In the last guide, 4.1 The Internet, we mentioned that information goes through the internet in data streams, broken up into packets, and that there are many different paths that packets could take in order to reach its final destination. One of the advantages of this system is that if a node (a device on the network) on the route is down or a connection isn't working, the packets can still reach their destination through another path.
This is an example of fault tolerance.
If a system is fault tolerant, it can function properly even in the event of one part failing.
Image source: Thomas Kinto on Unsplash
A non-digital example of fault tolerance is the road system in your city. There are usually multiple paths to get from one place in town to another. If there's a crash or construction on one road and it's blocked, you can still get to your destination. The road system can function even if one road stops operating, so we would call it (relatively) fault tolerant.
Another example would be having a backup generator in your house or multiple heaters in an apartment. In both of these cases, the power or heating system in question can continue to work even if a part (one of the generators or one of the heaters) fails.
Fault tolerance is important for AP CSP because the internet was designed to be fault tolerant. It does this primarily through redundancy.
One of the major aspects of a fault-tolerant system is the presence of redundancy. Redundancy is defined by College Board as the inclusion of extra components that can be used to mitigate failure of a system if other components fail. Unlike what your English teacher told you about redundancy in your essays, redundancy in the digital world is beneficial and helps keep everything running.
In the opening example, redundancy took the form of having multiple paths for one packet to be able to go through. (This is often literally having multiple physical cables and/or routers.) However, implementing redundancy takes several forms.
- For example, a company that wants to create a more fault-tolerant internet connection could use multiple internet providers or multiple wire channels. If one ocean-floor wire gets chewed on by a hungry fish or one router gets struck by lightning, the packet can go through a different wire/router and get to its destination.
- For another example, websites use two main types of solutions to promote redundancy: load balancing and failover:
- Load balancing solutions use load balancers, machines that allow websites to run using multiple servers, and can distribute incoming traffic across them. This creates a way for the website to continue running even if a server is down because the website can just rely on the other server.
- Failover solutions are used in the event of part or system failure, and basically entail switching to a backup machine.
❗ Load balancing and failover solutions won't be on the AP CSP test. They're just examples!
Fault tolerance helps reduce at least two vulnerabilities to failure found in digital systems: hardware malfunctions and cyberattacks. It also makes the system more scalable.
- The fault tolerance of a system increases how reliable it is because it prevents malfunctions from completely shutting it down. Parts of a system can and do fail. They can fail at the most unpredictable of times, potentially leading to a long wait before someone's available to fix them, and they often fail in groups.
- For example, a natural disaster that destroys one piece of equipment in a certain area is likely to knock other pieces out in the same area, owing to the nature of weather events.
Fault tolerance keeps the system from shutting down. When the system in question is, say, the transatlantic communications system between Europe and the US, shutdowns could be very harmful for people's use of the system, the systems that rely on that system, and society at large.
- Fault tolerance can also make it easier to reduce the damage done by some cyber-attacks.
- For example, a Distributed Denial of Service Attack (DDoS) takes place when a server or network is overwhelmed with a flood of traffic, causing it to slow or even crash. In this situation, having a redundant server or network connection could allow you to go around the attack and continue to operate.
- Finally, redundancy, a method of fault tolerance, makes it easier for a system to expand. It helps the system in question be more scalable than it would be otherwise (Remember scalability from 4.1?). This is because the existence of additional processing methods allows more devices to connect to the system and traffic to flow through it.
- For example, a fault redundant routing system, with multiple options to get a packet between two points, can handle more packets going through it than one that isn't fault redundant because there are more paths for the packets to go through.
Developing fault tolerance requires more resources than would be needed otherwise. For example, more cables and routers are needed for a fault-tolerant routing system than one that isn't because there need to be all those additional paths.
It can be quite expensive in materials, contruction cost and maintainence to build these new resources, and places with fewer resources can face more vulnerable computing networks as a result.
Not everything can be duplicated or made fault-tolerant due to cost and design concerns. (You can't have two steering wheels on a car, for example!) Developers often have to pick and choose what they need to develop fault tolerance for.