🛫Lessons from Building Large Campus Networks | A Retrospective🛃
Exploring the Evolution of Ethernet in Campus Networks | A Journey Through Time
Many moons ago, I ventured into the world of building large campus networks. Back then, Ethernet technologies were evolving, and the landscape of networking was much different from what it is today. Over time, newer Ethernet technologies have emerged, revolutionizing how networks can be deployed in both business and data center environments. Carrier Ethernet, in particular, has made significant inroads into service providers, changing the game entirely. As I reflect on my experiences, I've learned valuable lessons that are as relevant today as they were back then.
The Myth of Plug-and-Play Networking
Building large campus networks isn't just a matter of cascading switches in an unplanned fashion. This laissez-faire approach is fraught with issues. Some of the industry’s 800-pound gorillas like to sell the idea that their devices can be rolled in and magically cure any network ailment. This, coupled with the snake oil sales pitch that Layer 2 doesn’t matter and can be discounted, creates a dangerous bubble. Into this bubble and squeak, security bunnies insist that every network device should be a firewall, leading to an overcomplicated and unnecessary deployment of firewalls across the network.
An example of this misguided strategy is the wholesale replacement of network switch chassis with different models, under the belief that this will improve performance and resolve issues. In reality, this often results in nothing more than a salesman earning a healthy commission. Performance problems and issues rarely stem from a single cause; they are usually the result of multiple underlying factors. The gorilla arrives, sells their shiny new device, and leaves—but the problem remains.
The Complexity of Network Issues
Network administrators often fall into the trap of believing that any issue can be traced back to a single root cause. This gullibility allows vendors to "pimp" their devices as the all-dancing, all-flashing solution. However, there is never a single cause that can solve all problems with one device. The symptoms may change when some causes are addressed, but the root problem often remains. The causes of issues in a campus network span the entire spectrum—from the user interface, through the application, operating system, and network, all the way to the back-end processing infrastructure. Stability in a campus network requires addressing the complete spectrum, not just the most obvious symptoms.
The Role of Spanning Tree
The glue of Layer 2 networks is Spanning Tree Protocol (STP). While newer protocols have emerged, particularly in carrier and data center environments, STP is still present in many campus networks. It’s a simple protocol, but that simplicity can be deceiving. It’s dangerous to discount the effort required to correctly design and maintain a campus network with predictable STP behavior. When a campus network behaves unpredictably, the resultant incidents can cause significant business disruptions.
A critical aspect of STP is ensuring that the root bridge is forced to a known location. In networks built with devices from various vendors—or even from a single vendor with non-standard versioning—determining the root bridge location can be challenging. Compatibility and interoperability issues often arise, leading to unpredictable network behavior. Different switch code versions may contain bugs, and in a mixed environment, these can result in some very strange symptoms, causing network administrators to chase red herrings.
Dealing with Network Device Failures
Campus networks are often built using a core, distribution, and access hierarchy model. When a switch fails in the campus, it’s not just a case of replacing the hardware, plugging in all the cables, switching it on, and restoring the configuration. I’ve seen some "gurus" do precisely that, only to trigger a network tsunami as connectivity across the campus is disrupted. Spanning Tree requires incremental changes to the campus. Thus, when replacing a failed switch, the following process works best:
Label Everything: Make sure all cables are correctly labeled before removing the failed switch.
Check Configuration: Ensure you have the configuration of the failed switch available. If not, can you recreate it?
Replace Switch, Hold the Cables: Replace the switch, but do not immediately reconnect the cables.
Power On: Power on the switch and restore the configuration via the out-of-band management ports.
Disable Interfaces: Disable all the interfaces on the switch.
Reconnect Cables: Reconnect the cables, but do not enable the interfaces yet.
Incremental Enablement: Enable each interface individually, checking for stability after each one.
Save Configuration: Once all interfaces are enabled and stable, save the switch configuration to a repository.
VLAN Design
VLAN design in a campus network is as important as predicting Spanning Tree behavior. It’s not advisable to propagate all VLANs to all switches. Instead, VLANs should be designed in a structured and hierarchical fashion, filtering them from network devices where they are not needed. Failure to do so can overload individual switches with unnecessary VLAN processing. Additionally, VLANs should be used to segment traffic properly. Overlapping Layer 3 networks on the same physical segment can lead to easy man-in-the-middle attacks and unnecessary traffic flows. Proper VLAN design can help partition malicious traffic, making it easier to suppress using ACLs or traffic policing.
The Importance of Cabling
Cabling plays a critical role in campus networks. Fiber optic cables, for example, are susceptible to unidirectional spanning tree issues, where the integrity might be acceptable in one direction but not the other. Switches can be configured to overcome this, but the configuration is not always enabled by default. Physical errors occur regularly in a campus network, with fiber connections generally experiencing fewer errors than copper. When copper is used between switches or in large server clusters, the network can become unstable due to physical cable faults or mismatched interface settings.
Some administrators mistakenly believe that forced interface settings are more stable, but these non-negotiated connections are actually more prone to problems. A campus network should have at least two separate networks: a production network and a backup/restore network. Backup and restore traffic flows are the most intensive in a campus network, and segregating these environments can prevent them from adversely affecting live production systems.
Monitoring & Diagnostics
Monitoring is key to maintaining the health of campus networks. Unfortunately, many network devices in campuses come equipped with vast sets of monitoring and diagnostic features that are never enabled. Even basic monitoring, such as enabling SYSLOG and aggregating logs to a central collector, is often overlooked. This oversight leaves expensive switches in campus networks useless when disaster strikes, as they fail to provide the necessary physical, visual, and audio feedback.
While tools like ping and tracert are useful for basic connectivity testing, they provide limited insight into actual application traffic flow problems. More advanced tools, such as NetFlow (or its equivalents like JFlow, NetStream, and IPFIX) and IPSLA, offer far better monitoring capabilities. These tools allow for more accurate diagnosis of network issues and should be part of any modern network monitoring strategy.
Wrap
Building and maintaining large campus networks is a complex task that requires careful planning, execution, and ongoing management. From understanding the intricacies of Spanning Tree Protocol to designing VLANs and implementing robust monitoring, the lessons I’ve learned over the years have taught me that there are no shortcuts to achieving network stability. As Ethernet technology continues to evolve, embracing new methodologies and tools will be essential for staying ahead of the curve in campus network design and management.