# 🔗Link Aggregation (LAG) | The Backbone of Layer 2 Resilience✂️

In Layer 2 Ethernet networks, **resilience is key**. Unlike Layer 3, where routing protocols dynamically reroute traffic, Layer 2 has **limited options** when it comes to path redundancy. **Enter Link Aggregation (LAG)**—a method that allows multiple physical links to be bundled into a single logical link, ensuring both **higher bandwidth** and **resilience against failures**.

But here’s the catch: **LAG isn’t a set-and-forget solution**. Many engineers deploy it, assume it's working, and move on—only to get a nasty surprise when a failure occurs **without any alerts**. **If your LAG fails silently, that’s a monitoring failure, not just a network failure.**

---

## **Why Use Link Aggregation?**

### **1️⃣ All Links Forward Traffic**

One of the biggest advantages of LAG is that, unlike spanning tree (which blocks redundant links), **all links in the bundle are actively forwarding traffic**. This means:  
✅ **Increased throughput** by combining bandwidth across multiple links  
✅ **Automatic failover**—if a link fails, traffic shifts to the remaining links  
✅ **Less complexity** compared to other redundancy methods

---

## **How Does LAG Detect Failures?**

### **🔍 1. Basic Link Status (The Default Check)**

By default, most switches determine whether a LAG member is **active** based on **link status** (up/down). If a cable is physically unplugged or a switchport goes down, the device **removes the failed link from the aggregation**.

Sounds good, right? Well, **not always**.

**💀 The Problem:**  
What if the link stays *physically* up but traffic isn’t actually passing? 🤔

This can happen due to:  
❌ One-way traffic failures (e.g., unidirectional fiber failure)  
❌ Misconfigured VLANs, preventing some traffic from flowing  
❌ Physical layer degradation (e.g., bad optics, high CRC errors)

Link status alone **isn’t enough** to ensure all LAG members are actually **working**.

---

### **🛠 2. LACP (Link Aggregation Control Protocol)**

LACP is the most commonly used **dynamic** LAG protocol. Instead of assuming a link is working, LACP sends **PDUs (Protocol Data Units)** down each member link. This helps detect certain failures:  
✅ **Misconfigured LAG groups** (prevents plugging into the wrong switch)  
✅ **Link negotiation issues**  
✅ **Ensuring all members belong to the correct aggregation group**

**⚠️ But LACP isn't fast.**  
LACP failure detection relies on **timer-based mechanisms**, which aren’t always quick enough for critical networks.

💡 **Pro Tip:** You can adjust the LACP timer to improve failure detection (Fast LACP mode sends PDUs every second instead of every 30 seconds).

---

### **🚀 3. BFD (Bidirectional Forwarding Detection) for Faster Failure Detection**

For networks needing **sub-second failure detection**, **BFD (Bidirectional Forwarding Detection)** is the go-to solution.

Unlike LACP, BFD:  
✅ Works at **Layer 2 or Layer 3**  
✅ Can detect failures in as little as **50ms**  
✅ Doesn’t rely on periodic LACP PDUs—BFD runs separately to actively confirm **if traffic is flowing correctly**

**BFD + LACP?**  
Yep! **You can run both.** LACP ensures the LAG is properly formed, while BFD provides **rapid failure detection** and ensures the data plane is working as expected.

---

## **🚨 The Hidden Danger | LAG Without Monitoring 🚨**

LAG is **great**—but only **if you know when it fails**. A broken LAG member **won’t always trigger an alert** unless **you actively monitor it**.

Too many engineers **deploy LAG once and never check it again**. That’s like putting a spare tyre in your car, but never checking if it’s inflated when you need it.

---

## **How to Properly Monitor LAG**

### **📊 1. Set Up SNMP Monitoring for LAG Interfaces**

Use SNMP polling to check:  
✅ **LAG status (is the bundle still active?)**  
✅ **Number of active links (are all members up?)**  
✅ **Traffic distribution (is one link overloaded while others sit idle?)**

💡 **If a LAG member fails and your monitoring doesn’t catch it, your problem isn’t just a network failure—it’s a monitoring failure.**

---

### **🔔 2. Configure Alerts for LAG Failures**

Your NMS (Network Monitoring System) should notify you if:  
❌ The LAG goes from 4 links to 3 (or any reduction in members)  
❌ Traffic on one member **drops to zero**  
❌ LACP negotiation fails

---

### **📈 3. Periodically Verify LAG Performance**

**Test failover manually** from time to time. If a LAG failure goes undetected for months, then suddenly a second link drops, you’ll experience **an unexpected outage**.

---

## **🔄 Conclusion: LAG Works—If You Do It Right**

LAG is the best way to provide **resilience at Layer 2**. It ensures:  
✅ **Higher bandwidth** by bundling multiple links  
✅ **Automatic failover** if a link drops  
✅ **Better load balancing** than spanning tree

However, **LAG isn’t perfect**. Relying on default **link status checks** is risky. For **better failure detection**, you should:  
✔️ Use **LACP** to detect misconfigurations  
✔️ Deploy **BFD** for **faster failover**  
✔️ **Monitor** your LAG with SNMP & alerts

**Remember:** LAG failure without monitoring = **Monitoring failure.**

---

### **🔹 Wrapping Up | The Classic LAG Horror Story**

A **large enterprise** had a **4-member LAG** to their data centre. One link **failed silently**, but no one noticed because traffic simply shifted to the remaining three links.

Months later, **a second link failed**—**cutting bandwidth in half**.

Still, no one noticed.

Then, **a third link failed**. Suddenly, the company saw **huge performance issues**, and **only then** did the IT team investigate.

By then, they were running on **a single remaining link**—one failure away from a total outage.

Had they **monitored** their LAG properly, they would’ve **fixed the issue after the first failure**, instead of waiting for a **near-disaster**.

---

### **So, are you monitoring your LAG?**

If not, **start now.** Otherwise, **the next outage might already be in progress.** 🚨

**More Information:**

%[https://youtu.be/4P9cnoJGl50?si=YcyVbASpouGkrB-y]