Spanning Tree safeguards

Problem Description:

A bridging loop or spanning tree loop caused a network outage. To break the loop you’ve pulled one of the redundant links or shut down one of the switches that are participating in the loop but now you’re unsure of what to do to both find the source of the loop and prevent it from occurring again.

Action Plan:
Prior to bringing the redundant link/switch back online, implement Layer 2 safeguards designed to protect against STP loops and mitigate the impact if one does occur.

1) Implement Spanning Tree PortFast and BPDUGuard on all edge ports

2) Verify that currently the proper switch is STP root for all VLANs. Consider enabling root guard on root/core switch uplink ports to the distribution layer switches to ensure your root bridge does not change unexpectedly (such as when new switches are connected to the network). It can also be enabled at the access layer, rather than on the root bridge(s), if you maintain control of the distrubution layer and are not concerned with anyone making changes or adding switches to the distribution layer.

Below is an excellent doc that details root guard. See the section titled “What Is the Difference Between STP BPDU Guard and STP Root Guard?” for clarification on the difference. You do not want root guard on the port-channel between core switches running HSRP. It should be enabled ONLY on the uplinks to other switches (or access ports) that you do NOT want to become spanning tree root.

http://www.cisco.com/en/US/tech/tk389/tk621/technologies_tech_note09186a00800ae96b.shtml

3) Enable loop guard on all distribution/access layer switches*
4) Enable BPDU guard on all distribution/access layer switches*
5) Enable UDLD on all fiber uplinks*
– Unidirectional links can cause spanning tree loops. UDLD will prevent this by shutting down a unidirectional link. Note that in Some NX-OS vPC environments UDLD in Aggressive mode is NOT recommended. See http://tools.ietf.org/html/rfc5171#section-5.4 for the IEEE definintion on the difference between “Normal” and “Aggressive”

6) Prune unnecessary VLANs off your trunks

After implementing root guard, loop guard, UDLD aggressive, and BPDU guard, bring the link/switch back up and see if the loop reforms.

* Prior to implementing any of these features it is recommended that you Familiarize yourself with how each feature works:

IF THE LOOP REFORMS:
1) Have a TAC engineer online to troubleshoot

2) Enable mac-address move notification (if applicable – this is disabled by default on the 6500/7600 platform and enabled by default on most others – including the 3750/3560/2960 platforms)

 ITLABSW#(config)#mac-address-table notification mac-move

Check the switch log for mac’s flapping between interfaces. These are the ports that are participating in the loop. Trace the MAC back to its source. Look for:
– A link flapping on a upstream switch, causing spanning tree TCNs (topology change notifications) and spanning tree reconvergence. This should be used in conjunction with step 3 below.
– A unidirectional link on an upstream switch causing the loop.
– A hub or switch connected to a portfast enabled access port where this mac is learned. Shut this port down and see if this breaks the loop.

3) Check for TCNs
While the loop is occurring, if you see excessive TCNs you need to trace the TCNs to the source . To do this, start from the core and run the following commands.

 
ITLABSW#show spanning-tree detail | inc ieee|occurr|from|is exec

The output from this command will show you the port the last TCN was received on and the time which it was received.  
Look for the port that  received a TCN in the last few seconds.

 ITLABSW#sh spanning-tree detail | i ieee|occur|from|is exec
   VLAN0001 is executing the rstp compatible Spanning Tree protocol
     Number of topology changes 187927 last change occurred 00:01 ago <-time rec'd
         from Port-Channel12 <--interface that received the TCN

You will want to follow this port until the port that receives the TCN is an access port, or until the switch in question is generating TCNs but not receiving them. If you find an access port receiving TCNs, shut it down and see if that stabilizes the network.

If you find a switch generating TCNs, you will want to look for two uplink ports or trunks in a spanning tree forwarding state for the same VLAN. If you find two ports in a forwarding state, shut one port down and see if this breaks the loop. Check for a unidirectional link or excessive link flaps.

4) look for an interface with a very high input rate and low output rate

 ITLABSW#sh int | i is up|rate

When a bridging loop occurs you will usually see multiple interfaces with a high output rate and low input rate and a single interface with a high input rate and low output rate.
– Trace the port with the high input rate down until you come to an access port and shut it down
– If the port with the high input rate leads you into a loop you will want to check spanning tree and Etherchannel states until you either find a switch that has a port in an incorrect forwarding state or incorrectly bundled.

5) Look for packets hitting the CPU. Sniff the CPU and see if the packets share a common source (this is only an option on certain platforms. You’ll need to contact TAC to assist with setting it up and analyzing the data). Track down the source. If they are STP or CDP packets (or packets destined to the 0100.0CCC.CCCX reserved multicast address) trace where the source mac is learned. See if the source mac leads you in a loop.

If you see two ports in a forwarding state for the same VLAN on the same switch, we need to look for the following:
a) does this switch think he is the root for this VLAN (or vlans)?
b) should he be?
c) is he receiving BPDUs from his neighbor on the ports in a forwarding state? (sniff both forwarding ports to look for BPDUs)
d) look for a unidirectional link on one of the ports in a forwarding state
e) shut one of the ports in a forwarding state and see if the loop stops

Good doc for troubleshooting bridging loops.

http://www.cisco.com/en/US/tech/tk389/tk621/technologies_tech_note09186a00800951ac.shtml#brid_loop

ref # https://community.cisco.com/t5/networking-documents/spanning-tree-loop-troubleshooting-and-safeguards/ta-p/3115040