Introduction

Welcome back to this 2-part series around the open source tool - Batfish.

In the previous article we looked at the steps required to install Batfish and then explored how to perform configuration analysis against our example network.

We will now look at how to perform impact analysis by simulating network failures and analysing the results, to ensure our network can is configured/designed to withstand failure. Let’s begin ...

Example Topology

For our impact analysis example we will use the same topology (shown below) as per our previous article, with the addition of a configuration error added to make this example a little more interesting! Just to recap this topology is based on a spine and leaf topology, using OSPF to distribute the loopbacks and eBGP peerings formed between them.

image1-1
Figure 1 - Example Topology.

With regards to our snapshot the configuration that was found to be incorrect (i.e jumbo frames and multipath BGP) previoulsy has now been fixed and saved as snapshot-2 within the following location: https://github.com/rickdonato/network-automation/tree/master/batfish/nxos9k-ebgp-spine-leaf/snapshot-2

Snapshot Forking

To simulate network failures Batfish provides bf_fork_snapshot allowing you to apply a series of failure conditions to a cloned snapshot. These conditions can be based on node failure (deactivate_nodes) or interface failure (deactivate_interfaces). Below shows an example,

FAIL_SNAPSHOT_NAME = "spine1-failure-snapshot"
bf_fork_snapshot(BASE_SNAPSHOT_NAME, FAIL_SNAPSHOT_NAME, deactivate_nodes=["spine1"], overwrite=True)

NOTE! It is important to note that when the snapshot is forked, the reference snapshot is assigned to that of the fork. In other words once you perform bf_fork_snapshot all further questions will use the new snapshot containing your failure conditions. To set the snapshot back to the original use:

bf_set_snapshot=BASE_SNAPSHOT_NAME

Differential Reachability

Now that we have our forked snapshot containing a set of failures, we can ask various questions, such as differentialReachability().

This question performs a trace through the network using both, the base and failure snapshots. The results of these snapshots are compared for reachability failures and the results returned.

diff_reach_answer = bfq.differentialReachability(
        headers=HeaderConstraints(dstIps='server2')
    ).answer(
        snapshot=FAIL_SNAPSHOT_NAME,
        reference_snapshot=BASE_SNAPSHOT_NAME
    ).frame()

If there is success then the result should return no differences, like so,

>>> diff_reach_answer
Empty DataFrame
Columns: [Flow, Snapshot_Traces, Snapshot_TraceCount, Reference_Traces, Reference_TraceCount]
Index: []

Releasing a Chaos Monkey

Using the method for snapshot forking and also the differentialReachability question we can now build ourselves a chaos monkey.

Chaos Monkey is a tool invented in 2011 by Netflix to test the resilience of its IT infrastructure. It works by intentionally disabling computers in Netflix's production network to test how remaining systems respond to the outage.[1]

Our chaos monkey, a modified script pulled from here, will deactivate interfaces at random throughout our network then return any issues seen between the base and failure snapshots.
We will initiate our script via python3 -i so that at the point the script completes we will be dropped into the Python3 interpreter, with access to all of the Python assignments that we can use for further troubleshooting.

# python3 -i scripts/chaos_monkey.py "nxos9k-ebgp-spine-leaf/snapshot-2"
[*] Initializing BASE_SNAPSHOT
[*] Collecting link data
[*] Releasing the Chaos Monkey
 - Deactivating Links:leaf2[Ethernet1/1] + spine2[Ethernet1/3]
 - Deactivating Links:spine2[Ethernet1/4] + spine2[Ethernet1/2]
 - Deactivating Links:spine2[Ethernet1/2] + spine2[Ethernet1/3]
 - Deactivating Links:spine2[Ethernet1/2] + spine1[Ethernet1/2]
 - Deactivating Links:spine2[Ethernet1/3] + leaf1[Ethernet1/4]
 - Deactivating Links:spine1[Ethernet1/3] + spine2[Ethernet1/3]
 - Deactivating Links:leaf1[Ethernet1/4] + leaf1[Ethernet1/2]
[FAIL] difference found between BASE_SNAPSHOT and FAIL_SNAPSHOT

FLOW: start=leaf1 [3.3.3.3->172.16.2.1 ICMP]
node: leaf1
  ORIGINATED(default)
  NO_ROUTE

FLOW: start=server1 [172.16.1.1->172.16.2.1 ICMP]
node: server1
  ORIGINATED(default)
  FORWARDED(Routes: static [Network: 0.0.0.0/0, Next Hop IP:172.16.1.254])
  TRANSMITTED(eth1)
node: leaf1
  RECEIVED(Vlan10)
  NO_ROUTE

Analysing the Results

As you can see from our results we have a failure when deactivating the ports e1/4 and e1/2 on leaf1, with the output showing the flows and trace/steps of the flow.

If we look at the first flow we can see the loopback of leaf1 is unable to reach the destination due to there being no route. Ok, let's check the routes of leaf1.

>>> bfq.routes(nodes="leaf1").answer().frame()
    Node      VRF          Network Next_Hop     Next_Hop_IP Next_Hop_Interface   Protocol Metric Admin_Distance   Tag
0  leaf1  default       3.3.3.3/32     None  AUTO/NONE(-1l)          Loopback0  connected      0              0  None
1  leaf1  default      10.0.0.0/16     None  AUTO/NONE(-1l)     null_interface     static      0            254  None
2  leaf1  default      10.2.0.0/30     None  AUTO/NONE(-1l)        Ethernet1/3  connected      0              0  None
3  leaf1  default      10.0.0.0/30     None  AUTO/NONE(-1l)        Ethernet1/1  connected      0              0  None
4  leaf1  default      10.2.0.2/32     None  AUTO/NONE(-1l)        Ethernet1/3      local      0              0  None
5  leaf1  default  172.16.1.254/32     None  AUTO/NONE(-1l)             Vlan10      local      0              0  None
6  leaf1  default    172.16.1.0/24     None  AUTO/NONE(-1l)             Vlan10  connected      0              0  None
7  leaf1  default      10.0.0.2/32     None  AUTO/NONE(-1l)        Ethernet1/1      local      0              0  None

Strange, there are no ospf or bgp routes. Let’s first check why we aren't receiving the loopback addresses, which should be learnt via OSPF. Therefor, let's look at the OSPF interfaces,

>>> bfq.interfaceProperties(nodes="/leaf|spine/", properties="OSPF_Enabled|OSPF_Passive|Description").answer().frame().dropna()
               Interface     Description OSPF_Enabled OSPF_Passive
8       leaf2[Loopback0]        Loopback         True        False
9     leaf2[Ethernet1/4]       to spine1         True        False
10    leaf1[Ethernet1/3]       to spine2         True        False
11    leaf2[Ethernet1/3]       to spine1         True        False
12    leaf1[Ethernet1/2]       to spine1         True        False
13    leaf1[Ethernet1/1]       to spine1         True        False
14    leaf1[Ethernet1/4]       to spine2         True        False
15   spine1[Ethernet1/2]        to leaf1         True        False
16   spine2[Ethernet1/3]        to leaf1        False        False
17   spine2[Ethernet1/2]        to leaf2         True        False
18   spine1[Ethernet1/1]        to leaf1        False        False
19   spine2[Ethernet1/1]        to leaf2         True        False
20   spine1[Ethernet1/4]        to leaf2         True        False
21   spine2[Ethernet1/4]        to leaf1         True        False
22   spine1[Ethernet1/3]        to leaf2         True        False
23    leaf2[Ethernet1/1]       to spine2         True        False
24    leaf2[Ethernet1/2]       to spine2         True        False
38    leaf2[Ethernet1/5]     to server-2        False        False
42         spine2[mgmt0]  OOB Management        False        False
43         spine1[mgmt0]  OOB Management        False        False
48          leaf2[mgmt0]  OOB Management        False        False
56          leaf1[mgmt0]  OOB Management        False        False
85      leaf1[Loopback0]        Loopback         True        False
87    leaf1[Ethernet1/5]     to server-1        False        False
175    spine2[Loopback0]        Loopback         True        False
177    spine1[Loopback0]        Loopback         True        False

Ok so no interfaces look like they are set to passive. However, there are a few interfaces set to Falseunder OSPF_Enable. Let's explore further:

>>> ospf_interfaces = bfq.interfaceProperties(nodes="/leaf|spine/", properties="OSPF_Enabled|OSPF_Passive|Description").answer().frame().dropna()
>>> ospf_interfaces[ospf_interfaces['OSPF_Enabled'] != True]
              Interface     Description OSPF_Enabled OSPF_Passive
16  spine2[Ethernet1/3]        to leaf1        False        False
18  spine1[Ethernet1/1]        to leaf1        False        False
38   leaf2[Ethernet1/5]     to server-2        False        False
42        spine2[mgmt0]  OOB Management        False        False
43        spine1[mgmt0]  OOB Management        False        False
48         leaf2[mgmt0]  OOB Management        False        False
56         leaf1[mgmt0]  OOB Management        False        False
87   leaf1[Ethernet1/5]     to server-1        False        False

Interesting, there are 2 interfaces connected to leaf1 from the spines that do not have OSPF enabled. This isn't correct, as without the loopbacks being learned via OSPF, the eBGP peerings would fail. Let’s fix (config below) and create snapshot-3.

# spine1
interface Ethernet1/1
  description to leaf1
  no switchport
  mtu 9126
  mac-address fa16.3e00.0001
  ip address 10.0.0.1/30
+ ip router ospf 1 area 0.0.0.0
  no shutdown

# spine2
interface Ethernet1/3
  description to leaf1
  no switchport 
  mtu 9126
  mac-address fa16.3e00.0008
+ ip router ospf 1 area 0.0.0.0
  ip address 10.2.0.1/30
  no shutdown

Now, we can rerun the chaos monkey.

# python3 -i scripts/chaos_monkey.py "nxos9k-ebgp-spine-leaf/snapshot-3"
[*] Initializing BASE_SNAPSHOT
[*] Collecting link data
[*] Releasing the Chaos Monkey
 - Deactivating Links:leaf2[Ethernet1/1] + spine2[Ethernet1/3]
 - Deactivating Links:spine2[Ethernet1/4] + spine2[Ethernet1/2]
 - Deactivating Links:spine2[Ethernet1/2] + spine2[Ethernet1/3]
 - Deactivating Links:spine2[Ethernet1/2] + spine1[Ethernet1/2]
 - Deactivating Links:spine2[Ethernet1/3] + leaf1[Ethernet1/4]
 - Deactivating Links:spine1[Ethernet1/3] + spine2[Ethernet1/3]
 - Deactivating Links:leaf1[Ethernet1/4] + leaf1[Ethernet1/2]
 - Deactivating Links:spine1[Ethernet1/2] + leaf1[Ethernet1/1]
 - Deactivating Links:leaf2[Ethernet1/2] + leaf1[Ethernet1/1]
 - Deactivating Links:leaf1[Ethernet1/2] + leaf1[Ethernet1/4]
 - Deactivating Links:spine2[Ethernet1/2] + spine2[Ethernet1/2]
 - Deactivating Links:leaf1[Ethernet1/2] + leaf1[Ethernet1/3]
 - Deactivating Links:leaf2[Ethernet1/1] + leaf2[Ethernet1/1]
 - Deactivating Links:spine1[Ethernet1/1] + spine2[Ethernet1/1]
 - Deactivating Links:leaf2[Ethernet1/3] + spine1[Ethernet1/2]
[SUCCESS] No reachability issues found after 15 rounds of chaos!

GREAT! This resolved the issue and we completed 15 rounds of network failures whilst keeping network reachability to server-2.

Outro

That concludes this 2-part series on Batfish, but I can honestly say we have only scratched the surface on what Batfish can do. Batfish is a great tool though, and it is a must in any network automation stack.

Thanks for reading, and remember to check out our free newsletter to get all the latest networking updates.

For all the code related to this article check out : https://github.com/rickdonato/network-automation/tree/master/batfish

References


  1. "Chaos engineering - Wikipedia." https://en.wikipedia.org/wiki/Chaos_engineering. Accessed 1 Jul. 2019. ↩︎