Why Level3 Connectivity Issues Are Difficult To Diagnose

Level3 based network issues are often very difficult to diagnose.

This is due to Level3 operating their BGP network via MPLS (Multi-Protocol Label Switching), a wonderful and useful technology (but difficult to diagnose when your network spans the entire world!)

Take this snippet of a traceroute for example:

2  ge-6-18.car1.Manchester1.Level3.net (195.50.119.73)  0.868 ms  0.981 ms  0.770 ms
3  ae-4-90.edge1.Washington4.Level3.net (4.69.149.207)  80.715 ms  80.546 ms  80.671 ms

For a regular network, this would indicate a direct link between Manchester UK and Washington US.  However, this link does not exist.

The Manchester and Washington routers are talking directly between each other, but the underlying path is hidden from view (the actual path travels from Manchester to London before travelling to either New York, Newark or New Jersey in the US and onwards).

Think of MPLS as a network of tunnels, a tunnel exists between Manchester and Washington.  The path taken between the two may automatically re-route behind the scenes.  It’s entirely possible that it could route via Japan and back (although highly unlikely due to policies) and still only appear as a single hop.

The underlying network will often be OSPF based (an interior gateway routing protocol), calculating the “cost” between two destinations based on a variety of factors.

This makes it very resilient, but very difficult to diagnose.  You now have to diagnose two separate network paths, nested together.

US-based connectivity issue

We have received reports this morning of network alerts from 3rd party providers indicating trouble reaching the Netnorth network.

After further investigation, we narrowed the issue to part of the Level3 network (traceroutes included below) located in the US – we raised this with Level3 who confirmed they currently have a network issue ticket open for UK-US traffic with packet loss exceeding 70% (although we see 90-100% in our tests).

We have extensive monitoring throughout UK and Europe, but only key point monitoring within the USA which did not flag any issues automatically.

We will look to add further test points in the US and Asia to attempt to locate these remote issues quicker.

This issue should not have affected any UK or EU traffic – but many “site uptime” servers will test from multiple locations and report issues with any location.

Here are a few traceroutes showing the issue lying within Level3’s network:

 1  vzd114.mediatemple.net (205.186.158.19)  0.060 ms  0.024 ms  0.022 ms
 2  e1.2.cr01.iad01.mtsvc.net (70.32.64.249)  0.286 ms  0.276 ms  0.251 ms
 3  65.97.50.1 (65.97.50.1)  6.981 ms  7.347 ms  7.680 ms
 4  br01-1-1.iad2.netdc.com (65.97.48.205)  0.468 ms  0.460 ms  0.545 ms
 5  209.48.42.149 (209.48.42.149)  0.382 ms  0.372 ms  0.420 ms
 6  206.111.0.66.ptr.us.xo.net (206.111.0.66)  0.935 ms  0.947 ms  0.957 ms
 7  * * *
 8  * * ae-14-14.bar1.Toronto1.Level3.net (4.69.200.93)  142.194 ms
 9  ae-0-11.bar2.Toronto1.Level3.net (4.69.151.242)  144.597 ms  144.601 ms *
 10  * * *
 11  * * *
 12  ae-41-41.ebr2.London1.Level3.net (4.69.137.65)  234.130 ms * *
 13  * * vlan102.ebr1.London1.Level3.net (4.69.143.89)  224.504 ms
 14  ae-4-4.car1.Manchesteruk1.Level3.net (4.69.133.101)  224.407 ms * *
 15  * NETNORTH-LT.car1.Manchester1.Level3.net (195.50.119.74)  219.169 ms  219.394 ms
HOST: stats.netnorth.co.uk                    Loss%   Snt   Last   Avg  Best  Wrst StDev
 1. po1-16.router.tcw.netnorth.co.uk         0.0%    10    0.7   0.7   0.5   0.8   0.1
 2. ge-6-18.car1.Manchester1.Level3.net      0.0%    10    0.9   0.9   0.7   1.1   0.1
 3. ???                                     100.0    10    0.0   0.0   0.0   0.0   0.0
 4. AMAZON.COM.edge2.Washington1.Level3.net 90.0%    10  221.4 221.4 221.4 221.4   0.0
 5. 72.21.220.149                           80.0%    10  223.5 223.4 223.3 223.5   0.1
 6. 205.251.245.232                         70.0%    10  223.9 223.6 223.0 224.0   0.5
 7. ???                                     100.0    10    0.0   0.0   0.0   0.0   0.0
Sprint Source Region: Anaheim, CA (sl-crs3-ana)
 IP Destination: 82.148.224.24
 Performing: ICMP Traceroute
Wed May  6 08:11:49.079 UTC
 Tracing the route to 82.148.224.24
 1  144.232.13.244 4 msec  3 msec  2 msec
 2  144.232.24.40 6 msec  6 msec  5 msec
 3  ae14.edge1.LosAngeles9.Level3.net (4.68.111.89) 4 msec  3 msec  2 msec
 4   *  *  *
 5   *  *  *
 6   *  *  *
core1.tyo1.he.net> traceroute 82.148.224.24
 traceroute to 82.148.224.24 (82.148.224.24), 30 hops max, 60 byte packets
 1  74.82.46.5  3.918 ms  3.946 ms  4.023 ms
 2  184.105.223.105  133.830 ms  133.818 ms  133.894 ms
 3  80.239.167.189  98.187 ms  98.262 ms  98.247 ms
 4  213.155.137.58  98.174 ms 213.155.134.252  98.211 ms 213.155.130.126  98.086 ms
 5  4.68.70.129  102.706 ms  97.844 ms  102.817 ms
 6  * * *
 7  * * *
 8  * * *
 9  * * 4.69.151.242  232.329 ms
 10  * * *
 11  * * *
 12  * * 4.69.137.77  309.750 ms
 13  4.69.143.97  318.915 ms * 4.69.143.85  309.842 ms
 14  4.69.133.101  319.404 ms * *
 15  * 195.50.119.74  318.680 ms *
 16  * * *
 17  82.148.224.24  309.715 ms *  318.889 ms