This week a technical hitch caused websites to wobble worldwide. Tom Chivers discovers the net is held together with chewing gum and string
On Tuesday, at 8.48am British Summer Time, Verizon, a major US internet service provider (ISP), did something relatively mundane and technical: it took some big groups of IP addresses, which we can think of as the phone numbers of the internet, one of which is designated to every desktop computer, tablet or smartphone – and divided them up into smaller blocks, to free up some unused addresses. And in doing so, through no fault of its own, it broke the internet (a bit).
Major websites around the world slowed down, locked up or refused to allow visitors to log in with their usernames and passwords. The most high-profile casualty was eBay: British users were unable to log in for much of the day, causing traders to demand compensation for lost sales. (Even here in the super-high-tech fortress of Telegraph Towers we experienced a bit of a wobble.) The question is, how did a boring little reallocation of some addresses by an American telecoms company knock over large bits of the internet all around the world?
The answer is complicated, according to Dr Joss Wright, a computer scientist at Oxford University. “There are relatively few experts in this. It really is the deep magic,” he says. But fundamentally the difficulty lies in the fact that no one planned and built the internet: it grew organically, like a weed. When problems arose, engineers found ways to patch them or work around them. But sometimes those fixes became problems themselves a few years later. And that’s what happened on Tuesday.
The origin of the Verizon meltdown begins with something called the border gateway protocol (BGP). In every major internet hub – ISPs such as Verizon and BT, but also big businesses and universities – there are routers, large versions of those little black boxes with blinking lights that you probably have somewhere in your house powering your Wi-Fi. The job of those routers is to find a path for data from one bit of the internet to another: from your computer on your desk in Huddersfield, for example, to the hotel in Sydney where you’re trying to reserve a room (or rather, from the big BT hub a mile or so from your desk in Huddersfield to the big Telstra hub near the hotel in Sydney). But the internet is a thicket of countless millions of possible routes, so to find their way across, the routers keep a record of the most reliable ones.
That record is, in essence, a big list – a list with 512,000 entries, on older Cisco machines – with each one storing a route to a group of IP addresses. This is the BGP routing table, and it is how one bit of the internet finds another bit.
The trouble is, though, that the internet is built on old software and decisions, and ancient, fudged repairs. “That is the fundamental problem of the internet,” says Wright. “There are 2 to the power 32 IP addresses – about 4.2 billion. That was the number chosen in the naive days of the internet, when no one knew it was going to be the global system it is now. Now there are about 4 billion computing devices trying to use the internet, and we’re running out of addresses.”
In the early days of the internet, when regulators were handing out this seemingly inexhaustible supply of IP addresses, they did so generously. Stanford wants 16 million IP addresses? Sure, let them have them. We have lots. But Stanford, or whoever, doesn’t need 16 million IP addresses – it only uses a few tens of thousands – so now that we are running short, the regulators have to claw back groups of them and divide them up, like bulbs in the garden, in order to create additional groups of IP addresses that can be reallocated where they are needed. This is what Verizon did with their collection of inhouse IP addresses: they subdivided their groups into smaller groups, temporarily making about 15,000 new groups of addresses. But every time more groups are created, new routes are needed in the BGP table, so that the routers can find them.
Unfortunately, as well as the 4.2 billion IP address limit, the internet is bound by another arbitrary restriction: the 512,000 slots in the BGP grid of those older Cisco routers. That seemed like a huge amount when it was put in place, but as of the morning of August 12, most ISPs’ routers had about 500,000 of their places already full. So when the 15,000 new Verizon routes popped in, suddenly it pushed the number of routes that the routers had to remember up to 515,000. The older machines just couldn’t handle this, so they broke: they shut down, failed to remember new routes, or forgot old routes. And that’s why the internet broke on Tuesday.
“It’s not the first day of the Apocalypse,” Wright says, soothingly. Both the shortage of IP addresses and the shortage of slots in the BGP grid can be fixed. The fixes sound simple, if technical: there is a new IP address system, version 6, which ISPs can upgrade to, which would create trillions more addresses; and it is possible to reallocate some of the memory in your router to increase the number of BGP routes by another quarter of a million or so, which would stave off problems like the Verizon one – at least for the foreseeable future.
The reason that this has not happened already is that it is a risky process. “To fix it, they need to reboot the routers, and lots of them will be old machines that have never been rebooted before, and sometimes when you reboot something like that it doesn’t switch back on again,” says Wright. “Something could just go ‘Bing’.” On a similar note, changing from a broadly accepted system to a new, less widely supported one could lead to all sorts of failures. “It involves getting everyone to agree on the protocol. It’s like getting everyone in the USA and Britain to stop speaking English and speak Esperanto instead. It could be done, in theory, but it’s going to be tricky.”
We view the digital world as a place of constant innovation. But because of the risks of upgrading the infrastructure on which it relies, engineers are under pressure not to experiment, and not to fix things before they become a problem. Now that a definite problem has arisen, and caused a fairly major outage for some fairly major internet players, ISPs might overcome their innate (and entirely sensible) conservatism and make the switch.
Of course, this will only push the problem down the road a few more years, because the internet – as noted – is a patchwork quilt of fixes and workarounds and temporary solutions. “The internet – you have no idea. It’s held together with chewing gum and string,” sighs Wright. “If everyone said, we don’t need the internet for a year, let’s shut it down, we could make it so much better. But we can’t do that.”