On Saturday, March 30th, Information Technology Services (ITS) informed students via email that Eduroam was having partial connectivity issues. At this time, ITS reported that about 15 percent of devices were being affected. Users affected by this issue were still able to connect to the Carleton Guest network if they needed internet. Following this initial email, ITS continued work to attempt to isolate and solve the issue.
By the afternoon of Sunday, March 31st, ITS had deployed a fix which allowed more users to connect their devices to Eduroam. It was another 24 hours before ITS could confirm that their fix had worked and the all clear was given that the issue was entirely fixed.
According to Dave Flynn, Director of Systems and infrastructure, 2500 total devices were denied connection during the time between the start of the issue and it being resolved on April 1st.
According to the ITS blog, the issue was first noticed when they began receiving reports of issues connecting to Eduroam. Upon further examination, ITS discovered that the network was not able to allocate IP addresses to new devices. This was particularly problematic as students were returning from spring break, causing a large increase in the number of devices attempting to connect to the network.
This issue was caused by an update to the wireless registration system over spring break. Due to the low number of students on campus over spring break, the issue was not noticed until more students, and their devices, were on campus. ITS typically conducts updates over breaks so as to minimize potential disruptions to campus internet services.
“This update introduced an undocumented syntax change, which reduced the number of devices that could connect at one time,” said Flynn. As for the specifics of how this occurred, the update led to a change in the way that a comma was interpreted by the network.
Because of the large number of devices that need to be operating at any given time, instead of attempting to move the information from every device through one network, Eduroam is divided into 16 subnetworks. This is primarily done for security reasons, although it does also aid in performance. Each of these subnetworks has a limited number of devices which can connect to it. Following the update, the syntax error led to only two of these networks being available for connection. Because of the limited number of devices allowed on each network, and the decreased number of networks available, the network did not have the ability to allow connection from all of the devices attempting to connect to it.
“The effect was to shrink the size of the eduroam network by 75 percent, from 4048 simultaneous devices to 506, and the Carleton Guest network by 50 percent, from 1012 to 506,” said Flynn.
As for the protocol at ITS when something like this happens, there is an incident response plan that is used to manage the response to a disruption of this nature. The plan also, according to Flynn, “provides guidelines for how we communicate, with whom, and how often. It also lays out a framework for how the technical lead (the person in charge of actually fixing what’s broken) provides information to the incident manager (the overall coordinator of the response, and usually the individual responsible for sending communication to affected audiences).”
“This particular issue was very specific and we are unlikely to hit this combination of factors again. We work hard to learn from disruptions and to make our systems more resilient to further outages,” said Flynn.