The yearly testing of our incident response protocol was on my radar for later this year, but circumstances moved it up on the agenda and turned it from a tabletop exercise to a real-world crisis.
What precipitated the crisis was the distributed denial-of-service attack against DNS provider Dyn on Oct. 21. Calls started coming in around 8:30 a.m. EDT, just as our East Coast customers were arriving to work and attempting to log in to my company’s cloud-based software-as-a-service applications. By 9:30, more than 200 had been logged. That’s when I got involved.
After a hastily arranged teleconference among the heads of various departments, we were decided on several courses of action, but it was frustrating to know that we were being severely affected by something entirely out of our control and that no immediate fix was available.
A top priority was to deal with a crucial shortcoming in our disaster recovery plan — one that we probably would have been able to identify and address at our leisure if this were only an exercise: We had no secondary DNS provider. Now we needed one fast, but DNS is a somewhat tricky service. It’s no trivial matter to point to another provider, let alone configure a secondary. Things such as DNS caching and time-to-live (TTL) settings are out of our control and make switching to another DNS provider anything but quick.
TTL is a method for limiting the lifespan of data in systems and networks in order to improve performance. When the preset TTL expires, data is discarded, forcing a data refresh. That would be necessary so that our customers’ systems would see the new DNS provider that they would need to point to. But we only have control of our local TTL, not the TTL of our DNS service provider. And typically, TTL is set at 24 or 48 hours.
Nonetheless, we had to do something, and since we have invested fairly significantly in Amazon Web Services for some development infrastructure, we turned to Amazon’s DNS service, Route 53. We first had to copy all the DNS records that we had been “advertising” to Dyn, to the Route 53 configuration. We then advertised in our local DNS configurations that Route 53 would be our primary DNS going forward. Then we had to wait for caches to get flushed and the TTLs to expire. Fortunately, calls started to dwindle within a couple of hours of making the switch, and by the end of the day, everything seemed to be stabilized.
While the DNS switchover was in progress, we took other steps to address customers’ concerns. The attack was getting plenty of attention in the news, but we couldn’t assume that our customers would realize that our problems were the result of an attack against a totally different company. We had to reassure them that our company wasn’t under attack, that none of their data was at risk and that we were in the process of modifying our infrastructure to ensure that this type of attack couldn’t hurt us again.
Thanks to the efforts of our customer support team, we now have a standard email response that we can send to customers in the event of some future DDoS attack, as well as useful fodder for the company’s status page and an FAQ that answers questions about these types of attacks.
As for the operations team, it will be busy figuring out how to configure two DNS service providers, how to sync data between them and how to deal with other idiosyncrasies arising from having much-needed redundancy in our DNS service.
This week's journal is written by a real security manager, "Mathias Thurman," whose name and employer have been disguised for obvious reasons. Contact him at firstname.lastname@example.org.
Click here for more security articles.