Inter.link continually works to be at the forefront of network-as-a-service. This includes a huge focus on automation to improve efficiency and ultimately enhance the customer experience overall.
Like many companies, Inter.link has been using Ansible to perform tasks on its network in an automated manner. However, there are always trade-offs when using any tool, and at some point the trade-offs become no longer beneficial. We’re now reaching the point where the benefits of Ansible are outweighed by its limitations. This article looks at some of the issues we’ve had to work around as we scale up our automation process and where we’re heading next.
On the positive side:
- Ansible is an industry standard tool for performing simple automated tasks.
- It’s easy to hire engineers that have experience with Ansible, and it’s easy to learn if they’re not already familiar.
- Most network vendors have Ansible modules, which most of the time “just work”.
- Ansible provides the possibility to go from zero automation to something that is “OK” in a matter of hours.
This ubiquity, ease of use, and rapid progress, afforded by Ansible, comes at a cost though.
Generating Prefix Filters
Early on we packaged up Ansible inside a REST API with authentication and centralised storage of the generated Ansible logs. Calling the API triggers an asynchronous execution of Ansible. Once it’s completed the Ansible logs are pushed to a central storage so that we can see what our orchestration system is doing when it calls this app. The app also makes a callback to let the caller know the task outcome.
Engineers can run Ansible manually via the CLI (Command Line Interface) too, but this in-house application ensures the same actions are being performed by humans and machines (the same Ansible roles and tags are called via CLI or via our REST API).
We run all our software in Kubernetes and after a while we noticed that our Ansible API container was occasionally consuming 4GBs of RAM and crashing. A bit of digging revealed that it was related to the prefix- and AS path filter generation step, which used a very simple Jinja2 template to generate filter list files (note: prefix and AS path filter lists are not part of a device’s running config, they are extra files on the device alongside the running config).
As an IP Transit provider, prefix- and AS path filters are an essential part of our daily operations. Some customers have prefix lists which are 100K entries long (this is after aggregation!), some of our larger peers have several hundred thousand entries.
As a result, deploying prefix list updates to “busy” routers (those with many customers and peers) started to take a long time, and the container in Kubernetes was occasionally crashing because it was running out of memory. After replacing the simple Jinja2 template called inside the Ansible play, with a call to a Python script which essentially does the same thing, RAM usage dropped from 4GBs to 256MBs.
Copying Configs
We always generate a full device configuration rather than a partial configuration, to ensure the device state is consistent with our source of truth. This has become standard practice within the industry in recent years, Inter.link is no exception here.
However, as our device configs have grown over time we noticed that config deployments where becoming slower.
On busy routers, such as those connected to a large Internet Exchange Point, the single step of copying the config file to the router was now taking 4 minutes. Some routers are having 80k lines of config (as stated above, this is without any prefix of AS filters), which sounds like a lot, but this is just text, so the actual volume of data is very low making 4 minutes a very long time in relative terms. After replacing the internal Ansible file copying module with an external call to rsync directly, the file copy time dropped from 4 minutes to 2 seconds.
Generating Config Diffs
When we push a new config to a device, a diff (difference) is always generated for review. Again, as device configs started to grow in size, the busier routers on the network were taking longer and longer to generate a config diff. Initially the task of simply creating and displaying the diff was limited to 5 minutes. Later, once it started to exceed 5 minutes, we simply bumped that limit to 10 minutes and carried on. Once we had devices which were taking 12 minutes to generate and display a config diff, we decided we can’t keep increasing this timer, and something needs to be done.
Ansible provides us with the following workflow; First it pushes a candidate config to the device, this is loaded by the device into a temporary working space, then compared against the running config to generate the diff, and finally the diff is sent back to the user to check if the changes are acceptable.
After some digging we found that the device generates the diff in less than 1 second, but because Ansible is streaming back 70k lines of diff output via the standard out of the device CLI, Pexpect is trying to parse all that output. We rewrote the Ansible task to save the output of the diff generated on the device into a file, rsync the file back to the user, and have Ansible display the contents of that file. As a result, generating a config diff on our busiest routers dropped from 12 minutes to 2 seconds.
Abstracting Dependencies
As mentioned above, we have packaged Ansible into a container with an in-house application to provide an authenticated, centrally logged, and asynchronous REST API. But as also mentioned above, engineers can also run Ansible directly via CLI.
Ansible and its modules are written in Python which isn’t known for having great package management. To ensure the same version of Ansible and its variolous dependencies are used in both cases, when our engineers run Ansible CLI commands Ansible is actually being run transparently inside a container. Both containers use the same source for dependency version control.
We’re quite satisfied with this now, but in the past before creating this transparent Ansible container for CLI users, the dependencies had diverged between CLI and API, which caused inconsistencies.
Deprecating Ansible and What Comes Next
Each time we work around something in Ansible, we add a bit more complexity to our tech stack, meaning this path isn’t sustainable, especially when you consider that Ansible’s role in our stack is fairly basic.
Before Ansible is ever used, a different application in our tech stack has pulled the required data from our source of truth and compiled that into an internal neutral configuration format. Ansible’s role is simply to load that neutral configuration format, serialise it to CLI config (or to prefix/AS filter config), and push to the network device. It has no other function.
Before the list of Ansible workarounds get much bigger, we have already started to think about deprecating Ansible. We’re still investigating this so we can’t say exactly where we’re going next, but there are plenty of open-source tools that already provide multi-threaded, asynchronous, API driven interactions with network devices, and loading our existing neutral configuration format and populating YANG models with that data isn’t a huge step. This is the direction we will start testing first, and we’ll provide a follow-up article once we’ve made some progress.