For quite some time, since 4 years to be exactly, DroidWiki.org provides their users a hessle-free and easy way to edit pages without any knowledge of Wikitext, the markup language used by MediaWiki to generate the content of a particular page. Additionally, using the citoid service, the wiki also provides an easy way to add citations to external pages and references without needing to know the particular template as well as no need to manually fill in all the required information of the third-party site.
All this is powered by different services running in the background, namely Parsoid for transforming wikitext into HTML and backwards to support the visual editing experience, as well as citoid and zotero for generating citation information for external links. These services are accessible through a REST-API provided by the node.js service RESTBase developed by the Wikimedia Foundation.
In this blog post, I want to give some insights about how these services are run and maintained at DroidWiki.org and what changed in the recent time regarding this.
How it was done so far
Until recently, the services providing the specific features I described above was kind of easy and really not that complex. RESTBase, Parsoid (available as an apt package) as well as citoid are node.js services that were installed directly on one of the machines DroidWiki.org uses. Zotero, a component used by citoid, was installed on the machine as well. All these were managed by puppet, the automated configuration management system used to manage the configuration of the DroidWiki servers.
However, this specific way of operating the services meant to have some difficulties as well. The installation procedure was different for each single service and required specific deep konwledge of the services architecture as well as dependencies of infrastructure between the service. For example, node.js was a requirement for a lot of these services and upgrading a single service needed careful consideration to not break another service. This resulted in slower upgrade and rollout plans as well as rolling out new versions only as a batch instead of single, small deployments of a single service only.
Another problem emerged from the fact, that only one instance of a particular service was run on a single machine. If the machine or the service had difficulties, the service was not reachable, resulting in a bad user experience at this specific moment. The reliability of the services could be improved by a better availability of infrastructure resources, but only to a specific extent.
Together, these problems resulted in re-evaluating the current way these services are operated by DroidWiki.
Moving to docker swarm
Instead of “just” installing the services to a second machine and load balancing the requests to them by nginx, our frontend webserver, I decided to take a new approach, which was so far, not used by DroidWiki: Docker Swarm. Docker is already part of the overall system landscape at DroidWiki, e.g. for using Concourse as the CI/CD system. However, so far, no service used by the user facing production system was hosted using a docker-related solution.
Because of the specific requirements of the RESTBase systems, I decided to take some effor to move the services to a docker swarm setup. The services can be run on different systems without complicated locking or synchronisation of filesystems or databases, as the services usually request the required data from other backend-systems like the wiki or zotero itself.
Instead of having to switch all services from the current setup to the new one, I decided to take a one-by-one approach. In the end, this resulted in a little bit of overhead, as some cleanup steps were required in the end, however, this allowed me to distribute the work more easily and I could deploy little steps and verify that everything still worked as expected.
But wait… before doing that, I needed to enable the machines of DroidWiki.org to run in the docker swarm mode. Before that, the machines run independant docker daemons and did not communicate between each other. This also required some work on getting a highly available filesystem between the machines. Until now, each of the machines had their own disks only, as well as a NFS for saving assets like uploads from the users. However, this system was not suiteable and meant for being used to store server configuration files.
To achieve that, after some looking around in the internet, I decided to go with gluster, a high available file system, able to synchronize data between different servers as well as having some redundancy when setup as a such. The configuration was implemented in puppet pretty easily.
After that, the containerization of citoid was just doing it. Zotero, which was very lucky, already has an official docker image, so there was no need to actually build this image by myself. RESTBase as well, as also just another node.js service, was setup easily as well.
After that, the cleanup steps, I mentioned before, needed to be done. One thing I did during moving the services to docker, was to expose all of them directly on the host so they would, from the outside, still look like being installed directly on the host. Doing so allowed me to do all of this without having to have a downtime of the services to move them somewhere else. However, as all of the traffic and requests were made through RESTBase, there’s no need to have the other services being exposed to the internet or even outside of the overlay network of docker.
So, the next step here was to move the internal traffic of the services to docker-swarm-internal traffic. Docker provides a so-called overlay network for that and allows to request services using their service name. This service-name is automatically configured to be resolveable as DNS name. In the end, after doing some adjustments in the docker-compose.yml (used to configure the docker swarm deployment), all of the services were only reachable in an internal overlay network and exposed through the well-defined API RESTBase provides, only.
After moving the services to the docker swarm deployment as well as increasing the number of instances requested, the availability of the services went up to nearly 100%, there were some minor interruptions, though, but they were not related to the specific setup anymore. This results in 99.90 % availablity for Citoid and 100 % for the other services mentioned in this blog post in the last 7 days (as of now). This is increadible compared with the situation before. Also, the deployment of the services can be done independently as well, as they all bring their own environment, which makes applying updates and upgrades much more easy as well.