The JobQueue – How droidwiki.de runs Jobs

The MediaWiki JobQueue. A really powerful and great opportunity for both, wiki administrators and users and especially for developers who want to perform asynchronus tasks that take a long time. By default, MediaWiki uses the JobQueue for some small, but very important tasks, such as the job called LinksUpdate. A little job type that is added whenever a page was edited to update linked pages. The goal: Update the status of a link, e.g. if a user creates a new page, so links to it aren’t marked as red links anymore. Or pages that include a template gets updated if the template changes. All these tasks could be very expensive in terms of loading time, as well as processing power when executed (depending on the amount of links). But there are other packages (extensions) that heavily rely on the JobQueue, e.g. CirrusSearch, the extension that gives a much better search engine to MediaWiki based on elasticsearch. After an edit, some Jobs make sure, that the edit is indexed by elasticsearch and is available in the search as fast as possible.

Since a longer time, MediaWiki already executes jobs asynchronously, which means, that if a user opens a wiki page (and has JavaScript enabled), a little script will start a request to a hidden special page to start the execution of one job. With this, MediaWiki makes sure, that the job queue will be processed after a short waiting time (depending on how many jobs are in the queue already and how many page views are generated for this wiki). The problem: Even with only one job per request or less (the run rate can be adjusted by a configuration variable) the automatic processing of jobs whenever a request was made, can be a performance problem. And lowering the run rate of jobs exposes the next problem: How can a wiki sysadmin make sure, that all jobs are executed in a reasonable amount of time? In this blog post, I’ll try to give a short overview of the history, background and current implementation of how droidwiki.de handles the job queue.

The history

From the beginning and for a long time, droidwiki.de simply used the default conifguration for jobs. One page view executed one job asynchronously. That worked very well for only some edits and page views over the day and without any extension, which relied on the Job Queue (CirrusSearch and elasticsearch was enabled on droidwiki.de much later). But over the time with some more functions, more (self-developed) extnesions and so on, the performance of the wiki was really really bad. After investing not just a little work to make the performance of the wiki, e.g. the loading and saving time, much more better, a next big thing to handle was the job queue. Even if it was asynchronously already from the beginning, there is a small amount of performance impact, that can be removed. Furthermore: Especially for the future (e.g. when the wiki gets more edits and page views), a change away from the execution of jobs on a page view could be a real investment into the future to not get a problem with Jobs.

The first attempt was, to disable the execution of jobs on a page view and collect the jobs in a specific time frame. This can be easily handled by the configuration variable $wgJobRunRate, which was set to 0. After the time frame, a chron job executed the runJobs.php maintenance script of the MediaWiki software to run all jobs saved in the JobQueue. The first set time frame was 10 minutes. That means: Every edit made, or any upload or any other action that uses the JobQueue, saved a job in the job queue, which was executes (in the worst case) 10 minutes later. That’s probably not the definition of asynchronously. And the first problem occurs very fastly: The indexing of uploads. If an user used the VisualEditor to edit a page and uploaded an image he/she wants to use in that page, he/she normally started the VE on the page, started writing and uploaded the image just a moment before he/she wants to add it to the page. The problem: The search api used by the VisualEditor uses a search index, which is updated with a job inserted into the job queue after the user uploaded the image. That means, that in the worst case the user had to wait 10 minutes before he/she could use the image in the VisualEditor: a pain! The quick fix: Change the time of 10 minutes for the chronjob to 5 minutes. Not the best solution at all, too, but it’s better to wait 5 minutes instead of 10.

It’s clear: There must be a much better solution to run jobs asynchronously without having to wait a fixed time frame.

The current implementation

Like ever, the solution can be found in the Wikimedia Foundation. Instead of relying on a chronjob, which is triggered in a configurable time frame, the development and operations team of the Wikimedia foundation uses a much better approach: A service called Jobrunner and Jobchron. First of all: The implementation of the Wikimedia foundation is a bit oversized for small wikis. It uses:

  • a cluster of redis servers, who holds the job queue data
  • application servers (which run the MediaWiki code) inserts the jobs into the redis-servers
  • a cluster of jobrunner servers, who run the jobs saved in the redis cluster

The goal: Seperate the servers, who run the jobs, from the servers who insert them (the servers who serve the content to[1] and process requests of the user). That ensures, that the user doesn’t notice the load of the jobrunners and the servers of the jobrunner cluster can be better configured according their needs.

For droidwiki.de (a really small wiki) and probably any other wiki, this is a really oversized implementation (as long as you don’t operate a wiki farm). So, for droidwiki.de I adopted the implementation, but reduced the need of several servers and clusters to 2, one app server (which is there already), which delivers the content to the user, and one jobrunner server (which already runs and provides other supportive applications and services for the app server). The basic idea/workflow:

  1. An app server (the one, who delivers droidwiki.de) saves a job into the redis key/value store
  2. the jobchron service aggregates the job
  3. the jobrunner service picks up the job and executes it against the droidwiki.de database

There we have two new words: jobchron and jobrunner. These two are essential for using the Wikimedia job handling model: They aggreagte jobs (jobchron), which means, that:

Recycle or destroy any jobs that have been claimed for too long and release any ready delayed jobs into the queue. Also abandon and prune out jobs that failed too many times.

Basically this simply means, that any job in the local queue (local means the queue of the wiki) is checked and released into the global job queue, which is processed by the jobrunners. The jobchron service organizes the jobqueue and makes sure, that it is clean and doesn’t have “dead” jobs. The jobrunner on the other hand is a service, that provides several (configurable) groups of so called runners. These runners will execute the jobs of the jobqueue.

So, basically there are three overall steps to do:

  1. configure MediaWiki to use the redis server to insert jobs
  2. install and configure the jobchron service to organize the job queue
  3. install and configure the jobrunner service to actually run the jobs

[1] That’s not quite sure, as requests of not-logged in users does not get forwarded to the app servers, if they only want to view a page (where a server of the caching cluster will provide the content)

Configure MediaWiki

That’s probably the easiest thing, as it requires some changes to the LocalSettings.php, only (or in the droidwiki.de case: CommonsSettings.php). The first configuration variable is the $wgJobTypeConf. It describes, how MediaWiki manages it’s own jobs and defaults to the database job queue class. That means it will save the jobs into the database (the job table). droidwiki.de overwrites the whole variable to look like:

$wgJobTypeConf = array(
	'default' => array(
		'class' => 'JobQueueRedis',
		'redisServer' => '127.0.0.1:6379',
		'redisConfig' => array(),
		'claimTTL' => 3600,
		'daemonized' => true,
	)
);
$wgJobQueueAggregator = array(
	'class'        => 'JobQueueAggregatorRedis',
	'redisServers' => array(
		'localhost',
	),
	'redisConfig'  => array(
		'connectTimeout' => 2,
	)
);

This sets the JobQueue class to JobQueueRedis, which is the driver for a job queue in the redis key/value store server. It also defines the redis server and port to use and the configuration for the redis server, such as password or timeout time (removed in the above example). The really important configuration is the daemonized. If set to true, MediaWiki will assume, that the job queue is handled by another service and disables the execution of jobs on page view. From now, droidwiki.de jobs are saved in redis and currently not exeucted by anyone anywhere. That has to change.

Installation and Configuration of Jobchron and Jobrunner

The jobchron and jobrunner service are provided in the same repository, so it makes sense to install and configure both at the same time (especially because both share the same configuration file). droidwiki.de has it’s own directory structure, service related to mediawiki are saved at /data/mediawiki/services/ where a new directory was made for the jobrunner service, called jobrunner. There’s no installation script or deb to install, so anything has to be made manually, which isn’t a big problem, because the steps were very easy for droidwiki.de. A configuration file at /etc/default/jobrunner describes the options for the jobrunner and jobchron service:

JOBRUNNER_CONFIG="/etc/mediawiki/jobrunner.conf"
JOBRUNNER_PID="/var/run/jobrunner/pid"
JOBCHRON_PID="/var/run/jobchron/pid"
JOBRUNNER_USER="jobrunner"
JOBRUNNER_GROUP="jobrunner"
DAEMON_OPTS=""

This simply defines where the config file is saved and where the pid file’s for both services has to be created. Also the group and user, under which the services has to be run, are defined. The next interesting part is the jobrunner.conf in the etc/mediawiki directory. /etc/mediawiki is a central place in the droidwiki.de implementation (as far as I know, the WMF has/had this, too) where configuration files for services are saved, which are related to the mediawiki software itself (such as parsoid and the jobrunner services). The config file is a little json encoded file, which was adapted from the sample of the services. It simply defines, how many runners should be started for a group of jobs they run. E.g. the parsoid jobs aren’t processed by the main/general job runner group, but in it’s own group. This allows to seperate jobs and order the execution to a specific amount of runners.

That was it, the configuration part is done. Before I was able to start both services to process the job queue of droidwiki, I had to write two init scripts, which can be adopted from the wikimedia puppet source (jobrunner, jobchron). Now, the time has come: Start both services.

The services live!

It’s very simple to check, if both services work correctly: I simply made an edit. After running the showJobs.php maintenance script directly after the edit, I got the amount of jobs, which were currently in the queue). Just some seconds later, the jobqueue was 0, the job runners processed all jobs. After checking, that the new text of the edit was in the elasticsearch engine, I was sure, I’m finished 🙂

Leave a Reply

Your email address will not be published. Required fields are marked *