A Multi-Threaded Python S3 Application

The Application: keevalbak

keevalbak (= "Key-Value Backup") is a simple application to back up files from a directory to an Amazon S3 bucket.

The mapping from files to S3 values is the most direct possible: Unicode file names (relative to the root of the directory being backed up) map to UTF-8 encoded S3 keys (with a prefix representing the backup identity and date-time) and the file contents map to the S3 value.

Problem: Latency

The original version of the application was single threaded. This resulted in a latency problem. For example, for each file written, you have to finish writing one file to S3 before you start writing the next one. The overall data transfer rate is lower than the maximum possible between your computer and Amazon S3.

Solution: Multi-threading with Queues

The solution to the latency problem is to execute multiple file writes at the same time, starting new file write requests before previous requests have finished.

There are various ways to achieve multi-threading within a Python application. For example, you can divide tasks across separate Python processes. Or you can rewrite your application based on the Twisted framework.

However, the design philosophy of keevalbak is to keep everything simple.

Recently I read Practical threaded programming with Python at IBM developerWorks, which describes how to do multi-threaded applications in Python using the Queue module.

I decided to use this as a basis for speeding up those parts of keevalbak which involve executing multiple independent S3 requests, because it would let me multi-thread those parts of the application with minimal impact on the rest of the application.

There are actually three places where keevalbak generates large groups of S3 requests that can be executed independently:

When writing files to backup (write requests)
When restoring files from backup (read requests)
When pruning files from older backups (delete requests)

For the writing and reading requests, the local file operations can also be executed independently of each other.

A mini-framework

To enable multi-threading for each of these three types of request, I created a mini task-based multi-threading framework. Like all frameworks, it makes assumptions about what kind of program is going to use it.

The framework is based on the notion of a "task", which is an object representing one of a group of similar tasks. Each task object is required to implement the following methods:

doUnsynchronized: the initial part of the task which can be performed independently by each thread
doSynchronized: the final part of the task which must be performed synchronously on the main thread (for example, updating the results of the request to a central data structure)
getThreadLocals: retrieve a map of task shared attributes which are cloned for use by all tasks within a single thread

The first two are reasonably self-explanatory. (One might expect there to be a "before" version of the doSynchronized method, however any such code would be implicitly included in the task object initialiser, since the tasks are initially constructed on the main thread before being executed on multiple threads.)

The getThreadLocals method deals with the fact that tasks may use shared resources which are not thread-safe, and in effect gives the tasks a chance to provide a per-thread copy of those resources. The framework calls this method once for the first task handled by each thread, and replaces the shared attribute values with the per-thread values upon executing each task.

In the case of keevalbak, I found that the boto objects representing S3 buckets were not thread-safe. Although I could have investigated why, and attempted to fix the problem, I decided to just be lazy and make per-thread copies of all the relevant objects.

(To see that boto is not thread-safe in this respect, replace the "return" line of the method BackupFileTask:.getThreadLocals in BackupOperations.py with return {} and you will see errors like:

S3DataError: S3Error: ETag from S3 did not match computed MD5

)

The multi-threading framework is contained in the file ThreadedTaskRunner.py. The main class is ThreadedTaskRunner. This class is derived from TaskRunner, which has the same functionality as ThreadedTaskRunner except that it runs tasks in single-threaded mode.

(An additional feature of the framework is "checkpointing", where tasks are run in batches of some size, and a checkpoint operation is run after each batch. In keevalbak this feature supports recording the state of a partially completed backup, so that an very large backup interrupted for some reason can be resumed without having to start all over again from the beginning.)

Results

It is difficult to be completely scientific about testing performance of a networking application which communicates with a second-party application over the Internet, however the following is a typical result of running a test backup and restore with verification of a Git repository representing my Emacs site-lisp directory:

With single-threaded TaskRunner: 81 seconds
With multi-threaded ThreadedTaskRunner set to 10 threads: 19.5 seconds