keevalbak

Keevalbak: Python scripted on-line backups to Amazon S3

Keevalbak provides scriptable on-line backups to an Amazon S3 bucket ("keevalbak" stands for "Key/Value Backups"). It does this via the s3bucketmap.py Python dictionary interface to S3 (so, for example, it could easily be extended to use any other key/value storage service which has a Python dictionary interface).

Download

Keevalbak can be dowloaded from the keevalbak git repository at GitHub.

Sample code

For some sample code, see BackupExample.py.

Design

Keevalbak is designed to be simple and reasonably robust.

A backup group is a set of backups consisting of a full backup and all incremental backups after that until the next full backup.

Each backup is represented by a date-time string to the nearest second.
Incremental backups are done by recording SHA1 hashes of file contents. When a file has the same content as one written within the same backup group, then the file is linked to a pointer to where the file content was written earlier.
The set of directories and files backed up within each backup is written explicitly to a meta-data file.
All backup records currently in a backup map are recorded explicitly in a global meta-data file.
Verification is done by a full restore and then a recursive comparison of original and restored files on the source computer.
Currently, if a backup is aborted, there will be no record of it in the meta-data file, and any future incremental backups will only increment on top of previous successful backups. (But an ability to build on partial backups is planned; this will involve recording a backup before it starts but marking it as "incomplete".)

Version

Version 1.1a: added and tested handling of Unicode file names
Version 1.2: added multi-threaded S3 operations
Version 2.0: separate recording of which files are in a backup, and which file contents have been written. This speeds up incremental backups in the case where the directory being backed up has many files and there are many incremental backups between each full backup, because it avoids the need to retrieve a full file listing for each previous incremental backup.

Version 2 upgrade

The following only applies if you have already installed and used version 1 of keevalbak to do backups.

Version 2 of keevalbak is not compatible with version 1.x. This means that:

version 2 cannot be used to do an incremental backup following from previous version 1.x backup
version 2 cannot be used to restore a version 1 backup

Therefore, if you have been using version 1 of keevalbak, the procedure for upgrading to version 2 is:

Upgrade to version 2 (e.g. by updating from the master branch on Github).
Perform a full backup of each directory that you back up.
If after doing this, you need to restore an earlier version 1 backup, then download a separate copy of version 1 (from the v1 branch on Github) and use that to do the restore.

My reasons for not maintaining backwards compatibility between versions 1 and 2 are as follows:

Maintaining backwards compatibility would add extra complexity to new versions of the software.
Most backups are of limited value once newer backups of the same content have been made. If you really need to keep a history of something, you should use source control (or some application which takes responsibility for maintaining its own data history across upgrades), and then back up your repository.
A separate installation of keevalbak version 1 can be used to restore existing backups, so there is no risk of losing data.

Final note: the basic file layout has not changed substantially, and although versions 1 and 2 are incompatible for incremental backups and for restores, the "prune" functionality will work whether or not previous backups are version 1 or version 2.

Features

Full backup
Incremental backup (including an incremental backup that builds on a previous full or incremental backup that was not completed)
Full restore
Handles Unicode file names
Optional full verification (i.e. via full restore and directory comparison)
Incremental verification, i.e. trusting records of hash values of file contents read previously (at least once) from the backup
Pruning of previous backup groups

Limitations

Has been tested only on Windows Vista.
Has only been tested using the verification option (i.e. no unit tests)
Works within limitations of S3, i.e. maximum key length (1024 bytes) and maximum value size (5 gigabytes).

Road Map

Partial restores
Compression and encryption options (possibly these could be done via enhancements to s3bucketmap.py, using S3 metadata where relevant).
Define an arbitrary "source" (i.e. currently the only option is a file system directory and all the files within it)

software