Keevalbak: Python scripted on-line backups to Amazon S3
Keevalbak provides scriptable on-line backups to an
Amazon S3 bucket ("keevalbak" stands for
"Key/Value Backups"). It does this via the
s3bucketmap.py Python dictionary interface to S3 (so, for example,
it could easily be extended to use any other key/value storage service which has a
Python dictionary interface).
Download
Keevalbak can be dowloaded from
the keevalbak git repository at GitHub.
Sample code
For some sample code, see
BackupExample.py.
Design
Keevalbak is designed to be simple and reasonably robust.
A backup group is a set of backups consisting
of a full backup and all incremental backups after that until the next full backup.
- Each backup is represented by a date-time string to the nearest second.
- Incremental backups are done by recording SHA1 hashes of file contents. When
a file has the same content as one written within the same backup group, then
the file is linked to a pointer to where the file content was written earlier.
- The set of directories and files backed up within each backup is written
explicitly to a meta-data file.
- All backup records currently in a backup map are recorded explicitly in a global
meta-data file.
- Verification is done by a full restore and then a recursive comparison of
original and restored files on the source computer.
- Currently, if a backup is aborted, there will be no record of it in the meta-data
file, and any future incremental backups will only increment on top of previous
successful backups. (But an ability to build on partial backups is planned; this
will involve recording a backup before it starts but marking it as "incomplete".)
Version
- Version 1.1a: added and tested handling of Unicode file names
- Version 1.2: added multi-threaded S3 operations
- Version 2.0: separate recording of which files are in a backup,
and which file contents have been written. This speeds up incremental backups
in the case where the directory being backed up has many files and there are
many incremental backups between each full backup, because it avoids the need
to retrieve a full file listing for each previous incremental backup.
Version 2 upgrade
The following only applies if you have already installed and used version 1
of keevalbak to do backups.
Version 2 of keevalbak is not compatible with version 1.x.
This means that:
- version 2 cannot be used to do an incremental backup
following from previous version 1.x backup
- version 2 cannot be used to restore a version 1 backup
Therefore, if you have been using version 1 of keevalbak,
the procedure for upgrading to version 2 is:
- Upgrade to version 2 (e.g. by updating from
the master branch on Github).
- Perform a full backup of each directory that you back up.
- If after doing this, you need to restore an earlier version 1 backup, then
download a separate copy of version 1 (from
the v1 branch on Github)
and use that to do the restore.
My reasons for not maintaining backwards compatibility between versions 1 and 2 are
as follows:
- Maintaining backwards compatibility would add extra complexity to new versions of
the software.
- Most backups are of limited value once newer backups of the same content have been made.
If you really need to keep a history of something, you should use source control (or some
application which takes responsibility for maintaining its own data history across upgrades),
and then back up your repository.
- A separate installation of keevalbak version 1 can be used to restore existing
backups, so there is no risk of losing data.
Final note: the basic file layout has not changed substantially, and although
versions 1 and 2 are incompatible for incremental backups and for restores,
the "prune" functionality will work whether or not previous backups are version 1 or version 2.
Features
- Full backup
- Incremental backup (including an incremental backup that builds on a previous full
or incremental backup that was not completed)
- Full restore
- Handles Unicode file names
- Optional full verification (i.e. via full restore and directory comparison)
- Incremental verification, i.e. trusting records of hash values of file contents
read previously (at least once) from the backup
- Pruning of previous backup groups
Limitations
- Has been tested only on Windows Vista.
- Has only been tested using the verification option (i.e. no unit tests)
- Works within limitations of S3, i.e. maximum key length (1024 bytes) and maximum value size (5 gigabytes).
Road Map
- Partial restores
- Compression and encryption options (possibly these could be done via
enhancements to s3bucketmap.py, using S3 metadata where relevant).
- Define an arbitrary "source" (i.e. currently the only option is
a file system directory and all the files within it)