My Strategy for Personal Backups

2013-11-22

A sane backup strategy is guided by the data recovery needs and the appropriate threat model. Executing it must be effortless, or else it won’t be followed through.

As with every technical system, the backup system must be validated by testing – a backup is worthless unless the lost data can be restored from it. Unfortunately, this is easier said than done. To minimize the residual risk, good rules of thumb suggest to be conservative, to follow a simple strategy that is well understood, and to use software that is in wide use and that has a proven track record.

This post is an attempt to reflect my strategy for personal backups – to find bugs and opportunities for improvement. It is tailored to backing up an Apple laptop with an SSD running OS X 10.9, which sees daily use at home, at work and on the commute ride in between.

Recovery Needs

A specification of the recovery needs should include the acceptable data loss, the manageable service interruption and the desired restore capabilities:

I accept losing an hour’s worth of work under likely threats, and up to a day for a bad scenario where some of the backups fail as well. Losing more than a week should be close to impossible.
Getting the computer to a workable state should only take a couple of minutes, even in the case of total primary disk failure.
The backup system should make it easy to restore everything from individual files up to a full bare-metal restore.

Threat Model

The threat model should specify logical and physical threats to the data (including the backups!) and how to mitigate them. A sophisticated threat model also takes into account the likelihood of each threat and its associated cost. The risk, defined as the probability of occurrence times the cost of failure, is a good measure to rank threats, but it will be difficult to estimate the risk accurately for unlikely threats with heavy costs.

Loss of individual files: User error (e.g. a rogue delete command) is a likely threat that is mitigated by a backup that is as current as possible.
Data corruption: Bugs in the application, the storage stack and hardware glitches can lead to data corruption which might not be noticed immediately. A backup history is necessary to avoid propagating the failure from the primary disk to the backup.
Total loss of the primary copy: This includes failure of the SSD controller, damage to the laptop or having it stolen. I use full disk encryption to protect against data theft, and have insurance against the financial cost of buying a new laptop. A bootable backup significantly reduces down-time in case of disk failure and provides flexibility to schedule the disk replacement. A bare-metal restore capability avoids tedious re-installation of the OS and software.
Partial corruption or loss of backup data: This threat is mitigated by multiple independent copies of the primary data. Truly independent copies reduce the probability of an unrecoverable failure, whereas incremental backups or de-duplication efforts increase it. Unfortunately, a partial corruption of the backup is often not noticed unless a restore is attempted. Therefore, adding redundancy in the form of checksums and error correcting codes makes it possible to detect and correct corrupt backups automatically.
Corruption of backup meta-data: Backup programs store the data in a specific structure. Corruption of the structure can render the whole backup unusable, so a simple storage format with minimal meta-data carries less risk than a complex format.
Total loss of a backup: This again includes failure of the disk controller, damage or theft. Physical damage is mitigated by storing multiple backups in different locations (e.g. at home and at work), full disk encryption of the backup protects against data theft.
Data format obsolescence: The progression from backup to archive is gradual. Backups should be readable without access to the backup software, and be portable across different platforms.

Backup Software

I evaluated several free and non-free backup programs according to the threat model presented above (T1 to T7), before settling on my current implementation.

GNU tar

tar is one of the most mature backup programs available. The file format is simple, well documented and the program exists on all major platforms (T5, T7). GNU tar provides byte-wise verification of the backup using -W (T4). However, I am not sure if all the idiosyncrasies of OS X and the HFS+ filesystem can be represented faithfully in the archive, i.e. if a bare-metal restore would accurately reproduce the primary copy (T3). Browsing and restoring individual files from a .tar archive is not particularly convenient, and incremental backups with many levels become unwieldy (T1).

bup

bup stores backups in Git packfiles, therefore backup histories are space efficient (T2), but pruning the history is not possible because backup snapshots become intermingled over time. Storing multiple independent copies is not supported directly and the storage format is quite complex, i.e. a corruption of the repository potentially affects the complete history (T4, T5). The packfile format is well documented and can also be managed using standard Git repository tools (T7).

Time Machine

The Time Machine is super convenient for creating an up-to date backup history, as well as browsing and restoring individual files (T1, T2). It is also the preferred way to transfer user data from one machine to the next, when cloning the disk is not recommended due to differences in hardware etc. However, corruption of the archive is not unheard of (T5), and the archive is not bootable (T3).

SuperDuper! and Carbon Copy Cloner

SuperDuper! and Carbon Copy Cloner are two (modestly priced) commercial backup programs for OS X. Both support fast incremental mirroring of the internal disk to another partition or disk image. The backup partition is bootable (given the right set-up, see below), and a bare metal restore can be done using either the backup program or Disk Utility (T3). Restoring individual files from a .dmg is more convenient than from a .tar file (T1), and the disk image format is generic, although not as portable as a .tar archive (T7).

par2

par2 creates redundancy in the form of Reed-Solomon error correcting codes, which makes it possible to detect and correct data corruption (up to the redundancy stored in the .par archive). The software is apparently not actively developed, but a stable version is e.g. available on Homebrew.

Implementation

The implementation of my backup strategy builds on two external 4 TB hard-disks and a combination of Time Machine, Carbon Copy Cloner (CCC), GNU tar and par2.

One disk is at home and the other at the office (for physical separation), and both are set up such that the backup process starts when the USB cable is plugged in. Both also have an identical partition layout: one partition has the same size as the laptop SSD and is used as a bootable mirror, while the other partition spans the rest of the disk and contains incremental and full backup histories.

Bootable Mirror

Enabling full disk encryption (aka FileVault 2) while keeping the mirror partition bootable proved to be more of a challenge than expected.

The instructions in the CCC documentation don’t work on my machine. After a lot of trial and error, I found a laborious procedure which resulted in a bootable encrypted mirror partition, but unfortunately my success was only temporary, i.e. after a number of backups bootability was lost. Because I could not reliably boot from it, I gave up and de-activated FileVault on the mirror partition, and now have to rely on physical security for the backups.

I use the CCC scheduler to start the backup of the mirror partition automatically whenever the USB cable is plugged in.

Incremental Backup History

For both hard-disks, I specify the history partition as a Time Machine volume. The external disk stays plugged in during the office day, so hourly backups happen automagically. If the history partition gets close to full, I delete old Time Machine snapshots.

Full Backup History

I make a full backup every week using GNU tar, and protect the archive against bit rot using par2. A barebones Python script called tarpar.py automates the steps for creating the archive, adding redundancy and verifying the integrity of the archive. The script is started (as root) using

caffeinate -i tarpar.py </path/to/destination>

where caffeinate prevents the laptop from falling asleep during the backup process.

If the history partition gets close to full, I only keep one full backup per month and one per year.

Discussion

There are issues which I have thought about, but that have not been addressed in my current implementation:

I don’t have any data for threat likelihoods beyond personal and anecdotal evidence, and qualitative information from backup manuals and books. Obviously, priority should be given to preventing threats that have a substantial risk. Rare threats with minimal associated cost can be ignored.
I don’t routinely check the bootabilty of the mirror partition, doing so would require an extra manual effort which I am not willing to make every day.

LearningBits