How to Sync Video and Separately Recorded Audio, Using Only Open-Source Software

A Common Scenario: Home-Recording a Music Performance Video

So, you've practiced playing some music, which might be original, or it might be a cover, and now you want to record a video of yourself playing it and you want to put your video onto YouTube.

At this point you discover a problem that many others before you have discovered:

Your home video camera records good quality video, which is good enough to make a video which is going to be uploaded to YouTube. (Here, a "video camera" might be an actual camcorder, or a digital photo camera, or just your mobile phone.)
Your home video camera has a microphone, but the recording sound quality is not good enough. Also, it's mono-only (no stereo).
You can easily record quality audio into your home computer. This is full quality stereo sound, recorded via a line-in or microphone input on the computer's sound card.
But, and here's the problem we need to solve, there is no way to easily and automatically sync the two recordings.

Some Possible Solutions

Get a camera that accepts stereo line-in or external microphone input

Typically cheaper consumer video recorders do not have external stereo line-in audio inputs, so this is a more expensive solution.

A variant on this solution is the combination of an recent model Iphone with a USB audio adapter such as the "Mikey" Digital. (I have not used this device, but the FAQ assures us that the line-in is a digital stereo input.)

Alignment Software

PluralEyes from Red Giant claims to solve the alignment problem, and you can see from that page how much it costs: $US199 (at the time of writing). Also only currently available for Mac, so if you don't have a Mac, you'll have to buy one of those as well, or wait for the Windows version. (And if you don't have either Mac or Windows, then presumably you're out of luck.)

The Clapperboard

The Clapperboard is the traditional approach to this problem: a device which makes a noise which starts when the device is in a particular position that can be visually identified.

In a musical context, the clapperboard can be replaced by playing a particular note on your musical instrument, preferably with a timbre that has a rapid attack. For this to work properly, the video camera has to have a clear view of whatever finger or hand action is producing the note.

Of course once you've recorded your video and audio with the clapperboard clapping at the start of the video, you then have to find some video editing software that makes it easy to identify the moment of the "clap" within both the video and audio tracks, and then carry out the necessary realignment.

An Alternative Solution: Align the Audio Tracks

The best solution which I have found is a variant on the clapperboard method, but it requires only an audio "clap", and it works independently of any particular video editing software.

What it does require is that the video camera has its own microphone which is recording the same sound as the separate higher quality audio recording that you are making. And it requires access to some audio editing software.

The problem of alignment can be stated as the problem of how much time the separate audio recording has to be moved forwards or backwards to align with the video.

To solve this problem, you only need audio software. The most suitable software for this purpose is Audacity, which is freely available (for free) for Mac, Windows and Linux.

Detailed Instructions

To successfully align your video and audio, proceed as follows:

Record the video and separate audio, with a suitable reference sound at the beginning of the recording (this will be trimmed off once you are finished, so remember to leave a suitable gap between the reference sound and whatever it is you are actually recording).
If they are not already there, copy the video and audio recordings to your computer. (For sound, it is recommended to save as uncompressed WAV, only compressing at the last moment when the final video is to be created. For video, from a consumer video recorder, it is likely to be pre-compressed anyway.)
Start Audacity.
In Audacity, open the video file. Being quite clever about media formats, Audacity will automatically extract and load the audio track of your video. If it doesn't, then you'll have to find some other way to get the audio track out of the video.
To add the separately recorded audio track to the same Audacity window, choose the menu item File => Import => Audio ...
Note that individual tracks can be muted and unmuted. Also note that, due to being recorded differently, the correspondence between the two tracks may not be completely obvious visually. For example, the camera audio may contain ambient room noise that is not readily distinguishable from music just from looking at the wave forms.
For each track, separately, determine the precise millisecond when the reference note occurs. Do this using a combination of play, stop, zoom in, zoom out and scrolling back and forth in time.
Write down the time of the reference sound in milliseconds for each track.
Using your calculator, subtract one number from the other to determine the size of the gap in milliseconds.
Now, shift the separately recorded audio track in the desired direction by the required number of milliseconds. There are obviously two possible alternatives:
- Shift forward, by inserting silence. In Audacity this is done via the Generate => Silence... menu option, making sure to select the hh:mm:ss + milliseconds input option (confusingly, the default "seconds" option looks like it might work, but that's a comma not a decimal point, so choosing 3,600 will add an hour of silence, not 3.6 seconds).
- Shift backward, by firstly selecting from time zero until the required number of milliseconds (you can for example type the actual numbers into the Selection Start and End values at the bottom of the screen, selecting format hh:mm:ss + milliseconds so that you can specify milliseconds, and then secondly hitting the Delete key).
Once shifted, you can test for alignment by playing with both tracks unmuted. If it just sounds like one audio track, then you're probably good to go.
Now, delete the extracted video audio track from the Audacity window, and then export the aligned audio track to a new WAV file.

At this point, you still have a camera video with its own audio track, and you have a separately recorded audio track which is now perfectly aligned with the video audio track.

Now you need to use some video editing software to replace the video's original audio track with the separately recorded track.

The video editor I have used is Avidemux. The following steps will create an output video with replacement audio:

Open Avidemux
In Avidemux, choose Open to open the video file.
From the Avidemux menu, choose Audio => Main Track
In the dialog, change the Audio Source option from Video to External WAV
Click on the Browse... button to select the aligned audio track.
Click OK
This gives you a video with the audio track replaced by the separately recorded and correctly aligned audio track.
WARNING: Avidemux may not actually correctly align video and audio when playing the video. (At least it didn't for me. However this doesn't prevent you using Avidemux for simple editing tasks.)
Use the selection options to trim the beginning and end of the video (in particular you won't want to retain the initial reference sound).
Select desired video encoding, audio encoding and container format.
Click Save and choose a file name to save to.
Wait a while ...
When it's done, press OK, and you're done!
Upload to YouTube, or whatever.

Drift

One issue I have not dealt with here is drift, i.e. the audio and video might be in sync at one point in time, but then drift apart. Drift is caused by different recording devices not agreeing on how fast time is, or, to put it another way, disagreeing on how long a second is.

The simplest form of drift is where the difference between time between two devices is a fixed constant. In this case, it should be possible to fix the problem with the help of two reference sounds: one at the start of the video, and one at the end of the video. You'll need to note the alleged time of both reference sounds in both videos, and then do a bit of algebra to figure out the required addition and multiplication to get one track in sync with the other from start to finish.

The PluralEyes software mentioned above does claim to solve this type of problem, and possibly more complicated desychronization problems as well.

In practice, for the length of recordings I've made, and the equipment I have been using, I haven't had any problems with drift. This is one thing you can verify by the step above where you play the camera track and aligned separate audio track simultaneously. If there is any drift, this should become obvious as a change in the quality of the sound as the tracks play from beginning to end.

My Equipment, and an Example

The equipment I have used to record video and audio is:

A Canon IXUS 100 IS
Dell Inspiron running Ubuntu 11.0 with a Asus Xonar-D1 sound card.

For an example of video that I have processed using the steps described above, see the following two videos on YouTube ...

a blog about things that I've been thinking hard about