Understanding Audio Compression: MP3, WMA, Ogg, and More. by Article Admin
Published: 01/11/2005
Trying to transmit audio data with uncompressed audio or video is not the easiest task. After all, even an audio CD contains data that transmits at 1400kb/s, a fairly large chunk of data, more than many compressed DivX movies. The ability to stream that kind of data is one reason why there has been an increase in the bandwidth of wireless networks within homes, or the addition of things like gigabit LAN to many new motherboards as a standard feature. The joy of digital audio is that there are many different ways to decrease the amount of space required to store it, depending how signals are represented.
Most music is created in an analog form - sound waves. Depending on the initial recording medium it might be captured to another analog format (tape, though not the crappy cassettes that you put in your car) or a digital format. When first pulled to source as much data as possible is usually retained to ensure there is at least one high quality version. It’s easy enough later to translate the initial recording to a lower grade one; you can’t, however, increase the quality.
An analog recording obviously has the potential to exactly copy the original waveform. This ignores the potential input of noise into the recording, and other factors that can affect quality. There are an infinite number of points or levels that can be used to determine pitch (frequency) and loudness (amplitude) when you are dealing with analog; it’s the equivalent of a curvy line, or a string. If your equipment is up to the challenge, you can make any kind of continuous waveform.
A digital copy, however, is not a “curvy line”. Instead, it?s similar to a bar graph, or “connect the dots” depending on how you choose to display the end result. There is a series of singular points of data, with only certain available values for both. The scale along the bottom follows regular intervals, depending on the sampling rate. That sampling rate is measured in samples per second, or Hertz. (One KHz is obviously 1000 samples per second.). According to the “Nyquist Theorem,” you need to have twice as many digital samples as the frequency of the analog signal you are trying to represent to have enough data to accurately build it. Since humans can hear from 50 to 22,000Hz on average, you’d need 44,000Hz sampling rate to have a digital representation of it. That’s the minimum theoretical rate, which is one reason why you see 48,000 sampling rate on things such as DVDs, or 96KHz on DVD-Audio and SACD. The extra precision is useful for making up for rounding errors inherent in the process of moving a signal to a digital format.
Digital also factors in on the vertical scale on that “graph” I mentioned earlier. When you record to an analog medium, you store data as a voltage signal over time. In transferring it to digital there are a limited number of possible voltage values - this is called “quantization” of the signal. The bit depth determines how many values are available to round to. With one bit, you can have either on, or off, and you aren’t exactly going to enjoy much fidelity with that. With two bits, now you can have off, 1, 2 or 3 as values. That’s very coarse, but now you can have levels, at least, to round to. Adding more bits gives you more levels to play with, and more ability to end up with a digital representation close to that of the original recording. Compact Discs use a bit depth of 16, allowing for 2 ^ 16 possible levels. That works out to 65,536 values, which is sufficient in many cases for good following of an analog waveform. Some new formats such as DVD-Audio and Super Audio CD (SACD for short) are moving towards recording to 24 bits and 96KHz, or in some cases even more. Why the extra headroom when your ear can’t physically tell the difference? Any time you do something to the sound, mix channels, add instruments, change volume levels, you are introducing possible errors into the whole. With 16,777,216 possible values for a sound as opposed to 65k, taken twice as times per second, one error causes an order of magnitude less problems. The other factor is called “dynamic range”. Each bit represents around 6 decibels, a unit used with a logarithmic scale to define “loudness”. 16bits gives you 96db of range to work with -that’s fine and dandy if every sound is a loud one.
That dynamic range covers everything from the quietest to the loudest. Now, what happens if you are recording a quiet sound, one of the small harmonics that join in with the main ones? Say, for example, a light cymbal ride, the scratching of a string on a violin, or the thwack of a thumb hitting the thick strings on a bass guitar? If those are only recorded at a loudness of 24-48db, you have only 4-8 bits, or 16-256 levels, to record that sound with. Of course it’s not going to sound very good - it’s the equivalent of those midi files from games in the Pong or Commander Keen days. Moving to 24 bit, with a theoretical 144db of dynamic range is a much different story. While the loudest sound can be set to match up again to the highest “loudness” value (144 as opposed to 96, but you can also leave 12bB of fudge room and still have 22 bits of recording depth,) your quietest sound, instead of registering 24-48dB, can be pulled up much higher - 36-72 dB, if everything follows to scale. 6 to 12 bits are used for recording the sound, with the associated 36 to 4096 levels of possible values, so those small background nuances are going to show up much closer to true form, and be less quantized in the digital format.