Taken from An Introduction to Steganography

By Duncan Stellars

Steganography in Audio

Because of the range of the human auditory system (HAS), data hiding in audio signals is especially challenging. The HAS perceives over a range of power greater than one billion to one and range of frequencies greater than one thousand to one. Also, the auditory system is very sensitive to additive random noise. Any disturbances in a sound file can be detected as low as one part in ten million (80dB below ambient level) [1]. However, while the HAS has a large dynamic range, it has a fairly small differential range - large sounds tend to drown quiet sounds.

When performing data hiding on audio, one must exploit the weaknesses of the HAS, while at the same time being aware of the extreme sensitivity of the human auditory system.

8.1 Audio Environments

When working with transmitted audio signals, one should bear in mind two main considerations. First, the means of audio storage, or digital representation of the audio, and second, the transmission medium the signal might take.

8.1.1 Digital representation

Digital audio files generally have two primary characteristics:

  • Sample quantisation method: The most popular format for representing samples of high-quality digital audio is a 16-bit linear quantisation, such as that used by WAV (Windows Audio-Visual) and AIFF (Audio Interchange File Format). Some signal distortion is introduced by this format.
  • Temporal sampling rate: The most popular temporal sampling rates for audio include 8kHz (kilohertz, 9.6kHz, 10kHz, 12kHz, 16kHz, 22.05kHz and 44.1kHz. Sampling rate puts an upper bound on the usable portion of the frequency range. Generally, usable data space increases at least linearly with increased sampling rate.

Another digital representation that should be considered is the ISO MPEG-Audio format, a perceptual encoding standard. This format drastically changes the statistics of the signal by encoding only the parts the listener perceives, thus maintaining the sound, but changing the signal.

8.1.2 Transmission medium

The transmission medium, or transmission environment, of an audio signal refers to the environments the signal might go through on its way from encoder to decoder.

Bender in identifies four possible transmission environments:

  • Digital end-to-end environment: If a sound file is copied directly from machine to machine, but never modified, then it will go through this environment. As a result, the sampling will be exactly the same between the encoder and decoder. Very little constraints are put on data-hiding in this environment.
  • Increased/decreased resampling environment: In this environment, a signal is resampled to a higher or lower sampling rate, but remains digital throughout. Although the absolute magnitude and phase of most of the signal are preserved, the temporal characteristics of the signal are changed.
  • Analog transmission and resampling: This occurs when a signal is converted to an analog state, played on a relatively clean analog line, and resampled. Absolute signal magnitude, sample quantisation and temporal sampling rate are not preserved. In general, phase will be preserved.
  • ''Over the air'' environment: This occurs when the signal is ``played into the air'' and ``resampled with a microphone''. The signal will be subjected to possible unknown nonlinear modifications causing phase changes, amplitude changes, drifting of different frequency components, echoes, etc.

The signal representation and transmission environment both need to be considered when choosing a data-hiding method.

8.3 Methods of Audio Data Hiding

We now need to consider some methods of audio data-hiding.

8.2.1 Low-bit encoding

Similarly to how data was stored in the least-significant bit of images, binary data can be stored in the least-significant bit of audio files. Ideally the channel capacity is 1kb per second per kilohertz, so for example, the channel capacity would be 44kbps in a 44kHz sampled sequence. Unfortunately, this introduces audible noise. Of course, the primary disadvantage of this method is its poor immunity to manipulation. Factors such as channel noise and resampling can easily destroy the hidden signal.

A particularly robust implementation of such a method is described by Bassia and Pitas in [8]. The result is a slight amplitude modification of each sample in a way that does not produce any perceptual difference. Their implementation offers high robustness to MPEG compression plus other forms of signal manipulation, such as filtering, resampling and requantization.

8.2.2 Phase coding

The phase coding method works by substituting the phase of an initial audio segment with a reference phase that represents the data. The procedure for phase coding is as follows:

  • The original sound sequence is broken into a series of N short segments.
  • A discrete Fourier transform (DFT) is applied to each segment, to break create a matrix of the phase and magnitude.
  • The phase difference between each adjacent segment is calculated.
  • For segment S0, the first segment, an artificial absolute phase p0 is created.
  • For all other segments, new phase frames are created.
  • The new phase and original magnitude are combined to get a new segment, Sn.
  • Finally, the new segments are concatenated to create the encoded output.

For the decoding process, the synchronisation of the sequence is done before the decoding. The length of the segment, the DFT points, and the data interval must be known at the receiver. The value of the underlying phase of the first segment is detected as 0 or 1, which represents the coded binary string.

8.2.3 Spread spectrum

Most communication channels try to concentrate audio data in as narrow a region of the frequency spectrum as possible in order to conserve bandwidth and power. When using a spread spectrum technique, however, the encoded data is spread across as much of the frequency spectrum as possible.

One particular method discussed in [1], Direct Sequence Spread Spectrum (DSSS) encoding, spreads the signal by multiplying it by a certain maximal length pseudorandom sequence, known as a chip. The sampling rate of the host signal is used as the chip rate for coding. The calculation of the start and end quanta for phase locking purposes is taken care of by the discrete, sampled nature of the host signal. As a result, a higher chip rate and therefore a higher associated data rate, is possible.

However, unlike phase coding, DSSS does introduce additive random noise to the sound.

8.2.4 Echo data hiding

Echo data hiding embeds data into a host signal by introducing an echo. The data are hidden by varying three parameters of the echo: initial amplitude, decay rate, and offset, or delay. As the offset between the original and the echo decreases, the two signals blend. At a certain point, the human ear cannot distinguish between the two signals, and the echo is merely heard as added resonance. This point depends on factors such as the quality of the original recording, the type of sound, and the listener.

By using two different delay times, both below the human ear's perceptual level, we can encode a binary one or zero. The decay rate and initial amplitude can also be adjusted below the audible threshold of the ear, to ensure that the information is not perceivable. To encode more than one bit, the original signal is divided into smaller portions, each of which can be echoed to encode the desired bit. The final encoded signal is then just the recombination of all independently encoded signal portions.

As a binary one is represented by a certain delay y, and a binary zero is represented by a certain delay x, detection of the embedded signal then just involves the detection of spacing between the echoes. A process for doing this is described in Gruhl, et al.s work, [13].

Echo hiding was found to work exceptionally well on sound files where there is no additional degradation, such as from line noise or lossy encoding, and where there is no gaps of silence. Work to eliminate these drawbacks is being done.