Phase Coding in audio

A method to hide the data by altering the wave forms.

This article covers audio steganography using the Phase Coding method. A technique that hides data by manipulating audio wave phases rather than amplitude bits.

Why phase coding over LSB?

Before diving into phase code, let's understand why it's superior to the commonly used LSB method:

LSB method: It modifies the least significant bits of audio samples which can introduce audible noise, especially in quiet passages.
Phase coding: It alters the timing/ phase of frequency components while preserving amplitude, making changes virtually imperceptible to human ears.
Robustness: Phase changes are less susceptible to compression and filtering compared to LSB modifications.
Capacity: While LSB canhide larger files, phase coding is ideal for secure message hiding with better concealment.

Audio steganography

Most audio steganography uses WAV file due to their uncompressed nature, which preserves the exact audio data needed for reliable extraction. If you don't know why, check my previous section on LSB for audio . When we hide data inside an audio, our main goal is imperceptibility means making sure the listener hears no difference between original audio and stego audio.

Phase coding method is an interesting method in audio steganography because it take advantage of limitation of human hearing: while our ears are highly sensistive to amplitude chnages (loudness) , we are much less sensitive to phase shifts in complex audio signals. But what are phases? To give the answer for this, we will be learning about WAV in depth here.

Understanding Sound and Phases

What is Sound? Sound is basically a vibration of air pressure that travels as waves to our ears. These vibrations can be visualized as a waveform like a graph of amplitude(air pressure change) over time. You will see the diagram later for better understanding. Digital audio store these waves as sequences of numbers, each representing the amplitude at a given instant. For example of 440 Hz sine wave might have samples like:

Sample #1: 0.000000 Sample #2: 0.062791 Sample #3: 0.125333 Sample #4: 0.187381 Sample #5: 0.248690

Each sample value represents the amplitude at specific moment in time. The mathematical representation of a sine wave is:

x(t) = A sin(2πft + ϕ)

where: A is amplitude which controls loudness means higher amplitude = louder sound. f is frequency which controls pitch means higher frequency = higher pitch. ϕ is phase which controls starting position of the wave.

This diagram shows how a sine wave looks like, the curly lines you see which starts from 0.0 then it goes up to 1.0 then goes down, it's a sine wave which a WAV file stores in raw PCM sample. The more composed sine wave you have the higher the frequency will audio have.

See those different frequencies, it's a visual comparison of 2Hz, 5Hz and 10Hz tones. The more composed a sine wave is, the more oscillation will it make and the higher the frequency will be there.

Now lets, talk about phases and magnitude.

Phase and Magnitude

Phase and magnitude are represented in complex number, when a sound is analyzed using Fast Fourier Transform(FFT): the result of it will be complex number whose size is the magnitude and whose angle is the phase.

The diagram shows, how FFT breaks a signal into strength and timing of each frequency.

In short, magnitude is how the loudness/strength of each frequency is in the signal and phase is the timing offset of each frequency's sinusoid.

Now, If we keep the magnitude (loudness of each frequency) the same but adjust the phase, the perceived sound remains almost identical but the changes carry the secret message's bit. To simply put, think of two identical songs played with the same volume and notes but one starts a tiny fraction of a second later. To the listener, they sound the same, but a precise measurement shows the start time shifted, now that shift is phase.

The top diagram shows the sine wave might look like a single wave but there are two waves overlapping each other. These two waves are perfect in-phase ( which has 0° difference) , the peaks and valleys align to each other. The bottom diagram shows two waves which has difference of 90° means they are out-of-phase, One wave starts from 1.0 peak and other wave starts from 0.0. So they are out phase. Now you understand what are phases, we will alter these phases to hide the secret message's bit.

Another example of phase shifting, it has same frequency but there are different starting points (0°, 90°, 180°).

Phase coding

Phase coding hides secret data by modifying these phase values of certain frequency components in an audio signal while keeping their magnitude unchanged. How this is going to work? In short, we will be dividing the audio into chunks then we will apply Fast Fourier Transform (FFT) to break these chunks into magnitude and phases. Then we alter these phases by embedding our secret data's bit which will shift the phase by 90° if bit is 1 and 0° if bit is 0. After embedding our data by shifting these phases, we will use that magnitudes which were broke down by FFT and apply Inverse Fast Fourier Transform(IFFT) to reconstruct the audio with modified phase.

Before we start implementing the code, i am sure you must be thinking about FFT , like what it is and how is it dividing the chunks into magnitude and phases.

Fast Fourier Transform

FFT is an algorithm that converts the chunks of audio from the time domain(waveform) into the frequency domain(a set of complex number). How FFT gives magnitude and phase? Well the output of FFT is a vector of complex values where each complex value represent a bin. A complex number is a combination of real and imaginary number where real part corresponds to the cosine(in-phase) contribution while imaginary part corresponds to the sine (quadrature) contribution. For example: If Bin 0 has complex value of 25 + 0j , the magnitude will be |X| = squrt(25² + 0²) = 25 and phase will be atan2(0,25) = 0°.

If it were -25 + 0j , then phase = π radians (180°)

if it were 0 + 25j , then phase = π/2 radians (90°)

if it were 0 - 25j, then phase = -π/2 radians (-90°) or 3π/2 radius depending on convention.

The magnitude will be the same for all as it's |X|.

The atan2 is 2 argument arctangent function that returns the angle in radians from the positive x-axis to the point (x,y) , correctly handling all four quadrants and the case when x <= 0.

The mathematical formula used by FFT to calculate and return the complex number is this:

Xₖ = ∑ₙ₌₀ⁿ⁻¹ xₙ e^{−2π i k n / N}

Xₖ is the complex magnitude at the k-th frequency bin. The magnitude |Xₖ| tells you how strong that frequency is. The argument(phase) arg Xₖ tells you the shift of that waveform.

The Inverse FFT is an algorithm that converts frequency domain (complex bins with magnitude and phases) back into time domain(waveform). It reverse the FFT process, if we keep magnitude the same and use modified phase , then IFFT will give you an audio signal that sounds nearly identical to the original but now it hides the information in the pahse. The mathematical formula used by IFFT to calculate to reverse the process is this:

xₙ = (1/N) ∑ₖ₌₀ⁿ⁻¹ Xₖ e^{+2π i k n / N}

Implementation of code

Before starting, ensure you have:

Read our AES Encryption guide.
Basics of NumPy.
The aes.py file will be used from AES encryption page.

Install dependencies first:

pip install numpy scipy pycryptodome

The implementation consists of two main functions:

embed_phase(): Hides the message in audio phases
extract_phase(): Recovers the message from phases

Both functions handle:

Audio format validation and conversion to int16
Mono/stereo channel selection
Block division and FFT processing
Phase manipulation with proper mirroring for real signals
Data integrity verification with CRC checksums

Embedding

import numpy as np
from scipy.io import wavfile
import random
from aes import encryption, decryption, to_seed
import binascii
import struct

def final_payload(payload):
    length = len(payload)
    if length > 0xFFFF: # 0xFFFF is 65,355 in decimal which is 65kb
        raise ValueError("The length is big to fit the 2byte length prefix!!")
    length_prefix = struct.pack(">H", length) #it takes big endian with H (unsigned 16bits) and length of the payload (138)
    checksum = binascii.crc32(payload) & 0xFFFFFFFF
    checksum_prefix = struct.pack(">I", checksum) #it takes big endian with I (unsigned 32 bits) so the data will have exactly 4bytes of data and checksum
    final = length_prefix + checksum_prefix + payload
    return final

Before we jump to the main function, here is function which for crc checksum.

We import numpy as np , numpy is a huge library for mathematical complex calculation which provide as array, matrix, multidimensional arrays, and will help us in calculations. We import wavefile from scipy, we are not using wave module because wavefile from scipy.io returns the audio as a numpy array with the sample rate in one call. Basically, you can think of scipy as a partner of numpy for calculation. We import random module to randomize the bins of FFT to make it impossible to extract the data without the correct key. Then we have encryption, decryption and to_seed function from aes file which i have mentioned earlier to read how AES encryption works. We import binascii for crc32 checksum to convert between the binary data and ASCII encoded binary representations. Finally we have struct module which allows us to work with binary data by providing functionality to convert between python values and C-style binary data.

First function of the code is to create crc32 checksum to check the integrity of the paylaod when extracting, the function final_payload() takes encrypted payload as in only argument. we store the length of the encrypted playload in length variable first, then check the length if it's too big to fit the 2 byte length prefix or not because we are storing the length in 2 bytes. Using struct.pack we converts python values into binary format where it takes 2 arguments '>H' and length. '>' means Big endian (most significant byte first) and 'H' means unsigned short (16bits) and if the length is 138 then length prefix will be 2 byte representaion of 138 in big-endian which is 00 8A. checksum variable conputes the CRC32 checksum of the encrypted payload(epayload) where binascii.crc32() returns an integer that can be negative in python so we use '& 0xFFFFFFFF' to make sure it's a positive 32-bit integer. Next, we packs the checksum into 4 bytes using struct.pack() again as checksum prefix, where '>' means big-endian as mentioned earlier and 'I' means unsigned int (32 bits) . Finally we return the final payload which is concatenated addition of length prefix, checksum prefix and payload. The final payload will be like:

You might have noticed the length of the payload i have been mentioning is 138, why? because it's a fixed length of data you can embed here. Why limited? It's because i am using this method to hide the secret messages not files means i will give the user 100 characters to write the message. 100 characters are equivalent to 100 bytes and as you can see the final payload adding 6 bytes so we are left with 32 bytes, you will get to know where those 32 byte come from.

Moving on to the main function:

def embed_phase(audio, payload, key):
    rate,audio = wavfile.read(audio)

    if audio.dtype != np.int16:
        if audio.dtype == np.float32 or audio.dtype == np.float64:
            if np.max(np.abs(audio)) <= 1.0:
                audio = (audio * 32767).astype(np.int16)
            else:
                audio = np.clip(audio, -32768, 32767).astype(np.int16)
        elif audio.dtype == np.int32:
            audio = (audio//65536).astype(np.int16)

    if np.all(audio == 0):
        raise ValueError("Input audio file appears to be silent or corrupted!!")
    
    payload = payload.ljust(100).encode('utf-8')
    if len(payload) > 100:
        raise ValueError("The limit exceeds 100 characters")

The main function takes 3 arguments: audio as cover file , payload as 100 character message and the key for encryption and randomization of bins using pseudo random number generator object.

First, we read the audio file using wavefile.read() function which gives us sample rate and audio data in array. For example:

rate - 44100 audio - [[0 0] [0 0] [0 0] ... [0 0] [0 0] [0 0]]

Next, we check if the audio file's data type is int16 or not, since int16 is the standard PCM format for WAV file. This nested if else condition ensures everything is converted to int16, why? because standard uncompressed PCM WAV format uses 16-bit signed integers which range from -32768 to +32767. If we keep it in float or int32, operation like FFT then IFFT then saving back to WAV could cause distortions or incompatibility. Converting to int16 guarantees consistency for embedding the data and later extracting the data. np.max() function returns the maximum value in an array and np.abs() function convertes negative audio sample to positive to check their magnitude, so we are checking if the largest absolute value in the audio is <= 1.0 (normalized float), if it is , we multiply the audio array list with 32767 to scale to int16 range. otherwise we limit the values in an array so they stay within a specified range using np.clip() funciton. If datatype is int32 we divide the audio array by 65536 to scale down to 16-bit means int16. Next, np.all() function checks if all elements of array are 0, if they are that means the audio file is either silent or invalid and throws an error.

After analyzing valid audio file and converting it to int16 , we move onto the payload. We give user 100 characters to fill, but if the user has less than 100 characters we pads the paylaod with blank space to make it 100 character using payload.ljust() function and encode it to utf-8. If the length of the payload is larger than 100 characters it will throw an error. Continuing on the code:

    epayload = encryption(payload, key)
    finalp = final_payload(epayload)
    textInBinary = np.unpackbits(np.frombuffer(finalp,dtype=np.uint8))
    total_bits = len(textInBinary)

    blockLength = int(2 * 2 ** np.ceil(np.log2(2 * total_bits)))
   
    # checks shape to change data to 1 axis
    if len(audio.shape) == 1:
        samples = len(audio)
        mono = True
        audio_channel = audio.copy()

    elif len(audio.shape) == 2:     
        samples = audio.shape[0]
        mono = False
        right_channel = audio[:,1].copy()
        left_channel = audio[:, 0].copy()

        if np.sum(np.abs(left_channel)) >= np.sum(np.abs(right_channel)):
            audio_channel = left_channel
            org_right = right_channel

        else:
            audio_channel = right_channel
            org_right = left_channel
      
    else:
        raise NotImplementedError("Only mono or stereo channels are allowed!!")

We encrypt those 100 characters(since 1 character is equivalent to 1 byte so 100 bytes) of payload or padded payload using encryption() function which takes 2 arguments: payload and the key. Now if you look at AES Encryption encryption() function, it adds 16 bytes of cipher.nounce and another 16 bytes of tag to the bytes of encrypted payload making it 132 bytes in total to embed. After encrypting the payload , we pass the encrypted payload to the final_payload() function which i described earlier to create CRC32 checksum for integrity checks which adds total of 6 bytes to the 132 bytes making it 138 bytes finally.

Next, we convert the final payload with CRC32 checksum into a binary array (0s and 1s). Using np.frombuffer() function which interprets the byte data as an array of 8-bit unsigned integers (0-255 in values). For example if the payload is "hey" the result of np.frombuffer() would be like this:

[ 0 132 25 190 54 144 6 19 105 4 181 124 200 190 120 46 223 136 23 32 96 87 62 174 172 68 86 68 166 6 58 214 152 52 136 107 214 40 234 92 131 110 237 212 175 206 211 135 45 155 24 127 241 141 24 200 228 252 12 103 215 157 157 165 119 104 57 164 183 167 231 113 4 129 228 180 138 31 215 197 20 222 10 160 46 167 83 103 246 95 119 173 31 188 97 178 22 26 97 245 230 71 46 175 68 139 240 150 96 240 249 58 125 17 250 183 28 103 139 184 148 116 203 161 130 117 149 89 106 208 15 67 212 38 106 205 49 168]

which is basically a list of byte values range from 0-255 from each byte of final 138 bytes. Then we use np.unpackbits on this list of byte values which converts these each values to the bits which results in the list of bits [0 0 0 ... 1 0 1]. Then we store the length of the list of bits in total_bits which result in 1104 bits.

Next, we have to calculate the blockLength according to the total bits. Block length decides in how many blocks/chunks we have to divide the audio file. To calculate the blockLength there is mathematical formula:

blockLength = 2 * 2^{[log_2(2×total\_bits)]}

To use this formula in code we will use np.ceil() and np.log2() functions. First we multiple total_bits with 2 and then use np.log2() function on it, np.log2() returns float and it finds the power of 2 that fits the size. For example: if total_bits were 8 then we will end up having np.log2(2*8) = np.log2(16) = 4.0 and not only 4.0 if the total bits are 1104 as we calculated previously it would result in 11.108524456778168. Can we use that float number further to divide the audio in chunks? No we need a whole number and integer. That's why we use np.ceil() function which rounds up to the nearest whole number. So if we use np.ceil on 11.108524456778168 , the result would be 12.0. Now this 12.0 will become the power on 2 which will be multiplied by 2 again, the blockLength would result in 8192.

Unlike LSB method, I made this tool limited to mono and stereo modes, cause every mode will have different shape, different array, different channels and different size. Also the calculation will become more complicated than it is right now. Handling single mode is already a headache here, Imgaine having a mode 4-8 channels.

Now, we have nested if else condition that checks if the audio is mono or stereo, depending on the mode we will provide the audio data in array to fetch the phases using FFT. In if else condition, we first check the length of audio shape using len(audio.shape) , 'audio.shape' gives out (Nsample,) if the audio is mono also called 1D numpy array and 'audio.shape' gives out (Nsample, Nchannels) if the audio is stereo also called 2D numpy array.

If len(audio.shape) is 1 then the audio is in mono, if len(audio.shape) is 2 then the audio is in stereo mode. For mono: 'audio' is already a list of array, so we store the length of the audio (which is equal to the number of frames in it) in samples then we copy the original audio file's array into audio_channel variable. Samples are total frames of that audio and audio_channel is entire audio. For stereo: we get the Nsample from audio using audio.shape[0], since audio.shape for stereo would be like (8560559, 2) so on 0 index we have Nsample and then we store the total number of frames(Nsample) in samples variable.

Now here is the thing, in mono, audio_channel holds the entire audio as in list of arrays. But stereo has 2 channels, Left and right channels so which one to use or can we use both? We can definitely use both channels but it might reveal to the listener that something is wrong in either left or right channel cause of noise. So we are going to use either left or right channel but it will depend which channel is louder so the listener wont notice if something is wrong. To decide which channel to use, we copy the audio[:,1] means all samples of the right channels in right_channel variable and copy audio[:,0] means all samples of the left channel in left_channel variable.

After extracting left and right channels separately, we check the loudness in both channels. If the left channel has more loudness than right channel, audio_channel variable will hold the data of left_channel to modify the phase and save the right channel's data in org_channel for later when reconstructing the audio. If right channel has more loudness than left, we reverse the situation by storing the right channel's data to audio_channel to modify the phases. If the mode is neither mono or stereo , it will throw an error. Continuing on code:

    blockNumber = int(np.ceil(samples / blockLength))

    B = 8
    capacity = blockNumber * B
    if total_bits > capacity:
        raise ValueError(f"Audio too short! Need {total_bits} bits, capacity is {capacity}")
    
    required_samples = blockNumber * blockLength

    if len(audio_channel) < required_samples:
        audio_channel = np.pad(audio_channel, (0, required_samples - len(audio_channel)), mode='constant')
    else:
        audio_channel = audio_channel[:required_samples]

    blocks = audio_channel.reshape((blockNumber, blockLength))
    
    dft = np.fft.fft(blocks, axis=1)  # Calculate fft  
    magnitudes = np.abs(dft) # calculate magnitudes   
    phases = np.angle(dft) # create phase matrix   

    blockMid = blockLength // 2
    candidates = list(range(1, blockMid))
    seed = to_seed(key)
    magic = random.Random(seed)
    magic.shuffle(candidates)

BlockNumber variable hold the number of chunks which can be divided from the audio file. We simply use the samples (number of frames in mono or stereo) and divide it by blockLength. Then use np.ceil() function to round up the result as in whole number.

Next, there is B = 8 , means we will embed 8 bits of our message in each block. Now we can calculate capacity: multiplying the block number by B(8) and check if the message is bigger than what audio can store. We compute the required sample that are needed because we split the audio into equal sized blocks for FFT. Next, we pad or trim the audio depeding on required sample. If audio is too short, we pad with zeros. If audio is too long, we trim it down. By padding or trimming we ensure exact length for reshaping into blocks. Now we reshape the blocks using blocknumber and blockLength. Why? because we need multiple chunks for embedding also after padding/trimming the audio becomes 1D array so to make it 2D array we use blocknumber(as rows) and blockLength(as columns). We can embed all the data in a single block but if that block somehow get deteriorates the entire hidden data will be lost and instead of treating the entire audio as one big array, we divides it into equal sized blocks(or frames). Now each block will undergo FFT separately so we can manipulate phase value without messing up the entire signal. Using np.fft.fft() function we calculate the FFT but we give axis=1 argument with blocks so we can apply FFT along each row(each block). Now each block gets its own frequency representation.

Now after calculating FFT of each block we get the array of Bins (complex number) as i have described earlier how FFT works and what it returns. So these bins (complex numbers) are magnitude and phase, we need to extract them. Using np.abs() function we extract magnitudes from FFT output and using np.angle() function we extract phases from FFT output. These function works on the formula i have described for magnitude and phases.

Now, we calculate blockMid which is basically the middle array of the blocklength using blockLength // 2. We must avoid touching the 1st bin (index 0) and the middle bin(Nyquist bin) as we just calculated, Because 1st Bin or Bin 0 (index 0) represent DC component means zero frequency. It's essentially the average amplitude of the entire block. If we change its phase, the entire block will shifts up or down in amplitude creating an obvious distortion. As for Middle Bin (blockMid) it represent Nyquist frequency(half of the sampling rate), this bin is special because it doesn't have a mirrored pair (unlike others). Changing it alone breaks the symmetry needed for a real signal after IFFT. So these are the reason we must avoid Bin 0 and Middle Bin.

We will use the bins that are in between Bin 0 and Middle Bin, because FFT of a real signal is symmetric and to keep the signal real after inverse FFT we need to maintain that symmetry.

Moving on, after calculating blockMid we create a list of Bins which range from 1 to blockMid (obvious excludes in range) and store them in candidates variables. Now we create a seed using to_seed() function on user provided key and create pseudo random sequence to shuffle those Bins in candidates variables so nobody gonna know in which Bins we have hide the data. Continuing on the code:

    bit_idx = 0

    for block_idx in range(blockNumber):
        if bit_idx >= total_bits:
            break

        bins = candidates[block_idx*B : (block_idx+1)*B]
        for b in bins:
            if bit_idx >= total_bits:
                break

            bit = textInBinary[bit_idx]
            omega = (+np.pi/2) if bit else (-np.pi/2)
            phases[block_idx, b] = omega
            phases[block_idx, blockLength - b] = -omega #Mirror frequency
            bit_idx += 1

        if bit_idx >= total_bits:
            break

Now, here is the most important or cruical or complex part of this code the for loop. We have total bits our message so we set if condition on bit_idx , if bit_idx gets equal or higher than total bits the loop breaks meaning it already embedded all the bits of message. This loop checks for every block if we still have bits left to embed. Next, we select bins from shuffled candidate list, if block_idx is 2 then selected bins for this block comes from index 16 to index 24. Then we use for loop again on those selected bins, we again first check if bits are left embed or not. If left, we get the current bit from textInBinary[current index] and then we convert the bit into a phase angle using (+np.pi/2) if bit is 1 otherwise it use (-np.pi/2) if bit is 0 . This is how phase coding works: embed information in phase shift. Here we are shifting the phase from current position to +90° or -90°. Then we assign this phase to the bin using phase[block_idx, b] , where block_idx means current block and b means those selected bins from candidates. We assign these phases to its mirror also using 'phases[block_idx, blockLength - b]' . Why? Because for real signals, FFT needs conjugate symmetry. So if you change bin b, you must apply the opposite phase to its mirror (N-b) . Then we move on to the next bit using bit_idx += 1. This loop continues till bit_idx becomes equal or greater than total_bits.

Top diagram (orignal phases) show the random original phase values for each FFT in one block. Middle (Modified phases) shows the selected bins (2,3,4) are replaced with +π/2 or -π/2 depending on the bit(1 or 0). Their mirror bins (blockLength - b) are also modified to the opposite sign to maintain symmetry. Those blue arrows shows the mirror relationship between bins. Bottom (Difference) shows how much each bin's phase chnaged (non-zero only where we modified).

Continuing on the code:

    modified_fft = magnitudes * np.exp(1j * phases)
    renew_block = np.fft.ifft(modified_fft, axis=1)
    renew_block = np.real(renew_block)
    renew_block = np.clip(renew_block, -32768, 32767)

    new_audio = renew_block.ravel().astype(np.int16)

    if mono:
        output = new_audio
    else:
        min_length = min(len(new_audio), len(org_right))
        output = np.zeros((min_length, 2), dtype=np.int16)
        output[:, 0] = new_audio[:min_length]
        output[:,1] = org_right[:min_length]

    wavfile.write("encoded.wav", rate, output)
    return "encoded.wav"

Finally on the last part of embed function, we rebuild complex FFT with modified phases and magnitude of each FFT bin (from original audio). np.exp() function converts the phase angles back to the complex representation. By multiplying magnitude and complex phase we reconstruct the full FFT representation for each block. Next, using np.fft.ifft() function we convert each block from frequency domain to time domain(waveform). axis=1 means we apply IFFT row-wise because each row is a block. The output of this function would be complex but the real part of the complex number is the actual signal. Then using np.real() function we discard any imaginary component from the IFFT, keeping only real part. Next, using np.clip() function we clip values to valid PCM range, basically audio sample must be 16-bit signed integers and PCM range is [-32768, +32767]. This ensure we don't overflow and distort the sound.

Using .ravel() function we flattens the 2D block structure back into a 1D array(like original audio) and using .astype(np.int16) which converts to 16-bit integer type of WAV format.

Now, the audio was mono the output would be new_audio remains same. If audio was stereo, we have to ensure the output WAV file has two channels, remember we saved the right channel in org_right, we will use it to add this modified left channel to reconstruct the audio. First, we find the smallest length using len(new_audio) and len(org_right) , sometime after embedding the new_audio might be slightly shorter than the original right channel or vice versa which can throw an index errors so to avoid this we take the minimum length.

Then we create an empty stereo array using np.zeros() which will fill the array with 0s. we use this function with min_length and 2 means we create a 2D array with min-length rows and 2 columns. Column 0 is left channel and column 1 is right channel. Next, we fill the left and right channels, entire first column (0) will be filled with left channel(modified audio) and second column(1) with right channel (original audio). We slice both arrays up to min_length to avoid overflow.

Lastly, we use wavfile.write() function to write the modified audio back to a WAV file with the same sample rate. We are done with embedding.

Extraction

We learned more than you can think, now this will make easier to understand the extraction process.

def extract_phase(audio, key):
    rate, audio = wavfile.read(audio)

    if audio.ndim == 1:
        mono = True
        audio_channel = audio.copy()
        audio_samples = len(audio)
    elif audio.ndim == 2:
        mono = False
        audio_samples = len(audio)
        left_channel = audio[:,0].copy()
        right_channel = audio[:,1].copy()

        if np.sum(np.abs(left_channel)) >= np.sum(np.abs(right_channel)):
            audio_channel = left_channel
        else:
            audio_channel = right_channel

    else:
        raise NotImplementedError("Only mono or stereo channels are allowed!!")
    
    total_bits = 138 * 8

    blockLength = int(2 * 2 ** np.ceil(np.log2(2 * total_bits)))
    blockNum = int(np.ceil(audio_samples / blockLength))

Let's start implementing the extraction function. If you want to create separate file then make sure to impor the same modules you imported during embed function or you can create this function after embed function.

Extract function takes 2 arguments: audio as stego_file and the key for decryption adn for the same sequence order when you embed the message. First, using wavfile.read() function on audio we get the sample rate and audio data in array. Then we check if the audio is mono or stereo using audio.ndim, it returns 1 if the audio is mono and returns 2 if audio is stereo. So if audio is mono, we store array of original stego file in audio_channel and stores the total samples in audio_samples. If audio is stereo, we store the total sample in audio_samples , we then store the left channel's data in left_channel and right channel's data in right_channel using numpy slicing. Then we check which channel has the loudness or has more strength in frequency using np.abs() and np.sum() function together.

After choosing the louder channel, we have set the total_bits = 138 * 8 , cause we know we have embedded limited and fixed amount of data which is 138 bytes. Then we calculate blockLength and blockNum using the same formula as we did during embedding.

    B = 8
    capacity = blockNum * B

    if total_bits > capacity:
        raise ValueError(f"Audio is too short! Need {total_bits} bits, can extract {capacity} bits.")

    required = blockNum * blockLength
    if len(audio_channel) < required:
        padding = required - len(audio_channel)
        audio_channel = np.pad(audio_channel, (0, padding), mode='constant')
    else:
        audio_channel = audio_channel[:required]


    blocks = audio_channel.reshape((blockNum, blockLength))
    dft = np.fft.fft(blocks, axis=1)
    phases = np.angle(dft)

Next, we have set B = 8 means 8 bits per block. Then we check capacity of the audio with the same method as we did during embedding. Mostly things in the extraction function is same, so i don't need to explain these over again. Now, we compute the required samples to trim or pad the audio. We then reshape the blocks using blockNum and blockLength giving each block its own shape, like creating chunks audio.

After creating blocks, we will use np.fft.fft() function to calculate Fast Fourier Transform and get the Bins(complex numbers) of every block on every row (as we using axis = 1) . We don't need to extract magnitude here as there is no need for it. We directly extract phases from output of FFT.

    blockMid = blockLength // 2
    candidates = list(range(1, blockMid))
    magic = random.Random(to_seed(key))
    magic.shuffle(candidates)

    extracted_bits = []
    bit_idx = 0

    for block_idx in range(blockNum):
        if bit_idx >= total_bits:
            break
        bins = candidates[block_idx*B : (block_idx+1) * B]
        for b in bins:
            if bit_idx >= total_bits:
                break
            phase_val = phases[block_idx, b]
            bit = 1 if phase_val > 0 else 0
            extracted_bits.append(bit)
            bit_idx += 1

We find the blockMid so we can avoid touching it and create the list of candidates from 1 to blockMid. Then we use random.Random() function to create the same pseudo random sequence if the key is correct and shuffle those list of candidates.

Next, we create an empty list of extracted_bits so we can append the bits we find during extraction. We set bit_idx to 0 then start the for loop, again if the bit_idx is equal or greater than total_bits the loop will breaks meaning we already collected all the bits. Then, we select bins from shuffled candidate list, if block_idx is 2 then selected bins for this block comes from index 16 to index 24. Then we use for loop again on those selected bins, we again first check if bits are left embed or not. Now, we have phase_value which store the phase of frequency bin b in block block_idx, these phases were modified during embedding to either +π/2 or -π/2. Next , we convert those phase value back into bits, if the phase is positive,it means the original bit was 1 and if the phase si negative , means the original bit was 0. The we append these bits in extracted_bits[] empty list to create a list of all bits of our message. Then move on the next bit till the loop breaks.

    if len(extracted_bits) < total_bits:
        raise ValueError("Not enough bits extracted") 

    extracted_bytes = np.packbits(np.array(extracted_bits[:total_bits], dtype=np.uint8), bitorder='big').tobytes()
    if len(extracted_bytes) < 6:
        print("Insufficient data extracted - audio may not contain a message.")
        return None
    
    length = struct.unpack(">H", extracted_bytes[:2])[0]
    crc = struct.unpack(">I", extracted_bytes[2:6])[0]
    epayload = extracted_bytes[6:6+length]

    ocrc = binascii.crc32(epayload) & 0xFFFFFFFF
    if ocrc != crc:
        print("Data integrity check failed - file may be corrupted")
        raise ValueError("Crc mismatch - data corrupted")
        
    text = decryption(epayload, key)
    recover = text.rstrip().decode("utf-8")
    print(f"Extracted message : {recover}")
    return recover

After extracting bits from phase, we check if the extracted bits are less than total_bits or not, if it is not then it throws an error. Then we convert those extracted bits back into byes using np.packbits() function. Next, we check first 6 bytes contains the 2 bytes for message length and 3 byets for CRC checksum. If we don't have at least 6 bytes, the message structure is invalid, probably no message was embedded.

Using struct.unpack() fucntion we read 2 byes as an unsigned short( big-endian) for the message length. Then again using struct.unpack() function we read 4 bytes as an unsigned int for CRC checksum. Now we know length means how many bytes the hidden payload has and crc means the checksum of that payload (used for integrity check). Now we extract the actual payload using 'extracted_bytes[6:6+length]', the real embedded message starts after the first 6 bytes (headers) and take length bytes from that position.

Lastly, we verify the data integrity using CRC check, We first calculate the CRC32 of epayload then compares it with the original CRC stored in the header. If the don't match, the message is corrupted or tampered. We finally decypt the encrypted payload we extracted using the key. and return the recover message.

To use this tool, check my repository:

GitHub - kaizoku73/Resono: Advanced audio steganography tool using phase coding to hide AES-encrypted messages in WAV files. Secure, imperceptible, and robust covert communication.GitHub

I named it Resono because it comes from Latin word “resonare” → to resound, to echo, which ties directly to waves / phases / sound resonance. Clone the repository to use it.

Thanks.

PreviousLSB for audio NextHistogram Shift in Image

Last updated 4 months ago