问题

机器学习课上的一道题,题目是:仅使用矩阵乘法来生成频谱图矩阵。输入声音数据是一个列向量 $x$。请详细描述您将如何构建一个矩阵 $A$,使得 $A∙x$ 将产生复数频谱系数的 $vec(∙)$。$\text{DFT_size}=64,\text{Hop_size}=32$,并使用Hann window减少泄漏。

As promised in class, matrix multiplies can perform all kinds of linear operations.
Here you will have to implement a spectrogram using only a matrix multiplication.
Your input sound is a column vector x. Describe exactly how you would construct
a matrix A, such that the product A∙x will produce the vec(∙) of the complex
spectrogram coefficients. Your transform should have a DFT size of 64, a hop size
of 32, and will use a Hann window. Make an image plot of the real part of the matrix
A and allow me to marvel at its beauty.

解答思路

要解决这个问题,我们首先假设目标向量(即频谱图)是$S$,输入的声音数据列向量大小为$N$,给定$\text{DFT_size}=64, \text{Hop_size}=32$,帧频的数量就能被表示为:

$$ \text{Frame#} = \left\lceil \frac{N - \text{DFT_size}}{\text{Hop_size}} \right\rceil + 1 = \left\lceil \frac{N-64}{32} \right\rceil + 1 $$

每个帧频包含 $\text{DFT_size}$ 个频格。这些频格随后构成我们的频谱图向量 $S$。因此,$S$ 的长度可以表示为:

$$ |S| = \left\lceil \frac{N-64}{32} \right\rceil \times 64 = 2N - 64 $$

根据矩阵乘法的性质,$|S|$ 等于$A$ 中行的数量。所以矩阵 $A$ 的大小应该为 $(2N-64) \times N$

现在,让我们确定这个矩阵中每个entry的值。为了计算一个段 $ [a_1, a_2, \ldots, a_{64}]$ 的复数频谱系数,首先应用Hann Window来减少泄漏:

$$ w(n) = 0.5 \times \left(1 - \cos\left(\frac{2\pi n}{N-1}\right)\right), \quad N = 64 $$

应用Hann Window后的结果数据变为:

$$ \text{data_with_hann_window} = \left[ w(1)a_1, w(2)a_2, \ldots, w(64)a_{64} \right] $$

随后的步骤是应用 DFT:

$$ X(k) = \sum_{n=0}^{N-1} x[n] e^{-j\left(\frac{2\pi kn}{N}\right)} $$

给定 $x = \text{data_with_hann_window} $ 以及$N = 64$, 转换过后的数据能表示为:

$$ \text{data_with_hann_window_DFT} = \left[ X(0), X(1), \ldots, X(63) \right] $$

为了以矩阵形式表示这个过程,并考虑到Hop size,让我们定义三个函数:

$$ X(n, k) = e^{-j(2\pi nk/64)} \\ H(k) = 0.5 \times \left(1 - \cos\left(\frac{2\pi k}{63}\right)\right) \\ q(i) = 32 \left( \left\lfloor \frac{i}{64} \right\rfloor \right) $$

对于矩阵 $A$,其第 $i$ 行和第 $j$ 列的元素表示为 $A_{i, j}$,该元素必须满足以下条件:


Matrix A for sound sample size 64
Matrix A for sound sample size 64


从对$A$的可视化中,我们不难看出矩阵$A$是中心对称的,所以生成算法其实还有优化的空间,但由于只是作业就没管那么多了。

Code

import tqdm
import matplotlib.pyplot as plt
import numpy as np
from pydub import AudioSegment

sample_size = 20000
DFT_size = 128
Hop_size = 64

DRAW_A = False
DRAW_SPECTROGRAM = True
SAMPLE_FILENAME = 'sample.m4a'
assert (DFT_size / 2 == Hop_size)


# DFT function
def dft(n, k):
    return np.exp(-1j * (2 * np.pi * n * k / DFT_size))


# Hann window function
def hann(k):
    return 0.5 * (1 - np.cos(2 * np.pi * k / (DFT_size - 1)))


# down sampling, or computer will bomb
sample_raw = AudioSegment.from_file(SAMPLE_FILENAME).set_frame_rate(8000)
sample = np.array(sample_raw.get_array_of_samples())
N = min(sample_size, sample.shape[0])
sample = sample[:min(sample_size, N)]
print(f"sound has shape: {sample.shape[0]}, we use {N} datas as sample")

A = np.zeros((2 * N - DFT_size, N))

for i, row in enumerate(tqdm.tqdm(A)):
    q = Hop_size * (i // DFT_size)
    for j in range(row.shape[0]):
        if (j >= q) and (j <= (q + DFT_size)):
            # A_ij != 0
            row[j] = dft(n=j - q, k=i % DFT_size) * hann(j - q)

        # otherwise A_ij = 0

# visualize A
if DRAW_A:
    plt.imshow(A.real)
    plt.colorbar()
    plt.title(f"matrix A (sample size: {sample_size})")
    plt.axis('off')
    plt.show()

# draw spectrogram
if DRAW_SPECTROGRAM:
    spec_coefficient = A @ sample
    # keep partial of spec coefficient which can be | 64, add up real and imaginary part, and normalize log10
    spec_coefficient = np.abs(spec_coefficient[:spec_coefficient.shape[0] // DFT_size * DFT_size])
    spec_coefficient = spec_coefficient.reshape((DFT_size, -1))

    plt.imshow(np.log10(spec_coefficient), origin='lower', aspect='auto')
    plt.colorbar(label='Log Magnitude')
    plt.title(f"Spectrogram (DFT size: {DFT_size})")
    plt.xlabel('time frame')
    plt.ylabel('frequency bin')
    plt.show()

完整版

附件
附件名称:CS545 HW1.zip
文件大小:1097.4 KB
下载次数: 231
最后修改: 2023-11-27 01:06