问题
机器学习课上的一道题,题目是:仅使用矩阵乘法来生成频谱图矩阵。输入声音数据是一个列向量 $x$。请详细描述您将如何构建一个矩阵 $A$,使得 $A∙x$ 将产生复数频谱系数的 $vec(∙)$。$\text{DFT_size}=64,\text{Hop_size}=32$,并使用Hann window减少泄漏。
As promised in class, matrix multiplies can perform all kinds of linear operations.
Here you will have to implement a spectrogram using only a matrix multiplication.
Your input sound is a column vector x. Describe exactly how you would construct
a matrix A, such that the product A∙x will produce the vec(∙) of the complex
spectrogram coefficients. Your transform should have a DFT size of 64, a hop size
of 32, and will use a Hann window. Make an image plot of the real part of the matrix
A and allow me to marvel at its beauty.
解答思路
要解决这个问题,我们首先假设目标向量(即频谱图)是$S$,输入的声音数据列向量大小为$N$,给定$\text{DFT_size}=64, \text{Hop_size}=32$,帧频的数量就能被表示为:
$$ \text{Frame#} = \left\lceil \frac{N - \text{DFT_size}}{\text{Hop_size}} \right\rceil + 1 = \left\lceil \frac{N-64}{32} \right\rceil + 1 $$
每个帧频包含 $\text{DFT_size}$ 个频格。这些频格随后构成我们的频谱图向量 $S$。因此,$S$ 的长度可以表示为:
$$ |S| = \left\lceil \frac{N-64}{32} \right\rceil \times 64 = 2N - 64 $$
根据矩阵乘法的性质,$|S|$ 等于$A$ 中行的数量。所以矩阵 $A$ 的大小应该为 $(2N-64) \times N$
现在,让我们确定这个矩阵中每个entry的值。为了计算一个段 $ [a_1, a_2, \ldots, a_{64}]$ 的复数频谱系数,首先应用Hann Window来减少泄漏:
$$ w(n) = 0.5 \times \left(1 - \cos\left(\frac{2\pi n}{N-1}\right)\right), \quad N = 64 $$
应用Hann Window后的结果数据变为:
$$ \text{data_with_hann_window} = \left[ w(1)a_1, w(2)a_2, \ldots, w(64)a_{64} \right] $$
随后的步骤是应用 DFT:
$$ X(k) = \sum_{n=0}^{N-1} x[n] e^{-j\left(\frac{2\pi kn}{N}\right)} $$
给定 $x = \text{data_with_hann_window} $ 以及$N = 64$, 转换过后的数据能表示为:
$$ \text{data_with_hann_window_DFT} = \left[ X(0), X(1), \ldots, X(63) \right] $$
为了以矩阵形式表示这个过程,并考虑到Hop size,让我们定义三个函数:
$$ X(n, k) = e^{-j(2\pi nk/64)} \\ H(k) = 0.5 \times \left(1 - \cos\left(\frac{2\pi k}{63}\right)\right) \\ q(i) = 32 \left( \left\lfloor \frac{i}{64} \right\rfloor \right) $$
对于矩阵 $A$,其第 $i$ 行和第 $j$ 列的元素表示为 $A_{i, j}$,该元素必须满足以下条件:
从对$A$的可视化中,我们不难看出矩阵$A$是中心对称的,所以生成算法其实还有优化的空间,但由于只是作业就没管那么多了。
Code
import tqdm
import matplotlib.pyplot as plt
import numpy as np
from pydub import AudioSegment
sample_size = 20000
DFT_size = 128
Hop_size = 64
DRAW_A = False
DRAW_SPECTROGRAM = True
SAMPLE_FILENAME = 'sample.m4a'
assert (DFT_size / 2 == Hop_size)
# DFT function
def dft(n, k):
return np.exp(-1j * (2 * np.pi * n * k / DFT_size))
# Hann window function
def hann(k):
return 0.5 * (1 - np.cos(2 * np.pi * k / (DFT_size - 1)))
# down sampling, or computer will bomb
sample_raw = AudioSegment.from_file(SAMPLE_FILENAME).set_frame_rate(8000)
sample = np.array(sample_raw.get_array_of_samples())
N = min(sample_size, sample.shape[0])
sample = sample[:min(sample_size, N)]
print(f"sound has shape: {sample.shape[0]}, we use {N} datas as sample")
A = np.zeros((2 * N - DFT_size, N))
for i, row in enumerate(tqdm.tqdm(A)):
q = Hop_size * (i // DFT_size)
for j in range(row.shape[0]):
if (j >= q) and (j <= (q + DFT_size)):
# A_ij != 0
row[j] = dft(n=j - q, k=i % DFT_size) * hann(j - q)
# otherwise A_ij = 0
# visualize A
if DRAW_A:
plt.imshow(A.real)
plt.colorbar()
plt.title(f"matrix A (sample size: {sample_size})")
plt.axis('off')
plt.show()
# draw spectrogram
if DRAW_SPECTROGRAM:
spec_coefficient = A @ sample
# keep partial of spec coefficient which can be | 64, add up real and imaginary part, and normalize log10
spec_coefficient = np.abs(spec_coefficient[:spec_coefficient.shape[0] // DFT_size * DFT_size])
spec_coefficient = spec_coefficient.reshape((DFT_size, -1))
plt.imshow(np.log10(spec_coefficient), origin='lower', aspect='auto')
plt.colorbar(label='Log Magnitude')
plt.title(f"Spectrogram (DFT size: {DFT_size})")
plt.xlabel('time frame')
plt.ylabel('frequency bin')
plt.show()
那啥,这个附件我也想抄
这个我用的插件,Accessories,改了一丢丢css