I did a PCA in Python on audio spectrograms and face the following problem: I have a matrix, where each row consists of flattened song features. After applying PCA it's clear to me, that the dimensions are reduced. BUT I can't find those dimensional data in the regular dataset.
import sys
import glob
from scipy.io.wavfile import read
from scipy import signal
from scipy.fftpack import fft
import numpy as np
import matplotlib.pyplot as plt
import pylab
# Read file to get samplerate and numpy array containing the signal
files = glob.glob('../some/*.wav')
song_list = []
for wav in files:
(fs, x) = read(wav)
channels = [
np.array(x[:, 0]),
np.array(x[:, 1])
]
# Combine channels to make a mono signal out of stereo
channel = np.mean(channels, axis=0)
channel = channel[0:1024,]
# Generate spectrogram
## Freqs is the same with different songs, t differs slightly
Pxx, freqs, t, plot = pylab.specgram(
channel,
NFFT=128,
Fs=44100,
detrend=pylab.detrend_none,
window=pylab.window_hanning,
noverlap=int(128 * 0.5))
# Magnitude Spectrum to use
Pxx = Pxx[0:2]
X_flat = Pxx.flatten()
song_list.append(X_flat)
song_matrix = np.vstack(song_list)
If I now apply PCA to the song_matrix...
import matplotlib
from matplotlib.mlab import PCA
from sklearn import decomposition
#test = matplotlib.mlab.PCA(song_matrix.T)
pca = decomposition.PCA(n_components=2)
song_matrix_pca = pca.fit_transform(song_matrix.T)
pca.components_ #These components should be most helpful to discriminate between the songs due to their high variance
pca.components_
...the final 2 components are the following:
Final components - two dimensions from 15 wav-files
The problem is, that I can't find those two vectors in the original dataset with all dimensions What am I doing wrong or am I misinterpreting the whole thing?
PCA doesn't give you the vectors in your dataset.
From Wikipedia :
Principal component analysis (PCA) is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components. The number of principal components is less than or equal to the number of original variables. This transformation is defined in such a way that the first principal component has the largest possible variance (that is, accounts for as much of the variability in the data as possible), and each succeeding component in turn has the highest variance possible under the constraint that it is orthogonal to the preceding components.
Say you have a column vector V containing ONE flattened spectrogram. PCA will find a matrix M whose columns are orthogonal vectors (think of them as being at right angles to every other column in M).
Multiplying M and T will give you a vector of "scores", which can be used to determine how much variance each column of M captures from the original data and each column of M captures progressively less variance in the data.
Multiplying matrix M' (the first 2 columns of M) by V will produce a 2x1 vector T' representing the "dimension-reduced spectrogram". You could reconstruct an approximation of V by multiplying T' by the inverse of M'. This would work if you had a matrix of spectrograms, too. Keeping only two principal components would produce an extremely lossy compression of your data.
But what if you want to add a new song to your dataset? Unless it is very much like the original song (meaning it introduces little variance to the original data set), there's no reason to think that the vectors of M will describe the new song well. For that matter, even multiplying all the elements of V by a constant would render M useless. PCA is quite data specific. Which is why it's not used in image/audio compression.
The good news? You can use a Discrete Cosine transform to compress your training data. Instead of lines, it finds cosines that form a descriptive basis, and doesn't suffer from the data specific limitation. DCT is used in jpeg, mp3 and other compression schemes.
Related
I have two images :
Original Image
Binarized Image
I have applied Discrete Cosine Transform to the two images by dividing the 256x256 image into 8x8 blocks. After, I want to compare their DCT Coefficient Distributions.
import matplotlib.mlab as mlab
import matplotlib.pyplot as plt
import matplotlib.pylab as pylab
import numpy as np
import os.path
import scipy
import statistics
from numpy import pi
from numpy import sin
from numpy import zeros
from numpy import r_
from PIL import Image
from scipy.fftpack import fft, dct
from scipy import signal
from scipy import misc
if __name__ == '__main__':
image_counter = 1
#Opens the noisy image.
noise_image_path = 'noise_images/' + str(image_counter) + '.png'
noise_image = Image.open(noise_image_path)
# Opens the binarize image
ground_truth_image_path = 'ground_truth_noise_patches/' + str(image_counter) + '.png'
ground_truth_image = Image.open( ground_truth_image_path)
#Converts the images into Ndarray
noise_image = np.array(noise_image)
ground_truth_image = np.array(ground_truth_image)
#Create variables `noise_dct_data` and `ground_truth_dct_data` where the DCT coefficients of the two images will be stored.
noise_image_size = noise_image.shape
noise_dct_data = np.zeros(noise_image_size)
ground_truth_image_size = ground_truth_image.shape
ground_truth_dct_data = np.zeros(ground_truth_image_size)
for i in r_[:noise_image_size[0]:8]:
for j in r_[:noise_image_size[1]:8]:
# Apply DCT to the two images every 8x8 block of it.
noise_dct_data[i:(i+8),j:(j+8)] = dct(noise_image[i:(i+8),j:(j+8)])
# Apply DCT to the binarize image every 8x8 block of it.
ground_truth_dct_data[i:(i+8),j:(j+8)] = dct(ground_truth_image[i:(i+8),j:(j+8)])
The above code gets the DCT of the two images. I want to create their DCT Coefficient Distribution just like the image below:
The thing is I dont know how to plot it. Below is what I did:
#Convert 2D array to 1D array
noise_dct_data = noise_dct_data.ravel()
ground_truth_dct_data = ground_truth_dct_data.ravel()
#I just used a Histogram!
n, bins, patches = plt.hist(ground_truth_dct_data, 2000, facecolor='blue', alpha=0.5)
plt.show()
n, bins, patches = plt.hist(noise_dct_data, 2000, facecolor='blue', alpha=0.5)
plt.show()
image_counter = image_counter + 1
My questions are:
What does the X and Y-axis in the figure represents?
Are the value stored in noise_dct_data and ground_truth_dct_data, the DCT coefficients?
Does the Y-axis represents the frequncy of its corresponding DCT coefficients?
Is the histogram appropriate to represent the DCT coefficient distribution.
The DCT coefficients are normally classified into three sub-bands based on their frequencies, namely low, middle and high frequency-bands. What is the threshold value we can use to classify a DCT Coefficient in low, middle or high frequency band? In other words, how can we classify the DCT coefficient frequency bands radially? Below is an example of the radial classification of the DCT coefficient frequency bands.
The idea is based from the paper : Noise Characterization in Ancient Document Images Based on DCT Coefficient Distribution
The plot example you shared looks, to me, like a kernel density plot. A density plot "a variation of a Histogram that uses kernel smoothing to plot values, allowing for smoother distributions by smoothing out the noise." (See https://datavizcatalogue.com/methods/density_plot.html)
The seaborn library, which is built on top of matplotlib, has a kdeplot function, and it can handle two sets of data. Here's a toy example:
import numpy as np
from scipy.fftpack import dct
import seaborn
sample1 = dct(np.random.rand(100))
sample2 = dct(np.random.rand(30))
seaborn.kdeplot(sample1, color="r")
seaborn.kdeplot(sample2, color="b")
Note that rerunning this code will produce a slightly different image, as I'm using randomly generated data.
To answer your numbered questions directly:
1. What do the X- and Y-axes in the figure represent?
In a kdeplot, the X axis represents the density, and the y axis represents the number of observations with those values. Unlike a histogram, it applies a smoothing method to try and estimate a "true" distribution of data behind the noisy observed data.
2. Are the value stored in noise_dct_data and ground_truth_dct_data, the DCT coefficients?
Based on the way you've set up your code, yes, those variables stored the result of the DCT transformations you do.
3. Does the Y-axis represents the frequency of its corresponding DCT coefficients?
Yes, but with smoothing. Analogous to a histogram but not exactly the same.
4. Is the histogram appropriate to represent the DCT coefficient distribution?
It depends on the number of observations but, if you have enough data, a histogram should give you very similar results.
5. The DCT coefficients are normally classified into three sub-bands based on their frequencies, namely low, middle and high frequency-bands. What is the threshold value we can use to classify a DCT Coefficient in low, middle or high frequency band? In other words, how can we classify the DCT coefficient frequency bands radially?
I think this question is possibly too complicated to answer satisfactorily on stack, but my advice here to is try and figure out how the authors of the article did this task. The cited article, "Blind Image Quality Assessment: A Natural Scene Statistics Approach in the DCT Domain" appears to be talking about a Radial Basis Function (RBF), but this looks like a way of training a supervised model on the frequency data to predict the overall quality of the scan.
Regarding data partitions, they state, "In order to capture directional information from the local image patches, the DCT block is partitioned directionally. ... The upper, middle, and lower partitions correspond to the low-frequency, mid-frequency, and high-frequency DCT subbands, respectively."
I take this to me that, in at least one of their scenarios, the partitions are determined by a Subband DCT. (See https://ieeexplore.ieee.org/document/499836) There appears to be a great deal of literature on these types of approaches.
This is the code I've found online
d0 = pd.read_csv('./mnist_train.csv')
labels = d0.label.head(15000)
data = d0.drop('label').head(15000)
from sklearn.preprocessing import StandardScaler
standardized_data = StandardScaler().fit_transform(data)
#find the co-variance matrix which is : (A^T * A)/n
sample_data = standardized_data
# matrix multiplication using numpy
covar_matrix = np.matmul(sample_data.T , sample_data) / len(sample_data)
How does multiplying the same data gives np.matmul(sample_data.T, sample_data) covariance matrix? What is the co-variance matrix according to this tutorial I found online? The last step is what I don't understand.
This might be a better question for the math or stats stack exchange, but I'll answer here for now.
This comes from the definition of covariance. The Wikipedia page (linked) gives a whole lot of detail, but covariance is defined as (in pseudo-code)
cov = E[dot((x - E[x]), (x - E[x]).T)]
for column vectors, but in your case you probably have row vectors, which is why the first element in your dot-product is transposed, not the second. The E[...] means expected value, which is the mean for Gaussian-distributed data. When you perform StandardScaler().fit_transform(data), you are basically subtracting out the mean of the data, so that's why you don't explicitly do so in your dot product.
Note that StandardScaler() is also dividing by the variance, so it's normalizing everything to unit variance. This is going to affect your covariance! So if you need the actual covariance of the data without normalization, just calculate it with something like np.cov() from the numpy module.
Let's build towards Covariance matrix step by step, first let's define variance.
The variance of some random variable X is a measure of how much values in the distribution vary on average with respect to the mean.
Now we have to define covariance.
Covariance is the measure of the joint probability for two random variables. It describes how the two variables change together. Read here.
So now armed with that you can understand that Co-variance matrix is a matrix which shows how each feature varies with changes in other features. Which can be calculated as
and there you can see the equation that you are confused about formed at the bottom. If you have any further queries, comment down.
Image Source: Wikipedia.
My dataset has 2000 attributes and 200 samples. I need to reduce the dimensionality of it. To do this, I am trying to use Fourier transformation as a dimensional reduction. Fourier transformation returns the discrete Fourier transform when I feed data as an input. But I do not know how to use it for dimensional reduction.
from scipy.fftpack import fft
import panda as pd
price = pd.read_csv(priceFile(), sep=",")
transformed = fft(price )
Can you please help me?
Fourier transform is most suited if your samples are each a time series. If they are you may extract frequency domain features for each sample from transformed. Here is a listing of common features in time and frequency domain that you can consider (reference):
Let's said you have a Pandas data frame with 2000 atributes and 200 samples as you mentioned:
import numpy as np
df = pd.DataFrame(np.random.randint(0,100,size=(200, 2000)))
To reduce the dimensionality using scipy, you can generate a new an array with the transformed values by first setting the number of dimensions (n_dimensions) that you want and the calling the scipy function (fft).
First we call the function and we define it as fft
from scipy.fftpack import fft
Then we set the number of dimensions in this case we will assign 1 dimension
n_dimensions = 1
Then we call the function and we add our data frame first and the number of dimensions.
transformed_data = fft(df,n=n_dimensions)
Then if we want to work with Real numbers you can transform the array
df = df.real
I realize there are several articles that demonstrate how to fit a GMM to a 1D Gaussian with sklearn ([1] and [2], to name a few). However, in all of those cases, the data is present as single points where the distribution is Gaussian. In my case, I'm essentially have a frequency table (I'm working with spectroscopic data), where the distribution is Gaussian, but the individual points are unknown.
My distribution (i.e., the data I'm trying to fit) looks like this: 1D Gaussian Peak
I'd like to use GMM to deconvolve the 2 initial Gaussian distributions that make up this peak.
So far, I've tried the following (assume my data is a 200x2 array, with position in one column and AFU on the second) :
import numpy as np
from sklearn import mixture
import matplotlib.pyplot as plt
def gengmm(nc=4, n_iter = 2):
g = mixture.GMM(n_components=nc) # number of components
g.init_params = "" # No initialization
g.n_iter = n_iter # iteration of EM method
return g
I tried to see if I could fit this peak to just a single Gaussian:
g = gengmm(1, 100)
g.fit(data)
However, the mean and covariance I get don't define my data particularly well (notably, the mean for that Gaussian distribution is 127.5, which is not what is recovered with a 1 component GMM).
Is there an easier way to do this? (I realize I can just use a least-squares fit to recover the initial Gaussian, but again, I'm trying to ultimately use this to determine the two underlying Gaussians distributions that make up the final one.)
Thanks!
The data that i have is stored in a 2D list where one column represents a frequency and the other column is its corresponding dB. I would like to programmatically identify the frequency of the 3db points on either end of the passband. I have two ideas on how to do this but they both have drawbacks.
Find maximum point then the average of points in the passband then find points about 3dB lower
Use the sympy library to perform numerical differentiation and identify the critical points/inflection points
use a histogram/bin function to find the amplitude of the passband.
drawbacks
sensitive to spikes, not quite sure how to do this
i don't under stand the math involved and the data is noisy which could lead to a lot of false positives
correlating the amplitude values with list index values could be tricky
Can you think of better ideas and/or ways to implement what I have described?
Assuming that you've loaded multiple readings of the PSD from the signal analyzer, try averaging them before attempting to find the bandedges. If the signal isn't changing too dramatically, the averaging process might smooth away any peaks and valleys and noise within the passband, making it easier to find the edges. This is what many spectrum analyzers can do to make for a smoother PSD.
In case that wasn't clear, assume that each reading gives you 128 tuples of the frequency and power and that you capture 100 of these buffers of data. Now average the 100 samples from bin 0, then samples from 1, 2, ..., 128. Now try and locate the bandpass on this data. It should be easier than on any single buffer. Note I used 100 as an example. If your data is very noisy, it may require more. If there isn't much noise, fewer.
Be careful when doing the averaging. Your data is in dB. To add the samples together in order to find an average, you must first convert the dB data back to decimal, do the adds, do the divide to find the average, and then convert the averaged power back into dB.
Ok it seems this has to be solved by data analysis. I would propose these steps:
Preprocess you data if you suspect it to bee too noisy. I'd suggest either moving-average filter (sp.convolve(data, sp.ones(n)/n, "same")) or better a savitzky-golay-filter (sp.signal.savgol_filter(data, n, polyorder=3)) because you will be interested in extrema of the data, which will be unnecessarily distorted by the ma filter. You might also want to get rid of artifacts like 60Hz noise at this stage.
If the signal you are interested in lives in a narrow band, the spectrum will be a single pronounced peak. In that case you could just fit a curve to your data, a gaussian would be appropriate in that case.
import scipy as sp
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
freq, pow = read_in_your_data_here()
freq, pow = sp.asarray(freq), sp.asarray(pow)
def gauss(x, a, mu, sig):
return a**sp.exp(-(x-mu)**2/(2.*sig**2))
(a, mu, sig), _ = curve_fit(gauss, freq, pow)
fitted_curve = gauss(freq, a, mu, sig)
plt.plot(freq, pow)
plt.plot(freq, fitted_curve)
plt.vlines(mu, min(pow)-2, max(pow)+2)
plt.show()
center_idx = sp.absolute(freq-mu).argmin()
pow_center = pow[center_idx]
pow_3db = pow_center - 3.
def interv_from_binvec(data):
indicator = sp.convolve(data, [-1,1], "same")
return indicator.argmin(), indicator.argmax()
passband_idx = interv_from_binvec(pow > pow_3db)
passband = freq[passband_idx[0]], freq[passband_idx[1]]
This is more an example than a solution, and relies heavily on the assumption the you are searching and finding a high SNR signal with a narrow band. It could be extended to handle more than one signal by use of a mixture model.
You can use scipy's UnivariateSpline and leastsq methods:
Create a spline of y-(np.max(y)-3)
Find the roots of it.
Calculate the difference between the two roots.
from scipy.interpolate import UnivariateSpline
from scipy.optimize import leastsq
x = df["Wavelength / nm"]
y = df["Power / dBm"]
#create spline
spline = UnivariateSpline(x, y-(np.max(y)-3), s=0)
# find the roots
r1, r2 = spline.roots()
# calculate the difference
threedB_bandwidth = abs(r2-r1)