My dataset has 2000 attributes and 200 samples. I need to reduce the dimensionality of it. To do this, I am trying to use Fourier transformation as a dimensional reduction. Fourier transformation returns the discrete Fourier transform when I feed data as an input. But I do not know how to use it for dimensional reduction.
from scipy.fftpack import fft
import panda as pd
price = pd.read_csv(priceFile(), sep=",")
transformed = fft(price )
Can you please help me?
Fourier transform is most suited if your samples are each a time series. If they are you may extract frequency domain features for each sample from transformed. Here is a listing of common features in time and frequency domain that you can consider (reference):
Let's said you have a Pandas data frame with 2000 atributes and 200 samples as you mentioned:
import numpy as np
df = pd.DataFrame(np.random.randint(0,100,size=(200, 2000)))
To reduce the dimensionality using scipy, you can generate a new an array with the transformed values by first setting the number of dimensions (n_dimensions) that you want and the calling the scipy function (fft).
First we call the function and we define it as fft
from scipy.fftpack import fft
Then we set the number of dimensions in this case we will assign 1 dimension
n_dimensions = 1
Then we call the function and we add our data frame first and the number of dimensions.
transformed_data = fft(df,n=n_dimensions)
Then if we want to work with Real numbers you can transform the array
df = df.real
Related
I have an input dataset that has 4 time series with 288 values for 80 days. So the actual shape is (80,4,288). I would like to cluster differnt days. I have 80 days and all of them have 4 time series: outside temperature, solar radiation, electrical demand, electricity prices. What I want is to group similar days with regard to these 4 time series combined into clusters. Days belonging to the same cluster should have similar time series.
Before clustering the days using k-means or Ward's method, I would like to scale them using scikit learn. For this I have to transform the data into a 2 dimensional shape array with the shape (80, 4*288) = (80, 1152), as the Standard Scaler of scikit learn does not accept 3-dimensional input. The Standard Scaler just standardizes features by removing the mean and scaling to unit variance.
Now I scale this data using sckit learn's standard scaler:
import numpy as np
from sklearn.preprocessing import StandardScaler
import pandas as pd
data_Unscaled = pd.read_csv("C:/Users/User1/Desktop/data_Unscaled.csv", sep=";")
scaler = StandardScaler()
data_Scaled = scaler.fit_transform(data_Unscaled)
np.savetxt("C:/Users/User1/Desktop/data_Scaled.csv", data_Scaled, delimiter=";")
When I now compare the unscaled and scaled data e.g. for the first day (1 row) and the 4th time series (columns 864 - 1152 in the csv file), the results look quite strange as you can see in the following figure:
As far as I see it, they are not in line with each other. For example in the timeslots between 111 and 201 the unscaled data does not change at all whereas the scaled data fluctuates. I can't explain that. Do you have any idea why this is happening and why they don't seem to be in line?
Here is the unscaled input data with shape (80,1152): https://filetransfer.io/data-package/CfbGV9Uk#link
and here the scaled output of the scaling with shape (80,1152): https://filetransfer.io/data-package/23dmFFCb#link
You have two issues here: scaling and clustering. As the question title refers to scaling, I'll handle that one in detail. The clustering issue is probably better suited for CrossValidated.
You don't say it, but it seems natural that all temperatures, be it on day 1 or day 80, are measured on a same scale. The same holds for the other three variables. So, for the purpose of scaling you essentially have four time series.
StandardScaler, like basically everything in sklearn, expects your observations to be organised in rows and variables in columns. It treats each column separately, deducting its mean from all the values in the column and dividing the resulting values by their standard deviation.
I reckon from your data that the first 288 entries in each row correspond to one variable, the next 288 to the second one etc. You need to reshape these data to form 288*80=23040 rows and 4 columns, one for each variable.
You apply StandardScaler on that array and reformat the data into the original shape, with 80 rows and 4*288=1152 columns. The code below should do the trick:
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
data_Unscaled = pd.read_csv("C:/Users/User1/Desktop/data_Unscaled.csv", sep=";", header=None)
X = data_Unscaled.to_numpy()
X_narrow = np.array([X[:, i*288:(i+1)*288].ravel() for i in range(4)]).T
scaler = StandardScaler()
X_narrow_scaled = scaler.fit_transform(X_narrow)
X_scaled = np.array([X_narrow_scaled[i*288:(i+1)*288, :].T.ravel() for i in range(80)])
# Plot the original data:
i=3
j=0
plt.plot(X[j, i*288:(i+1)*288])
plt.title('TimeSeries_Unscaled')
plt.show()
# plot the scaled data:
plt.plot(X_scaled[j, i*288:(i+1)*288])
plt.title('TimeSeries_Scaled')
plt.show()
resulting in the following graphs:
The line
X_narrow = np.array([X[:, i*288:(i+1)*288].ravel() for i in range(4)]).T
uses list comprehension to generate the four columns of the long, narrow array X_narrow. Basically, it is just a shorthand for a for-loop over your four variables. It takes the first 288 columns of X, flattens them into a vector, which it then puts into the first column of X_narrow. Then it does the same for the next 288 columns, X[:, 288:576], and then for the third and the fourth block of the 288 observed values per day. This way, each column in X_narrow contains a long time series, spanning 80 days (and 288 observations per day), of exactly one of your variables (outside temperature, solar radiation, electrical demand, electricity prices).
Now, you might try to cluster X_scaled using K-means, but I doubt it will work. You have just 80 points in a 1152-dimensional space, so the curse of dimensionality will almost certainly kick in. You'll most probably need to perform some kind of dimensionality reduction, but, as I noted above, that's a different question.
I realize there are several articles that demonstrate how to fit a GMM to a 1D Gaussian with sklearn ([1] and [2], to name a few). However, in all of those cases, the data is present as single points where the distribution is Gaussian. In my case, I'm essentially have a frequency table (I'm working with spectroscopic data), where the distribution is Gaussian, but the individual points are unknown.
My distribution (i.e., the data I'm trying to fit) looks like this: 1D Gaussian Peak
I'd like to use GMM to deconvolve the 2 initial Gaussian distributions that make up this peak.
So far, I've tried the following (assume my data is a 200x2 array, with position in one column and AFU on the second) :
import numpy as np
from sklearn import mixture
import matplotlib.pyplot as plt
def gengmm(nc=4, n_iter = 2):
g = mixture.GMM(n_components=nc) # number of components
g.init_params = "" # No initialization
g.n_iter = n_iter # iteration of EM method
return g
I tried to see if I could fit this peak to just a single Gaussian:
g = gengmm(1, 100)
g.fit(data)
However, the mean and covariance I get don't define my data particularly well (notably, the mean for that Gaussian distribution is 127.5, which is not what is recovered with a 1 component GMM).
Is there an easier way to do this? (I realize I can just use a least-squares fit to recover the initial Gaussian, but again, I'm trying to ultimately use this to determine the two underlying Gaussians distributions that make up the final one.)
Thanks!
I'm clustering data with DBSCAN in order to remove outliers. The computation is very memory consuming because the implementation of DBSCAN in scikit-learn can't handle almost 1 GB of data. The problem was already mentioned here
The bottleneck of the following code appears to be the matrix calculation, which is very memory consuming (size of matrix: 10mln x 10mln). Is there a way to optimize the computation of DBSCAN?
My brief research shows that the matrix should be reduced to a sparse matrix in some way to make it feasible to compute.
My ideas how to solve this problem:
create and calculate a sparse matrix
calculate parts of matrix and save them to files and merge them later
perform DBSCAN on small subsets of data and merge the results
switch to Java and use ELKI tool
Code:
import numpy as np
import pandas as pd
import sklearn
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import DBSCAN
# sample data
speed = np.random.uniform(0,25,1000000)
power = np.random.uniform(0,3000,1000000)
# create a dataframe
data_dict = {'speed': speed,
'power': power}
df = pd.DataFrame(data_dict)
# convert to matrix
df = df.as_matrix().astype("float64", copy = False)
X = data
# normalize data
X = StandardScaler().fit_transform(X)
# precompute matrix of distances
dist_matrix = sklearn.metrics.pairwise.euclidean_distances(X, X)
# perform DBSCAN clustering
db = DBSCAN(eps=0.1, min_samples=60, metric="precomputed", n_jobs=-1).fit(dist_matrix)
1 to 3 will not work.
Your data is dense. There aren't "mostly 0s", so sparse formats will actually need much more memory. The exact thresholds vary, but as a rule of thumb, you'll need at least 90% of 0s for sparse formats to become effective.
DBSCAN does not use a distance matrix.
Working on parts, then merging isn't that easy (there is GriDBSCAN, which does this for Euclidean fistance). You cannot just take random partitions and merge them later.
I have an input dataset (DataFrame / numpy matrix) that has a skewed normal distribution. I am trying to find the python transformation function (or numpy matrix) which will transform the input dataset to a normal distribution with no skew.
I have looked at curve_fit (in scipy.optimize) and am not sure how I would go about applying it.
Is there a simple method of doing this?
I've done one of 2 things:
Use box-cox transformations. This requires you find the appropriate power or lambda that transforms you data to having zero skew.
Force a normal distribution.
Example
from scipy.stats import norm
df = pd.DataFrame(np.random.rand(1000), columns=['Uniform'])
df['Normal'] = norm.ppf((df.Uniform.rank() - .5) / len(df))
df.plot(kind='kde')
df.skew()
Uniform 2.392991e-02
Normal 2.114051e-15
dtype: float64
I did a PCA in Python on audio spectrograms and face the following problem: I have a matrix, where each row consists of flattened song features. After applying PCA it's clear to me, that the dimensions are reduced. BUT I can't find those dimensional data in the regular dataset.
import sys
import glob
from scipy.io.wavfile import read
from scipy import signal
from scipy.fftpack import fft
import numpy as np
import matplotlib.pyplot as plt
import pylab
# Read file to get samplerate and numpy array containing the signal
files = glob.glob('../some/*.wav')
song_list = []
for wav in files:
(fs, x) = read(wav)
channels = [
np.array(x[:, 0]),
np.array(x[:, 1])
]
# Combine channels to make a mono signal out of stereo
channel = np.mean(channels, axis=0)
channel = channel[0:1024,]
# Generate spectrogram
## Freqs is the same with different songs, t differs slightly
Pxx, freqs, t, plot = pylab.specgram(
channel,
NFFT=128,
Fs=44100,
detrend=pylab.detrend_none,
window=pylab.window_hanning,
noverlap=int(128 * 0.5))
# Magnitude Spectrum to use
Pxx = Pxx[0:2]
X_flat = Pxx.flatten()
song_list.append(X_flat)
song_matrix = np.vstack(song_list)
If I now apply PCA to the song_matrix...
import matplotlib
from matplotlib.mlab import PCA
from sklearn import decomposition
#test = matplotlib.mlab.PCA(song_matrix.T)
pca = decomposition.PCA(n_components=2)
song_matrix_pca = pca.fit_transform(song_matrix.T)
pca.components_ #These components should be most helpful to discriminate between the songs due to their high variance
pca.components_
...the final 2 components are the following:
Final components - two dimensions from 15 wav-files
The problem is, that I can't find those two vectors in the original dataset with all dimensions What am I doing wrong or am I misinterpreting the whole thing?
PCA doesn't give you the vectors in your dataset.
From Wikipedia :
Principal component analysis (PCA) is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components. The number of principal components is less than or equal to the number of original variables. This transformation is defined in such a way that the first principal component has the largest possible variance (that is, accounts for as much of the variability in the data as possible), and each succeeding component in turn has the highest variance possible under the constraint that it is orthogonal to the preceding components.
Say you have a column vector V containing ONE flattened spectrogram. PCA will find a matrix M whose columns are orthogonal vectors (think of them as being at right angles to every other column in M).
Multiplying M and T will give you a vector of "scores", which can be used to determine how much variance each column of M captures from the original data and each column of M captures progressively less variance in the data.
Multiplying matrix M' (the first 2 columns of M) by V will produce a 2x1 vector T' representing the "dimension-reduced spectrogram". You could reconstruct an approximation of V by multiplying T' by the inverse of M'. This would work if you had a matrix of spectrograms, too. Keeping only two principal components would produce an extremely lossy compression of your data.
But what if you want to add a new song to your dataset? Unless it is very much like the original song (meaning it introduces little variance to the original data set), there's no reason to think that the vectors of M will describe the new song well. For that matter, even multiplying all the elements of V by a constant would render M useless. PCA is quite data specific. Which is why it's not used in image/audio compression.
The good news? You can use a Discrete Cosine transform to compress your training data. Instead of lines, it finds cosines that form a descriptive basis, and doesn't suffer from the data specific limitation. DCT is used in jpeg, mp3 and other compression schemes.