Python: PCA issue with data analysis - python

I am attempting to do some data analysis with PCA sklearn package. The issue I'm currently running into is the way my code is analysing the data.
An example of some of the data is as follows
wavelength intensity
; [um] [W/m**2/um/sr]
196.078431372549 1.108370393265022E-003
192.307692307692 1.163428008597600E-003
188.679245283019 1.223639983609668E-003
The code written so far is as follows:
scaler = StandardScaler(with_mean=True, with_std=True) #scales the data
data_crescent=ascii.read('earth_crescent.dat',data_start=4958, data_end=13300, delimiter=' ')#where the data is being read
#where each variable comes from in the dat
y_intensity_crescent=data_crescent['col2'][:]
x_wave_crescent=data_crescent['col1'][:]
standard_y_crescent=StandardScaler().fit_transform(y_intensity_crescent)#standardizing the intensity variable
#PCA runthrough of data
pca= PCA(n_components=2)
principalCrescentY=pca.fit_transform(standard_y_crescent)
principalDfcrescent = pd.DataFrame(data = principalCrescentY
, columns = ['principal component 1', 'principal component 2'])
finalDfcrescent = pd.concat([principalDfcrescent, [y_intensity_crescent]], axis = 1)
Once ran, the data produces this error:
ValueError: Expected 2D array, got 1D array instead:
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample
In order to analyze the data via PCA, the data needs to be transformed into a 2D model, to produce the expected results. Any work around would be much appreciated!

The problem is that you are giving one feature y_intensity_crescent to your pca object by doing: principalCrescentY=pca.fit_transform(standard_y_crescent). You are in fact giving only one dimension to your pca algorithm. Roughly: principal component analysis takes multiple features time series and will combine them into components which are combination of the features. If you want 2 components you need more than 1 features.
Here is some example of how to use it properly: PCA tutorial using sklearn

Related

Extracting data from the chemical_kinetics data.dataset

I recreated the "simple example" from the documentation of the chemical_kinetics module used to load, plot and fit chemical kinetics data. I used the module to fit some of my own data succesfully but now I also want to get numerical values from the data.dataset created. The DataFrame contains all the input data but also the parameters and the output data used to plot the fitted line.
The following code is used to import data
ds = data.Dataset(
files_c = ["data/concentrations vs time.csv"],
t_label = "Time [a.u.]",
c_label = "Concentration [a.u.]"
)
And the following adds the fitting data
from chemical_kinetics import fit
fit.fit_dataset(
dataset = ds,
derivatives = derivatives,
parameters = parameters,
c0 = c0
)
The function
fit.print_result(ds)
only reports the values for each parameter and
plot.plot_c(ds)
plots all the data I want to extract. I want to numerically extract the very data that is plotted.
I tried using some methods to get data from pandas Dataframe as the documentation says that the data.dataset created pandas.Dataframes but it never works.
The entire example is listed on the following link:
https://chemical-kinetics.readthedocs.io/en/latest/simple_example.html

tsfresh time series feature extraction

I am using tsfresh for extracting features from my data.
inital data:
My inital data was timeseries data of a machine sensor.
I used the third column to add another column named hub. It represents the cycles of the machine. Also I converted the Timestamp to integer "Timesteps" for each cycle. Rsulting in this Dataframe:
when extracting the features the algorithm returns 787 features for each of my datarows.
from tsfresh import extract_features
extracted_features = extract_features(df_sample, column_id="hub", column_sort="step")
features = extracted_features.columns.tolist()
But when I use the select features method with the labeled vector y, it gives back an empty dataframe.
I dont understand why?
from tsfresh import select_features
from tsfresh.utilities.dataframe_functions import impute
impute(extracted_features)
features_filtered = select_features(extracted_features, y)
I am pretty new to feature extraction.
If anybody has any pointers, as to how I could extract good features from the cyclic timeseries data, I would be very thankful.
The plot of the sensor value over the crankshaft position is shown below.

Sentiment Analysis Feature Selection based on word to label correlation

In my sentiment analysis on a dataset of 194k review texts with labels (class 1-5), I am trying to reduce the features (words) based on a word to label correlation by which a classifier can be trained.
Using sklearn.feature_extraction.text.CountVectorizer with default parameterization, I get 86,7k features. When performing fit_transform, I got a CSR-sparse matrix which I tried to put into a data frame using toarray().
Unfortunately, an array of size (194439,86719) causes a Memory Error. I think I need it to be in the data frame in order to calculate the correlations with df.corr(). Below you find my coding:
corpus = data['reviewText']
vectorizer = CountVectorizer(analyzer ='word')
X = vectorizer.fit_transform(corpus)
content = X.toarray() # here comes the Memory Error
vocab = vectorizer.get_feature_names()
df = pd.DataFrame(data= X.toarray(), columns=vocab)
corr = pd.Series(df.corrwith(df['overall']) > 0.6)
new_vocab = df2[corr[corr == True].index] # should return features that we want to use
Is there a way to filter by correlation without having to change the format into a data frame?
Most posts that were going into the same direction of using correlation on df do not have to handle the large data amount.
I figured that there are other ways to implement a feature selection based on the correlation. With SelectKBest and the scoring function f_regression.

plotting k-modes cluster in python

I've got 10 clusters in k-modes,
data:- categorical(i converted to binary then run model).
used technology:- jupyter-python.
doubt:- 1. find accuracy.
plotting/visualising cluster in 2d and 3d.
Something like this should be a good start.
#recreate data to feed into the algorithm
data = np.asarray([np.asarray(df['field1']),np.asarray(df['field2'])]).T
So now running the following piece of code:
# computing K-Means with K = 5 (5 clusters)
centroids,_ = kmeans(data,5)
# assign each sample to a cluster
idx,_ = vq(data,centroids)
# some plotting using numpy's logical indexing
plot(data[idx==0,0],data[idx==0,1],'ob',
data[idx==1,0],data[idx==1,1],'oy',
data[idx==2,0],data[idx==2,1],'or',
data[idx==3,0],data[idx==3,1],'og',
data[idx==4,0],data[idx==4,1],'om')
plot(centroids[:,0],centroids[:,1],'sg',markersize=8)
show()
This is a great resource.
https://www.pythonforfinance.net/2018/02/08/stock-clusters-using-k-means-algorithm-in-python/

Issue with Scikit-learn data analysis

am attempting to take a .dat file of about 90,000 data lines of two variables (wavelength and intensity) and apply a sklearn.pca filter to it.
Here is a small set of that data:
wavelength intensity
[um] [W/m**2/um/sr]
196.078431372549 1.108370393265022E-003
192.307692307692 1.163428008597600E-003
188.679245283019 1.223639983609668E-003
The code I am using to analyze the data is below
pca= PCA(n_components=2)
pca.fit(data)
print(pca.components_)
The error code I get is this when I try to apply 2 pca components to one of the data sets:
ValueError: Datatype coercion is not allowed
Any help resolving would be much appreciated
I think in your case, the problem is the column name, especially [W/m**2/um/sr].
Also when using PCA, do not forget to rescale the input variables into "comparable" units using StandardScaler.
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
data = pd.DataFrame({'wavelength [um]': [196.078431372549, 1.108370393265022E-003, 192.307692307692], 'intensity [W/m**2/um/sr]': [1.163428008597600E-003, 188.679245283019, 1.223639983609668E-003]})
scaler = StandardScaler(with_mean=True, with_std=True)
pca= PCA(n_components=2)
pca.fit(scaler.fit_transform(data))
print(pca.components_)
Worked well for me. Maybe you just need to specify:
data.columns = data.columns.astype(str)

Categories