I am using tsfresh for extracting features from my data.
inital data:
My inital data was timeseries data of a machine sensor.
I used the third column to add another column named hub. It represents the cycles of the machine. Also I converted the Timestamp to integer "Timesteps" for each cycle. Rsulting in this Dataframe:
when extracting the features the algorithm returns 787 features for each of my datarows.
from tsfresh import extract_features
extracted_features = extract_features(df_sample, column_id="hub", column_sort="step")
features = extracted_features.columns.tolist()
But when I use the select features method with the labeled vector y, it gives back an empty dataframe.
I dont understand why?
from tsfresh import select_features
from tsfresh.utilities.dataframe_functions import impute
impute(extracted_features)
features_filtered = select_features(extracted_features, y)
I am pretty new to feature extraction.
If anybody has any pointers, as to how I could extract good features from the cyclic timeseries data, I would be very thankful.
The plot of the sensor value over the crankshaft position is shown below.
Related
I recreated the "simple example" from the documentation of the chemical_kinetics module used to load, plot and fit chemical kinetics data. I used the module to fit some of my own data succesfully but now I also want to get numerical values from the data.dataset created. The DataFrame contains all the input data but also the parameters and the output data used to plot the fitted line.
The following code is used to import data
ds = data.Dataset(
files_c = ["data/concentrations vs time.csv"],
t_label = "Time [a.u.]",
c_label = "Concentration [a.u.]"
)
And the following adds the fitting data
from chemical_kinetics import fit
fit.fit_dataset(
dataset = ds,
derivatives = derivatives,
parameters = parameters,
c0 = c0
)
The function
fit.print_result(ds)
only reports the values for each parameter and
plot.plot_c(ds)
plots all the data I want to extract. I want to numerically extract the very data that is plotted.
I tried using some methods to get data from pandas Dataframe as the documentation says that the data.dataset created pandas.Dataframes but it never works.
The entire example is listed on the following link:
https://chemical-kinetics.readthedocs.io/en/latest/simple_example.html
In my sentiment analysis on a dataset of 194k review texts with labels (class 1-5), I am trying to reduce the features (words) based on a word to label correlation by which a classifier can be trained.
Using sklearn.feature_extraction.text.CountVectorizer with default parameterization, I get 86,7k features. When performing fit_transform, I got a CSR-sparse matrix which I tried to put into a data frame using toarray().
Unfortunately, an array of size (194439,86719) causes a Memory Error. I think I need it to be in the data frame in order to calculate the correlations with df.corr(). Below you find my coding:
corpus = data['reviewText']
vectorizer = CountVectorizer(analyzer ='word')
X = vectorizer.fit_transform(corpus)
content = X.toarray() # here comes the Memory Error
vocab = vectorizer.get_feature_names()
df = pd.DataFrame(data= X.toarray(), columns=vocab)
corr = pd.Series(df.corrwith(df['overall']) > 0.6)
new_vocab = df2[corr[corr == True].index] # should return features that we want to use
Is there a way to filter by correlation without having to change the format into a data frame?
Most posts that were going into the same direction of using correlation on df do not have to handle the large data amount.
I figured that there are other ways to implement a feature selection based on the correlation. With SelectKBest and the scoring function f_regression.
I have sets of Google Analytics data from a website which I plan to analyse for a project. However, due to maintenance and other factors, there are chunks of dates for which there is no data. I want to impute this data while still maintaining the integrity of the data as I plan to plot these sets and compare the curves of different sets to each-other over time.
Example
I want to use the nearest valid datapoints to each missing datapoint to impute that value in order to maintain the underlying shape that can be seen from the image.
I've already tried to use scikit-learn's KNN-Imputer and Iterative Imputer but I'm either miss-understanding how these imputers are supposed to be used or they're not the correct for what I'm trying to do, potentially both.
import pandas as pd
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
import numpy as np
df = pd.read_csv('data.csv', names=['Day','Views'],delimiter=',',skiprows=3, usecols=[0,1], skipfooter=1, engine='python', quoting= 1)
df = df.replace(0, np.nan)
da = df.Views.rename_axis('ID').values
da = da.reshape(-1,1)
imputer = IterativeImputer(n_nearest_features = 100, max_iter = 10)
df_imputed = imputer.fit_transform(da)
df_imputed.reshape(1,-1)
df.Views = df_imputed
df
All of the NaN values are calculated to be the exact same number from what I have currently implemented.
Any help would be greatly appreciated.
The problem here was I reshaping the array. My data was just a 1D array of values so I was making it 2D by reshaping the array which was causing all the NaN values to be calculated as the same. When I added an index column and included this as an input to the imputer the values were calculated correctly.I also ended up using a KNN imputer from sklearn instead of the iterative imputer in this instance.
I am attempting to do some data analysis with PCA sklearn package. The issue I'm currently running into is the way my code is analysing the data.
An example of some of the data is as follows
wavelength intensity
; [um] [W/m**2/um/sr]
196.078431372549 1.108370393265022E-003
192.307692307692 1.163428008597600E-003
188.679245283019 1.223639983609668E-003
The code written so far is as follows:
scaler = StandardScaler(with_mean=True, with_std=True) #scales the data
data_crescent=ascii.read('earth_crescent.dat',data_start=4958, data_end=13300, delimiter=' ')#where the data is being read
#where each variable comes from in the dat
y_intensity_crescent=data_crescent['col2'][:]
x_wave_crescent=data_crescent['col1'][:]
standard_y_crescent=StandardScaler().fit_transform(y_intensity_crescent)#standardizing the intensity variable
#PCA runthrough of data
pca= PCA(n_components=2)
principalCrescentY=pca.fit_transform(standard_y_crescent)
principalDfcrescent = pd.DataFrame(data = principalCrescentY
, columns = ['principal component 1', 'principal component 2'])
finalDfcrescent = pd.concat([principalDfcrescent, [y_intensity_crescent]], axis = 1)
Once ran, the data produces this error:
ValueError: Expected 2D array, got 1D array instead:
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample
In order to analyze the data via PCA, the data needs to be transformed into a 2D model, to produce the expected results. Any work around would be much appreciated!
The problem is that you are giving one feature y_intensity_crescent to your pca object by doing: principalCrescentY=pca.fit_transform(standard_y_crescent). You are in fact giving only one dimension to your pca algorithm. Roughly: principal component analysis takes multiple features time series and will combine them into components which are combination of the features. If you want 2 components you need more than 1 features.
Here is some example of how to use it properly: PCA tutorial using sklearn
I have been struggling with this one for a while.
My goal is to take a text feature that I have, and find the best 5-10 words in it to help me classify. Hence, I am running a TfIdfVectorizer, and choosing ~90 best for now. however, after I downsize the feature amount, I am unable to see which features were actually chosen.
here is what I have:
import pandas
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_selection import SelectPercentile, f_classif
train=pandas.read_csv("train.tsv", sep='\t')
labels_train = train["label"]
documents = []
for i, row in train.iterrows():
documents.append((row['boilerplate'][1:-1].lower()))
vectorizer = TfidfVectorizer(sublinear_tf=True, stop_words="english")
features_train_transformed = vectorizer.fit_transform(documents)
selector = SelectPercentile(f_classif, percentile=0.1)
selector.fit(features_train_transformed, labels_train)
features_train_transformed = selector.transform(features_train_transformed).toarray()
The result is that features_train_transformed contains a matrix of all the tfidf scores per word per document of the selected words, however I have no idea which words were chosen, and methods like "get_feature_names()" are unavailable for the class SelectPercentile.
This is neccesary because i need to add these features to a bunch of numeric features and only then make my training and predictions.
selector.get_support() to get you a boolean array of columns that were within the percentile range you specified
train.columns.values should get you the complete list of column names for the original dataframe
filtering the latter with the former should give you the names of columns that make up your chosen percentile range.
the code below (cut-pasted from working code) is similar enough to yours, that it's hopefully helpful
import numpy as np
selection = SelectPercentile(f_regression, percentile=2)
train_minus_target = train.drop("y", axis=1)
x_features = selection.fit_transform(train_minus_target, y_train)
columns = np.asarray(train_minus_target.columns.values)
support = np.asarray(selection.get_support())
columns_with_support = columns[support]
Reference:
about get_support