I am using tsfresh for extracting features from my data.
inital data:
My inital data was timeseries data of a machine sensor.
I used the third column to add another column named hub. It represents the cycles of the machine. Also I converted the Timestamp to integer "Timesteps" for each cycle. Rsulting in this Dataframe:
when extracting the features the algorithm returns 787 features for each of my datarows.
from tsfresh import extract_features
extracted_features = extract_features(df_sample, column_id="hub", column_sort="step")
features = extracted_features.columns.tolist()
But when I use the select features method with the labeled vector y, it gives back an empty dataframe.
I dont understand why?
from tsfresh import select_features
from tsfresh.utilities.dataframe_functions import impute
impute(extracted_features)
features_filtered = select_features(extracted_features, y)
I am pretty new to feature extraction.
If anybody has any pointers, as to how I could extract good features from the cyclic timeseries data, I would be very thankful.
The plot of the sensor value over the crankshaft position is shown below.
I'm trying to create implement k-means clustering in pyspark. I am using the mnist as my dataset, which has hundreds of columns with integer values.
After creating a data-frame, when I try to create a features column to be used in the clustering, I don't know what to give as a inputCols parameter for VectorAssembler. Below is my code
sc = SparkContext('local')
spark = SparkSession(sc)
df = spark.read.csv('mnist_train.csv')
df.show()
df_feat = df.select(*(df[c].cast("float").alias(c) for c in df.columns[0:]))
df_feat.show()
vecAssembler = VectorAssembler(inputCols = ???????, outputCol = "features")
What should I put as parameter for inputCols for this large integer valued data that I am using?
The Vectorassembler needs a list of column names to create a feature vector. So for the mnist dataset you could give him everything except of the labels. For example:
#I assume that df_feat.columns[0] is the column which contains the labels
cols = df_feat.columns[1:]
vecAssembler = VectorAssembler(inputCols = cols, outputCol = "features")
I am fitting Linear Classifier for pretty wide and sparse data using number of Categorical Columns with hash bucket and Crossed Feature Columns as Feature Columns.
Later I want to use the weights/coefficients of the model in a custom serving infrastructure. I know how to extract the weights from the model, but obviously, for aforementioned columns, they come for an already hashed feature values.
I can reconstruct a Hashtable (value -> hashed value) for a simple categorical columns using tf.string_to_hash_bucket_fast, but I am getting trouble doing that for Crossed Feature Columns.
For a pair of values of two categorical columns building up a Crossed Column - how can I understand which bucket they will get into?
After inspecting the source code I found out that the simplest way would be to construct an Input Layer for input data consisting of the all the distinct values (or their combinations) in the column.
As a result you get a DenseTensor consisting of 0 and 1, each row corresponds to a distinct value and where 1s are sitting in the columns corresponding to the actual hash bucket number (I've verified that for Categorical Columns, should be the same for CrossedColumns).
Here is the example code (for both Categorical Column and Crossed Column):
import tensorflow as tf
from tensorflow.python.feature_column import feature_column as fc
actual_sex = {'sex': tf.Variable(['male', 'female', 'female', 'male'], tf.string)}
actual_nationality = {'nationality': tf.Variable(['belgian', 'french', 'belgian', 'belgian'], tf.string)}
actual_sex_nationality = dict(actual_sex, **actual_nationality)
# hashed_column
sex_hashed_raw = fc.categorical_column_with_hash_bucket("sex", 10)
sex_hashed = fc.indicator_column(sex_hashed_raw)
# crossed column
crossed_sn_raw = fc.crossed_column(['sex', 'nationality'], hash_bucket_size = 20)
crossed_sn = fc.indicator_column(crossed_sn_raw)
layer_s = tf.feature_column.input_layer(actual_sex_nationality, sex_hashed)
layer_sn = tf.feature_column.input_layer(actual_sex_nationality, crossed_sn)
sess = tf.Session()
init = tf.global_variables_initializer()
sess.run(init)
print(sess.run(layer_s))
print(sess.run(layer_sn))
I'm running a model using GLM (using ML in Spark 2.0) on data that has one categorical independent variable. I'm converting that column into dummy variables using StringIndexer and OneHotEncoder, then using VectorAssembler to combine it with a continuous independent variable into a column of sparse vectors.
If my column names are continuous and categorical where the first is a column of floats and the second is a column of strings denoting (in this case, 8) different categories:
string_indexer = StringIndexer(inputCol='categorical',
outputCol='categorical_index')
encoder = OneHotEncoder(inputCol ='categorical_index',
outputCol='categorical_vector')
assembler = VectorAssembler(inputCols=['continuous', 'categorical_vector'],
outputCol='indep_vars')
pipeline = Pipeline(stages=string_indexer+encoder+assembler)
model = pipeline.fit(df)
df = model.transform(df)
Everything works fine to this point, and I run the model:
glm = GeneralizedLinearRegression(family='gaussian',
link='identity',
labelCol='dep_var',
featuresCol='indep_vars')
model = glm.fit(df)
model.params
Which outputs:
DenseVector([8440.0573, 3729.449, 4388.9042, 2879.1802, 4613.7646, 5163.3233, 5186.6189, 5513.1392])
Which is great, because I can verify that these coefficients are essentially correct (via other sources). However, I haven't found a good way to link these coefficients to the original column names, which I need to do (I've simplified this model for SO; there's more involved.)
The relationship between column names and coefficients is broken by StringIndexer and OneHotEncoder. I've found one fairly slow way:
df[['categorical', 'categorical_index']].distinct()
Which gives me a small dataframe relating the the string names to the numerical names, which I think I could then relate back to the keys in the sparse vector? This is very clunky and slow though, when you consider the scale of the data.
Is there a better way to do this?
For PySpark, here is the solution to map feature index to feature name:
First, train your model:
pipeline = Pipeline().setStages([label_stringIdx,assembler,classifier])
model = pipeline.fit(x)
Transform your data:
df_output = model.transform(x)
Extract the mapping between feature index and feature name. Merge numeric attributes and binary attributes into a single list.
numeric_metadata = df_output.select("features").schema[0].metadata.get('ml_attr').get('attrs').get('numeric')
binary_metadata = df_output.select("features").schema[0].metadata.get('ml_attr').get('attrs').get('binary')
merge_list = numeric_metadata + binary_metadata
OUTPUT:
[{'name': 'variable_abc', 'idx': 0},
{'name': 'variable_azz', 'idx': 1},
{'name': 'variable_azze', 'idx': 2},
{'name': 'variable_azqs', 'idx': 3},
....
I also came across the exact problem and I've got your solution :)
This is based on the Scala version here:
How to map variable names to features after pipeline
# transform data
best_model = pipeline.fit(df)
best_pred = best_model.transform(df)
# extract features metadata
meta = [f.metadata
for f in best_pred.schema.fields
if f.name == 'features'][0]
# access feature name and index
features_name_ind = meta['ml_attr']['attrs']['numeric'] + \
meta['ml_attr']['attrs']['binary']
print features_name_ind[:2]
# [{'name': 'feature_name_1', 'idx': 0}, {'name': 'feature_name_2', 'idx': 1}]
I didn't investigate the previous versions, but in Spark 2.4.3 it is possible to retrieve a lot of information about the features just by using the summary attribute of a GeneralizedLinearRegressionModel.
Printing summary results in something like this:
Coefficients:
Feature Estimate Std Error T Value P Value
(Intercept) -0.1742 0.4298 -0.4053 0.6853
x1_enc_(-inf,5.5] -0.7781 0.3661 -2.1256 0.0335
x1_enc_(5.5,8.5] 0.1850 0.3736 0.4953 0.6204
x1_enc_(8.5,9.5] -0.3937 0.4324 -0.9106 0.3625
x45_enc_1-10-7-8-9 -0.5382 0.2718 -1.9801 0.0477
x45_enc_2-3-4-ND 0.5187 0.2811 1.8454 0.0650
x45_enc_5 -0.0456 0.3353 -0.1361 0.8917
x33_enc_1 0.6361 0.4043 1.5731 0.1157
x33_enc_10 0.0059 0.4083 0.0145 0.9884
x33_enc_2-3-4-8-ND 0.6121 0.1741 3.5152 0.0004
x102_enc_(-inf,4.5] 0.5315 0.1695 3.1354 0.0017
(Dispersion parameter for binomial family taken to be 1.0000)
Null deviance: 937.7397 on 666 degrees of freedom
Residual deviance: 858.8846 on 666 degrees of freedom
AIC: 880.8846
The Feature column can be constructed by accessing an internal Java object:
In [131]: glm.summary._call_java('featureNames')
Out[131]:
['x1_enc_(-inf,5.5]',
'x1_enc_(5.5,8.5]',
'x1_enc_(8.5,9.5]',
'x45_enc_1-10-7-8-9',
'x45_enc_2-3-4-ND',
'x45_enc_5',
'x33_enc_1',
'x33_enc_10',
'x33_enc_2-3-4-8-ND',
'x102_enc_(-inf,4.5]']
The Estimate column can be constructed by the following concatenation:
In [134]: [glm.intercept] + list(glm.coefficients)
Out[134]:
[-0.17419580191414719,
-0.7781490190325139,
0.1850214800764976,
-0.3936963366945294,
-0.5382255101657534,
0.5187453074755956,
-0.045649677050663987,
0.6360647167539958,
0.00593020879299306,
0.6121475986933201,
0.531510974697773]
PS.: This line shows why the column Features can be retrieved by using an internal Java object.
Sorry, this seems to be a very late answer and maybe you might have already figured it out but wth, anyways. I recently did the same implementation of String Indexer, OneHotEncoder and VectorAssembler and as far as I have understood, the following code will present what you are looking for.
from pyspark.ml import Pipeline
from pyspark.ml.feature import OneHotEncoder, StringIndexer, VectorAssembler
categoricalColumns = ["one_categorical_variable"]
stages = [] # stages in the pipeline
for categoricalCol in categoricalColumns:
# Category Indexing with StringIndexer
stringIndexer = StringIndexer(inputCol=categoricalCol,
outputCol=categoricalCol+"Index")
# Using OneHotEncoder to convert categorical variables into binary
SparseVectors
encoder = OneHotEncoder(inputCol=stringIndexer.getOutputCol(),
outputCol=categoricalCol+"classVec")
# Adding the stages so that they will be run all at once later
stages += [stringIndexer, encoder]
# convert label into label indices using the StringIndexer
label_stringIdx = StringIndexer(inputCol = "Service_Level", outputCol =
"label")
stages += [label_stringIdx]
# Transform all features into a vector using VectorAssembler
numericCols = ["continuous_variable"]
assemblerInputs = map(lambda c: c + "classVec", categoricalColumns) +
numericCols
assembler = VectorAssembler(inputCols=assemblerInputs, outputCol="features")
stages += [assembler]
# Creating a Pipeline for Training
pipeline = Pipeline(stages=stages)
# Running the feature transformations.
pipelineModel = pipeline.fit(df)
df = pipelineModel.transform(df)
For using "MLUtils.saveAsLibSVMFile" the data parameter should be RDD of labeled points.Suppose we have 3 categories in categorical variable(which is response variable).How can we create RDD of labeled points with 3 categories as labels? I already converted my categorical variable to binary features.
I am trying with this code.
from pyspark.mllib.regression import LabeledPoint
from pyspark import SparkContext
sc=SparkContext()
laxman = sc.textFile("G:\converted data.txt")
labeled_point_rdd=laxman.map(lambda row: row.split(',')).map(lambda seq: LabeledPoint(seq[-3],seq[:-4]))
print (labeled_point_rdd.take(2))