I have recently started learning python to develop a predictive model for a research project using machine learning methods. I have used OneHotEncoder to encode all the categorical variables in my dataset
# Encode categorical data with oneHotEncoder
from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder(handle_unknown='ignore')
Z = ohe.fit_transform(Z)
I now want to create a dataframe with the results from the OneHotEncoder. I want the dataframe columns to be the new categories that resulted from the encoding, that is why I am using the categories_ attribute. When running the following line of code:
ohe_df = pd.DataFrame(Z, columns=ohe.categories_)
I get the error: ValueError: all arrays must be same length
I understand that the arrays being referred to in the error message are the arrays of categories, each of which has a different length depending on the number of categories it contains, but am not sure what the correct way of creating a dataframe with the new categories as columns is (when there are multiple features).
I tried to do this with a small dataset that contained one feature only and it worked:
ohe = OneHotEncoder(handle_unknown='ignore', sparse=False)
df = pd.DataFrame(['Male', 'Female', 'Female'])
results = ohe.fit_transform(df)
ohe_df = pd.DataFrame(results, columns=ohe.categories_)
ohe_df.head()
Female Male
0 0.0 1.0
1 1.0 0.0
2 1.0 0.0
So how do I do the same for my large dataset with numerous features.
Thank you in advance.
EDIT:
As requested, I have come up with a MWE to demonstrate how it is not working:
import numpy as np
import pandas as pd
# create dataframe
df = pd.DataFrame(np.array([['Male', 'Yes', 'Forceps'], ['Female', 'No', 'Forceps and ventouse'],
['Female','missing','None'], ['Male','Yes','Ventouse']]),
columns=['gender', 'diabetes', 'assistance'])
df.head()
# encode categorical data
from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder(handle_unknown='ignore')
results = ohe.fit_transform(df)
print(results)
By this step, I have created a dataframe of categorical data and encoded it. I now want to create another dataframe such that the columns of the new dataframe are the categories created by the OneHotEncoder and rows are the encoded data. To do this I tried two things:
ohe_df = pd.DataFrame(results, columns=np.concatenate(ohe.categories_))
And I tried:
ohe_df = pd.DataFrame(results, columns=ohe.get_feature_names(input_features=df.columns))
Which both resulted in the error:
ValueError: Shape of passed values is (4, 1), indices imply (4, 9)
IIUC,
import numpy as np
import pandas as pd
# create dataframe
df = pd.DataFrame(np.array([['Male', 'Yes', 'Forceps'], ['Female', 'No', 'Forceps and ventouse'],
['Female','missing','None'], ['Male','Yes','Ventouse']]),
columns=['gender', 'diabetes', 'assistance'])
df.head()
# encode categorical data
from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder(handle_unknown='ignore')
results = ohe.fit_transform(df)
df_results = pd.DataFrame.sparse.from_spmatrix(results)
df_results.columns = ohe.get_feature_names(df.columns)
df_results
Output:
gender_Female gender_Male diabetes_No diabetes_Yes diabetes_missing assistance_Forceps assistance_Forceps and ventouse assistance_None assistance_Ventouse
0 0.0 1.0 0.0 1.0 0.0 1.0 0.0 0.0 0.0
1 1.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0
2 1.0 0.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0
3 0.0 1.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0
Note, the output of ohe.fit_transform(df) is a sparse matrix.
print(type(results))
<class 'scipy.sparse.csr.csr_matrix'>
You can convert this to a dataframe using pd.DataFrame.sparse.from_spmatrix. Then, we can use ohe.get_feature_names and passing the original dataframe columns to name your columns in the results dataframe, df_results.
ohe.categories_ is a list of arrays, one array for each feature. You need to flatten that into a 1D list/array for pd.DataFrame, e.g. with np.concatenate(ohe.categories_).
But probably better, use the builtin method get_feature_names.
Related
I have the following dataset:
dates A B C
2005-01-01 1.0 2.0 1.0
2005-01-02 2.0 1.0 1.0
2005-01-04 3.0 0.0 1.0
I want to calculate the slopes based on the timestamp index. This should be the result:
slope:
A 0.4
B -0.7
C -0.1
I tried this solution:
slope = df.apply(lambda x: np.polyfit(df.index), x, 1)[0])
But it returns an error:
TypeError: float() argument must be a string or a number, not 'Timestamp'
Any help will be greatly appreciated.
a) Don't apply() the polynomial-fitting to the 'Timestamp' string column, only to the float columns A,B,C. So either make dates the index, or don't include it in the columns passed into apply().
Make dates column your index:
df.set_index('dates', inplace=True)
A B C
dates
2005-01-01 1.0 2.0 1.0
2005-01-02 2.0 1.0 1.0
2005-01-04 3.0 0.0 1.0
b) Now as to fixing up the apply() call:
you're missing a second parenthesis, and you need a trailing ...), axis=1 to apply your function columnwise.
also since we changed df.index to now be dates not the autonumbered integers 0,1,2, you need to pass an explicit integer range into polyfit().
Solution:
#pd.options.display.float_format = '{:.3f}'.format
#pd.options.display.precision = 3
#np.set_printoptions(floatmode='fixed', precision=3, suppress=True)
df.apply(lambda x: np.polyfit(range(len(x)), x, 1), axis=1)
dates
2005-01-01 [-1.9860273225978183e-16, 1.3333333333333333]
2005-01-02 [-0.5000000000000004, 1.8333333333333341]
2005-01-04 [-0.9999999999999998, 2.3333333333333335]
(Note: I'm unsuccesfully trying to set the np and pd display options to suppress the unwanted dps and scientific notation on the object returned by polyfit. You can figure that part out yourself.]
And here's the boilerplate to make your example reproducible:
import numpy as np
import pandas as pd
from io import StringIO
df = """dates A B C
2005-01-01 1.0 2.0 1.0
2005-01-02 2.0 1.0 1.0
2005-01-04 3.0 0.0 1.0"""
df = pd.read_csv(StringIO(df), sep=r'\s+', parse_dates=['dates'])
So, my input file is a large file containing a list within a list. It has thousands of rows. Each value in the list is either 0 or 1. There is also another column specifying a value associated with each list. However, after passing my data through pandas train_test_split the ends of the nested list appear to be separated from the rest of the list.
I have tried using pandas.read_csv and scikit train_test_split functions but this does not appear to solve the issue.
I have also tried using the dtype parameter in pandas.read_csv however this has not worked either.
data = pandas.read_csv('file_to_be_read', sep=',',)
X_train, X_test = train_test_split(data, test_size = 0.3, random_state = 42)
Input file:
[[0,0,0,0,1,0,0,0,0,0], [0,1,0,0,0,0,0,0,0,0] ..., [0,0,0,0,0,0,0,0,1,0]] 4.567645
The resulting output:
[[0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0]... [0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0] 4.567645
Desired Output:
[[0,0,0,0,1,0,0,0,0,0], [0,1,0,0,0,0,0,0,0,0] ..., [0,0,0,0,0,0,0,0,1,0]] 4.567645
In essence I just want to take random rows of data from the original file and place them into another file.
I'm currently dealing with a set of similar DataFrames having a double Header.
They have the following structure:
age height weight shoe_size
RHS height weight shoe_size
0 8.0 6.0 2.0 1.0
1 8.0 NaN 2.0 1.0
2 6.0 1.0 4.0 NaN
3 5.0 1.0 NaN 0.0
4 5.0 NaN 1.0 NaN
5 3.0 0.0 1.0 0.0
height weight shoe_size age
RHS weight shoe_size age
0 1.0 1.0 NaN NaN
1 1.0 2.0 0.0 2.0
2 1.0 NaN 0.0 5.0
3 1.0 2.0 0.0 NaN
4 0.0 1.0 0.0 3.0
Actually the main differences are the sorting of the first Header row, which could be made the same for all of them, and the position of the RHS header column in the second Header row. I'm currently wondering if there is an easy way of saving/reading all these DataFrames into/from a single CSV file instead of having a different CSV file for each of them.
Unfortunately, there isn't any reasonable way to store multiple dataframes in a single CSV such that retrieving each one would not be excessively cumbersome, but you can use pd.ExcelWriter and save to separate sheets in a single .xlsx file:
import pandas as pd
writer = pd.ExcelWriter('file.xlsx')
for i, df in enumerate(df_list):
df.to_excel(writer,'sheet{}'.format(i))
writer.save()
Taking back your example (with random numbers instead of your values) :
import pandas as pd
import numpy as np
h1 = [['age', 'height', 'weight', 'shoe_size'],['RHS','height','weight','shoe_size']]
df1 = pd.DataFrame(np.random.randn(3, 4), columns=h1)
h2 = [['height', 'weight', 'shoe_size','age'],['RHS','weight','shoe_size','age']]
df2 = pd.DataFrame(np.random.randn(3, 4), columns=h2)
First, reorder your columns (How to change the order of DataFrame columns?) :
df3 = df2[h1[0]]
Then, concatenate the two dataframes (Merge, join, and concatenate) :
df4 = pd.concat([df1,df3])
I don't know how you want to deal with the second row of your header (for now, it's just using two sub-columns, which is not very elegant). If, to your point of view, this row is meaningless, just reset your header like you want before to concatenate :
df1.columns=h1[0]
df3.columns=h1[0]
df5 = pd.concat([df1,df3])
Finally, save it under CSV format (pandas.DataFrame.to_csv) :
df4.to_csv('file_name.csv',sep=',')
I want to train a model and finally predict a truth value using a random forest model in Python of the three column dataset (click the link to download the full CSV-dataset formatted as in the following
t_stamp,X,Y
0.000543,0,10
0.000575,0,10
0.041324,1,10
0.041331,2,10
0.041336,3,10
0.04134,4,10
0.041345,5,10
0.04135,6,10
0.041354,7,10
I wanted to predict the current value of Y (the true value) using the last (for example: 5, 10, 100, 300, 1000, ..etc) data points of X using random forest model of sklearn in Python. Meaning taking [0,0,1,2,3] of X column as an input for the first window - i want to predict the 5th row value of Y trained on the previous values of Y.
Let's say we have 5 traces of dataset (a1.csv, a2.csv, a3.csv, a4.csv and a5.csv) in the current directory. For a single trace (dataset) (for example, a1.csv) – I can do the prediction of a 5 window as the following
import pandas as pd
import numpy as np
from io import StringIO
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
from sklearn.metrics import accuracy_score
import math
from math import sqrt
df = pd.read_csv('a1.csv')
for i in range(1,5):
df['X_t'+str(i)] = df['X'].shift(i)
print(df)
df.dropna(inplace=True)
X=pd.DataFrame({ 'X_%d'%i : df['X'].shift(i) for i in range(5)}).apply(np.nan_to_num, axis=0).values
y = df['Y'].values
reg = RandomForestRegressor(criterion='mse')
reg.fit(X,y)
modelPred = reg.predict(X)
print(modelPred)
print("Number of predictions:",len(modelPred))
modelPred.tofile('predictedValues1.txt',sep="\n",format="%s")
meanSquaredError=mean_squared_error(y, modelPred)
print("Mean Square Error (MSE):", meanSquaredError)
rootMeanSquaredError = sqrt(meanSquaredError)
print("Root-Mean-Square Error (RMSE):", rootMeanSquaredError)
I have solved this problem with random forest, which yields df:
rolling_regression')
time X Y X_t1 X_t2 X_t3 X_t4
0 0.000543 0 10 NaN NaN NaN NaN
1 0.000575 0 10 0.0 NaN NaN NaN
2 0.041324 1 10 0.0 0.0 NaN NaN
3 0.041331 2 10 1.0 0.0 0.0 NaN
4 0.041336 3 10 2.0 1.0 0.0 0.0
5 0.041340 4 10 3.0 2.0 1.0 0.0
6 0.041345 5 10 4.0 3.0 2.0 1.0
7 0.041350 6 10 5.0 4.0 3.0 2.0
.........................................................
[2845 rows x 7 columns]
[ 10. 10. 10. ..., 20. 20. 20.]
RMSE: 0.5136564734333562
However, now I want to do the prediction over all of the files (a1.csv, a2.csv, a3.csv, a4.csv and a5.csv)by dividing the training into 60% of the datasets whose file name start with a and the remaining 40% for testing whose file name start with a using sklearn in Python (meaning 3 traces will be used for training and 2 files for testing)?
PS: All the files have the same structure but they are with different lengths for they are generated with different parameters.
import glob, os
df = pd.concat(map(pd.read_csv, glob.glob(os.path.join('', "a*.csv"))))
# get your X and Y Df's
x_train,x_test,y_train,y_test=sklearn.cross_validation.train_test_split(X,Y,test_size=0.40)
To read in multiple files, you'll need a slight extension. Aggregate data from each csv, then call pd.concat to join them:
df_list = []
for i in range(1, 6):
df_list.append(pd.read_csv('a%d.csv' %i))
df = pd.concat(df_list)
This will read in all your csvs, and you can carry on as usual. Get X and y:
X = pd.DataFrame({ 'X_%d'%i : df['X'].shift(i) for i in range(5)}).apply(np.nan_to_num, axis=0).values
y = df['Y'].values
Use sklearn.cross_validation.train_test_split to segment your data:
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.4)
You can also look at StratifiedKFold.
I am trying to oneHotEncode the categorical variables of my Pandas dataframe, which includes both categorical and continues variables. I realise this can be done easily with the pandas .get_dummies() function, but I need to use a pipeline so I can generate a PMML-file later on.
This is the code to create a mapper. The categorical variables I would like to encode are stored in a list called 'dummies'.
from sklearn_pandas import DataFrameMapper
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder
mapper = DataFrameMapper(
[(d, LabelEncoder()) for d in dummies] +
[(d, OneHotEncoder()) for d in dummies]
)
And this is the code to create a pipeline, including the mapper and linear regression.
from sklearn2pmml import PMMLPipeline
from sklearn.linear_model import LinearRegression
lm = PMMLPipeline([("mapper", mapper),
("regressor", LinearRegression())])
When I now try to fit (with 'features' being a dataframe, and 'targets' a series), it gives an error 'could not convert string to float'.
lm.fit(features, targets)
OneHotEncoder doesn't support string features, and with [(d, OneHotEncoder()) for d in dummies] you are applying it to all dummies columns. Use LabelBinarizer instead:
mapper = DataFrameMapper(
[(d, LabelBinarizer()) for d in dummies]
)
An alternative would be to use the LabelEncoder with a second OneHotEncoder step.
mapper = DataFrameMapper(
[(d, LabelEncoder()) for d in dummies]
)
lm = PMMLPipeline([("mapper", mapper),
("onehot", OneHotEncoder()),
("regressor", LinearRegression())])
LabelEncoder and LabelBinarizer are intended for encoding/binarizing the target (label) of your data, i.e. the y vector. Of course they do more or less the same thing as OneHotEncoder, the main difference being the Label preprocessing steps don't accept matrices, only 1-D vectors.
example = pd.DataFrame({'x':np.arange(2,14,2),
'cat1':['A','B','A','B','C','A'],
'cat2':['p','q','w','p','q','w']})
dummies = ['cat1', 'cat2']
x cat1 cat2
0 2 A p
1 4 B q
2 6 A w
3 8 B p
4 10 C q
5 12 A w
As an example, LabelEncoder().fit_transform(example['cat1']) works, but LabelEncoder().fit_transform(example[dummies]) throws a ValueError exception.
In contrast, OneHotEncoder accepts multiple columns:
from sklearn.preprocessing import OneHotEncoder
OneHotEncoder().fit_transform(example[dummies])
<6x6 sparse matrix of type '<class 'numpy.float64'>'
with 12 stored elements in Compressed Sparse Row format>
This can be incorporated into a pipeline using a ColumnTransformer, passing through (or alternatively applying different transformations to) the other columns :
from sklearn.compose import ColumnTransformer
ct = ColumnTransformer([('encode_cats', OneHotEncoder(), dummies),],
remainder='passthrough')
pd.DataFrame(ct.fit_transform(example), columns = ct.get_feature_names_out())
encode_cats__cat1_A encode_cats__cat1_B ... encode_cats__cat2_w remainder__x
0 1.0 0.0 ... 0.0 2.0
1 0.0 1.0 ... 0.0 4.0
2 1.0 0.0 ... 1.0 6.0
3 0.0 1.0 ... 0.0 8.0
4 0.0 0.0 ... 0.0 10.0
5 1.0 0.0 ... 1.0 12.0
Finally, slot this into a pipeline:
from sklearn.pipeline import Pipeline
Pipeline([('preprocessing', ct),
('regressor', LinearRegression())])