I've been following a tutorial trying to understand machine learning while trying out what he's doing at the same time.
My array is:
0 44 72000
2 27 48000
1 30 54000
2 38 61000
1 40 6.377777777777778101
0 35 58000
2 38.77777777777777857 52000
0 48 79000
1 50 83000
0 37 67000
The first column used to contain country name but he used label encoder to transform it to 0s,1s and 2s.
He wanted to also use OneHotEncoder to transform that column to more features but since his videos are a bit outdated he used categorical_features with OneHotEncoder but in my sklearn version OneHotEncoder has been changed and i don't have that parameter anymore.
So how can I use OneHotEncoder now on that specific feature?
What he tried was:
onehotencoder = OneHotEncoder(categorical_features = [0])
X = onehotencoder.fit_transform(X).toarray()
Assuming that your data X has a shape (n_rows, features).
If you like to apply one-hot encoding to say, the first column. A quick approach would be
onehotencoder = OneHotEncoder()
one_hot = onehotencoder.fit_transform(X[:,0:1]).toarray()
A better approach to apply one-hot encoding only a specific column would be to use ColumnTransformer
from sklearn.compose import ColumnTransformer
ct = ColumnTransformer([("country", OneHotEncoder(), [0])], remainder = 'passthrough')
X = ct.fit_transform(X)
one hot encoding based on categories. You can represent your data with one hot vectors. For instance if you have 2 classes your vector have length 2:
[_,_]
So each class can be represented in here by just using 0s and 1s. Represented class index take 1 and others take 0. For instance class0 will be:
[1,0]
Class1 will be:
[0,1]
In your example, you have 3 classes. Therefore your one-hot-vector will have length of 3. Each class represented like that:
Class0 -> [1,0,0]
Class1 -> [0,1,0]
Class2 -> [0,0,1]
Then your array will looks like:
[1,0,0] 44 72000
[0,0,1] 27 48000
[0,1,0] 30 54000
[0,0,1] 38 61000
[0,1,0] 40 6.377777777777778101
[1,0,0] 35 58000
[0,0,1] 38.77777777777777857 52000
[1,0,0] 48 79000
[0,1,0] 50 83000
[1,0,0] 37 67000
I hope this clarify your question. You can write your own function to do that.
Related
I'm building a predictive model for whether a car is sport car or not. The model works fine, however I would like to join the predicted values back to the unique IDs and visualize the proportion, etc. Basically I have two dataframes:
Testing with labelled data - test_cars
CarId
Feature1
Feature2
IsSportCar
1
90
150
True
2
60
200
False
3
560
500
True
Unlabelled data to be predicted - cars_new
CarId
Feature1
Feature2
4
88
666
5
55
458
6
150
125
from sklearn.neighbors import KNeighborsClassifier
# Create arrays for the features and the response variable
y = test_cars['IsSportCar'].values
X = test_cars.drop(['IsSportCar','CarId'], axis=1).values
X_new = cars_new.drop(['CarId'], axis=1).values
# Create a k-NN classifier with 10 neighbors
knn = KNeighborsClassifier(n_neighbors=10)
# Fit the classifier to the data
knn.fit(X,y)
y_pred = knn.predict(X_new)
The model works fine, but I would like to join the predicted values back to each car (CarId), so the car_new dataframe would be outputted with predicted column "IsSportCar":
CarId
Feature1
Feature2
IsSportCar
4
88
666
False
5
55
458
True
6
150
125
True
Any ideas how to join the predicted values back to the unique IDs?
cars_new['IsSportCar'] = y_pred
I assume y_pred is the variable you want to put into cars_new?
I am working on a Data Science project with the Fifa dataset. I cleaned the data and took care of any NaN values in the Data to get it ready to be split into test and train. I need to use StratifiedShuffleSplit in order to split the data. Updated to a cleaner way to divided the value data into groups, but I am still getting NaN values once it goes through the split.
Link to the data set I am using: https://www.kaggle.com/karangadiya/fifa19
n = fifa['value'].count()
folds = 3
fifa.sort_values('value', ascending=False, inplace=True)
fifa['group_id'] = np.floor(np.arange(n)/folds)
fifa['value_cat'] = fifa.groupby('group_id', as_index = False)['name'].transform(lambda x: np.random.choice(v_cats, size=x.size, replace = False))
At this point when I check the test and train data I now have mystery NaN values inputed. I think the NaN values maybe a result of .loc since I am getting a 'warning' in jupyter.
c:\python37\lib\site-packages\ipykernel_launcher.py:6: FutureWarning:
Passing list-likes to .loc or [] with any missing label will raise
KeyError in the future, you can use .reindex() as an alternative.
Code below:
from sklearn.model_selection import StratifiedShuffleSplit
split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
for train_index, test_index in split.split(fifa, fifa['value_cat']):
strat_train_set = fifa.loc[train_index]
strat_test_set = fifa.loc[test_index]
fifa = strat_train_set.drop('value', axis=1)
value_labels = strat_train_set['value'].copy()
PLEASE HELP MY POOR SOUL!!
Here's one solution.
import numpy as np
import pandas as pd
n = 100
folds = 3
# Make some data
df = pd.DataFrame({'id':np.arange(n), 'value':np.random.lognormal(mean=10, sigma=1, size=n)})
# Sort by value
df.sort_values('value', ascending=False, inplace=True)
# Insert 'group' ids, 0, 0, 0, 1, 1, 1, 2, 2, 2, ...
df['group_id'] = np.floor(np.arange(n)/folds)
# Randomly assign folds within each group
df['fold'] = df.groupby('group_id', as_index=False)['id'].transform(lambda x: np.random.choice(folds, size=x.size, replace=False))
# Inspect
df.head(10)
id value group_id fold
46 46 208904.679048 0.0 0
3 3 175730.118616 0.0 2
0 0 137067.103600 0.0 1
87 87 101894.243831 1.0 2
11 11 100570.573379 1.0 1
90 90 93681.391254 1.0 0
73 73 92462.150435 2.0 2
13 13 90349.408620 2.0 1
86 86 87568.402021 2.0 0
88 88 82581.010789 3.0 1
Assuming you want k folds, the idea is to sort the data by value, then randomly assign folds 1, 2, ..., k to the first k rows, then do the same to the next k rows, etc.
By the way, you will have more luck getting answers to questions here if you can create reproducible examples with data that make it easy for others to tinker with. :)
Assume I have Four inputs and I want to predict next 2hour value of first input value When I am trying to predict the value there is NaN is containing the first input column.
What I tried to skip the NaN value , I am trying to shift the earlier pred value into that input column. But it didn't work for me.
[ 120 30 40 50
110 20 10 20
NaN 12 30 30
120 50 60 70
NaN 10 28 40] inputs to the model
What I expected output
when training the model
[ 120 30 40 50 = pred1
110 20 10 20 = pred2
pred2 12 30 30 = pred3
120 50 60 70 = pred4
pred4 10 28 40 = pred5 ]
Now here when the training the model NaN values removed and earlier prediction value should have to move to that NaN value position.
I wrote the code for that but it didn't work for me. Here is my code:
model.reset_states()
pred= model.predict(x_test_n)
pred_count=pred[0]
forecasts=[]
next_pred=[]
for col in range(len(x_test_n)-1):
print('Prediction %s: ' % str(pred))
next_pred_res = np.reshape(next_pred, (next_pred.shape[1], 1, next_pred.shape[0]))
# make predictions
forecastPredict = model.predict(next_pred_res, batch_size=1)
forecastPredictInv = scaler.inverse_transform(forecastPredict)
forecasts.append(forecastPredictInv)
next_pred = next_pred[1:]
next_pred = np.concatenate([next_pred, forecastPredict])
pred_count += 1
Can anyone help me to solve this error? I just want to shift the earlier prediction value with NaN value.
You can iterate through each row, get predictions and fill the nans. Something like below i.e
prev_preds = 0
preds = []
# For each row of the dataframe get the predictions.
for _,row in df.iterrows():
# Fill the missing values with previous prediction, initially it will be zero.
row = row.fillna(prev_preds)
# Now get the prediction and store it in an array
preds.append(model.predict([row.values]))
# Update the previous prediction to new prediction by accessing last element of the predictions array.
prev_preds = preds[-1]
# Assign the predictions to a new column in dataframe
df['predictions'] = preds
I want one of my ONLY ONE of my features to be converted to a separate binary features:
df["pattern_id"]
Out[202]:
0 3
1 3
...
7440 2
7441 2
7442 3
Name: pattern_id, Length: 7443, dtype: int64
df["pattern_id"]
Out[202]:
0 0 0 1
1 0 0 1
...
7440 0 1 0
7441 0 1 0
7442 0 0 1
Name: pattern_id, Length: 7443, dtype: int64
I want to use OneHotEncoder, data is int, so no need to encode it:
onehotencoder = OneHotEncoder(categorical_features=["pattern_id"])
df = onehotencoder.fit_transform(df).toarray()
ValueError: could not convert string to float: 'http://www.zaragoza.es/sedeelectronica/'
Interesting enough I receive an error... sklearn tried to encode another column, not the one I wanted.
We have to encode pattern_id to be an integer value
I used this link: Issue with OneHotEncoder for categorical features
#transform the pattern_id feature to int
encoding_feature = ["pattern_id"]
enc = LabelEncoder()
enc.fit(encoding_feature)
working_feature = enc.transform(encoding_feature)
working_feature = working_feature.reshape(-1, 1)
ohe = OneHotEncoder(sparse=False)
#convert the pattern_id feature to separate binary features
onehotencoder = OneHotEncoder(categorical_features=working_feature, sparse=False)
df = onehotencoder.fit_transform(df).toarray()
And I get the same error. What am I doing wrong ?
Edit
source:
https://github.com/martin-varbanov96/scraper/blob/master/logo_scrape/logo_scrape/analysis.py
df
Out[259]:
found_img is_http link_img \
0 True 0 img/aahoteles.svg
//www.zaragoza.es/cont/paginas/img/sede/logo_e...
pattern_id current_link site_id \
0 3 https://www.aa-hoteles.com/es/reservas 3
6 3 https://www.aa-hoteles.com/es/ofertas-hoteles 3
7 2 http://about.pressreader.com/contact-us/ 4
8 3 http://about.pressreader.com/contact-us/ 4
status link_id
0 200 https://www.aa-hoteles.com/
1 200 https://www.365travel.asia/
2 200 https://www.365travel.asia/
3 200 https://www.365travel.asia/
4 200 https://www.aa-hoteles.com/
5 200 https://www.aa-hoteles.com/
6 200 https://www.aa-hoteles.com/
7 200 http://about.pressreader.com
8 200 http://about.pressreader.com
9 200 https://www.365travel.asia/
10 200 https://www.365travel.asia/
11 200 https://www.365travel.asia/
12 200 https://www.365travel.asia/
13 200 https://www.365travel.asia/
14 200 https://www.365travel.asia/
15 200 https://www.365travel.asia/
16 200 https://www.365travel.asia/
17 200 https://www.365travel.asia/
18 200 http://about.pressreade
[7443 rows x 8 columns]
If you take a look at the documentation for OneHotEncoder you can see that the categorical_features argument expects '“all” or array of indices or mask' not a string. You can make your code work by changing to the following lines
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
# Create a dataframe of random ints
df = pd.DataFrame(np.random.randint(0, 4, size=(100, 4)),
columns=['pattern_id', 'B', 'C', 'D'])
onehotencoder = OneHotEncoder(categorical_features=[df.columns.tolist().index('pattern_id')])
df = onehotencoder.fit_transform(df)
However df will no longer be a DataFrame, I would suggest working directly with the numpy arrays.
You can also do it like this
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
onehotenc = OneHotEncoder()
X = onehotenc.fit_transform(df.required_column.values.reshape(-1, 1)).toarray()
We need to reshape the column, because fit_transform requires a 2-D array. Then you can add columns to this numpy array and then merge it with your DataFrame.
Seen from this link here
The recommended way to work with different column types is detailed in the sklearn documentation here.
Representative example:
numeric_features = ['age', 'fare']
numeric_transformer = Pipeline(steps=[('scaler', StandardScaler())])
categorical_features = ['embarked', 'sex', 'pclass']
categorical_transformer = OneHotEncoder(handle_unknown='ignore')
preprocessor = ColumnTransformer(
transformers=[
('num', numeric_transformer, numeric_features),
('cat', categorical_transformer, categorical_features)])
I am using bank data to predict number of tickets on a daily basis. I am using stacking to get more accurate result and using brew library.
Here is the sample dataset for important features:
[]
Here is the target attribute sample:
[]
Here is the code:
from stacked_generalization.lib.stacking import StackedClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression, RidgeClassifier
# Stage 1 model
bclf = LogisticRegression(random_state=1)
# Stage 0 models
clfs = [RandomForestClassifier(n_estimators=40, criterion = 'gini', random_state=1),
gbm,
RidgeClassifier(random_state=1)]
sl = StackedClassifier(bclf, clfs)
sl.fit(training.select_columns(features).to_dataframe().as_matrix(), np.array(training['class']))
Here is the training data format:
[[ 21 11 2014 46 4 3]
[ 22 11 2014 46 5 4]
[ 24 11 2014 47 0 4]
...,
[ 30 9 2016 39 4 5]
[ 3 10 2016 40 0 1]
[ 4 10 2016 40 1 1]]
Now, when I try to fit the model, it gives the following error:
However, I compared my code with the example given in the library but still couldn't figure out where am I going wrong. Kindly assist me.
I had a similar issue and seems to just be a bug in brew that needs to be fixed. The problem is that the c.classes_ (or number of classes) returns a numpy array with floats (e.g., if you have two classes it returns [0.0, 1.0] instead of integers ([0,1]). The code tries to use these floats to index the columns, but you cannot index a numpy column with floats.
probas.shape = # rows = # training examples; # columns = # of classes
c.predict_proba(X) returns probabilites for each class for each training example.
probas[:, list(c.classes_)] = c.predict_proba(X)
Should put the probability for each class for each row in X into probas using class # to index columns in probas.
This would work if you add astype(int)
probas[:, list(et.classes_.astype(int))] = et.predict_proba(X)
or just
probas = np.copy(et.predict_proba(X))