I am working on a Data Science project with the Fifa dataset. I cleaned the data and took care of any NaN values in the Data to get it ready to be split into test and train. I need to use StratifiedShuffleSplit in order to split the data. Updated to a cleaner way to divided the value data into groups, but I am still getting NaN values once it goes through the split.
Link to the data set I am using: https://www.kaggle.com/karangadiya/fifa19
n = fifa['value'].count()
folds = 3
fifa.sort_values('value', ascending=False, inplace=True)
fifa['group_id'] = np.floor(np.arange(n)/folds)
fifa['value_cat'] = fifa.groupby('group_id', as_index = False)['name'].transform(lambda x: np.random.choice(v_cats, size=x.size, replace = False))
At this point when I check the test and train data I now have mystery NaN values inputed. I think the NaN values maybe a result of .loc since I am getting a 'warning' in jupyter.
c:\python37\lib\site-packages\ipykernel_launcher.py:6: FutureWarning:
Passing list-likes to .loc or [] with any missing label will raise
KeyError in the future, you can use .reindex() as an alternative.
Code below:
from sklearn.model_selection import StratifiedShuffleSplit
split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
for train_index, test_index in split.split(fifa, fifa['value_cat']):
strat_train_set = fifa.loc[train_index]
strat_test_set = fifa.loc[test_index]
fifa = strat_train_set.drop('value', axis=1)
value_labels = strat_train_set['value'].copy()
PLEASE HELP MY POOR SOUL!!
Here's one solution.
import numpy as np
import pandas as pd
n = 100
folds = 3
# Make some data
df = pd.DataFrame({'id':np.arange(n), 'value':np.random.lognormal(mean=10, sigma=1, size=n)})
# Sort by value
df.sort_values('value', ascending=False, inplace=True)
# Insert 'group' ids, 0, 0, 0, 1, 1, 1, 2, 2, 2, ...
df['group_id'] = np.floor(np.arange(n)/folds)
# Randomly assign folds within each group
df['fold'] = df.groupby('group_id', as_index=False)['id'].transform(lambda x: np.random.choice(folds, size=x.size, replace=False))
# Inspect
df.head(10)
id value group_id fold
46 46 208904.679048 0.0 0
3 3 175730.118616 0.0 2
0 0 137067.103600 0.0 1
87 87 101894.243831 1.0 2
11 11 100570.573379 1.0 1
90 90 93681.391254 1.0 0
73 73 92462.150435 2.0 2
13 13 90349.408620 2.0 1
86 86 87568.402021 2.0 0
88 88 82581.010789 3.0 1
Assuming you want k folds, the idea is to sort the data by value, then randomly assign folds 1, 2, ..., k to the first k rows, then do the same to the next k rows, etc.
By the way, you will have more luck getting answers to questions here if you can create reproducible examples with data that make it easy for others to tinker with. :)
Related
I have two original dataframes.
One contains limits: df_limits
feat_1 feat_2 feat_3
target 12 9 90
UL 15 10 120
LL 9 8 60
where target is ideal value,
UL - upper limit,
LL - lower limit
And another one original data: df_to_check
ID feat_1 feat_2 feat_3
123 12.5 9.6 100
456 18 3 100
789 9 11 100
I'm creating a function which desired output is get ID and features which are below or above the threshold (limits from first Dataframe) Till now I'm able to recognise which features are out of limits but I'm getting full output of original Dataframe...
def table(df_limits, df_to_check, column):
UL = df_limits[column].loc['target'] + df_limits[column].loc['UL']
LL = df_limits[column].loc['target'] + df_limits[column].loc['LL']
UL_index = df_to_check.loc[df_to_check[column] > UL].index
LL_index = df_to_check.loc[df_to_check[column] < LL].index
if UL_index is not None:
above_limit = {'ID': df_to_check['ID'],
'column': df_to_check[column],
'target': df_limits[column].loc['target']}
return pd.DataFrame(above_limit)
What I should change so my desired output would be:
(showing only ID and column where observations are out of limit)
The best if it would show also how many percent of original value is deviate from ideal value target (I would be glad for advices how to add such a column)
ID column target value deviate(%)
456 feat_1 12 18 50
456 feat_2 9 3 ...
789 feat_2 9 11 ...
Now after running this function its returning whole dataset because statement says if not null... which is not null... I understand why I have this issue but I don't know how to change it
Issue is with statement if UL_index is not None: since it returning whole dataset and I'm looking for way how to replace this part
Approach
reshape
merge
calculate
new_df = (df_to_check.set_index("ID").unstack().reset_index()
.rename(columns={"level_0":"column",0:"value"})
.merge(df_limits.T, left_on="column", right_index=True)
.assign(deviate=lambda dfa: (dfa.value-dfa.target)/dfa.target)
)
column
ID
value
target
UL
LL
deviate
feat_1
123
12.5
12
15
9
0.0416667
feat_1
456
18
12
15
9
0.5
feat_1
789
9
12
15
9
-0.25
feat_2
123
9.6
9
10
8
0.0666667
feat_2
456
3
9
10
8
-0.666667
feat_2
789
11
9
10
8
0.222222
feat_3
123
100
90
120
60
0.111111
feat_3
456
100
90
120
60
0.111111
feat_3
789
100
90
120
60
0.111111
First of all, you have not provided a reproducible example https://stackoverflow.com/help/minimal-reproducible-example because you have not shared the code which produces the two initial dataframes. Next time you ask a question, please keep it in mind, Without those, I made a toy example with my own (random) data.
I start by unpivoting what you call dataframe_to_check: that's because, if you want to check each feature independently, then that dataframe is not normalised (you might want to look up what database normalisation means).
The next step is a left outer join between the unpivoted dataframe you want to check and the (transposed) dataframe with the limits.
Once you have that, you can easily calculate whether a row is within the range, the deviation between value and target, etc, and you can of course group this however you want.
My code is below. It should be easy enough to customise it to your case.
import pandas as pd
import numpy as np
df_limits = pd.DataFrame(index =['min val','max val','target'])
df_limits['a']=[2,4,3]
df_limits['b']=[3,5,4.5]
df =pd.DataFrame(columns = df_limits.columns, data =np.random.rand(100,2)*6 )
df_unpiv = pd.melt( df.reset_index().rename(columns ={'index':'id'}), id_vars='id', var_name ='feature', value_name = 'value' )
# I reset the index because I couldn't get a join on a column and index, but there is probably a better way to do it
df_joined = pd.merge( df_unpiv, df_limits.transpose().reset_index().rename(columns = {'index':'feature'}), how='left', on ='feature' )
df_joined['abs diff from target'] = abs( df_joined['value'] - df_joined['target'] )
df_joined['outside range'] = (df_joined['value'] < df_joined['min val'] ) | (df_joined['value'] > df_joined['max val'])
df_outside_range = df_joined.query(" `outside range` == True " )
df_inside_range = df_joined.query(" `outside range` == False " )
I solved my issue maybe in bit clumsy way but it works as desired...
If someone have better answer I will still appreciate:
Example how to get only observations above limits, to have both just concatenate observation from UL_index and LL_index
def table(df_limits,df_to_check,column):
above_limit = []
df_above_limit = pd.DataFrame()
UL = df_limits[column].loc['target'] + df_limits[column].loc['UL']
LL = df_limits[column].loc['target'] + df_limits[column].loc['LL']
UL_index = df_to_check.loc[df_to_check[column] > UL].index
LL_index = df_to_check.loc[df_to_check[column] < LL].index
df_to_check_UL = df_to_check.loc[UL_index]
df_to_check_LL = df_to_check.loc[LL_index]
above_limit = {
'ID': df_to_check_UL['ID'],
'feature value': df_to_check[column],
'target': df_limits[column].loc['target']
}
df_above_limit = pd.DataFrame(above_limit, index = df_to_check_UL.index)
return df_above_limit
Assume I have Four inputs and I want to predict next 2hour value of first input value When I am trying to predict the value there is NaN is containing the first input column.
What I tried to skip the NaN value , I am trying to shift the earlier pred value into that input column. But it didn't work for me.
[ 120 30 40 50
110 20 10 20
NaN 12 30 30
120 50 60 70
NaN 10 28 40] inputs to the model
What I expected output
when training the model
[ 120 30 40 50 = pred1
110 20 10 20 = pred2
pred2 12 30 30 = pred3
120 50 60 70 = pred4
pred4 10 28 40 = pred5 ]
Now here when the training the model NaN values removed and earlier prediction value should have to move to that NaN value position.
I wrote the code for that but it didn't work for me. Here is my code:
model.reset_states()
pred= model.predict(x_test_n)
pred_count=pred[0]
forecasts=[]
next_pred=[]
for col in range(len(x_test_n)-1):
print('Prediction %s: ' % str(pred))
next_pred_res = np.reshape(next_pred, (next_pred.shape[1], 1, next_pred.shape[0]))
# make predictions
forecastPredict = model.predict(next_pred_res, batch_size=1)
forecastPredictInv = scaler.inverse_transform(forecastPredict)
forecasts.append(forecastPredictInv)
next_pred = next_pred[1:]
next_pred = np.concatenate([next_pred, forecastPredict])
pred_count += 1
Can anyone help me to solve this error? I just want to shift the earlier prediction value with NaN value.
You can iterate through each row, get predictions and fill the nans. Something like below i.e
prev_preds = 0
preds = []
# For each row of the dataframe get the predictions.
for _,row in df.iterrows():
# Fill the missing values with previous prediction, initially it will be zero.
row = row.fillna(prev_preds)
# Now get the prediction and store it in an array
preds.append(model.predict([row.values]))
# Update the previous prediction to new prediction by accessing last element of the predictions array.
prev_preds = preds[-1]
# Assign the predictions to a new column in dataframe
df['predictions'] = preds
I want to train a model and finally predict a truth value using a random forest model in Python of the three column dataset (click the link to download the full CSV-dataset formatted as in the following
t_stamp,X,Y
0.000543,0,10
0.000575,0,10
0.041324,1,10
0.041331,2,10
0.041336,3,10
0.04134,4,10
0.041345,5,10
0.04135,6,10
0.041354,7,10
I wanted to predict the current value of Y (the true value) using the last (for example: 5, 10, 100, 300, 1000, ..etc) data points of X using random forest model of sklearn in Python. Meaning taking [0,0,1,2,3] of X column as an input for the first window - i want to predict the 5th row value of Y trained on the previous values of Y.
Let's say we have 5 traces of dataset (a1.csv, a2.csv, a3.csv, a4.csv and a5.csv) in the current directory. For a single trace (dataset) (for example, a1.csv) – I can do the prediction of a 5 window as the following
import pandas as pd
import numpy as np
from io import StringIO
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
from sklearn.metrics import accuracy_score
import math
from math import sqrt
df = pd.read_csv('a1.csv')
for i in range(1,5):
df['X_t'+str(i)] = df['X'].shift(i)
print(df)
df.dropna(inplace=True)
X=pd.DataFrame({ 'X_%d'%i : df['X'].shift(i) for i in range(5)}).apply(np.nan_to_num, axis=0).values
y = df['Y'].values
reg = RandomForestRegressor(criterion='mse')
reg.fit(X,y)
modelPred = reg.predict(X)
print(modelPred)
print("Number of predictions:",len(modelPred))
modelPred.tofile('predictedValues1.txt',sep="\n",format="%s")
meanSquaredError=mean_squared_error(y, modelPred)
print("Mean Square Error (MSE):", meanSquaredError)
rootMeanSquaredError = sqrt(meanSquaredError)
print("Root-Mean-Square Error (RMSE):", rootMeanSquaredError)
I have solved this problem with random forest, which yields df:
rolling_regression')
time X Y X_t1 X_t2 X_t3 X_t4
0 0.000543 0 10 NaN NaN NaN NaN
1 0.000575 0 10 0.0 NaN NaN NaN
2 0.041324 1 10 0.0 0.0 NaN NaN
3 0.041331 2 10 1.0 0.0 0.0 NaN
4 0.041336 3 10 2.0 1.0 0.0 0.0
5 0.041340 4 10 3.0 2.0 1.0 0.0
6 0.041345 5 10 4.0 3.0 2.0 1.0
7 0.041350 6 10 5.0 4.0 3.0 2.0
.........................................................
[2845 rows x 7 columns]
[ 10. 10. 10. ..., 20. 20. 20.]
RMSE: 0.5136564734333562
However, now I want to do the prediction over all of the files (a1.csv, a2.csv, a3.csv, a4.csv and a5.csv)by dividing the training into 60% of the datasets whose file name start with a and the remaining 40% for testing whose file name start with a using sklearn in Python (meaning 3 traces will be used for training and 2 files for testing)?
PS: All the files have the same structure but they are with different lengths for they are generated with different parameters.
import glob, os
df = pd.concat(map(pd.read_csv, glob.glob(os.path.join('', "a*.csv"))))
# get your X and Y Df's
x_train,x_test,y_train,y_test=sklearn.cross_validation.train_test_split(X,Y,test_size=0.40)
To read in multiple files, you'll need a slight extension. Aggregate data from each csv, then call pd.concat to join them:
df_list = []
for i in range(1, 6):
df_list.append(pd.read_csv('a%d.csv' %i))
df = pd.concat(df_list)
This will read in all your csvs, and you can carry on as usual. Get X and y:
X = pd.DataFrame({ 'X_%d'%i : df['X'].shift(i) for i in range(5)}).apply(np.nan_to_num, axis=0).values
y = df['Y'].values
Use sklearn.cross_validation.train_test_split to segment your data:
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.4)
You can also look at StratifiedKFold.
How can we predict a model using random forest? I want to train a model and finally predict a truth value using a random forest model in Python of the three column dataset (click the link to download the full CSV-dataset formatted as in the following
t_stamp,X,Y
0.000543,0,10
0.000575,0,10
0.041324,1,10
0.041331,2,10
0.041336,3,10
0.04134,4,10
0.041345,5,10
0.04135,6,10
0.041354,7,10
I wanted to predict the current value of Y (the true value) using the last (for example: 5, 10, 100, 300, 1000, ..etc) data points of X using random forest model of sklearn in Python. Meaning taking [0,0,1,2,3] of X column as an input for the first window - i want to predict the 5th row value of Y trained on the previous values of Y. Similarly, using a simple rolling OLS regression model, we can do it as in the following but I wanted to do it using random forest model.
import pandas as pd
df = pd.read_csv('data_pred.csv')
model = pd.stats.ols.MovingOLS(y=df.Y, x=df[['X']],
window_type='rolling', window=5, intercept=True)
I have solved this problem with random forest, which yields df:
t_stamp X Y X_t1 X_t2 X_t3 X_t4 X_t5
0.000543 0 10 NaN NaN NaN NaN NaN
0.000575 0 10 0.0 NaN NaN NaN NaN
0.041324 1 10 0.0 0.0 NaN NaN NaN
0.041331 2 10 1.0 0.0 0.0 NaN NaN
0.041336 3 10 2.0 1.0 0.0 0.0 NaN
0.041340 4 10 3.0 2.0 1.0 0.0 0.0
0.041345 5 10 4.0 3.0 2.0 1.0 0.0
0.041350 6 10 5.0 4.0 3.0 2.0 1.0
0.041354 7 10 6.0 5.0 4.0 3.0 2.0
.........................................................
[ 10. 10. 10. 10. .................................]
MSE: 1.3273548431
This seems to work fine for ranges 5, 10, 15, 20, 22. However, it doesn't seem to work fine for ranges greater than 23 (it prints MSE: 0.0) and this is because, as you can see from the dataset the values of Y are fixed (10) from row 1 - 23 and then changes to another value (20, and so on) from row 24. How can we train and predict a model of such cases based on the last data points?
It seems with the existing code, when calling dropna, you truncate X but not y. You also train and test on the same data.
Fixing this will give non-zero MSE.
Code:
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
df = pd.read_csv('/Users/shivadeviah/Desktop/estimated_pred.csv')
df1 = pd.DataFrame({ 'X_%d'%i : df['X'].shift(i) for i in range(25)})
df1['Y'] = df['Y']
df1 = df1.sample(frac=1).reset_index(drop=True)
df1.dropna(inplace=True)
X = df1.iloc[:, :-1].values
y = df1.iloc[:, -1].values
x = int(len(X) * 0.66)
X_train = X[:x]
X_test = X[x:]
y_train = y[:x]
y_test = y[x:]
reg = RandomForestRegressor(criterion='mse')
reg.fit(X_train, y_train)
modelPred = reg.predict(X_test)
print(modelPred)
print("Number of predictions:",len(modelPred))
meanSquaredError = mean_squared_error(y_test, modelPred)
print("MSE:", meanSquaredError)
print(df1.size)
df2 = df1.iloc[x:, :].copy()
df2['pred'] = modelPred
df2.head()
Output:
[ 267.7 258.26608241 265.07037249 ..., 267.27370169 256.7 272.2 ]
Number of predictions: 87891
MSE: 1954.9271256
6721026
X_0 pred
170625 48 267.700000
170626 66 258.266082
170627 184 265.070372
170628 259 294.700000
170629 271 281.966667
I am using bank data to predict number of tickets on a daily basis. I am using stacking to get more accurate result and using brew library.
Here is the sample dataset for important features:
[]
Here is the target attribute sample:
[]
Here is the code:
from stacked_generalization.lib.stacking import StackedClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression, RidgeClassifier
# Stage 1 model
bclf = LogisticRegression(random_state=1)
# Stage 0 models
clfs = [RandomForestClassifier(n_estimators=40, criterion = 'gini', random_state=1),
gbm,
RidgeClassifier(random_state=1)]
sl = StackedClassifier(bclf, clfs)
sl.fit(training.select_columns(features).to_dataframe().as_matrix(), np.array(training['class']))
Here is the training data format:
[[ 21 11 2014 46 4 3]
[ 22 11 2014 46 5 4]
[ 24 11 2014 47 0 4]
...,
[ 30 9 2016 39 4 5]
[ 3 10 2016 40 0 1]
[ 4 10 2016 40 1 1]]
Now, when I try to fit the model, it gives the following error:
However, I compared my code with the example given in the library but still couldn't figure out where am I going wrong. Kindly assist me.
I had a similar issue and seems to just be a bug in brew that needs to be fixed. The problem is that the c.classes_ (or number of classes) returns a numpy array with floats (e.g., if you have two classes it returns [0.0, 1.0] instead of integers ([0,1]). The code tries to use these floats to index the columns, but you cannot index a numpy column with floats.
probas.shape = # rows = # training examples; # columns = # of classes
c.predict_proba(X) returns probabilites for each class for each training example.
probas[:, list(c.classes_)] = c.predict_proba(X)
Should put the probability for each class for each row in X into probas using class # to index columns in probas.
This would work if you add astype(int)
probas[:, list(et.classes_.astype(int))] = et.predict_proba(X)
or just
probas = np.copy(et.predict_proba(X))