append predictions to new data frame - python

Data example (not real data) can be also seen here I have a data set of 3x500 with columns names of : Job Level (numerical), Job Code (Categorical) and Stock Value (Numerical). I am using linear regression to fit the Stock values based on Job Levels, grouped by Job Code.
For example:
Job Code Job Level Job Title Stock Value
20 1 Production Engineer
20 2 Production Engineer
20 3 Production Engineer 6,985
20 4 Production Engineer 7,852
20 5 Production Engineer
30 1 Production Engineer
30 2 Logistics Analyst
30 3 Logistics Analyst 4,962
30 4 Logistics Analyst 22,613
30 5 Logistics Analyst 31,689
40 1 Logistics Analyst
Here is what I have done. How can I get to see my data set columns (original data) with the predicted values added. Right now I only can see the prediction. I can not join them together because:
Here is the situation: When I first start my code my df_nonull.shape = (268,4) then after the for loop my df_nonull.shape = (4,4) and then df_results.shape = (89,2). As a result, I am not able to join them.
> import pandas as pd from sklearn.linear_model import LinearRegression
> df = pd.read_excel("stats.xlsx")
> df_nonull=df.dropna()
>
> model= LinearRegression() groups = [] results = [] level = []
>
> for (group, df_nonull) in df_nonull.groupby('Job Code'):
> X=df_nonull[['Job Level']]
> y=df_nonull[['Stock Value']]
> model.fit(X,y)
> coefs = list(zip(X.columns, model.coef_))
> results.append(model.predict(735947)[0])
> groups.append(group)
>
> df_results = pd.DataFrame({'Job Code':groups, 'prediction':results})
>
> print df_results.head(50)
Just FYI, my main goal here is running a regression model in the data set where there is no NaN (df_nonull), and applying the linear regression coefficients to the entire data (for Stock Values,y) (df). This has nothing to do with what I am asking but wanted to give some backround info about why I am pursuing this.

Assuming you have consistent index for the input data and the prediction series. I think what you need is pd.concat.
import pandas as pd
>>> X = pd.DataFrame({'input': [i for i in range(10)]}) ## fake input data
>>> pred = pd.DataFrame({'prediction':[i-5 for i in range(10)]}) ## fake prediction data
>>> pd.concat([X, pred], axis=1)
input prediction
0 0 -5
1 1 -4
2 2 -3
3 3 -2
4 4 -1
5 5 0
6 6 1
7 7 2
8 8 3
9 9 4
I would recommend the pandas (0.20.1) specifically this section on concatenation.

You can use then following command to create one single data frame containing the data set values and the predicted values.
df_nonull.join(df_results,how="outer")

Related

Remove outliers withing training set in cross-validation

I am trying to do a cross-validation but I want to remove outliers (eg. only lower than 0.95y) in the training sets, while keeping the test set intact. I am using:
cv_scores = cross_validate(reg, X=X, y=y_tr, cv=GroupKFold(n_splits=3), groups=groups, scoring=scoring, return_train_score=True, verbose=0)
for the cross-validation (function from sklearn.model_selection), however I dont know how to make the necessary changes.
Sample:
date id x1 x2 y
1 a 10 15 100
2 a 20 30 150
3 a 12 10 130
2 b 15 13 1000
3 b 16 19 90
1 c 18 12 700
2 c 20 15 60
For example: one of the training folders will contain ids a and b. In this case I want to remove the outlier (date 2 id b), while keeping the outlier in the test folder (date 1 id c). Similarly, in the a and c training folder, I should remove the (date 1 id c), while keeping the (date 2 id b).
Removing outliers is not recommended by most statisticians. Instead, you should analyze the outliers like using cook distance. Importantly, removing outliers from train data will cause data shift.

How to change only second row in multiple groups of a dataframe

I would like for each group in a data frame df_task containing three rows, to modify the second row of the column Task.
import pandas as pd
df_task = pd.DataFrame({'Days':[5,5,5,20,20,20,10,10],
'Task':['Programing','Presentation','Training','Development','Presentation','Workshop','Coding','Communication']},)
df_task.groupby(["Days"])
This is the expected output, if the group contain three rows, the value of task from the first row is added to the value of Task from the second row, as shown in the new column New_Task, if the group has two rows, nothing is modified:
Days Task New_Task
0 5 Programing Programing
1 5 Presentation Presentation,Programing
2 5 Training Training
3 20 Development Development
4 20 Presentation Presentation,Development
5 20 Workshop Workshop
6 10 Coding Coding
7 10 Communication Communication
Your requirement are pretty straight-forward. Try:
groups = df_task.groupby('Days')
# enumeration of the rows within groups
enums = groups.cumcount()
# sizes of the groups broadcast to each row
sizes = groups['Task'].transform('size')
# so update the correct rows
df_task['New_Task'] = np.where(enums.eq(1) & sizes.gt(2),
df_task['Task'] + ',' + groups['Task'].shift(fill_value=''),
df_task['Task'])
print(df_task)
Output:
Days Task New_Task
0 5 Programing Programing
1 5 Presentation Presentation,Programing
2 5 Training Training
3 20 Development Development
4 20 Presentation Presentation,Development
5 20 Workshop Workshop
6 10 Coding Coding
7 10 Communication Communication

Python: How to migrate a R function that applies SVM function for group by in data frame

I have a data frame (df) that looks like that:
Value Country ID
1 21 RU AAAU9001025
2 24 NG AAAU9001848
3 17 EG ACLU2799370
4 2 EG ACLU2799370
5 56 RU ACLU2799370
I am running a SVM classifier for outlier detection on the value, per country, and based on relative small sample (max rows: 5K), and I indicate if it is an outlier in each row. So my output is a data frame with additional logical column that indicates if its an outlier:
Value Country ID SVM
1 21 RU AAAU9001025 FALSE
2 24 NG AAAU9001848 FALSE
3 17 EG ACLU2799370 FALSE
4 2 EG ACLU2799370 TRUE
5 56 RU ACLU2799370 TRUE
6 25 EG AMFU3022141 FALSE
I am using the following code in R:
library(e1071)
SVM_f = function(x,limit=5000){
N = min(c(limit,length(x)))
mdl = svm(x[sample(length(x),N)],
nu=0.98, type="one-classification", kernel="polynomial")
predict(mdl,x)
}
res = by(df,df$Country,function(x){
data.frame(x,SVM = SVM_f(x$Value))
})
res = do.call(rbind,res)
Now I need to migrate this code to Python - please assist.
You could just use Rmarkdown where you can use native python language performed within the R environment, so you could leave the R code as it is above.
Here is a more specific link.

Can I forward recalculate a value in a pandas dataframe when a value has been reset, e.g. a water meter

I want to forward fill my water meter reading data when resets occur so that the data is clean for analysis. A reset is when the value in the next row is less than the previous one.
My python pandas dataframe looks like this:-
water
0 31031
1 31037
2 31038
3 31043
4 131 (system was reset)
5 223
6 331
7 412
...
It is possible that there are several resets of the water data in my pandas dataframe.
Research suggests that using for loops/iteration is not the best option with pandas dataframes so I am trying to avoid.
I would like to update the dataframe df so that the fact that the system was reset at index 4 is no longer visible and the water figures continue to cumulate.
e.g.
water
0 31031
1 31037
2 31038
3 31043
4 31174 # system reset to 0 so value should be 31043 + 131
5 31266 # continuing with the difference through to the end of df
6 31374
7 31445
...
import pandas as pd
df = pd.DataFrame({'water': [31031,31037,,31038,31043,131,223,331,412]})
df["waterreset"] = np.where(df["water"]-df["water"].shift(1)<0, df["water"] + df["water"].shift(1),df["water"])
print(df)
"waterreset" code line above only identifies the one line where the reset occurs and doesn't fill forward, plus I would rather use inplace=True to update the current dataframe.
You can find where the resets are, take the previous values, and add to the subsequence:
# resets
resets = df.water.diff().le(0)
# reading at resets
# cumsum is used to accumulate readings
readings = df.water.shift().where(resets).fillna(0).cumsum()
df.water += readings
Output:
water
0 31031.0
1 31037.0
2 31038.0
3 31043.0
4 31174.0
5 31266.0
6 31374.0
7 31455.0

How does one interpret the random forest classifier from sci-kit learn?

I know little on how random forest works.
Usually in classification I could fit the train data into the random forest classifier and ask to predict the test data.
Currently I am working on titanic data that is provided to me. This is a top rows of the data set and there are 1300(approx) rows.
survived pclass sex age sibsp parch fare embarked
0 1 1 female 29 0 0 211.3375 S
1 1 1 male 0.9167 1 2 151.55 S
2 0 1 female 2 1 2 151.55 S
3 0 1 male 30 1 2 151.55 S
4 0 1 female 25 1 2 151.55 S
5 1 1 male 48 0 0 26.55 S
6 1 1 female 63 1 0 77.9583 S
7 0 1 male 39 0 0 0 S
8 1 1 female 53 2 0 51.4792 S
9 0 1 male 71 0 0 49.5042 C
10 0 1 male 47 1 0 227.525 C
11 1 1 female 18 1 0 227.525 C
12 1 1 female 24 0 0 69.3 C
13 1 1 female 26 0 0 78.85 S
There is no test data given. So I want random forest to predict the survival on entire data set and compare it with actual value (more like checking the accuracy score).
So what I have done is divide my complete dataset into two parts; one with features and other one predict(survived).
Features consists all the columns except survived and predict consists survived column.
dfFeatures = df['survived']
dfTarget = dfCopy.drop('survived', 1)
Note: df is the entire dataset.
Here is the code that checks the score of randomforest
rfClf = RandomForestClassifier(n_estimators=100, max_features=10)
rfClf = rfClf.fit(dfFeatures, dfTarget)
scoreForRf = rfClf.score(dfFeatures, dfTarget)
I get the score output with something like this
The accuracy score for random forest is : 0.983193277311
I am finding it little difficult to understand what is happening behind the code in above given code.
Does, it predict survival for all the tuples based upon other features (dfFeatures) and compare it with test data(dfTarget) and give the prediction score or does it randomly create train and test data based upon the train data provided and compare accuracy for test data it generated behind?
To be more precise, while calculating the accuracy score does it predict the survival for entire data set or just random partial data set?
Somehow i dont see you're trying to split the dataset into train and test
dfWithTestFeature = df['survived']
dfWithTestFeature contains only the column survived, which is the labels.
dfWithTrainFeatures = dfCopy.drop('survived', 1)
dfWithTrainFeatures contain all the feature (pclass, sex, age, etc).
and now jumping to the code,
rfClf = RandomForestClassifier(n_estimators=100, max_features=10)
the line above is creating the random forest classifier, n_estimator is depth of the tree, higher number of this will lead to overfit the data.
rfClf = rfClf.fit(dfWithTrainFeatures, dfWithTestFeature)
line above is training process, the .fit() need 2 parameter, first for the feature, and second is the label (or target value which is the value from 'survived' column) from the features.
scoreForRf = rfClf.score(dfWithTrainFeatures, dfWithTestFeature)
.score() needs 2 parameter, 1st is features and 2nd is labels.
This is for using the model that we created using the .fit() function to predict the features in 1st parameter, while the 2nd parameter will be validation value.
from what i see, you're using same data to train and test the model which is not good.
To be more precise, while calculating the accuracy score does it predict the survival for entire data set or just random partial data set?
you used all data to test the model.
I could use cross validation but then again question is do I have to for random forest? Also cross validation for random forest seems to be very slow
of course, you need to use validation to test your model. Create confusion matrix, count precision and recall, don't just depends on the accuracy.
if you think the model is running too slow, then decrease n_esimators value.

Categories