Remove outliers withing training set in cross-validation - python

I am trying to do a cross-validation but I want to remove outliers (eg. only lower than 0.95y) in the training sets, while keeping the test set intact. I am using:
cv_scores = cross_validate(reg, X=X, y=y_tr, cv=GroupKFold(n_splits=3), groups=groups, scoring=scoring, return_train_score=True, verbose=0)
for the cross-validation (function from sklearn.model_selection), however I dont know how to make the necessary changes.
Sample:
date id x1 x2 y
1 a 10 15 100
2 a 20 30 150
3 a 12 10 130
2 b 15 13 1000
3 b 16 19 90
1 c 18 12 700
2 c 20 15 60
For example: one of the training folders will contain ids a and b. In this case I want to remove the outlier (date 2 id b), while keeping the outlier in the test folder (date 1 id c). Similarly, in the a and c training folder, I should remove the (date 1 id c), while keeping the (date 2 id b).

Removing outliers is not recommended by most statisticians. Instead, you should analyze the outliers like using cook distance. Importantly, removing outliers from train data will cause data shift.

Related

Special case of counting empty cells "before" an occupied cell in Pandas

Pandas question here.
I have a specific dataset in which we are sampling subjective ratings several times over a second. The information is sorted as below. What I need is a way to "count" the number of blank cells before every "second" (i.e. "1" in the second's column that occur at regular intervals), so I can feed that value into a greatest common factor equation and create somewhat of a linear extrapolation based on milliseconds. In the example below that number would be "2" and I would feed that into the GCF formula. The end goal is to make a more accurate/usable timestamp. Sampling rates may vary by dataset.
index
rating
seconds
1
26
2
28
3
30
1
4
33
5
40
6
45
1
7
50
8
48
9
49
1
If you just want to count the number of NaNs before the first 1:
df['seconds'].isna().cummin().sum()
If you have another value (e.g. empty string)
df['seconds'].eq('').cummin().sum()
Output: 2
Or, if you have a range Index:
df['seconds'].first_valid_index()

looking for better iteration approach for slicing a dataframe

First post: I apologize in advance for sloppy wording (and possibly poor searching if this question has been answered ad nauseum elsewhere - maybe I don't know the right search terms yet).
I have data in 10-minute chunks and I want to perform calculations on a column ('input') grouped by minute (i.e. 10 separate 60-second blocks - not a rolling 60 second period) and then store all ten calculations in a single list called output.
The 'seconds' column records the second from 1 to 600 in the 10-minute period. If no data was entered for a given second, there is no row for that number of seconds. So, some minutes have 60 rows of data, some have as few as one or two.
Note: the calculation (my_function) is not basic so I can't use groupby and np.sum(), np.mean(), etc. - or at least I can't figure out how to use groupby.
I have code that gets the job done but it looks ugly to me so I am sure there is a better way (probably several).
output=[]
seconds_slicer = 0
for i in np.linspace(1,10,10):
seconds_slicer += 60
minute_slice = df[(df['seconds'] > (seconds_slicer - 60)) &
(df['seconds'] <= seconds_slicer)]
calc = my_function(minute_slice['input'])
output.append(calc)
If there is a cleaner way to do this, please let me know. Thanks!
Edit: Adding sample data and function details:
seconds input
1 1 0.000054
2 2 -0.000012
3 3 0.000000
4 4 0.000000
5 5 0.000045
def realized_volatility(series_log_return):
return np.sqrt(np.sum(series_log_return**2))
For this answer, we're going to repurpose Bin pandas dataframe by every X rows
We'll create a dataframe with missing data in the 'seconds' column, as I understand your data to be based on the description given
secs=[1,2,3,4,5,6,7,8,9,11,12,14,15,17,19]
data = [np.random.randint(-25,54)/100000 for _ in range(15)]
df=pd.DataFrame(data=zip(secs,data), columns=['seconds','input'])
df
seconds input
0 1 0.00017
1 2 -0.00020
2 3 0.00033
3 4 0.00052
4 5 0.00040
5 6 -0.00015
6 7 0.00001
7 8 -0.00010
8 9 0.00037
9 11 0.00050
10 12 0.00000
11 14 -0.00009
12 15 -0.00024
13 17 0.00047
14 19 -0.00002
I didn't create 600 rows, but that's okay, we'll say we want to bin every 5 seconds instead of every 60. Now, because we're just trying to use equal time measures for grouping, we can just use floor division to see which bin each time interval would end up in. (In your case, you'd divide by 60 instead)
grouped=df.groupby(df['seconds'] // 5).apply(realized_volatility).drop('seconds', axis=1) #we drop the extra 'seconds' column because we don;t care about the root sum of squares of seconds in the df
grouped
input
seconds
0 0.000441
1 0.000372
2 0.000711
3 0.000505

Average for similar looking data in a column using Pandas

I'm working on a large data with more than 60K rows.
I have continuous measurement of current in a column. A code is measured for a second where the equipment measures it for 14/15/16/17 times, depending on the equipment speed and then the measurement moves to the next code and again measures for 14/15/16/17 times and so forth.
Every time measurement moves from one code to another, there is a jump of more than 0.15 on the current measurement
The data with top 48 rows is as follows,
Index
Curr(mA)
0
1.362476
1
1.341721
2
1.362477
3
1.362477
4
1.355560
5
1.348642
6
1.327886
7
1.341721
8
1.334804
9
1.334804
10
1.348641
11
1.362474
12
1.348644
13
1.355558
14
1.334805
15
1.362477
16
1.556172
17
1.542336
18
1.549252
19
1.528503
20
1.549254
21
1.528501
22
1.556173
23
1.556172
24
1.542334
25
1.556172
26
1.542336
27
1.542334
28
1.556170
29
1.535415
30
1.542334
31
1.729109
32
1.749863
33
1.749861
34
1.749861
35
1.736024
36
1.770619
37
1.742946
38
1.763699
39
1.749861
40
1.749861
41
1.763703
42
1.756781
43
1.742946
44
1.736026
45
1.756781
46
1.964308
47
1.957395
I want to write a script where similar data of 14/15/16/17 times is averaged in a separate column for each code measurement .. I have been thinking of doing this with pandas..
I want the data to look like
Index
Curr(mA)
0
1.34907
1
1.54556
2
1.74986
Need some help to get this done. Please help
First get the indexes of every row where there's a jump. Use Pandas' DataFrame.diff() to get the difference between the value in each row and the previous row, then check to see if it's greater than 0.15 with >. Use that to filter the dataframe index, and save the resulting indices (in the case of your sample data, three) in a variable.
indices = df.index[df['Curr(mA)'].diff() > 0.15]
The next steps depend on if there are more columns in the source dataframe that you want in the output, or if it's really just curr(mA) and index. In the latter case, you can use np.split() to cut the dataframe into a list of dataframes based on the indexes you just pulled. Then you can go ahead and average them in a list comphrension.
[df['Curr(mA)'].mean() for df in np.split(df, indices)]
> [1.3490729374999997, 1.5455638666666667, 1.7498627333333332, 1.9608515]
To get it to match your desired output above (same thing but as one-column dataframe rather than list) convert the list to pd.Series and reset_index().
pd.Series(
[df['Curr(mA)'].mean() for df in np.split(df, indices)]
).reset_index(drop=True)
index 0
0 0 1.349073
1 1 1.545564
2 2 1.749863
3 3 1.960851

append predictions to new data frame

Data example (not real data) can be also seen here I have a data set of 3x500 with columns names of : Job Level (numerical), Job Code (Categorical) and Stock Value (Numerical). I am using linear regression to fit the Stock values based on Job Levels, grouped by Job Code.
For example:
Job Code Job Level Job Title Stock Value
20 1 Production Engineer
20 2 Production Engineer
20 3 Production Engineer 6,985
20 4 Production Engineer 7,852
20 5 Production Engineer
30 1 Production Engineer
30 2 Logistics Analyst
30 3 Logistics Analyst 4,962
30 4 Logistics Analyst 22,613
30 5 Logistics Analyst 31,689
40 1 Logistics Analyst
Here is what I have done. How can I get to see my data set columns (original data) with the predicted values added. Right now I only can see the prediction. I can not join them together because:
Here is the situation: When I first start my code my df_nonull.shape = (268,4) then after the for loop my df_nonull.shape = (4,4) and then df_results.shape = (89,2). As a result, I am not able to join them.
> import pandas as pd from sklearn.linear_model import LinearRegression
> df = pd.read_excel("stats.xlsx")
> df_nonull=df.dropna()
>
> model= LinearRegression() groups = [] results = [] level = []
>
> for (group, df_nonull) in df_nonull.groupby('Job Code'):
> X=df_nonull[['Job Level']]
> y=df_nonull[['Stock Value']]
> model.fit(X,y)
> coefs = list(zip(X.columns, model.coef_))
> results.append(model.predict(735947)[0])
> groups.append(group)
>
> df_results = pd.DataFrame({'Job Code':groups, 'prediction':results})
>
> print df_results.head(50)
Just FYI, my main goal here is running a regression model in the data set where there is no NaN (df_nonull), and applying the linear regression coefficients to the entire data (for Stock Values,y) (df). This has nothing to do with what I am asking but wanted to give some backround info about why I am pursuing this.
Assuming you have consistent index for the input data and the prediction series. I think what you need is pd.concat.
import pandas as pd
>>> X = pd.DataFrame({'input': [i for i in range(10)]}) ## fake input data
>>> pred = pd.DataFrame({'prediction':[i-5 for i in range(10)]}) ## fake prediction data
>>> pd.concat([X, pred], axis=1)
input prediction
0 0 -5
1 1 -4
2 2 -3
3 3 -2
4 4 -1
5 5 0
6 6 1
7 7 2
8 8 3
9 9 4
I would recommend the pandas (0.20.1) specifically this section on concatenation.
You can use then following command to create one single data frame containing the data set values and the predicted values.
df_nonull.join(df_results,how="outer")

How does one interpret the random forest classifier from sci-kit learn?

I know little on how random forest works.
Usually in classification I could fit the train data into the random forest classifier and ask to predict the test data.
Currently I am working on titanic data that is provided to me. This is a top rows of the data set and there are 1300(approx) rows.
survived pclass sex age sibsp parch fare embarked
0 1 1 female 29 0 0 211.3375 S
1 1 1 male 0.9167 1 2 151.55 S
2 0 1 female 2 1 2 151.55 S
3 0 1 male 30 1 2 151.55 S
4 0 1 female 25 1 2 151.55 S
5 1 1 male 48 0 0 26.55 S
6 1 1 female 63 1 0 77.9583 S
7 0 1 male 39 0 0 0 S
8 1 1 female 53 2 0 51.4792 S
9 0 1 male 71 0 0 49.5042 C
10 0 1 male 47 1 0 227.525 C
11 1 1 female 18 1 0 227.525 C
12 1 1 female 24 0 0 69.3 C
13 1 1 female 26 0 0 78.85 S
There is no test data given. So I want random forest to predict the survival on entire data set and compare it with actual value (more like checking the accuracy score).
So what I have done is divide my complete dataset into two parts; one with features and other one predict(survived).
Features consists all the columns except survived and predict consists survived column.
dfFeatures = df['survived']
dfTarget = dfCopy.drop('survived', 1)
Note: df is the entire dataset.
Here is the code that checks the score of randomforest
rfClf = RandomForestClassifier(n_estimators=100, max_features=10)
rfClf = rfClf.fit(dfFeatures, dfTarget)
scoreForRf = rfClf.score(dfFeatures, dfTarget)
I get the score output with something like this
The accuracy score for random forest is : 0.983193277311
I am finding it little difficult to understand what is happening behind the code in above given code.
Does, it predict survival for all the tuples based upon other features (dfFeatures) and compare it with test data(dfTarget) and give the prediction score or does it randomly create train and test data based upon the train data provided and compare accuracy for test data it generated behind?
To be more precise, while calculating the accuracy score does it predict the survival for entire data set or just random partial data set?
Somehow i dont see you're trying to split the dataset into train and test
dfWithTestFeature = df['survived']
dfWithTestFeature contains only the column survived, which is the labels.
dfWithTrainFeatures = dfCopy.drop('survived', 1)
dfWithTrainFeatures contain all the feature (pclass, sex, age, etc).
and now jumping to the code,
rfClf = RandomForestClassifier(n_estimators=100, max_features=10)
the line above is creating the random forest classifier, n_estimator is depth of the tree, higher number of this will lead to overfit the data.
rfClf = rfClf.fit(dfWithTrainFeatures, dfWithTestFeature)
line above is training process, the .fit() need 2 parameter, first for the feature, and second is the label (or target value which is the value from 'survived' column) from the features.
scoreForRf = rfClf.score(dfWithTrainFeatures, dfWithTestFeature)
.score() needs 2 parameter, 1st is features and 2nd is labels.
This is for using the model that we created using the .fit() function to predict the features in 1st parameter, while the 2nd parameter will be validation value.
from what i see, you're using same data to train and test the model which is not good.
To be more precise, while calculating the accuracy score does it predict the survival for entire data set or just random partial data set?
you used all data to test the model.
I could use cross validation but then again question is do I have to for random forest? Also cross validation for random forest seems to be very slow
of course, you need to use validation to test your model. Create confusion matrix, count precision and recall, don't just depends on the accuracy.
if you think the model is running too slow, then decrease n_esimators value.

Categories