Multiclass classification to balance in python (over sampling) - python

I have the following problem, there is a classification problem. On the track 50,000 lines, on Y 60 labels. But the data is unbalanced (in one class, 35000 values, in the other 59 classes 15000 values, of which in some 30 values). If for example, that is, X (column_1, column_2, column_3) and Y:
colum_1 colum_2 colum_3 Y
0.5 1 2 1
0.5 1.1 2 1
0.55 0.95 3 1
0.1 1 2 2
2 0.9 3 3
And need to add "noisy" data, so that there is no imbalance, conditionally, that all values become the same:
colum_1 colum_2 colum_3 Y
0.5 1 2 1
0.5 1.1 2 1
0.55 0.95 3 1
0.1 1 2 2
0.15 0.99 2 2
0.05 1.01 2 2
2 0.9 3 3
1.95 0.95 3 3
2.05 0.85 3 3
Only this is a toy example, but I have many meanings.

Although the question is not exactly clear, I think you're looking for help with oversampling the minority classes. A common approach would be the SMOTE algorithm, which you can find in the imblearn package.
from imblearn.over_sampling import SMOTE
sm = SMOTE(random_state=42, ratio = 1.0)
X_res, Y_res = sm.fit_sample(X_train, Y_train)
Just make sure you divide your data up into train and test groups first, and then over-sample each group separately so you don't end of with the same data in both. A fuller description here.

Related

Take average of range entities and replace it in pandas column

I have dataframe where one column looks like
Average Weight (Kg)
0.647
0.88
0
0.73
1.7 - 2.1
1.2 - 1.5
2.5
NaN
1.5 - 1.9
1.3 - 1.5
0.4
1.7 - 2.9
Reproducible data
df = pd.DataFrame([0.647,0.88,0,0.73,'1.7 - 2.1','1.2 - 1.5',2.5 ,np.NaN,'1.5 - 1.9','1.3 - 1.5',0.4,'1.7 - 2.9'],columns=['Average Weight (Kg)'])
where I would like to take average of range entries and replace it in the dataframe e.g. 1.7 - 2.1 will be replaced by 1.9 , following code doesn't work TypeError: 'float' object is not iterable
np.where(df['Average Weight (Kg)'].str.contains('-'), df['Average Weight (Kg)'].str.split('-')
.apply(lambda x: statistics.mean((list(map(float, x)) ))), df['Average Weight (Kg)'])
Another possible solution, which is based on the following ideas:
Convert column to string.
Split each cell by \s-\s.
Explode column.
Convert back to float.
Group by and mean.
df['Average Weight (Kg)'] = df['Average Weight (Kg)'].astype(
str).str.split(r'\s-\s').explode().astype(float).groupby(level=0).mean()
Output:
Average Weight (Kg)
0 0.647
1 0.880
2 0.000
3 0.730
4 1.900
5 1.350
6 2.500
7 NaN
8 1.700
9 1.400
10 0.400
11 2.300
edit: slight change to avoid creating a new column
You could go for something like this (renamed your column name to avg, cause it was long to type :-) ):
new_average =(df.avg.str.split('-').str[1].astype(float) + df.avg.str.split('-').str[0].astype(float) ) / 2
df["avg"] = new_average.fillna(df.avg)
yields for avg:
0 0.647
1 0.880
2 0.000
3 0.730
4 1.900
5 1.350
6 2.500
7 NaN
8 1.700
9 1.400
10 0.400
11 2.300
Name: avg2, dtype: float64

boxplot at a certain x, y location in matplotlib, seaborn

I have some data which is organized as follows:
x y score
1 1 0.951
1 1 0.956
1 1 0.976
1 1 0.875
1.5 1.5 0.94
1.5 1.5 0.76
1.5 1.5 0.88
...
10 10 0.51
10 10 0.66
So what I want to do is aggregate the data by (x, y) values and then show a box plot for the score at those (x, y) values.
I realize that I will need 2 y-axes and I think matplotlib allows one to do that. I see an example here: https://matplotlib.org/gallery/api/two_scales.html
However, I am not sure if it is even possible to arrange this so that the scales will correspond to the means of these datasets.
So, I am guessing my question is whether this can be done or if there is a recommendation on how to visualise this sort of data?

Does the test set need data cleaning in machine learning?

I am on an interesting machine learning project about the NYC taxi data (https://s3.amazonaws.com/nyc-tlc/trip+data/green_tripdata_2017-04.csv), the target is predicting the tip amount, the raw data looks like (2 data samples):
VendorID lpep_pickup_datetime lpep_dropoff_datetime store_and_fwd_flag \
0 2 2017-04-01 00:03:54 2017-04-01 00:20:51 N
1 2 2017-04-01 00:00:29 2017-04-01 00:02:44 N
RatecodeID PULocationID DOLocationID passenger_count trip_distance \
0 1 25 14 1 5.29
1 1 263 75 1 0.76
fare_amount extra mta_tax tip_amount tolls_amount ehail_fee \
0 18.5 0.5 0.5 1.00 0.0 NaN
1 4.5 0.5 0.5 1.45 0.0 NaN
improvement_surcharge total_amount payment_type trip_type
0 0.3 20.80 1 1.0
1 0.3 7.25 1 1.0
There are five different 'payment_type', indicated by numerical number 1,2,3,4,5
I find that only when the 'payment_type' is 1, the 'tip_amount' is meaningful, 'payment_type' 2,3,4,5 all have zero tip:
for i in range(1,6):
print(raw[raw["payment_type"] == i][['tip_amount', 'payment_type']].head(2))
gives:
tip_amount payment_type
0 1.00 1
1 1.45 1
tip_amount payment_type
5 0.0 2
8 0.0 2
tip_amount payment_type
100 0.0 3
513 0.0 3
tip_amount payment_type
59 0.0 4
102 0.0 4
tip_amount payment_type
46656 0.0 5
53090 0.0 5
First question: I want to build a regression model for 'tip_amount', if i use the 'payment_type' as a feature, can the model automatically handle this kind of behavior?
Second question: We know that the 'tip_amount' is actually not zero for 'payment_type' 2,3,4,5, just not being correctly recorded, if I drop these data samples and only keep the 'payment_type' == 1, then when using the model for unseen test dataset, it can not predict 'payment_type' 2,3,4,5 to zero tip, so I have to keep the 'payment_type' as an important feature right?
Third question: Let's say I keep all different 'payment_type' data samples and the model is able to predict zero tip amount for 'payment_type' 2,3,4,5 but is this what we really want? Because the underlying true tip should not be zero, it's just how the data looks like.
A common saying for machine learning goes garbage in, garbage out. Often, feature selection and data preprocessing is more important than your model architecture.
First question:
Yes
Second question:
Since payment_type of 2, 3, 4, 5 all result in 0, why not just keep it simple. Replace all payment types that are not 1 with 0. This will let your model easily correlate 1 to being paid and 0 to not being paid. It also reduces the amount of things your model will have to learn in the future.
Third question:
If the "underlying true tip" is not reflected in the data, then it is simply impossible for your model to learn it. Whether this inaccurate representation of the truth is what we want or not what we want is a decision for you to make. Ideally you would have data that shows the actual tip.
Preprocessing your data is very important and will help your model tremendously. Besides making some changes to your payment_type features, you should also look into normalizing your data, which will help your machine learning algorithm better generalize relations between your data.

Predictive modelling - Regression with grouped id's and moving average (python)

I have a problem with a predictive modelling problem. Hopefully, someone have time and can help me. The starting position is shown below. S1-S2 are sensor measurements and RUL is my target value.
DataStructure:
id period s1 s2 s3 RUL
1 1 510.23 643.43 1585.29 6
1 2 512.34 644.89 1586.12 5
1 3 514.65 645.11 1587.99 4
1 4 512.98 647.59 1588.45 3
1 5 516.34 649.04 1590.65 2
1 6 518.12 652.62 1593.09 1
2 1 509.77 640.61 1584.91 9
2 2 510.26 642.06 1586.00 8
2 3 511.95 643.62 1588.09 7
2 4 513.51 646.51 1589.45 6
2 5 512.17 648.06 1589.54 5
2 6 515.56 646.11 1586.22 4
2 7 518.78 649.34 1586.96 3
2 8 519.90 650.30 1588.95 2
2 9 521.05 651.39 1591.34 1
3 1 501.11 653.99 1580.45 8
3 2 511.45 643.23 1584.09 7
3 3 505.45 643.78 1586.11 6
3 4 504.45 643.43 1588.34 5
3 5 506.45 643.71 1589.89 4
3 6 511.45 643.33 1591.21 3
3 7 516.45 643.61 1592.42 2
3 8 518.45 643.05 1596.77 1
Target:
My target is to predict the remaining usefull live (RUL) of unseen data. In this case I have only 1 type of machine with different id's (That means 1 type and 3 different physical systems). For the prediction the id doesn't matter, because it's the same machine. Furthermore, I want to add new features. The moving average of s1 s2 and s3. So I have to add three new column with the names a1, a2 and a3.
For instance, a1 should look like:
a1
NaN
NaN
512.41
513.32
514.66
515.81
NaN
NaN
510.66
511.91
512.54
513.75
515.50
518.08
519.91
NaN
NaN
506.00
507.12
505.45
507.45
511.45
515.45
The next problem is, that I can't work with NaN, because it's a string. How can I ignore/work with it for a1, a2, and a3 ?
Next Question is: How can I use a regression models like RandomForest and Bagged Decision Trees with a train_test_split to predict the RUL of unseen new data? (Of course I need more data, this example gives just the structure.) [s1],[s2],[s3] are my inputs and RUL is the output.
Furthermore, I want to evaluate the model with Mean Absolut Error, Mean Squared Error and R².
Finaly, I want to use the gridsearch Method for tuning.
Thank you:
Thank you in advance. I know what I wanna do, but I'm not able to realize it with python. A complete code would be perfect.
The standard way of solving this problem is through imputation. SciKitLearn has a built in package for imputation. The documentation is here: http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.Imputer.html
There are 3 strategies for replacing the NaNs:
1) replacing it with the mean of the column
2) replacing it with the mode of the column
3) replacing it with the median of the column
An example of usage would look like this:
from sklearn.preprocessing import Imputer
imp = Imputer(strategy = 'mean', axis = 1)
a1 = Imputer.fit_transform(a1, strategy = 'mean')
There are also usage examples available here: http://scikit-learn.org/stable/modules/preprocessing.html#imputation

Rolling Linear Fit with Python DataFrame

I want to perform a moving window linear fit to the columns in my dataframe.
n =5
df = pd.DataFrame(index=pd.date_range('1/1/2000', periods=n))
df['B'] = [1.9,2.3,4.4,5.6,7.3]
df['A'] = [3.2,1.3,5.6,9.4,10.4]
B A
2000-01-01 1.9 3.2
2000-01-02 2.3 1.3
2000-01-03 4.4 5.6
2000-01-04 5.6 9.4
2000-01-05 7.3 10.4
For, say, column B, I want to perform a linear fit using the first two rows, then another linear fit using the second and third rown and so on. And the same for column A. I am only interested in the slope of the fit so at the end, I want a new dataframe with the entries above replaced by the different rolling slopes.
After doing
df.reset_index()
I try something like
model = pd.ols(y=df['A'], x=df['index'], window_type='rolling',window=3)
But I get
KeyError: 'index'
EDIT:
I aded a new column
df['i'] = range(0,len(df))
and I can now run
pd.ols(y=df['A'], x=df.i, window_type='rolling',window=3)
(it gives an error for window=2)
I am not understaing this well because I was expecting a string of numbers but I get just one result
-------------------------Summary of Regression Analysis---------------
Formula: Y ~ <x> + <intercept>
Number of Observations: 3
Number of Degrees of Freedom: 2
R-squared: 0.8981
Adj R-squared: 0.7963
Rmse: 1.1431
F-stat (1, 1): 8.8163, p-value: 0.2068
Degrees of Freedom: model 1, resid 1
-----------------------Summary of Estimated Coefficients--------------
Variable Coef Std Err t-stat p-value CI 2.5% CI 97.5%
--------------------------------------------------------------------------------
x 2.4000 0.8083 2.97 0.2068 0.8158 3.9842
intercept 1.2667 2.5131 0.50 0.7028 -3.6590 6.1923
---------------------------------End of Summary---------------------------------
EDIT 2:
Now I understand better what is going on. I can acces the different values of the fits using
model.beta
I havent tried it out, but I don't think you need to specify the window_type='rolling', if you specify the window to something, window will automatically be set to rolling.
Source.
I have problems doing this with the DatetimeIndex you created with pd.date_range, and find datetimes a confusing pain to work with in general due to the number of types out there and apparent incompatibility between APIs. Here's how I would do it if the date were an integer (e.g. days since 12/31/99, or years) or float in your example. It won't help your datetime problem, but hopefully it helps with the rolling linear fit part.
Generating your date with integers instead of datetimes:
df = pd.DataFrame()
df['date'] = range(1,6)
df['B'] = [1.9,2.3,4.4,5.6,7.3]
df['A'] = [3.2,1.3,5.6,9.4,10.4]
date B A
0 1 1.9 3.2
1 2 2.3 1.3
2 3 4.4 5.6
3 4 5.6 9.4
4 5 7.3 10.4
Since you want to group by 2 dates every time, then fit a linear model on each group, let's duplicate the records and number each group with the index:
df_dbl = pd.concat([df,df], names = ['date', 'B', 'A']).sort()
df_dbl = df_dbl.iloc[1:-1] # removes the first and last row
date B A
0 1 1.9 3.2 # this record is removed
0 1 1.9 3.2
1 2 2.3 1.3
1 2 2.3 1.3
2 3 4.4 5.6
2 3 4.4 5.6
3 4 5.6 9.4
3 4 5.6 9.4
4 5 7.3 10.4
4 5 7.3 10.4 # this record is removed
c = df_dbl.index[1:len(df_dbl.index)].tolist()
c.append(max(df_dbl.index))
df_dbl.index = c
date B A
1 1 1.9 3.2
1 2 2.3 1.3
2 2 2.3 1.3
2 3 4.4 5.6
3 3 4.4 5.6
3 4 5.6 9.4
4 4 5.6 9.4
4 5 7.3 10.4
Now it's ready to group by index to run linear models on B vs. date, which I learned from Using Pandas groupby to calculate many slopes. I use scipy.stats.linregress since I got weird results with pd.ols and couldn't find good documentation to understand why (perhaps because it's geared toward datetime).
1 0.4
2 2.1
3 1.2
4 1.7

Categories