sciklearn Linear Regression (Final Prediciton always 0) - python

I'm trying to do simple linear regression using this small Dataset (Screenshot).
The dataset is records divided into small time blocks of 4 years each (Except for the 2nd to the last time block of 2016-2018).
What I'm trying to do is try to predict the output of records for the timeblock of 2019-2022. To do this, I placed a 2019-2022 time block with all its rows containing the value of 0 (Since there's nothing made during that time since it's the future). I did that to accommodate the syntax of sklearn's train_test_split and went with this code:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
df = pd.read_csv("TCO.csv")
df = df[['2000-2003', '2004-2007', '2008-2011','2012-2015','2016-2018','2019-2022']]
linreg = LinearRegression()
X1_train, X1_test, y1_train, y1_test = train_test_split(df[['2000-2003','2004-2007','2008-2011',
'2012-2015','2016-2018']],df['2019-2022'],test_size=0.4,random_state = 42)
linreg.fit(X1_train, y1_train)
linreg.intercept_
list( zip( ['2000-2003','2004-2007','2008-2011','2012-2015','2016-2018'],list(linreg.coef_)))
y1_pred = linreg.predict(X1_test)
print(y1_pred)
test_pred_df = pd.DataFrame({'actual': y1_test,
'predicted': np.round(y1_pred, 2),
'residuals': y1_test - y1_pred})
print(test_pred_df[0:10].to_string())
For some reason, the algorithm would always return a 0 as the final prediction for all rows with 0 residuals (This is due to the timeblock of 2019-2022 having all rows of zero.)
I think I did something wrong but I can't tell what it is. (I'm a beginner in this topic.) Can someone point out what went wrong and how to fix it?
Edit: I added a copy-able version of the data:
df = pd.DataFrame( {'Country:':['Brunei','Cambodia','Indonesia','Laos',
'Malaysia','Myanmar','Philippines','Singaore',
'Thailand','Vietnam'],
'2000-2003': [0,0,14,1,6,0,25,8,26,8],
'2004-2007': [0,3,15,6,21,0,37,11,44,36],
'2008-2011': [0,5,31,9,75,0,58,27,96,61],
'2012-2015': [5,11,129,35,238,3,99,65,170,96],
'2016-2018': [6,22,136,17,211,10,66,89,119,88]})

Based on your data, I think this is what you ask for [Edit: see updated version below]:
import pandas as pd
from sklearn.linear_model import LinearRegression
df = pd.DataFrame( {'Country:':['Brunei','Cambodia','Indonesia','Laos',
'Malaysia','Myanmar','Philippines','Singaore',
'Thailand','Vietnam'],
'2000-2003': [0,0,14,1,6,0,25,8,26,8],
'2004-2007': [0,3,15,6,21,0,37,11,44,36],
'2008-2011': [0,5,31,9,75,0,58,27,96,61],
'2012-2015': [5,11,129,35,238,3,99,65,170,96],
'2016-2018': [6,22,136,17,211,10,66,89,119,88]})
# create a transposed version with country in header
df_T = df.T
df_T.columns = df_T.iloc[-1]
df_T = df_T.drop("Country:")
# create a new columns for target
df["2019-2022"] = np.NaN
# now fit a model per country and add the prediction
for country in df_T:
y = df_T[country].values
X = np.arange(0,len(y))
m = LinearRegression()
m.fit(X.reshape(-1, 1), y)
df.loc[df["Country:"] == country, "2019-2022"] = m.predict(5)[0]
This prints:
Country: 2000-2003 2004-2007 2008-2011 2012-2015 2016-2018 2019-2022
Brunei 0 0 0 5 6 7.3
Cambodia 0 3 5 11 22 23.8
Indonesia 14 15 31 129 136 172.4
Laos 1 6 9 35 17 31.9
Malaysia 6 21 75 238 211 298.3
Myanmar 0 0 0 3 10 9.5
Philippines 25 37 58 99 66 100.2
Singaore 8 11 27 65 89 104.8
Thailand 26 44 96 170 119 184.6
Vietnam 8 36 61 96 88 123.8
Forget about my comment with shift(). I thought about it, but it makes not sense for this small amount of data, I think. But considering time series methods and treating each country's series as a time series may still be worth for you.
Edit:
Excuse me. The above code is unnessary complicated, but was just result of me going through it step by step. Of course it can simply be done row by row like tihs:
import pandas as pd
from sklearn.linear_model import LinearRegression
df = pd.DataFrame( {'Country:':['Brunei','Cambodia','Indonesia','Laos',
'Malaysia','Myanmar','Philippines','Singaore',
'Thailand','Vietnam'],
'2000-2003': [0,0,14,1,6,0,25,8,26,8],
'2004-2007': [0,3,15,6,21,0,37,11,44,36],
'2008-2011': [0,5,31,9,75,0,58,27,96,61],
'2012-2015': [5,11,129,35,238,3,99,65,170,96],
'2016-2018': [6,22,136,17,211,10,66,89,119,88]})
# create a new columns for target
df["2019-2022"] = np.NaN
for idx, row in df.iterrows():
y = row.drop(["Country:", "2019-2022"]).values
X = np.arange(0,len(y))
m = LinearRegression()
m.fit(X.reshape(-1, 1), y)
df.loc[idx, "2019-2022"] = m.predict(len(y)+1)[0]
1500 rows should be no problem.

Related

Pre-processing single feature containing different scales

How do I preprocess this data containing a single feature with different scales? This will then be used for supervised machine learning classification.
Data
import pandas as pd
import numpy as np
np.random.seed = 4
df_eur_jpy = pd.DataFrame({"value": np.random.default_rng().uniform(0.07, 3.85, 50)})
df_usd_cad = pd.DataFrame({"value": np.random.default_rng().uniform(0.0004, 0.02401, 50)})
df_usd_cad["ticker"] = "usd_cad"
df_eur_jpy["ticker"] = "eur_jpy"
df = pd.concat([df_eur_jpy,df_usd_cad],axis=0)
df.head(1)
value ticker
0 0.161666 eur_jpy
We can see the different tickers contain data with a different scale when looking at the max/min of this groupby:
df.groupby("ticker")["value"].agg(['min', 'max'])
min max
ticker
eur_jpy 0.079184 3.837519
usd_cad 0.000405 0.022673
I have many tickers in my real data and would like to combine all of these in the one feature (pandas column) and use with an estimator in sci-kit learn for supervised machine learning classification.
If I Understand Carefully (IIUC), you can use the min-max scaling formula:
You can apply this formula to your dataframe with implemented sklearn.preprocessing.MinMaxScaler like below:
from sklearn.preprocessing import MinMaxScaler
df2 = df.pivot(columns='ticker', values='value')
# ticker eur_jpy usd_cad
# 0 3.204568 0.021455
# 1 1.144708 0.013810
# ...
# 48 1.906116 0.002058
# 49 1.136424 0.022451
df2[['min_max_scl_eur_jpy', 'min_max_scl_usd_cad']] = MinMaxScaler().fit_transform(df2[['eur_jpy', 'usd_cad']])
print(df2)
Output:
ticker eur_jpy usd_cad min_max_scl_eur_jpy min_max_scl_usd_cad
0 3.204568 0.021455 0.827982 0.896585
1 1.144708 0.013810 0.264398 0.567681
2 2.998154 0.004580 0.771507 0.170540
3 1.916517 0.003275 0.475567 0.114361
4 0.955089 0.009206 0.212517 0.369558
5 3.036463 0.019500 0.781988 0.812471
6 1.240505 0.006575 0.290608 0.256373
7 1.224260 0.020711 0.286163 0.864584
8 3.343022 0.020564 0.865864 0.858280
9 2.710383 0.023359 0.692771 0.978531
10 1.218328 0.008440 0.284540 0.336588
11 2.005472 0.022898 0.499906 0.958704
12 2.056680 0.016429 0.513916 0.680351
13 1.010388 0.005553 0.227647 0.212368
14 3.272408 0.000620 0.846543 0.000149
15 2.354457 0.018608 0.595389 0.774092
16 3.297936 0.017484 0.853528 0.725720
17 2.415297 0.009618 0.612035 0.387285
18 0.439263 0.000617 0.071386 0.000000
19 3.335262 0.005988 0.863740 0.231088
20 2.767412 0.013357 0.708375 0.548171
21 0.830678 0.013824 0.178478 0.568255
22 1.056041 0.007806 0.240138 0.309336
23 1.497400 0.023858 0.360896 1.000000
24 0.629698 0.014088 0.123489 0.579604
25 3.758559 0.020663 0.979556 0.862509
26 0.964214 0.010302 0.215014 0.416719
27 3.680324 0.023647 0.958150 0.990918
28 3.169445 0.017329 0.818372 0.719059
29 1.898905 0.017892 0.470749 0.743299
30 3.322663 0.020508 0.860293 0.855869
31 2.735855 0.010578 0.699741 0.428591
32 2.264645 0.017853 0.570816 0.741636
33 2.613166 0.021359 0.666173 0.892456
34 1.976168 0.001568 0.491888 0.040928
35 3.076169 0.013663 0.792852 0.561335
36 3.330470 0.013048 0.862429 0.534891
37 3.600527 0.012340 0.936318 0.504426
38 0.653994 0.008665 0.130137 0.346288
39 0.587896 0.013134 0.112052 0.538567
40 0.178353 0.011326 0.000000 0.460781
41 3.727127 0.016738 0.970956 0.693658
42 1.719622 0.010939 0.421696 0.444123
43 0.460177 0.021131 0.077108 0.882665
44 3.124722 0.010328 0.806136 0.417826
45 1.011988 0.007631 0.228085 0.301799
46 3.833281 0.003896 1.000000 0.141076
47 3.289872 0.017223 0.851322 0.714495
48 1.906116 0.002058 0.472721 0.062020
49 1.136424 0.022451 0.262131 0.939465

Looped regression model in Python/Sklearn

I'm trying to systematically regress a couple of different dependent variables (countries) on the same set of inputs/independent variables, and want to do this in a looped fashion in Python using Sklearn. The dependant variables look like this:
Europe UK Japan USA Canada
Jan-10 10 13 39 42 16
Feb-10 13 16 48 51 19
Mar-10 15 18 54 57 21
Apr-10 12 15 45 48 18
May-10 11 14 42 45 17
while the independent variables look like this:
Input 1 Input 2 Input 3 Input 4
Jan-10 90 50 3 41
Feb-10 95 54 5 43
Mar-10 92 52 1 45
Apr-10 91 60 1 49
May-10 90 67 11 49
I find it easy to manually regress them + store predictions one at a time (ie Europe on all four inputs, then Japan etc) but haven't figured out how to program a single looped function that could do them all in one go. I suspect I may need to use a list/dictionary to store the dependent variables and call them one-by-one but don't quite know how to write this in a Pythonic way.
The existing code for a single loop looks like this:
x = pd.DataFrame('countryinputs.csv')
countries = pd.DataFrame('countryoutputs.csv')
y = countries['Europe']
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X, y)
y_pred = regressor.predict(X)
Simply iterate through the column names. Then pass name into a defined function. In fact, you can wrap the process in a dictionary comprehension and pass into DataFrame constructor to return a dataframe of predicted values (same shape as original dataframe):
X = pd.DataFrame(...)
countries = pd.DataFrame(...)
def reg_proc(label):
y = countries[label]
regressor = LinearRegression()
regressor.fit(X, y)
y_pred = regressor.predict(X)
return(y_pred)
pred_df = pd.DataFrame({lab: reg_proc(lab) for lab in countries.columns},
columns = countries.columns)
To demonstrate with random, seeded data where tools below would be your countries:
Data
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
np.random.seed(7172018)
tools = pd.DataFrame({'pandas': np.random.uniform(0,1000,50),
'r': np.random.uniform(0,1000,50),
'julia': np.random.uniform(0,1000,50),
'sas': np.random.uniform(0,1000,50),
'spss': np.random.uniform(0,1000,50),
'stata': np.random.uniform(0,1000,50)
},
columns=['pandas', 'r', 'julia', 'sas', 'spss', 'stata'])
X = pd.DataFrame({'Input1': np.random.randn(50)*10,
'Input2': np.random.randn(50)*10,
'Input3': np.random.randn(50)*10,
'Input4': np.random.randn(50)*10})
Model
def reg_proc(label):
y = tools[label]
regressor = LinearRegression()
regressor.fit(X, y)
y_pred = regressor.predict(X)
return(y_pred)
pred_df = pd.DataFrame({lab: reg_proc(lab) for lab in tools.columns},
columns = tools.columns)
print(pred_df.head(10))
# pandas r julia sas spss stata
# 0 547.631679 576.025733 682.390046 507.767567 246.020799 557.648181
# 1 577.334819 575.992992 280.579234 506.014191 443.044139 396.044620
# 2 430.494827 576.211105 541.096721 441.997575 386.309627 558.472179
# 3 440.662962 524.582054 406.849303 420.017656 508.701222 393.678200
# 4 588.993442 472.414081 453.815978 479.208183 389.744062 424.507541
# 5 520.215513 489.447248 670.708618 459.375294 314.008988 516.235188
# 6 515.266625 459.292370 477.485995 436.398180 446.777292 398.826234
# 7 423.930650 414.069751 629.444118 378.059735 448.760240 449.062734
# 8 549.769034 406.531405 653.557937 441.425445 348.725447 456.089921
# 9 396.826924 399.327683 717.285415 361.235709 444.830491 429.967976

Adding sparse matrix from CountVectorizer into dataframe with complimentary information for classifier - keep it in sparse format

I have the following problem. Right now I am building a classifier system which will use text and some additional complimentary information as an input. I store complimentary information in pandas DataFrame. I transform text using CountVectorizer and get a sparse matrix. Now, in order to train a classifier I need to have both inputs in same dataframe. The problem is that, when I merge dataframe with output of CountVectorizer I get a dense matrix, which I means I run out of memory really fast. Is there any way to avoid it and properly merge together this 2 inputs into single dataframe without getting a dense matrix?
Example code:
import pandas as pd
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.feature_extraction.text import CountVectorizer
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
#how many most popular words we consider
n_features = 5000
df = pd.DataFrame.from_csv('DataWithSentimentAndTopics.csv',index_col=None)
#vecotrizing text
tf_vectorizer = CountVectorizer(max_df=0.5, min_df=2,
max_features=n_features,
stop_words='english')
#getting the TF matrix
tf = tf_vectorizer.fit_transform(df['reviewText'])
df = pd.concat([df.drop(['reviewText', 'Summary'], axis=1), pd.DataFrame(tf.A)], axis=1)
#binning target variable into 4 bins.
df['helpful'] = pd.cut(df['helpful'],[-1,0,10,50,100000], labels = [0,1,2,3])
#creating X and Y variables
train = df.drop(['helpful'], axis=1)
Y = df['helpful']
#splitting into train and test
X_train, X_test, y_train, y_test = train_test_split(train, Y, test_size=0.1)
#creating GBR
gbc = GradientBoostingClassifier(max_depth = 7, n_estimators=1500, min_samples_leaf=10)
print('Training GBC')
print(datetime.datetime.now())
#fit classifier, look for best
gbc.fit(X_train, y_train)
As you see, I set up my CountVectorizer to have 5000 words. I have just 50000 lines in my original dataframe but I already get a matrix of 50000x5000 cells, which is 2.5 billion of units. It already requires a lot of memory.
you dont need to use a data frame.
convert the numerical features from dataframe to a numpy array:
num_feats = df[[cols]].values
from scipy import sparse
training_data = sparse.hstack((count_vectorizer_features, num_feats))
then you can use a scikit-learn algorithm which supports sparse data.
for GBM, you can use xgboost which supports sparse.
As #AbhishekThakur has already said, you don't have to put your one-hot-encoded data into the DataFrame.
But if you want to do so, you can add Pandas.SparseSeries as a columns:
#vecotrizing text
tf_vectorizer = CountVectorizer(max_df=0.5, min_df=2,
max_features=n_features,
stop_words='english')
#getting the TF matrix
tf = tf_vectorizer.fit_transform(df.pop('reviewText'))
# adding "features" columns as SparseSeries
for i, col in enumerate(tf_vectorizer.get_feature_names()):
df[col] = pd.SparseSeries(tf[:, i].toarray().ravel(), fill_value=0)
Result:
In [107]: df.head(3)
Out[107]:
asin price reviewerID LenReview Summary LenSummary overall helpful reviewSentiment 0 \
0 151972036 8.48 A14NU55NQZXML2 199 really a difficult read 23 3 2 -0.7203 0.002632
1 151972036 8.48 A1CSBLAPMYV8Y0 77 wha 3 4 0 -0.1260 0.005556
2 151972036 8.48 A1DDECXCGHDYZK 114 wordy and drags on 18 1 4 0.5707 0.004545
... think thought trailers trying wanted words worth wouldn writing young
0 ... 0 0 0 0 1 0 0 0 0 0
1 ... 0 0 0 1 0 0 0 0 0 0
2 ... 0 0 0 0 1 0 1 0 0 0
[3 rows x 78 columns]
Pay attention at memory usage:
In [108]: df.memory_usage()
Out[108]:
Index 80
asin 112
price 112
reviewerID 112
LenReview 112
Summary 112
LenSummary 112
overall 112
helpful 112
reviewSentiment 112
0 112
1 112
2 112
3 112
4 112
5 112
6 112
7 112
8 112
9 112
10 112
11 112
12 112
13 112
14 112
...
parts 16 # memory used: # of ones multiplied by 8 (np.int64)
peter 16
picked 16
point 16
quick 16
rating 16
reader 16
reading 24
really 24
reviews 16
stars 16
start 16
story 32
tedious 16
things 16
think 16
thought 16
trailers 16
trying 16
wanted 24
words 16
worth 16
wouldn 16
writing 24
young 16
dtype: int64
Pandas also supports importing sparse matrices, which it stores using its sparseDtype
import scipy.sparse
pd.DataFrame.sparse.from_spmatrix(Your_Sparse_Data)
Which you could concatenate to the rest of your dataframe

operations in pandas DataFrame

I have a fairly large (~5000 rows) DataFrame, with a number of variables, say 2 ['max', 'min'], sorted by 4 parameters, ['Hs', 'Tp', 'wd', 'seed']. It looks like this:
>>> data.head()
Hs Tp wd seed max min
0 1 9 165 22 225 18
1 1 9 195 16 190 18
2 2 5 165 43 193 12
3 2 10 180 15 141 22
4 1 6 180 17 219 18
>>> len(data)
4500
I want to keep only the first 2 parameters and get the maximum standard deviation for all 'seed's calculated individually for each 'wd'.
In the end, I'm left with unique (Hs, Tp) pairs with the maximum standard deviations for each variable. Something like:
>>> stdev.head()
Hs Tp max min
0 1 5 43.31321 4.597629
1 1 6 43.20004 4.640795
2 1 7 47.31507 4.569408
3 1 8 41.75081 4.651762
4 1 9 41.35818 4.285991
>>> len(stdev)
30
The following code does what I want, but since I have little understanding about DataFrames, I'm wondering if these nested loops can be done in a different and more DataFramy way =)
import pandas as pd
import numpy as np
#
#data = pd.read_table('data.txt')
#
# don't worry too much about this ugly generator,
# it just emulates the format of my data...
total = 4500
data = pd.DataFrame()
data['Hs'] = np.random.randint(1,4,size=total)
data['Tp'] = np.random.randint(5,15,size=total)
data['wd'] = [[165, 180, 195][np.random.randint(0,3)] for _ in xrange(total)]
data['seed'] = np.random.randint(1,51,size=total)
data['max'] = np.random.randint(100,250,size=total)
data['min'] = np.random.randint(10,25,size=total)
# and here it starts. would the creators of pandas pull their hair out if they see this?
# can this be made better?
stdev = pd.DataFrame(columns = ['Hs', 'Tp', 'max', 'min'])
i=0
for hs in set(data['Hs']):
data_Hs = data[data['Hs'] == hs]
for tp in set(data_Hs['Tp']):
data_tp = data_Hs[data_Hs['Tp'] == tp]
stdev.loc[i] = [
hs,
tp,
max([np.std(data_tp[data_tp['wd']==wd]['max']) for wd in set(data_tp['wd'])]),
max([np.std(data_tp[data_tp['wd']==wd]['min']) for wd in set(data_tp['wd'])])]
i+=1
Thanks!
PS: if curious, this is statistics on variables depending on sea waves. Hs is wave height, Tp wave period, wd wave direction, the seeds represent different realizations of an irregular wave train, and min and max are the peaks or my variable during a certain exposition time. After all this, by means of the standard deviation and average, I can fit some distribution to the data, like Gumbel.
This could be a one-liner, if I understood you correctly:
data.groupby(['Hs', 'Tp', 'wd'])[['max', 'min']].std(ddof=0).max(level=[0, 1])
(include reset_index() on the end if you want)

cross validation in sklearn with given fold splits

I'm learning how to use Lasso and Ridge with sklearn in Python. I am given the folds in a column. I want to find the best parameter based on a 5 fold cross validation.
My data looks like the following:
mpg cylinders displacement horsepower weight acceleration origin fold
0 18 8 307 130 3504 12.0 1 3
1 15 8 350 165 3693 11.5 1 0
2 18 8 318 150 3436 11.0 1 2
3 16 8 304 150 3433 12.0 1 2
4 17 8 302 140 3449 10.5 1 3
reg_para = [0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1, 5, 10, 50, 100]
mpg is the y/target variable and the other columns are the predictor. the last column contains the folds. I want to run a Lasso and Ridge and find the best parameter. The problem I am having is incorporating the specified folds in cross validation. Here is what I have so far (for Lasso):
from sklearn.linear_model import Lasso, LassoCV
lasso_model = LassoCV(cv=5, alphas=reg_para)
lasso_fit = lasso_model.fit(X,y)
Is there a simple way to incorporate the fold splits? Any help is greatly appreciated
If your data are in a pandas dataframe, then all you need to do is access that column
fold_labels = df["fold"]
from sklearn.cross_validation import LeaveOneLabelOut
cv = LeaveOneLabelOut(fold_labels)
lasso_model = LassoCV(cv=cv, alphas=reg_para)
So if you obtain the fold labels in an array fold_labels you can just use LeaveOneLabelOut (sorry for the non-functional code. It should be sufficient to elucidate the idea though.)

Categories