Looped regression model in Python/Sklearn - python

I'm trying to systematically regress a couple of different dependent variables (countries) on the same set of inputs/independent variables, and want to do this in a looped fashion in Python using Sklearn. The dependant variables look like this:
Europe UK Japan USA Canada
Jan-10 10 13 39 42 16
Feb-10 13 16 48 51 19
Mar-10 15 18 54 57 21
Apr-10 12 15 45 48 18
May-10 11 14 42 45 17
while the independent variables look like this:
Input 1 Input 2 Input 3 Input 4
Jan-10 90 50 3 41
Feb-10 95 54 5 43
Mar-10 92 52 1 45
Apr-10 91 60 1 49
May-10 90 67 11 49
I find it easy to manually regress them + store predictions one at a time (ie Europe on all four inputs, then Japan etc) but haven't figured out how to program a single looped function that could do them all in one go. I suspect I may need to use a list/dictionary to store the dependent variables and call them one-by-one but don't quite know how to write this in a Pythonic way.
The existing code for a single loop looks like this:
x = pd.DataFrame('countryinputs.csv')
countries = pd.DataFrame('countryoutputs.csv')
y = countries['Europe']
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X, y)
y_pred = regressor.predict(X)

Simply iterate through the column names. Then pass name into a defined function. In fact, you can wrap the process in a dictionary comprehension and pass into DataFrame constructor to return a dataframe of predicted values (same shape as original dataframe):
X = pd.DataFrame(...)
countries = pd.DataFrame(...)
def reg_proc(label):
y = countries[label]
regressor = LinearRegression()
regressor.fit(X, y)
y_pred = regressor.predict(X)
return(y_pred)
pred_df = pd.DataFrame({lab: reg_proc(lab) for lab in countries.columns},
columns = countries.columns)
To demonstrate with random, seeded data where tools below would be your countries:
Data
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
np.random.seed(7172018)
tools = pd.DataFrame({'pandas': np.random.uniform(0,1000,50),
'r': np.random.uniform(0,1000,50),
'julia': np.random.uniform(0,1000,50),
'sas': np.random.uniform(0,1000,50),
'spss': np.random.uniform(0,1000,50),
'stata': np.random.uniform(0,1000,50)
},
columns=['pandas', 'r', 'julia', 'sas', 'spss', 'stata'])
X = pd.DataFrame({'Input1': np.random.randn(50)*10,
'Input2': np.random.randn(50)*10,
'Input3': np.random.randn(50)*10,
'Input4': np.random.randn(50)*10})
Model
def reg_proc(label):
y = tools[label]
regressor = LinearRegression()
regressor.fit(X, y)
y_pred = regressor.predict(X)
return(y_pred)
pred_df = pd.DataFrame({lab: reg_proc(lab) for lab in tools.columns},
columns = tools.columns)
print(pred_df.head(10))
# pandas r julia sas spss stata
# 0 547.631679 576.025733 682.390046 507.767567 246.020799 557.648181
# 1 577.334819 575.992992 280.579234 506.014191 443.044139 396.044620
# 2 430.494827 576.211105 541.096721 441.997575 386.309627 558.472179
# 3 440.662962 524.582054 406.849303 420.017656 508.701222 393.678200
# 4 588.993442 472.414081 453.815978 479.208183 389.744062 424.507541
# 5 520.215513 489.447248 670.708618 459.375294 314.008988 516.235188
# 6 515.266625 459.292370 477.485995 436.398180 446.777292 398.826234
# 7 423.930650 414.069751 629.444118 378.059735 448.760240 449.062734
# 8 549.769034 406.531405 653.557937 441.425445 348.725447 456.089921
# 9 396.826924 399.327683 717.285415 361.235709 444.830491 429.967976

Related

Pre-processing single feature containing different scales

How do I preprocess this data containing a single feature with different scales? This will then be used for supervised machine learning classification.
Data
import pandas as pd
import numpy as np
np.random.seed = 4
df_eur_jpy = pd.DataFrame({"value": np.random.default_rng().uniform(0.07, 3.85, 50)})
df_usd_cad = pd.DataFrame({"value": np.random.default_rng().uniform(0.0004, 0.02401, 50)})
df_usd_cad["ticker"] = "usd_cad"
df_eur_jpy["ticker"] = "eur_jpy"
df = pd.concat([df_eur_jpy,df_usd_cad],axis=0)
df.head(1)
value ticker
0 0.161666 eur_jpy
We can see the different tickers contain data with a different scale when looking at the max/min of this groupby:
df.groupby("ticker")["value"].agg(['min', 'max'])
min max
ticker
eur_jpy 0.079184 3.837519
usd_cad 0.000405 0.022673
I have many tickers in my real data and would like to combine all of these in the one feature (pandas column) and use with an estimator in sci-kit learn for supervised machine learning classification.
If I Understand Carefully (IIUC), you can use the min-max scaling formula:
You can apply this formula to your dataframe with implemented sklearn.preprocessing.MinMaxScaler like below:
from sklearn.preprocessing import MinMaxScaler
df2 = df.pivot(columns='ticker', values='value')
# ticker eur_jpy usd_cad
# 0 3.204568 0.021455
# 1 1.144708 0.013810
# ...
# 48 1.906116 0.002058
# 49 1.136424 0.022451
df2[['min_max_scl_eur_jpy', 'min_max_scl_usd_cad']] = MinMaxScaler().fit_transform(df2[['eur_jpy', 'usd_cad']])
print(df2)
Output:
ticker eur_jpy usd_cad min_max_scl_eur_jpy min_max_scl_usd_cad
0 3.204568 0.021455 0.827982 0.896585
1 1.144708 0.013810 0.264398 0.567681
2 2.998154 0.004580 0.771507 0.170540
3 1.916517 0.003275 0.475567 0.114361
4 0.955089 0.009206 0.212517 0.369558
5 3.036463 0.019500 0.781988 0.812471
6 1.240505 0.006575 0.290608 0.256373
7 1.224260 0.020711 0.286163 0.864584
8 3.343022 0.020564 0.865864 0.858280
9 2.710383 0.023359 0.692771 0.978531
10 1.218328 0.008440 0.284540 0.336588
11 2.005472 0.022898 0.499906 0.958704
12 2.056680 0.016429 0.513916 0.680351
13 1.010388 0.005553 0.227647 0.212368
14 3.272408 0.000620 0.846543 0.000149
15 2.354457 0.018608 0.595389 0.774092
16 3.297936 0.017484 0.853528 0.725720
17 2.415297 0.009618 0.612035 0.387285
18 0.439263 0.000617 0.071386 0.000000
19 3.335262 0.005988 0.863740 0.231088
20 2.767412 0.013357 0.708375 0.548171
21 0.830678 0.013824 0.178478 0.568255
22 1.056041 0.007806 0.240138 0.309336
23 1.497400 0.023858 0.360896 1.000000
24 0.629698 0.014088 0.123489 0.579604
25 3.758559 0.020663 0.979556 0.862509
26 0.964214 0.010302 0.215014 0.416719
27 3.680324 0.023647 0.958150 0.990918
28 3.169445 0.017329 0.818372 0.719059
29 1.898905 0.017892 0.470749 0.743299
30 3.322663 0.020508 0.860293 0.855869
31 2.735855 0.010578 0.699741 0.428591
32 2.264645 0.017853 0.570816 0.741636
33 2.613166 0.021359 0.666173 0.892456
34 1.976168 0.001568 0.491888 0.040928
35 3.076169 0.013663 0.792852 0.561335
36 3.330470 0.013048 0.862429 0.534891
37 3.600527 0.012340 0.936318 0.504426
38 0.653994 0.008665 0.130137 0.346288
39 0.587896 0.013134 0.112052 0.538567
40 0.178353 0.011326 0.000000 0.460781
41 3.727127 0.016738 0.970956 0.693658
42 1.719622 0.010939 0.421696 0.444123
43 0.460177 0.021131 0.077108 0.882665
44 3.124722 0.010328 0.806136 0.417826
45 1.011988 0.007631 0.228085 0.301799
46 3.833281 0.003896 1.000000 0.141076
47 3.289872 0.017223 0.851322 0.714495
48 1.906116 0.002058 0.472721 0.062020
49 1.136424 0.022451 0.262131 0.939465

Python - decision tree in lightgbm with odd values

I am trying to fit a single decision tree using the Python module lightgbm. However, I find the output a little strange. I have 15 explanatory variables and the numerical response variable has the following characteristic:
count 653.000000
mean 31.503813
std 11.838267
min 13.750000
25% 22.580000
50% 28.420000
75% 38.250000
max 76.750000
Name: X2, dtype: float64
I do the following to fit the tree: I first construct the Dataset object
df_train = lightgbm.Dataset(
df, # The data
label = df[response], # The response series
feature_name = features, # A list with names of all explanatory variables
categorical_feature = categorical_vars # A list with names of the categorical ones
)
Next, I define the parameters and fit the model:
param = {
# make it a single tree:
'objective': 'regression',
'bagging_freq':0, # Disable bagging
'feature_fraction':1, # don't randomly select features. consider all.
'num_trees': 1,
# tuning parameters
'max_leaves': 20,
'max_depth': -1,
'min_data_in_leaf': 20
}
model = lightgbm.train(param, df_train)
From the model I extract the leaves of the tree as:
tree = model.trees_to_dataframe()[[
'right_child',
'node_depth',
'value',
'count']]
leaves = tree[tree.right_child.isnull()]
print(leaves)
right_child node_depth value count
5 None 6 29.957982 20
6 None 6 30.138253 28
8 None 6 30.269373 34
9 None 6 30.404353 38
12 None 6 30.528705 33
13 None 6 30.651690 62
14 None 5 30.842856 59
17 None 5 31.080432 51
19 None 6 31.232860 21
20 None 6 31.358547 26
22 None 5 31.567571 43
23 None 5 31.795345 46
28 None 6 32.034321 27
29 None 6 32.247890 24
31 None 6 32.420886 22
32 None 6 32.594289 21
34 None 5 32.920932 20
35 None 5 33.210205 22
37 None 4 33.809376 36
38 None 4 34.887632 20
Now, if you look at the values, they range from (approximately) 30 to 35. This is far from capturing the distribution (shown above with min = 13.75 and max = 76.75) of the response variable.
Can anyone explain to me what is going on here?
Follow Up Based On Accepted Answer:
I tried to add 'learning_rate':1 and 'min_data_in_bin':1 to the parameter dict which resulted in the following tree:
right_child node_depth value count
5 None 6 16.045500 20
6 None 6 17.824074 27
8 None 6 19.157500 36
9 None 6 20.529730 37
12 None 6 21.805834 36
13 None 6 23.048387 62
14 None 5 24.975263 57
17 None 5 27.335385 52
19 None 6 29.006800 25
20 None 6 30.234286 21
22 None 5 32.221591 44
23 None 5 34.472272 44
28 None 6 36.808889 27
29 None 6 38.944583 24
31 None 6 40.674546 22
32 None 6 42.408572 21
34 None 5 45.675000 20
35 None 5 48.567728 22
37 None 4 54.559445 36
38 None 4 65.341999 20
This is much more desirable. This means, that we can now use lightgbm to mimic the behavior of a single decision tree with categorical features. As opposed to sklearn, lightgbm honors "true" categorical variables whereas in sklearn one needs to one-hot encode all categorical variables which can turn out really bad; see this kaggle post.
As you may know LightGBM does a couple of tricks to speed things up. One of them is feature binning, where the values of the features are assigned to bins to reduce the possible number of splits. By default this number is 3, so for example if you have 100 samples you'd have about 34 bins.
Another important thing here when using a single tree is that LightGBM does boosting by default, which means that it will start from an initial score and try to gradually improve on it. That gradual change is controlled by the learning_rate which by default is 0.1, so the predictions from each tree are multiplied by this number and added to the current score.
The last thing to consider is that the tree size is controlled by num_leaves which is 31 by default. If you want to fully grow the tree you have to set this number to your number of samples.
So if you want to replicate a full-grown decision tree in LightGBM you have to adjust these parameters. Here's an example:
import lightgbm as lgb
import numpy as np
import pandas as pd
X = np.linspace(1, 2, 100)[:, None]
y = X[:, 0]**2
ds = lgb.Dataset(X, y)
params = {'num_leaves': 100, 'min_child_samples': 1, 'min_data_in_bin': 1, 'learning_rate': 1}
bst = lgb.train(params, ds, num_boost_round=1)
print(pd.concat([
bst.trees_to_dataframe().loc[lambda x: x['left_child'].isnull(), 'value'].describe().rename('leaves'),
pd.Series(y).describe().rename('y'),
], axis=1))
leaves
y
count
100
100
mean
2.33502
2.33502
std
0.882451
0.882451
min
1
1
25%
1.56252
1.56252
50%
2.25003
2.25003
75%
3.06252
3.06252
max
4
4
Having said that, if you're looking for a decision tree it's easier to use scikit-learn's:
from sklearn.tree import DecisionTreeRegressor
tree = DecisionTreeRegressor().fit(X, y)
np.allclose(bst.predict(X), tree.predict(X))
# True

sciklearn Linear Regression (Final Prediciton always 0)

I'm trying to do simple linear regression using this small Dataset (Screenshot).
The dataset is records divided into small time blocks of 4 years each (Except for the 2nd to the last time block of 2016-2018).
What I'm trying to do is try to predict the output of records for the timeblock of 2019-2022. To do this, I placed a 2019-2022 time block with all its rows containing the value of 0 (Since there's nothing made during that time since it's the future). I did that to accommodate the syntax of sklearn's train_test_split and went with this code:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
df = pd.read_csv("TCO.csv")
df = df[['2000-2003', '2004-2007', '2008-2011','2012-2015','2016-2018','2019-2022']]
linreg = LinearRegression()
X1_train, X1_test, y1_train, y1_test = train_test_split(df[['2000-2003','2004-2007','2008-2011',
'2012-2015','2016-2018']],df['2019-2022'],test_size=0.4,random_state = 42)
linreg.fit(X1_train, y1_train)
linreg.intercept_
list( zip( ['2000-2003','2004-2007','2008-2011','2012-2015','2016-2018'],list(linreg.coef_)))
y1_pred = linreg.predict(X1_test)
print(y1_pred)
test_pred_df = pd.DataFrame({'actual': y1_test,
'predicted': np.round(y1_pred, 2),
'residuals': y1_test - y1_pred})
print(test_pred_df[0:10].to_string())
For some reason, the algorithm would always return a 0 as the final prediction for all rows with 0 residuals (This is due to the timeblock of 2019-2022 having all rows of zero.)
I think I did something wrong but I can't tell what it is. (I'm a beginner in this topic.) Can someone point out what went wrong and how to fix it?
Edit: I added a copy-able version of the data:
df = pd.DataFrame( {'Country:':['Brunei','Cambodia','Indonesia','Laos',
'Malaysia','Myanmar','Philippines','Singaore',
'Thailand','Vietnam'],
'2000-2003': [0,0,14,1,6,0,25,8,26,8],
'2004-2007': [0,3,15,6,21,0,37,11,44,36],
'2008-2011': [0,5,31,9,75,0,58,27,96,61],
'2012-2015': [5,11,129,35,238,3,99,65,170,96],
'2016-2018': [6,22,136,17,211,10,66,89,119,88]})
Based on your data, I think this is what you ask for [Edit: see updated version below]:
import pandas as pd
from sklearn.linear_model import LinearRegression
df = pd.DataFrame( {'Country:':['Brunei','Cambodia','Indonesia','Laos',
'Malaysia','Myanmar','Philippines','Singaore',
'Thailand','Vietnam'],
'2000-2003': [0,0,14,1,6,0,25,8,26,8],
'2004-2007': [0,3,15,6,21,0,37,11,44,36],
'2008-2011': [0,5,31,9,75,0,58,27,96,61],
'2012-2015': [5,11,129,35,238,3,99,65,170,96],
'2016-2018': [6,22,136,17,211,10,66,89,119,88]})
# create a transposed version with country in header
df_T = df.T
df_T.columns = df_T.iloc[-1]
df_T = df_T.drop("Country:")
# create a new columns for target
df["2019-2022"] = np.NaN
# now fit a model per country and add the prediction
for country in df_T:
y = df_T[country].values
X = np.arange(0,len(y))
m = LinearRegression()
m.fit(X.reshape(-1, 1), y)
df.loc[df["Country:"] == country, "2019-2022"] = m.predict(5)[0]
This prints:
Country: 2000-2003 2004-2007 2008-2011 2012-2015 2016-2018 2019-2022
Brunei 0 0 0 5 6 7.3
Cambodia 0 3 5 11 22 23.8
Indonesia 14 15 31 129 136 172.4
Laos 1 6 9 35 17 31.9
Malaysia 6 21 75 238 211 298.3
Myanmar 0 0 0 3 10 9.5
Philippines 25 37 58 99 66 100.2
Singaore 8 11 27 65 89 104.8
Thailand 26 44 96 170 119 184.6
Vietnam 8 36 61 96 88 123.8
Forget about my comment with shift(). I thought about it, but it makes not sense for this small amount of data, I think. But considering time series methods and treating each country's series as a time series may still be worth for you.
Edit:
Excuse me. The above code is unnessary complicated, but was just result of me going through it step by step. Of course it can simply be done row by row like tihs:
import pandas as pd
from sklearn.linear_model import LinearRegression
df = pd.DataFrame( {'Country:':['Brunei','Cambodia','Indonesia','Laos',
'Malaysia','Myanmar','Philippines','Singaore',
'Thailand','Vietnam'],
'2000-2003': [0,0,14,1,6,0,25,8,26,8],
'2004-2007': [0,3,15,6,21,0,37,11,44,36],
'2008-2011': [0,5,31,9,75,0,58,27,96,61],
'2012-2015': [5,11,129,35,238,3,99,65,170,96],
'2016-2018': [6,22,136,17,211,10,66,89,119,88]})
# create a new columns for target
df["2019-2022"] = np.NaN
for idx, row in df.iterrows():
y = row.drop(["Country:", "2019-2022"]).values
X = np.arange(0,len(y))
m = LinearRegression()
m.fit(X.reshape(-1, 1), y)
df.loc[idx, "2019-2022"] = m.predict(len(y)+1)[0]
1500 rows should be no problem.

Non-linear regression in Seaborn Python

I have the following dataframe that I wish to perform some regression on. I am using Seaborn but can't quite seem to find a non-linear function that fits. Below is my code and it's output, and below that is the dataframe I am using, df. Note I have truncated the axis in this plot.
I would like to fit either a Poisson or Gaussian distribution style of function.
import pandas
import seaborn
graph = seaborn.lmplot('$R$', 'Equilibrium Value', data = df, fit_reg=True, order=2, ci=None)
graph.set(xlim = (-0.25,10))
However this produces the following figure.
df
R Equilibrium Value
0 5.102041 7.849315e-03
1 4.081633 2.593005e-02
2 0.000000 9.990000e-01
3 30.612245 4.197446e-14
4 14.285714 6.730133e-07
5 12.244898 5.268202e-06
6 15.306122 2.403316e-07
7 39.795918 3.292955e-18
8 19.387755 3.875505e-09
9 45.918367 5.731842e-21
10 1.020408 9.936863e-01
11 50.000000 8.102142e-23
12 2.040816 7.647420e-01
13 48.979592 2.353931e-22
14 43.877551 4.787156e-20
15 34.693878 6.357120e-16
16 27.551020 9.610208e-13
17 29.591837 1.193193e-13
18 31.632653 1.474959e-14
19 3.061224 1.200807e-01
20 23.469388 6.153965e-11
21 33.673469 1.815181e-15
22 42.857143 1.381050e-19
23 25.510204 7.706746e-12
24 13.265306 1.883431e-06
25 9.183673 1.154141e-04
26 41.836735 3.979575e-19
27 36.734694 7.770915e-17
28 18.367347 1.089037e-08
29 44.897959 1.657448e-20
30 16.326531 8.575577e-08
31 28.571429 3.388120e-13
32 40.816327 1.145412e-18
33 11.224490 1.473268e-05
34 24.489796 2.178927e-11
35 21.428571 4.893541e-10
36 32.653061 5.177167e-15
37 8.163265 3.241799e-04
38 22.448980 1.736254e-10
39 46.938776 1.979881e-21
40 47.959184 6.830820e-22
41 26.530612 2.722925e-12
42 38.775510 9.456077e-18
43 6.122449 2.632851e-03
44 37.755102 2.712309e-17
45 10.204082 4.121137e-05
46 35.714286 2.223883e-16
47 20.408163 1.377819e-09
48 17.346939 3.057373e-08
49 7.142857 9.167507e-04
EDIT
Attached are two graphs produced from both this and another data set when increasing the order parameter beyond 20.
Order = 3
I have problems understanding why a lmplot is needed here. Usually you want to perform a fit by taking a model function and fit it to the data.
Assume you want a gaussian function
model = lambda x, A, x0, sigma, offset: offset+A*np.exp(-((x-x0)/sigma)**2)
you can fit it to your data with scipy.optimize.curve_fit:
popt, pcov = curve_fit(model, df["R"].values,
df["EquilibriumValue"].values, p0=[1,0,2,0])
Complete code:
import pandas as pd
import numpy as np
from scipy.optimize import curve_fit
import matplotlib.pyplot as plt
df = ... # your dataframe
# plot data
plt.scatter(df["R"].values,df["EquilibriumValue"].values, label="data")
# Fitting
model = lambda x, A, x0, sigma, offset: offset+A*np.exp(-((x-x0)/sigma)**2)
popt, pcov = curve_fit(model, df["R"].values,
df["EquilibriumValue"].values, p0=[1,0,2,0])
#plot fit
x = np.linspace(df["R"].values.min(),df["R"].values.max(),250)
plt.plot(x,model(x,*popt), label="fit")
# Fitting
model2 = lambda x, sigma: model(x,1,0,sigma,0)
popt2, pcov2 = curve_fit(model2, df["R"].values,
df["EquilibriumValue"].values, p0=[2])
#plot fit2
x2 = np.linspace(df["R"].values.min(),df["R"].values.max(),250)
plt.plot(x2,model2(x2,*popt2), label="fit2")
plt.xlim(None,10)
plt.legend()
plt.show()

Adding sparse matrix from CountVectorizer into dataframe with complimentary information for classifier - keep it in sparse format

I have the following problem. Right now I am building a classifier system which will use text and some additional complimentary information as an input. I store complimentary information in pandas DataFrame. I transform text using CountVectorizer and get a sparse matrix. Now, in order to train a classifier I need to have both inputs in same dataframe. The problem is that, when I merge dataframe with output of CountVectorizer I get a dense matrix, which I means I run out of memory really fast. Is there any way to avoid it and properly merge together this 2 inputs into single dataframe without getting a dense matrix?
Example code:
import pandas as pd
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.feature_extraction.text import CountVectorizer
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
#how many most popular words we consider
n_features = 5000
df = pd.DataFrame.from_csv('DataWithSentimentAndTopics.csv',index_col=None)
#vecotrizing text
tf_vectorizer = CountVectorizer(max_df=0.5, min_df=2,
max_features=n_features,
stop_words='english')
#getting the TF matrix
tf = tf_vectorizer.fit_transform(df['reviewText'])
df = pd.concat([df.drop(['reviewText', 'Summary'], axis=1), pd.DataFrame(tf.A)], axis=1)
#binning target variable into 4 bins.
df['helpful'] = pd.cut(df['helpful'],[-1,0,10,50,100000], labels = [0,1,2,3])
#creating X and Y variables
train = df.drop(['helpful'], axis=1)
Y = df['helpful']
#splitting into train and test
X_train, X_test, y_train, y_test = train_test_split(train, Y, test_size=0.1)
#creating GBR
gbc = GradientBoostingClassifier(max_depth = 7, n_estimators=1500, min_samples_leaf=10)
print('Training GBC')
print(datetime.datetime.now())
#fit classifier, look for best
gbc.fit(X_train, y_train)
As you see, I set up my CountVectorizer to have 5000 words. I have just 50000 lines in my original dataframe but I already get a matrix of 50000x5000 cells, which is 2.5 billion of units. It already requires a lot of memory.
you dont need to use a data frame.
convert the numerical features from dataframe to a numpy array:
num_feats = df[[cols]].values
from scipy import sparse
training_data = sparse.hstack((count_vectorizer_features, num_feats))
then you can use a scikit-learn algorithm which supports sparse data.
for GBM, you can use xgboost which supports sparse.
As #AbhishekThakur has already said, you don't have to put your one-hot-encoded data into the DataFrame.
But if you want to do so, you can add Pandas.SparseSeries as a columns:
#vecotrizing text
tf_vectorizer = CountVectorizer(max_df=0.5, min_df=2,
max_features=n_features,
stop_words='english')
#getting the TF matrix
tf = tf_vectorizer.fit_transform(df.pop('reviewText'))
# adding "features" columns as SparseSeries
for i, col in enumerate(tf_vectorizer.get_feature_names()):
df[col] = pd.SparseSeries(tf[:, i].toarray().ravel(), fill_value=0)
Result:
In [107]: df.head(3)
Out[107]:
asin price reviewerID LenReview Summary LenSummary overall helpful reviewSentiment 0 \
0 151972036 8.48 A14NU55NQZXML2 199 really a difficult read 23 3 2 -0.7203 0.002632
1 151972036 8.48 A1CSBLAPMYV8Y0 77 wha 3 4 0 -0.1260 0.005556
2 151972036 8.48 A1DDECXCGHDYZK 114 wordy and drags on 18 1 4 0.5707 0.004545
... think thought trailers trying wanted words worth wouldn writing young
0 ... 0 0 0 0 1 0 0 0 0 0
1 ... 0 0 0 1 0 0 0 0 0 0
2 ... 0 0 0 0 1 0 1 0 0 0
[3 rows x 78 columns]
Pay attention at memory usage:
In [108]: df.memory_usage()
Out[108]:
Index 80
asin 112
price 112
reviewerID 112
LenReview 112
Summary 112
LenSummary 112
overall 112
helpful 112
reviewSentiment 112
0 112
1 112
2 112
3 112
4 112
5 112
6 112
7 112
8 112
9 112
10 112
11 112
12 112
13 112
14 112
...
parts 16 # memory used: # of ones multiplied by 8 (np.int64)
peter 16
picked 16
point 16
quick 16
rating 16
reader 16
reading 24
really 24
reviews 16
stars 16
start 16
story 32
tedious 16
things 16
think 16
thought 16
trailers 16
trying 16
wanted 24
words 16
worth 16
wouldn 16
writing 24
young 16
dtype: int64
Pandas also supports importing sparse matrices, which it stores using its sparseDtype
import scipy.sparse
pd.DataFrame.sparse.from_spmatrix(Your_Sparse_Data)
Which you could concatenate to the rest of your dataframe

Categories