Performing Regression for all DataFrames Inside a List

Performing Regression for all DataFrames Inside a List - python

I have a list called "Data" contained with 81 DataFrames (df1,df2...df81) with each DataFrame having the same shape and label. Let's say the independent variables (X) are 'a','b','c', and the dependent variable (Y) is 'y'. Can I perform a multivariate regression on each DataFrame inside list "Data" simultaneously instead of doing it one by one? and also storing each regression accuracy (r2_score) into accuracy_list?
e.g
I do regression with codes below
accuracy_list =[]
#First dataframe (df1)
X = Data['df1'][['a','b','c']]
Y = Data['df1']['y']
from sklearn.model_selection import train_test_split
X_train,X_test,Y_train,Y_test = train_test_split(X,Y,train_size=0.9,random_state=42)
from sklearn.linear_model import LinearRegression
rgs = LinearRegression()
rgs.fit(X_train,Y_train)
from sklearn.metrics import r2_score
y_pred = rgs.predict(X_test)
r2_score(Y_test,y_pred) # append it to accuracy_list
#second dataframe (df2)
X = Data['df2'][['a','b','c']]
Y = Data['df2']['y']
X_train,X_test,Y_train,Y_test = train_test_split(X,Y,train_size=0.9,random_state=42)
rgs = LinearRegression()
rgs.fit(X_train,Y_train)
y_pred = rgs.predict(X_test)
r2_score(Y_test,y_pred) # append it to accuracy_list
# and so on

As I understand from your question, the important part is that you can process them in parallel to speed up the computation. Therefore, you could try using multiprocessing, which spins up various processes to execute your code. One very convenient way, that is also used under the hood in sci-kit learn would be to use joblib parallel here.
In code, that would roughly read as
from joblib import Parallel, delayed
def compute_r2_score(model, X, y) -> float:
y_pred = rgs.predict(X)
return r2_score(y, y_pred)
n_jobs = 2 # For having 2 processes. That should be at max n_cpus - 1
# verbose=10 gives you some output on the iterations
accuracy_list = Parallel(n_jobs=n_jobs, verbose=10)(delayed(compute_r2_score)(rgs, df[['a','b','c']], df['y']) for df in data.values())
Note that multiprocessing doesn't come for free and introduces additional communication and processing overhead. Apart from that, everything that runs in an individual process must be pickable, just in case you run into that issue.
As a side note, multithreading wouldn't speed up anything here due to the Global Interpreter Lock and this task being certainly CPU bound.

Related

Why different random_states in ML model?

I recently read that specifying a number for the random_state ensures to get the same results in each run.
Why do I use then random_state=1 when splitting the data into training and validation sets but random_state=0 for creating the model?
I would have expected them to be both the same value.
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
data = pd.read_csv('../input/fifa-2018-match-statistics/FIFA 2018 Statistics.csv')
y = (data['Man of the Match'] == "Yes") # Convert from string "Yes"/"No" to binary
feature_names = [i for i in data.columns if data[i].dtype in [np.int64]]
X = data[feature_names]
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state=1)
my_model = RandomForestClassifier(n_estimators=100,
random_state=0).fit(train_X, train_y)

Don't read too much into the number itself. Basically, the random_state refers to numpys random number generator numpy.random.seed() and ensures that the random numbers you create are always seeded exactly the same. Initializing with 1 will give you a different sequence than with 0. Since the split uses the random numbers for different purposes (splitting the data) than the random forest (introducing randomness to trees, e.g. with selecting sub-features for a tree, etc.). Yet, the number you give it matters little - it is only to ensure the reproducibility of your draws. To see that you might set the seed, e.g. numpy.random.seed(42) and then draw several random numbers numpy.random.rand(). Resetting again with 42 and repeating the draws will give you the same sequence.
From time to time (after setting everything up satisfactorily) it might be wise to get rid of the set random_state to see how your results look like repeatedly with more randomness included. Trying other values (or no seed at all) gives you a sense of how independent and valid your results are in the end. If you need to compare and reproduce the results exactly, the seed should be given.

Returning a trained scikit learn (random forest) model from a function?

I am training a random forest model and have found that returning the trained model object from a function consistently results in different .predict behavior. Is this intended or not?
I think this is completely reproducible code. Input data is just 1000 rows of 6 columns of floats:
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
import pandas as pd
def as_a_function():
df = pd.to_csv() # read file
lcscols = #just a list of 6/12 of the columns in the csv file that are used to build the model (ignore timestamp, etc)
selcol = #y 'real' data
train_df = df.sample(frac=testsize,random_state=42)
test_df = df.drop(train_df.index) #test/train split
rfmodel, fitvals_mid = RF_model(train_df,test_df,selcol, lcscols)
tempdf = df.copy(deep=True) # new copy, not totally necessary but helpful in edge cases
tempdf.dropna(inplace=True)
selcolname = selcol + '_cal'
mid_cal = pd.DataFrame(data=rfmodel.predict(tempdf[lcscols]),index=tempdf.index,columns=[selcolname])
#new df just made from a .predict call
# note that input order of columns matters, needs to be identical to training order??
def RF_model(train_df, test_df, ycol, xcols):
rf = RandomForestRegressor(n_estimators = 1000, random_state = 42)
rfmodel = rf.fit(train_df[xcols], train_df[ycol])
y_pred_test = rfmodel.predict(test_df[xcols])
#missing code to test predicted values of testing set
return rfmodel
#################################
def inline():
df = pd.to_csv() # read file
lcscols = #just a list of 6/12 of the columns in the csv file that are used to build the model (ignore timestamp, etc)
refcol = #'true' data
X = df[lcscols].values
y = df[[refcol]].values
x_train,x_test,y_train,y_test = train_test_split(X, y, test_size=0.2, random_state=42)
rf = RandomForestRegressor(n_estimators = 1000, random_state = 42)
ramp = rf.fit(x_train, y_train.flatten())
y_pred_test = ramp.predict(x_test)
#missing code to check prediction on test values
tempdf = df.copy(deep=True)[lcscols]
tempdf.dropna(axis=1,how='all',inplace=True)
tempdf.dropna(axis=0,inplace=True)
df_cal = pd.DataFrame(data=ramp.predict(tempdf),index=tempdf.index,columns=['name'])
return df_cal
The problem is that rfmodel.predict(tempdf[lcscols]) produces different output than ramp.predict(tempdf).
I imagine that it's going to be somewhat different given that pd.DataFrame.sample is not going to be the exact same split as test_train_split but it's r^2 value of 0.98 when .predict is called on the trained model in the same function as compared to r^2 = 0.5 when .predict is called on a returned model object. That seems like way too different to be attributable to a different split method?

Try using np.random.seed(42) before you call the method - Make sure you have numpy imported first. Every time the model predicts it uses random values, every time you run your code with that seed it uses different random values, however when you use np.random.seed(42), every time you run your code the model will use the same random values.

How to parallelize multiple model-building procedures in sklearn

Is there a way to parallelize multiple model-building procedures in scikit-learn? I know that I can use the n_jobs argument in both GridSearchCV and cross_validate to achieve some sort of parallelization within one model building procedure. However, I am running multiple model-building procedures in a for-loop with different input parameters and save the results in a list. Just as an example, suppose I have 15 free CPUs and I am using n_jobs=5 in cross_validate. If I am not mistaken, that means that one single model-building procedure uses 5 CPUS. Now is there a way to already start the next 2 model-building procedures in my for-loop so I am using all 15 CPUS? Here's a dummy example:
import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import KFold, GridSearchCV, cross_validate
# load breast cancer data set
X,y = load_breast_cancer(return_X_y=True)
# define different types of penalty strategies
# let's make a toy example and pretend we would be interested in
# running different penalty strategies (I use three times 'l2' here,
# but imagine these would be different)
penalty_types = ['l2','l2','l2']
# define output list where we add the results using different penalty strategies
nested_cv_scores_list = []
for penalty_type in penalty_types:
# create a random number generator
rng = np.random.RandomState(42)
# z-standardize features
scaler = StandardScaler()
# use linear L2-regularized Logistic Regression as classifier
lr = LogisticRegression(random_state=rng,penalty=penalty_type)
# define parameter grid to optimize over (optimize C)
lr_c = np.linspace(start=1,stop=16,num=11,endpoint=True)
p_grid = {'lr__C':lr_c}
# create pipeline
lr_pipe = Pipeline([
('scaler',scaler),
('lr',lr)
])
# define cross validation strategy
cv = KFold(shuffle=True,random_state=rng)
# implement GridSearch (inner cross validation)
grid = GridSearchCV(lr_pipe,param_grid=p_grid,cv=cv)
# implement cross_validate (outer cross validation)
nested_cv_scores = cross_validate(grid,X,y,cv=cv,n_jobs=5)
# append result to list
nested_cv_scores_list.append(nested_cv_scores)
Is there a way to parallelize this for-loop?

joblib.parallel is made for this job! Just put your loop content in a function and call it using Parallel and delayed
from joblib.parallel import Parallel, delayed
import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import KFold, GridSearchCV, cross_validate
# load breast cancer data set
X,y = load_breast_cancer(return_X_y=True)
# define different types of penalty strategies
# let's make a toy example and pretend we would be interested in
# running different penalty strategies (I use three times 'l2' here,
# but imagine these would be different)
penalty_types = ['l2','l2','l2']
# define output list where we add the results using different penalty strategies
nested_cv_scores_list = []
# put rng-seed outside of loop so that not all results are the same
rng = np.random.RandomState(42)
def run_as_job(penalty_type, X, y):
# create a random number generator
# z-standardize features
scaler = StandardScaler()
# use linear L2-regularized Logistic Regression as classifier
lr = LogisticRegression(random_state=rng,penalty=penalty_type)
# define parameter grid to optimize over (optimize C)
lr_c = np.linspace(start=1,stop=16,num=11,endpoint=True)
p_grid = {'lr__C':lr_c}
.... # additional calculation that is missing in the example
.... # e.g. res = cross_val_score(clf, X, y, n_jobs=2)
return res
if __name__ == '__main__':
results = Parallel(n_jobs=2)(delayed(run_as_job)(penalty_type) for penalty_type in penalty_types)
for more usage options have a look at joblib: Embarrassingly parallel for loops

Different results when using train_test_split vs manually splitting the data

I have a pandas dataframe that I want to make predictions on and get the root mean squared error for each feature. I'm following an online guide that splits the dataset manually, but I thought it would be more convenient to use train_test_split from sklearn.model_selection. Unfortunately, I'm getting different results when looking at the rmse values after splitting the data manually vs using train_test_split.
A (hopefully) reproducible example:
import pandas as pd
import numpy as np
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
np.random.seed(0)
df = pd.DataFrame(np.random.randint(0,100,size=(100, 4)), columns=['feature_1','feature_2','feature_3','feature_4'])
df['target'] = np.random.randint(2,size=100)
df2 = df.copy()
Here is a function, knn_train_test, that splits the data manually, fits the model, makes predictions, etc:
def knn_train_test(train_col, target_col, df):
knn = KNeighborsRegressor()
np.random.seed(0)
# Randomize order of rows in data frame.
shuffled_index = np.random.permutation(df.index)
rand_df = df.reindex(shuffled_index)
# Divide number of rows in half and round.
last_train_row = int(len(rand_df) / 2)
# Select the first half and set as training set.
# Select the second half and set as test set.
train_df = rand_df.iloc[0:last_train_row]
test_df = rand_df.iloc[last_train_row:]
# Fit a KNN model using default k value.
knn.fit(train_df[[train_col]], train_df[target_col])
# Make predictions using model.
predicted_labels = knn.predict(test_df[[train_col]])
# Calculate and return RMSE.
mse = mean_squared_error(test_df[target_col], predicted_labels)
rmse = np.sqrt(mse)
return rmse
rmse_results = {}
train_cols = df.columns.drop('target')
# For each column (minus `target`), train a model, return RMSE value
# and add to the dictionary `rmse_results`.
for col in train_cols:
rmse_val = knn_train_test(col, 'target', df)
rmse_results[col] = rmse_val
# Create a Series object from the dictionary so
# we can easily view the results, sort, etc
rmse_results_series = pd.Series(rmse_results)
rmse_results_series.sort_values()
#Output
feature_3 0.541110
feature_2 0.548452
feature_4 0.559285
feature_1 0.569912
dtype: float64
Now, here is a function, knn_train_test2, that splits the data using train_test_split:
def knn_train_test2(train_col, target_col, df2):
knn = KNeighborsRegressor()
np.random.seed(0)
X_train, X_test, y_train, y_test = train_test_split(df2[[train_col]],df2[[target_col]], test_size=0.5)
knn.fit(X_train,y_train)
predictions = knn.predict(X_test)
mse = mean_squared_error(y_test,predictions)
rmse = np.sqrt(mse)
return rmse
rmse_results = {}
train_cols = df2.columns.drop('target')
for col in train_cols:
rmse_val = knn_train_test2(col, 'target', df2)
rmse_results[col] = rmse_val
rmse_results_series = pd.Series(rmse_results)
rmse_results_series.sort_values()
# Output
feature_4 0.522303
feature_3 0.556417
feature_1 0.569210
feature_2 0.572713
dtype: float64
Why am I getting different results? I think I'm misunderstanding the split > train > test process in general, or maybe misunderstanding/mis-specifying train_test_split. Thank you in advance

Your custom train_test_split implementation differs from scikit-learn's implementation, that's why you get different results for the same seed.
Here you can find the official implementation. The first thing which is notable is, that scikit-learn is doing by default 10 iterations of re-shuffeling & splitting. (check the n_splits parameter)
Only if your approach is doing exactly the same as the scitkit-learn approach, then you can expect to have the same result for the same seed.

This is basic machine learning nature. When you manually split the data, you have a different version of training and testing set. When you use the sklearn function, you get different training and testing set. Your model will make prediction based on what training data it recieves and thus your final results are different for both.
If you want to reproduce result, then use the train_test_split to create multiple training set by setting a seed value. A seed value is used to reproduce the same result in the train_test_split function. Then when running your ml function, set a seed in there too as even ML functions start training with random weights. Try your model on these datasets with same seed and you will get the results.

Splitting data manually is just slicing but train_test_split will also randomize the sliced data. Try fix the random number seed and see if you can get same results each time when using train_test_split.

Using ranking data in Logistic Regression

I will be putting the max bounty on this as I am struggling to learn these concepts! I am trying to use some ranking data in a logistic regression. I want to use machine learning to make a simple classifier as to whether a webpage is "good" or not. It's just a learning exercise so I don't expect great results; just hoping to learn the "process" and coding techniques.
I have put my data in a .csv as follows :
URL WebsiteText AlexaRank GooglePageRank
In my Test CSV we have :
URL WebsiteText AlexaRank GooglePageRank Label
Label is a binary classification indicating "good" with 1 or "bad" with 0.
I currently have my LR running using only the website text; which I run a TF-IDF on.
I have a two questions which I need help with. I'll be putting a max bounty on this question and awarding it to the best answer as this is something I'd like some good help with so I, and others, may learn.
How can I normalize my ranking data for AlexaRank? I have a set of
10,000 webpages, for which I have the Alexa rank of all of them;
however they aren't ranked 1-10,000. They are ranked out of the
entire Internet, so while http://www.google.com may be ranked #1,
http://www.notasite.com may be ranked #83904803289480. How do I
normalize this in Scikit learn in order to get the best possible
results from my data?
I am running my Logistic Regression in this way; I am nearly sure I have done this incorrectly. I am trying to do the TF-IDF on the website text, then add the two other relevant columns and fit the Logistic Regression. I'd appreciate if someone could quickly verify that I am taking in the three columns I want to use in my LR correctly. Any and all feedback on how I can improve myself would also be appreciated here.
loadData = lambda f: np.genfromtxt(open(f,'r'), delimiter=' ')
print "loading data.."
traindata = list(np.array(p.read_table('train.tsv'))[:,2])#Reading WebsiteText column for TF-IDF.
testdata = list(np.array(p.read_table('test.tsv'))[:,2])
y = np.array(p.read_table('train.tsv'))[:,-1] #reading label
tfv = TfidfVectorizer(min_df=3, max_features=None, strip_accents='unicode', analyzer='word',
token_pattern=r'\w{1,}', ngram_range=(1, 2), use_idf=1, smooth_idf=1,sublinear_tf=1)
rd = lm.LogisticRegression(penalty='l2', dual=True, tol=0.0001, C=1, fit_intercept=True, intercept_scaling=1.0, class_weight=None, random_state=None)
X_all = traindata + testdata
lentrain = len(traindata)
print "fitting pipeline"
tfv.fit(X_all)
print "transforming data"
X_all = tfv.transform(X_all)
X = X_all[:lentrain]
X_test = X_all[lentrain:]
print "20 Fold CV Score: ", np.mean(cross_validation.cross_val_score(rd, X, y, cv=20, scoring='roc_auc'))
#Add Two Integer Columns
AlexaAndGoogleTrainData = list(np.array(p.read_table('train.tsv'))[2:,3])#Not sure if I am doing this correctly. Expecting it to contain AlexaRank and GooglePageRank columns.
AlexaAndGoogleTestData = list(np.array(p.read_table('test.tsv'))[2:,3])
AllAlexaAndGoogleInfo = AlexaAndGoogleTestData + AlexaAndGoogleTrainData
#Add two columns to X.
X = np.append(X, AllAlexaAndGoogleInfo, 1) #Think I have done this incorrectly.
print "training on full data"
rd.fit(X,y)
pred = rd.predict_proba(X_test)[:,1]
testfile = p.read_csv('test.tsv', sep="\t", na_values=['?'], index_col=1)
pred_df = p.DataFrame(pred, index=testfile.index, columns=['label'])
pred_df.to_csv('benchmark.csv')
print "submission file created.."`
Thank you very much for all feedback - please post if you need any further information!

I guess sklearn.preprocessing.StandardScaler would be the first thing you want to try. StandardScaler transforms all of your features into Mean-0-Std-1 features.
This definitely gets rid of your first problem. AlexaRank will be guaranteed to be spread around 0 and bounded. (Yes, even massive AlexaRank values like 83904803289480 are transformed to small floating point numbers). Of course, the results will not be integers between 1 and 10000 but they will maintain same order as the original ranks. And in this case, keeping the rank bounded and normalized will help solve your second problem like follows.
In order to understand why normalization would help in LR, let's revisit the logit formulation of LR.
In your case, X1, X2, X3 are three TF-IDF features and X4, X5 are Alexa/Google rank related features. Now, the linear form of equation suggest that the coefficients represent the change in logit of y with one unit change in a variable. Think what happens when your X4 is kept fixed at a massive rank value, say 83904803289480. In that case, the Alexa Rank variable dominates your LR fit and a small change in TF-IDF value has almost no effect on the LR fit. Now one might think that the coefficient should be able to adjust to small/large values to account for differences between these features. Not in this case --- It's not only the magnitude of variables that matter but also their range. Alexa Rank definitely has a large range and should definitely dominate your LR fit in this case. Therefore, I guess normalizing all variables using StandardScaler to adjust their range will improve the fit.
Here is how you can scale the X matrix.
sc = proprocessing.StandardScaler().fit(X)
X = sc.transform(X)
Don't forget to use same scaler to transform X_test.
X_test = sc.transform(X_test)
Now you can use the fitting procedure etc.
rd.fit(X, y)
re.predict_proba(X_test)
Check this out for more on sklearn preprocessing: http://scikit-learn.org/stable/modules/preprocessing.html
Edit: Parsing and column merging part can be easily done using pandas, i.e., there is no need to convert the matrices into list and then append them. Moreover, pandas dataframes can be directly indexed by their column names.
AlexaAndGoogleTrainData = p.read_table('train.tsv', header=0)[["AlexaRank", "GooglePageRank"]]
AlexaAndGoogleTestData = p.read_table('test.tsv', header=0)[["AlexaRank", "GooglePageRank"]]
AllAlexaAndGoogleInfo = AlexaAndGoogleTestData.append(AlexaAndGoogleTrainData)
Note that we are passing header=0 argument to read_table to maintain original header names from tsv file. And also note how we can index using entire set of columns. Finally, you can stack this new matrix with X using numpy.hstack.
X = np.hstack((X, AllAlexaAndGoogleInfo))
hstack horizontally combined two multi-dimensional array-like structures provided their lengths are same.

Regarding normalizing the numeric ranks either scikit StandardScaler or a logarithmic transform (or both) should work well enough.
For building up a working pipeline, I find my sanity greatly benefits from using the Pandas package and the sklearn.pipeline utilities. Here is a simple script that should do what you need.
First a couple of utlitlty classes I always seem to need. It would be nice to have something like these in sklearn.pipeline or sklearn.utilities.
from sklearn import base
class Columns(base.TransformerMixin, base.BaseEstimator):
def __init__(self, columns):
super(Columns, self).__init__()
self.columns_ = columns
def fit(self, *args, **kwargs):
return self
def transform(self, X, *args, **kwargs):
return X[self.columns_]
class Text(base.TransformerMixin, base.BaseEstimator):
def fit(self, *args, **kwargs):
return self
def transform(self, X, *args, **kwargs):
return (X.apply("\t".join, axis=1, raw=False))
Now set up the pipeline.
I used the SGDClassifier implementation of logistic regression since it tends to be more eficcient for high dimensional data like text classification also I usually find that hinge loss usually gives better results than logistic regression anyway.
from sklearn import linear_model as lin
from sklearn import metrics
from sklearn.feature_extraction import text as txt
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.preprocessing import StandardScaler
from sklearn import preprocessing as prep
import numpy as np
from pandas.io import parsers
import pandas as pd
pipe = Pipeline([
('feat', FeatureUnion([
('txt', Pipeline([
('txtcols', Columns(["WebsiteText"])),
('totxt', Text()),
('vect', txt.TfidfVectorizer()),
])),
('num', Pipeline([
('numcols', Columns(["AlexaRank", "GooglePageRank"])),
('scale', prep.StandardScaler()),
])),
])),
('clf', lin.SGDClassifier(loss="log")),
])
Next train the model:
train=parsers.read_csv("train.csv")
pipe.fit(train, train.Label)
Finally evaluate on test data:
test=parsers.read_csv("test.csv")
tstlbl=np.array(test.Label)
print pipe.score(test, tstlbl)
pred = pipe.predict(test)
print metrics.confusion_matrix(tstlbl, pred)
print metrics.classification_report(tstlbl, pred)
print metrics.f1_score(tstlbl, pred)
prob = pipe.decision_function(test)
print metrics.roc_auc_score(tstlbl, prob)
print metrics.average_precision_score(tstlbl, prob)
You will probably not get very good results with everything using default setting like this,
but it should give you a working baseline to work from. I can suggest some parameter settings that usually work for me if you like.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.