my question has to do with a very large dataset I'm running a regression on in Python. I have categorical data (gender, industry, region, salary groupings, etc.) that I would like to run a regression on with statsmodels. The whole dataframe comes out to be about 83 columns in width after using pd.getdummies() on roughly 5 million lines.
Code:
import pandas as pd
import numpy as np
import statsmodels.api as sm
import statsmodels.formula.api as smf
from datetime import datetime as dt
#Start time
print('Start Time: ', dt.now())
#Variables
groups = ['sex', 'central_age', 'group_size', 'industry', 'region', 'salary']
base_cases = ['sex_Male', 'central_age_47.0', 'group_size_F. 100-249', 'salary_A. < 25',
'industry_H. Manufacturing - heavy, steel etc.', 'region_C. Division 3: East North Central']
aggregates = ['death_amount_exposed', 'death_claim_amount']
#Read/ format data to transform data into categorical variables
df = pd.read_pickle(r'./Life_Mortality_Data.pkl')
df = df[df['death_amount_exposed']!=0]
df['central_age'] = df['central_age'].apply(str)
final = pd.get_dummies(df[groups]).join(df[aggregates]).astype(float)
final.drop(base_cases, axis=1, inplace=True)
#Prepare sting of variables to regress on in next step
var_columns = list(final.columns)
for i in aggregates:
var_columns.remove(i)
variables = '+'.join('Q("' + i + '")' for i in var_columns)
#Training and testing with Poisson model
print('Regression Time: ', dt.now(), '\n')
res1 = smf.glm(formula='death_claim_amount ~'+variables, data=final, offset=np.log(final['death_amount_exposed']), family=sm.families.Poisson(sm.families.links.log())).fit()
#Print stats summary, base cases, and multiplicative factors
print(res1.summary())
print('Base Cases:')
for case in base_cases:
print(case)
print('\nParameters:\n', np.exp(res1.params))
#This takes the result of a statsmodel results table and transforms it into a dataframe
def results_summary_to_dataframe(results):
pvals = results.pvalues
coeff = results.params
std_err = results.bse
conf_lower = results.conf_int()[0]
conf_higher = results.conf_int()[1]
results_df = pd.DataFrame({"pvals":pvals,
"coeff":coeff,
"std_error":std_err,
"conf_lower":conf_lower,
"conf_higher":conf_higher
})
#Reordering columns
results_df = results_df[["coeff","std_error","pvals","conf_lower","conf_higher"]]
return results_df
#Write data to excel
results_summary_to_dataframe(res1).to_excel(r'./All_Regression_Amounts_v1.xlsx')
#End time
print('\nEnd Time: ', dt.now())
The problem I'm having is that I run out of memory at the point where the statsmodels regression is run. I am using the 64-bit version of Python on Windows and have 32 GB of memory which I thought would be more than enough to handle this kind of computation but am not sure if I'm not using all available memory or if something may be wrong with my code. I'm very new to this kind of analysis and handling this much data. I'd really appreciate any help on what I can do to resolve this error
When building linear models on datasets which are too large to hold in memory your best bet is to train the model with Stochastic Gradient Descent. This fits the model iteratively by gradient descent using repeated small samples of the data rather than all the data at once.
Scikit-learn has a SGDClassifier module which fits a linear model like this. You could take a look at that and see if it might work for you.
Related
image of the error I am trying to build a collaborative recommendation system the code below. I am a noob to deep learning right now, and I am stuck with this error when I try to train the model. I want to train a model with a csv data set. Can anyone please help me understand what's happening? I would really appreciate it.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
# Import the surprise packages
from surprise import Dataset
from surprise import Reader
from surprise.prediction_algorithms.matrix_factorization import SVD as FunkSVD
# Import GridSearchCV for algorithm tuning
from surprise.model_selection import GridSearchCV
# Import train_test_split
from surprise.model_selection import train_test_split
# Read in the prepared dataframe from the user_cleanup notebook
user_df = pd.read_csv('user_clean.csv')
user_df.head()
# Merge the two dataframes on appid
df = user_df.merge(games_df,on='appid')
df = df.drop('name',1)
df.head()
# Let's take a look at one of the most prominent users in the dataset, user 24469287
df[df['user_id'] == 24469287]
# Let's find this users favorite games using the 1-5 rating scale
print(f"Shape:{df[(df['user_id'] == 24469287) & (df['rating_5'] == 5)].shape}")
display(df[(df['user_id'] == 24469287) & (df['rating_5'] == 5)])
# Prepare the dataframes for the surprise package
# Dataframe needs to contain 3 columns: user id, item id, and rating
# For the 1-10 scale
rating_10_df = df.filter(['user_id','appid','rating_10'])
rating_10_df = rating_10_df.sort_values(by=['user_id','appid'])
# And the 1-5 scale
rating_5_df = df.filter(['user_id','appid','rating_5'])
rating_5_df = rating_5_df.sort_values(by=['user_id','appid'])
# Confirm dataframe is set up properly (user, item, rating)
rating_10_df.head()
# initialize the reader with 1-10 rating scale
my_reader = Reader(rating_scale=(0,10))
# load the dataframe with the reader
md = Dataset.load_from_df(rating_10_df, my_reader)
%%time
# Set the parameter grid for optimization
param_grid = {
# Number of latent factors. More factors could give better results, but can also lead overfitting
'n_factors': [50, 100, 150],
# Number of epochs. Number of iterations the algorithm will run
'n_epochs': [10, 20, 50],
# Learning rate. The speed at which algorithm learns. Larger values give faster learning, but smaller values give more accurate learning.
'lr_all': [0.005, 0.1],
'biased': [False] }
# Set GridSearchCV with 5 fold cross-validation using the FunkSVD
GS = GridSearchCV(FunkSVD, param_grid, measures=['rmse','mae','fcp'], cv=5)
# Fit the model to the data
GS.fit(md)
I have two time series representing two independent periods of data observation. I would like to fit an autoregressive model to this data. In other words, I would like to perform two partial fits, or two sessions of incremental learning.
This is a simplified description of a not-unusual scenario which could also apply to batch fitting on large datasets.
How do I do this (in statsmodels or otherwise)? Bonus points if the solution can generalise to other time-series models like ARIMA.
In pseudocode, something like:
import statsmodels.api as sm
from statsmodels.tsa.ar_model import AutoReg
data = sm.datasets.sunspots.load_pandas().data['SUNACTIVITY']
data_1 = data[:len(data)//3]
data_2 = data[len(data)-len(data)//3:]
# This is the standard single fit usage
res = AutoReg(data_1, lags=12).fit()
res.aic
# This is more like what I would like to do
model = AutoReg(lags=12)
model.partial_fit(data_1)
model.partial_fit(data_2)
model.results.aic
Statsmodels does not directly have this functionality. As Kevin S mentioned though, pmdarima does have a wrapper that provides this functionality. Specifically the update method. Per their documentation: "Update the model fit with additional observed endog/exog values.".
See example below around your particular code:
from pmdarima.arima import ARIMA
import statsmodels.api as sm
data = sm.datasets.sunspots.load_pandas().data['SUNACTIVITY']
data_1 = data[:len(data)//3]
data_2 = data[len(data)-len(data)//3:]
# This is the standard single fit usage
model = ARIMA(order=(12,0,0))
model.fit(data_1)
# update the model parameters with the new parameters
model.update(data_2)
I don't know how to achieve that in autoreg, but I think it can be achieved somehow, but need to manually evaluate results or somehow add the data.
But in ARIMA and SARIMAX, it's already implemented and it's simple.
For incremental learning, there are three functions related and it's documented here. First is apply which use fitted parameters on new unrelated data. Then there are extend and append. Append can be refit. I don't know exact difference though.
Here is my example that is different but similar...
from statsmodels.tsa.api import ARIMA
data = np.array(range(200))
order = (4, 2, 1)
model = ARIMA(data, order=order)
fitted_model = model.fit()
prediction = fitted_model.forecast(7)
new_data = np.array(range(600, 800))
fitted_model = fitted_model.apply(new_data)
new_prediction = fitted_model.forecast(7)
print(prediction) # [200. 201. 202. 203. 204. 205. 206.]
print(new_prediction) # [800. 801. 802. 803. 804. 805. 806.]
This replace all the data, so it can be used on unrelated data (unknown index). I profiled it and apply is very fast in comparison to fit.
I am trying to do a linear regression. With the results I want to multiply each x with its own estimated coefficient: xi·βi.
However, I am doing a lot of transformations on xi.
For example:
import statsmodels.api as sm
import statsmodels.formula.api as smf
import numpy as np
def log_plus_1(x):
return np.log(x + 1.0)
df = sm.datasets.get_rdataset("Guerry", "HistData").data
df = df[['Lottery', 'Literacy', 'Wealth', 'Region']].dropna()
formule = 'Lottery ~ pow(Literacy,2) + log_plus_1(Wealth)'
mod = smf.ols(formula=formule, data=df)
res = mod.fit()
res.params
Now I would need pow(Literacy, 2) and log_plus_1(Wealth). But since they go into the model, I was hoping to get them out of there too. Instead of transforming the data from the original dataset.
In R I would use res$model to get it.
The data is stored as attributes of the model, e.g. the design matrix is mod.exog, the dependent or response variable is mod.endog.
(I'm not sure I remember correctly the details of the following: The data that patsy returns after creating the transformed design matrix should, in this case, be a pandas DataFrame, and should be stored in mod.data.orig_exog or something like that.)
res.predict automatically handles the transformation, i.e. patsy uses the formula information to transform the data for the explanatory variables in prediction in the same way as the data was transformed in creating the model.
predict only returns the prediction and not the internally transformed predict exog.
I would like some assist with a problem I have. I have a big csv file (6239292, 5) and want to perform an unsupervised machine learning technique (kmodes). My code is this:
import numpy as np
import pandas as pd
print("initialising")
syms = np.genfromtxt('foo.csv', delimiter = ';', dtype=str, skip_header=1, invalid_raise=False)[:, 0:]
print(syms.shape)
X = np.genfromtxt('foo.csv',dtype=object, delimiter=';', invalid_raise=False, skip_header=1)[:, 1:]
X[1:, 0] = X[1:, 0].astype(float)
from kmodes.kprototypes import KPrototypes
print("Imported successfully")
kproto = KPrototypes(n_clusters=6, init='random', n_init=2, verbose=2)
clusters = kproto.fit_predict(X, categorical=[2,1,3,])
Due to the size of the file, it's taking forever. Is there any technique I could use to reduce the time? Thank you in advance!
You can select the first n rows like:
read_csv(..., nrows=999999)
or skip some rows and then select the next n rows:
read_csv(..., skiprows=1000000, nrows=999999)
There shouldn't be a problem with your results due to the Central Limit Theorem
The Central Limit Theorem (CLT) is a statistical theory states that
given a sufficiently large sample size from a population with a finite
level of variance, the mean of all samples from the same population
will be approximately equal to the mean of the population.
The data that I have is hourly recorded over the past 4 months. I am building a time series model and I've tried several methods so far: Arima, LSTMs, Prophet but they can be quite slow for my task since I have to run the model on thousands of time series in different locations. So then I thought it might be interesting to transform it into a supervised problem and use regression.
I extracted 4 features from the univariate time series and its time index, namely: dayofweek, hour, daily average and hourly average. So at the moment I am using these 4 predictors but could possibly extract more(like beginning of the day, noon, etc-also if you have any other suggestions here they are very welcomed :) )
I've used XGBoost for the regression and here are parts of the code:
# XGB
import xgboost as xgb
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score
# Functions needed
def convert_dates(x):
x['date'] = pd.to_datetime(x['date'])
#x['month'] = x['date'].dt.month
#x['year'] = x['date'].dt.year
x['dayofweek'] = x['date'].dt.dayofweek
x['hour'] = x['date'].dt.hour
#x['week_no'] = pd.to_numeric(x['date'].index.strftime("%V"))
x.pop('date')
return(x)
def add_avg(x):
x['daily_avg']=x.groupby(['dayofweek'])['y'].transform('mean')
x['hourly_avg'] = x.groupby(['dayofweek','hour'])['y'].transform('mean')
#x['monthly_avg']=x.groupby(['month'])['y'].transform('mean')
#x['weekly_avg']=x.groupby(['week_no'])['y'].transform('mean')
return x
xgb_mape_r2_dict = {}
I then run a for loop in which I select a location and build the model for it. Here I split the data into a train and test part. I knew there might be problems due to the Easter holidays in my country last week because those are rare events so that is why I split the training and test data in that manner. So I actually consider the data from the beginning of the year up to two weeks ago as training data and the very next week after that as test data.
for j in range(10,20):
data = df_all.loc[df_all['Cell_Id']==top_cells[j]]
data.drop(['Cell_Id', 'WDay'], axis = 1, inplace = True)
data['date'] = data.index
period = 168
data_train = data.iloc[:-2*period,:]
data_test = data.iloc[-2*period:-period,:]
data_train = convert_dates(data_train)
data_test = convert_dates(data_test)
data_train.columns = ['y', 'dayofweek', 'hour']
data_test.columns = ['y', 'dayofweek', 'hour']
data_train = add_avg(data_train)
daily_avg = data_train.groupby(['dayofweek'])['y'].mean().reset_index()
hourly_avg = data_train.groupby(['dayofweek', 'hour'])['y'].mean().reset_index()
Now, for the test data I add the past averages, namely the 7 daily averages from the past and the 168 hourly averages from the past as well. This is actually the part that takes the longest amount of time to run and I would like to improve its efficiency.
value_dict ={}
for k in range(168):
value_dict[tuple(hourly_avg.iloc[k])[:2]] = tuple(hourly_avg.iloc[k])[2]
data_test['daily_avg'] = 0
data_test['hourly_avg'] = 0
for i in range(len(data_test)):
data_test['daily_avg'][i] = daily_avg['y'][data_test['dayofweek'][i]]
data_test['hourly_avg'][i] = value_dict[(data_test['dayofweek'][i], data_test['hour'][i])]
My current run time is of 30 seconds for every iteration in the for loop which is way too slow because of the poor way that I use to add the averages in the test data. I would really appreciate if anyone could point out how could I implement this bit faster.
I will also add the rest of my code and make some other observations as well:
x_train = data_train.drop('y',axis=1)
x_test = data_test.drop('y',axis=1)
y_train = data_train['y']
y_test = data_test['y']
def XGBmodel(x_train,x_test,y_train,y_test):
matrix_train = xgb.DMatrix(x_train,label=y_train)
matrix_test = xgb.DMatrix(x_test,label=y_test)
model=xgb.train(params={'objective':'reg:linear','eval_metric':'mae'}
,dtrain=matrix_train,num_boost_round=500,
early_stopping_rounds=20,evals=[(matrix_test,'test')],)
return model
model=XGBmodel(x_train,x_test,y_train,y_test)
#submission = pd.DataFrame(x_pred.pop('id'))
y_pred = model.predict(xgb.DMatrix(x_test), ntree_limit = model.best_ntree_limit)
#submission['sales']= y_pred
y_pred = pd.DataFrame(y_pred)
y_test = pd.DataFrame(y_test)
y_test.reset_index(inplace = True, drop = True)
compare_df = pd.concat([y_test, y_pred], axis = 1)
compare_df.columns = ['Real', 'Predicted']
compare_df.plot()
mape = (np.abs((y_test['y'] - y_pred[0])/y_test['y']).mean())*100
r2 = r2_score(y_test['y'], y_pred[0])
xgb_mape_r2_dict[top_cells[j]] = [mape,r2]
I've used both R-squared and MAPE as accuracy measures although I don't think MAPE is indicated anymore since I've transformed the time series problem into a regression problem. Any thoughts on your part on this subject?
Thank you very much for your time and consideration. Any help is very much appreciated.
Update: I have managed to fix the issue using pandas' merge. I've first created two dataframes containing the daily averges and hourly averages from the training data and then merged these ataframes with the test data:
data_test = merge(data_test, daily_avg,['dayofweek'],'daily_avg')
data_test = merge(data_test, hourly_av['dayofweek','hour'],'hourly_avg')
data_test.columns = ['y', 'dayofweek', 'hour', 'daily_avg', 'hourly_avg']
where we used the merge function defined as:
def merge(x,y,col,col_name):
x =pd.merge(x, y, how='left', on=None, left_on=col, right_on=col,
left_index=False, right_index=False, sort=True,
copy=True, indicator=False,validate=None)
x=x.rename(columns={'sales':col_name})
return x
I can now run the model for 2000 locations per hour on a laptop with decent results but I will try to improve it while keeping it fast. Thank you very much once again.