How to make a data frame combining different regression results in python? - python

I am running some regression models to predict performance.
After running the models I created a variable to see the predictions (y_pred_* are lists with 2567 values):
y_pred_LR = regressor.predict(X_test)
y_pred_SVR = regressor2.predict(X_test)
y_pred_RF = regressor3.predict(X_test)
the types of these prediction lists are Array of float64, while the y_test is a DataFrame.
I wanted to create a table with the results, I tried some different ways, calling as list, trying to convert, trying to select as values, and I did not succeed so far, any one could help?
My last trial was like below:
comparison = pd.DataFrame({'Real': y_test, LR':y_pred_LR,'RF':y_pred_RF,'SVM':y_pred_SVM})
In this case the DataFrame is created but the values donĀ“t appear.
Additionally, I would like to create two new rows with the mean and standard deviation of results and this row should be located at beginning (or first row) of the Data Frame.
Thanks

import pandas as pd
import numpy as np
real = np.array([2] * 10).reshape(-1,1)
y_pred_LR = np.array([0] * 10)
y_pred_SVR = np.array([1] * 10)
y_pred_RF = np.array([5] * 10)
real = real.flatten()
comparison = pd.DataFrame({'real':real,'y_pred_LR':y_pred_LR,'y_pred_SVR':y_pred_SVR,"y_pred_RF":y_pred_RF})
Mean = comparison.mean(axis=0)
StD = comparison.std(axis=0)
Mean_StD = pd.concat([Mean,StD],axis=1).T
result = pd.concat([Mean_StD,comparison],ignore_index=True)
print(result)

Related

LOOP univariate rolling window regression on entire DF Python

I have a dataframe of 24 variables (24 columns x 4580 rows) from 2008 to 2020.
My independant variable is the first one in the DF and the dependant variables are the 23 others.
I've done a test for one rolling window regression, it works well, here is my code :
import statsmodels.api as sm
from statsmodels.regression.rolling import RollingOLS
import seaborn
seaborn.set_style('darkgrid')
pd.plotting.register_matplotlib_converters()
x = sm.add_constant(df[['DIFFSWAP']])
y = df[['CADUSD']]
rols = RollingOLS(y,x, window=60)
rres = rols.fit()
params = rres.params
r_sq = rres.rsquared
Now, what i want to do, i'd like to do a loop to regress (rolling window) all the dependant variables of the DF (columns 2:24) on the independant variable (column 1) and store the coefficients and the rsquareds.
My ultimate goal is to extract Rsquareds and Coefficients and put them in dataframes(or lists or whatever) and then graph them.
I'm new to Python so I'd be very gratefull for any help.
Thank you!
Can you throw it all in a loop and store the results in some other object like a dict?
Potential solution:
data = {}
for column in list(df.columns)[2:]: # iterate over columns 2 to 24
x = sm.add_constant(df[column])
y = df[['CADUSD']] ## This never changes from CADUSD, right?
rols = RollingOLS(y, x, window=60)
rres = rols.fit()
params = rres.params
r_sq = rres.rsquared
# Store results from each column's fit as a new dict entry
data[column] = {'params':params, 'r_sq':r_sq}
results_df = pd.DataFrame(data).T

How to save predicted regression values inside a for loop?

I'm trying to use statsmodels to run separate logistic regressions for each "group" in a pandas dataframe and save the predicted probabilities for each observations (row). Each "group" represents about 2500 respondents or observations; I would like to get the predicted probability for each respondent - similar to how with SPSS you can "save" predicted probabilities when running a logistic regression.
I've read what others have attempted, but nothing seems to work. I'm using SPSS to check that the looping operation in Python is working correctly - the predicted probabilities should be the same (SPSS has a split function which makes this really easy).
import pandas as pd
import numpy as np
from statsmodels.formula.api import logit
df = pd.read_csv('test_data.csv')
for cat in df['Brand'].unique():
df_slice = df[df.Brand == cat]
est = logit('binary ~ var_1', df_slice)
est_result = est.fit()
pred = est_result.predict(df)
print(est_result.summary())
df['pred'] = pred
The model summaries are correct (est_result.summary()) and match SPSS exactly. However, the saved predicted values do not match at all. I cannot seem to understand how to get it to work correctly.
Any advice is appreciated.
I solved it in a really un-pythonic kind of way. I hope someone can improve this code. The probabilities now match exactly what SPSS produces when you split the file by group, and run individual regressions by group.
result =[]
for cat in df['Brand'].unique():
df_slice = df[df.Brand == cat]
est = logit('binary ~ var_1', df_slice)
est_result = est.fit()
pred = est_result.predict(df_slice)
results.append(pred)
# print(est_result.summary())
n = len(df['Brand'].unique())
r = pd.DataFrame(results) #put the results into a dataframe
rt = r.T #tranpose the dataframe
r_small = rt[rt.columns[-n:]] #remove all but the last n columns, n = number of categories
r_new = r_small.bfill(axis=1).iloc[:, 0] #merge the n columns and remove the NaNs
r_new #show us
df['predicted'] = r_new # combine the r_new array with the original dataframe
df #show us.

DataFrame NumPy value accuracy

Im using the following code to scale values from one interval to another. "outputs_max", "outputs_min" are numpy arrays, so are (as a result) "slope" and "intercept".
For higher clarity when displaying the result "scaled_outputs", I used pandas to create a DataFrame of the file "out.npy" which I called "output_array". The resulting array "scaled_outputs" hence is displayed in a DataFrame too and later on stored as a numpy file.
import pandas as pd
import numpy as np
output_file = np.load("U:\\out.npy")
output_array = pd.DataFrame(output_file)
desired_upper_bound = 1
desired_lower_bound = 0
slope = (desired_upper_bound - desired_lower_bound) / (outputs_max - outputs_min)
intercept = desired_upper_bound - (slope * rounded_outputs_max)
scaled_outputs = slope * output_array + intercept
np.save("U:\\scaled_outputs.npy", scaled_outputs)
Am I losing accuracy of the values by creating a DataFrame and passing it into the equation? Would it be better to pass the numpy array "output_file" and creating a DataFrame of "scaled_outputs"?
The result in the console is displayed with 5 decimals at max, which is why I'm asking.
No, you're not losing precision or accuracy by using a dataframe in your equation. What you're seeing on the console is a result of display precision. You can change the display.precision property to see more digits when a dataframe is displayed.
pandas.set_option("display.precision", 10)

Rolling PCA on pandas dataframe

I'm wondering if anyone knows of how to implement a rolling/moving window PCA on a pandas dataframe. I've looked around and found implementations in R and MATLAB but not Python. Any help would be appreciated!
This is not a duplicate - moving window PCA is not the same as PCA on the entire dataframe. Please see pandas.DataFrame.rolling() if you do not understand the difference
Unfortunately, pandas.DataFrame.rolling() seems to flatten the df before rolling, so it cannot be used as one might expect to roll over the rows of the df and pass windows of rows to the PCA.
The following is a work-around for this based on rolling over indices instead of rows. It may not be very elegant but it works:
# Generate some data (1000 time points, 10 features)
data = np.random.random(size=(1000,10))
df = pd.DataFrame(data)
# Set the window size
window = 100
# Initialize an empty df of appropriate size for the output
df_pca = pd.DataFrame( np.zeros((data.shape[0] - window + 1, data.shape[1])) )
# Define PCA fit-transform function
# Note: Instead of attempting to return the result,
# it is written into the previously created output array.
def rolling_pca(window_data):
pca = PCA()
transf = pca.fit_transform(df.iloc[window_data])
df_pca.iloc[int(window_data[0])] = transf[0,:]
return True
# Create a df containing row indices for the workaround
df_idx = pd.DataFrame(np.arange(df.shape[0]))
# Use `rolling` to apply the PCA function
_ = df_idx.rolling(window).apply(rolling_pca)
# The results are now contained here:
print df_pca
A quick check reveals that the values produced by this are identical to control values computed by slicing appropriate windows manually and running PCA on them.

How to fit multidimensional output using scikit-learn?

I am trying to fit OneVsAll Classification output in training data , rows of output adds upto 1 .
One possible way is to read all the rows and find which column has highest value and prepare data for training .
Eg : y = [[0.2,0.8,0],[0,1,0],[0,0.3,0.7]] can be reduced to y = [b,b,c] , considering a,b,c as corresponding class of the columns 0,1,2 respectively.
Is there a function in scikit-learn which helps to achieve such transformations?
This code does what you want:
import numpy as np
import string
y = np.array([[0.2,0.8,0],[0,1,0],[0,0.3,0.7]])
def transform(y,labels):
f = np.vectorize(lambda i : string.letters[i])
y = f(y.argmax(axis=1))
return y
y = transform(y,'abc')
EDIT: Using the comment by alko, I made it more general be letting the user supply the labels to the transform function.

Categories