How to get feature importance in RF - python

I am trying to get RF feature importance, I fit the random forest on the data like this:
model = RandomForestRegressor()
n = model.fit(self.X_train,self.y_train)
if n is not None:
df = pd.DataFrame(data = n , columns = ["Feature","Importance_Score"])
df["Feature_Name"] = np.array(self.X_Headers)
df = df.drop(["Feature"], axis = 1)
df[["Feature_Name","Importance_Score"]].to_csv("RF_Importances.csv", index = False)
del df
However, the n variable returns None, why is this happening?

Not very sure how model.fit(self.X_train,self.y_train) is supposed to work. Need more information about how you set up the model.
If we set this up using simulated data, it works:
np.random.seed(111)
X = pd.DataFrame(np.random.normal(0,1,(100,5)),columns=['A','B','C','D','E'])
y = np.random.normal(0,1,100)
model = RandomForestRegressor()
n = model.fit(X,y)
if n is not None:
df = pd.DataFrame({'features':X.columns,'importance':n.feature_importances_})
df
features importance
0 A 0.176091
1 B 0.183817
2 C 0.169927
3 D 0.267574
4 E 0.202591

Related

How do I create a linear regression model for a file that has about 500 columns as y variables? Working with Python

This code manually selects a column from the y table and then joins it to the X table. The program then performs linear regression. Any idea how to do this for every single column from the y table?
yDF = pd.read_csv('ytable.csv')
yDF.drop('Dates', axis = 1, inplace = True)
XDF = pd.read_csv('Xtable.csv')
ycolumnDF = yDF.iloc[:,0].to_frame()
regressionDF = pd.concat([XDF,ycolumnDF], axis=1)
X = regressionDF.iloc[:,1:20]
y = regressionDF.iloc[:,20:].squeeze()
lm = linear_model.LinearRegression()
lm.fit(X,y)
cf = lm.coef_
print(cf)
You can regress multiple y's on the same X's at the same time. Something like this should work
import numpy as np
from sklearn.linear_model import LinearRegression
df_X = pd.DataFrame(columns = ['x1','x2','x3'], data = np.random.normal(size = (10,3)))
df_y = pd.DataFrame(columns = ['y1','y2'], data = np.random.normal(size = (10,2)))
X = df_X.iloc[:,:]
y = df_y.iloc[:,:]
lm = LinearRegression().fit(X,y)
print(lm.coef_)
produces
[[ 0.16115884 0.08471495 0.39169592]
[-0.51929011 0.29160846 -0.62106353]]
The first row here ([ 0.16115884 0.08471495 0.39169592]) are the regression coefs of y1 on xs and the second are the regression coefs of y2 on xs.

Encoding multiple columns

In the case a dataframe has two or more columns with numerical and text values, and one Label/Target column, if I want to apply a model like svm, how can I use only the columns I am more interested in?
Ex.
Data Num Label/Target No_Sense
What happens here? group1 1 Migrate
Customer Management group2 0 Change Stage
Life Cycle Stages group1 1 Restructure
Drop-down allows to select status type group3 1 Restructure Status
and so.
The approach I have taken is
1.encode "Num" column:
one_hot = pd.get_dummies(df['Num'])
df = df.drop('Num',axis = 1)
df = df.join(one_hot)
2.encode "Data" column:
def bag_words(df):
df = basic_preprocessing(df)
count_vectorizer = CountVectorizer()
count_vectorizer.fit(df['Data'])
list_corpus = df["Data"].tolist()
list_labels = df["Label/Target"].tolist()
X = count_vectorizer.transform(list_corpus)
return X, list_labels
Then apply bag_words to the dataset
X, y = bag_words(df)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=40)
Is there anything that I missed in these steps? How can I select only "Data" and "Num" features in my training dataset? (as I think "No_Sense" is not so relevant for my purposes)
EDIT: I have tried with
def bag_words(df):
df = basic_preprocessing(df)
count_vectorizer = CountVectorizer()
count_vectorizer.fit(df['Data'])
list_corpus = df["Data"].tolist()+ df["group1"].tolist()+df["group2"].tolist()+df["group3"].tolist() #<----
list_labels = df["Label/Target"].tolist()
X = count_vectorizer.transform(list_corpus)
return X, list_labels
but I have found the error:
TypeError: 'int' object is not iterable
I hope this helps you:
import pandas as pd
import numpy as np
import re
from sklearn.feature_extraction.text import CountVectorizer
#this part so I can recreate you df from the string you posted
#remove this part !!!!
data="""
Data Num Label/Target No_Sense
What happens here? group1 1 Migrate
Customer Management group2 0 Change Stage
Life Cycle Stages group1 1 Restructure
Drop-down allows to select status type group3 1 Restructure Status
"""
df = pd.DataFrame(np.array( [ re.split(r'\s{2,}', line) for line in lines[1:] ] ),
columns = lines[0].split())
#what you want starts from here!!!!:
one_hot = pd.get_dummies(df['Num'])
df = df.drop('Num',axis = 1)
df = df.join(one_hot)
#at this point you have 3 new fetures for 'Num' variable
def bag_words(df):
count_vectorizer = CountVectorizer()
count_vectorizer.fit(df['Data'])
matrix = count_vectorizer.transform(df['Data'])
#this dataframe: `encoded_df`has 15 new features, these are the result of fitting
#the CountVectorizer to the 'Data' variable
encoded_df = pd.DataFrame(data=matrix.toarray(), columns=["Data"+str(i) for i in range(matrix.shape[1])])
#adding them to the dataframe
df.join(encoded_df)
#getting the numpy arrays that you can use in training
X = df.loc[:, ["Data"+str(i) for i in range(matrix.shape[1])] + ["group1", "group2", "group3"]].to_numpy()
y = df.loc[:, ["Label/Target"]].to_numpy()
return X, y
X, y = bag_words(df)

How to resolve Boolean value error in linear regression model in python?

I am trying to run a fama-macbeth regression in a python. As afirst step I am running the time series for every asset in my portfolio but I am unable to run it because I am getting an error:
'ValueError: Must pass DataFrame with boolean values only'
I am relatively new to python and have heavily relied on this forum to help me out. I hope it you can help me with this issue.
Please let me know how I can resolve this. I will be very grateful to you!
I assume this line is producing the error. Cause when I run the function without the for loop, it works perfectly.
for i in range(cols):
df_beta = RegressionRoll(df=data_set, subset = 0, dependent = data_set.iloc[:,i], independent = data_set.iloc[:,30:], const = True, parameters = 'beta',
win = 12)
The dimension of my matrix is 108x35, 30 stocks and 5 factors over 108 points. Hence I want to run a regression for every stock against the 4 factors and store the result of the coeffs in a dataframe. Sample dataframe:
Date BAS GY AI FP SGL GY LNA GY AKZA NA Market Factor
1/29/2010 -5.28% -7.55% -1.23% -5.82% -7.09% -5.82%
2/26/2010 0.04% 13.04% -1.84% 4.06% -14.62% -14.62%
3/31/2010 10.75% 1.32% 7.33% 6.61% 12.21% 12.21%
The following is the entire code:
import pandas as pd
import statsmodels.api as sm
import numpy as np
import matplotlib.pyplot as plt
from statsmodels.stats.outliers_influence import variance_inflation_factor
data_set = pd.read_excel(r'C:\XXX\Research Project\Data\Regression.xlsx', sheet_name = 'Fama Macbeth')
data_set.set_index(data_set['Date'], inplace=True)
data_set.drop('Date', axis=1, inplace=True)
X = data_set.iloc[:,30:]
y = data_set.iloc[:,:30]
def RegressionRoll(df, subset, dependent, independent, const, win, parameters):
# Data subset
if subset != 0:
df = df.tail(subset)
else:
df = df
# Loopinfo
end = df.shape[0]
win = win
rng = np.arange(start = win, stop = end, step = 1)
# Subset and store dataframes
frames = {}
n = 1
for i in rng:
df_temp = df.iloc[:i].tail(win)
newname = 'df' + str(n)
frames.update({newname: df_temp})
n += 1
# Analysis on subsets
df_results = pd.DataFrame()
for frame in frames:
#print(frames[frame])
# Rolling data frames
dfr = frames[frame]
y = dependent
x = independent
if const == True:
x = sm.add_constant(dfr[x])
model = sm.OLS(dfr[y], x).fit()
else:
model = sm.OLS(dfr[y], dfr[x]).fit()
if parameters == 'beta':
theParams = model.params[0:]
coefs = theParams.to_frame()
df_temp = pd.DataFrame(coefs.T)
indx = dfr.tail(1).index[-1]
df_temp['Date'] = indx
df_temp = df_temp.set_index(['Date'])
df_results = pd.concat([df_results, df_temp], axis = 0)
if parameters == 'R2':
theParams = model.rsquared
df_temp = pd.DataFrame([theParams])
indx = dfr.tail(1).index[-1]
df_temp['Date'] = indx
df_temp = df_temp.set_index(['Date'])
df_temp.columns = [', '.join(independent)]
df_results = pd.concat([df_results, df_temp], axis = 0)
return(df_results)
cols = len(y.columns)
for i in range(cols):
df_beta = RegressionRoll(df=data_set, subset = 0, dependent = data_set.iloc[:,i], independent = data_set.iloc[:,30:], const = True, parameters = 'beta',
win = 12)
ValueError: Must pass DataFrame with boolean values only

Arima grid search function

I am writing a function that does a grid search for arima model based on bic criteria
I am able to do ARIMA grid search but I am trying to update it to SARIMA grid search
Code look like below I just cant make it work
import itertools
p_min = 0
d_min = 0
q_min = 0
p_max = 4
d_max = 0
q_max = 4
# Initialize a DataFrame to store the results
results_bic = pd.DataFrame(index=['AR{}'.format(i) for i in range(p_min,p_max+1)],
columns=['MA{}'.format(i) for i in range(q_min,q_max+1)])
for p,d,q in itertools.product(range(p_min,p_max+1),
range(d_min,d_max+1),
range(q_min,q_max+1)):
if p==0 and d==0 and q==0:
results_bic.loc['AR{}'.format(p), 'MA{}'.format(q)] = np.nan
continue
try:
model = sm.tsa.SARIMAX(data, order=(p, d, q),trend = "c", seasonal_order = (p, d, q,12)
)
results = model.fit()
results_bic.loc['AR{}'.format(p), 'MA{}'.format(q)] = results.bic
except:
continue
results_bic = results_bic[results_bic.columns].astype(float)

Python pandas create new column with groupby with custom agg function

My DataFrame:
from random import random, randint
from pandas import DataFrame
t = DataFrame({"metasearch":["A","B","A","B","A","B","A","B"],
"market":["A","B","A","B","A","B","A","B"],
"bid":[random() for i in range(8)],
"clicks": [randint(0,10) for i in range(8)],
"country_code":["A","A","A","A","A","B","A","B"]})
I want to fit LinearRegression for each market, so I:
1) Group df - groups = t.groupby(by="market")
2) Prepare function to fit model on a group -
from sklearn.linear_model import LinearRegression
def group_fitter(group):
lr = LinearRegression()
X = group["bid"].fillna(0).values.reshape(-1,1)
y = group["clicks"].fillna(0)
lr.fit(X, y)
return lr.coef_[0] # THIS IS A SCALAR
3) Create a new Series with market as an index and coef as a value:
s = groups.transform(group_fitter)
But the 3rd step fails: KeyError: ('bid_cpc', 'occurred at index bid')
I think you need instead transform use apply because working with more columns in function together and for new column use join:
from sklearn.linear_model import LinearRegression
def group_fitter(group):
lr = LinearRegression()
X = group["bid"].fillna(0).values.reshape(-1,1)
y = group["clicks"].fillna(0)
lr.fit(X, y)
return lr.coef_[0] # THIS IS A SCALAR
groups = t.groupby(by="market")
df = t.join(groups.apply(group_fitter).rename('new'), on='market')
print (df)
bid clicks country_code market metasearch new
0 0.462734 9 A A A -8.632301
1 0.438869 5 A B B 6.690289
2 0.047160 9 A A A -8.632301
3 0.644263 0 A B B 6.690289
4 0.579040 0 A A A -8.632301
5 0.820389 6 B B B 6.690289
6 0.112341 5 A A A -8.632301
7 0.432502 0 B B B 6.690289
Just return the group from the function instead of the coefficient.
# return the group instead of scaler value
def group_fitter(group):
lr = LinearRegression()
X = group["bid"].fillna(0).values.reshape(-1,1)
y = group["clicks"].fillna(0)
lr.fit(X, y)
group['coefficient'] = lr.coef_[0] # <- This is the changed line
return group
# the new column gets added to the data
s = groups.apply(group_fitter)

Categories