How use LinearRegression with categorical variables in sklearn - python

I am trying to perform some speed comparison test Python vs R and struggling with issue - LinearRegression under sklearn with categorical variables.
Code R:
# Start the clock!
ptm <- proc.time()
ptm
test_data = read.csv("clean_hold.out.csv")
# Regression Model
model_liner = lm(test_data$HH_F ~ ., data = test_data)
# Stop the clock
new_ptm <- proc.time() - ptm
Code Python:
import pandas as pd
import time
from sklearn.linear_model import LinearRegression
from sklearn.feature_extraction import DictVectorizer
start = time.time()
test_data = pd.read_csv("./clean_hold.out.csv")
x_train = [col for col in test_data.columns[1:] if col != 'HH_F']
y_train = ['HH_F']
model_linear = LinearRegression(normalize=False)
model_linear.fit(test_data[x_train], test_data[y_train])
but it's not work for me
return X.astype(np.float32 if X.dtype == np.int32 else np.float64)
ValueError: could not convert string to float: Bee True
I was tried another approach
test_data = pd.read_csv("./clean_hold.out.csv").to_dict()
v = DictVectorizer(sparse=False)
X = v.fit_transform(test_data)
However, I catched another error:
File
"C:\Anaconda32\lib\site-packages\sklearn\feature_extraction\dict_vectorizer.py",
line 258, in transform
Xa[i, vocab[f]] = dtype(v) TypeError: float() argument must be a string or a number
I don't understand how Python should resolve this issues ...
Example of data:
http://screencast.com/t/hYyyu7nU9hQm

I have to do some encoding before using fit.
There are several classes that can be used :
LabelEncoder : turn your string into incremental value
OneHotEncoder : use One-of-K algorithm to transform your String into integer
I wanted to have a scalable solution but didn't get any answer. I selected OneHotEncoder that binarize all the strings. It is quite effective but if you have a lot different strings the matrix will grow very quickly and memory will be required.

Related

Iris decision tree in python

i am trying to learn about decision trees and I ended up finding a article about decision trees. The goal of the article is to decide if a flower is a iris flower or not but i seem to run into some errors that i hope somebody got the answer to i get two errors like the following:
iris: Bunch iris: inner_f Instance of 'tuple' has no 'target' member
and
iris: Bunch iris: inner_f Instance of 'tuple' has no 'data' member
i get these errors at the x = iris.data line and at the y = iris.target line.
Here is the code:
import warnings
warnings.filterwarnings('ignore')
import pandas as pd
from sklearn import datasets
from sklearn.model_selection import train_test_split
#load iris data
iris = datasets.load_iris()
x = iris.data
y = iris.target
d = [{"sepal_length":row[0],
"sepal_width":row[1],
"petal_length":row[2],
"petal_width":row[3]} for row in x]
df = pd.DataFrame(d) # construct dataframe
df["types"] = y # assign types
df = df.sample(frac=1.0) # random shuffle rows
df.head()
Is there anybody that knows why i get these errors?
Your error message indicates that the problematic value iris is a tuple, which doesn't have the attributes you're referencing. Check the documentation for the tools you're using; they should explain how to unpack datasets.load_iris() into the objects you need.
I would not filter warnings in most cases, as you get useful information from the warnings.
So, the sklearn datasets format is a Bunch, which is a specialized container object that works like a dictionary. You can access it with dot notation, e.g. iris.data or dictionary notation, e.g. iris['data']. Here, it is unclear what the error is on your machine, as I (like other commenters) had no problem accessing iris.data or iris['data'] in python 3.8.5.
I wanted to let you know a couple of places to improve your approach:
(1) It is unclear why you need to construct a dataframe as you can get the samples you need directly from calling train_test_split on the concatenated numpy arrays or you can get a random sample of indices from the numpy arrays directly.
(2) Your method for constructing the dataframe is more complex than it needs to be.
import pandas as pd
import numpy as np
from sklearn import datasets
from sklearn.model_selection import train_test_split
# load iris data
iris = datasets.load_iris()
# train test split
X_train, y_train, X_test, y_test = train_test_split(iris.data, iris.target)
# random shuffle of data/target indices
rng = np.random.default_rng()
rng_size = iris.data.shape[0]
idx_sample = rng.choice(np.arange(rng_size), size=rng_size, replace=False)
# simpler way to create dataframe
# concatenate along the columns (axis 1)
# then set the column names in one place
df = pd.concat([pd.DataFrame(iris.data), pd.DataFrame(iris.target)], axis=1)
df.columns = ["sepal_length", "sepal_width", "petal_length", "petal_width", "types"]

Feature selection and categorical variables

I work on a dataset which contain mainly binary variables. However two of the are categorical with multiple values (strings). I want to apply feature selection using lasso but i have an error Keyerror: could not convert string to float:
Should i use LabelEncoder and then do the feature selection? Any ideas how to deal with this?
Here is my code
X = data.iloc[:,:-1]
y = data.iloc[:,-1]
scaler = MinMaxScaler()
scaler.fit(X)
X_scaled = scaler.transform()
selector = SelectFromModel(estimator=LassoCV (cv=5)).fit(X_scaled,y)
selector.get_support()
It is problematic to use onehot because each category will be coded as binary and feeding it into lasso doesn't allow selection of the categorical variable as a whole, which is what you are after i guess. You can also check out this post.
You can use the group lasso implementation in python. Below I use an example dataset:
import pandas as pd
import numpy as np
from sklearn.metrics import r2_score
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder
from group_lasso import GroupLasso
from group_lasso.utils import extract_ohe_groups
import scipy.sparse
data = pd.DataFrame({'cat1':np.random.choice(['A','B','C'],100),
'cat2':np.random.choice(['D','E','F'],100),
'bin1':np.random.choice([0,1],100),
'bin2':np.random.choice([0,1],100)})
data['y'] = 1.5*data['bin1'] + -3*data['bin2'] + 2*(data['cat1'] == 'A').astype('int') + np.random.normal(0,1,100)
Define the categorical and numeric (binary) columns. You don't need the min max scaler since your values are binary. Next we onehot encode the categorical columns and extract the groups out:
cat_columns = ['cat1','cat2']
num_columns = ['bin1','bin2']
ohe = OneHotEncoder()
onehot_data = ohe.fit_transform(data[cat_columns])
groups = extract_ohe_groups(ohe)
Put numeric and onehot together, you can also convert them to dense, but can be problematic if data is huge:
X = scipy.sparse.hstack([onehot_data,scipy.sparse.csr_matrix(data[num_columns])])
y = data['y']
Likewise, construct the groups:
groups = np.hstack([groups,len(cat_columns) + np.arange(len(num_columns))+1])
groups
Run the group lasso:
grpLasso = GroupLasso(groups=groups,supress_warning=True,n_iter=1000)
grpLasso.sparsity_mask_
array([ True, True, True, False, False, False, True, True])
grpLasso.chosen_groups_
{0, 3, 4}
Check out also the help page for using it in a pipeline.

Bad Input Shape

I don't know that my code is correct or not. but I got the error:
bad input shape (1, 301)
from sklearn import svm
import pandas as pd
clf = svm.SVC(gamma='scale')
df = pd.read_csv('C:\\Users\\Armin\\Desktop\\heart.csv')
x = [df.age[1:302], df.sex[1:302], df.cp[1:302], df.trestbps[1:302], df.chol[1:302], df.fbs[1:302], df.restecg[1:302], df.thalach[1:302], df.exang[1:302], df.oldpeak[1:302], df.slope[1:302], df.ca[1:302], df.thal[1:302]]
y = [df.target[1:302]]
clf.fit(x, y)
This is a very simple fix.
You need all the columns from df in x except the target column, for that, just do:
x = df.drop('target', axis=1)
And your target column will be:
y = df['target']
And now do your fit:
clf.fit(x, y)
It will work.
PS: What you were trying to do is passing list of Series having the features value. But what you just need to do is, pass the actual values of your feature set and targets from the dataframe directly.
Some more references for you to get started and keep going:
Read more about what to pass to the fit method here: https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html#sklearn.svm.SVC.fit
Here is a super basic tutorial from the folks of scikit themselves: https://scikit-learn.org/stable/tutorial/basic/tutorial.html

Python running extremely slow one one line of code

I'm running the code below.
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
import random
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
train=pd.read_csv('C:\\path_here\\train.csv')
test=pd.read_csv('C:\\path_here\\test.csv')
train['Type']='Train' #Create a flag for Train and Test Data set
test['Type']='Test'
fullData = pd.concat([train,test],axis=0) #Combined both Train and Test Data set
fullData.columns # This will show all the column names
fullData.head(10) # Show first 10 records of dataframe
fullData.describe() #You can look at summary of numerical fields by using describe() function
ID_col = ['REF_NO']
target_col = ['Status']
cat_cols = ['children','age_band','status','occupation','occupation_partner','home_status','family_income','self_employed', 'self_employed_partner','year_last_moved','TVarea','post_code','post_area','gender','region']
num_cols= list(set(list(fullData.columns)))
other_col=['Type'] #Test and Train Data set identifier
fullData.isnull().any()#Will return the feature with True or False,True means have missing value else False
num_cat_cols = num_cols+cat_cols # Combined numerical and Categorical variables
#Create a new variable for each variable having missing value with VariableName_NA
# and flag missing value with 1 and other with 0
for var in num_cat_cols:
if fullData[var].isnull().any()==True:
fullData[var+'_NA']=fullData[var].isnull()*1
#Impute numerical missing values with mean
fullData[num_cols] = fullData[num_cols].fillna(fullData[num_cols].mean(),inplace=True)
#Impute categorical missing values with 0
fullData[cat_cols] = fullData[cat_cols].fillna(value = 0)
#create label encoders for categorical features
for var in cat_cols:
number = LabelEncoder()
fullData[var] = number.fit_transform(fullData[var].astype('str'))
#Target variable is also a categorical so convert it
fullData["Account.Status"] = number.fit_transform(fullData["Account.Status"].astype('str'))
train=fullData[fullData['Type']=='Train']
test=fullData[fullData['Type']=='Test']
train['is_train'] = np.random.uniform(0, 1, len(train)) <= .75
Train, Validate = train[train['is_train']==True], train[train['is_train']==False]
features=list(set(list(fullData.columns))-set(ID_col)-set(target_col)-set(other_col))
x_train = Train[list(features)].values
y_train = Train["Account.Status"].values
x_validate = Validate[list(features)].values
y_validate = Validate["Account.Status"].values
x_test=test[list(features)].values
random.seed(100)
rf = RandomForestClassifier(n_estimators=1000)
rf.fit(x_train, y_train)
It seems to run, endlessly, in this line.
fullData[cat_cols] = fullData[cat_cols].fillna(value = 0)
I can't get it past that spot. how can I see what's happening in the background? Is there some way to see the work that's being done? Thanks.
One way to check where to code is getting to is to add print statements. For example you can add (right before the label encoder):
print("Code got before label encoder")
And then after that code block add another print statement. You can see in your console exactly where the code is getting stuck and debug that specific line.

Why does this return 'Too Many Indexers'?

My code is:
import pandas as pd
import numpy as np
from sklearn import svm
name = '../CLIWOC/CLIWOC15.csv'
data = pd.read_csv(name)
# Get info into dataframe and drop NaNs
data = pd.concat([data.UTC, data.Lon3, data.Lat3, data.Rain]).dropna(how='any')
# Set target
X = data.loc[:, ['UTC', 'Lon3', 'Lat3']]
y = data['Rain']
# Partition a test set
Xtest = X[-1]
ytest = y[-1]
X = X[1:-2]
y = y[1:-2]
# Train classifier
classifier = svm.svc(gamma=0.01, C=100.)
classifier.fit(X, y)
classifier.predict(Xtest)
y
Arriving at the 'set target' section, the compiler returns the error 'Too Many Indexers'. I lifted this syntax directly from the documentation, so I'm unsure what could be wrong.
The csv is organized with these headers for columns of data.
Without your data, it is hard to verify. My immediate suspicion, however, is that you need to pass a numpy array instead of a DataFrame.
Try this to extract them:
# Set target
X = data.loc[:, ['UTC', 'Lon3', 'Lat3']].values
y = data['Rain'].values
Use data.loc[['UTC', 'Lon3', 'Lat3']]. This will also work in iloc method as well.
Do not use like data.loc[:, 0] etc...

Categories