My code is:
import pandas as pd
import numpy as np
from sklearn import svm
name = '../CLIWOC/CLIWOC15.csv'
data = pd.read_csv(name)
# Get info into dataframe and drop NaNs
data = pd.concat([data.UTC, data.Lon3, data.Lat3, data.Rain]).dropna(how='any')
# Set target
X = data.loc[:, ['UTC', 'Lon3', 'Lat3']]
y = data['Rain']
# Partition a test set
Xtest = X[-1]
ytest = y[-1]
X = X[1:-2]
y = y[1:-2]
# Train classifier
classifier = svm.svc(gamma=0.01, C=100.)
classifier.fit(X, y)
classifier.predict(Xtest)
y
Arriving at the 'set target' section, the compiler returns the error 'Too Many Indexers'. I lifted this syntax directly from the documentation, so I'm unsure what could be wrong.
The csv is organized with these headers for columns of data.
Without your data, it is hard to verify. My immediate suspicion, however, is that you need to pass a numpy array instead of a DataFrame.
Try this to extract them:
# Set target
X = data.loc[:, ['UTC', 'Lon3', 'Lat3']].values
y = data['Rain'].values
Use data.loc[['UTC', 'Lon3', 'Lat3']]. This will also work in iloc method as well.
Do not use like data.loc[:, 0] etc...
Related
I am using sklearn for KNN regressor:
#importing libraries and data
import pandas as pd
from sklearn.neighbors import KNeighborsRegressor as KNR
theta = pd.read_csv("train.csv")#pandas dataframe
#getting data wanted from theta and putting it in a new dataframe
a = theta.get("YearBuilt")
b = theta.get("YrSold")
A = a.to_frame()
B = b.to_frame()
glasses = [A,B]
x = pd.concat(glasses)
#getting target data
y = theta.get("SalePrice")
#using KNN
horses = KNR(n_neighbors = 3)
horses.fit(x,y)
I get this error message:
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').
Could someone please explain this? My data is in the hundred thousands for target and the thousands for input. And there is no blanks in the data.
Before answering the question, Let me refactor the code. You are using a dataframe so you can index single or muliple fields of the dataframe without going through the extra steps you've used:
#importing libraries and data
import pandas as pd
from sklearn.neighbors import KNeighborsRegressor as KNR
theta = pd.read_csv("train.csv") # pandas dataframe
#getting data wanted from theta and putting it in a new dataframe
x = theta[["YearBuilt", "YrSold"]] # index multiple fields
#getting target data
y = theta["SalePrice"] # index single field
#using KNN
horses = KNR(n_neighbors = 3)
horses.fit(x,y) # fit KNN
Regarding your error, it indicates that you have some NaN, Inf, large values in your data. You can ensure these doesnt occur by filtering out the NaN and inf values using this:
theta = theta.replace([np.inf, -np.inf], np.nan)
theta.dropna(inplace=True)
I don't know that my code is correct or not. but I got the error:
bad input shape (1, 301)
from sklearn import svm
import pandas as pd
clf = svm.SVC(gamma='scale')
df = pd.read_csv('C:\\Users\\Armin\\Desktop\\heart.csv')
x = [df.age[1:302], df.sex[1:302], df.cp[1:302], df.trestbps[1:302], df.chol[1:302], df.fbs[1:302], df.restecg[1:302], df.thalach[1:302], df.exang[1:302], df.oldpeak[1:302], df.slope[1:302], df.ca[1:302], df.thal[1:302]]
y = [df.target[1:302]]
clf.fit(x, y)
This is a very simple fix.
You need all the columns from df in x except the target column, for that, just do:
x = df.drop('target', axis=1)
And your target column will be:
y = df['target']
And now do your fit:
clf.fit(x, y)
It will work.
PS: What you were trying to do is passing list of Series having the features value. But what you just need to do is, pass the actual values of your feature set and targets from the dataframe directly.
Some more references for you to get started and keep going:
Read more about what to pass to the fit method here: https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html#sklearn.svm.SVC.fit
Here is a super basic tutorial from the folks of scikit themselves: https://scikit-learn.org/stable/tutorial/basic/tutorial.html
I'm running the code below.
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
import random
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
train=pd.read_csv('C:\\path_here\\train.csv')
test=pd.read_csv('C:\\path_here\\test.csv')
train['Type']='Train' #Create a flag for Train and Test Data set
test['Type']='Test'
fullData = pd.concat([train,test],axis=0) #Combined both Train and Test Data set
fullData.columns # This will show all the column names
fullData.head(10) # Show first 10 records of dataframe
fullData.describe() #You can look at summary of numerical fields by using describe() function
ID_col = ['REF_NO']
target_col = ['Status']
cat_cols = ['children','age_band','status','occupation','occupation_partner','home_status','family_income','self_employed', 'self_employed_partner','year_last_moved','TVarea','post_code','post_area','gender','region']
num_cols= list(set(list(fullData.columns)))
other_col=['Type'] #Test and Train Data set identifier
fullData.isnull().any()#Will return the feature with True or False,True means have missing value else False
num_cat_cols = num_cols+cat_cols # Combined numerical and Categorical variables
#Create a new variable for each variable having missing value with VariableName_NA
# and flag missing value with 1 and other with 0
for var in num_cat_cols:
if fullData[var].isnull().any()==True:
fullData[var+'_NA']=fullData[var].isnull()*1
#Impute numerical missing values with mean
fullData[num_cols] = fullData[num_cols].fillna(fullData[num_cols].mean(),inplace=True)
#Impute categorical missing values with 0
fullData[cat_cols] = fullData[cat_cols].fillna(value = 0)
#create label encoders for categorical features
for var in cat_cols:
number = LabelEncoder()
fullData[var] = number.fit_transform(fullData[var].astype('str'))
#Target variable is also a categorical so convert it
fullData["Account.Status"] = number.fit_transform(fullData["Account.Status"].astype('str'))
train=fullData[fullData['Type']=='Train']
test=fullData[fullData['Type']=='Test']
train['is_train'] = np.random.uniform(0, 1, len(train)) <= .75
Train, Validate = train[train['is_train']==True], train[train['is_train']==False]
features=list(set(list(fullData.columns))-set(ID_col)-set(target_col)-set(other_col))
x_train = Train[list(features)].values
y_train = Train["Account.Status"].values
x_validate = Validate[list(features)].values
y_validate = Validate["Account.Status"].values
x_test=test[list(features)].values
random.seed(100)
rf = RandomForestClassifier(n_estimators=1000)
rf.fit(x_train, y_train)
It seems to run, endlessly, in this line.
fullData[cat_cols] = fullData[cat_cols].fillna(value = 0)
I can't get it past that spot. how can I see what's happening in the background? Is there some way to see the work that's being done? Thanks.
One way to check where to code is getting to is to add print statements. For example you can add (right before the label encoder):
print("Code got before label encoder")
And then after that code block add another print statement. You can see in your console exactly where the code is getting stuck and debug that specific line.
I use scikit linear regression and if I change the order of the features, the coef are still printed in the same order, hence I would like to know the mapping of the feature with the coeff.
#training the model
model_1_features = ['sqft_living', 'bathrooms', 'bedrooms', 'lat', 'long']
model_2_features = model_1_features + ['bed_bath_rooms']
model_3_features = model_2_features + ['bedrooms_squared', 'log_sqft_living', 'lat_plus_long']
model_1 = linear_model.LinearRegression()
model_1.fit(train_data[model_1_features], train_data['price'])
model_2 = linear_model.LinearRegression()
model_2.fit(train_data[model_2_features], train_data['price'])
model_3 = linear_model.LinearRegression()
model_3.fit(train_data[model_3_features], train_data['price'])
# extracting the coef
print model_1.coef_
print model_2.coef_
print model_3.coef_
The trick is that right after you have trained your model, you know the order of the coefficients:
model_1 = linear_model.LinearRegression()
model_1.fit(train_data[model_1_features], train_data['price'])
print(list(zip(model_1.coef_, model_1_features)))
This will print the coefficients and the correct feature. (Tested with pandas DataFrame)
If you want to reuse the coefficients later you can also put them in a dictionary:
coef_dict = {}
for coef, feat in zip(model_1.coef_,model_1_features):
coef_dict[feat] = coef
(You can test it for yourself by training two models with the same features but, as you said, shuffled order of features.)
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)
coef_table = pd.DataFrame(list(X_train.columns)).copy()
coef_table.insert(len(coef_table.columns),"Coefs",regressor.coef_.transpose())
#Robin posted a great answer, but for me I had to make one tweak on it to work the way I wanted, and it was to refer to the dimension of the 'coef_' np.array that I wanted, namely modifying to this: model_1.coef_[0,:], as below:
coef_dict = {}
for coef, feat in zip(model_1.coef_[0,:],model_1_features):
coef_dict[feat] = coef
Then the dict was created as I pictured it, with {'feature_name' : coefficient_value} pairs.
Here is what I use for pretty printing of coefficients in Jupyter. I'm not sure I follow why order is an issue - as far as I know the order of the coefficients should match the order of the input data that you gave it.
Note that the first line assumes you have a Pandas data frame called df in which you originally stored the data prior to turning it into a numpy array for regression:
fieldList = np.array(list(df)).reshape(-1,1)
coeffs = np.reshape(np.round(clf.coef_,5),(-1,1))
coeffs=np.concatenate((fieldList,coeffs),axis=1)
print(pd.DataFrame(coeffs,columns=['Field','Coeff']))
Borrowing from Robin, but simplifying the syntax:
coef_dict = dict(zip(model_1_features, model_1.coef_))
Important note about zip: zip assumes its inputs are of equal length, making it especially important to confirm that the lengths of the features and coefficients match (which in more complicated models might not be the case). If one input is longer than the other, the longer input will have the values in its extra index positions cut off. Notice the missing 7 in the following example:
In [1]: [i for i in zip([1, 2, 3], [4, 5, 6, 7])]
Out[1]: [(1, 4), (2, 5), (3, 6)]
pd.DataFrame(data=regression.coef_, index=X_train.columns)
All of these answers were great but what personally worked for me was this, as the feature names I needed were the columns of my train_date dataframe:
pd.DataFrame(data=model_1.coef_,columns=train_data.columns)
Right after training the model, the coefficient values are stored in the variable model.coef_[0]. We can iterate over the column names and store the column name and their coefficient value in a dictionary.
model.fit(X_train,y)
# assuming all the columns except last one is used in training
columns = data.iloc[:,-1].columns
coef_dict = {}
for i in range(0,len(columns)):
coef_dict[columns[i]] = model.coef_[0][i]
Hope this helps!
As of scikit-learn version 1.0, the LinearRegression estimator has a feature_names_in_ attribute. From the docs:
feature_names_in_ : ndarray of shape (n_features_in_,)
Names of features seen during fit. Defined only when X has feature names that are all strings.
New in version 1.0.
Assuming you're fitting on a pandas.DataFrame (train_data), your estimators (model_1, model_2, and model_3) will have the attribute. You can line up your coefficients using any of the methods listed in previous answers, but I'm in favor of this one:
coef_series = pd.Series(
data=model_1.coef_,
index=model_1.feature_names_in_
)
A minimally reproducible example
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
# for repeatability
np.random.seed(0)
# random data
Xy = pd.DataFrame(
data=np.random.random((10, 3)),
columns=["x0", "x1", "y"]
)
# separate X and y
X = Xy.drop(columns="y")
y = Xy.y
# initialize estimator
lr = LinearRegression()
# fit to pandas.DataFrame
lr.fit(X, y)
# get coeficients and their respective feature names
coef_series = pd.Series(
data=lr.coef_,
index=lr.feature_names_in_
)
print(coef_series)
x0 0.230524
x1 -0.275611
dtype: float64
I am trying to fit OneVsAll Classification output in training data , rows of output adds upto 1 .
One possible way is to read all the rows and find which column has highest value and prepare data for training .
Eg : y = [[0.2,0.8,0],[0,1,0],[0,0.3,0.7]] can be reduced to y = [b,b,c] , considering a,b,c as corresponding class of the columns 0,1,2 respectively.
Is there a function in scikit-learn which helps to achieve such transformations?
This code does what you want:
import numpy as np
import string
y = np.array([[0.2,0.8,0],[0,1,0],[0,0.3,0.7]])
def transform(y,labels):
f = np.vectorize(lambda i : string.letters[i])
y = f(y.argmax(axis=1))
return y
y = transform(y,'abc')
EDIT: Using the comment by alko, I made it more general be letting the user supply the labels to the transform function.