Performing logistic regression analysis in python using sklearn - python

I am trying to perform a logistic regression analysis but I don't know which part am i mistaken in my code. It gives error on the line logistic_regression.fit(X_train, y_train). But it seems okay as i checked from different sources. Can anybody help?
Here is my code:
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
df = pd.read_csv("/Users/utkusenel/Documents/Data Analyzing/data.csv", header=0, sep=";")
data = pd.DataFrame(df)
x = data.drop(columns=["churn"]) #features
y = data.churn # target variable
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=0)
logistic_regression = LogisticRegression()
logistic_regression.fit(X_train, y_train)

There are multiple problems here.
Your first row of headers has a ';' at the end. So it is going to read an extra column. You need to remove that ';' after churn.
The training data that you are trying to use here, X_train, is going to have multiple text/categorical columns. You need to convert these into numbers. Check out OneHotEncoder here: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html and LabelEncoder here: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html
After you have converted your text and categorical data to numbers and removed the extra ';' separator, run your algorithm again.

Related

Error doing Random Forest from data frame using pandas

I am performing a random forest on my data frame using pandas, but I can't seem to get it right, my data frame contains two columns with nonbinary values (letters), so I think that's why it's not letting me divide the data into values and attributes. My code is below. For reference, columns 0 and 7 are the nonbinary columns
import pandas as pd
import numpy as np
new_df.head()
X = new_df.iloc[:, 1:16].values
y = new_df.iloc[:, 16].values
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)'
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
when I input this the error I get is "could not convert string to float: 'TCGA-CH-5740'
from sklearn.ensemble import RandomForestRegressor
regressor = RandomForestRegressor(n_estimators=20,
random_state=0)
[enter image description here][1]regressor.fit(X_train, y_train)
y_pred = regressor.predict(X_test)
the columns drag more to the right but are not captured in the pic
Since a decision tree operates by splitting a feature based on it's value e.g "is this value greater than 5? Is it greater than 10?" then it requires that the features are numbers.
I would assume that sklearn automatically converts all data to floats first, but since you have some columns that contains strings that cannot be converted to a number e.g TCGA-CH-5740 it fails.
One way to overcome this is by using OneHotEncoding to convert your "string" to numbers, or another implementation that accepts categorical values such as LightGBM or CatBoost

Split data into testing and training and convert to csv or excel files

I have a large dataset (around 200k rows), i wanted to split the dataset into 2 parts randomly, 70% as the training data and 30% as the testing data. Is there a way to do this in python? Note I also want to get these datasets saved as excel or csv files in my computer. Thanks!
from sklearn.model_selection import train_test_split
#split the data into train and test set
train,test = train_test_split(data, test_size=0.30, random_state=0)
#save the data
train.to_csv('train.csv',index=False)
test.to_csv('test.csv',index=False)
Start by importing the following:
from sklearn.model_selection import train_test_split
import pandas as pd
In order to split you can use the train_test_split function from sklearn package:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
where X, y is your taken from your original dataframe.
Later, you can export each of them as CSV using the pandas package:
X_train.to_csv(index=False)
X_test.to_csv(index=False)
Same goes for y data as well.
EDIT: as you clarified the question and required both X and y factors on the same file, you can do the following:
train, test = train_test_split(yourdata, test_size=0.3, random_state=42)
and then export them to csv as I mentioned above.

How to fix this error: ValueError: could not convert string to float: 'A'

I'm not sure how to get rid of this error. Below is my example datasets. Is there another step that I'm missing?
Code below:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
models = RandomForestClassifier(n_estimators=100)
np.random.seed(42)
X = re_arrange.drop('Gender',axis=1)
y = re_arrange['Gender']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
models.fit(X_train,y_train)
models.score(X_test, y_test)
Your column "Branch" has letters whereas the RandomForestClassifier expects numbers.
I believe it is of categorical type. So you can encode the column "Branch" using some categorical encoding as shown below before you do train test split
X["Branch"] = pd.get_dummies(X["Branch"])
It will map letters 'A', 'B' etc in numbers. It does not change your data but just converts them in computational-friendly state
RandomForestClassifier can handle only numerical values in any of its features. As you can see, you have text/object data in almost all your features. So 1st of all:
do X.info() to know the data type of your features. If you find 'string' & 'object', encode all those features in numbers using One-Hot-Encoder or LabelEncoding.
One-Hot-Encoding
LabelEncoding

How does one code data from a different (test) file vs all the data in one file?

All examples I've ever come across always conveniently have data in one file to show how train_test_split works (or any model really). But quite often the training data and testing data are two separate files.
So, I made a ultra-basic logistic regression train file and test file consisting of two columns, 'age', 'insurance'. And naming the df's df_train, df_test.
I realize df_test hasn't been trained, hence the error but...isn't that the point?
I know model.predict(X_test) doesn't throw an error, but that is based on the training data not the test data.
Word of warning, this is what happens when you're old and trying to learn new things. Don't get old.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df[['age']],df.insurance,test_size=0.1)
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
model.fit(X_train, y_train)
model.predict(df_test)
Thanks,
Old fart
As you stated :
train file and test file consisting of two columns, 'age',
'insurance'.
So if test files contains both age and insurance columns and used as it is, the predict function will not work because of mis-match of input between training and prediction.
Also model.predict expect the independent variable only(in your case its age) in below format :
predict(self, X)[source]ΒΆ
Predict class labels for samples in X.
Parameters:
X : array_like or sparse matrix, shape (n_samples, n_features)
Samples.
Now coming to the modification :
model.predict(df_test["age"].values)
Edit : Try this :
from sklearn.model_selection import train_test_split
X = df["age"].values
y = df["insurance"].values
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.1)
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
model.fit(X_train, y_train)
model.predict([list(df_test["age"].values)])

LinearRegression in Python giving incorrect results?

I have a comma-separated CSV file with two numerical columns - inputs and outputs. They are correlated in a (more or less linear function), see below. The sample I have is very small.
Below, is the Python code I wrote using sklearn in order to predict values. Somehow it's not giving me the correct values (reasonable predictions). I am quite new to this, so please bear with me.
import pandas as pd
data = pd.read_csv("data.csv", header=None, names=['kg', 'cm'])
labels = data['kg']
train1 = data.drop(['kg'], axis=1) # In all honesty, I don't understand this.
from sklearn.cross_validation import train_test_split
x_train, x_test, y_train, y_test = train_test_split(train1, labels, test_size=0.10, random_state=2)
from sklearn.linear_model import LinearRegression
reg = LinearRegression()
reg.fit(x_train, y_train)
reg.score(x_test, y_test)
reg.predict(80) # Gives an incorrect value of about 108.
Data.
89,155
86,161
82.5,168
79.25,174
76.25,182
73,189
70,198
66.66,207
63.5,218
60.25,229
57,241
54,257
51,259
Actually you are having problem understanding your own code.
import pandas as pd
data = pd.read_csv("data.csv", header=None, names=['kg', 'cm'])
labels = data['kg']
train1 = data.drop(['kg'], axis=1) # In all honesty, I don't understand this.
Until here what you have done is that you have loaded the dataframe. After that you seprated X and y from the dataset.
labels represent the y values.
train1 represent the x values.
Since you wrote you can't understand :- train1 = data.drop(['kg'], axis=1)
Let me explain this. What this does is that from the dataframe which consist both column 'kg' and 'cm'. It removes 'kg' column (axis = 1 means column, axis = 0 means row). Hence only 'cm' is remaining which is your x.
from sklearn.cross_validation import train_test_split
x_train, x_test, y_train, y_test = train_test_split(train1, labels, test_size=0.10, random_state=2)
from sklearn.linear_model import LinearRegression
reg = LinearRegression()
reg.fit(x_train, y_train)
reg.score(x_test, y_test)
reg.predict(80) # Gives an incorrect value of about 108.
Now you train the model on x values which represents 'cm' and y values which represent 'kg'.
When you predict(80) what happens is that you input the 'cm' value to be 80. Let me just plot the 'cm' vs 'kg' for training data.
When you input height as 80 this means that you are going more left, even more left than your plot. Hence as you can see x decrease y increase. It means that as 'cm' decrease means 'kg' increase. Hence ouput is 110 which is more.
from io import StringIO
input_data=StringIO("""89,155\n
86,161\n
82.5,168\n
79.25,174\n
76.25,182\n
73,189\n
70,198\n
66.66,207\n
63.5,218\n
60.25,229\n
57,241\n
54,257\n
51,259""")
import pandas as pd
data = pd.read_csv(input_data, header=None, names=['kg', 'cm'])
labels = data['cm']
train1 = data.drop(['cm'], axis=1) #This is similar to selecting the kg column
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(train1, labels, test_size=0.10, random_state=2)
from sklearn.linear_model import LinearRegression
reg = LinearRegression()
reg.fit(x_train, y_train)
reg.score(x_test, y_test)
import numpy as np
reg.predict(np.array([80]).reshape(-1, 1)) # 172.65013306.
I think you are having problems with small data size. The code flow looks normal to me, I would suggest you try to find the p-value for the input-output. This will tell you if the correlation found from your linear regression is significant or not (p-value <0.05).
You can find p-value using:
from scipy.stats import linregress
print(linregress(input, output))
To find p-value using scikit learn you probably need to use the formula to find p-value. Good luck.

Categories