I am performing a random forest on my data frame using pandas, but I can't seem to get it right, my data frame contains two columns with nonbinary values (letters), so I think that's why it's not letting me divide the data into values and attributes. My code is below. For reference, columns 0 and 7 are the nonbinary columns
import pandas as pd
import numpy as np
new_df.head()
X = new_df.iloc[:, 1:16].values
y = new_df.iloc[:, 16].values
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)'
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
when I input this the error I get is "could not convert string to float: 'TCGA-CH-5740'
from sklearn.ensemble import RandomForestRegressor
regressor = RandomForestRegressor(n_estimators=20,
random_state=0)
[enter image description here][1]regressor.fit(X_train, y_train)
y_pred = regressor.predict(X_test)
the columns drag more to the right but are not captured in the pic
Since a decision tree operates by splitting a feature based on it's value e.g "is this value greater than 5? Is it greater than 10?" then it requires that the features are numbers.
I would assume that sklearn automatically converts all data to floats first, but since you have some columns that contains strings that cannot be converted to a number e.g TCGA-CH-5740 it fails.
One way to overcome this is by using OneHotEncoding to convert your "string" to numbers, or another implementation that accepts categorical values such as LightGBM or CatBoost
Related
I am currently using the scikit learn module in order to help with a crime prediction problem. I am having an issue batch coding the entire Dataframe that I have with the knn.predict method.
How can I batch code the entire two columns of my Dataframe with the knn.predict() method in order to store in another Dataframe the output?
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
knn_df = pd.read_csv("/Users/helenapunset/Desktop/knn_dataframe.csv")
# x is the set of features
x = knn_df[['latitude', 'longitude']]
# y is the target variable
y = knn_df['Class']
# train and test data
x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=0)
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors = 5)
# training the data
knn.fit(x_train,y_train)
# test score was approximately 69%
knn.score(x_test,y_test)
# this is predicted to be a safe zone
crime_prediction = knn.predict([[25.787882, -80.358427]])
print(crime_prediction)
In the last line of the code I was able to add the two features I am using which are latitude and longitude from my Dataframe labeled knn_df. But, this is a single point I have been searching through the documentation on a process for streamlining this knn prediction for the entire Dataframe and cannot seem to find a way to do this. Is there somehow a possibility of using a for loop for this?
Let the new set to be predicted is 'knn_df_predict'. Assuming same column names,try the following lines of code :
x_new = knn_df_predict[['latitude', 'longitude']] #formating features
crime_prediction = knn.predict(x_new) #predicting for the new set
knn_df_predict['prediction'] = crime_prediction #Adding the prediction to dataframe
I am trying to perform a logistic regression analysis but I don't know which part am i mistaken in my code. It gives error on the line logistic_regression.fit(X_train, y_train). But it seems okay as i checked from different sources. Can anybody help?
Here is my code:
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
df = pd.read_csv("/Users/utkusenel/Documents/Data Analyzing/data.csv", header=0, sep=";")
data = pd.DataFrame(df)
x = data.drop(columns=["churn"]) #features
y = data.churn # target variable
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=0)
logistic_regression = LogisticRegression()
logistic_regression.fit(X_train, y_train)
There are multiple problems here.
Your first row of headers has a ';' at the end. So it is going to read an extra column. You need to remove that ';' after churn.
The training data that you are trying to use here, X_train, is going to have multiple text/categorical columns. You need to convert these into numbers. Check out OneHotEncoder here: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html and LabelEncoder here: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html
After you have converted your text and categorical data to numbers and removed the extra ';' separator, run your algorithm again.
I'm not sure how to get rid of this error. Below is my example datasets. Is there another step that I'm missing?
Code below:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
models = RandomForestClassifier(n_estimators=100)
np.random.seed(42)
X = re_arrange.drop('Gender',axis=1)
y = re_arrange['Gender']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
models.fit(X_train,y_train)
models.score(X_test, y_test)
Your column "Branch" has letters whereas the RandomForestClassifier expects numbers.
I believe it is of categorical type. So you can encode the column "Branch" using some categorical encoding as shown below before you do train test split
X["Branch"] = pd.get_dummies(X["Branch"])
It will map letters 'A', 'B' etc in numbers. It does not change your data but just converts them in computational-friendly state
RandomForestClassifier can handle only numerical values in any of its features. As you can see, you have text/object data in almost all your features. So 1st of all:
do X.info() to know the data type of your features. If you find 'string' & 'object', encode all those features in numbers using One-Hot-Encoder or LabelEncoding.
One-Hot-Encoding
LabelEncoding
I have a comma-separated CSV file with two numerical columns - inputs and outputs. They are correlated in a (more or less linear function), see below. The sample I have is very small.
Below, is the Python code I wrote using sklearn in order to predict values. Somehow it's not giving me the correct values (reasonable predictions). I am quite new to this, so please bear with me.
import pandas as pd
data = pd.read_csv("data.csv", header=None, names=['kg', 'cm'])
labels = data['kg']
train1 = data.drop(['kg'], axis=1) # In all honesty, I don't understand this.
from sklearn.cross_validation import train_test_split
x_train, x_test, y_train, y_test = train_test_split(train1, labels, test_size=0.10, random_state=2)
from sklearn.linear_model import LinearRegression
reg = LinearRegression()
reg.fit(x_train, y_train)
reg.score(x_test, y_test)
reg.predict(80) # Gives an incorrect value of about 108.
Data.
89,155
86,161
82.5,168
79.25,174
76.25,182
73,189
70,198
66.66,207
63.5,218
60.25,229
57,241
54,257
51,259
Actually you are having problem understanding your own code.
import pandas as pd
data = pd.read_csv("data.csv", header=None, names=['kg', 'cm'])
labels = data['kg']
train1 = data.drop(['kg'], axis=1) # In all honesty, I don't understand this.
Until here what you have done is that you have loaded the dataframe. After that you seprated X and y from the dataset.
labels represent the y values.
train1 represent the x values.
Since you wrote you can't understand :- train1 = data.drop(['kg'], axis=1)
Let me explain this. What this does is that from the dataframe which consist both column 'kg' and 'cm'. It removes 'kg' column (axis = 1 means column, axis = 0 means row). Hence only 'cm' is remaining which is your x.
from sklearn.cross_validation import train_test_split
x_train, x_test, y_train, y_test = train_test_split(train1, labels, test_size=0.10, random_state=2)
from sklearn.linear_model import LinearRegression
reg = LinearRegression()
reg.fit(x_train, y_train)
reg.score(x_test, y_test)
reg.predict(80) # Gives an incorrect value of about 108.
Now you train the model on x values which represents 'cm' and y values which represent 'kg'.
When you predict(80) what happens is that you input the 'cm' value to be 80. Let me just plot the 'cm' vs 'kg' for training data.
When you input height as 80 this means that you are going more left, even more left than your plot. Hence as you can see x decrease y increase. It means that as 'cm' decrease means 'kg' increase. Hence ouput is 110 which is more.
from io import StringIO
input_data=StringIO("""89,155\n
86,161\n
82.5,168\n
79.25,174\n
76.25,182\n
73,189\n
70,198\n
66.66,207\n
63.5,218\n
60.25,229\n
57,241\n
54,257\n
51,259""")
import pandas as pd
data = pd.read_csv(input_data, header=None, names=['kg', 'cm'])
labels = data['cm']
train1 = data.drop(['cm'], axis=1) #This is similar to selecting the kg column
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(train1, labels, test_size=0.10, random_state=2)
from sklearn.linear_model import LinearRegression
reg = LinearRegression()
reg.fit(x_train, y_train)
reg.score(x_test, y_test)
import numpy as np
reg.predict(np.array([80]).reshape(-1, 1)) # 172.65013306.
I think you are having problems with small data size. The code flow looks normal to me, I would suggest you try to find the p-value for the input-output. This will tell you if the correlation found from your linear regression is significant or not (p-value <0.05).
You can find p-value using:
from scipy.stats import linregress
print(linregress(input, output))
To find p-value using scikit learn you probably need to use the formula to find p-value. Good luck.
I'm trying to train a decision tree classifier using Python. I'm using MinMaxScaler() to scale the data, and f1_score for my evaluation metric. The strange thing is that I'm noticing my model giving me different results in a pattern at each run.
data in my code is a (2000, 7) pandas.DataFrame, with 6 feature columns and the last column being the target value. Columns 1, 3, and 5 are categorical data.
The following code is what I did to preprocess and format my data:
import numpy as np
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import f1_score
# Data Preprocessing Step
# =============================================================================
data = pd.read_csv("./data/train.csv")
X = data.iloc[:, :-1]
y = data.iloc[:, 6]
# Choose which columns are categorical data, and convert them to numeric data.
labelenc = LabelEncoder()
categorical_data = list(data.select_dtypes(include='object').columns)
for i in range(len(categorical_data)):
X[categorical_data[i]] = labelenc.fit_transform(X[categorical_data[i]])
# Convert categorical numeric data to one-of-K data, and change y from Series to ndarray.
onehotenc = OneHotEncoder()
X = onehotenc.fit_transform(X).toarray()
y = y.values
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2)
min_max_scaler = MinMaxScaler()
X_train_scaled = min_max_scaler.fit_transform(X_train)
X_val_scaled = min_max_scaler.fit_transform(X_val)
The next code is for the actual decision tree model training:
dectree = DecisionTreeClassifier(class_weight='balanced')
dectree = dectree.fit(X_train_scaled, y_train)
predictions = dectree.predict(X_val_scaled)
score = f1_score(y_val, predictions, average='macro')
print("Score is = {}".format(score))
The output that I get (i.e. the score) varies, but in a pattern. For example, it would circulate among data within the range of 0.39 and 0.42.
On some iterations, I even get the UndefinedMetricWarning, that claims "F-score is ill-defined and being set to 0.0 in labels with no predicted samples."
I'm familiar with what the UndefinedMetricWarning means, after doing some searching on this community and Google. I guess the two questions I have may be organized to:
Why does my output vary for each iteration? Is there something in the preprocessing stage that happens which I'm not aware of?
I've also tried to use the F-score with other data splits, but I always get the warning. Is this unpreventable?
Thank you.
You are splitting the dataset into train and test which randomly divides sets for both train and test. Due to this, when you train your model with different training data everytime, and testing it with different test data, you will get a range of F score depending on how well the model is trained.
In order to replicate the result each time you run, use random_state parameter. It will maintain a random number state which will give you the same random number each time you run. This shows that the random numbers are generated in the same order. This can be any number.
#train test split
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=13)
#Decision tree model
dectree = DecisionTreeClassifier(class_weight='balanced', random_state=2018)