Do predictions using training data - python

I have 2 csv files. one is a training dataset and the other is test dataset. Training dataset contains 36 columns. One column of that is the outcome which have A-F as values. The test dataset has 35 columns which does not have the outcome. I want to add an outcome column to the test dataset as well. I searched for several tutorials but did not find the method that I should follow. Can any one tell about the process that I should follow?

You haven't supplied any sample data and the technique you want to use, my below code will have you to understand how can you make prediction in general:
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import LabelEncoder
Assuming you have read 2 csv files, train & test
X_train = train.loc[:, train.columns != 'Outcome'] # <---- Here you exclude the Outcome from train
y_train = train['Outcome'] # <---- This is your Outcome
le = LabelEncoder()
y_train = le.fit_transform(y_train) # <---- Here you convert your A-F values to numeric(0-5)
I am assuming rest of the x variables are numeric.
rf = RandomForestClassifier() # <---- Here you call the classifier
rf.fit(X_train, y_train) # <---- Fit the classifier on train data
rf.score(X_train, y_train) # <---- Check model accuracy
y_pred = pd.DataFrame(rf.predict(test), columns = ['Outcome']) # <---- Make predictions on test data
test_pred = pd.concat([test, y_pred['Outcome']], axis = 1) # <---- Here you add predictions column to your test dataset
test_pred.to_excel(r'path\Test.xlsx')

That depends on how you will find/calculate the outcome you need to add.
One way would be to load the test dataset as a Pandas data frame. Calculate the outcome and add the values to a list which you the add to your Pandas dataframe:
import pandas as pd
data = pd.DataFrame(columns=['Names', 'Age', 'Outcome'])
names = ['John', 'Nicole', 'Evan']
age = [53, 23, 27]
data['Names'] = names
data['Age'] = age
outcome = [6545, 5252, 85665]
data['Outcome'] = outcome

Related

Adding a column to Pandas Dataframe, randomly fill with values with percentage splits

I want to do a test, train, valid on a pandas dataframe, but I do not want to generate new data frames. Rather, I want to add a new column called 'Split' where Split = ['train','valid','test']. I want 'train', 'valid', 'test' to be distributed throughout 64%, 16%, and 20% of the rows randomly, respectively.
I know of scikit learn's train_test_split, but again, I don't want new frames. So I could try:
from sklearn.model_selection import train_test_split
train, test = train_test_split(df, test_size=0.2)
but I just want a column 'Split' with values of train, valid, and test as labels. This is for machine learning purposes so I would like to make sure the splits are completely random.
Does anyone know how this may be possible?
Here's one way, using the suggested numpy.random.choice:
import pandas as pd
import numpy as np
# Set up a little example
data = np.ones(shape=(100, 3))
df = pd.DataFrame(data, columns=['x1', 'x2', 'y'])
df['split'] = pd.NA
# Split
split = ['train', 'valid', 'test']
df['split'] = df['split'].apply(lambda x: np.random.choice(split, p=[0.64, 0.16, 0.20]))
# Verify
df['split'].value_counts()
For one given run, this yielded
train 64
valid 19
test 17
Name: split, dtype: int64

How to load features and label from dataframe in Python?

I'm trying to use keras with tensorflow to train a network. I've my own digit dataset of Myanmar language. I'm trying to develop Myanmar digits recognition using neural network using python. I've train.csv file and test.csv file which have a header with format label,pixel0,...,pixel783. I used pandas to load dataframe. But I want to split the dataframe into features and labels.
import pandas as pd
dataframe = pd.read_csv("mmdigitstrain.csv")
dataframe2 = pd.read_csv("mmdigitstest.csv")
(X_train, y_train) = splitfeaturesandlabelfromdataframe
(X_test, y_test) = splitfeaturesandlabelfromdataframe2
If your dataframe contains last column as the label column. Then use the following
X_train = dataframe.iloc[:,:-1]
Y_train = dataframe.iloc[:,-1:]
X_train = dataframe.loc[:, dataframe.columns != 'label']
Y_train = dataframe.loc[:, dataframe.columns == 'label']
Updated according to the comment below. Now subsetting dataframe w.r.t to column name label
The other way is to combine/ merge the two dataframes, and try to use train_test_split
You have to put the datas on numpy arrays
import pandas as pd
import numpy as np
df_train = pd.read_csv("mmdigitstrain.csv")
df_test = pd.read_csv("mmdigitstest.csv")
y_train=df_train['label'].to_numpy()
#check the shape should bd nbofitem x 1 in train dataset
print(y_train.shape)
X_train=df_train.drop(columns=['label']).to_numpy()
check the shape should bd nbofitem x 780 in train dataset
print(X_train.shape)
y_test=df_test['label'].to_numpy()
#check the shape should bd nbofitem x 1 in test dataset
print(y_test.shape)
X_test=df_test.drop(columns=['label']).to_numpy()
check the shape should bd nbofitem x 780 in test dataset
print(X_test.shape)

Different results when using train_test_split vs manually splitting the data

I have a pandas dataframe that I want to make predictions on and get the root mean squared error for each feature. I'm following an online guide that splits the dataset manually, but I thought it would be more convenient to use train_test_split from sklearn.model_selection. Unfortunately, I'm getting different results when looking at the rmse values after splitting the data manually vs using train_test_split.
A (hopefully) reproducible example:
import pandas as pd
import numpy as np
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
np.random.seed(0)
df = pd.DataFrame(np.random.randint(0,100,size=(100, 4)), columns=['feature_1','feature_2','feature_3','feature_4'])
df['target'] = np.random.randint(2,size=100)
df2 = df.copy()
Here is a function, knn_train_test, that splits the data manually, fits the model, makes predictions, etc:
def knn_train_test(train_col, target_col, df):
knn = KNeighborsRegressor()
np.random.seed(0)
# Randomize order of rows in data frame.
shuffled_index = np.random.permutation(df.index)
rand_df = df.reindex(shuffled_index)
# Divide number of rows in half and round.
last_train_row = int(len(rand_df) / 2)
# Select the first half and set as training set.
# Select the second half and set as test set.
train_df = rand_df.iloc[0:last_train_row]
test_df = rand_df.iloc[last_train_row:]
# Fit a KNN model using default k value.
knn.fit(train_df[[train_col]], train_df[target_col])
# Make predictions using model.
predicted_labels = knn.predict(test_df[[train_col]])
# Calculate and return RMSE.
mse = mean_squared_error(test_df[target_col], predicted_labels)
rmse = np.sqrt(mse)
return rmse
rmse_results = {}
train_cols = df.columns.drop('target')
# For each column (minus `target`), train a model, return RMSE value
# and add to the dictionary `rmse_results`.
for col in train_cols:
rmse_val = knn_train_test(col, 'target', df)
rmse_results[col] = rmse_val
# Create a Series object from the dictionary so
# we can easily view the results, sort, etc
rmse_results_series = pd.Series(rmse_results)
rmse_results_series.sort_values()
#Output
feature_3 0.541110
feature_2 0.548452
feature_4 0.559285
feature_1 0.569912
dtype: float64
Now, here is a function, knn_train_test2, that splits the data using train_test_split:
def knn_train_test2(train_col, target_col, df2):
knn = KNeighborsRegressor()
np.random.seed(0)
X_train, X_test, y_train, y_test = train_test_split(df2[[train_col]],df2[[target_col]], test_size=0.5)
knn.fit(X_train,y_train)
predictions = knn.predict(X_test)
mse = mean_squared_error(y_test,predictions)
rmse = np.sqrt(mse)
return rmse
rmse_results = {}
train_cols = df2.columns.drop('target')
for col in train_cols:
rmse_val = knn_train_test2(col, 'target', df2)
rmse_results[col] = rmse_val
rmse_results_series = pd.Series(rmse_results)
rmse_results_series.sort_values()
# Output
feature_4 0.522303
feature_3 0.556417
feature_1 0.569210
feature_2 0.572713
dtype: float64
Why am I getting different results? I think I'm misunderstanding the split > train > test process in general, or maybe misunderstanding/mis-specifying train_test_split. Thank you in advance
Your custom train_test_split implementation differs from scikit-learn's implementation, that's why you get different results for the same seed.
Here you can find the official implementation. The first thing which is notable is, that scikit-learn is doing by default 10 iterations of re-shuffeling & splitting. (check the n_splits parameter)
Only if your approach is doing exactly the same as the scitkit-learn approach, then you can expect to have the same result for the same seed.
This is basic machine learning nature. When you manually split the data, you have a different version of training and testing set. When you use the sklearn function, you get different training and testing set. Your model will make prediction based on what training data it recieves and thus your final results are different for both.
If you want to reproduce result, then use the train_test_split to create multiple training set by setting a seed value. A seed value is used to reproduce the same result in the train_test_split function. Then when running your ml function, set a seed in there too as even ML functions start training with random weights. Try your model on these datasets with same seed and you will get the results.
Splitting data manually is just slicing but train_test_split will also randomize the sliced data. Try fix the random number seed and see if you can get same results each time when using train_test_split.

"ValueError: could not convert string to float" when using RandomForestClassifier

I am attempting to use the RandomForestClassifier of the Scikit Learn library.
I have my data in a dataframe which I am preprocessing using LabelEncoder like so:
from sklearn import preprocessing
from sklearn.preprocessing import LabelEncoder
for column in df.columns:
if df[column].dtype == type(object):
le = LabelEncoder()
df[column] = le.fit_transform(df[column])
I then create my training and test sets like so:
# Labels are the values we want to predict
labels = np.array(df['hta_tota'])
# Remove the labels from the features
# axis 1 refers to the columns
df= df.drop('hta_tota', axis = 1)
# Saving feature names for later use
feature_list = list(df.columns)
# Convert to numpy array
dfNpy = np.array(df)
train_features, test_features, train_labels, test_labels = train_test_split(dfNpy, labels, test_size = 0.25, random_state = 42)
Now I am trying to use the RandomForestClassifier to fit my training set...
rf = RandomForestClassifier(n_jobs=2, random_state=0)
rf.fit(train_features, train_labels);
... but I get the following error:
ValueError: could not convert string to float: masculino
masculino is one of the string values under one of my columns in the dataframe. However I used LabelEncoder to encode this column!
What's going on? Any ideas?
Thanks in advance.
UPDATE:
Some more information regarding the dataframe, `df'; it is created and simplified as so:
df = pd.read_stata('health_data/Hipertension_entrega.dta')
cols_wanted = ['folio', 'desc_ent', 'desc_mun', 'sexo', 'edad', 'hta_tota']
df = df[cols_wanted]
df = df[pd.notnull(df['hta_tota'])]
df.set_index('folio')
Then once I do the preprocess via LabelEncoder (as shown above), the df still returns the following:

Multiclass Classification and probability prediction

import pandas as pd
import numpy
from sklearn import cross_validation
from sklearn.naive_bayes import GaussianNB
fi = "df.csv"
# Open the file for reading and read in data
file_handler = open(fi, "r")
data = pd.read_csv(file_handler, sep=",")
file_handler.close()
# split the data into training and test data
train, test = cross_validation.train_test_split(data,test_size=0.6, random_state=0)
# initialise Gaussian Naive Bayes
naive_b = GaussianNB()
train_features = train.ix[:,0:127]
train_label = train.iloc[:,127]
test_features = test.ix[:,0:127]
test_label = test.iloc[:,127]
naive_b.fit(train_features, train_label)
test_data = pd.concat([test_features, test_label], axis=1)
test_data["p_malw"] = naive_b.predict_proba(test_features)
print "test_data\n",test_data["p_malw"]
print "Accuracy:", naive_b.score(test_features,test_label)
I have written this code to accept input from a csv file with 128 columns where 127 columns are features and the 128th column is the class label.
I want to predict probability that the sample belongs to each class (There are 5 classes (1-5)) and print it in for of a matrix and determine the class of sample based on the prediction. predict_proba() is not giving the desired output. Please suggest required changes.
GaussianNB.predict_proba returns the probabilities of the samples for each class in the model. In your case, it should return a result with five columns with the same number of rows as in your test data. You can verify which column corresponds to which class using naive_b.classes_ . So, it is not clear why you are saying that this is not the desired output. Perhaps, your problem comes from the fact that you are assigning the output of predict proba to a data frame column. Try:
pred_prob = naive_b.predict_proba(test_features)
instead of
test_data["p_malw"] = naive_b.predict_proba(test_features)
and verify its shape using pred_prob.shape. The second dimension should be 5.
If you want the predicted label for each sample you can use the predict method, followed by confusion matrix to see how many labels have been predicted correctly.
from sklearn.metrics import confusion_matrix
naive_B.fit(train_features, train_label)
pred_label = naive_B.predict(test_features)
confusion_m = confusion_matrix(test_label, pred_label)
confusion_m
Here is some useful reading.
sklearn GaussianNB - http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html#sklearn.naive_bayes.GaussianNB.predict_proba
sklearn confusion_matrix - http://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html

Categories