How to load features and label from dataframe in Python? - python

I'm trying to use keras with tensorflow to train a network. I've my own digit dataset of Myanmar language. I'm trying to develop Myanmar digits recognition using neural network using python. I've train.csv file and test.csv file which have a header with format label,pixel0,...,pixel783. I used pandas to load dataframe. But I want to split the dataframe into features and labels.
import pandas as pd
dataframe = pd.read_csv("mmdigitstrain.csv")
dataframe2 = pd.read_csv("mmdigitstest.csv")
(X_train, y_train) = splitfeaturesandlabelfromdataframe
(X_test, y_test) = splitfeaturesandlabelfromdataframe2

If your dataframe contains last column as the label column. Then use the following
X_train = dataframe.iloc[:,:-1]
Y_train = dataframe.iloc[:,-1:]
X_train = dataframe.loc[:, dataframe.columns != 'label']
Y_train = dataframe.loc[:, dataframe.columns == 'label']
Updated according to the comment below. Now subsetting dataframe w.r.t to column name label
The other way is to combine/ merge the two dataframes, and try to use train_test_split

You have to put the datas on numpy arrays
import pandas as pd
import numpy as np
df_train = pd.read_csv("mmdigitstrain.csv")
df_test = pd.read_csv("mmdigitstest.csv")
y_train=df_train['label'].to_numpy()
#check the shape should bd nbofitem x 1 in train dataset
print(y_train.shape)
X_train=df_train.drop(columns=['label']).to_numpy()
check the shape should bd nbofitem x 780 in train dataset
print(X_train.shape)
y_test=df_test['label'].to_numpy()
#check the shape should bd nbofitem x 1 in test dataset
print(y_test.shape)
X_test=df_test.drop(columns=['label']).to_numpy()
check the shape should bd nbofitem x 780 in test dataset
print(X_test.shape)

Related

How to handle data type & shape when splitting the data?

I'm following this on a classification problem and I noticed that their train and test data type is array. I have my data in xlsx file, when I tested my data it showed as a series, also the data shape is 2D but mine is 1D.
My question is how to handle this ? and when do I need to reshape my data
df = data[['normalized_data_arr', 'label']] #5 label
df.shape #(5091, 2)
df["normalized_data_arr"].values.shape #(5091,)
df["label"].values.shape #(5091,)
type(df["normalized_data_arr"].values) #numpy.ndarray
type(df["label"].values) #numpy.ndarray
train, others = train_test_split(df, train_size=0.7, shuffle=False)
val, test = train_test_split(others, test_size=0.2, shuffle=False)
type(train) #dataframe
type(train["label"]) # series

Unable to make prediction after loading sklearn model

I have created a ML model with Scikit-Learn and saved it. Now when I load the model, I have trouble with transformation and prediction.
I have 4 features in DataFrame. First two features are textual, and other 2 are numerical. The result column is 1 or 0.
In order to train my model, I used ColumnTransformer and CountVectorizer for transformation and vectorization textual features. I specified NAMES of the columns that I want to transform/vectorize.
(columns text1 and text2). Numerical columns do not need to be vectorized so remainder='passthrough' is fixing that.
Part of code that works:
features = df.iloc[:, :-1]
results = df.iloc[:, -1]
transformerVectoriser = ColumnTransformer(transformers=[('vector word 1', CountVectorizer(analyzer='word', ngram_range=(1, 1), max_features = 12000, stop_words = 'english'), 'text1'),
('vector phrase 3', CountVectorizer(analyzer='word', ngram_range=(3, 3), max_features = 2500, stop_words = 'english'), 'text2')],
remainder='passthrough') # Default is to drop untransformed columns, passthrough == leave columns as they are
x_train, x_test, y_train, y_test = train_test_split(features, results, test_size=0.3, random_state=0)
x_train = transformerVectoriser.fit_transform(x_train)
x_test = transformerVectoriser.transform(x_test)
model = clf.fit(x_train, y_train)
y_pred = model.predict(x_test)
filename = 'ml_model.sav'
pickle.dump(model, open(filename, 'wb'))
filename = 'ml_transformer.sav'
pickle.dump(transformerVectoriser, open(filename, 'wb'))
But when I want to load a model, and make prediction I get an error:
# LOADING MODEL
model = pickle.load(open('ml_model.sav','rb'))
vectorizer = pickle.load(open('ml_transformer.sav','rb'))
# MAKING PREDICTION
data_for_prediction = vectorizer.transform([data_for_prediction]) #ERROR
print(model.predict_proba(data_for_prediction))
I get the error:
ValueError: Specifying the columns using strings is only supported for pandas DataFrames
When I was training my model I used Pandas dataframe, and when I wanted to make prediction I have just put the values in the list. So data_for_prediction is list, that looks like this:
["text that should be vectorized with vectorizer that i created", "More texts that should be vectorized", 4, 7]
I think that that is the error, Because I used column names when I was using ColumnTransformer, but now when I want to make prediction, vectorizer do not know what to vectorize.
My final model and vectorizer should be used in an API, and api should only take JSON, so I do not want to convert JSON to DataFrame and pass it to the model.
Is there a way to fix this error without using pandas dataframe in my final Flask APP.
The training data is a dataframe with the columns:
x_train.columns
the function vectorizer.transform() wants data in the same format, so assuming that
data_f_p = ["text that should be vectorized", 4,7,0]
corresponds to the same four columns as x_train you can turn it into a dataframe with
data_f_p = pd.DataFrame([data_f_p], columns=x_train.columns)
data_f_p = vectorizer.transform(data_f_p)
In the case you don't want to use pandas.DataFrame in your REST API endpoint, just don't train your model with the DataFrame but convert your data to a numpy array first:
>>> df
TEXT_1 TEXT_2 NUM_1 NUM_2
0 This is the first text. The second text. 300.000 23.3
1 Here is the third text. And the fourth text. 2.334 29.0
>>> df.to_numpy()
array([['This is the first text.', 'The second text.', 300.0, 23.3],
['Here is the third text.', 'And the fourth text.', 2.334, 29.0]],
dtype=object)
Then, make changes in how you define the model. I'd suggest to combine preprocessing and predicting steps using sklearn.pipeline.Pipeline into a single model like this:
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.ensemble import RandomForestClassifier
model = Pipeline(steps=[
('transformer', ColumnTransformer(
transformers=[
('TEXT_1', CountVectorizer(analyzer='word', stop_words='english'), 0),
('TEXT_2', CountVectorizer(analyzer='word', stop_words='english'), 1),
],
remainder='passthrough',
)),
('predictor', RandomForestClassifier()),
])
Note, here we are using indices instead of names to reference texts when defining transformers for the ColumnTransformer instance. Once we've transformed the initial DataFrame to a numpy array, the TEXT_1 feature is located at 0, and the TEXT_2 at 1 in a data row. Here is how you can use the model:
from joblib import dump, load
X = df.to_numpy()
model.fit(X, y)
dump(model, 'model.joblib')
...
model = load('model.joblib')
results = model.predict(data)
As a result, you don't have to convert your incoming data to the DataFrame in order to make a prediction.

How to output Prediction Values into an Excel File?

new to scikit-learn and I want to take the prediction values and convert it back to text and output it into an excel file.
The way the project is setup is it takes a row of strings and predicts whether or not the column is a certain category (there is approximately 5 categories).
Description
Actual Answer
Prediction
Some string that is random in length per row
Car
Truck
I want to have the excel file output something like you see above. I do not want to output the numerical prediction results. I want to output the actual text itselfs.
Can anyone help me on how to do this?
This is my code so far:
X = df['without_Tags']
Y = df['Tower']
tokens = Tokenizer()
VectorX = tokens.texts_to_sequences(df['without_Tags'].values)
VectorX = pad_sequences(VectorX, maxlen=200)
VectorY = pd.get_dummies(df['Tower'])
X_train, X_test, y_train, y_test = train_test_split(VectorX, VectorY, test_size=0.20, random_state=0)
# Model Creation
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)
You can do this:
import pandas as pd
CSV = pd.DataFrame({
"Prediction": y_pred
})
CSV.to_csv("prediction.csv", index=False)
The file will be named "prediction.csv" and will be saved in your source code file directory.
Update:
import pandas as pd
csv = pd.DataFrame(y_pred, columns=["1st", "2nd", "3rd", "4th", "5th"])
csv.to_csv("pred.csv", index=False)

Do predictions using training data

I have 2 csv files. one is a training dataset and the other is test dataset. Training dataset contains 36 columns. One column of that is the outcome which have A-F as values. The test dataset has 35 columns which does not have the outcome. I want to add an outcome column to the test dataset as well. I searched for several tutorials but did not find the method that I should follow. Can any one tell about the process that I should follow?
You haven't supplied any sample data and the technique you want to use, my below code will have you to understand how can you make prediction in general:
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import LabelEncoder
Assuming you have read 2 csv files, train & test
X_train = train.loc[:, train.columns != 'Outcome'] # <---- Here you exclude the Outcome from train
y_train = train['Outcome'] # <---- This is your Outcome
le = LabelEncoder()
y_train = le.fit_transform(y_train) # <---- Here you convert your A-F values to numeric(0-5)
I am assuming rest of the x variables are numeric.
rf = RandomForestClassifier() # <---- Here you call the classifier
rf.fit(X_train, y_train) # <---- Fit the classifier on train data
rf.score(X_train, y_train) # <---- Check model accuracy
y_pred = pd.DataFrame(rf.predict(test), columns = ['Outcome']) # <---- Make predictions on test data
test_pred = pd.concat([test, y_pred['Outcome']], axis = 1) # <---- Here you add predictions column to your test dataset
test_pred.to_excel(r'path\Test.xlsx')
That depends on how you will find/calculate the outcome you need to add.
One way would be to load the test dataset as a Pandas data frame. Calculate the outcome and add the values to a list which you the add to your Pandas dataframe:
import pandas as pd
data = pd.DataFrame(columns=['Names', 'Age', 'Outcome'])
names = ['John', 'Nicole', 'Evan']
age = [53, 23, 27]
data['Names'] = names
data['Age'] = age
outcome = [6545, 5252, 85665]
data['Outcome'] = outcome

"ValueError: could not convert string to float" when using RandomForestClassifier

I am attempting to use the RandomForestClassifier of the Scikit Learn library.
I have my data in a dataframe which I am preprocessing using LabelEncoder like so:
from sklearn import preprocessing
from sklearn.preprocessing import LabelEncoder
for column in df.columns:
if df[column].dtype == type(object):
le = LabelEncoder()
df[column] = le.fit_transform(df[column])
I then create my training and test sets like so:
# Labels are the values we want to predict
labels = np.array(df['hta_tota'])
# Remove the labels from the features
# axis 1 refers to the columns
df= df.drop('hta_tota', axis = 1)
# Saving feature names for later use
feature_list = list(df.columns)
# Convert to numpy array
dfNpy = np.array(df)
train_features, test_features, train_labels, test_labels = train_test_split(dfNpy, labels, test_size = 0.25, random_state = 42)
Now I am trying to use the RandomForestClassifier to fit my training set...
rf = RandomForestClassifier(n_jobs=2, random_state=0)
rf.fit(train_features, train_labels);
... but I get the following error:
ValueError: could not convert string to float: masculino
masculino is one of the string values under one of my columns in the dataframe. However I used LabelEncoder to encode this column!
What's going on? Any ideas?
Thanks in advance.
UPDATE:
Some more information regarding the dataframe, `df'; it is created and simplified as so:
df = pd.read_stata('health_data/Hipertension_entrega.dta')
cols_wanted = ['folio', 'desc_ent', 'desc_mun', 'sexo', 'edad', 'hta_tota']
df = df[cols_wanted]
df = df[pd.notnull(df['hta_tota'])]
df.set_index('folio')
Then once I do the preprocess via LabelEncoder (as shown above), the df still returns the following:

Categories