I'm newbie using Python
First of all, I would like to split my data training and data testing where
Data Training = 6 and Data Testing = 2
I confused using csv file to random data training and data testing
I've been trying to split data training and data testing but sequence same with csv file
Here we go my data training and data testing:
def ambilData():
df = pd.read_csv("datalatihnodummy.csv", sep=';')
dropdata = df.drop(['data', 'Klasifikasi'], axis =1)
datalatih = dropdata.iloc[:6]
datauji = dropdata.iloc[6:]
return datalatih, datauji
and here is a output of training :
and here is a output of testing:
I would like to test Hepatitis B only or Hepatitis A only.
Anyone know how to random my dataset? thx u^
here is my data: https://drive.google.com/open?id=1tD3h0aS-AB4qrMg2vw0fHcMx6F3jzJCx
try this:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x_data, y_data, test_size=0.15, random_state=1)
it will randomly split your data into training and testing data.
the x_data is the independent variable - so here you want to drop the 'Klasifikasi'
the y_data is the dependent variable in yours df it's the 'Klasifikasi'
hope it will help
Related
I am currently using the scikit learn module in order to help with a crime prediction problem. I am having an issue batch coding the entire Dataframe that I have with the knn.predict method.
How can I batch code the entire two columns of my Dataframe with the knn.predict() method in order to store in another Dataframe the output?
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
knn_df = pd.read_csv("/Users/helenapunset/Desktop/knn_dataframe.csv")
# x is the set of features
x = knn_df[['latitude', 'longitude']]
# y is the target variable
y = knn_df['Class']
# train and test data
x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=0)
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors = 5)
# training the data
knn.fit(x_train,y_train)
# test score was approximately 69%
knn.score(x_test,y_test)
# this is predicted to be a safe zone
crime_prediction = knn.predict([[25.787882, -80.358427]])
print(crime_prediction)
In the last line of the code I was able to add the two features I am using which are latitude and longitude from my Dataframe labeled knn_df. But, this is a single point I have been searching through the documentation on a process for streamlining this knn prediction for the entire Dataframe and cannot seem to find a way to do this. Is there somehow a possibility of using a for loop for this?
Let the new set to be predicted is 'knn_df_predict'. Assuming same column names,try the following lines of code :
x_new = knn_df_predict[['latitude', 'longitude']] #formating features
crime_prediction = knn.predict(x_new) #predicting for the new set
knn_df_predict['prediction'] = crime_prediction #Adding the prediction to dataframe
new to scikit-learn and I want to take the prediction values and convert it back to text and output it into an excel file.
The way the project is setup is it takes a row of strings and predicts whether or not the column is a certain category (there is approximately 5 categories).
Description
Actual Answer
Prediction
Some string that is random in length per row
Car
Truck
I want to have the excel file output something like you see above. I do not want to output the numerical prediction results. I want to output the actual text itselfs.
Can anyone help me on how to do this?
This is my code so far:
X = df['without_Tags']
Y = df['Tower']
tokens = Tokenizer()
VectorX = tokens.texts_to_sequences(df['without_Tags'].values)
VectorX = pad_sequences(VectorX, maxlen=200)
VectorY = pd.get_dummies(df['Tower'])
X_train, X_test, y_train, y_test = train_test_split(VectorX, VectorY, test_size=0.20, random_state=0)
# Model Creation
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)
You can do this:
import pandas as pd
CSV = pd.DataFrame({
"Prediction": y_pred
})
CSV.to_csv("prediction.csv", index=False)
The file will be named "prediction.csv" and will be saved in your source code file directory.
Update:
import pandas as pd
csv = pd.DataFrame(y_pred, columns=["1st", "2nd", "3rd", "4th", "5th"])
csv.to_csv("pred.csv", index=False)
I have a large dataset (around 200k rows), i wanted to split the dataset into 2 parts randomly, 70% as the training data and 30% as the testing data. Is there a way to do this in python? Note I also want to get these datasets saved as excel or csv files in my computer. Thanks!
from sklearn.model_selection import train_test_split
#split the data into train and test set
train,test = train_test_split(data, test_size=0.30, random_state=0)
#save the data
train.to_csv('train.csv',index=False)
test.to_csv('test.csv',index=False)
Start by importing the following:
from sklearn.model_selection import train_test_split
import pandas as pd
In order to split you can use the train_test_split function from sklearn package:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
where X, y is your taken from your original dataframe.
Later, you can export each of them as CSV using the pandas package:
X_train.to_csv(index=False)
X_test.to_csv(index=False)
Same goes for y data as well.
EDIT: as you clarified the question and required both X and y factors on the same file, you can do the following:
train, test = train_test_split(yourdata, test_size=0.3, random_state=42)
and then export them to csv as I mentioned above.
I have a dataset that shows whether a person has diabetes based on indicators, it looks like this (original dataset):
I've created a straightforward model in order to predict the last column (Outcome).
#Libraries imported
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
#Dataset imported
data = pd.read_csv('diabetes.csv')
#Assign X and y
X = data.iloc[:,:-1].values
y = data.iloc[:,-1].values
#Data preprocessed
sc = StandardScaler()
X = sc.fit_transform(X)
#Dataset split between train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)
model = LogisticRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
# Predicting the results for the whole dataset
y_pred2 = model.predict(data)
#Add prediction column to original dataset
data['prediction'] = y_pred2
However, I get the following error: ValueError: X has 9 features per sample; expecting 8.
My questions are:
Why can't I create a new column with the predictions for my entire dataset?
How can I make predictions for blank outcomes (that need to be predicted), that is to say, should I upload the file again? Let's say I want to predict the folowing:
Rows to predict:
Please let me know if my questions are clear!
You are feeding data (with all 9 initial features) to a model that was trained with X (8 features, since Outcome has been removed to create y), hence the error.
What you need to do is:
Get predictions using X instead of data
Append the predictions to your initial data set
i.e.:
y_pred2 = model.predict(X)
data['prediction'] = y_pred2
Keep in mind that this means that your prediction variable will come from both data that have already been used for model fitting (i.e. the X_train part) as well as from data unseen by the model during training (the X_test part). Not quite sure what your final objective is (and neither this is what the question is about), but this is a rather unusual situation from an ML point of view.
If you have a new dataset data_new to predict the outcome, you do it in a similar way; always assuming that X_new has the same features with X (i.e. again removing the Outcome column as you have done with X):
y_new = model.predict(X_new)
data_new['prediction'] = y_new
I have a .csv file that contains my data. I would like to do Logistic Regression, Naive Bayes and Decision Trees. I already know how to implement these.
However, my teacher wants me to split the data in my .csv file into 80% and let my algorithms predict the other 20%. I would like to know how to actually split the data in that way.
diabetes_df = pd.read_csv("diabetes.csv")
diabetes_df.head()
with open("diabetes.csv", "rb") as f:
data = f.read().split()
train_data = data[:80]
test_data = data[20:]
I tried to split it like this (sure it isn't working).
Workflow
Load the data (see How do I read and write CSV files with Python?
)
Preprocess the data (e.g. filtering / creating new features)
Make the train-test (validation and dev-set) split
Code
Sklearns sklearn.model_selection.train_test_split is what you are looking for:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.33, random_state=0)
splitted_csv = "value1,value2,value3".split(',')
print(str(splitted_csv)) #["value1", "value2", "value3"]
print(splitted_csv[0]) #value1
print(splitted_csv[1]) #value2
print(splitted_csv[2]) #value3
There are also libraries that parse csv and allow you to access at value by column name, but from your example i thought that you need some "low level" way to do it