Scikit-learn - What am I predicting?

Scikit-learn - What am I predicting? - python

My aim is to predict between five and six numbers in an array, based on csv data with six columns. The below script is supposed to predict only one number, from an array of 5. I assumed I could work my way up to the entire 5 or 6 from there, but I might be wrong about that.
Mre:
import csv
import numpy as np
import pandas as pd
from math import sqrt
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.preprocessing import StandardScaler
df = pd.read_csv('subdata.csv')
ft = [9,8,15,4,6]
fintest = np.array(ft)
def train():
df.astype(np.float64)
df.drop(['One'], axis = 1)
X = df
y = X['One']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=123)
scaler = StandardScaler()
train_scaled = scaler.fit_transform(X_train)
test_scaled = scaler.transform(X_test)
tree_model = DecisionTreeRegressor()
rf_model = RandomForestRegressor()
tree_model.fit(train_scaled, y_train)
rf_model.fit(train_scaled, y_train)
rfp = rf_model.predict(fintest.reshape(1, -1))
tmp = tree_model.predict(fintest.reshape(1, -1))
print(rfp)
print(tmp)
train()
Could you please clarify, what I am asking this script to predict in the final rfp and tmp lines?
My data looks like this:
The script as is currently gives an error:
Traceback (most recent call last):
File "C:\Users\conra\Desktop\Code\lotto\pie.py", line 43, in <module>
train()
File "C:\Users\conra\Desktop\Code\lotto\pie.py", line 37, in train
rfp = rf_model.predict(fintest.reshape(1, -1))
File "C:\Users\conra\AppData\Local\Programs\Python\Python39\lib\site-packages\sklearn\ensemble\_forest.py", line 784, in predict
X = self._validate_X_predict(X)
File "C:\Users\conra\AppData\Local\Programs\Python\Python39\lib\site-packages\sklearn\ensemble\_forest.py", line 422, in _validate_X_predict
return self.estimators_[0]._validate_X_predict(X, check_input=True)
File "C:\Users\conra\AppData\Local\Programs\Python\Python39\lib\site-packages\sklearn\tree\_classes.py", line 402, in _validate_X_predict
X = self._validate_data(X, dtype=DTYPE, accept_sparse="csr",
File "C:\Users\conra\AppData\Local\Programs\Python\Python39\lib\site-packages\sklearn\base.py", line 437, in _validate_data
self._check_n_features(X, reset=reset)
File "C:\Users\conra\AppData\Local\Programs\Python\Python39\lib\site-packages\sklearn\base.py", line 365, in _check_n_features
raise ValueError(
ValueError: X has 5 features, but DecisionTreeRegressor is expecting 6 features as input.
By adding a sixth digit to the ft array I can get around this error and receive wildly inaccurate outputs, that appear to have no correlation with the data whatsoever. For example, by setting variable ft to [9,8,15,4,6,2] which is the first row in the csv file, and setting X and y to use the 'Four' label; I get an output of [37.22] and [37.].
My other questions will probably be answered by my first. But here they are:
Could you also please clarify why I need to pass an array of 6?
And why are my predictions so close together (all ~35), no matter what array I pass for the prediction?

The way you defined your X is wrong. It is containing 6 features.
Your y is contained in your X in the way you defined it :
X = df #6 features
y = X['One'] #1 feature
I think what you wanted to do was something like this :
X = df[['Two', 'Three', 'Four', 'Five', 'Zero']]
y = df['One']

It depends on your data, and like I saw your data is an example without context, so actually you are trying to train your data to predict the 'One' column using two different models, that doesn't make sense to me.
The error is because you give to X the dataframe without column 'One' and after you are asking for the column 'One' to variable Y, Y=X['One'].

Related

Sk learn could not convert string to float

I have a CSV file of
lemma,trained
iran seizes bitcoin mining machines power spike,-1
... (goes on for 1054 lines)
And my code looks like:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
from sklearn.metrics import accuracy_score
from sklearn.naive_bayes import GaussianNB
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
df = pd.read_csv('lemma copy.csv')
X = df.iloc[:, 0].values
y = df.iloc[:, 1].values
print(y)
X_train, X_test, y_train, y_test =train_test_split(X,y,test_size= 0.25, random_state=0)
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
I am getting the error
Traceback (most recent call last):
File "/home/arctesian/Scripts/School/EE/Algos/Qual/bayes/sklean.py", line 20, in <module>
X_train = sc_X.fit_transform(X_train)
File "/home/arctesian/.local/lib/python3.10/site-packages/sklearn/base.py", line 867, in fit_transform
return self.fit(X, **fit_params).transform(X)
File "/home/arctesian/.local/lib/python3.10/site-packages/sklearn/preprocessing/_data.py", line 809, in fit
return self.partial_fit(X, y, sample_weight)
File "/home/arctesian/.local/lib/python3.10/site-packages/sklearn/preprocessing/_data.py", line 844, in partial_fit
X = self._validate_data(
File "/home/arctesian/.local/lib/python3.10/site-packages/sklearn/base.py", line 577, in _validate_data
X = check_array(X, input_name="X", **check_params)
File "/home/arctesian/.local/lib/python3.10/site-packages/sklearn/utils/validation.py", line 856, in check_array
array = np.asarray(array, order=order, dtype=dtype)
ValueError: could not convert string to float: 'twitter ios beta lays groundwork bitcoin tips'
Printing this out shows that the random splitting of the data makes that line the first line so it must be a problem with trans coding the data. How do I fix this problem?

Sometimes searching for the right question on Stack Overflow (or the internet as a whole) is difficult. The reason why you're having trouble finding an answer is because your question is related to NLP based on your CSV containing lemmas.
You'll have to preprocess your data in some way such as by using word vectors. Word vectors are essentially a model trained on a large corpus of text data so that each word can be represented by a N length vector. I'm greatly simplifying this of course.
Another strategy is to use the bag of words approach. A bag of words takes the count of each word that appears in your corpus. You use the bag of words rather than the original strings to train your models. Here's a very small example using scikit-learn's CountVectorizer.
from sklearn.feature_extraction.text import CountVectorizer
corpus = ["I like cats", "meow", "Espeon is a cool Pokemon", "my friend has lotsof pet fish",
"my pet cat wants to eat my friend's fish", "spams spam", "not spam",
"someone please hire me for a job", "nlp is cool",
"this corpus isn't actually large enough to use counter vectorizer well"]
count_vec = CountVectorizer(ngram_range=(
1, 3), stop_words="english").fit(corpus)
corpus_cv = count_vec.transform(corpus)
I skipped steps to keep the code concise, but the above is the gist of using CountVectorizer.

So I fixed it by using #joshua megauth method and getting rid of pandas. Did this:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
from coalas import csvReader as c
from sklearn.metrics import accuracy_score
from sklearn.naive_bayes import GaussianNB
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
# df = pd.read_csv('lemma copy.csv')
def vect(X):
features = vectorizer.fit_transform(X)
features_nd = features.toarray()
return features_nd
def test():
y_pred = classifer.predict(X_test)
print(accuracy_score(y_pred, y_test))
if __name__ == "__main__":
c.importCSV('lemma copy.csv')
vectorizer = CountVectorizer(
analyzer = 'word',
lowercase = False,
)
X = c.lemma
# y = c.Best
y = c.trained
features_nd = vect(X)
X_train, X_test, y_train, y_test =train_test_split(features_nd,y,test_size= 0.2, random_state=0)
sc_X = StandardScaler()
# print(X_train)
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.fit_transform(X_test)
classifer = GaussianNB()
classifer.fit(X_train, y_train)
test()

Fix ValueError in Logistic Regression in Python

I am following Müller & Guido's Machine Learning with Python book, and I am trying to run classifications on this dataset.
So far my code looks like this:
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.svm import LinearSVC
# Read the Churn data into a dataset (pandas) from the cvs file
dataset = pd.read_csv(r'C:\Users\Amalie\IdeaProjects\INFO284\src\Lab2.csv')
# Make the data into a 2D NumPy array (as scikit-learn expects for the data)
dataframe = dataset[['SeniorCitizen', 'Partner', 'Dependents', 'tenure', 'PhoneService', 'MultipleLines',
'InternetService', 'OnlineSecurity', 'Churn']]
y = dataframe['Churn'] # Target
X = dataframe.drop('Churn', 1) # Features ( all other than target column 'Churn' )
# Logistic Regression
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=20) # Split into test/training sets
logReg = LogisticRegression(max_iter=100000).fit(X_train, y_train)
print("Training set score: {:.3f}".format(logReg.score(X_train, y_train)))
print("Test set score: {:.3f}".format(logReg.score(X_test, y_test)))
When I run it, I get this error:
Traceback (most recent call last):
File "C:/Users/Amalie/IdeaProjects/INFO284/src/Lab5.py", line 19, in <module>
logReg = LogisticRegression(max_iter=100000).fit(X_train, y_train)
File "C:\Users\Amalie\IdeaProjects\INFO284\venv\lib\site-packages\sklearn\linear_model\_logistic.py", line 1514, in fit
accept_large_sparse=solver not in ["liblinear", "sag", "saga"],
File "C:\Users\Amalie\IdeaProjects\INFO284\venv\lib\site-packages\sklearn\base.py", line 581, in _validate_data
X, y = check_X_y(X, y, **check_params)
File "C:\Users\Amalie\IdeaProjects\INFO284\venv\lib\site-packages\sklearn\utils\validation.py", line 976, in check_X_y
estimator=estimator,
File "C:\Users\Amalie\IdeaProjects\INFO284\venv\lib\site-packages\sklearn\utils\validation.py", line 746, in check_array
array = np.asarray(array, order=order, dtype=dtype)
File "C:\Users\Amalie\IdeaProjects\INFO284\venv\lib\site-packages\pandas\core\generic.py", line 1993, in __array__
return np.asarray(self._values, dtype=dtype)
ValueError: could not convert string to float: 'No'
Process finished with exit code 1
It says that the problem is with this line
logReg = LogisticRegression(max_iter=100000).fit(X_train, y_train)
I have used the fit()-method before when running other classification problems, but I've never come across this issue before. What am I doing wrong?

train_test_split producing inconsistent samples

I am working on using sklearn's train_test_split to create a training set and testing set of my data.
My script is below:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn import neighbors
# function to perform one hot encoding and dropping the original item
# in this case its the part number
def encode_and_bind(original_dataframe, feature_to_encode):
dummies = pd.get_dummies(original_dataframe[[feature_to_encode]])
res = pd.concat([original_dataframe, dummies], axis=1)
res = res.drop([feature_to_encode], axis=1)
return(res)
# read in data from csv
data = pd.read_csv('export2.csv')
# one hot encode the part number
new = encode_and_bind(data, 'PART_NO')
# create the labels, or field we are trying to estimate
label = new['TOTAL_DAYS_TO_COMPLETE']
# remove the header
label = label[1:]
# create the data, or the data that is to be estimated
thedata = new.drop('TOTAL_DAYS_TO_COMPLETE', axis=1)
# remove the header
thedata = thedata[1:]
print(label.shape)
print(thedata.shape)
# # split into training and testing sets
train_data, train_classes, test_data, test_classes = train_test_split(thedata, label, test_size = 0.3)
# create a knn model
knn = neighbors.KNeighborsRegressor()
# fit it with our data
knn.fit(train_data, train_classes)
Running it, I get the following:
C:\Users\jerry\Desktop>python test.py (6262,) (6262, 253) Traceback
(most recent call last): File "test.py", line 37, in
knn.fit(train_data, train_classes) File "C:\Python367-64\lib\site-packages\sklearn\neighbors\base.py", line
872, in fit
X, y = check_X_y(X, y, "csr", multi_output=True) File "C:\Python367-64\lib\site-packages\sklearn\utils\validation.py", line
729, in check_X_y
check_consistent_length(X, y) File "C:\Python367-64\lib\site-packages\sklearn\utils\validation.py", line
205, in check_consistent_length
" samples: %r" % [int(l) for l in lengths]) ValueError: Found input variables with inconsistent numbers of samples: [4383, 1879]
So, it looks like both my X and Y have the same number of rows (6262), but different # of columns, since I thought Y was just supposed to be one column of the label or value you are trying to predict.
How can I use train_test_split to give me a training and testing dataset that I can use for a KNN Regressor?

You've switched the outputs of train_test_split, from what I can tell.
The function returns, in order: Training features, Testing features, Training labels, Testing labels.
The common naming convention is X_train, X_test, y_train, y_test=... where X is the features (columns or features) and yy is the targets (labels or, I'm assuming, "classes" in your code)
You appear to be trying to get it to return, instead, X_train, y_train, X_test, y_test
Try this and see if it works for you:
train_data, test_data, train_classes, test_classes = train_test_split(thedata, label, test_size = 0.3)

ValueError: Found input variables with inconsistent numbers of samples: [1, 3185]

# -*- coding: utf-8 -*-
"""
Created on Sun Jun 3 01:36:10 2018
#author: Sharad
"""
import numpy as np
import pickle
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
dbfile=open("D:/df_train_api.pk", 'rb')
df=pickle.load(dbfile)
y=df[['label']]
features=['groups']
X=df[features].copy()
X.columns
y.columns
#for spiliting into training and test data
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=324)
#for vectorizing
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(X_train)
X_train_counts.shape
from sklearn.feature_extraction.text import TfidfTransformer
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
X_train_tfidf.shape
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB().fit(X_train_tfidf, y_train)
The problem lies in the vectorisationg as it gives me X_train_counts of size [1,1]. I don't know why. And that's why MultinomialNB can't perform the action as y_train is of size [1, 3185].
I'm new to machine learning. Any help would be much appreciated.
traceback:
Traceback (most recent call last):
File "<ipython-input-52-5b5949203f76>", line 1, in <module>
runfile('C:/Users/Sharad/.spyder-py3/hypothizer.py', wdir='C:/Users/Sharad/.spyder-py3')
File "C:\Users\Sharad\Anaconda3\lib\site-packages\spyder\utils\site\sitecustomize.py", line 705, in runfile
execfile(filename, namespace)
File "C:\Users\Sharad\Anaconda3\lib\site-packages\spyder\utils\site\sitecustomize.py", line 102, in execfile
exec(compile(f.read(), filename, 'exec'), namespace)
File "C:/Users/Sharad/.spyder-py3/hypothizer.py", line 37, in <module>
clf = MultinomialNB().fit(X_train_tfidf, y_train)
File "C:\Users\Sharad\Anaconda3\lib\site-packages\sklearn\naive_bayes.py", line 579, in fit
X, y = check_X_y(X, y, 'csr')
File "C:\Users\Sharad\Anaconda3\lib\site-packages\sklearn\utils\validation.py", line 583, in check_X_y
check_consistent_length(X, y)
File "C:\Users\Sharad\Anaconda3\lib\site-packages\sklearn\utils\validation.py", line 204, in check_consistent_length
" samples: %r" % [int(l) for l in lengths])
ValueError: Found input variables with inconsistent numbers of samples: [1, 3185]

CountVectorizer (and by inheritence, TfidfTransformer and TfidfVectorizer) expects an iterable of raw documents in fit() and fit_transform():
raw_documents : iterable
An iterable which yields either str, unicode or file objects.
So internally it will do this:
for doc in raw_documents:
do_processing(doc)
When you pass a pandas DataFrame object in it, only the column names will be yielded by the for ... in X. And hence only a single document is processed (instead of data inside that column).
You need to do this:
X = df[features].values().ravel()
Or else do this:
X=df['groups'].copy()
There is a difference in the code above and the code you are doing. You are doing this:
X=df[features].copy()
Here features is already a list of columns. So essentially this becomes:
X=df[['groups']].copy()
The difference is in the double brackets here (which return a dataframe) and single bracket in my code (which returns a Series).
for value in X works as expected when X is a series, but only returns column names when X is a dataframe.
Hope this is clear.

Python Sklearn Linear Regression Value Error

Ive been trying out Linear Regression using sklearn. Sometime I get a value error, sometimes it works fine. Im not sure which approach to use.
Error Message is as follows:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/sklearn/linear_model/base.py", line 512, in fit
y_numeric=True, multi_output=True)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/sklearn/utils/validation.py", line 531, in check_X_y
check_consistent_length(X, y)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/sklearn/utils/validation.py", line 181, in check_consistent_length
" samples: %r" % [int(l) for l in lengths])
ValueError: Found input variables with inconsistent numbers of samples: [1, 200]
The code is something like this:
import pandas as pd
from sklearn.linear_model import LinearRegression
data = pd.read_csv('http://www-bcf.usc.edu/~gareth/ISL/Advertising.csv', index_col=0);
x = data['TV']
y = data['Sales']
lm = LinearRegression()
lm.fit(x,y)
Please help me out. I am a student, trying to pick up on Machine Learning basics.

lm.fit expects X to be a
numpy array or sparse matrix of shape [n_samples,n_features]
Your x has shape:
In [6]: x.shape
Out[6]: (200,)
Just use:
lm.fit(x.reshape(-1,1) ,y)

Pass your X in as a dataframe and not a series, you can use [[]] "double brackets" or to_frame() for a single feature:
import pandas as pd
from sklearn.linear_model import LinearRegression
data = pd.read_csv('http://www-bcf.usc.edu/~gareth/ISL/Advertising.csv', index_col=0);
x = data[['TV']]
Or
x = data['TV'].to_frame()
y = data['Sales']
lm = LinearRegression()
lm.fit(x,y)
Output:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Scikit-learn - What am I predicting? - python

The way you defined your X is wrong. It is containing 6 features. Your y is contained in your X in the way you defined it : X = df #6 features y = X['One'] #1 feature I think what you wanted to do was something like this : X = df[['Two', 'Three', 'Four', 'Five', 'Zero']] y = df['One']

Related

Sk learn could not convert string to float

Fix ValueError in Logistic Regression in Python

train_test_split producing inconsistent samples

ValueError: Found input variables with inconsistent numbers of samples: [1, 3185]

Python Sklearn Linear Regression Value Error

Categories

Resources