My aim is to predict between five and six numbers in an array, based on csv data with six columns. The below script is supposed to predict only one number, from an array of 5. I assumed I could work my way up to the entire 5 or 6 from there, but I might be wrong about that.
Mre:
import csv
import numpy as np
import pandas as pd
from math import sqrt
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.preprocessing import StandardScaler
df = pd.read_csv('subdata.csv')
ft = [9,8,15,4,6]
fintest = np.array(ft)
def train():
df.astype(np.float64)
df.drop(['One'], axis = 1)
X = df
y = X['One']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=123)
scaler = StandardScaler()
train_scaled = scaler.fit_transform(X_train)
test_scaled = scaler.transform(X_test)
tree_model = DecisionTreeRegressor()
rf_model = RandomForestRegressor()
tree_model.fit(train_scaled, y_train)
rf_model.fit(train_scaled, y_train)
rfp = rf_model.predict(fintest.reshape(1, -1))
tmp = tree_model.predict(fintest.reshape(1, -1))
print(rfp)
print(tmp)
train()
Could you please clarify, what I am asking this script to predict in the final rfp and tmp lines?
My data looks like this:
The script as is currently gives an error:
Traceback (most recent call last):
File "C:\Users\conra\Desktop\Code\lotto\pie.py", line 43, in <module>
train()
File "C:\Users\conra\Desktop\Code\lotto\pie.py", line 37, in train
rfp = rf_model.predict(fintest.reshape(1, -1))
File "C:\Users\conra\AppData\Local\Programs\Python\Python39\lib\site-packages\sklearn\ensemble\_forest.py", line 784, in predict
X = self._validate_X_predict(X)
File "C:\Users\conra\AppData\Local\Programs\Python\Python39\lib\site-packages\sklearn\ensemble\_forest.py", line 422, in _validate_X_predict
return self.estimators_[0]._validate_X_predict(X, check_input=True)
File "C:\Users\conra\AppData\Local\Programs\Python\Python39\lib\site-packages\sklearn\tree\_classes.py", line 402, in _validate_X_predict
X = self._validate_data(X, dtype=DTYPE, accept_sparse="csr",
File "C:\Users\conra\AppData\Local\Programs\Python\Python39\lib\site-packages\sklearn\base.py", line 437, in _validate_data
self._check_n_features(X, reset=reset)
File "C:\Users\conra\AppData\Local\Programs\Python\Python39\lib\site-packages\sklearn\base.py", line 365, in _check_n_features
raise ValueError(
ValueError: X has 5 features, but DecisionTreeRegressor is expecting 6 features as input.
By adding a sixth digit to the ft array I can get around this error and receive wildly inaccurate outputs, that appear to have no correlation with the data whatsoever. For example, by setting variable ft to [9,8,15,4,6,2] which is the first row in the csv file, and setting X and y to use the 'Four' label; I get an output of [37.22] and [37.].
My other questions will probably be answered by my first. But here they are:
Could you also please clarify why I need to pass an array of 6?
And why are my predictions so close together (all ~35), no matter what array I pass for the prediction?
The way you defined your X is wrong. It is containing 6 features.
Your y is contained in your X in the way you defined it :
X = df #6 features
y = X['One'] #1 feature
I think what you wanted to do was something like this :
X = df[['Two', 'Three', 'Four', 'Five', 'Zero']]
y = df['One']
It depends on your data, and like I saw your data is an example without context, so actually you are trying to train your data to predict the 'One' column using two different models, that doesn't make sense to me.
The error is because you give to X the dataframe without column 'One' and after you are asking for the column 'One' to variable Y, Y=X['One'].
# -*- coding: utf-8 -*-
"""
Created on Sun Jun 3 01:36:10 2018
#author: Sharad
"""
import numpy as np
import pickle
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
dbfile=open("D:/df_train_api.pk", 'rb')
df=pickle.load(dbfile)
y=df[['label']]
features=['groups']
X=df[features].copy()
X.columns
y.columns
#for spiliting into training and test data
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=324)
#for vectorizing
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(X_train)
X_train_counts.shape
from sklearn.feature_extraction.text import TfidfTransformer
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
X_train_tfidf.shape
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB().fit(X_train_tfidf, y_train)
The problem lies in the vectorisationg as it gives me X_train_counts of size [1,1]. I don't know why. And that's why MultinomialNB can't perform the action as y_train is of size [1, 3185].
I'm new to machine learning. Any help would be much appreciated.
traceback:
Traceback (most recent call last):
File "<ipython-input-52-5b5949203f76>", line 1, in <module>
runfile('C:/Users/Sharad/.spyder-py3/hypothizer.py', wdir='C:/Users/Sharad/.spyder-py3')
File "C:\Users\Sharad\Anaconda3\lib\site-packages\spyder\utils\site\sitecustomize.py", line 705, in runfile
execfile(filename, namespace)
File "C:\Users\Sharad\Anaconda3\lib\site-packages\spyder\utils\site\sitecustomize.py", line 102, in execfile
exec(compile(f.read(), filename, 'exec'), namespace)
File "C:/Users/Sharad/.spyder-py3/hypothizer.py", line 37, in <module>
clf = MultinomialNB().fit(X_train_tfidf, y_train)
File "C:\Users\Sharad\Anaconda3\lib\site-packages\sklearn\naive_bayes.py", line 579, in fit
X, y = check_X_y(X, y, 'csr')
File "C:\Users\Sharad\Anaconda3\lib\site-packages\sklearn\utils\validation.py", line 583, in check_X_y
check_consistent_length(X, y)
File "C:\Users\Sharad\Anaconda3\lib\site-packages\sklearn\utils\validation.py", line 204, in check_consistent_length
" samples: %r" % [int(l) for l in lengths])
ValueError: Found input variables with inconsistent numbers of samples: [1, 3185]
CountVectorizer (and by inheritence, TfidfTransformer and TfidfVectorizer) expects an iterable of raw documents in fit() and fit_transform():
raw_documents : iterable
An iterable which yields either str, unicode or file objects.
So internally it will do this:
for doc in raw_documents:
do_processing(doc)
When you pass a pandas DataFrame object in it, only the column names will be yielded by the for ... in X. And hence only a single document is processed (instead of data inside that column).
You need to do this:
X = df[features].values().ravel()
Or else do this:
X=df['groups'].copy()
There is a difference in the code above and the code you are doing. You are doing this:
X=df[features].copy()
Here features is already a list of columns. So essentially this becomes:
X=df[['groups']].copy()
The difference is in the double brackets here (which return a dataframe) and single bracket in my code (which returns a Series).
for value in X works as expected when X is a series, but only returns column names when X is a dataframe.
Hope this is clear.
I am trying to find the score of a given data set with respect to some training data. I have written the following code:
from sklearn.ensemble import RandomForestClassifier
import numpy as np
randomForest = RandomForestClassifier(n_estimators = 200)
li_train1 = [[1,2,3,4,5,6,7,8,9],[1,2,3,4,5,6,7,8,9]]
li_train2 = [[1,2,3,4,5,6,7,8,9],[1,2,3,4,5,6,7,8,9]]
li_text1 = [[10,20,30,40,50,60,70,80,90], [10,20,30,40,50,60,70,80,90]]
li_text2 = [[1,2,3,4,5,6,7,8,9],[1,2,3,4,5,6,7,8,9]]
randomForest.fit(li_train1, li_train2)
output = randomForest.score(li_train1, li_text1)
On compiling and trying to run the program I am getting the error:
Traceback (most recent call last):
File "trial.py", line 16, in <module>
output = randomForest.score(li_train1, li_text1)
File "/usr/local/lib/python2.7/dist-packages/sklearn/base.py", line 349, in score
return accuracy_score(y, self.predict(X), sample_weight=sample_weight)
File "/usr/local/lib/python2.7/dist-packages/sklearn/metrics/classification.py", line 172, in accuracy_score
y_type, y_true, y_pred = _check_targets(y_true, y_pred)
File "/usr/local/lib/python2.7/dist-packages/sklearn/metrics/classification.py", line 89, in _check_targets
raise ValueError("{0} is not supported".format(y_type))
ValueError: multiclass-multioutput is not supported
On checking the documentation related to the score method it says:
score(X, y, sample_weight=None)
X : array-like, shape = (n_samples, n_features)
Test samples.
y : array-like, shape = (n_samples) or (n_samples, n_outputs)
True labels for X.
Both X and y in my case are arrays, 2d arrays.
I also went through this question but I couldn't understand where am I going wrong.
EDIT
So as per the answer and the comments that follow, I have edited the program as follows:
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import MultiLabelBinarizer
import numpy as np
randomForest = RandomForestClassifier(n_estimators = 200)
mlb = MultiLabelBinarizer()
li_train1 = [[1,2,3,4,5,6,7,8,9],[1,2,3,4,5,6,7,8,9]]
li_train2 = [[1,2,3,4,5,6,7,8,9],[1,2,3,4,5,6,7,8,9]]
li_text1 = [100,200]
li_text2 = [[1,2,3,4,5,6,7,8,9],[1,2,3,4,5,6,7,8,9]]
randomForest.fit(li_train1, li_train2)
output = randomForest.score(li_train1, li_text1)
After this edit I am getting the error:
Traceback (most recent call last):
File "trial.py", line 19, in <module>
output = randomForest.score(li_train1, li_text1)
File "/usr/local/lib/python2.7/dist-packages/sklearn/base.py", line 349, in score
return accuracy_score(y, self.predict(X), sample_weight=sample_weight)
File "/usr/local/lib/python2.7/dist-packages/sklearn/metrics/classification.py", line 172, in accuracy_score
y_type, y_true, y_pred = _check_targets(y_true, y_pred)
File "/usr/local/lib/python2.7/dist-packages/sklearn/metrics/classification.py", line 82, in _check_targets
"".format(type_true, type_pred))
ValueError: Can't handle mix of binary and multiclass-multioutput
According to the documentation:
Warning: At present, no metric in sklearn.metrics supports the multioutput-multiclass classification task.
The score method invokes sklearn's accuracy metric but this isn't supported for the multi-class, multi-output classification problem you've defined.
It's not clear from your question if you really intend to solve a multi-class, multi-output problem. If that's not your intention, then you should restructure your input arrays.
If on the other hand you really want to solve this kind of problem, you'll simply need to define your own scoring function.
UPDATE
Since you are not solving a multi-class, multi-label problem you should restructure your data so that it looks something like this:
from sklearn.ensemble import RandomForestClassifier
# training data
X = [
[1,2,3,4,5,6,7,8,9],
[1,2,3,4,5,6,7,8,9]
]
y = [0,1]
# fit the model
randomForest.fit(X,y)
# test data
Xtest = [
[1,2,0,4,5,6,0,8,9],
[1,1,3,1,5,0,7,8,9]
]
ytest = [0,1]
output = randomForest.score(Xtest,ytest)
print(output) # 0.5
I try to load CSV file to numpy-array and use the array in LogisticRegression etc. Now, I am struggling with error is shown below:
import numpy as np
import pandas as pd
from sklearn import preprocessing
from sklearn.linear_model import LogisticRegression
dataset = pd.read_csv('../Bookie_test.csv').values
X = dataset[1:, 32:34]
y = dataset[1:, 14]
# normalize the data attributes
normalized_X = preprocessing.normalize(X)
# standardize the data attributes
standardized_X = preprocessing.scale(X)
model = LogisticRegression()
model.fit(X, y)
print(model)
# make predictions
expected = y
predicted = model.predict(X)
# summarize the fit of the model
print(metrics.classification_report(expected, predicted))
print(metrics.confusion_matrix(expected, predicted))
I got an error:
> C:\Anaconda32\lib\site-packages\sklearn\utils\validation.py:332:
> UserWarning: The normalize function assumes floating point values as
> input, got object "got %s" % (estimator, X.dtype)) Traceback (most
> recent call last): File
> "X:/test3.py", line 23, in
> <module>
> normalized_X = preprocessing.normalize(X) File "C:\Anaconda32\lib\site-packages\sklearn\preprocessing\data.py", line
> 553, in normalize
> norms = row_norms(X) File "C:\Anaconda32\lib\site-packages\sklearn\utils\extmath.py", line 65,
> in row_norms
> norms = np.einsum('ij,ij->i', X, X) TypeError: invalid data type for einsum
I am new in Python and don't like transformation:
Load CSV to Pandas
Convert Pandas to NumPy
Use NumPy in LogisticRegression
Are there any simple approach, like:
Load to Pandas
Use Pandas Dataframes in ML methods?
Regarding the main question, thanks to Evert for advises I will check.
Regarding #2: I found great tutorial http://www.markhneedham.com/blog/2013/11/09/python-making-scikit-learn-and-pandas-play-nice/
and achieved desired result with pandas + sklearn
I am building a Bayesian Ridge Regression using sklearn on the Parkinson's Telemonitoring Data Set. This is the code:
import math
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn import linear_model
data1 = pd.read_csv("data.csv")
msk = np.random.rand(len(data1)) < 0.66
train = data1[msk]
test = data1[~msk]
y = train[['motor_UPDRS','total_UPDRS']]
X = train.drop('motor_UPDRS',axis = 1)
X = X.drop('total_UPDRS',axis = 1)
labels = test[['motor_UPDRS','total_UPDRS']]
test = test.drop('motor_UPDRS',axis = 1)
test = test.drop('total_UPDRS',axis = 1)
clf = linear_model.BayesianRidge()
clf.fit(X,y)
The data set is divided into 66% training set and 33% test set ratio. When I run it,I get the following error:
Traceback (most recent call last):
File "<ipython-input-8-c4e92f3e0bf9>", line 1, in <module>
runfile('C:/Users/Keshav/Desktop/Spring/ML/Project/parkinsons/main6.py', wdir='C:/Users/Keshav/Desktop/Spring/ML/Project/parkinsons')
File "C:\Users\Keshav\Anaconda\lib\site-packages\spyderlib\widgets\externalshell\sitecustomize.py", line 580, in runfile
execfile(filename, namespace)
File "C:/Users/Keshav/Desktop/Spring/ML/Project/parkinsons/main6.py", line 27, in <module>
clf.fit(X,y)
File "C:\Users\Keshav\Anaconda\lib\site-packages\sklearn\linear_model\bayes.py", line 212, in fit
self._set_intercept(X_mean, y_mean, X_std)
File "C:\Users\Keshav\Anaconda\lib\site-packages\sklearn\linear_model\base.py", line 159, in _set_intercept
self.coef_ = self.coef_ / X_std
ValueError: operands could not be broadcast together with shapes (20,2) (20,)
Any idea how to resolve it?