I have a comma-separated CSV file with two numerical columns - inputs and outputs. They are correlated in a (more or less linear function), see below. The sample I have is very small.
Below, is the Python code I wrote using sklearn in order to predict values. Somehow it's not giving me the correct values (reasonable predictions). I am quite new to this, so please bear with me.
import pandas as pd
data = pd.read_csv("data.csv", header=None, names=['kg', 'cm'])
labels = data['kg']
train1 = data.drop(['kg'], axis=1) # In all honesty, I don't understand this.
from sklearn.cross_validation import train_test_split
x_train, x_test, y_train, y_test = train_test_split(train1, labels, test_size=0.10, random_state=2)
from sklearn.linear_model import LinearRegression
reg = LinearRegression()
reg.fit(x_train, y_train)
reg.score(x_test, y_test)
reg.predict(80) # Gives an incorrect value of about 108.
Data.
89,155
86,161
82.5,168
79.25,174
76.25,182
73,189
70,198
66.66,207
63.5,218
60.25,229
57,241
54,257
51,259
Actually you are having problem understanding your own code.
import pandas as pd
data = pd.read_csv("data.csv", header=None, names=['kg', 'cm'])
labels = data['kg']
train1 = data.drop(['kg'], axis=1) # In all honesty, I don't understand this.
Until here what you have done is that you have loaded the dataframe. After that you seprated X and y from the dataset.
labels represent the y values.
train1 represent the x values.
Since you wrote you can't understand :- train1 = data.drop(['kg'], axis=1)
Let me explain this. What this does is that from the dataframe which consist both column 'kg' and 'cm'. It removes 'kg' column (axis = 1 means column, axis = 0 means row). Hence only 'cm' is remaining which is your x.
from sklearn.cross_validation import train_test_split
x_train, x_test, y_train, y_test = train_test_split(train1, labels, test_size=0.10, random_state=2)
from sklearn.linear_model import LinearRegression
reg = LinearRegression()
reg.fit(x_train, y_train)
reg.score(x_test, y_test)
reg.predict(80) # Gives an incorrect value of about 108.
Now you train the model on x values which represents 'cm' and y values which represent 'kg'.
When you predict(80) what happens is that you input the 'cm' value to be 80. Let me just plot the 'cm' vs 'kg' for training data.
When you input height as 80 this means that you are going more left, even more left than your plot. Hence as you can see x decrease y increase. It means that as 'cm' decrease means 'kg' increase. Hence ouput is 110 which is more.
from io import StringIO
input_data=StringIO("""89,155\n
86,161\n
82.5,168\n
79.25,174\n
76.25,182\n
73,189\n
70,198\n
66.66,207\n
63.5,218\n
60.25,229\n
57,241\n
54,257\n
51,259""")
import pandas as pd
data = pd.read_csv(input_data, header=None, names=['kg', 'cm'])
labels = data['cm']
train1 = data.drop(['cm'], axis=1) #This is similar to selecting the kg column
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(train1, labels, test_size=0.10, random_state=2)
from sklearn.linear_model import LinearRegression
reg = LinearRegression()
reg.fit(x_train, y_train)
reg.score(x_test, y_test)
import numpy as np
reg.predict(np.array([80]).reshape(-1, 1)) # 172.65013306.
I think you are having problems with small data size. The code flow looks normal to me, I would suggest you try to find the p-value for the input-output. This will tell you if the correlation found from your linear regression is significant or not (p-value <0.05).
You can find p-value using:
from scipy.stats import linregress
print(linregress(input, output))
To find p-value using scikit learn you probably need to use the formula to find p-value. Good luck.
Related
I am performing a random forest on my data frame using pandas, but I can't seem to get it right, my data frame contains two columns with nonbinary values (letters), so I think that's why it's not letting me divide the data into values and attributes. My code is below. For reference, columns 0 and 7 are the nonbinary columns
import pandas as pd
import numpy as np
new_df.head()
X = new_df.iloc[:, 1:16].values
y = new_df.iloc[:, 16].values
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)'
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
when I input this the error I get is "could not convert string to float: 'TCGA-CH-5740'
from sklearn.ensemble import RandomForestRegressor
regressor = RandomForestRegressor(n_estimators=20,
random_state=0)
[enter image description here][1]regressor.fit(X_train, y_train)
y_pred = regressor.predict(X_test)
the columns drag more to the right but are not captured in the pic
Since a decision tree operates by splitting a feature based on it's value e.g "is this value greater than 5? Is it greater than 10?" then it requires that the features are numbers.
I would assume that sklearn automatically converts all data to floats first, but since you have some columns that contains strings that cannot be converted to a number e.g TCGA-CH-5740 it fails.
One way to overcome this is by using OneHotEncoding to convert your "string" to numbers, or another implementation that accepts categorical values such as LightGBM or CatBoost
I'm trying to create a kNN algorithm for stock prediction, with at least 80% correct predictions on the test data. I have a problem with the StandardScaler from sklearn. For some reason it says that there is a "typo" in the word "Scaler", which I find is weird. Does someone know how to solve this issue? If you find more mistakes in the code, please tell me how to fix them, but I think it should be mostly correct (some might be wrong). I want the polynomial line to show around a week in the future. I use data from a private API Key from Marketstack.com, which is provided in JSON formatting. The data contains of EOD data (end of day) with a limit of 1000 days in Descending order.
# Exports API data to a csv file on my hardware and then I import the csv data after it's sorted
df.to_csv('Test_Sample.csv', index=False)
dataframe = pd.read_csv('Test_Sample.csv')
dataframe['symbol']=dataframe['symbol'].astype(float)
dataframe['exchange']=dataframe['exchange'].astype(float)
dataframe['date']=dataframe['date'].astype(float)
dataframe.info()
X = df.iloc[:, :-1].values
Y = df.iloc[:, 4].values
# 80% training data, 20% testing data
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2)
# Scale train and test data
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler() #Here is the mistake, under scaler (Error code: 'Typo in the word scaler')
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)
# Classify data
from sklearn.neighbors import KNeighborsClassifier
classifier = KNeighborsClassifier(n_neighbors=5)
classifier.fit(X_train, Y_train)
Y_pred = classifier.predict(X_test)
# Train and test result
from sklearn.metrics import classification_report, confusion_matrix
print(classification_report(Y_test, Y_pred))
print(confusion_matrix(Y_test, Y_pred))
# Scatter all the data points in a figure
import matplotlib.pyplot as plt
plt.figure()
plt.scatter(X, Y, color='blue')
plt.xlabel('Date')
plt.ylabel('Price')
plt.title('Financial Instrument Predicted Price')
# Fitting Polynomial Regression to the dataset
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree=4)
X_poly = poly.fit_transform(X)
poly.fit(X_poly, Y)
plt.plot(X, poly.fit_transform(X), color='red')
plt.show()
ValueError: could not convert string to float: 'AAPL'
You don't have a typo, in the comments you said:
ValueError: could not convert string to float: 'AAPL'
The error is clear actually, you have a string in your dataset, and trying to normalize/standardize your data. For most of the algorithms you need to encode your strings into integers. Since you did not provide any data sample, you can do, before splitting you can check your dataframe with
dataframe.info()
if it contains strings.
Edit: Check if your first row is supposed to be your header, then you can do the following:
dataframe = pd.read_csv('Test_Sample.csv', header = 0)
I am trying to perform a logistic regression analysis but I don't know which part am i mistaken in my code. It gives error on the line logistic_regression.fit(X_train, y_train). But it seems okay as i checked from different sources. Can anybody help?
Here is my code:
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
df = pd.read_csv("/Users/utkusenel/Documents/Data Analyzing/data.csv", header=0, sep=";")
data = pd.DataFrame(df)
x = data.drop(columns=["churn"]) #features
y = data.churn # target variable
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=0)
logistic_regression = LogisticRegression()
logistic_regression.fit(X_train, y_train)
There are multiple problems here.
Your first row of headers has a ';' at the end. So it is going to read an extra column. You need to remove that ';' after churn.
The training data that you are trying to use here, X_train, is going to have multiple text/categorical columns. You need to convert these into numbers. Check out OneHotEncoder here: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html and LabelEncoder here: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html
After you have converted your text and categorical data to numbers and removed the extra ';' separator, run your algorithm again.
I'm trying to train a decision tree classifier using Python. I'm using MinMaxScaler() to scale the data, and f1_score for my evaluation metric. The strange thing is that I'm noticing my model giving me different results in a pattern at each run.
data in my code is a (2000, 7) pandas.DataFrame, with 6 feature columns and the last column being the target value. Columns 1, 3, and 5 are categorical data.
The following code is what I did to preprocess and format my data:
import numpy as np
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import f1_score
# Data Preprocessing Step
# =============================================================================
data = pd.read_csv("./data/train.csv")
X = data.iloc[:, :-1]
y = data.iloc[:, 6]
# Choose which columns are categorical data, and convert them to numeric data.
labelenc = LabelEncoder()
categorical_data = list(data.select_dtypes(include='object').columns)
for i in range(len(categorical_data)):
X[categorical_data[i]] = labelenc.fit_transform(X[categorical_data[i]])
# Convert categorical numeric data to one-of-K data, and change y from Series to ndarray.
onehotenc = OneHotEncoder()
X = onehotenc.fit_transform(X).toarray()
y = y.values
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2)
min_max_scaler = MinMaxScaler()
X_train_scaled = min_max_scaler.fit_transform(X_train)
X_val_scaled = min_max_scaler.fit_transform(X_val)
The next code is for the actual decision tree model training:
dectree = DecisionTreeClassifier(class_weight='balanced')
dectree = dectree.fit(X_train_scaled, y_train)
predictions = dectree.predict(X_val_scaled)
score = f1_score(y_val, predictions, average='macro')
print("Score is = {}".format(score))
The output that I get (i.e. the score) varies, but in a pattern. For example, it would circulate among data within the range of 0.39 and 0.42.
On some iterations, I even get the UndefinedMetricWarning, that claims "F-score is ill-defined and being set to 0.0 in labels with no predicted samples."
I'm familiar with what the UndefinedMetricWarning means, after doing some searching on this community and Google. I guess the two questions I have may be organized to:
Why does my output vary for each iteration? Is there something in the preprocessing stage that happens which I'm not aware of?
I've also tried to use the F-score with other data splits, but I always get the warning. Is this unpreventable?
Thank you.
You are splitting the dataset into train and test which randomly divides sets for both train and test. Due to this, when you train your model with different training data everytime, and testing it with different test data, you will get a range of F score depending on how well the model is trained.
In order to replicate the result each time you run, use random_state parameter. It will maintain a random number state which will give you the same random number each time you run. This shows that the random numbers are generated in the same order. This can be any number.
#train test split
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=13)
#Decision tree model
dectree = DecisionTreeClassifier(class_weight='balanced', random_state=2018)
I'm trying to learn scikit-learn and Machine Learning by using the Boston Housing Data Set.
# I splitted the initial dataset ('housing_X' and 'housing_y')
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(housing_X, housing_y, test_size=0.25, random_state=33)
# I scaled those two datasets
from sklearn.preprocessing import StandardScaler
scalerX = StandardScaler().fit(X_train)
scalery = StandardScaler().fit(y_train)
X_train = scalerX.transform(X_train)
y_train = scalery.transform(y_train)
X_test = scalerX.transform(X_test)
y_test = scalery.transform(y_test)
# I created the model
from sklearn import linear_model
clf_sgd = linear_model.SGDRegressor(loss='squared_loss', penalty=None, random_state=42)
train_and_evaluate(clf_sgd,X_train,y_train)
Based on this new model clf_sgd, I am trying to predict the y based on the first instance of X_train.
X_new_scaled = X_train[0]
print (X_new_scaled)
y_new = clf_sgd.predict(X_new_scaled)
print (y_new)
However, the result is quite odd for me (1.34032174, instead of 20-30, the range of the price of the houses)
[-0.32076092 0.35553428 -1.00966618 -0.28784917 0.87716097 1.28834383
0.4759489 -0.83034371 -0.47659648 -0.81061061 -2.49222645 0.35062335
-0.39859013]
[ 1.34032174]
I guess that this 1.34032174 value should be scaled back, but I am trying to figure out how to do it with no success. Any tip is welcome. Thank you very much.
You can use inverse_transform using your scalery object:
y_new_inverse = scalery.inverse_transform(y_new)
Bit late to the game:
Just don't scale your y. With scaling y you actually loose your units. The regression or loss optimization is actually determined by the relative differences between the features. BTW for house prices (or any other monetary value) it is common practice to take the logarithm. Then you obviously need to do an numpy.exp() to get back to the actual dollars/euros/yens...