I am new to machine learning thing and python. I have created a simple linear regression model in python . I can test the accuracy of my model but only for the data in my data set , my data set is a csv file which contains a relation between salary and years of experience . But I want to use it in practical life . Like I will input the years of experience and the output will be predicted salary . Here is what I have done so far
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# Importing the dataset
dataset = pd.read_csv('Salary_Data.csv')
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, 1].values
# Splitting the dataset into the Training set and Test set
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 1/3, random_state = 0)
# Feature Scaling
"""from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)
sc_y = StandardScaler()
y_train = sc_y.fit_transform(y_train)"""
# Fitting Simple Linear Regression to the Training set
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)
# Predicting the Test set results
y_pred = regressor.predict(X_test)
I want to modify the above code in such way, that I can give input years of experience and the output will expected salary .
Thanks in advance .
After training the model, save your model to file and load it later in order to make predictions. In Python, you can use 'pickle' to achieve this.
References:
scikit-learn Model Persistence
save and load machine learning models, an example
You can use your trained model to make a prediction.
As a previous answer mentioned, you would want to use
regressor.predict([years_of_xp])
This will ask your model to make a prediction of the salary someone will recieve, given years_of_xp years of experience.
Related
for a school project I need to make a forecast of the baseline sales. I've splitted the dataset into a X and Y set. X = all my variables except Total units (baseline), Y = Total Units (baseline).
I would love to train my set on the weeks where there are no promotions = (FINAL_df[FINAL_df['PROMOTIONAL_PRESENCE']==0]), and test my set on all the weeks (and not only the weeks without promotion = FINAL_df) this should give a better result than if I train them both on the same set (FINAL_df).
But I have no idea how to train the training set separately from the test set?
( I know this part of my code is wrong: X_train = FINAL_df[FINAL_df['PROMOTIONAL_PRESENCE']==0], but I don't know how to correct it?)
(I am new to coding and ML so any help is very much appreciated!) Thanks in advance!
code:
X = FINAL_df.drop(["Total_Units"],axis="columns")
Y = FINAL_df.Total_Units
from sklearn.model_selection import train_test_split
X_train = FINAL_df[FINAL_df['PROMOTIONAL_PRESENCE']==0]
from sklearn.linear_model import LinearRegression
X_train, X_test, Y_train, Y_test = train_test_split(X,Y,test_size=0.2, random_state=101)
lin_model = LinearRegression()
lin_model.fit(X_train,Y_train)
--> here i get an error (Unable to allocate 3.48 GiB for an array with shape (453971, 1028) and data type float64)strong text
from sklearn.model_selection import train_test_split
FINAL_df_prom_pres0 = FINAL_df[FINAL_df['PROMOTIONAL_PRESENCE']==0]
Y = FINAL_df_prom_pres0.Total_Units
X = FINAL_df_prom_pres0.drop(["Total_Units"],axis="columns")
from sklearn.linear_model import LinearRegression
X_train, X_test, Y_train, Y_test = train_test_split(X,Y,test_size=0.2, random_state=101)
lin_model = LinearRegression()
lin_model.fit(X_train,Y_train)
I changed the answer can you please try like this.
I am using smote to balanced the output (y) only for Model train but want to test the model with original data as it makes logic how we can test the model with smote created outputs. Please ask anything for clarification if I didn't explained it well. It's my starting on Stack overflow.
from imblearn.over_sampling import SMOTE
oversample = SMOTE()
X_sm, y_sm = oversample.fit_resample(X, y)
# Splitting Dataset into Train and Test (Smote)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_sm, y_sm,test_size=0.2,random_state=42)
Here i applied the Random Forest Classifier on my data
import math
from sklearn.metrics import accuracy_score, confusion_matrix
import seaborn as sn
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestClassifier
from sklearn import metrics
# RF = RandomForestClassifier(n_estimators=100)
# RF.fit(X_train, y_train.values.ravel())
# y_pred = RF.predict(X)
# print(metrics.classification_report(y,y_pred))
RF = RandomForestClassifier(n_estimators=10)
RF.fit(X_train, y_train.values.ravel())
If i applied this but X also contains the data which we used for train. how we can remove the data which we already used for training the data.
y_pred = RF.predict(X)
print(metrics.classification_report(y,y_pred))
I used SMOTE in the past, it is suboptimal. Lately, researchers have proven some flaws in the generated distribution of Synthetic Minority Oversample Technique (SMOTE). I know sometimes we don't have a choice regarding the unbalanced classes, but you can use sklearn.ensemble.RandomForestClassifier, where you can define a proper class_weight to handle the unbalanced class problem.
Check scikit-learn documentation:
Scikit-documentation
I agree with razimbres about using class_weight.
Another option for you would be to split the dataset into train and test first. Then, keep the test set aside. Use only the training set from here on:
X_sm, y_sm = oversample.fit_resample(X_train, y_train)
.
.
.
Logistic Regression with inputs of "Machine Learning.csv" file.
#Import Libraries
import pandas as pd
#Import Dataset
dataset = pd.read_csv('Machine Learning Data Set.csv')
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, 10]
#Splitting the dataset into the Training set and Test set
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.2, random_state = 0)
#Feature Scaling
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)
#Fitting Logistic Regression to the Training Set
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression(random_state=0)
classifier.fit(X_train,y_train)
#Predicting the Test set results
y_pred = classifier.predict(X_test)
#Making the Confusion Matrix
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test,y_pred)
I have a machine learning / logistic regression code (python) as above. It has properly trained my model and gives a really good match with the test data. But unfortunately it is only giving me 0/1 (binary) results when I test with some other random values. (the training set has only 0/1 - as in failed/succeeded)
How can I get a probability result instead of a binary result in this algorithm? I have tried very different set of numbers and would like find out a probability of failing - instead of a 0 and 1.
Any help is strongly appreciated :) Thanks a lot!
Just replace
y_pred = classifier.predict(X_test)
with
y_pred = classifier.predict_proba(X_test)
For details refer Logistic Regression Probability
predict_proba(X_test) will give you probability of each sample for each class.i.e if X_test contains n_samples and you have 2 classes output of above function will be a "n_samples X 2 " matrix. and sum of two classes predicted will be 1. for more details have a look at documentation here
I've just split my data into a training and testing set and my plan is to train a Linear Regression model and be able to check what the performance is like using my testing split.
My current code is:
import pandas as pd
import numpy as np
from sklearn import datasets, linear_model
import matplotlib.pyplot as plt
df = pd.read_csv('C:/Dataset.csv')
df['split'] = np.random.randn(df.shape[0], 1)
split = np.random.rand(len(df)) <= 0.75
training_set = df[split]
testing_set = df[~split]
Is there a proper method I should be using to plot a Linear Regression model from an external file such as a .csv?
Since you want to use scikit-learn, here's an approach using sklearn.linear_model.LinearRegression:
from sklearn.linear_model import LinearRegression
model = LinearRegression()
X_train, y_train = training_set[x_vars], training_set[y_var]
X_test, y_test = testing_test[x_vars], testing_test[y_var]
model.fit(X_train, y_train)
predictions = model.predict(X_test)
Depending on whether you need more descriptive output, you might also look into use statsmodels for linear regression.
I'm trying to learn scikit-learn and Machine Learning by using the Boston Housing Data Set.
# I splitted the initial dataset ('housing_X' and 'housing_y')
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(housing_X, housing_y, test_size=0.25, random_state=33)
# I scaled those two datasets
from sklearn.preprocessing import StandardScaler
scalerX = StandardScaler().fit(X_train)
scalery = StandardScaler().fit(y_train)
X_train = scalerX.transform(X_train)
y_train = scalery.transform(y_train)
X_test = scalerX.transform(X_test)
y_test = scalery.transform(y_test)
# I created the model
from sklearn import linear_model
clf_sgd = linear_model.SGDRegressor(loss='squared_loss', penalty=None, random_state=42)
train_and_evaluate(clf_sgd,X_train,y_train)
Based on this new model clf_sgd, I am trying to predict the y based on the first instance of X_train.
X_new_scaled = X_train[0]
print (X_new_scaled)
y_new = clf_sgd.predict(X_new_scaled)
print (y_new)
However, the result is quite odd for me (1.34032174, instead of 20-30, the range of the price of the houses)
[-0.32076092 0.35553428 -1.00966618 -0.28784917 0.87716097 1.28834383
0.4759489 -0.83034371 -0.47659648 -0.81061061 -2.49222645 0.35062335
-0.39859013]
[ 1.34032174]
I guess that this 1.34032174 value should be scaled back, but I am trying to figure out how to do it with no success. Any tip is welcome. Thank you very much.
You can use inverse_transform using your scalery object:
y_new_inverse = scalery.inverse_transform(y_new)
Bit late to the game:
Just don't scale your y. With scaling y you actually loose your units. The regression or loss optimization is actually determined by the relative differences between the features. BTW for house prices (or any other monetary value) it is common practice to take the logarithm. Then you obviously need to do an numpy.exp() to get back to the actual dollars/euros/yens...