copying column data into another dataframe using iloc [closed] - python

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 1 year ago.
Improve this question
after spliting x_train ,y_train .
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size = 0.3)
print(x_train.shape,y_train.shape)
(354, 13) (354,)
Again I need to join ytrain column to xtrain .Price is new column
x_train['Price']=y_train but this does not work
I am trying to use iloc like following but it gives warning
x_train['price']=y_train.iloc[0:354]
please help me out regarding this

You get that warning because x_train is a view of X. Using an example:
df = pd.DataFrame(np.random.uniform(0,1,(100,4)),
columns=['x1','x2','x3','y'])
X = df[['x1','x2','x3']]
Y = df[['y']]
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size = 0.3)
You can see:
x_train._is_view
True
If I try to run your code, I get the same warning.
See this post about views of a data frame and also this on dealing with the warning. What you can do make a copy if you don't think it's an issue:
x_train = x_train.copy()
x_train['Price'] = y_train
Or use insert:
x_train.insert(x_train.shape[1],"Price",y_train)

Related

I was trying to evaluate the efficiency of different ML algorithms but I am getting this error [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed yesterday.
Improve this question
This is my code.
names = []
res = []
for name, model in models:
kfold = StratifiedKFold(n_splits=10, random_state=None)
cv_results = cross_val_score(model, X_train, Y_train, cv=kfold, scoring='accuracy')
results.append(cv_results)
names.append(name)
res.append(cv_results.mean())
print('%s: %f (%f)' % (name, cv_results.mean(), cv_results.std()))
pyplot.ylim(.990, .999)
pyplot.bar(names, res, color ='maroon', width = 0.6)
pyplot.title('Algorithm Comparison')
pyplot.show()
This error occurred when I executed the code.

How to scale datasets correctly [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 1 year ago.
Improve this question
Which one is more correct or there is any other way to scale data? (I've used StandardScaler as an example)
I've tried every way and computed the accuracy of every model but there is no meaningful difference but I want to know which way is more correct
dataset= pd.read_csv("wine.csv")
x = dataset.iloc[:,:13]
y = dataset.iloc[:,13]
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.8,random_state=0)
sc=StandardScaler()
x_train=sc.fit_transform(x_train)
x_test=sc.fit_transform(x_test)
or
dataset= pd.read_csv("wine.csv")
x = dataset.iloc[:,:13]
y = dataset.iloc[:,13]
sc=StandardScaler()
x = sc.fit_transform(x)
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.8,random_state=0)
or
dataset= pd.read_csv("wine.csv")
x = dataset.iloc[:,:13]
y = dataset.iloc[:,13]
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.8,random_state=0)
sc=StandardScaler()
x_train=sc.fit_transform(x_train)
x_test=sc.transform(x_test)
Test data should not bee seen or used during the training a of model as they are used to assert the performance of the model.
Therefore the last option is the correct one. The scaling parameter should be computed solely on the train set as follow:
sc=StandardScaler()
x_train=sc.fit_transform(x_train)
x_test=sc.transform(x_test)

How to export dataset from sklearn after model applied? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 1 year ago.
Improve this question
I've undertaking a machine learning course with sklearn/python. I understand the preprocessing, selection & running of the model, etc. but now I've ran the data through I'm unsure how to:
Export this data, OR
How to find predictions for specific rows (IDs).
Here's my code:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
dataset = pd.read_csv('test_dataset.csv')
dataset.set_index('ID', inplace=True) # replace ID with identifier field
X = dataset.iloc[:, 0:-1].values #input variables
y = dataset.iloc[:, -1].values #output variable (to predict)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
poly_reg = PolynomialFeatures(degree = 4)
X_poly = poly_reg.fit_transform(X_train)
regressor = LinearRegression()
regressor.fit(X_poly, y_train)
y_pred = regressor.predict(poly_reg.transform(X_test))
np.set_printoptions(precision=2)
print(np.concatenate((y_pred.reshape(len(y_pred),1), y_test.reshape(len(y_test),1)),1))
First you need to create a dataframe using pd.DataFrame and then you can make It a csv file using the method pd.to_csv.
Go through the pandas documentation about these two methods.
Option: 1
In this example, if you want to predict on a specific record then you can do something like this
some_data_for_predict = dataset[dataset['ID']==1].iloc[:, 0:-1].values
y_pred = regressor.predict(poly_reg.transform(some_data_for_predict))
print(f"actual: \n{dataset[dataset['ID']==1]} \ny_pred: \n{y_pred}")
Option: 2
In case there involves preprocessing of data (e.g. handling missing data, applying proper encoding, feature scaling) then probably you will end up with the data encoded after transformation and in such case if you want to see the actuals from the transformed then you can use inverse_transform.
Something like:
X_normalized = scaler.fit_transform(X)
X_train_norm, X_test_norm, y_train, y_test = train_test_split(X_normalized, y, test_size=0.2, random_state=42)
some_data = X_test_norm[:5]
regressor.predict(some_data)
scaler.inverse_transform(some_data) # this will give the actual data.

Getting not aligned value error in statsmodel [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 2 years ago.
Improve this question
I am using statsmodel for robust regression and getting the following error:
not aligned: 9 (dim 1) != 258095 (dim 0).
I think this is because of the x_train and y_train shape. X_train and y_train are NumPy arrays. Here is my code
# Input
print(X1)
print(y1)
X_train, X_test, y_train, y_test = train_test_split(X1, y1, test_size=0.15)
print(X_train.shape)
print(y_train.shape)
rlm_model = sm.RLM(X_train,y_train, M=sm.robust.norms.HuberT())
rlm_results = rlm_model.fit()
# output
X1=[[ 36.299999 4.8 321. ... 1341. 22.
0. ]
y1=[1.700012 1.600006 1.399994 ... 1.899994 0.899994 1.199997]
x_train shape=(258095, 9)
y_train shape=(258095,)
#value error on rlm_model.fit()
You have your variables in the wrong order. It should be
rlm_model = sm.RLM(y_train, X_train, M=sm.robust.norms.HuberT())
rlm_results = rlm_model.fit()
See the help for RLM. endog is what most people call y and exog is what most would call x.

Split dataset into training and test by month

I was not able to find the answer to this anywhere. I have data for three months, where I would like to split it into the first two months('Jan-19', 'Feb-19') as training set and the last month as the test ('Mar-19').
Previously I have done random sampling with simple code like this:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30,random_state=109)
and before that assigned y as the label and x as the columns to use to predict. I'm not sure how to assign the test and training to the months I want.
Thank you
If your data is in a pandas dataframe, you can use subsetting like this:
X_train = X[X['month'] != 'Mar-19']
y_train = y[X['month'] != 'Mar-19']
X_test = X[X['month'] == 'Mar-19']
y_test = y[X['month'] == 'Mar-19']
You try this option and see if it helps.
dataset_train = df['2004-02-12 11:02:39':'2004-02-13 23:52:39']
dataset_test = df['2004-02-13 23:52:39':]

Categories