How to scale datasets correctly [closed] - python

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 1 year ago.
Improve this question
Which one is more correct or there is any other way to scale data? (I've used StandardScaler as an example)
I've tried every way and computed the accuracy of every model but there is no meaningful difference but I want to know which way is more correct
dataset= pd.read_csv("wine.csv")
x = dataset.iloc[:,:13]
y = dataset.iloc[:,13]
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.8,random_state=0)
sc=StandardScaler()
x_train=sc.fit_transform(x_train)
x_test=sc.fit_transform(x_test)
or
dataset= pd.read_csv("wine.csv")
x = dataset.iloc[:,:13]
y = dataset.iloc[:,13]
sc=StandardScaler()
x = sc.fit_transform(x)
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.8,random_state=0)
or
dataset= pd.read_csv("wine.csv")
x = dataset.iloc[:,:13]
y = dataset.iloc[:,13]
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.8,random_state=0)
sc=StandardScaler()
x_train=sc.fit_transform(x_train)
x_test=sc.transform(x_test)

Test data should not bee seen or used during the training a of model as they are used to assert the performance of the model.
Therefore the last option is the correct one. The scaling parameter should be computed solely on the train set as follow:
sc=StandardScaler()
x_train=sc.fit_transform(x_train)
x_test=sc.transform(x_test)

Related

inserting values without having to format them specifically like the training model [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 11 months ago.
Improve this question
My plan was to insert only 2 values (The Gold and S&P closing price) into a Machine Learning model (linear regression model of Sklearn) and then daily download the opening values of the S&P and Gold, and predict the closing price that BTC would have according to these values. The model works perfectly and predicts the values right 99% right of the time.
Here's the problem:
I want to only input the single values into the prediction algorithm, but I can't seem to work out how to input values without having to convert them to np.arrays and finally the classic sklearn format with
x_train, x_test, y_train, y_test = sklearn.model_selection.train_test_split(X, y, test_size = 0.1)
Does anyone know how to convert the values to the right format without having to use this line of code?
I simply want to be able to use:
Prediction_of_today = linear.predict(!some format of downloaded closing prices of Gold and S&P!)
print(Prediction of today)
import yfinance as yf
import pandas as pd
import numpy as np
import sklearn
import time
from sklearn import model_selection
from sklearn import linear_model
import matplotlib.pyplot as pyplot
from matplotlib import style
import pickle
'''downloading the right format of values'''
'''use yfinance as my API '''
#donwload and rename relevant columns
BTC = yf.download(tickers='BTC-USD', period="16wk", interval='1d')
BTC_.rename(columns={'Adj Closing': 'BTC closing'}, inplace=True)
SaP = yf.download('SPY', period="16wk", interval="1d")
SaP.rename(columns={'Open':'S&P opening'}, inplace=True)
Gold = yf.download('GC=F', period="16wk", interval="1d")
Gold.rename(columns={'Open':'Gold opening'}, inplace=True)
btc_closing = BTC['BTC closing']
btc_closing = pd.Series(btc_closing, index=SaP.index)
gold_opening = Gold['Gold opening']
gold_opening = pd.Series(gold_opening, index=SaP.index)
sap_opening = SaP['S&P opening']
add_input_data = pd.concat([gold_opening, sap_opening], axis=1)
full_data = pd.concat([add_input_data, btc_closing], axis=1)
corr_SB = sap_opening.corr(btc_closing)
corr_GB = gold_opening.corr(btc_closing)
print(f'The correlation of the SaP to BTC is {corr_SB}, the correlation of Gold to BTC amounts to {corr_GB}')
full_data = full_data.rename(columns={"Adj Close": "BTC closing"})
X = np.array(full_data.drop(['BTC closing'], 1))
y = np.array(full_data['BTC closing'])
x_train, x_test, y_train, y_test = sklearn.model_selection.train_test_split(X, y, test_size = 0.1)
linear = linear_model.LinearRegression()
linear.fit(x_train, y_train)
If you want to input a single value, you need to ensure the format of the input is your single value inside a list, because what you input is a list of a dataset. e.g. Prediction_of_today = linear.predict([[],[],[],[],...]). Therefore, if the dataset only include one single value, it may look like Prediction_of_today = linear.predict([[]])
Hope it is helpful for you.
I'm sorry I misuderstood the concept
for clarification:
i could simply format the values as np.array and use it to predict the needed value!
this is what I was looking for
test_data = pd.concat([gold_closing_test, sap_closing_test], axis=1)
test_data = np.array(test_data)
prediction_results = linear.predict(test_data)

I want to know the resulting label for all of the input data [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 1 year ago.
Improve this question
Learning was carried out using the random forest algorithm. I want to append the results of the input data to the existing input data, how do I do it? In the case of scikit-learn, it provides model evaluation criteria such as accuracy, precision, recall, and f1 score for the result, but I am not sure if there is a function that returns the label of the prediction result like keras. I don't know where to start with the code, so I'll just ask a question.
Usually, you have something like this when using sklearn:
input_data = pd.read_csv("path/to/data")
features = ["area", "location", "rooms"]
y = input_data["Price"]
X = input_data[features]
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state=1)
model = RandomForest()
model.fit(train_X, train_y)
Now your model is trained. As you mentioned you could get different metrics from the model using sklearn on your validation set.
Getting output label from the model means getting predictions(inference):
output_label = model.predict(val_X)
#This is an nd array with the size of val_y
x = pd.DataFrame(val_X,columns=["input"])
x["output_label"] = output_label
Or you could use numpy.concatenate to append the labels directly to your input data

How to export dataset from sklearn after model applied? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 1 year ago.
Improve this question
I've undertaking a machine learning course with sklearn/python. I understand the preprocessing, selection & running of the model, etc. but now I've ran the data through I'm unsure how to:
Export this data, OR
How to find predictions for specific rows (IDs).
Here's my code:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
dataset = pd.read_csv('test_dataset.csv')
dataset.set_index('ID', inplace=True) # replace ID with identifier field
X = dataset.iloc[:, 0:-1].values #input variables
y = dataset.iloc[:, -1].values #output variable (to predict)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
poly_reg = PolynomialFeatures(degree = 4)
X_poly = poly_reg.fit_transform(X_train)
regressor = LinearRegression()
regressor.fit(X_poly, y_train)
y_pred = regressor.predict(poly_reg.transform(X_test))
np.set_printoptions(precision=2)
print(np.concatenate((y_pred.reshape(len(y_pred),1), y_test.reshape(len(y_test),1)),1))
First you need to create a dataframe using pd.DataFrame and then you can make It a csv file using the method pd.to_csv.
Go through the pandas documentation about these two methods.
Option: 1
In this example, if you want to predict on a specific record then you can do something like this
some_data_for_predict = dataset[dataset['ID']==1].iloc[:, 0:-1].values
y_pred = regressor.predict(poly_reg.transform(some_data_for_predict))
print(f"actual: \n{dataset[dataset['ID']==1]} \ny_pred: \n{y_pred}")
Option: 2
In case there involves preprocessing of data (e.g. handling missing data, applying proper encoding, feature scaling) then probably you will end up with the data encoded after transformation and in such case if you want to see the actuals from the transformed then you can use inverse_transform.
Something like:
X_normalized = scaler.fit_transform(X)
X_train_norm, X_test_norm, y_train, y_test = train_test_split(X_normalized, y, test_size=0.2, random_state=42)
some_data = X_test_norm[:5]
regressor.predict(some_data)
scaler.inverse_transform(some_data) # this will give the actual data.

Multiple non-linear regression in Python [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
I am looking for any libraries or method which can help me to find a regression equation. The equation is in this format:
Y=a1*x^a+a2*y^b+a3*z^c+D
where:
Y is the dependent variable
x, y, z are independent variables
D is constant
a1, a2, a3 are the coefficients
a, b, c are the exponents of the independent variables respectively.
I have values of Y and x, y, z stored in a data frame.
You can use Random Forest Regressor implementation from scikit learn. It's quite easy to use, you simply do:
from sklearn.ensemble import RandomForestRegressor
clf = RandomForestRegressor()
# train the model
clf.fit(df[['x','y','z']], df['Y'])
# predict on test data
predict = clf.predict(test_data[['x','y','z']])
Make sure train and test data have same number of independent variables.
For more non-linear regressor, check: scikit-learn ensemble module

Speed Improvements to Leave One Group Out in Large Datasets [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 5 years ago.
Improve this question
I am performing classification by LogisticRegression over a large dataset (1.5 million observations) using LeaveOneGroupOut cross-validation. I am using scikit-learn for implementation. My code takes around 2 days to run and I would appreciate your inputs on how to make it faster. A snippet of my code is shown below:
grp = data['id_x'].values
logo = LeaveOneGroupOut()
LogReg = LogisticRegression()
params_grid = {'C': [0.78287388, 1.19946909, 1.0565957 , 0.69874106, 0.88427995, 1.33028731, 0.51466415, 0.91421747, 1.25318725, 0.82665192, 1, 10],
'penalty': ['l1', 'l2'] }
random_search = RandomizedSearchCV(LogReg, param_distributions = params_grid, n_iter = 3, cv = logo, scoring = 'accuracy')
random_search.fit(X, y, grp)
print random_search.best_params_
print random_search.best_score_
I am going to make the following assumptions:
1- you are using scikit-learn.
2- you need your code to be faster.
To get your final results faster, you can train multiple models at once by running them in parallel. To do so, you need to modify the variable n_jobs in scikit-learn. Possible options for n_jobs can be #of_CPU_cores or #of_CPU_cores-1 if you are not running anything else on your computer while training the model.
Examples:
RandomizedSearchCV in parallel:
random_search = RandomizedSearchCV(LogReg, n_jobs=3, param_distributions = params_grid, n_iter = 3, cv = logo, scoring = 'accuracy')
LogisticRegression in parallel:
LogisticRegression(n_jobs=3)
I recommend parallelizing only RandomizedSearchCV.
It might be helpful to also look at the original scikit-learn documentations:
sklearn.linear_model.LogisticRegression
sklearn.model_selection.RandomizedSearchCV

Categories