how to extract data from dataframe in python (Index)

how to extract data from dataframe in python (Index) - python

I am trying to take the feature but not getting the results.
df_close = df['Close']
df_train = df_close[:'2019-12-31']
df_train.shape
training_set = df_close
from sklearn.preprocessing import MinMaxScaler
sc = MinMaxScaler(feature_range = (0, 1))
training_set_scaled = sc.fit_transform(training_set)
training_set_scaled[1]
import numpy as np
X_train = []
y_train = []
for i in range(100, training_set.shape[1]):
X_train.append(training_set_scaled[i-100:i, 0])
y_train.append(training_set_scaled[i, 0])
X_train, y_train = np.array(X_train), np.array(y_train)
X_train
and the result is:
array([], dtype=float64)

If the value of training_set.shape[1] is smaller then 100 then the inside of the for loop is skipped, leaving X_train empty.
You could test this case by adding a print statement inside the for loop. Let me know if it worked, good luck!

Related

'numpy.ndarray' object has no attribute 'columns'

I was following the machine learning tutorial on youtube and using this dataset. However while the person in the video had no problem runnning the code, I received an error that the numpy.ndarray object has no attribute 'columns'
below is the code I ran
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from imblearn.over_sampling import RandomOverSampler
cols = ['integrated_mean','integrated_standard_deviation','integrated_excess_kurtosis','integrated_skewness','DM_mean','DM_standard_deviation','DM_excess_kurtosis','DM_skewness','class']
df = pd.read_csv("HTRU_2.data", names = cols)
train, valid, test = np.split(df.sample(frac = 1), [int(0.6*len(df)), int(0.8*len(df))])
def scale_dataset(dataframe, oversample = False):
X = dataframe[dataframe.columns[:-1]].values
y = dataframe[dataframe.columns[-1]].values
scaler = StandardScaler()
X = scaler.fit_transform(X)
if oversample:
ros = RandomOverSampler()
X, y = ros.fit_resample(X, y)
data = np.hstack((X, np.reshape(y, (-1, 1))))
return data, X, y
train, X_train, y_train = scale_dataset(train, oversample = True)
valid, X_train, y_train = scale_dataset(train, oversample = False)
test, X_train, y_train = scale_dataset(train, oversample = False)
I do not know what is happening and how to fix it, I've tried searching elsewhere but I have no idea. If anyone can help it would be much appreciated.

I couldn't find the minute in the tutorial, but may be it's just a consequence of copy-paste.
In the function scale_dataset you make data a numpy array and then you assign that value to train variable. But when you come again with scale_dataset for valid data set you want to use this `train' data set as a pandas dataframe but in that moment it's a numpy array.
My common sense tells me you want to use valid data set instead of train and so on like this:
train, X_train, y_train = scale_dataset(train, oversample = True)
valid, X_train, y_train = scale_dataset(valid, oversample = False)
test, X_train, y_train = scale_dataset(test, oversample = False)

Instead of
X = dataframe[dataframe.columns[:-1]].values
y = dataframe[dataframe.columns[-1]].values
I did
X = dataframe[:, :-1]
y = dataframe[:, -1]
And now all the codes work fine now

Training loop for XGBoost in different dataset

I have developed some different datasets and I want to write a for loop to do the training for each of which and at the end, I want to have RMSE for each dataset. I tried by passing through a for loop but it does not work since it gives back the same value for each dataset while I know that it should be different. The code that I have written is below:
for i in NEW_middle_index:
DF = df1.iloc[i-100:i+100,:]
# Append an empty sublist inside the list
FINAL_DF.append(DF)
y = DF.iloc[:,3]
X = DF.drop(columns='Target')
index_train = int(0.7 * len(X))
X_train = X[:index_train]
y_train = y[:index_train]
X_test = X[index_train:]
y_test = y[index_train:]
scaler_x = MinMaxScaler().fit(X_train)
X_train = scaler_x.transform(X_train)
X_test = scaler_x.transform(X_test)
xgb_r = xg.XGBRegressor(objective ='reg:linear',
n_estimators = 20, seed = 123)
for i in range(len(NEW_middle_index)):
# print(i)
# Fitting the model
xgb_r.fit(X_train,y_train)
# Predict the model
pred = xgb_r.predict(X_test)
# RMSE Computation
rmse = np.sqrt(mean_squared_error(y_test,pred))
# print(rmse)
RMSE.append(rmse)

Not sure if you indented it correctly. You are overwriting X_train and X_test and when you fit your model, its always on the same dataset, hence you get the same results.
One option is to fit the model once you create the train / test dataframes. Else if you want to keep the train / test set, maybe something like below, to store them in a list of dictionaries, without changing too much of your code:
import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import mean_squared_error
import xgboost as xg
df1 = pd.DataFrame(np.random.normal(0,1,(600,3)))
df1['Target'] = np.random.uniform(0,1,600)
NEW_middle_index = [100,300,500]
NEWDF = []
for i in NEW_middle_index:
y = df1.iloc[i-100:i+100:,3]
X = df1.iloc[i-100:i+100,:].drop(columns='Target')
index_train = int(0.7 * len(X))
scaler_x = MinMaxScaler().fit(X)
X_train = scaler_x.transform(X[:index_train])
y_train = y[:index_train]
X_test = scaler_x.transform(X[index_train:])
y_test = y[index_train:]
NEWDF.append({'X_train':X_train,'y_train':y_train,'X_test':X_test,'y_test':y_test})
Then we fit and calculate RMSE:
RMSE = []
xgb_r = xg.XGBRegressor(objective ='reg:linear',n_estimators = 20, seed = 123)
for i in range(len(NEW_middle_index)):
xgb_r.fit(NEWDF[i]['X_train'],NEWDF[i]['y_train'])
pred = xgb_r.predict(NEWDF[i]['X_test'])
rmse = np.sqrt(mean_squared_error(NEWDF[i]['y_test'],pred))
RMSE.append(rmse)
RMSE
[0.3524827559800294, 0.3098101362502435, 0.3843173269966071]

How to iterate over rows in dataset for distance calculation

i have Iris dataset and i want to calculate the distance between all rows, i.e. 0 and 1, 0 and 2..... , 1 and 2, 1 and 3.... for KNN.
my code:
import numpy as np
from sklearn import datasets
import pandas as pd
#1 Handle the data
iris = datasets.load_iris()
x = iris.data[:, :4]
y = iris.target.reshape((150,1))
def shuffle(x,y,percentage):
iris_data = np.concatenate((x,y), axis=1)
shuffling = iris_data[np.random.permutation(len(iris_data))]
train, test = np.split(shuffling,[int(percentage*len(iris_data))])
x_train = train[:, :4]
y_train = train[:, -1]
x_test = test[:, :4]
y_test = test[:, -1]
return [iris_data, x_train, y_train, x_test, y_test]
shuf = shuffle(x,y,0.7)
x_train= shuf[1]; y_train= shuf[2]
x_test= shuf[3]; y_test= shuf[4]
#2 Distance function
def distance(x, x_test, y, y_test):
cont= 0
dist = {}
for i in range(x.shape[0]):
for j in range(x.shape[0]):
cont += (x[i] - x_test[j])**2
dist[i] = (np.sqrt(cont), y[i])
return dist
but i get a dictionary with numpy arrays (4,) instead of array of scalars.
i tried to use itertools.combinations but i have some errors.
one more question, how can i store my output in dataframe with the distances and the lables instead of dict (dist = {}) ?
thank you

Is there any way i can optimize the code for Logistic Regression model

X_train = np.asarray(X_train)
y_train = np.asarray(y_train)
X_test = np.asarray(X_test)
y_test = np.asarray(y_test)
X_val = np.asarray(X_valid)
y_val = np.asarray(y_valid)
import cv2
X_train_full = []
X_test_full = []
X_valid_full = []
for i in X_train:
res = cv2.resize(i, dsize=(28, 28), interpolation=cv2.INTER_CUBIC)
X_train_full.append(res)
for i in X_test:
res = cv2.resize(i, dsize=(28, 28), interpolation=cv2.INTER_CUBIC)
X_test_full.append(res)
for i in X_val:
res = cv2.resize(i, dsize=(28, 28), interpolation=cv2.INTER_CUBIC)
X_valid_full.append(res)
4.
X_train_full = np.asarray(X_train_full)
X_test_full = np.asarray(X_test_full)
X_valid_full = np.asarray(X_valid_full)
5.
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X_train_full)
X_train_full = scaler.transform(X_train_full)
X_test_full = scaler.transform(X_test_full)
6.
from sklearn.linear_model import LogisticRegression
models = list()
accuracy = list()
save = 'svm/'
name = 'svm'
for i in range(len(dataset)):
name = 'model'+str(i)
data = dataset[i]
X_train = data[0][0]
y_train = data[0][1]
X_test = data[1][0]
y_test = data[1][1]
logisticRegr = LogisticRegression(solver='lbfgs',multi_class='multinomial')
logisticRegr.fit(X_train,y_train)
prediction = logisticRegr.predict(X_test)
accuracy.append(accuracy_score(y_test,prediction))
print('Accuracy ',str(i),': ',accuracy_score(y_test,prediction))
7.
logisticRegr = LogisticRegression(solver='lbfgs',multi_class='multinomial')
logisticRegr.fit(X_train,y_train)
predictions = logisticRegr.predict(X_test)
After i run the StandardScaler() part in my jupyter/colab, it crashed bcuz the memory overallocated. Is there way i can fix the code for the LogisticRegression Model?
At first , i load the datasets consist of 161 folders with 500 data in each folder
Then, i run a random() to shuffle the data among all of them.
and create a X_train_full to resize the image and save them in
Then i perform the StandardScaler on my latest X_train_full but it crashed, Is there any solution since i have already resize my image dimension to 28,28 from 192,256.

Well, you have a lot of data and simultaneously loading that data and performing operations on them is leading to memory crash.
In such a scenario, we should use dataset pipelines like tensorflow dataset tf.data.Dataset . This will load your data in batches rather than whole and is very memory efficient.
If you are using pytorch, then you can use torch.utils.data.DataLoader which is also a data loader.
For more information visit
Tensorflow
https://www.tensorflow.org/api_docs/python/tf/data/Dataset
Pytorch
https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader

How to predict multiple features using keras with time series?

I have a problem I don't know how to fix transform to add new features in order to make more proper forecast. The code below predicts stock prices by Close value. Data:
Open High Low Close Adj Close Volume
Datetime
2020-03-10 09:30:00+03:00 5033.0 5033.0 4690.0 4840.0 4840.0 702508
2020-03-10 10:30:00+03:00 4840.0 4870.0 4700.0 4746.5 4746.5 1300648
2020-03-10 11:30:00+03:00 4746.5 4783.0 4706.0 4745.5 4745.5 1156482
2020-03-10 12:30:00+03:00 4745.5 4884.0 4730.0 4870.0 4870.0 1213268
2020-03-10 13:30:00+03:00 4874.0 4990.5 4867.5 4886.5 4886.5 1958028
... ... ... ... ... ... ...
2020-04-03 14:30:00+03:00 5177.0 5217.0 5164.0 5211.5 5211.5 385696
2020-04-03 15:30:00+03:00 5212.0 5364.0 5191.0 5269.5 5269.5 1091066
2020-04-03 16:30:00+03:00 5270.0 5297.0 5209.0 5220.5 5220.5 518686
2020-04-03 17:30:00+03:00 5222.0 5271.0 5184.0 5220.5 5220.5 665096
2020-04-03 18:30:00+03:00 5217.5 5223.5 5197.0 5204.5 5204.5 261400
I want to add Volume and Open features, but getting error:
predictions = scaler.inverse_transform(predictions)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/sklearn/preprocessing/_data.py", line 436, in inverse_transform
X -= self.min_
ValueError: non-broadcastable output operand with shape (40,1) doesn't match the broadcast shape (40,3)
Q1: How to change inverse_transform and what else do I need to change (input_shape argument maybe) to get correct results?
Q2: The result will be prediction of Close value. But how do I predict Volume value also? I guess I need to set model.add(Dense(2)), but can I do 2 predictions correctly in one code, or I need to execute script separately? How to do that? How do I get Volume than Open when model.add(Dense(2))?
Full code:
from math import sqrt
from numpy import concatenate
import pandas as pd
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import mean_squared_error
from keras.models import Sequential
from keras.layers import Dense, Dropout, Embedding
from keras.layers import LSTM
import numpy as np
from datetime import datetime, timedelta
import yfinance as yf
start = (datetime.now() - timedelta(days=30))
end = (datetime.now() - timedelta(days=0))
df = yf.download(tickers="LKOH.ME", start=start.strftime("%Y-%m-%d"), end=end.strftime("%Y-%m-%d"), interval="60m")
df = df.loc[start.strftime("%Y-%m-%d"):end.strftime("%Y-%m-%d")]
# I need here add another features
# df.filter(['Close', 'Open', 'Volume']) <-- this will make further an error with shapes
data = df.filter(['Close'])
dataset = data.values
#Get the number of rows to train the model on, 40 rows for test
training_data_len = len(dataset) - 40
scaler = MinMaxScaler(feature_range=(0,1))
scaled_data = scaler.fit_transform(dataset)
train_data = scaled_data[0:int(training_data_len), :]
x_train = []
y_train = []
for i in range(60, len(train_data)):
x_train.append(train_data[i-60:i, 0])
y_train.append(train_data[i, 0])
x_train, y_train = np.array(x_train), np.array(y_train)
x_train = np.reshape(x_train, (x_train.shape[0], x_train.shape[1], 1))
model = Sequential()
# should i change to input_shape=(x_train.shape[1], 3) ?
model.add(LSTM(50, return_sequences=True, input_shape=(x_train.shape[1], 1)))
model.add(LSTM(50, return_sequences=False))
model.add(Dense(25))
model.add(Dense(1))
model.compile(optimizer='adam', loss='mean_squared_error')
model.fit(x_train, y_train, batch_size=1, epochs=1)
test_data = scaled_data[training_data_len - 60: , :]
x_test = []
y_test = dataset[training_data_len:, :]
for i in range(60, len(test_data)):
x_test.append(test_data[i-60:i, 0])
x_test = np.array(x_test)
x_test = np.reshape(x_test, (x_test.shape[0], x_test.shape[1], 1 ))
predictions = model.predict(x_test)
predictions = scaler.inverse_transform(predictions) # error here

The problem is that you are fitting MinMaxScaler on dataset, then splitting dataset into x_train and y_train and then later on trying to use the inverse_transform method on the predictions, which have the same shape as y_train. I suggest you create x_train and y_train and fit MinMaxScaler only to x_train. y_train doesn't need to be scaled for the model and that will save you needing to inverse_transform the predictions completely.
So instead of
#Get the number of rows to train the model on, 40 rows for test
training_data_len = len(dataset) - 40
scaler = MinMaxScaler(feature_range=(0,1))
scaled_data = scaler.fit_transform(dataset)
train_data = scaled_data[0:int(training_data_len), :]
x_train = []
y_train = []
for i in range(60, len(train_data)):
x_train.append(train_data[i-60:i, 0])
y_train.append(train_data[i, 0])
x_train, y_train = np.array(x_train), np.array(y_train)
x_train = np.reshape(x_train, (x_train.shape[0], x_train.shape[1], 1))
Use
#Get the number of rows to train the model on, 40 rows for test
training_data_len = len(dataset) - 40
train_data = scaled_data[0:int(training_data_len), :]
x_train = []
y_train = []
for i in range(60, len(train_data)):
x_train.append(train_data[i-60:i, 0])
y_train.append(train_data[i, 0])
x_train, y_train = np.array(x_train), np.array(y_train)
scaler = MinMaxScaler(feature_range=(0,1))
x_train = scaler.fit_transform(x_train) # Only scaling x_train
x_train = np.reshape(x_train, (x_train.shape[0], x_train.shape[1], 1))
and just delete the line predictions = scaler.inverse_transform(predictions).
Updates relating to additional questions in the comments
The definition of y_test is inconsistent with y_train. Specifically, y_test is defined as y_test = dataset[training_data_len:, :] which is using all of the columns of dataset. Instead, to be consistent with y_train, it should be dataset[training_data_len:, 0].
Handling splitting the data is often clearer and less error-prone if done in pandas:
# Starting with the dataframe 'data'
data = df.filter(['Close', 'Open', 'Volume'])
# Create x/y test/train directly from 'data'
training_data_len = len(data) - 40
x_train = data[['Open', 'Volume']][:training_data_len]
y_train = data.Close[:training_data_len]
x_test = data[['Open', 'Volume']][training_data_len:]
y_test = data.Close[training_data_len:]
# Then confirm you have the expected subsets by checking things like
# shape (and info(), describe(), etc.)
x_train.shape, x_test.shape
> ((160, 2), (40, 2))
y_train.shape, y_test.shape
> ((160,), (40,))

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

how to extract data from dataframe in python (Index) - python

If the value of training_set.shape[1] is smaller then 100 then the inside of the for loop is skipped, leaving X_train empty. You could test this case by adding a print statement inside the for loop. Let me know if it worked, good luck!

Related

'numpy.ndarray' object has no attribute 'columns'

Training loop for XGBoost in different dataset

How to iterate over rows in dataset for distance calculation

Is there any way i can optimize the code for Logistic Regression model

How to predict multiple features using keras with time series?

Categories

Resources