SMOTE on dataframe of arrays issues - python

I'm trying to attempt to SMOTE on a dataframe full of sliding windows here:
DataFrame
I'm using imblearn's SMOTE() function on it. Without any manipulation, I'm getting an error that each cell must have a size 1 array. SMOTING individually by rows or exploding the dataframe and SMOTING on each window (same index) results in a ValueError because there is only one class in each SMOTE attempt. How do I get around this problem of wanting to SMOTE an entire sliding window without aggregating them or getting dimensional errors by keeping it in the dataframe in the first picture?
new_df_labels = X_with_labels.reset_index().apply(pd.Series.explode)
new_df = X_smoted.reset_index().apply(pd.Series.explode)
np.unique(new_df.index)
X_list = pd.DataFrame(columns = X_smoted.columns)
y_list = []
for j in np.unique(new_df.index):
new_df1 = new_df[new_df.index == j]
new_df_labels1 = new_df_labels[new_df_labels.index == j]
X_smoted_1, y_smoted_1 = smot.fit_resample(new_df1, new_df_labels1['Activity'])
X_list = X_list.append(X_smoted_1)
y_list.append(y_smoted_1.ravel())
Exploded DataFrame

Related

Question about Yfinance, IndexError, and Numpy arrays

I am trying to use linear regression using data pulled from yfinance to predict future stock prices, but I am having trouble using linear regression after transposing my data's shape.
Here I create a normalization function
def normalize_data(df):
# df on input should contain only one column with the price data (plus dataframe index)
min = df.min()
max = df.max()
x = df
# time series normalization part
# y will be a column in a dataframe
y = (x - min) / (max - min)
return y
And another function to pull stock prices from Yfinance that calls the normalization function
def closing_price(ticker):
#Asset = pd.DataFrame(yf.download(ticker, start=Start,end=End)['Adj Close'])
Asset = pd.DataFrame(yf.download(ticker, start='2022-07-13',end='2022-09-16')['Adj Close'])
Asset = normalize_data(Asset)
return Asset.to_numpy()
I then pull 11 different stocks using the function
MRO= closing_price('MRO')
HES= closing_price('HES')
FANG= closing_price('FANG')
DVN= closing_price('DVN')
PXD= closing_price('PXD')
COP= closing_price('COP')
CVX= closing_price('CVX')
APA= closing_price('APA')
EOG= closing_price('EOG')
HAL= closing_price('HAL')
BLK = closing_price('BLK')
Which works so far
But when I try to merge the first 10 numpy arrays together,
X = np.array([MRO, HES, FANG, DVN, PXD, COP, CVS, APA, EOG, HAL])[:, :, 0]
X = np.transpose(X)
it gives me the error for the first line when I merge the numpy arrays
<ipython-input-53-a30faf3e4390>:1: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray
Have you tried passing the following as is suggested by your error message?
X = np.array([MRO, HES, FANG, DVN, PXD, COP, CVS, APA, EOG, HAL], dtype=float)[:, :, 0]
Alternatively, what are you trying to do with your data afterwards, run a linear regression? Does the data have to be an np array? Often working with data is a lot easier using pandas.DataFrame, and basically all machine learning libraries such as sklearn or statsmodels or any other you might want to use will have pandas support.
To create one big dataset out of these you could try the following:
data = pd.DataFrame() #creating empty dataframe
list_of_tickers = [MRO, HES, FANG, DVN, PXD, COP, CVS, APA, EOG, HAL, BLK]
for ticker in list_of_tickers:
for column in ticker: #because each column will just be labelled "Adj. Close" and you can't name multiple columns the same way
new_name = str(ticker) + "_" + str(column) #columns in data will then be named "MRO_Adj. Close", "HES_Adj. Close", etc
ticker[new_name] = ticker[column]
ticker = ticker.drop(column, axis=1)
data = pd.concat([data, ticker], axis=1)
Additionally, this neatly prevents problems that might arise from issues that different stock tickers have or lack different dates in their dataset, as was correctly pointed out by Kevin Choon Liang Yew in the comments above.

How to make a data frame combining different regression results in python?

I am running some regression models to predict performance.
After running the models I created a variable to see the predictions (y_pred_* are lists with 2567 values):
y_pred_LR = regressor.predict(X_test)
y_pred_SVR = regressor2.predict(X_test)
y_pred_RF = regressor3.predict(X_test)
the types of these prediction lists are Array of float64, while the y_test is a DataFrame.
I wanted to create a table with the results, I tried some different ways, calling as list, trying to convert, trying to select as values, and I did not succeed so far, any one could help?
My last trial was like below:
comparison = pd.DataFrame({'Real': y_test, LR':y_pred_LR,'RF':y_pred_RF,'SVM':y_pred_SVM})
In this case the DataFrame is created but the values donĀ“t appear.
Additionally, I would like to create two new rows with the mean and standard deviation of results and this row should be located at beginning (or first row) of the Data Frame.
Thanks
import pandas as pd
import numpy as np
real = np.array([2] * 10).reshape(-1,1)
y_pred_LR = np.array([0] * 10)
y_pred_SVR = np.array([1] * 10)
y_pred_RF = np.array([5] * 10)
real = real.flatten()
comparison = pd.DataFrame({'real':real,'y_pred_LR':y_pred_LR,'y_pred_SVR':y_pred_SVR,"y_pred_RF":y_pred_RF})
Mean = comparison.mean(axis=0)
StD = comparison.std(axis=0)
Mean_StD = pd.concat([Mean,StD],axis=1).T
result = pd.concat([Mean_StD,comparison],ignore_index=True)
print(result)

pandas dataframe conditional selecting

I have a Dataframe and im trying to apply some ml algorithms on it.
im using pandas to handle it but im having several problems with it:
as you see in the 3rd cell, i have splitted Y into Ytr and Yts. after this the dataframe losses its column names. I've tried to name the column again but it doesn't work.
in the 4th cell, Im trying to use conditional statement to create a subset of Y in which Y values are 1(it is named ytr1). but it returns an empty dataframe.
any suggestions on the whole code would be really appreciated since im not really experienced with Pandas
note: if you haven't worked with jupyter notebook, #%% just means a new cell.
#%%
from pandas import DataFrame as df
import random
import numpy as np
import pandas as pd
import re
#%%
# Preparing the DataFrame
labels = pd.read_csv(r'A:\Data Sets\Pima Indian Diabetes\labels.csv', header=None)
ll = labels.loc[:, 0].tolist()
data = pd.read_csv(r'A:\Data Sets\Pima Indian Diabetes\pima-indians-diabetes2.csv', names=ll)
i = data.columns.values.tolist() # i is the labels of the csv file
i[-1]
#%%
# Spliting the Dataset
X = data.drop(i[-1], axis=1)
Y = data.iloc[:, 8]
Y = Y.to_frame()
Y = pd.DataFrame(Y.values.reshape(-1, 1), columns=i[-1])
tr_idx = data.sample(frac=0.7).index
Xtr = df(X[X.index.isin(tr_idx)])
Xts = df(X[~X.index.isin(tr_idx)])
Ytr = df(Y[X.index.isin(tr_idx)], columns='result')
Yts = df(Y[~X.index.isin(tr_idx)], columns=i[-1])
#%%
# splitting the Classes
ytr1 = Ytr.drop(Ytr[Ytr.iloc[0]!=1].index)
X: all the columns except Labels\classes which are 0 or 1
Y: last column of the csv files that are loaded as labels
Xtr: fraction of X that Im planning to use for training
Xts: fraction of X that Im planning to use for testing

Rolling PCA on pandas dataframe

I'm wondering if anyone knows of how to implement a rolling/moving window PCA on a pandas dataframe. I've looked around and found implementations in R and MATLAB but not Python. Any help would be appreciated!
This is not a duplicate - moving window PCA is not the same as PCA on the entire dataframe. Please see pandas.DataFrame.rolling() if you do not understand the difference
Unfortunately, pandas.DataFrame.rolling() seems to flatten the df before rolling, so it cannot be used as one might expect to roll over the rows of the df and pass windows of rows to the PCA.
The following is a work-around for this based on rolling over indices instead of rows. It may not be very elegant but it works:
# Generate some data (1000 time points, 10 features)
data = np.random.random(size=(1000,10))
df = pd.DataFrame(data)
# Set the window size
window = 100
# Initialize an empty df of appropriate size for the output
df_pca = pd.DataFrame( np.zeros((data.shape[0] - window + 1, data.shape[1])) )
# Define PCA fit-transform function
# Note: Instead of attempting to return the result,
# it is written into the previously created output array.
def rolling_pca(window_data):
pca = PCA()
transf = pca.fit_transform(df.iloc[window_data])
df_pca.iloc[int(window_data[0])] = transf[0,:]
return True
# Create a df containing row indices for the workaround
df_idx = pd.DataFrame(np.arange(df.shape[0]))
# Use `rolling` to apply the PCA function
_ = df_idx.rolling(window).apply(rolling_pca)
# The results are now contained here:
print df_pca
A quick check reveals that the values produced by this are identical to control values computed by slicing appropriate windows manually and running PCA on them.

dask DataFrame.assign blows up dask graph

So I have an issue with dask DataFrame.append. I generate a lot of derivative features from main data and append them to the main dataframe. After that the dask graph for any set of columns is blown up. Here is small example:
%pylab inline
import numpy as np
import pandas as pd
import dask.dataframe as dd
from dask.dot import dot_graph
df=pd.DataFrame({'x%s'%i:np.random.rand(20) for i in range(5)})
ddf = dd.from_pandas(df, npartitions=2)
dot_graph(ddf['x0'].dask)
here is the dask graph as expected
g=ddf.assign(y=ddf['x0']+ddf['x1'])
dot_graph(g['x0'].dask)
here the graph for same column is exploded with irrelevant computation
Imagine i have lots of lots of spawned columns. So computation graph for any particular column includes irrelevant computations for all the other columns. I.e. in my case I have len(ddf['someColumn'].dask)>100000. So that becomes unusable quickly.
So my question is can this issue be resolved? Are there any existing means to do this? If not - what direction should i look to implement this?
Thanks!
Rather than continuously assigning new columns to the dask dataframe, you might want to build several dask series and then concat them all together at the end
So instead of doing this:
df['x'] = df.w + 1
df['y'] = df.x * 10
df['z'] = df.y ** 2
Do this
x = df.w + 1
y = x + 10
z = y * 2
df = df.assign(x=x, y=y, z=z)
Or this:
dd.concat([df, x, y, z], axis=1)
This may still result in the same number of tasks in your graph however, but will probably result in fewer memory copies.
Alternatively, if all of your transformations are row-wise then you can construct a pandas function and map that across all partitions
def f(part):
part = part.copy()
part['x'] = part.w + 1
part['y'] = part.x * 10
part['z'] = part.y ** 2
return part
df = df.map_partitions(f)
Also, while a million-node task graph is less than ideal, it should also be OK. I've seen larger graphs run comfortably.

Categories