Converting pandas series to 3D input vectors for LSTM implementation - python

I have a pandas series where each row is a list sequence containing 50 timesteps each as input and another series with corresponding 10 timestep sequences as output. The shape of their heads is (5,) respectively. I wish to convert the training data to a shape (n_samples, 50, 1) and test data to a shape (n_samples, 10) in order to feed it to a many-to-many LSTM model. I've been trying several approaches on Stackoverflow but none of them seem to be working for me. Whatever I do, I keep getting the error:
ValueError: Failed to convert a NumPy array to a Tensor (Unsupported object type list).

My colleague helped me with the answer to this:
N = 100 #number of samples
X = df_new['sequence']
y = df_new['target']
X = X.iloc[:N]
X = np.array([[np.array(x) for x in X.values]]).T.reshape(N, 50, 1)
y = y.iloc[:N]
y = np.array([np.array(x) for x in y.values])
print(X.shape)
print(y.shape)
The part I missed was the use of the transpose function to manipulate the arrays.

Related

Dealing with NaN in BERT for Multi-class Text Classification of unlabelled text

I am just starting to explore Bert in a multiclass text classification task. In doing that, I am using this data. This is my code. During model testing, attempt to compute cosine similarities:
#--- Model Algorithm ---#
## compute cosine similarities
similarities = np.array(
[metrics.pairwise.cosine_similarity(X, y).T.tolist()[0]
for y in dic_y.values()]
).T
gives this error:
ValueError: Input contains NaN, infinity or a value too large for dtype('float32').
From many stackoverflow posts, there were suggestions to eliminate NaN and null cells which I did at the beginning (see my code). There were also suggestions to replace NaN but I when I did as thus:
X = np.nan_to_num(X.astype(np.float32))
similarities = np.array(
[metrics.pairwise.cosine_similarity(X, y).T.tolist()[0]
for y in dic_y.values()]
).T
I got another error:
ValueError: Expected 2D array, got 1D array instead:
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
Applying .reshape(-1, 1) or .reshape(1, -1) to X:
X = np.reshape(X, (-1, 1))
similarities = np.array(
[metrics.pairwise.cosine_similarity(X, y).T.tolist()[0]
for y in dic_y.values()]
).T
generates the same error:
ValueError: Expected 2D array, got 1D array instead:
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
I am sure that the input X is a 2D array and not 1D, because X.ndim is 2. Any help will be appreciated.

How to remove 2d array from 3d array if it contains NA values

I am working on a seq2seq machine learning problem with Conv1D and LSTM, to do this I must produce a tensor input of the shape samples, timesteps, features. Aside from the problems that I was having with the LSTM layer (different topic). I find myself struggling to delete a 2d slice of my 3d input tensor if it contains NA value(s). I want to delete the entire sample if any feature, in anytimestep is NA.
Up until now to keep it simple i was working with univariate data and my solution was to simply transform my array into a pandas dataframe and use their df.dropna(axis=0) function to drop the entire sample. However that function only works with 2d dataframes. I've tried looping over my samples to produce 2d arrays that i can then convert into pandas dataframes, but got stuck trying to add the 2d arrays together again. And i figured, there has GOT to be a cleaner way to go about this. So i found this example:
x = np.array([[[1,2,3], [4,5,np.nan]], [[7,8,9], [10,11,12]]])
print("Original array:")
print(x)
print("Remove all non-numeric elements of the said array")
print(x[~np.isnan(x).any(axis=2)])
which works for 2d arrays, but i figured it would work with any number of dimensions, I was wrong... I don't understand what I'm doing wrong here. For completeness sake, here is my function that successfully deletes input and its corresponding output from X_train and y_train if either X_train OR y_train contains NA value(s) (but this only works for univariate data as the 3rd dimension in the X_train tensor is of shape 1 and can therefore be dropped):
def drop_days_with_na(df, df1):
df_shape = df.shape
df = df.reshape(df.shape[0], df.shape[1])
df = np.concatenate((df, df1), axis=1)
df = pd.DataFrame(df)
na_index = df.isna()
df = df.dropna(axis=0)
df = np.array(df)
df = df.reshape(df.shape[0], df.shape[1], 1)
df1 = df[:,df_shape[1]:,:]
df1 = df1.reshape(df1.shape[0], df1.shape[1])
df = df[:,:df_shape[1],:]
return df, df1, na_index
This solved my problem:
def remove_nan(X,Y):
x = []
y = []
for sample in range(X.shape[0]):
if np.isnan(X[sample,:,:]).any() | np.isnan(Y[sample,:]).any():
None
else:
x.append(X[sample,:,:])
y.append(Y[sample,:])
x = np.array(x)
y = np.array(y)
return x, y
x_train, y_train = remove_nan(x_train, y_train)
x_test, y_test = remove_nan(x_test, y_test)

Cannot reshape array of size x into shape y

I'm following a tutorial to create an LSTM neural network using keras.
I have an array of 1270 rows and 26 features.
I split the data like this:
train_ind = int(0.8 * X.shape[0])
X_train = X[:train_ind]
X_test = X[train_ind:]
y_train = y[:train_ind]
y_test = y[train_ind:]
And i'm trying to reshape it for the lstm using this:
num_steps = 4
X_train_shaped = np.reshape(X_train_sc, newshape=(-1, num_steps, 26))
y_train_shaped = np.reshape(y_train_sc, newshape=(-1, num_steps, 26))
assert X_train_shaped.shape[0] == y_train_shaped.shape[0]
However, i'm getting this error:
ValueError: cannot reshape array of size 1016 into shape (4,26)
Well, 4 x 26 = 104, and 1270 isn't divisible by 104, so np.reshape() can't choose an integer number of rows (the -1) in order to fit that into an array. You need to change either num_steps or num_features (26) so that num_steps * num_features evenly divides 1270. Unfortunately, this is impossible with num_features = 26, since 13 does not divide 1270. Your other option is to choose a different number of total rows, say 1040 or 1144, which are both divisible by 104.
So, instead of setting train_ind = int(0.8 * X.shape[0]), try train_id = 1040 or a smaller multiple of 104. Note, however, that your test data will also have to have a nice number of rows in order to reshape it in the same way.
First of all, you don't need to reshape an array. The shape attribute of a numpy array simply determines how the underlying data is displayed to you and how the data is accessed; changing the shape doesn't actually move any data around.
Likewise, we note that one cannot change the shape to something that is impossible. For example, if an array has size (100,5,6), you can't change this to (100,5,7). In general the axes have to multiple to the correct values. 100*5*6 not equal 100*5*7.
In your case, you sound like you want to work with an LSTM, which would normally mean that you want to simply add an additional axis so that you have input vectors of size 1. A new axis can be added with a None entry in numpy. Something like:
X_train = X[:train_ind,:,None] #The axes are Batch, Time, and the Input Vector.
Shape should now be (1016,26,1).

How to shape input array appropriately for ML

I apologize in advance if this question seems silly but I am trying to understand a machinelearningmastery blog about LSTM type ML algorithms, more specifically on how the input data gets reshaped. And I dont have a ton of wisdom here on the subject or a CS degree for that matter
About half way into the blog about LSTM CNN section, Jason talks about:
The first step is to split the input sequences into subsequences that
can be processed by the CNN model. For example, we can first split our
univariate time series data into input/output samples with four steps
as input and one as output. Each sample can then be split into two
sub-samples, each with two time steps. The CNN can interpret each
subsequence of two time steps and provide a time series of
interpretations of the subsequences to the LSTM model to process as
input.
We can parameterize this and define the number of subsequences as
n_seq and the number of time steps per subsequence as n_steps. The
input data can then be reshaped to have the required structure:
[samples, subsequences, timesteps, features]
My question is this a requirement for the data to be only shaped into 4 steps? Or can it be larger? This code below will attempt to print the array, I am using my own data sample here on my git account.
import pandas as pd
import numpy as np
# univariate data preparation
from numpy import array
df = pd.read_csv("trainData.csv")
df = df[['kW']].shift(-1)
df = df.dropna()
raw_seq = df.values.tolist()
# split a univariate sequence into samples
def split_sequence(sequence, n_steps):
X, y = list(), list()
for i in range(len(sequence)):
# find the end of this pattern
end_ix = i + n_steps
# check if we are beyond the sequence
if end_ix > len(sequence)-1:
break
# gather input and output parts of the pattern
seq_x, seq_y = sequence[i:end_ix], sequence[end_ix]
X.append(seq_x)
y.append(seq_y)
return array(X), array(y)
# define input sequence
#raw_seq = [10, 20, 30, 40, 50, 60, 70, 80, 90]
# choose a number of time steps
n_steps = 4
# split into samples
X, y = split_sequence(raw_seq, n_steps)
# reshape from [samples, timesteps] into [samples, subsequences, timesteps, features]
n_features = 1
n_seq = 2
n_steps = 2
X = X.reshape((X.shape[0], n_seq, n_steps, n_features))
# summarize the data
for i in range(len(X)):
print(X[i], y[i])
The code above works but when I change n_steps = 7 (from 4) I get this shape error.
File "convArray.py", line 39, in <module>
X = X.reshape((X.shape[0], n_seq, n_steps, n_features))
ValueError: cannot reshape array of size 2499 into shape (357,2,2,1)
The reason I want to try and use 7 time steps is the data that I am experimenting with is electrical demand units for a building per day, and 7 days in a week would be an ideal experimental time step!
Any tips greatly appreciated
The problem is mentioned in the error in this case. The problem is that you are trying to reshape the array to a specific shape but it is not possible. The array X has 2499 elements and 2499 cannot be reshaped to a (357,2,2,1) shape. The product of the numbers in the shape is the number of elements overall. A (357,2,2,1) has 357*2*2*1=1428 elements.
So your code returned an array with 1428 elements overall when n_steps = 2.
I think in your case, it is dependent on raw_seq because its length decides the number of times the for loop inside split_sequence() will run and the arrays it returns.raw_seq depends on the data, so for this dataset, n_steps might be limited.
I'm not 100% sure about it though.

How to reshape input for keras LSTM?

I have a numpy array of some 5000 rows and 4 columns (temp, pressure, speed, cost). So this is of the shape (5000, 4). Each row is an observation at a regular interval this is the first time i'm doing time series prediction and I'm stuck on input shape. I'm trying to predict
a value 1 timestep from the last data point. How do I reshape it into the 3D form for LSTM model in keras?
Also It will be much more helpful if a small sample program is written. There doesn't seem to be any example/tutorial where the input has more than one feature (and also not NLP).
The first question you should ask yourself is :
What is the timescale in which the input features encode relevant information for the value you want to predict?
Let's call this timescale prediction_context.
You can now create your dataset :
import numpy as np
recording_length = 5000
n_features = 4
prediction_context = 10 # Change here
# The data you already have
X_data = np.random.random((recording_length, n_features))
to_predict = np.random.random((5000,1))
# Make lists of training examples
X_in = []
Y_out = []
# Append examples to the lists (input and expected output)
for i in range(recording_length - prediction_context):
X_in.append(X_data[i:i+prediction_context,:])
Y_out.append(to_predict[i+prediction_context])
# Convert them to numpy array
X_train = np.array(X_in)
Y_train = np.array(Y_out)
At the end :
X_train.shape = (recording_length - prediction_context, prediction_context, n_features)
So you will need to make a trade-off between the length of your prediction context and the number of examples you will have to train your network.

Categories