I am still lost on the Spark and Deep Learning model.
If I have a (2D) time series that I want to use for e.g. an LSTM model. Then I first convert it to a 3D array and then pass it to the model. This is normally done in memory with numpy.
But what happens when I manage my BIG file with Spark?
The solutions I've seen so far all do it by working with Spark and then converting the 3D data in numpy at the end. At that solution puts everything in memory.... or am I thinking wrong?
A common Spark LSTM solution is looks like this:
# create fake dataset
import random
from keras import models
from keras import layers
data = []
for node in range(0,100):
for day in range(0,100):
data.append([str(node),
day,
random.randrange(15, 25, 1),
random.randrange(50, 100, 1),
random.randrange(1000, 1045, 1)])
df = spark.createDataFrame(data,['Node', 'day','Temp','hum','press'])
# transform the data
df_trans = df.groupBy('day').pivot('Node').sum()
df_trans = df_trans.orderBy(['day'], ascending=True)
#make train/test data
trainDF = df_trans[df_trans.day < 70]
testDF = df_trans[df_trans.day > 70]
################## we lost the SPARK #############################
# create train/test array
trainArray = np.array(trainDF.select(trainDF.columns).collect())
testArray = np.array(testDF.select(trainDF.columns).collect())
# drop the target columns
xtrain = trainArray[:, 0:-1]
xtest = testArray[:, 0:-1]
# take the target column
ytrain = trainArray[:, -1:]
ytest = testArray[:, -1:]
# reshape 2D to 3D
xtrain = xtrain.reshape((xtrain.shape[0], 1, xtrain.shape[1]))
xtest = xtest.reshape((xtest.shape[0], 1, xtest.shape[1]))
# build the model
model = models.Sequential()
model.add(layers.LSTM(1, input_shape=(1,400)))
model.add(layers.Dense(1))
model.compile(loss='mean_squared_error', optimizer='adam')
# train the model
loss = model.fit(xtrain, ytrain, batch_size=10, epochs=100)
My problem with this is:
If my Spark data uses millions of rows and thousands of columns, then when the # create train/test array program line tries to transform the data, it causes a memory overflow. Am I right?
My question is:
Can SPARK be used to train LSTM models on big data, or is it not possible?
Is there any Generator function that can solve this? Like the Keras Generator function?
Perhaps you have too many columns in your dataframe - why would you have hundreds of columns? Are you collecting that many data points for each timestamp? If so, then I would argue that you need to subset your data. In my experience, a time-series is driven largely by the timestamp - even a small number of data points stretched across a long collection of time provides enormous information. In other words, you have a dataset that is wide and tall, but it should perhaps be thin and tall instead.
Related
Is there any way to batch a large input file (111MB) made of 22 MLN cells (222 rows for 110k columns) in MLlib (something similar to this tutorial made in keras) Keras batching tutorial.
The file contains the actual features extracted from 222 images using the above tutorial, but instead of using a keras model I would like to replicate such code using pyspark and MLlib.
Unfortunately I've not enough resources for dealing in memory for such big file and the computation fails for Java Heap Space memory error.
The file structure is composed by for each row (representing an image) we have these columns: "_c0" the label 0/1, from "_c1" up to "_c100353" features extracted.
Here's my code, I don't care about precision and accuracy, I'm just interested on running the model for making resource usage metrics.
sql,sc = init_spark()
df = sql.read.option("maxColumns", 100400).load(file3,format="csv",inferSchema="true",sep=',',header="false")
labelIndexer = StringIndexer(inputCol="_c0", outputCol="indexedLabel").fit(df)
cols = df.columns
cols.remove("_c0")
assembler = VectorAssembler(inputCols=cols,outputCol="features")
data = assembler.transform(df)
featureIndexer =\
VectorIndexer(inputCol="features", outputCol="indexedFeatures", maxCategories=100).fit(data)
(trainingData, testData) = data.randomSplit([0.7, 0.3])
dt = DecisionTreeClassifier(labelCol="indexedLabel", featuresCol="indexedFeatures")
#
# # Chain indexers and tree in a Pipeline
pipeline = Pipeline(stages=[labelIndexer, featureIndexer, dt])
#
# # Train model. This also runs the indexers.
model = pipeline.fit(trainingData)
#
# # Make predictions.
predictions = model.transform(testData)
#
# # Select example rows to display.
predictions.select("prediction", "indexedLabel", "features").show(100)
predictions.printSchema()
#
evaluator = MulticlassClassificationEvaluator(
labelCol="indexedLabel", predictionCol="prediction", metricName="accuracy")
accuracy = evaluator.evaluate(predictions)
print("Accuracy = %g " % accuracy)
Please don't suggest me to use sparkdl library for features extraction using DeepImageFeaturizer because it's completely broken.
I have built a multi-step, multi-variate LSTM model to predict the target variable 5 days into the future with 5 days of look-back. The model runs smooth (even though it has to be further improved), but I cannot correctly invert the transformation applied, once I get my predictions.
I have seen on the web that there are many ways to pre-process and transform data. I decided to follow these steps:
Data fetching and cleaning
df = yfinance.download(['^GSPC', '^GDAXI', 'CL=F', 'AAPL'], period='5y', interval='1d')['Adj Close'];
df.dropna(axis=0, inplace=True)
df.describe()
Data set table
Split the data set into train and test
size = int(len(df) * 0.80)
df_train = df.iloc[:size]
df_test = df.iloc[size:]
Scaled train and test sets separately with MinMaxScaler()
scaler = MinMaxScaler(feature_range=(0,1))
df_train_sc = scaler.fit_transform(df_train)
df_test_sc = scaler.transform(df_test)
Creation of 3D X and y time-series compatible with the LSTM model
I borrowed the following function from this article
def create_X_Y(ts: np.array, lag=1, n_ahead=1, target_index=0) -> tuple:
"""
A method to create X and Y matrix from a time series array for the training of
deep learning models
"""
# Extracting the number of features that are passed from the array
n_features = ts.shape[1]
# Creating placeholder lists
X, Y = [], []
if len(ts) - lag <= 0:
X.append(ts)
else:
for i in range(len(ts) - lag - n_ahead):
Y.append(ts[(i + lag):(i + lag + n_ahead), target_index])
X.append(ts[i:(i + lag)])
X, Y = np.array(X), np.array(Y)
# Reshaping the X array to an RNN input shape
X = np.reshape(X, (X.shape[0], lag, n_features))
return X, Y
#In this example let's assume that the first column (AAPL) is the target variable.
trainX,trainY = create_X_Y(df_train_sc,lag=5, n_ahead=5, target_index=0)
testX,testY = create_X_Y(df_test_sc,lag=5, n_ahead=5, target_index=0)
Model creation
def build_model(optimizer):
grid_model = Sequential()
grid_model.add(LSTM(64,activation='tanh', return_sequences=True,input_shape=(trainX.shape[1],trainX.shape[2])))
grid_model.add(LSTM(64,activation='tanh', return_sequences=True))
grid_model.add(LSTM(64,activation='tanh'))
grid_model.add(Dropout(0.2))
grid_model.add(Dense(trainY.shape[1]))
grid_model.compile(loss = 'mse',optimizer = optimizer)
return grid_model
grid_model = KerasRegressor(build_fn=build_model,verbose=1,validation_data=(testX,testY))
parameters = {'batch_size' : [12,24],
'epochs' : [8,30],
'optimizer' : ['adam','Adadelta'] }
grid_search = GridSearchCV(estimator = grid_model,
param_grid = parameters,
cv = 3)
grid_search = grid_search.fit(trainX,trainY)
grid_search.best_params_
my_model = grid_search.best_estimator_.model
Get predictions
yhat = my_model.predict(testX)
Invert transformation of predictions and actual values
Here my problems begin, because I am not sure which way to go. I have read many tutorials, but it seems that those authors prefer to apply MinMaxScaler() on the entire dataset before splitting the data into train and test. I do not agree on this, because, otherwise, training data will be incorrectly scaled with information we should not use (i.e. the test set). So, I followed my approach, but I am stucked here.
I found this possible solution on another post, but it's not working for me:
# invert scaling for forecast
pred_scaler = MinMaxScaler(feature_range=(0, 1)).fit(df_test.values[:,0].reshape(-1, 1))
inv_yhat = pred_scaler.inverse_transform(yhat)
# invert scaling for actual
inv_y = pred_scaler.inverse_transform(testY)
In fact, when I double check the last values of the target from my original data set they don't match with the inverted scaled version of the testY.
Can someone please help me on this? Many thanks in advance for your support!
Two things could be mentioned here. First, you cannot inverse transform something you did not see. This happens because you use two different scalers. The NN will predict values in the range of Scaler 1, where it is not said that this lies within the range of Scaler 2 (scaled on test data). Second, the best practice is to fit your scaler on the training set and use the same scaler (only transform) on the test data as well. Now, you should be able to reverse transform your test results. Third if scaling wents off, because the test set has completely different values - e.g. happens with live streaming data, it is up to you to deal with it, e.g. the min-max scaler will produce values > 1.0.
I want to create a neural network with Keras and my training data is in a pandas data frame, called df_train, which has the following form. Every row is an event/observation consisting of 51 variables.
df_train.head()
My question is, can I use this df_train data frame as an input in the Keras model.fit() command? As following
net = Sequential()
net.add(Dense(70, input_dim = 51, activation = "relu"))
net.add(Dense(70, activation = "relu"))
net.add(Dense(1, activation = "sigmoid"))
net.compile(loss = "binary_crossentropy", optimizer = "adam", metrics = ["accuracy"])
net.fit(df_train, train_labels, epochs = 300, batch_size = 100)
In the net.fit() I pass as train data a data frame, but in sequential documentation it doesnt mention a data frame as a valid input. However, in my code it works and the model runs normally. Is something happening wrong behind the backstage and simply there is no error, or it is running as intended even when you use a pandas data frame as input?
Also, if it works, does the fit() command in this case take as input one row of the given data frame a t the time?
Thanks a lot.
net.fit(df_train, train_labels, epochs = 300, batch_size = 100)
In this df_train is 2D and train_label can be 2D or 1D (it depend upon loss which you mention and units of output layers)
2nd question answer : Yes you can do
If you want to input X as single row which is 1D then it generated error:
ValueError: Expected 2D array, got 1D array instead:array=[1. 2. 3. 4.].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature
or array.reshape(1, -1) if it contains a single sample.
How we resolve this one
converted into 2d !
X=train.iloc[0:1,:]
print(X.shape)
output:(1, 25)
# now this single row converted into two dim:
There is a specialized method, called flow_from_dataframe. Similar to flow_from_directory which reads and feeds images to your neural network, the former method allows you to feed data from a dataframe.
You can have a look here in order to see an application of flow_from_dataframe:
https://medium.com/#vijayabhaskar96/tutorial-on-keras-flow-from-dataframe-1fd4493d237c
The second option is to use tf.data.Dataset.from_tensor_slices(), which is specific to TensorFlow specialized data pipeline.
An example of it is available here: https://www.tensorflow.org/guide/data#consuming_csv_data
The third method and perhaps in the future the most elegant and fast from a performance view tf.data.experimental.CsvDataset. The reason that I am saying in the foreseeable future is that, as you can see, it is part of the 'experimental' packages; that means it is something new and thus not mature and stable enough. A link about it is provided here: https://www.tensorflow.org/api_docs/python/tf/data/experimental/CsvDataset
You can try either of these methods, but for beginning, I would stick to flow_from_dataframe.
I am using the getting started example of Tensorflow CNN and updating parameters to my own data but since my model is large (244 * 244 features) I got OutOfMemory error.
I am running the training on Ubuntu 14.04 with 4 CPUs and 16Go of RAM.
Is there a way to shrink my data so I don't get this OOM error?
My code looks like this:
# Create the Estimator
mnist_classifier = tf.estimator.Estimator(
model_fn=cnn_model_fn, model_dir="path/to/model")
# Load the data
train_input_fn = tf.estimator.inputs.numpy_input_fn(
x={"x": np.array(training_set.data)},
y=np.array(training_set.target),
num_epochs=None,
batch_size=5,
shuffle=True)
# Train the model
mnist_classifier.train(
input_fn=train_input_fn,
steps=100,
hooks=[logging_hook])
Is there a way to shrink my data so I don't get this OOM error?
You can slice your training_set to obtain just a portion of the dataset. Something like:
x={"x": np.array(training_set.data)[:(len(training_set)/2)]},
y=np.array(training_set.target)[:(len(training_set)/2)],
In this example you are getting the first half of your dataset (you can select up to what point of your dataset you want to load).
Edit: Another way you can do this is to obtain a random subset of your training dataset. This you can achieve by masking elements on your dataset array. For example:
import numpy as np
from random import random as rn
#obtain boolean mask to filter out some elements
#here you can define your sample %
r = 0.5 #say filter half the elements
mask = [True if rn() >= r else False for i in range(len(training_set))]
#finally, mask out those elements,
#the result will have ~r times the original elements
reduced_ds = training_set[mask]
I have a numpy array of some 5000 rows and 4 columns (temp, pressure, speed, cost). So this is of the shape (5000, 4). Each row is an observation at a regular interval this is the first time i'm doing time series prediction and I'm stuck on input shape. I'm trying to predict
a value 1 timestep from the last data point. How do I reshape it into the 3D form for LSTM model in keras?
Also It will be much more helpful if a small sample program is written. There doesn't seem to be any example/tutorial where the input has more than one feature (and also not NLP).
The first question you should ask yourself is :
What is the timescale in which the input features encode relevant information for the value you want to predict?
Let's call this timescale prediction_context.
You can now create your dataset :
import numpy as np
recording_length = 5000
n_features = 4
prediction_context = 10 # Change here
# The data you already have
X_data = np.random.random((recording_length, n_features))
to_predict = np.random.random((5000,1))
# Make lists of training examples
X_in = []
Y_out = []
# Append examples to the lists (input and expected output)
for i in range(recording_length - prediction_context):
X_in.append(X_data[i:i+prediction_context,:])
Y_out.append(to_predict[i+prediction_context])
# Convert them to numpy array
X_train = np.array(X_in)
Y_train = np.array(Y_out)
At the end :
X_train.shape = (recording_length - prediction_context, prediction_context, n_features)
So you will need to make a trade-off between the length of your prediction context and the number of examples you will have to train your network.