How to Include continuous and categorical predictors in Keras LSTM? - python

I want to use Keras LSTM (or similar) to forecast energy consumption of businesses based on:
historical consumption data
some numerical features (e.g. total yearly consumption)
some categorical features (e.g. business type)
This is a cold-start problem because, while 2. and 3. are present both for the training and the test set, 1. is not, i.e. I am trying to predict consumption of new businesses for which there is no historical data.
My question is: how to structure the dataframe and the RNN to accomodate both 2. (numerical features) and 3. (categorical data) as my predictors?
Here is a made-up example of the data:
# generate x (predictors dataframe)
import pandas as pd
x = pd.DataFrame({'ID':[0,1,2,3],'business_type':[0,2,2,1], 'contract_type':[0,0,2,1], 'yearly_consumption':[1000,200,300,900], 'n_sites':[9,1,2,5]})
print(x)
# note: the first 2 are categorical and the second 2 are numerical
ID business_type contract_type yearly_consumption n_sites
0 0 0 0 1000 9
1 1 2 0 200 1
2 2 2 2 300 2
3 3 1 1 900 5
# generate y (timeseries data)
import numpy as np
time_series = []
data_length = 6
period = 1
for k in range(4):
level = 10 * np.random.rand()
seas_amplitude = (0.1 + 0.3*np.random.rand()) * level
sig = 0.05 * level # noise parameter (constant in time)
time_ticks = np.array(range(data_length))
source = level + seas_amplitude*np.sin(time_ticks*(2*np.pi)/period)
noise = sig*np.random.randn(data_length)
data = source + noise
index = pd.DatetimeIndex(start=t0, freq=freq, periods=data_length)
time_series.append(pd.Series(data=data, index=['t0','t1','t2','t3','t4','t5']))
y = pd.DataFrame(time_series)
print(y)
t0 t1 t2 t3 t4 t5
0 9.611984 8.453227 8.153665 8.801166 8.208920 8.399184
1 2.139507 2.118636 2.160479 2.216049 1.943978 2.008407
2 0.131757 0.133401 0.135168 0.141212 0.136568 0.123730
3 5.990021 6.219840 6.637837 6.745850 6.648507 5.968953
# note: the real data has thousands of data points (one year with half hourly frequency)
# note: the first row belongs to ID = 0 in x, the second row to ID = 1 etc.
I have looked extensively online, and there seem to be no example where both categorical, numerical and time-series data are used. For a simple forecasting problem, this post explains that in order to learn from the previous time period, the LSTM must be fed something like this:
# process df for a classical forecasting problem for first ID
y_lstm = pd.DataFrame(y.iloc[0,:])
y_lstm.columns = ['t']
y_lstm['t-1'] = y_lstm['t'].shift()
print(y_lstm)
t t-1
t0 9.611984 NaN
t1 8.453227 9.611984
t2 8.153665 8.453227
t3 8.801166 8.153665
t4 8.208920 8.801166
t5 8.399184 8.208920
# note: t-1 represents the previous time point
However, while this works for a single timeseries, it is unclear how to structure the dataset when there are multiple timeseries, and how to include the rest of the predictors in this structure.
This post talks about how to include both categorical and numerical variables through embedding, but does not fit my problem where also timeseries data has to be included. This post discusses between one-hot encoding and embedding without any example code and does not answer my question.
Could anyone please provide me with example code on how to structure the data appropriately for the RNN and/or how a simple LSTM structure with Keras would look like? Note that this stucture should be able to use the timeseries data for training, but not for predictions (i.e. only x and not y is available for the test set)
Thank you very much in advance.

Related

Understanding TimeSeriesDataSet in Pytorch-Forecasting

I have 913000 rows data:
data image
First, Let me explain this data
this data is sales data for 10 stores and 50 item from 2013-01-01 to 2017-12-31.
i understand why this data has 913000, by leap year.
anyway, i made my training set.
training = TimeSeriesDataSet(
train_df[train_df.apply(lambda x:x['time_idx']<=training_cutoff,axis=1)],
time_idx = "time_idx",
target = "sales",
group_ids = ["store","item"], # list of column names identifying a time series
max_encoder_length = max_encoder_length,
max_prediction_length = max_prediction_length,
static_categoricals = ["store","item"],
# Categorical variables that do nat change over time (e.g. product length)
time_varying_unknown_reals = ["sales"],
)
Now
First Question: i have known as the TimeSeriesDataSet has data param, reflected data minus prediction horizon by training_cutoff and minus max_encoder_length for prediction. this is right? if no please tell me truth.
Second Question: Similarly, this is output of over code
output image
Why the length is 863500
i calculate the length on my knowledge.
prediction horizon by training_cutoff - 205010 =10000
max_encoder_length for prediction - 605010 = 30000
Thus 913000-40000 = 873000.
where is 9500?
i do my best in googling. please tell me truth..

What is the meaning of normalization in machine learning language? Does it correspond to one sample?

I am dealing with a classification problem I want to classify data into 2 classes. I generate 1000 samples at different temperatures ranging from 1 to 5. I load data using following function load_data. Where "data" is 2 dimensional array (1000,16), Rows correspond to number of samples at "1.0.npy" and similarly for other points and 16 is number of features. So I picked max and min values from each sample by applying a for loop. But I'm afraid that my normalization is not correct because I'm not sure what is the strategy of normalization in machine learning. Should I pick np.amax(each sample) or should I pick np.amax("1.0.npy") mean from all 1000 samples that contained in 1.0.npy files. My goal is to normalize data between 0 and 1.
`def load_data():
path ="./directory"
files =sorted(os.listdir(path)) #{1.0.npy, 2.0.npy,.....5.0.npy}
dictData ={}
for df in sorted(files):
print(df)
data = np.load(os.path.join(path,df))
a=data
lis =[]
for i in range(len(data)):
old_range = np.amax(a[i]) - np.amin(a[i])
new_range = 1 - 0
f = ((a[i] - np.amin(a[i])) / old_range)*new_range + 0
lis.append(f)`
After normalization I get following result such that first value of every sample is 0 and last value is one.
[0, ...., 1] #first sample
[0,.....,1] #second sample

LSTM with more features / classes

How can I use more than one feature/class as input/output on a LSTM using Sequential from Keras Models in Python?
To be more specific, I would like to use as input and output to the network: [FeatureA][FeatureB][FeatureC].
FeatureA is a categorial class with 100 different possible values indicating the sensor which collected the data;
FeatureB is a on/off indicator, being 0 or 1;
FeatureC is a categorial class too having 5 unique values.
Data Example:
1. 40,1,5
2. 58,1,2
3. 57,1,5
4. 40,0,1
5. 57,1,4
6. 23,0,3
When using the raw data and loss='categorical_crossentropy' on model.compile, the loss is over than 10.0.
When normalisating the data to values between 0-1 and using mean_squared_error on loss, it gets an average of 0.27 on loss. But when testing it on prediction, the results does not makes any sense.
Any suggestions here or tutorials I could consult?
Thanks in advance.
You need to convert FeatureC to a binary category. Mean squared error is for regression and as best I can tell you are trying to predict which class a certain combination of sensors and states belongs to. Since there's 5 possible classes you can kind of think that you're trying to predict if the class is Red, Green, Blue, Yellow, or Purple. Right now those are represented by numbers but for a regression you model is going to be predicting values like 3.24 which doesn't make sense.
In effect you're converting values of FeatureC to 5 columns of binary values. Since it seems like the classes are exclusive there should be a single 1 and the rest of the columns of a row would be 0s. So if the first row is 'red' it would be [1, 0, 0, 0, 0]
For best results you should also convert FeatureA to binary categorical features. For the same reason as above, sensor 80 is not 4x more than sensor 20, but instead a different entity.
The last layer of your model should be of softmax type with 5 neurons. Basically your model is going to try to be predicting the probability of each class, in the end.
It looks like you are working with a dataframe since there's an index. Therefore I would try:
import keras
import numpy as np
import pandas as pd # assume that this has probably already been done though
feature_a = data.loc[:, "FeatureA"].values # pull out feature A
labels = data.loc[:, "FeatureC"].values # y are the labels that we're trying to predict
feature_b = data.loc[:, "FeatureB"].values # pull out B to make things more clear later
# make sure that feature_b.shape = (rows, 1) otherwise reset the shape
# so hstack works later
feature_b = feature_b.reshape(feature_b.shape[0], 1)
labels -= 1 # subtract 1 from all labels values to zero base (0 - 4)
y = keras.utils.to_categorical(labels)
# labels.shape should be (rows, 5)
# convert 1-100 to binary columns
# zero base again
feature_a -= 1
# Before: feature_a.shape=(rows, 1)
feature_a_data = keras.utils.to_categorical(feature_a)
# After: feature_a_data.shape=(rows, 100)
data = np.hstack([feature_a_data, feature_b])
# data.shape should be (rows, 101)
# y.shape should be (rows, 5)
Now you're ready to train/test split and so on.
Here's something to look at which has multi-class prediction:
https://github.com/keras-team/keras/blob/master/examples/mnist_cnn.py

K-means clustering on 3 dimensions with sklearn

I'm trying to cluster data using lat/lon as X/Y axes and DaysUntilDueDate as my Z axis. I also want to retain the index column ('PM') so that I can create a schedule later using this clustering analysis. The tutorial I found here has been wonderful but I don't know if it's taking the Z-axis into account, and my poking around hasn't resulted in anything but errors. I think the essential point in the code is the parameters of the iloc bit of this line:
kmeans_model = KMeans(n_clusters=k, random_state=1).fit(A.iloc[:, :])
I tried changing this part to iloc[1:4] (to only work on columns 1-3) but that resulted in the following error:
ValueError: n_samples=3 should be >= n_clusters=4
So my question is: How can I set up my code to run clustering analysis on 3-dimensions while retaining the index ('PM') column?
Here's my python file, thanks for your help:
from sklearn.cluster import KMeans
import csv
import pandas as pd
# Import csv file with data in following columns:
# [PM (index)] [Longitude] [Latitude] [DaysUntilDueDate]
df = pd.read_csv('point_data_test.csv',index_col=['PM'])
numProjects = len(df)
K = numProjects // 3 # Around three projects can be worked per day
print("Number of projects: ", numProjects)
print("K-clusters: ", K)
for k in range(1, K):
# Create a kmeans model on our data, using k clusters.
# Random_state helps ensure that the algorithm returns the
# same results each time.
kmeans_model = KMeans(n_clusters=k, random_state=1).fit(df.iloc[:, :])
# These are our fitted labels for clusters --
# the first cluster has label 0, and the second has label 1.
labels = kmeans_model.labels_
# Sum of distances of samples to their closest cluster center
SSE = kmeans_model.inertia_
print("k:",k, " SSE:", SSE)
# Add labels to df
df['Labels'] = labels
#print(df)
df.to_csv('test_KMeans_out.csv')
It seems the issue is with the syntax of iloc[1:4].
From your question it appears you changed:
kmeans_model = KMeans(n_clusters=k, random_state=1).fit(df.iloc[:, :])
to:
kmeans_model = KMeans(n_clusters=k, random_state=1).fit(df.iloc[1:4])
It seems to me that either you have a typo or you don't understand how iloc works. So I will explain.
You should start by reading Indexing and Selecting Data from the pandas documentation.
But in short .iloc is an integer based indexing method for selecting data by position.
Let's say you have the dataframe:
A B C
1 2 3
4 5 6
7 8 9
10 11 12
The use of iloc in the example you provided iloc[:,:] selects all rows and columns and produces the entire dataframe. In case you aren't familiar with Python's slice notation take a look at the question Explain slice notation or the docs for An Informal Introduction to Python. The example you said caused your error iloc[1:4] selects the rows at index 1-3. This would result in:
A B C
4 5 6
7 8 9
10 11 12
Now, if you think about what you are trying to do and the error you received you will realize that you have selected fewer samples form your data than you are looking for clusters. 3 samples (rows 1, 2, 3) but you're telling KMeans to find 4 clusters, which just isn't possible.
What you really intended to do (as I understand it) was to select all rows and columns 1-3 that correspond to your lat, lng, and z values. To do this just add a colon as the first argument to iloc like so:
df.iloc[:, 1:4]
Now you will have selected all of your samples and the columns at index 1, 2, and 3. Now, assuming you have enough samples, KMeans should work as you intended.

Classify stream of data using hidden markov models

Problem
In an on-line process consisting of different steps I have data of people that complete the process and the people that drop out. The each user, the data consists of a sequence of process steps per time interval, let's say a second.
An example of such a sequence of a completed user would be [1,1,1,1,2,2,2,3,3,3,3....-1] where the user is in step 1 for four seconds,
followed by step 2 for three seconds and step 3 for four seconds etc before reaching the end of the process (denoted by -1).
An example of a drop out would be [1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2] where the user would be spending an excessive timespan in step 1, then 5 seconds in step 2 and then closing the webpage (so not reaching the end (-1))
Based on a model I would like to be able to predict/classify online (as in in 'real-time') the probability of the user completing the process or dropping out.
Approach
I have read about HMMs and I would to apply the following principle:
train one model using the sequences of people of that completed the process
train another model using the sequences of people that did not complete the process
collect the stream of incoming data of an unseen user and at each timestep use the forward algorithm on each of the models to see which of the two models is most likely to output this stream. The corresponding model represents then the label associated to this stream.
What is your opinion? Is this doable? I have been looking at the Python libraries hmmlearn and pomegranate, but I can not seem to create a small working example to test with. Some test code of mine can be found below with some artificial data:
from pomegranate import *
import numpy as np
# generate data of some sample sequences of length 4
# mean and std of each step in sequence
means = [1,2,3,4]
stds = [0.1, 0.1, 0.1, 0.1]
num_data = 100
data = []
for mean, std in zip(means, stds):
d = np.random.normal(mean, std, num_data)
data.append(d)
data = np.array(data).T
# create model (based on sample code of pomegranate https://github.com/jmschrei/pomegranate/blob/master/tutorials/Tutorial_3_Hidden_Markov_Models.ipynb)
s1 = State( NormalDistribution( 1, 1 ), name="s1" )
s2 = State( NormalDistribution( 2, 1 ), name="s2" )
model = HiddenMarkovModel()
model.add_states( [s1, s2] )
model.add_transition( model.start, s1, 0.5, pseudocount=4.2 )
model.add_transition( model.start, s2, 0.5, pseudocount=1.3 )
model.add_transition( s1, s2, 0.5, pseudocount=5.2 )
model.add_transition( s2, s1, 0.5, pseudocount=0.9 )
model.bake()
#model.plot()
# fit model
model.fit( data, use_pseudocount=False, algorithm = 'baum-welch', verbose=False )
# get probability of very clean sequence (mean of each step)
p = model.probability([1,2,3,4])
print p # 3.51e^-112
I would expect here that the probability of this very clean sequence would be close to 1, since the values are the means of each of the distributions of the steps. How can I make this example better and eventually apply it for my application?
Concerns
I am not sure what states and transitions my model should comprise of. What is a 'good' model? How can you know that you need to add more states to the model to add more expressive data given the data. The tutorials of pomegranate are nice but insufficient for me to apply HMM's in this context.
Yes, the HMM is a viable way to do this, although it's a bit of overkill, since the FSM is a simple linear chain. The "model" can also be built from mean and variation for each string length, and you can simply compare the distance of the partial string to each set of parameters, rechecking at each desired time point.
The states are simple enough:
1 ==> 2 ==> 3 ==> ... ==> done
Each state has a loop back to itself; this is the most frequent choice.
There is also a transition to "failed" from any state.
Thus, the Markov Matrices will be sparse, something like
1 2 3 4 done failed
0.8 0.1 0 0 0 0.1
0 0.8 0.1 0 0 0.1
0 0 0.8 0.1 0 0.1
0 0 0 0.8 0.1 0.1
0 0 0 0 1.0 0
0 0 0 0 0 1.0

Categories