Classify stream of data using hidden markov models - python

Problem
In an on-line process consisting of different steps I have data of people that complete the process and the people that drop out. The each user, the data consists of a sequence of process steps per time interval, let's say a second.
An example of such a sequence of a completed user would be [1,1,1,1,2,2,2,3,3,3,3....-1] where the user is in step 1 for four seconds,
followed by step 2 for three seconds and step 3 for four seconds etc before reaching the end of the process (denoted by -1).
An example of a drop out would be [1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2] where the user would be spending an excessive timespan in step 1, then 5 seconds in step 2 and then closing the webpage (so not reaching the end (-1))
Based on a model I would like to be able to predict/classify online (as in in 'real-time') the probability of the user completing the process or dropping out.
Approach
I have read about HMMs and I would to apply the following principle:
train one model using the sequences of people of that completed the process
train another model using the sequences of people that did not complete the process
collect the stream of incoming data of an unseen user and at each timestep use the forward algorithm on each of the models to see which of the two models is most likely to output this stream. The corresponding model represents then the label associated to this stream.
What is your opinion? Is this doable? I have been looking at the Python libraries hmmlearn and pomegranate, but I can not seem to create a small working example to test with. Some test code of mine can be found below with some artificial data:
from pomegranate import *
import numpy as np
# generate data of some sample sequences of length 4
# mean and std of each step in sequence
means = [1,2,3,4]
stds = [0.1, 0.1, 0.1, 0.1]
num_data = 100
data = []
for mean, std in zip(means, stds):
d = np.random.normal(mean, std, num_data)
data.append(d)
data = np.array(data).T
# create model (based on sample code of pomegranate https://github.com/jmschrei/pomegranate/blob/master/tutorials/Tutorial_3_Hidden_Markov_Models.ipynb)
s1 = State( NormalDistribution( 1, 1 ), name="s1" )
s2 = State( NormalDistribution( 2, 1 ), name="s2" )
model = HiddenMarkovModel()
model.add_states( [s1, s2] )
model.add_transition( model.start, s1, 0.5, pseudocount=4.2 )
model.add_transition( model.start, s2, 0.5, pseudocount=1.3 )
model.add_transition( s1, s2, 0.5, pseudocount=5.2 )
model.add_transition( s2, s1, 0.5, pseudocount=0.9 )
model.bake()
#model.plot()
# fit model
model.fit( data, use_pseudocount=False, algorithm = 'baum-welch', verbose=False )
# get probability of very clean sequence (mean of each step)
p = model.probability([1,2,3,4])
print p # 3.51e^-112
I would expect here that the probability of this very clean sequence would be close to 1, since the values are the means of each of the distributions of the steps. How can I make this example better and eventually apply it for my application?
Concerns
I am not sure what states and transitions my model should comprise of. What is a 'good' model? How can you know that you need to add more states to the model to add more expressive data given the data. The tutorials of pomegranate are nice but insufficient for me to apply HMM's in this context.

Yes, the HMM is a viable way to do this, although it's a bit of overkill, since the FSM is a simple linear chain. The "model" can also be built from mean and variation for each string length, and you can simply compare the distance of the partial string to each set of parameters, rechecking at each desired time point.
The states are simple enough:
1 ==> 2 ==> 3 ==> ... ==> done
Each state has a loop back to itself; this is the most frequent choice.
There is also a transition to "failed" from any state.
Thus, the Markov Matrices will be sparse, something like
1 2 3 4 done failed
0.8 0.1 0 0 0 0.1
0 0.8 0.1 0 0 0.1
0 0 0.8 0.1 0 0.1
0 0 0 0.8 0.1 0.1
0 0 0 0 1.0 0
0 0 0 0 0 1.0

Related

How to implement a Forward Selection using KNN?

I am trying to use a wrapper method in Python to implement a simple forward selection using KNN from the data I have.
My data:
ID S_LENGTH S_WIDTH P_LENGTH P_WIDTH SPECIES
------------------------------------------------------------------
1 3.5 2.5 5.6 1.7 VIRGINICA
2 4.5 5.6 3.4 8.7 SETOSA
This is where I have defined X and y:
X = df[['S_LENGTH', 'S_WIDTH', 'P_LENGTH', 'P_WIDTH']].values
y = df['SPECIES'].values
This is a simple KNN model:
clf = neighbors.KNeighborsClassifier()
clf.fit(X_fs,y)
predictions = clf.predict(X_fs)
metrics.accuracy_score(y, predictions)
Therefore, how would I implement a KNN model using forward selection?
Thanks!
I do not believe that KNN has a features importance built-in, so you have basically three options. First, you can use a model agnostic version of feature importance like permutation importance.
Second, you can try adding one feature at a time at each step, and pick the model that most increases performance.
Third (closely related to second), just try every permutation! Since you only have 4 features, assuming you don't have too much data, you could just try all combinations of features. There are 4 models of one feature, 6 (4 choose 2) models with two features, 4 with three, and 1 with all four. That's probably less computation than the above two ideas.
So something like this:
feat_lists = [
['S_LENGTH'],
['S_WIDTH'],
...
['S_LENGTH', 'S_WIDTH', 'P_LENGTH'],
['S_LENGTH', 'S_WIDTH', 'P_WIDTH'],
...
['S_LENGTH', 'S_WIDTH', 'P_LENGTH', 'P_WIDTH']
]
for feats in feat_lists:
X = df[feats].values
y = df['SPECIES'].values
...all you other code...
print(feats)
print(metrics.accuracy_score(y, predictions))
To clarify, I'm assuming that's not actually your data, but only the first two rows, correct? If you only have two rows, you have bigger problems :)

Odd linear model results

I'm unit acceptance testing some code I wrote. It's conceivable that at some point in the real world we will have input data where the dependent variable is constant. Not the norm, but possible. A linear model should yield coefficients of 0 in this case (right?), which is fine and what we would want -- but for some reason I'm getting some wild results when I try to fit the model on this use case.
I have tried 3 models and get diffirent weird results every time -- or no results in some cases.
For this use case all of the dependent observations are set at 100, all the freq_weights are set at 1, and the independent variables are a binary coded dummy set of 20 features.
In total there are 150 observations.
Again, this data is unlikely in the real world but I need my code to be able to work on this ugly data. IDK why I'm getting such erroneous and different results.
As I understand with no variance in the dependent variable I should be getting 0 for all my coefficients.
freq = freq['Freq']
Indies = sm.add_constant(df)
model = sm.OLS(df1, Indies)
res = model.fit()
res.params
yields:
const 65.990203
x1 17.214836
reg = statsmodels.GLM(df1, Indies, freq_weights = freq)
results = reg.fit(method = 'lbfgs', max_start_irls=0)
results.params
yields:
const 83.205034
x1 82.575228
reg = statsmodels.GLM(df1, Indies, freq_weights = freq)
result2 = reg.fit()
result2.params
yields
PerfectSeparationError: Perfect separation detected, results not available

How to Include continuous and categorical predictors in Keras LSTM?

I want to use Keras LSTM (or similar) to forecast energy consumption of businesses based on:
historical consumption data
some numerical features (e.g. total yearly consumption)
some categorical features (e.g. business type)
This is a cold-start problem because, while 2. and 3. are present both for the training and the test set, 1. is not, i.e. I am trying to predict consumption of new businesses for which there is no historical data.
My question is: how to structure the dataframe and the RNN to accomodate both 2. (numerical features) and 3. (categorical data) as my predictors?
Here is a made-up example of the data:
# generate x (predictors dataframe)
import pandas as pd
x = pd.DataFrame({'ID':[0,1,2,3],'business_type':[0,2,2,1], 'contract_type':[0,0,2,1], 'yearly_consumption':[1000,200,300,900], 'n_sites':[9,1,2,5]})
print(x)
# note: the first 2 are categorical and the second 2 are numerical
ID business_type contract_type yearly_consumption n_sites
0 0 0 0 1000 9
1 1 2 0 200 1
2 2 2 2 300 2
3 3 1 1 900 5
# generate y (timeseries data)
import numpy as np
time_series = []
data_length = 6
period = 1
for k in range(4):
level = 10 * np.random.rand()
seas_amplitude = (0.1 + 0.3*np.random.rand()) * level
sig = 0.05 * level # noise parameter (constant in time)
time_ticks = np.array(range(data_length))
source = level + seas_amplitude*np.sin(time_ticks*(2*np.pi)/period)
noise = sig*np.random.randn(data_length)
data = source + noise
index = pd.DatetimeIndex(start=t0, freq=freq, periods=data_length)
time_series.append(pd.Series(data=data, index=['t0','t1','t2','t3','t4','t5']))
y = pd.DataFrame(time_series)
print(y)
t0 t1 t2 t3 t4 t5
0 9.611984 8.453227 8.153665 8.801166 8.208920 8.399184
1 2.139507 2.118636 2.160479 2.216049 1.943978 2.008407
2 0.131757 0.133401 0.135168 0.141212 0.136568 0.123730
3 5.990021 6.219840 6.637837 6.745850 6.648507 5.968953
# note: the real data has thousands of data points (one year with half hourly frequency)
# note: the first row belongs to ID = 0 in x, the second row to ID = 1 etc.
I have looked extensively online, and there seem to be no example where both categorical, numerical and time-series data are used. For a simple forecasting problem, this post explains that in order to learn from the previous time period, the LSTM must be fed something like this:
# process df for a classical forecasting problem for first ID
y_lstm = pd.DataFrame(y.iloc[0,:])
y_lstm.columns = ['t']
y_lstm['t-1'] = y_lstm['t'].shift()
print(y_lstm)
t t-1
t0 9.611984 NaN
t1 8.453227 9.611984
t2 8.153665 8.453227
t3 8.801166 8.153665
t4 8.208920 8.801166
t5 8.399184 8.208920
# note: t-1 represents the previous time point
However, while this works for a single timeseries, it is unclear how to structure the dataset when there are multiple timeseries, and how to include the rest of the predictors in this structure.
This post talks about how to include both categorical and numerical variables through embedding, but does not fit my problem where also timeseries data has to be included. This post discusses between one-hot encoding and embedding without any example code and does not answer my question.
Could anyone please provide me with example code on how to structure the data appropriately for the RNN and/or how a simple LSTM structure with Keras would look like? Note that this stucture should be able to use the timeseries data for training, but not for predictions (i.e. only x and not y is available for the test set)
Thank you very much in advance.

What is the meaning of normalization in machine learning language? Does it correspond to one sample?

I am dealing with a classification problem I want to classify data into 2 classes. I generate 1000 samples at different temperatures ranging from 1 to 5. I load data using following function load_data. Where "data" is 2 dimensional array (1000,16), Rows correspond to number of samples at "1.0.npy" and similarly for other points and 16 is number of features. So I picked max and min values from each sample by applying a for loop. But I'm afraid that my normalization is not correct because I'm not sure what is the strategy of normalization in machine learning. Should I pick np.amax(each sample) or should I pick np.amax("1.0.npy") mean from all 1000 samples that contained in 1.0.npy files. My goal is to normalize data between 0 and 1.
`def load_data():
path ="./directory"
files =sorted(os.listdir(path)) #{1.0.npy, 2.0.npy,.....5.0.npy}
dictData ={}
for df in sorted(files):
print(df)
data = np.load(os.path.join(path,df))
a=data
lis =[]
for i in range(len(data)):
old_range = np.amax(a[i]) - np.amin(a[i])
new_range = 1 - 0
f = ((a[i] - np.amin(a[i])) / old_range)*new_range + 0
lis.append(f)`
After normalization I get following result such that first value of every sample is 0 and last value is one.
[0, ...., 1] #first sample
[0,.....,1] #second sample

LSTM with more features / classes

How can I use more than one feature/class as input/output on a LSTM using Sequential from Keras Models in Python?
To be more specific, I would like to use as input and output to the network: [FeatureA][FeatureB][FeatureC].
FeatureA is a categorial class with 100 different possible values indicating the sensor which collected the data;
FeatureB is a on/off indicator, being 0 or 1;
FeatureC is a categorial class too having 5 unique values.
Data Example:
1. 40,1,5
2. 58,1,2
3. 57,1,5
4. 40,0,1
5. 57,1,4
6. 23,0,3
When using the raw data and loss='categorical_crossentropy' on model.compile, the loss is over than 10.0.
When normalisating the data to values between 0-1 and using mean_squared_error on loss, it gets an average of 0.27 on loss. But when testing it on prediction, the results does not makes any sense.
Any suggestions here or tutorials I could consult?
Thanks in advance.
You need to convert FeatureC to a binary category. Mean squared error is for regression and as best I can tell you are trying to predict which class a certain combination of sensors and states belongs to. Since there's 5 possible classes you can kind of think that you're trying to predict if the class is Red, Green, Blue, Yellow, or Purple. Right now those are represented by numbers but for a regression you model is going to be predicting values like 3.24 which doesn't make sense.
In effect you're converting values of FeatureC to 5 columns of binary values. Since it seems like the classes are exclusive there should be a single 1 and the rest of the columns of a row would be 0s. So if the first row is 'red' it would be [1, 0, 0, 0, 0]
For best results you should also convert FeatureA to binary categorical features. For the same reason as above, sensor 80 is not 4x more than sensor 20, but instead a different entity.
The last layer of your model should be of softmax type with 5 neurons. Basically your model is going to try to be predicting the probability of each class, in the end.
It looks like you are working with a dataframe since there's an index. Therefore I would try:
import keras
import numpy as np
import pandas as pd # assume that this has probably already been done though
feature_a = data.loc[:, "FeatureA"].values # pull out feature A
labels = data.loc[:, "FeatureC"].values # y are the labels that we're trying to predict
feature_b = data.loc[:, "FeatureB"].values # pull out B to make things more clear later
# make sure that feature_b.shape = (rows, 1) otherwise reset the shape
# so hstack works later
feature_b = feature_b.reshape(feature_b.shape[0], 1)
labels -= 1 # subtract 1 from all labels values to zero base (0 - 4)
y = keras.utils.to_categorical(labels)
# labels.shape should be (rows, 5)
# convert 1-100 to binary columns
# zero base again
feature_a -= 1
# Before: feature_a.shape=(rows, 1)
feature_a_data = keras.utils.to_categorical(feature_a)
# After: feature_a_data.shape=(rows, 100)
data = np.hstack([feature_a_data, feature_b])
# data.shape should be (rows, 101)
# y.shape should be (rows, 5)
Now you're ready to train/test split and so on.
Here's something to look at which has multi-class prediction:
https://github.com/keras-team/keras/blob/master/examples/mnist_cnn.py

Categories