LSTM with more features / classes - python

How can I use more than one feature/class as input/output on a LSTM using Sequential from Keras Models in Python?
To be more specific, I would like to use as input and output to the network: [FeatureA][FeatureB][FeatureC].
FeatureA is a categorial class with 100 different possible values indicating the sensor which collected the data;
FeatureB is a on/off indicator, being 0 or 1;
FeatureC is a categorial class too having 5 unique values.
Data Example:
1. 40,1,5
2. 58,1,2
3. 57,1,5
4. 40,0,1
5. 57,1,4
6. 23,0,3
When using the raw data and loss='categorical_crossentropy' on model.compile, the loss is over than 10.0.
When normalisating the data to values between 0-1 and using mean_squared_error on loss, it gets an average of 0.27 on loss. But when testing it on prediction, the results does not makes any sense.
Any suggestions here or tutorials I could consult?
Thanks in advance.

You need to convert FeatureC to a binary category. Mean squared error is for regression and as best I can tell you are trying to predict which class a certain combination of sensors and states belongs to. Since there's 5 possible classes you can kind of think that you're trying to predict if the class is Red, Green, Blue, Yellow, or Purple. Right now those are represented by numbers but for a regression you model is going to be predicting values like 3.24 which doesn't make sense.
In effect you're converting values of FeatureC to 5 columns of binary values. Since it seems like the classes are exclusive there should be a single 1 and the rest of the columns of a row would be 0s. So if the first row is 'red' it would be [1, 0, 0, 0, 0]
For best results you should also convert FeatureA to binary categorical features. For the same reason as above, sensor 80 is not 4x more than sensor 20, but instead a different entity.
The last layer of your model should be of softmax type with 5 neurons. Basically your model is going to try to be predicting the probability of each class, in the end.
It looks like you are working with a dataframe since there's an index. Therefore I would try:
import keras
import numpy as np
import pandas as pd # assume that this has probably already been done though
feature_a = data.loc[:, "FeatureA"].values # pull out feature A
labels = data.loc[:, "FeatureC"].values # y are the labels that we're trying to predict
feature_b = data.loc[:, "FeatureB"].values # pull out B to make things more clear later
# make sure that feature_b.shape = (rows, 1) otherwise reset the shape
# so hstack works later
feature_b = feature_b.reshape(feature_b.shape[0], 1)
labels -= 1 # subtract 1 from all labels values to zero base (0 - 4)
y = keras.utils.to_categorical(labels)
# labels.shape should be (rows, 5)
# convert 1-100 to binary columns
# zero base again
feature_a -= 1
# Before: feature_a.shape=(rows, 1)
feature_a_data = keras.utils.to_categorical(feature_a)
# After: feature_a_data.shape=(rows, 100)
data = np.hstack([feature_a_data, feature_b])
# data.shape should be (rows, 101)
# y.shape should be (rows, 5)
Now you're ready to train/test split and so on.
Here's something to look at which has multi-class prediction:
https://github.com/keras-team/keras/blob/master/examples/mnist_cnn.py

Related

How to modify the Timeseries forecasting for weather prediction example to increase the number of predictions?

(And plot them all in the same figure).
I've been following the "Timeseries forecasting for weather prediction" code found here:
https://keras.io/examples/timeseries/timeseries_weather_forecasting/
The article says:
"The trained model above is now able to make predictions for 5 sets of values from validation set."
And it uses this code to get predictons and plot them:
def show_plot(plot_data, delta, title):
labels = ["History", "True Future", "Model Prediction"]
marker = [".-", "rx", "go"]
time_steps = list(range(-(plot_data[0].shape[0]), 0))
if delta:
future = delta
else:
future = 0
plt.title(title)
for i, val in enumerate(plot_data):
if i:
plt.plot(future, plot_data[i], marker[i], markersize=10, label=labels[i])
else:
plt.plot(time_steps, plot_data[i].flatten(), marker[i], label=labels[i])
plt.legend()
plt.xlim([time_steps[0], (future + 5) * 2])
plt.xlabel("Time-Step")
plt.show()
return
for x, y in dataset_val.take(5):
show_plot(
[x[0][:, 1].numpy(), y[0].numpy(), model.predict(x)[0]],
12,
"Single Step Prediction",
)
In my computer in order to downsample the series to 1 hour... instead of using "sampling_rate=6" I have directly modified the frequency of the input data and I'm using "sampling_rate=1"
Now, considering that the model was fitted properly... What do I need to modify if I want to get predictions for the next 500 intervals instead of just 5?
dataset_val.take(500)
Or something else?
The configuration at the beginning also says:
split_fraction = 0.715
train_split = int(split_fraction * int(df.shape[0]))
step = 6
past = 720
future = 72
learning_rate = 0.001
batch_size = 256
epochs = 10
What values do I need to use now for past and future (if my data has a frequency of 1 hour and I want to predict 500 points forward?
future = 500
past = ? (it seems to be the number of timestamps taken backwards for training)
What about delta? It's fixed to 12, but it seems to be the value for future.
according to the source
https://github.com/keras-team/keras-io/blob/master/examples/timeseries/timeseries_weather_forecasting.py
, here is the model
inputs = keras.layers.Input(shape=(inputs.shape[1], inputs.shape[2]))
lstm_out = keras.layers.LSTM(32)(inputs)
outputs = keras.layers.Dense(1)(lstm_out)
model = keras.Model(inputs=inputs, outputs=outputs)
model.compile(optimizer=keras.optimizers.Adam(learning_rate=learning_rate), loss="mse")
model.summary()
as you can see it uses a 1 unit Dense as a last layer. if you want for example 2 predictions u should use 2 units for Dense(as a last layer) and should be careful about the input shape of ( X_train, Y_train) and (X_Validation, Y_Validation) because your expected Y as a default has a 1 unit so probably you should convert it.
Simple example
Default X:1,Y:1 changes to X:1,Y:1,2
and probably Y data should be shifted(N) which N is exactly the number of units in the last layer(Dense).
If you just want to predict a bigger time frame you can convert your whole Data to the bigger one.
e.x our default Time frame and data (weather) is per hour. then we can convert our data to the daily ( which is x24 ) and then we can predict daily or the same thing (X30) and we can predict monthly and so that.

Label encode then impute missing then inverse encoding

I have a data set on police killings that you can find on Kaggle. There's some missing data in several columns:
UID 0.000000
Name 0.000000
Age 0.018653
Gender 0.000640
Race 0.317429
Date 0.000000
City 0.000320
State 0.000000
Manner_of_death 0.000000
Armed 0.454487
Mental_illness 0.000000
Flee 0.000000
dtype: float64
I created a copy of the original df to encode it and then impute missing values. My plan was:
Label encode all categorical columns:
Index(['Gender', 'Race', 'City', 'State', 'Manner_of_death', 'Armed',
'Mental_illness', 'Flee'],
dtype='object')
le = LabelEncoder()
lpf = {}
for col in lepf.columns:
lpf[col] = le.fit_transform(lepf[col])
lpfdf = pd.DataFrame(lpf)
Now I have my dataframe with all categories encoded.
Then, I located those nan values in the original dataframe (pf), to substitute those encoded nan's in lpfdf:
for col in lpfdf:
print(col,"\n",len(np.where(pf[col].to_frame().isna())[0]))
Gender 8
Race 3965
City 4
State 0
Manner_of_death 0
Armed 5677
Mental_illness 0
Flee 0
For instance, Gender got three encoded labels: 0 for Male, 1 for Female, and 2 for nan. However, the feature City had >3000 values, and it was not possible to locate it using value_counts(). For that reason, I used:
np.where(pf["City"].to_frame().isna())
Which yielded:
(array([ 4110, 9093, 10355, 10549], dtype=int64), array([0, 0, 0,
0], dtype=int64))
Looking to any of these rows corresponding to the indices, I saw that the nan label for City was 3327:
lpfdf.iloc[10549]
Gender 1
Race 6
City 3327
State 10
Manner_of_death 1
Armed 20
Mental_illness 0
Flee 0
Name: 10549, dtype: int64
Then I proceded to substitute these labels for np.nan:
"""
Gender: 2,
Race: 6,
City: 3327,
Armed: 59
"""
lpfdf["Gender"] = lpfdf["Gender"].replace(2, np.nan)
lpfdf["Race"] = lpfdf["Race"].replace(6, np.nan)
lpfdf["City"] = lpfdf["City"].replace(3327, np.nan)
lpfdf["Armed"] = lpfdf["Armed"].replace(59, np.nan)
Create the instance of iterative imputer and then fit and transform lpfdf:
itimp = IterativeImputer()
iilpf = itimp.fit_transform(lpfdf)
Then make a dataframe for these new imputed values:
itimplpf = pd.DataFrame(np.round(iilpf), columns = lepf.columns)
And finally, when I go to inveres transform to see the corresponding labels it imputed I get the following error:
for col in lpfdf:
le.inverse_transform(itimplpf[col].astype(int))
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-191-fbdde4bb4781> in <module>
1 for col in lpfdf:
----> 2 le.inverse_transform(itimplpf[col].astype(int))
~\anaconda3\lib\site-packages\sklearn\preprocessing\_label.py in inverse_transform(self, y)
158 diff = np.setdiff1d(y, np.arange(len(self.classes_)))
159 if len(diff):
--> 160 raise ValueError(
161 "y contains previously unseen labels: %s" % str(diff))
162 y = np.asarray(y)
ValueError: y contains previously unseen labels: [2 3 4 5]
What is wrong with my steps?
Sorry for my long-winded explanation but I felt that I need to explain all the steps so that you can understand the issue properly. Thank you all.
A possibility that might be worth exploring is predicting missing categorical (encoded) values using a machine learning algorithm e.g. sklearn.ensemble.RandomForestClassifier.
Here, you would train a multiclass classification model for predicting missing values of each of your columns. You'd start by replacing missing values with a magic value (e.g -99), and then one-hot encode them. Next, train a classification model to predict the categorical value of a chosen column, using the one-hot encoded values of the other columns as training data. The training data would, of course, exclude rows where the column to be predicted is missing. Finally, compose a "test" set made from the rows where this column is missing, predict the values, and impute these values into the column. Repeat this for each column that needs to have missing values imputed.
Assuming you want to apply machine learning techniques to this data at a later point, a deeper question is whether the absence of values in some examples of the dataset may in fact carry useful information for predicting your Target, and consequently, whether a particular imputation strategy could corrupt that information.
Edit: Below is an example of what I mean, using dummy data.
import numpy as np
import sklearn
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
#from catboost import CatBoostClassifier
# create some fake data
n_samples = 1000
n_features = 20
features_og, _ = make_classification(n_samples=n_samples, n_features=n_features,n_informative=3, n_repeated= 16, n_redundant = 0)
# convert to fake categorical data
features_og = (features_og*10).astype(int)
# add missing value flag (-99) at random
features = features_og.copy()
for i in range(n_samples):
for j in range(n_features):
if np.random.random() > 0.85:
features[i,j] = -99
# go column by column predicting and replacing missing values
features_fixed = features.copy()
for j in range(n_features):
# do train test split based on whether the selected column value is -99.
train = features[np.where(features[:,j] != -99)]
test = features[np.where(features[:,j] == -99)]
clf = RandomForestClassifier(n_estimators=300, max_depth=5, random_state=42)
# potentially better for categorical features is CatBoost:
#clf = CatBoostClassifier(n_estimators= 300,cat_features=[identify categorical features here])
# train the classifier to predict the value of column j using the other columns
clf.fit(train[:,[x for x in range(n_features) if x != j]], train[:,j])
# predict values for elements of column j that have the missing flag
preds = clf.predict(test[:,[x for x in range(n_features) if x != j]])
# substitute the missing values in column j with the predicted values
features_fixed[np.where(features[:,j] == -99.),j] = preds
Your approach of encoding categorical values first and then imputing missing values is prone to problems and thus, not recommended.
Some imputing strategies, like IterativeImputer, will not guarantee that the output contains only previously known numeric values . This can result in imputed values which are unknown to the encoder and will cause an error upon the inverse transformation (which is exactly your case).
It is better to first impute the missing values for both, numeric and categorical features, and then encode the categorical features. One option would be to use SimpleImputer and replacing missing values with the most frequent category or a new constant value.
Also, a note on LabelEncoder: it is clearly mentioned in its documentation that:
This transformer should be used to encode target values, i.e. y, and not the input X.
If you insist on an encoding strategy like LabelEncoder, you can use OrdinalEncoder which does the same but is actually meant for feature encoding. However, you should be aware that such an encoding strategy might falsely suggest an ordinal relationship between each category of a feature, which might lead to undesired consequences. You should therefore consider other encoding strategies as well.
The entire process can be automated with the datawig package.You just need to create an imputation model for each to-be-imputed column and it will handle the encoding and inverse encoding by itself.
It was even tested against kNN and iterative imputer and showed better results.
Here is a personal guide.

get the size of dataset after applying a filter from tf.data.Dataset

I wonder how I can get the size or the len of the dataset after applying a filter. Using tf.data.experimental.cardinality give -2, and this is not what I am looking for!! I want to know how many filtered samples exist in my dataset in order to be able to split it to training and validation datasets using take() and skip().
Example:
dataset = tf.data.Dataset.from_tensor_slices([1, 2, 3, 4, 5])
dataset = dataset.filter(lambda x: x < 4)
size = tf.data.experimental.cardinality(dataset).numpy()
#size here is equal to -2 but I want to get the real size which is 3
My dataset contains images and their labels, this is just an illustrative example
Taking a look at the documentation reveals that a cardinality of -2 shows that Tensorflow is unable to determine the cardinality of the data set. You can find this in here. For your example, you can do
dataset = dataset.as_numpy_iterator()
dataset = list(dataset)
print(len(dataset))

Scikit Learn SVM - Input types

Hy I am starting to learn Scikit learn, but I am not interested in this iris-date or orlando real estate price things that they use in all this tutorials. This does not make any sence for me. I want to use my own date, but I can not figure out what input format should be used.
This is how my Code looks:
import matplotlib.pyplot as plt
from sklearn import svm
clf = svm.SVC(gamma=0.001, C=100)
x,y = [[1,2], [2,4]]
clf.fit(x,y)
I always geht the message:
ValueError: Found input variables with inconsistent numbers of
samples: [1, 2]
I tried many other formats like [[1],[1]] or 1,1.
So my simple questions, in witch formate do I have to write this
"x,y = [[1,2], [2,4]]" for my data?
Also, how can I train a model to make a forecast, for example: I have 10 Sports Teams in one legue.
In my table I have:
Team 1 | Team 2 | Result | location
So I want to figure if 2 teams play against each other who will win, but the location of course can be a factor.
I want to predict if team A plays against B at home, who is more likly to win.
How to enter your data:
The way you enter your data x contains just a single sample with 2 features, whereas y provides 2 labels.
Use this notation to get 2 samples with one feature each: x,y = [[1],[4]],[2,4]
Or, just to make it more obvious:
x = [[1],[4]]
y = [2,4]
Btw.: Given that you are new to Scikit learn, you should definitely try to do the same with Numpy arrays.
Make a classification:
If you want to make a forecast about who wins, you need to follow a couple of steps:
- Split you data, so that your features ("teamA", "teamB" and "location") are contained in you training data and the results represent the labels, e.g.:
x = [[teamA1,teamB1,Loc1],[teamA2,teamB2,Loc2],[teamA3,teamB3,Loc3],...]
y = [result1,result2,result3,...]
- Fit your model as before
- Make a prediction given your test data, e.g.:
x_test = [teamX,teamY,locX] # data for which you want the forecast
clf.predict(x_test) # this returns the estimated result

How to split Test and Train data such that there is garenteed at least one of each Class in each

I have some fairly unbalanced data I am trying to classify.
However, it is classifying fairly well.
To evaluate exactly how well, I must split the data into training and test subsets.
Right now I am doing that by the very simple measure of:
import numpy as np
corpus = pandas.DataFrame(..., columns=["data","label"]) # My data, simplified
train_index = np.random.rand(len(corpus))>0.2
training_data = corpus[train_index]
test_data = corpus[np.logical_not(train_index)]
This is nice and simple, but some of the classes occur very very rarely:
about 15 occur less than 100 times each in the corpus of over 50,000 cases, and two of them each occur only once.
I would like to partition my data corpus into test and training subsets such that:
If a class occurs less than twice, it is excluded from both
each class occurs at least once, in test and in training
The split into test and training is otherwise random
I can throw together something to do this,
(likely the simplest way is to remove things with less than 2 occurances) and then just resample til the spit has both on each side), but I wonder if there is a clean method that already exists.
I don't think that sklearn.cross_validation.train_test_split will do for this, but that it exists suggests that sklearn might have this kind of functionality.
The following meets your 3 conditions for partitioning the data into test and training:
#get rid of items with fewer than 2 occurrences.
corpus=corpus[corpus.groupby('label').label.transform(len)>1]
from sklearn.cross_validation import StratifiedShuffleSplit
sss=StratifiedShuffleSplit(corpus['label'].tolist(), 1, test_size=0.5, random_state=None)
train_index, test_index =list(*sss)
training_data=corpus.iloc[train_index]
test_data=corpus.iloc[test_index]
I've tested the code above by using the following fictitious dataframe:
#create random data with labels 0 to 39, then add 2 label case and one label case.
corpus=pd.DataFrame({'data':np.random.randn(49998),'label':np.random.randint(40,size=49998)})
corpus.loc[49998]=[random.random(),40]
corpus.loc[49999]=[random.random(),40]
corpus.loc[50000]=[random.random(),41]
Which produces the following output when testing the code:
test_data[test_data['label']==40]
Out[110]:
data label
49999 0.231547 40
training_data[training_data['label']==40]
Out[111]:
data label
49998 0.253789 40
test_data[test_data['label']==41]
Out[112]:
Empty DataFrame
Columns: [data, label]
Index: []
training_data[training_data['label']==41]
Out[113]:
Empty DataFrame
Columns: [data, label]
Index: []

Categories