Getting a probability distribution curve from a TensorFlow model

Getting a probability distribution curve from a TensorFlow model - python

I am trying to learn working with TensorFlow and so I was trying to make a probablistic ML model to get the probability distribution of the next day stock price based on the last n days price sequence, and when doing so I managed to predict the next day's price but not getting a probability distribution of the model. How do I get the curve which the model predictions are based on from the TensorFlow model?
This is the code i have got so far that predicts the actual price for the next day (using this video: https://www.youtube.com/watch?v=PuZY9q-aKLw):
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from datetime import datetime
from datetime import timedelta
from sklearn.preprocessing import MinMaxScaler
import tensorflow as tf
import tensorflow_probability as tfp
tfd = tfp.distributions
tfpl = tfp.layers
tfk = tf.keras
tfkl = tf.keras.layers
# preparing data
min_max_scaler = MinMaxScaler(feature_range=(0,1)) # a scaler object which normalizes the number between 0 and 1 in relation to the rest of the dataset
close_prices = price_data.loc[:,'Close'].values # all the close prices of the price_data df
scaled_close_prices = min_max_scaler.fit_transform(close_prices.reshape(-1,1)) # all the close prices, normalized between 1 and 0
n = 60 # the number of days that are used to determine the next value
# x contains the last n days before the predicted day which is y
X_train = [] # 70% of the dataset
y_train = []
X_test = [] # 30% of the dataset
y_test = []
for x in range(n, int(len(scaled_close_prices)*0.7)):
X_train.append(scaled_close_prices[x-n:x,0])
y_train.append(scaled_close_prices[x,0])
for x in range(int(len(scaled_close_prices)*0.7), len(scaled_close_prices)):
X_test.append(scaled_close_prices[x-n:x,0])
y_test.append(scaled_close_prices[x,0])
X_train, y_train = np.array(X_train), np.array(y_train)
X_test, y_test = np.array(X_test), np.array(y_test)
X_train = np.reshape(X_train, (X_train.shape[0], X_train.shape[1], 1))
X_test = np.reshape(X_test, (X_test.shape[0], X_test.shape[1], 1))
# building the model
model = tfk.Sequential()
model.add(tfkl.LSTM(units=50, return_sequences=True, input_shape=(X_train.shape[1], 1)))
model.add(tfkl.Dropout(0.2))
model.add(tfkl.LSTM(units=50, return_sequences=True))
model.add(tfkl.Dropout(0.2))
model.add(tfkl.LSTM(units=50))
model.add(tfkl.Dropout(0.2))
model.add(tfkl.Dense(units=1))
# compiling model
model.compile(optimizer=tfk.optimizers.Adam(learning_rate=0.05),
loss='mean_squared_error',
metrics=[])
# fitting model
model.fit(X_train, y_train, epochs=25, batch_size=32)
# testing model
y_predicted = model.predict(X_test)
y_predicted = min_max_scaler.inverse_transform(y_predicted).reshape(-1)

Probabilistic modelling one of the things that tensorflow_probability (tfp) can do.
It can do probabilistic time series forecasting - e.g. see https://www.tensorflow.org/probability/examples/Structural_Time_Series_Modeling_Case_Studies_Atmospheric_CO2_and_Electricity_Demand
... but I don't believe it includes a variational LSTM in its toolbox, and they may be close to the current research frontier. e.g. see https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7842932/

As written, this model is not especially probabilistic, per se, but it can be viewed as outputting an estimate of the mean of a gaussian distribution with unit variance; this is implicit in using a squared error loss. To make this more concrete, you could instantiate a tfd.Normal distribution with the output of your model as the loc parameter and scale=1., and call prob(...) to evaluate the pdf of this distribution. I'm not sure if this is what you're looking for, though.

Related

ANN problem: I am building an ANN model to predict the profit of a new startup based on certain features

The image of the dataset
import numpy as np
from keras.models import Sequential
from keras.layers import Dense
Loading the data set using pandas as data frame format
import pandas as pd
df = pd.read_csv(r"E:\50_Startups.csv")
df.drop(['State'],axis = 1, inplace = True)
from sklearn.preprocessing import MinMaxScaler
mm = MinMaxScaler()
df.iloc[:,:] = mm.fit_transform(df.iloc[:,:])
info = df.describe()
x = df.iloc[:,:-1].values
y = df.iloc[:,-1].values
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split( x,y, test_size=0.2, random_state=42)
Initializing the model
model = Sequential()
model.add(Dense(40,input_dim =3,activation="relu",kernel_initializer='he_normal'))
model.add(Dense(30,activation="relu"))
model.add(Dense(1))
model.compile(loss="mean_squared_error",optimizer="adam",metrics=["accuracy"])
fitting model on train data
model.fit(x=x_train,y=y_train,epochs=150, batch_size=32,verbose=1)
Evaluating the model on test data
eval_score_test = model.evaluate(x_test,y_test,verbose = 1)
I am getting zero accuracy.

The problem is that accuracy is a metric for discrete values (classification).
you should use:
r2 score
mape
smape
instead.
e.g:
model.compile(loss="mean_squared_error",optimizer="adam",metrics=["mean_absolute_percentage_error"])

Adding to the answer of #GuintherKovalski accuracy is not for regression but if you still want to use it then you can use it along with some threshold using following steps:
Set a threshold such that if the absolute difference in the predicted value and the actual value is less than equal to the threshold then you consider that value as correct, otherwise false.
Ex -> predicted values = [0.3, 0.7, 0.8, 0.2], original values = [0.2, 0.8, 0.5, 0.4].
Now abs diff -> [0.1, 0.1, 0.3, 0.2] and let's take a threshold of 0.2. So with this threshold the correct -> [1, 1, 0, 1] and your accuracy will be correct.sum()/len(correct) and that is 3/4 -> 0.75.
This could be implemented in TensorFlow like this
import numpy as np
import tensorflow as tf
from sklearn.datasets import make_regression
data = make_regression(10000)
model = tf.keras.Sequential([tf.keras.layers.Dense(1, input_shape=(100,))])
def custom_metric(a, b):
threshold = 1 # Choose accordingly
abs_diff = tf.abs(b - a)
correct = abs_diff >= threshold
correct = tf.cast(correct, dtype=tf.float16)
res = tf.math.reduce_mean(correct)
return res
model.compile('adam', 'mae', metrics=[custom_metric])
model.fit(data[0], data[1], epochs=30, batch_size=32)

Just want to say Thank you to everyone who took their precious time to help me. I am posting this code as this worked for me. I hope it helps everyone who is stuck somewhere looking for answers. I got this code after consulting with my friend.
import numpy as np
from keras.models import Sequential
from keras.layers import Dense
from keras.utils import np_utils
import pandas as pd
from sklearn.model_selection import train_test_split
# Loading the data set using pandas as data frame format
startups = pd.read_csv(r"E:\0Assignments\DL_assign\50_Startups.csv")
startups = startups.drop("State", axis =1)
train, test = train_test_split(startups, test_size = 0.2)
x_train = train.iloc[:,0:3].values.astype("float32")
x_test = test.iloc[:,0:3].values.astype("float32")
y_train = train.Profit.values.astype("float32")
y_test = test.Profit.values.astype("float32")
def norm_func(i):
x = ((i-i.min())/(i.max()-i.min()))
return (x)
x_train = norm_func(x_train)
x_test = norm_func(x_test)
y_train = norm_func(y_train)
y_test = norm_func(y_test)
# one hot encoding outputs for both train and test data sets
y_train = np_utils.to_categorical(y_train)
y_test = np_utils.to_categorical(y_test)
# Storing the number of classes into the variable num_of_classes
num_of_classes = y_test.shape[1]
x_train.shape
y_train.shape
x_test.shape
y_test.shape
# Creating a user defined function to return the model for which we are
# giving the input to train the ANN mode
def design_mlp():
# Initializing the model
model = Sequential()
model.add(Dense(500,input_dim =3,activation="relu"))
model.add(Dense(200,activation="tanh"))
model.add(Dense(100,activation="tanh"))
model.add(Dense(50,activation="tanh"))
model.add(Dense(num_of_classes,activation="linear"))
model.compile(loss="mean_squared_error",optimizer="adam",metrics =
["accuracy"])
return model
# building a cnn model using train data set and validating on test data set
model = design_mlp()
# fitting model on train data
model.fit(x=x_train,y=y_train,batch_size=100,epochs=10)
# Evaluating the model on test data
eval_score_test = model.evaluate(x_test,y_test,verbose = 1)
print ("Accuracy: %.3f%%" %(eval_score_test[1]*100))
# accuracy score on train data
eval_score_train = model.evaluate(x_train,y_train,verbose=0)
print ("Accuracy: %.3f%%" %(eval_score_train[1]*100))

Solar power prediction using Keras

Dataset:
The PV Yield (kWh) is my output. My model is suppose to predict this.
This is what I have done. I have attached the image of the dataset. From AirTemp to Zenith is my X and Y is PV Yield(KW/H).
df=pd.read_csv("Data1.csv")
X=df.drop(['Date-PrimaryKey','output-PV Yield (kWh)'],axis=1)
Y=df['output-PV Yield (kWh)']
pca = PCA(n_components=9)
pca.fit(X_train)
X_train = pca.transform(X_train)
pca.fit(X_test)
X_test = pca.transform(X_test)
#normalizing the input values to fall in -1 to 1
X_train = X_train/180000000.0
X_test = X_test/180000000.0
#Creating Model
model = Sequential()
model.add(Dense(15, input_shape=(9,)))
model.add(Activation('tanh'))
model.add(Dense(11))
model.add(Activation('tanh'))
model.add(Dense(1))
model.summary()
sgd = optimizers.SGD(lr=0.1,momentum=0.2)
model.compile(loss='mean_absolute_error',optimizer=sgd,metrics=['accuracy'])
#Training
model.fit(X_train, train_y, epochs=20, batch_size = 50, validation_data=(X_test, test_y))
My weights are not getting updated. Accuracy is zero in all epochs.

The model seems OK but there are two problems I can spot fast:
pca = PCA(n_components=9)
pca.fit(X_train)
X_train = pca.transform(X_train)
pca.fit(X_test)
X_test = pca.transform(X_test)
Anything used for transformation of the data must not be fit on testing data. You fit it on train samples and then use it to transform both train and test part. You should assume that you know nothing about data you will be predicting on in production, eg. you know nothing about tomorrows weather, results of sport matches in a month, etc. You wont be able to do so then, so you cant do so during training. Correct way:
pca = PCA(n_components=9)
pca.fit(X_train)
X_train = pca.transform(X_train)
X_test = pca.transform(X_test)
The second very incorrect stuff you have there is here:
#normalizing the input values to fall in -1 to 1
X_train = X_train/180000000.0
X_test = X_test/180000000.0
Of course you want to normalize your data, but this way you will end up with incredibly low decimals in cases where values are low, eg. AlbedoDaily column, and quite high values where are values high, such as SurfacePressure. For such scaling you can use already defined classes such as standard scaler. The code is very simple and each column is treated independently:
from sklearn.preprocessing import StandardScaler
transformer = StandardScaler().fit(X_train)
X_train = transformer.transform(X_train)
X_test = transformer.transform(X_test)
You have not provided or explained what your target variable is and where you get is, there could be other problems in your code I can not see right now.

Classification ANN stuck at 60%

I am trying to create a binary classifier on a data set of 10,000. I have tried multiple Activators and Optimizers, however the results are always between 56.8% and 58.9%. Given the fairly steady results over many dozen iterations, I assume the problem is either:
My dataset is not classifiable
My model is broken
This is the data set: training-set.csv
I may be able to get 2000 more records but that would be it.
My question is: is there something in the way my model is constructed that is preventing it from learning to a higher degree?
Note that I am happy to have as many layers and nodes as needed, and time is not a factor in generating the model.
dataframe = pandas.read_csv(r"training-set.csv", index_col=None)
dataset = dataframe.values
X = dataset[:,0:48].astype(float)
Y = dataset[:,48]
#count the input variables
col_count = X.shape[1]
#normalize X
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_scale = sc_X.fit_transform(X)
# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_scale, Y, test_size = 0.2)
# define baseline model
activator = 'linear' #'relu' 'sigmoid' 'softmax' 'exponential' 'linear' 'tanh'
#opt = 'Adadelta' #adam SGD nadam RMSprop Adadelta
nodes = 1000
max_layers = 2
max_epochs = 100
max_batch = 32
loss_funct = 'binary_crossentropy' #for binary
last_act = 'sigmoid' # 'softmax' 'sigmoid' 'relu'
def baseline_model():
# create model
model = Sequential()
model.add(Dense(nodes, input_dim=col_count, activation=activator))
for x in range(0, max_layers):
model.add(Dropout(0.2))
model.add(Dense(nodes, input_dim=nodes, activation=activator))
#model.add(BatchNormalization())
model.add(Dense(1, activation=last_act)) #model.add(Dense(1, activation=last_act))
# Compile model
adam = Adam(lr=0.001)
model.compile(loss=loss_funct, optimizer=adam, metrics=['accuracy'])
return model
estimator = KerasClassifier(build_fn=baseline_model, epochs=max_epochs, batch_size=max_batch)
estimator.fit(X_train, y_train)
y_pred = estimator.predict(X_test)
#confusion matrix
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
score = np.sum(cm.diagonal())/float(np.sum(cm))

Two points:
There is absolutely no point in stacking dense layers with linear activations - they only result to a single linear unit; change to activator = 'relu' (and just don't bother with the other candidate activation functions in your commented-out list).
Do not use dropout by default, especially if your model has difficulties in learning (like here); remove the dropout layer(s), and just be ready to put (some of) them back in only in case you see overfitting (you are currently still very far from that point, so this is not something to worry about now).

Predicting next value of game results

I am working on predicting game output. The game has three seats. Seat A , seat B and seat C. Each time only one seat will win randomly.
I have written LSTM code to predict the next value. I made the dataset of around 8000 games and recorded the results. But LSTM is not predicting the next value. It is either predicting either all "0"s or all "1"s. I'm curious why it is not learning.
here you can find dataset
here is attached my google colab code
# univariate data preparation
from sklearn.preprocessing import OneHotEncoder
from numpy import array
import pandas as pd
# split a univariate sequence into samples
df = pd.read_csv('DatasetLarge.csv') # here you can find dataset https://drive.google.com/open?id=1WZBMYO-Oi3uErPlBXphXFAACmn9VK_yZ
def split_sequence(sequence, n_steps):
X, y = list(), list()
for i in range(len(sequence)):
# find the end of this pattern
end_ix = i + n_steps
# check if we are beyond the sequence
if end_ix > len(sequence)-1:
break
# gather input and output parts of the pattern
seq_x, seq_y = sequence[i:end_ix], sequence[end_ix]
X.append(seq_x)
y.append(seq_y)
return array(X), array(y)
# Removing some errors of dataset
df=df.replace(to_replace ="q",
value ="a")
df=df.replace(to_replace ="aa",
value ="a")
df=df.replace(to_replace ="cc",
value ="c")
#label encoding
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
le.fit(df.values)
df=le.transform(df)
# define input sequence
raw_seq=df
# choose a number of time steps
n_steps = 5
# split into samples
X, y = split_sequence(raw_seq, n_steps)
# summarize the data
for i in range(len(X)):
print(X[i], y[i])
#Spliting dataset for train and test
X_train=X[0:7500]
y_train=y[0:7500]
X_test=X[7500:]
y_test=y[7500:]
import tensorflow.keras
from tensorflow.keras.models import Sequential, Model
from tensorflow.keras.layers import Dense, Dropout, Flatten,Input,LSTM
n_features = 1
X_train = X_train.reshape((X_train.shape[0], X_train.shape[1], n_features))
# define model
model = Sequential()
model.add(LSTM(200, activation='softmax', input_shape=(n_steps, n_features)))
model.add(Dense(1))
model.compile(optimizer='adam', loss='binary_crossentropy')
# # fit model
model.fit(X_train, y_train, epochs=100, verbose=0)
# # demonstrate prediction
X_test=X_test.reshape((X_test.shape[0], X_test.shape[1], n_features))
yhat_classes =(model.predict(X_test) > 0.5).astype("int32")
print(yhat_classes)
Please help me why it is not predicting correctly.

Keras Prediction after test values

I am currently trying to build neuronal network to be able to predict time series, but the question is, is it possible to predict further than just the test dataset. I mean, for my example, I have a dataset of about 3000 values, from which I keep 90% for training and 10% for testing. Then When I compare the prediction with the actual test value, it corresponds, but is it possible for instance to ask the program to predict the next 500 values (i.e. from 3001 to 3500) ?
Here is a snipper of the code I use.
import csv
import numpy as np
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
from keras.layers.core import Dense, Activation, Dropout
from keras.layers.recurrent import LSTM, GRU
from keras.models import Sequential
from keras import optimizers
from sklearn.svm import SVR
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import learning_curve
from sklearn.kernel_ridge import KernelRidge
import time
from sklearn.preprocessing import MinMaxScaler
sc = MinMaxScaler(feature_range = (-1, 1))
def load_data(datasetname, column, seq_len, normalise_window):
# A support function to help prepare datasets for an RNN/LSTM/GRU
data = datasetname.loc[:,column]
sequence_length = seq_len + 1
result = []
for index in range(len(data) - sequence_length):
result.append(data[index: index + sequence_length])
result = np.array(result)
result.reshape(-1,1)
training_set_scaled = sc.fit_transform(result)
print (result)
#Last 10% is used for validation test, first 90% for training
row = round(0.9 * training_set_scaled.shape[0])
train = training_set_scaled[:int(row), :]
#np.random.shuffle(train)
x_train = train[:, :-1]
y_train = train[:, -1]
X_test = training_set_scaled[int(row):, :-1]
y_test = training_set_scaled[int(row):, -1]
print ("shape train", x_train)
print ("shape train", X_test)
x_train = np.reshape(x_train, (x_train.shape[0], x_train.shape[1], 1))
X_test = np.reshape(X_test, (X_test.shape[0], X_test.shape[1], 1))
return [x_train, X_test, y_train, y_test]
def build_model():
model = Sequential()
layers = {'input': 100, 'hidden1': 150, 'hidden2': 256, 'hidden3': 100, 'output': 10}
model.add(LSTM(
50,
return_sequences=True,
input_shape=(200,1)
))
model.add(Dropout(0.2))
model.add(LSTM(
layers['hidden2'],
return_sequences=True,
))
model.add(Dropout(0.2))
model.add(LSTM(
layers['hidden3'],
return_sequences=False,
))
model.add(Dropout(0.2))
model.add(Activation("linear"))
model.add(Dense(
output_dim=layers['output']))
start = time.time()
model.compile(loss="mean_squared_error", optimizer="adam")
print ("Compilation Time : ", time.time() - start)
return model
dataset = pd.read_csv(
'data.csv')
X_train, X_test, y_train, y_test = load_data(dataset, 'mean anomaly', 200, False)
model = build_model()
print ("train",X_train)
print ("test",X_test)
model.fit(X_train, y_train, batch_size=256, epochs=1, validation_split=0.05)
predictions = model.predict(X_test)
predictions = np.reshape(predictions, (predictions.size,))
plt.figure(1)
plt.subplot(311)
plt.title("Actual Test Signal w/Anomalies & noise")
plt.plot(y_test)
plt.subplot(312)
plt.title("predicted signal")
plt.plot(predictions, 'g')
plt.subplot(313)
plt.title("training signal")
plt.plot(y_train, 'b')
plt.plot(y_test, 'y')
plt.legend(['train', 'test'])
plt.show()
I have read that I should increase the output dim of the dense layer to get more than 1 predicted value, or increase the size of my window in the load data function ?
Here is the result, the yellow plot is supposed to be after the blue one, it respresents my input test data, the first plot is a zoom on this data and the second one the prediction.

If you want to predict the output value of your serie at t+x based on data at time t, the data you need to feed to the network should already have this format.
Time series data formating :
If you have 3000 data point and want to predict the output value for the next "virtual" 500 point you should offset the output value by this amount. For exemple :
In your dataset, your 500th data point correspond to the 500th output value. If you want to predict "future" values then the 500th data point should have the 1000th output value. You can do this in pandas with the shift function. Be aware that you will loose the last 500 data point by doing so, has they will no longer have an output value.
Then when you predict on data point xi you'll have the output value yi+500. You should find some basic guides for time serie forecasting on sites like machinelearningmastery
Good pratice for model evaluation :
If you want to better evaluate the quality of your model, first find some metrics that suits your problem and try to increase test set percenatage. While graphics are a good way to visualise result, they can be deceiving, try combining them with some metrics ! (be carefull with Mean Squarred Error, it can give you a biased score with value in the range [-1;1] as the square of an error in this range will always be less than the acutal error, try Mean Absolute Error instead)
Data leakage when scalling data :
While scalling data is usually a good thing you need to be carefull doing so. You comited something called a data leak. You used scalling on the whole data set before splitting into training and test set. Further reading about this data leak.
Update
I think i misunderstood your problem.
If you want to "predict further than just the test dataset" you will need some unseen/new data to make more prediction. The test set is only made to evaluate the performance of the learning phase.
Now if you want to predict further than just the next step (this won't allow you to "predict further than just the test dataset" because of the way you change your dataset, see bellow) :
Your model as it's made will only ever predict the next step.
In your example you feed to the algorithm series of lenght 'seq_len' and give them as output the value right after the end of those series. If you want your algorithm to learn to predict in more than one step into the future you y_train must have value at the corresponding time in the future, example :
x = [0,1,2,3,4,5,6,7,8,9,10,...]
seq_len = 5
step_to_predict = 5
So to predict not one step into the future but five, your series will have to look like this :
x_serie_1 = [0,1,2,3,4]
y_serie_1 = [9]
x_serie_2 = [1,2,3,4,5]
y_serie_2 = [10]
This is a way to get your model to learn how to make predictions further into the future than just the next step.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.