Time series data prediction with multiple n numbers - python

I am studying time series data.
If you look at the time series data you have run with the examples so far, they all have similarly only two columns. One is a date, and one is any value.
For example, in the case of a stock price increase forecast, we predict a 'single' stock.
If so, can you predict multiple stocks simultaneously in time series data analysis?
For example, after the subjects had taken medicines that affected their liver levels, they got liver count data by date so far. Based on this, I would like to experiment with predicting at which point the liver level rises or falls in the future. At this time, I need to predict several patients at the same time, not one patient. How do I specify the data set in this case?
Is it possible to label by adding one column? Or am I not really understanding the nature of time series data analysis?
If anyone knows anything related, I would be really grateful if you can advise me or give me a reference site.

You should do the predictions for each patient separately. You probably don't want the prediction on one of the patient to vary because of what happens to the others at the same time.
Machine Learning is not just about giving data to a model and getting back results, you also have to think the model, what should be its input and output here. For time series, you would probably give as input what was observed on a patient in the previous days, and try to predict what will happen in the next one. For one patient, you do not need the data of the others patients, and if you give it to your model, it will try to use it and capture some noise from the training data, which is not what you want.
However as you could expect similar behaviors in each patient, you can build a model for all the patients, and not one model for each patient. The typical input would be of the form :
[X(t - k, i), X(t - k + 1, i), ..., X(t - 1, i)]
where X(t, i) is the observation at time t for the patient i, to predict X(t, i). Train your model with the data of all the patients.
As you give a medical example, know that if you have some covariates like the weight or the gender of the patients you can include them in your model to capture their individual characteristics. In this case the input of the model to predict X(t, i) would be :
[X(t - k, i), X(t - k + 1, i), ..., X(t - 1, i), C1(i), ..., Cp(i)]
where C1(i)...Cp(i) are the covariates of the patient. If you do not have theses covariates, it is not a problem, they can just improve the results in some cases. Note that all covariates are not necessarily useful.

Related

sklift inference: how to get probabilities for treatment vs no-treatment?

I am working with sklift to describe what the uplift for a given treatment (in this case marketing discount) is. When training the model, we can get both probabilities, such as:
# model results: conditional probabilities of treatment effect
# probability of performing the targeted action (visits):
#prob_treat = model_sm.trmnt_preds_ # probability in treatment group
#prob_control = model_sm.ctrl_preds_ # probability in control group
But when I try to get these prob_treat and prob_control for inference (unseen data), I cannot find anything in the docs. All I can do is to get the uplift using model.predict(X_val). I am in need to understand for inference what the probas are if treatment are given vs not given. Can some one help?
I didn't use sklift and I use causalml package which is from Uber.
But I have some ideas about your question.
From your description I guess you are using T-learner or S-learner which predict the probability seperately in treatment group and control group.
So, the result(uplift value) you want is prob_treat-prob_control.
And the probas when treatment are given or not given are counter-factual outcomes.

Bayesian update in pymc3: adding more data doesn't work

I am new to pymc3, but I've heard it can be used to build a Bayesian update model. So I tried, without success. My goal was to predict which day of the week a person buys a certain product, based on prior information from a number of customers, as well as that person's shopping history.
So let's suppose I know that customers in general buy this product only on Mondays, Tuesdays, Wednesdays, and Thursdays only; and that the number of customers who bought the product in the past on those days is 3,2, 1, and 1, respectively. I thought I would set up my model like this:
import pymc3 as pm
dow = ['m', 'tu', 'w','th']
c = np.array([3, 2, 1, 1])
# hyperparameters (initially all equal)
alphas = np.array([1, 1, 1, 1])
with pm.Model() as model:
# Parameters of the Multinomial are from a Dirichlet
parameters = pm.Dirichlet('parameters', a=alphas, shape=4)
# Observed data is from a Multinomial distribution
observed_data = pm.Multinomial(
'observed_data', n=7, p=parameters, shape=4, observed=c)
So this set up my model without any issues. Then I have an individual customer's data from 4 weeks: 1 means they bought the product, 0 means they didn't, for a given day of the week. I thought updating the model would be as simple as:
c = np.array([[1, 0,0,0],[0,1,0,0],[1,0,0,0],[1,0,0,0],[1,0,0,1])
with pm.Model() as model:
# Parameters are a dirichlet distribution
parameters = pm.Dirichlet('parameters', a=alphas, shape=4)
# Observed data is a multinomial distribution
observed_data = pm.Multinomial(
'observed_data',n=1,p=parameters, shape=4, observed=c)
trace = pm.sample(draws=100, chains=2, tune=50, discard_tuned_samples=True)
This didn't work.
My questions are:
Does this still take into account the priors I set up before, or does it create a brand-new model?
As written above, the code didn't work as it gave me a "bad initial energy" error. Through trial and error I found that parameter "n" has to be the sum of the elements in observations (so I can't have observations adding up to different n's). Why is that? Surely the situation I described above (where some weeks they shop only on Mondays, and others on Mondays and Thursday) is not impossible?
Is there a better way of using pymc3 or a different package for this type of problem? Thank you!
To answer your specific questions first:
The second model is a new model. You can reuse context managers by changing the line to just with model:, but looking at the code, that is probably not what you intended to do.
A multinomial distribution takes n draws, using the provided probabilities, and returns one list. pymc3 will broadcast for you if you provide an array for n. Here's a tidied version of your model:
with pm.Model() as model:
parameters = pm.Dirichlet('parameters', a=alphas)
observed_data = pm.Multinomial(
'observed_data', n=c.sum(axis=-1), p=parameters, observed=c)
trace = pm.sample()
You also ask about whether pymc3 is the right library for this question, which is great! The two models you wrote down are well known, and you can solve the posterior by hand, which is much faster: in the first model, it is a Dirichlet([4, 3, 2, 2]), and in the second Dirichlet([5, 2, 1, 2]). You can confirm this with PyMC3, or read up here.
If you wanted to expand your model, or chose distributions that were not conjugate, then PyMC3 might be a better choice.

The most proper model for combining both input types(TIME_SERIES_INPUT and AUXILIARY(static)_INPUTS)?

So I have searched a few posts and Issue posts about the reshaping problem but none of these solutions seems to work for me so far.
The goal of the project: predicting the week-based sales by each store and SKU(the properties of clothes: e.g. colour/size).
The structure of the dataset is like following:
Week, store_id, color, size, last_week_sales, last_2week_sales_(sales two weeks),actual_sales(the one we want to predict)
1, 341, red, LL, 0, 1, 1
1, 341, yellow, M, 2, 4, 2
1, 341, blue, S, 2, 2, 3
2, 342, blue, M, 2, 3, 1
2, 342, green, S, 2, 3, 2
So for each week, every record is unique by the combination of the features(properties of the clothes and the store_id etc.)
The number of records for each week is not the same.
Update on 8-23-2018:
I tried the fully-connected NN, yet the correctness is about 75% which can't be improved in various ways. I wondered whether there's another way around dealing this issue, thanks in advance!
The 'reshaping problem' stems from a lack of understanding of your own data and prediction goals. LSTMs (and RNNs in general) expect a sequence-of-vectors data structure. Essentially, you want to model some function f(x) where your features are time varying, x=x(t), therefore f(x) may be rewritten as f(t). This is not particularly evident in your sample dataset, as only a small subset of your features (the sales) are time-varying.
What you could do would be to consider a single time slice as a vector whose elements are the static features (color, store id, whatever) PLUS a single instance of sales. Then a full sample would be a matrix consisting of N vertically stacked time slices, where N the count of time slices you have. Many instances of these samples would be a batch, i.e. a 3-dimensional tensor which is the expected input of a recurrent network.
If you're not just doing this for the sake of experimentation and learning, you should keep in mind that this is a bad approach. Your features do not have any sort of temporal structure and, intuitively, they should not have any predictive potential of next week's sales. Additionally, using a RNN here is an overkill and you will certainly be overfitting your dataset.

Predicting weather data using LSTM neural nets with Keras

I've spent months reading an endless number of posts and I still feel as confused as I initially was. Hopefully someone can help.
Problem: I want to use time series to make predictions of weather data at a particular location.
Set-up:
X1 and X2 are both vectors containing daily values of indices for 10 years (3650 total values in each vector).
Y is a time series of temperature at Newark airport (T), every day for 10 years (3650 days).
There's a strong case to be made that X1 and X2 can be used as predictors for Y. So I break everything into windows of 100 days and create the following:
X1 = (3650,100,1)
X2 = (3650,100,1)
Such that window 1 includes the values from t=0 to t=99, window 2 includes values from t=1 to t=100, etc. (Assume that I have enough extra data at the end that we still have 3650 windows).
What I've learned from other tutorials is that to go into Keras I'd do this:
X = (3650,100,2) = (#_of_windows,window_length,#_of_predictors) which I get by merging X1 and X2.
Then I have this code:
model = Sequential()
model.add(LSTM(1,return_sequences=True,input_shape=(100,2)))
model.add(LSTM(4))
model.add(Dropout(0.2))
model.compile(loss='mean_square_error',optimizer='rmsprop',shuffle=True)
model.fit(X,Y,batch_size=128,epochs=2) # Y is shape (3650,)
predictions = model.predict(?????????????)
So my question is, how do I set up the model.predict area to get back forecasts of N number of days in the future? Some times I might want 2 days, sometimes I might need 2 weeks. I only need to get back N values (shape: [N,]), I don't need to get back windows or anything like that.
Thanks so much!
The only format in which you can predict is the format in which you trained the model. If I understand correctly, you trained the model as follows:
You used windows of size 100 (that is, features at times T-99,T-98,...,T) to predict the value of the target at time T.
If this is indeed the case, then the only thing that you can do with the model is the same type of prediction. That is, you can provide the values of your features for 100 days, and ask the model to predict the value of the target for the last day among the 100.
If you want it to be able to forecast N days, you have to train your model accordingly. That is, every element in Y should consist of sequences of N days. Here is a blog post that describes how to do that.

Create Feature Using K-Nearest Neighbors

I'm relatively new to Python and Machine Learning, but I've been working on building out a predictive model for Mortgage prices. Where I'm struggling is using the K-Nearest Neighbor algorithm to create a feature.
Here's how I understand the mechanics of what I want to accomplish:
I have two data files: Mortgages Sold and Mortgages Listed
In both data files I have the same features (including Lat/Long).
I want to create a column in Mortgages Listed that represents the median price of the most closely related homes in the immediate area.
I'll use the methodology listed in 3 to create columns for 1-3 months, 4-6 months, 7-12 months.
Another column would be the trend of those three columns.
I've found something on KNN imputation, but that doesn't seem to be what I'm looking for.
How do I go about executing this idea? Are there resources that I may have missed that would help?
Any guidance would be appreciated. Thanks!
So, from what I understand, you want to fit the KNN Model using Mortgages Sold data to predict the prices for Mortgages Listed data.
This is a classical KNN problem where you will need to find the nearest features vectors in Sold data for each feature vector in Listed data, and then take the median of those feature vectors.
Consider there are n rows in Sold data, and the feature vectors for each row are X1,X2, ..., Xn and the corresponding prices are P1, P2, ..., Pn
X_train = [X1, X2, ..., Xn]
y_train = [P1, P2, ..., Pn]
Note here that each Xi itself is a feature vector and the representative of ith row
For now, consider that you want 5 closest rows in Sold data for each row in Listed data. So, a KNN model parameter here which might need to be optimised later is:
NUMBER_OF_NEIGHBOURS = 5
Now, the training code will look something like this:
from sklearn.neighbors import KNeighborsClassifier
knn_model = KNeighborsClassifier(n_neighbors=NUMBER_OF_NEIGHBOURS)
knn_model.fit(X_train, y_train)
For prediction, consider there are m rows in Listed data, and the feature vectors for each row are F1, F2, ..., Fm. The corresponding median prices Z1, Z2, ..., Zm need to be determined.
X_test = [F1, F2, ..., Fm]
Note that the feature vectors in X_train and X_test should be vectorized using the same Vectorizer/Transformer. Read more about Vectorizers here.
The prediction code will look something like this:
y_predicted = knn_model.predict(X_test)
Each element of this y_predicted list will contain (in this case) 5 closest prices from y_train. That is:
y_predicted = [(P11, P12, .., P15), (P21, P22, .., P25), .., (Pm1, Pm2, .., Pm5)]
For each jth element of y_predicted:
import numpy as np
Zj = np.median(np.array([Pj1, Pj2, .., Pj5]))
Hence, in that way, you can find the median price Zj for each row of Listed data
Now, coming to the parameter optimisation part. The only hyper-parameter in your KNN Model would be NUMBER_OF_NEIGHBOURS. You can find the optimal value of this parameter by dividing the X_train itself into say 80:20 ratio. Train on the 80% part and cross-validate on the remaining 20% part. Once, you are sure that the accuracy numbers are good enough, you can use this value of the hyper-parameter NUMBER_OF_NEIGHBOURS for prediction on the y_test.
In the end, for month-wise analysis, you will need to create month-wise models. For example, M1 = Trained on 1-3 month Sold data, M2 = Trained on 4-6 month Sold data, M3 = Trained on 7-12 month Sold data, etc.
Reference: http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html

Categories