I've spent months reading an endless number of posts and I still feel as confused as I initially was. Hopefully someone can help.
Problem: I want to use time series to make predictions of weather data at a particular location.
Set-up:
X1 and X2 are both vectors containing daily values of indices for 10 years (3650 total values in each vector).
Y is a time series of temperature at Newark airport (T), every day for 10 years (3650 days).
There's a strong case to be made that X1 and X2 can be used as predictors for Y. So I break everything into windows of 100 days and create the following:
X1 = (3650,100,1)
X2 = (3650,100,1)
Such that window 1 includes the values from t=0 to t=99, window 2 includes values from t=1 to t=100, etc. (Assume that I have enough extra data at the end that we still have 3650 windows).
What I've learned from other tutorials is that to go into Keras I'd do this:
X = (3650,100,2) = (#_of_windows,window_length,#_of_predictors) which I get by merging X1 and X2.
Then I have this code:
model = Sequential()
model.add(LSTM(1,return_sequences=True,input_shape=(100,2)))
model.add(LSTM(4))
model.add(Dropout(0.2))
model.compile(loss='mean_square_error',optimizer='rmsprop',shuffle=True)
model.fit(X,Y,batch_size=128,epochs=2) # Y is shape (3650,)
predictions = model.predict(?????????????)
So my question is, how do I set up the model.predict area to get back forecasts of N number of days in the future? Some times I might want 2 days, sometimes I might need 2 weeks. I only need to get back N values (shape: [N,]), I don't need to get back windows or anything like that.
Thanks so much!
The only format in which you can predict is the format in which you trained the model. If I understand correctly, you trained the model as follows:
You used windows of size 100 (that is, features at times T-99,T-98,...,T) to predict the value of the target at time T.
If this is indeed the case, then the only thing that you can do with the model is the same type of prediction. That is, you can provide the values of your features for 100 days, and ask the model to predict the value of the target for the last day among the 100.
If you want it to be able to forecast N days, you have to train your model accordingly. That is, every element in Y should consist of sequences of N days. Here is a blog post that describes how to do that.
Related
I have a high frequency time series (observations separated by 3 seconds), which I'd like to analyse and eventually forecast short-term periods (10/20/30 min ahead) using different models. My hole dataset containing 20K observations. My goal is to come out with conclusions of how good the different models can forecast the data.
I tried first to plot the hole dataset but i couldn't identify anything :
Hole Dataset
Then I plotted only the first 500 observations and this is the result :
Firt 500 observations
I don't know why it looks just like a whitenoise !
After running the ADF test on the hole dataset it gives me a 0.0 p-value ! this means that my dataset is stationary right ?
I decided to try first the ARIMA model, from the ACF and PACF plots I can't identify p and q :
ACF
PACF
1- Is the dataset a whitenoise ? Is it possible to predict in this time series ?
2- I tried to downsample the dataset (the mean in each 4 minutes), but same think, I couldn't identify anythink, and I think this will result a loss of inforlation no ?
3- What is the length of data on which I should fit the ARIMA on the training set ? Does it make sense to use a short training set for short term forecasting period ?
So I'm having a hard time conceptualizing how to make mathematical representation of my solution for a simple logistic regression problem. I understand what is happening conceptually and have implemented it, but I am answering a question which asks for a final solution.
Say I have a simple two column dataset denoting something like likelihood of getting a promotion per year worked, so the likelihood would increase the person accumulates experience. Where X denotes the year and Y is a binary indicator for receiving a promotion:
X | Y
1 0
2 1
3 0
4 1
5 1
6 1
I implement logistic regression to find the probability per year worked of receiving a promotion, and get an output set of probabilities that seem correct.
I get an output weight vector that that is two items, which makes sense as there are only two inputs. The number of years X, and when I fix the intercept to handle bias, it adds a column of 1s. So one weight for years, one for bias.
So I have two few questions about this.
Since it is easy to get an equation of the form y = mx + b as a decision boundary for something like linear regression or a PLA, how can similarly I denote a mathematical solution with the weights of the logistic regression model? Say I have a weights vector [0.9, -0.34], how can I convert this into an equation?
Secondly, I am performing gradient descent which returns a gradient, and I multiply that by my learning rate. Am I supposed to update the weights at every epoch? As my gradient never returns zeros in this case so I am always updating.
Thank you for your time.
The logistic regression is trying to map the input value (x = years) to the output value (y=likelihood) through this relationship:
where theta and b are the weights you are trying to find.
The decision boundary will then be defined as L(x)>p or <p. where L(x) is the right term of the equation above. That is the relationship you want.
You can eventually transform it to a more linear form like the one of linear regression by passing the exponential in numerator and taking the log on both sides.
I am studying time series data.
If you look at the time series data you have run with the examples so far, they all have similarly only two columns. One is a date, and one is any value.
For example, in the case of a stock price increase forecast, we predict a 'single' stock.
If so, can you predict multiple stocks simultaneously in time series data analysis?
For example, after the subjects had taken medicines that affected their liver levels, they got liver count data by date so far. Based on this, I would like to experiment with predicting at which point the liver level rises or falls in the future. At this time, I need to predict several patients at the same time, not one patient. How do I specify the data set in this case?
Is it possible to label by adding one column? Or am I not really understanding the nature of time series data analysis?
If anyone knows anything related, I would be really grateful if you can advise me or give me a reference site.
You should do the predictions for each patient separately. You probably don't want the prediction on one of the patient to vary because of what happens to the others at the same time.
Machine Learning is not just about giving data to a model and getting back results, you also have to think the model, what should be its input and output here. For time series, you would probably give as input what was observed on a patient in the previous days, and try to predict what will happen in the next one. For one patient, you do not need the data of the others patients, and if you give it to your model, it will try to use it and capture some noise from the training data, which is not what you want.
However as you could expect similar behaviors in each patient, you can build a model for all the patients, and not one model for each patient. The typical input would be of the form :
[X(t - k, i), X(t - k + 1, i), ..., X(t - 1, i)]
where X(t, i) is the observation at time t for the patient i, to predict X(t, i). Train your model with the data of all the patients.
As you give a medical example, know that if you have some covariates like the weight or the gender of the patients you can include them in your model to capture their individual characteristics. In this case the input of the model to predict X(t, i) would be :
[X(t - k, i), X(t - k + 1, i), ..., X(t - 1, i), C1(i), ..., Cp(i)]
where C1(i)...Cp(i) are the covariates of the patient. If you do not have theses covariates, it is not a problem, they can just improve the results in some cases. Note that all covariates are not necessarily useful.
I have a dataset of peak load for a year. Its a simple two column dataset with the date and load(kWh).
I want to train it on the first 9 months and then let it predict the next three months . I can't get my head around how to implement SVR. I understand my 'y' would be predicted value in kWh but what about my X values?
Can anyone help?
given multi-variable regression, y =
Regression is a multi-dimensional separation which can be hard to visualize in ones head since it is not 3D.
The better question might be, which are consequential to the output value `y'.
Since you have the code to the loadavg in the kernel source, you can use the input parameters.
For Python (I suppose, the same way will be for R):
Collect the data in this way:
[x_i-9, x_i-8, ..., x_i] vs [x_i+1, x_i+2, x_i+3]
First vector - your input vector. Second vector - your output vector (or value if you like). Use method fit from here, for example: http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVR.html#sklearn.svm.SVR.fit
You can try scaling, removing outliers, apply weights and so on. Play :)
The issue
So I have 50 netCDF4 data files that contain decades of monthly temperature predictions on a global grid. I'm using np.mean() to make an ensemble average of all 50 data files together while preserving time length & spatial scale, but np.mean() gives me two different answers. The first time I run its block of code, it gives me a number that, when averaged over latitude & longitude & plotted against the individual runs, is slightly lower than what the ensemble mean should be. If I re-run the block, it gives me a different mean which looks correct.
The code
I can't copy every line here since it's long, but here's what I do for each run.
#Historical (1950-2020) data
ncin_1 = Dataset("/project/wca/AR5/CanESM2/monthly/histr1/tas_Amon_CanESM2_historical-r1_r1i1p1_195001-202012.nc") #Import data file
tash1 = ncin_1.variables['tas'][:] #extract tas (temperature) variable
ncin_1.close() #close to save memory
#Repeat for future (2021-2100) data
ncin_1 = Dataset("/project/wca/AR5/CanESM2/monthly/histr1/tas_Amon_CanESM2_historical-r1_r1i1p1_202101-210012.nc")
tasr1 = ncin_1.variables['tas'][:]
ncin_1.close()
#Concatenate historical & future files together to make one time series array
tas11 = np.concatenate((tash1,tasr1),axis=0)
#Subtract the 1950-1979 mean to obtain anomalies
tas11 = tas11 - np.mean(tas11[0:359],axis=0,dtype=np.float64)
And I repeat that 49 times more for other datasets. Each tas11, tas12, etc file has the shape (1812, 64, 128) corresponding to time length in months, latitude, and longitude.
To get the ensemble mean, I do the following.
#Move all tas data to one array
alltas = np.zeros((1812,64,128,51)) #years, lat, lon, members (no ensemble mean value yet)
alltas[:,:,:,0] = tas11
(...)
alltas[:,:,:,49] = tas50
#Calculate ensemble mean & fill into 51st slot in axis 3
alltas[:,:,:,50] = np.mean(alltas,axis=3,dtype=np.float64)
When I check a coordinate & month, the ensemble mean is off from what it should be. Here's what a plot of globally averaged temperatures from 1950-2100 looks like with the first mean (with monhly values averaged into annual values. Black line is ensemble mean & colored lines are individual runs.
Obviously that deviated below the real ensemble mean. Here's what the plot looks like when I run alltas[:,:,:,50]=np.mean(alltas,axis=3,dtype=np.float64) a second time & keep everything else the same.
Much better.
The question
Why does np.mean() calculate the wrong value the first time? I tried specifying the data type as a float when using np.mean() like in this question- Wrong numpy mean value?
But it didn't work. Any way I can fix it so it works correctly the first time? I don't want this problem to occur on a calculation where it's not so easy to notice a math error.
In the line
alltas[:,:,:,50] = np.mean(alltas,axis=3,dtype=np.float64)
the argument to mean should be alltas[:,:,:,:50]:
alltas[:,:,:,50] = np.mean(alltas[:,:,:,:50], axis=3, dtype=np.float64)
Otherwise you are including those final zeros in the calculation of the ensemble means.