I am working on time-series classification problem using CNN. The dataset used is financial stock market data (like Yahoo Finance). I am using some technical indicators calculated using raw values high,low,volume,open,close.
One of the technical indicators is MACD (Moving Average Convergence Divergence) using TA Library. However, it is written, in most places, that it is calculated for n_fast = 12 and n_slow = 26 periods with RSI (Relative Strength Index) being calculated for 14 days and n_sign = 9 (parameter of macd_diff() in ta library).
So, if I am calculating RSI for 5 days period then how do we set these n_fast and n_slow values according to it? Should these be n_fast = 3 and n_slow = 8. Also, what should be the value of n_sign then? I am new to finance domain.
Related
I have weekly dataset with 1 input variable and 1 output variable. I want to include week of the year seasonality as a input variable in regression model. Generally we include seasonality in regression model by making dummy variable but as there are 52 weeks, including 52 dummy variable is not the feasible solution. I saw one blog here where we can use fourier series to capture seasonality. How I can use fourier series to capture week of the year seasonality? Something like
week
x1
seasoanl_inut
target
w1
0.5
3.4
1.22
w2
0.22
4,2
0.23
such that for every week I should have unique value in seasonal input variable.
I have around 23300 hourly datapoints in my dataset and I try to forecast using Facebook Prophet.
To fine-tune the hyperparameters one can use cross validation:
from fbprophet.diagnostics import cross_validation
The whole procedure is shown here:
https://facebook.github.io/prophet/docs/diagnostics.html
Using cross_validation one needs to specify initial, period and horizon:
df_cv = cross_validation(m, initial='xxx', period='xxx', horizon = 'xxx')
I am now wondering how to configure these three values in my case? As stated I have data of about 23.300 hourly datapoints. Should I take a fraction of that as the horizon or is it not that important to have correct fractions of the data as horizon and I can take whatever value seems to be appropriate?
Furthermore, cutoffs has also be defined as below:
cutoffs = pd.to_datetime(['2013-02-15', '2013-08-15', '2014-02-15'])
df_cv2 = cross_validation(m, cutoffs=cutoffs, horizon='365 days')
Should these cutoffs be equally distributed as above or can we set the cutoffs individually as someone likes to set them?
initial is the first training period. It is the minimum
amount of data needed to begin your training on.
horizon is the length of time you want to evaluate your forecast
over. Let's say that a retail outlet is building their model so
that they can predict sales over the next month. A horizon set to 30
days would make sense here, so that they are evaluating their model
on the same parameter setting that they wish to use it on.
period is the amount of time between each fold. It can be either
greater than the horizon or less than it, or even equal to it.
cutoffs are the dates where each horizon will begin.
You can understand these terms by looking at this image -
credits: Forecasting Time
Series Data with
Facebook Prophet by Greg Rafferty
Let's imagine that a retail outlet wants a model that is able to predict the next month
of daily sales, and they plan on running the model at the beginning of each quarter. They
have 3 years of data
They would set their initial training data to be 2 years, then. They want to predict the
next month of sales, and so would set horizon to 30 days. They plan to run the model
each business quarter, and so would set the period to be 90 days.
Which is also shown in above image.
Let's apply these parameters into our model:
df_cv = cross_validation(model,
horizon='30 days',
period='90 days',
initial='730 days')
I have a Python Gekko application to estimate and control disturbances to an industrial polymer manufacturing process (UNIPOL polyethylene). The approach is to update the unknown catalyst activity to minimize the difference between the measured and predicted production rate. The catalyst activity is then used in production control. The predicted production rate is based on heat exchanged with cooling water. The problem that I'm running into is that sometimes the production rate measurements are not good because of intermittent issues associated with the measurements (flow meter, temperatures) and calculations during large transients. The distributed control system (Honeywell Experion with TDC3000) has appropriate protections against bad measurements and reports a value but with bad status. How can I use the available good measurements but ignore the intermittent bad measurements in Python Gekko? I don't have example code that I can share due to proprietary issues, but it is similar to this TCLab exercise.
for i in range(1,n):
# Read temperatures in Celsius
T1m[i] = a.T1
T2m[i] = a.T2
# Insert measurements
TC1.MEAS = T1m[i]
TC2.MEAS = T2m[i]
Q1.MEAS = Q1s[i-1]
Q2.MEAS = Q2s[i-1]
# Predict Parameters and Temperatures with MHE
m.solve(disp=True)
Can I use np.nan (NaN) as the measurement or is there another way to deal with bad data?
For any bad data, you can set the feedback status FSTATUS to off (0) for FVs, MVs, SVs, or CVs.
if bad_measurements:
TC1.FSTATUS = 0
TC2.FSTATUS = 0
Q1.FSTATUS = 0
Q2.FSTATUS = 0
else:
TC1.FSTATUS = 1
TC2.FSTATUS = 1
Q1.FSTATUS = 1
Q2.FSTATUS = 1
Gekko eliminates the bad measurement from the time series model update but keeps the good data. For the CVs, it does this by storing and time-shifting the measurements and the value of fstatus for each one. The bad data eventually leaves the data horizon along with the FSTATUS=0 indicator. You can also have FSTATUS values between 0 and 1 if you want to filter the input data:
x=LSTVAL∗(1−FSTATUS)+MEAS∗FSTATUS
where LSTVAL is the last value, MEAS is the measurement, and x is the new filtered input for that measurement. More information on FSTATUS is in the documentation.
For a machine learning problem, I want to derive the hourly PV power of a specific system given various weather parameter, including hourly GHI and DHI, but no DNI. If I would take one of the pvLib DNI estimation models, I always need the zenith angle. Since I have hourly values for Irradiance, I cannot be very specific regarding the angle. Would you take an hourly average? There is always the problem that angles close to 90° result in super high DNI values.
So far I tried to manually calculate hourly DNI = (GHI-DHI)/cos(zenith), taking the mean of 5 min resolution zenith angles for the hourly zenith. The sunrise in the location is almost always before 7 am, so I should get some very small PV power in hour 6 of the day. However, due to the fact that I take the average which is almost always over 90°, I get 0 kW AC power or for the few days when the mean angle is just below 90° I get 40 kW AC power, which is the system's maximum limited by the inverters and this in these early hours even more unrealistic.
ModelChain Parameters:
pvsys_ref=pvsyst
loc_ref=loc
orient_strat_ref=None
sky_mod_ref='ineichen'
transp_mod_ref='haydavies'
sol_pos_mod_ref='nrel_numpy'
airm_mod_ref='kastenyoung1989'
dc_mod_ref='cec'
ac_mod_ref=None
aoi_mod_ref='physical'
spec_mod_ref='no_loss'
temp_mod_ref='sapm'
loss_mod_ref='no_loss'
The required weather panda consists out of the hourly simulated ghi, dhi, temp and windspeed as well as the manually calculated dni.
Usually the midpoint of the hour is used to calculate the sun position/sun zenith, and for the sunset and sunrise hours, the midpoint of the period when the sun is above the horizon.
To calculate DNI from GHI and DHI, try using the function dni in pvlib.irradiance:
https://pvlib-python.readthedocs.io/en/latest/generated/pvlib.irradiance.dni.html#pvlib.irradiance.dni
Good day,
I have applied lightGBM algorithm to real estate price data set (85524 observations and 167 features). I want to receive the interaction between year and real estate area size to price. The dependent variable is transformed with log1p to get normal distribution.
I have used Python, pdpbox module to generate an interaction plot. As I understand the coloring is the average price between the variables, however, I would like to receive the interval of the interaction i.e. min and max. Is it possible to do so?
LGBMR.fit(df_train.drop(["Price"], axis = 1, inplace = False), df_train["Price"])
feats = ['Year', 'Real estate area']
p = pdp.pdp_interact(LGBMR, df, model_features = columns, features = feats)
pdp.pdp_interact_plot(p, feats, plot_type = 'grid')
I am adding the pdp interaction plot. For example, in 2008 year, the real estate object of size 0.52 was purchased for an average price of 5.697 (prediction), but I would like to know the min and max predicted price of this interaction.