I would like to predict values (e.g. transport volumes). As input data I have the volumes from the last two years. I already did some timeseries prediction on those values basically following the instruction on Basics of Time Series Prediction and Techniques for Time Series Prediction.
I now would like to go a step further and include some indicators (e.g. economic indicators) in the prediction to see if this will increase the accuracy of the predictions.
What is the right approach to do so? Looking around I found this Post, basically describing the same usecase. Unfortunately it got no responses.
One approach might be to do a "simple" prediction based on a model with the current volume and indicators as features and the future volume as label. But I then would loose the timeseries, the connection between the single data points so to say.
Do you have experience with such predictions? What did work in your case? Please point me in the right direction!
One approach might be to do a "simple" prediction based on a model
with the current volume and indicators as features and the future
volume as label. But I then would loose the timeseries, the connection
between the single data points so to say.
In this case a common solution is to include N 'lagging' values (i.e. volumes for N previous periods) as features for every observation, in addition to some indicator value features. This allows using pretty much any regression model for time series forecasting. Just make sure there's no data leakage of the 'future' values when calculating your indicators.
I have yearly data over time (longitudinal data) with repeated measures for many of the subjects. I think I need multilevel modeling/regressions to deal with sure-to-be correlated clusters of measurements for the same individuals over time. The data currently is in separate tables for each year.
I was wondering if there was a way that was built into scikit-learn, like LinearRegression(), that would be able to conduct a multilevel regression where Level 1 is all the data over the years, and Level 2 is for the clustered on the subjects (clusters for each subject's measurements over time). And if so, if it's better to have the longitudnal data laid out length-wise (where the each subject's measures over time are all in one row) or stacked (where each measure for each year is it's own row).
Is there a way to do this?
Estimation of random effects in multilevel models is non-trivial and you typically have to resort to Bayesian inference methods.
I would suggest you look into Bayesian inference packages such as pymc3 or BRMS (if you know R) where you can specify such a model. Or alternatively, look at lme4 package in R for a fully-frequentist implementation of multi-level models.
Also, I think you would be able to get some inspiration from the "sleep-deprivation" dataset which is used as a textbook example of longitudinal data-analysis (https://cran.r-project.org/web/packages/lme4/vignettes/lmer.pdf) pg.4
To get started in pymc3 have a look here:
https://github.com/fonnesbeck/Bios8366/blob/master/notebooks/Section4_7-Multilevel-Modeling.ipynb
I have a monthly time series which I want to forecast using Prophet. I also have external regressors which are only available on a quarterly basis.
I have thought of following possibilities -
repeat the quarterly values to make it monthly and then include
linearly interpolate for the months
What other options I can evaluate?
Which would be the most sensible thing to do in this situation?
You have to evaluate based on your business problem, but there are some questions you can ask yourself.
How are the external regressors making their predictions? Are they trained on completely different data?
If not, are they worth including?
How quickly do we expect those regressors to get "stale"? How far in the future are their predictions available? How well do they perform more than one quarter into the future?
Interpolation can be reasonable based on these factors...but don't leak information about the future to your model at training time.
Do they relate to subsets of your features?
If so, some feature engineering could but fun - combine the external regressor's output with your other data in meaningful ways.
I have a data file with one column full of time stamps and I have aggregated the times in 10 minute time intervals, I am trying to visualize them to find underlying patterns of the demand. I have looked at a histogram of this information...and the heat map did not return good results.
My information is just one column full of timestamps like this:
2017-08-28 14:37:00
I have 100,000 rows and I am trying to use pandas for forecasting, I dont know if I should use linear regression or kalman filter so far this is my visualization
plt.figure()
df["time"].apply(lambda x: x.hour).plot.hist(bins=24) I am trying to get it more granular on a 10 minute interval time and then look at patterns and implement a forecasting technique
I'm not sure I understand what is your question precisely. From what I understood, you have a uni dimensional time-series of "demand" and you want to develop a prediction algorithm.
For your data exploration, identification of patterns", I understand you have difficulty for visualization. First, to increase the granularity of your histogram, you may want to group your data on a daily basis and plot an histogram with 24*6=144 bins. If you want to try more visualization, some are basics one:
you could try a simple graph visualization as your data seems to be unidimensional
another option is to build heatmaps with as axis the hour of the day, the day of the week (Monday, Tuesday, etc), the month of the year
a scatterplot with as x-axis the hour between 0h and 23h, ...
You should find many different options.
For the prediction algorithm, you did not provide any info so we could give an hint. Try to be more specific, or have a quick search for "time series prediction"
I hope you guys can help me sort this out as I feel this is above me. It might be silly for some of you, but I am lost and I come to you for advice.
I am new to statistics, data analysis and big data. I just started studying and I need to make a project on churn prediction. Yes, this is sort of a homework task, but I hope you can answer some of my questions.
I would be most grateful for a beginner-level answers step-by-step.
Basically, I have a very big data set (obviously) on customer activity data from cellular company for 3 months, the 4th month ending in churned or not churned. Each month has these columns:
['year',
'month',
'user_account_id',
'user_lifetime',
'user_intake',
'user_no_outgoing_activity_in_days',
'user_account_balance_last',
'user_spendings',
'user_has_outgoing_calls',
'user_has_outgoing_sms',
'user_use_gprs',
'user_does_reload',
'reloads_inactive_days',
'reloads_count',
'reloads_sum',
'calls_outgoing_count',
'calls_outgoing_spendings',
'calls_outgoing_duration',
'calls_outgoing_spendings_max',
'calls_outgoing_duration_max',
'calls_outgoing_inactive_days',
'calls_outgoing_to_onnet_count',
'calls_outgoing_to_onnet_spendings',
'calls_outgoing_to_onnet_duration',
'calls_outgoing_to_onnet_inactive_days',
'calls_outgoing_to_offnet_count',
'calls_outgoing_to_offnet_spendings',
'calls_outgoing_to_offnet_duration',
'calls_outgoing_to_offnet_inactive_days',
'calls_outgoing_to_abroad_count',
'calls_outgoing_to_abroad_spendings',
'calls_outgoing_to_abroad_duration',
'calls_outgoing_to_abroad_inactive_days',
'sms_outgoing_count',
'sms_outgoing_spendings',
'sms_outgoing_spendings_max',
'sms_outgoing_inactive_days',
'sms_outgoing_to_onnet_count',
'sms_outgoing_to_onnet_spendings',
'sms_outgoing_to_onnet_inactive_days',
'sms_outgoing_to_offnet_count',
'sms_outgoing_to_offnet_spendings',
'sms_outgoing_to_offnet_inactive_days',
'sms_outgoing_to_abroad_count',
'sms_outgoing_to_abroad_spendings',
'sms_outgoing_to_abroad_inactive_days',
'sms_incoming_count',
'sms_incoming_spendings',
'sms_incoming_from_abroad_count',
'sms_incoming_from_abroad_spendings',
'gprs_session_count',
'gprs_usage',
'gprs_spendings',
'gprs_inactive_days',
'last_100_reloads_count',
'last_100_reloads_sum',
'last_100_calls_outgoing_duration',
'last_100_calls_outgoing_to_onnet_duration',
'last_100_calls_outgoing_to_offnet_duration',
'last_100_calls_outgoing_to_abroad_duration',
'last_100_sms_outgoing_count',
'last_100_sms_outgoing_to_onnet_count',
'last_100_sms_outgoing_to_offnet_count',
'last_100_sms_outgoing_to_abroad_count',
'last_100_gprs_usage']
The end result for this homework would be k-means cluster analysis and churn prediction model.
My biggest headache regarding this dataset is:
How to make a cluster analysis for monthly data including most of these variables? I tried to look for an example, but I either found an example on analyzing one variable per month or many variables per one month.
I am using Python and Spark.
I think I can make it work as long as I know what to do with months and a huge list of variables.
Thanks, your help will be greatly appreciated!
P.S. Would a code example be too much to ask?
Why would you use k-means here?
k-means will not do anything meaningful on such data. It's too sensitive to scaling and attribute types (e.g. year, month)
Churn prediction is a supervised problem. Never use an unsupervised algorithm for a supervised problem. That means you are ignoring the single most valueable information you have to guide the search.