I am new to prophet (and stackoverflow in general ;) ) and have some issues with creating a predictive model using python. I am trying to predict daily sales of a product, using around 5 years of data. The data looks as follows: General data plot.
The company is closed in the weekends en during holidays, so there will be no orders. I accounted for this by creating a dataframe with al the weekends/holidays and using this dataframe as an argument for the holidays parameter. Furthermore I didn't change anything from the model, so it looks like: Prophet(holidays = my weekend/holiday dataframe).
However, my model doens't seem to work right and predicts negative values, see the following plot: Predicition 1. Hereby also the different component plots as extra information: trend, holidays, weekly, yearly. I also tried to just replace the negative values in the prediction by 0, which gives some better result (see prediction 2), but I don't think this is the right way to tackle this problem. The last thing I tried was to remove all the weekends from the training and predicting data. The results weren't good either: prediction 3.
I would love to hear some tips from you guys, for things I could try to do. If anything is unclear or you need more information, just let me know. Thank you in advance!!
My suggestions:
Try normalization
If that doesn't work try using Recurrent Neural Networks
Related
I would like to predict values (e.g. transport volumes). As input data I have the volumes from the last two years. I already did some timeseries prediction on those values basically following the instruction on Basics of Time Series Prediction and Techniques for Time Series Prediction.
I now would like to go a step further and include some indicators (e.g. economic indicators) in the prediction to see if this will increase the accuracy of the predictions.
What is the right approach to do so? Looking around I found this Post, basically describing the same usecase. Unfortunately it got no responses.
One approach might be to do a "simple" prediction based on a model with the current volume and indicators as features and the future volume as label. But I then would loose the timeseries, the connection between the single data points so to say.
Do you have experience with such predictions? What did work in your case? Please point me in the right direction!
One approach might be to do a "simple" prediction based on a model
with the current volume and indicators as features and the future
volume as label. But I then would loose the timeseries, the connection
between the single data points so to say.
In this case a common solution is to include N 'lagging' values (i.e. volumes for N previous periods) as features for every observation, in addition to some indicator value features. This allows using pretty much any regression model for time series forecasting. Just make sure there's no data leakage of the 'future' values when calculating your indicators.
I have a dataset that I am trying to analyze for a project.
The first step of the project is to basically model the data, and I am running into some issues. The data is on house sales within the past 5 years. Collecting data on buyers, cost of house, income, age, year purchased, years in loan, years at current job, and whether or not this house was foreclosed on with YES or NO.
The goal is to train a model to make predictions using machine learning, but I am stuck on part 1 - describing the data. I am using Jupyter notebooks to analyze the data and trying to put together a linear or multilinear regression model, and I am failing. When I throw together a scatter plot, my data is all over the chart with no way to really "group" the data at intersection point and cast a prediction line. This makes it difficult to figure out what is actually happening, perhaps the data I am comparing is not correlated in any way.
The problem also comes in with the YES or NO data. I was thinking this might need to be converted into 0s and 1s, but then my linear regression model would have an incredible weight on both ends of the spectrum. Perhaps regression is not the best choice?
I'm just struggling to figure out what to do and how to do it. I am kind of new to data analysis, so perhaps I am thinking of this all wrong. If anyone has any insight it would be much appreciated.
You can see in the following picture a demand problem.
My question relates to how one can/should implement fixed holidays in a LSTM model, which as seen here contain no demand and therefore cause sudden strong 1-day deviations from the average. I am specifically not referring to the change in trend between December and January
An Arima model, for example, can handle such days well.
After hours of searching the internet, all I could find was things how to deal with a change in trend. However, this is not the case, the trend remains the same and is only suspended for one day. I Hope there is someone here who has a paper or an approach for this kind of problem.
since the holydays have predefined dates, why not change the value of the data at that specific date to another value that wouldn't disturb the learning much, maybe the previous one, or the one after. or you could just remove the holydays data from your data and the sequence would be now unharmed by their drastic effect.
I'm working on a model that would predict an exam schedule for a given course and term. My input would be the term and the course name, and the output would be the date. I'm currently done with the data cleaning and preprocessing step, however, I can't wrap my head around a way to make a model whose input is two strings and the output is two numbers (the day and month of exam). One approach that I thought of would be encoding my course names, and writing the term as a binary list. I.E input: encoded(course), [0,0,1] output: day, month. and then feeding to a regression model.
I hope someone who's more experienced could tell me a better approach.
Before I start answering your question:
/rant
I know this sounds dumb and doesn't really help your question, but why are you using Neural Networks for this?!
To me, this seems like the classical case of "everybody uses ML/AI in their area, so now I have to, too!" (which is completely not true) /rant over
For string-like inputs, there exist several methods to encode these; choosing the right one might depend on your specific task. As you have a very "simple" (and predictable) input - i.e., you know in advance that there might not be any new/unseen course titles during testing/inference, or you do not need contextual/semantic information, you can resort to something like scikit-learn's LabelEncoder, which will turn it into different classes.
Alternatively, you could also throw a more heavy-weight encoding structure at the problem, that embeds the values in a matrix. Most DL frameworks offer some form of internal function for this, which basically requires you to pass an unique index for your input data, and actively learns some k-dimensional embedding vector for this. Intuitively, these embeddings correspond to a semantic or topical direction. If you have for example 3-dimensional embeddings, the first one could represent "social sciences course", the other one "technical courses", and the third for "seminar".
Of course, this is just a simplification of it, but helps imagining how it works.
For the output, predicting a specific date is actually a really good question. As I have personally never predicted dates myself, I can only recommend tips by other users. A nice answer on dates (as input) is given here.
If you can sacrifice a little bit of accuracy in the result, predicting the calendar week in which the exam is happening might be a good idea. Otherwise, you could simply treat it as two regressed values, but you might end up with invalid combinations (i.e. "negative days/months", or something like "31st February".
Depending on how much training data of high quality you have, results might vary quite heavily. Lastly, I would again recommend you to overthink whether you actually need a neural network for this task, or whether there are simpler metrics to do this.
Create dummy variables or use RandomForest. They accept text input and numerical output.
I have a bunch of contact data listing what members were contacted by what offer, which summarizes something like this:
To make sense of it (and to make it more scalable) I was considering creating dummy variables for each offer and then using a logistic model to see how different offers impact performance:
Before I embark too far on this journey I wanted to get some input if this is a sensible way to approach this (I have started playing around but and got a model output, but haven't dug into it yet). Someone suggested I use linear regression instead, but I'm not really sure about the approach for that in this case.
What I'm hoping to get are coefficients that are interpretable - so I can see that Mailing the 50% off offer in the 3d mailing is not as impactful as the $25 giftcard etc, and then do this at scale (lots of mailings with lots of different offers) to draw some conclusions about the impact of timing of different offers.
My concern is that I will end up with a fairly sparse matrix where only some combinations of the many possible are respresented, and what problems may arise from this. I've taken some online courses in ML but am new to it, and this is one of my first chances to work directly with it so I'm hoping I could create something useful out of this. I have access to lots and lots of data, it's just a matter of getting something basic out that can show some value. Maybe there's already some work on this or even some kind of library I can use?
Thanks for any help!
If your target variable is binary (1 or 0) as in the second chart, then a classification model is appropriate. Logistic Regression is a good first option, you could also a tree-based model like a decision tree classifier or a random forest.
Creating dummy variables is a good move; you could also convert the discounts to numerical values if you want to keep them in a single column, however this may not work so well for a linear model like logistic regression as the correlation will probably not be linear.
If you wanted to model the first chart directly you could use a linear regressions for predicting the conversion rate, I'm not sure about the difference is in doing this, it's actually something I've been wondering about for a while, you've motivated me to post a question on stats.stackexchange.com