I have dataset that contins 300 rows and 4 columns: Date, Hour, counts(how many ads were emitted during this hour in TV), Visits (how many visits were made during this hour). Here is example of data:
If I want to test the effect of the tv spots on visits on the website, should I treat it as a time series and use regression for example? And what should the input table look like in that case? I know that I have to divide the date into day and month, but how to treat the counts column, leave them as they are, if my y is to be the number of visits?
Thanks
just to avoid case of single input and single output regression model, you could use hour and counts as input and predict the visits.
I don't know what format are hours in, if they are in 12hrs format convert them to 24hr format before feeding them to your model.
If you want predict the the next dates and hours in the time series, regression models or classical time series model such as ARIMA, ARMA, exponential smoothing would be useful.
But, as you need to predict the effectiveness of tv spot, I recommend to generate features using tsfresh library in python, based on counts to remove the time effect and use a machine learning model to do prediction, such as SVR or Gradient Boosting.
In your problem:
from tsfresh import extract_features
extracted_features = extract_features(df,
column_id="Hour",
column_kind=None,
column_value="Counts",
column_sort="time")
So, your target table will be:
Hour Feature_1 Feature_2 ... Visits(Avg)
0 min(Counts) max(Counts) ... mean(Visits)
1 min(Counts) max(Counts) ... mean(Visits)
2 min(Counts) max(Counts) ... mean(Visits)
min() and max() are just example features, tsfresh could extract many other features. Visit here for more information
Related
I have a pandas timeseries y that does not work well with statsmodel functions.
import statsmodels.api as sm
y.tail(10)
2019-09-20 7.854
2019-10-01 44.559
2019-10-10 46.910
2019-10-20 49.053
2019-11-01 24.881
2019-11-10 52.882
2019-11-20 84.779
2019-12-01 56.215
2019-12-10 23.347
2019-12-20 31.051
Name: mean_rainfall, dtype: float64
I verify that it is indeed a timeseries
type(y)
pandas.core.series.Series
type(y.index)
pandas.core.indexes.datetimes.DatetimeIndex
From here, I am able to pass the timeseries through an autocorrelation function with no problem, which produces the expected output
plot_acf(y, lags=72, alpha=0.05)
However, when I try to pass this exact same object y to SARIMA
mod = sm.tsa.statespace.SARIMAX(y.mean_rainfall, order=pdq, seasonal_order=seasonal_pdq)
results = mod.fit()
I get the following error:
A date index has been provided, but it has no associated frequency information and so will be ignored when e.g. forecasting.
The problem is that the frequency of my timeseries is not regular (it is the 1st, 10th, and 20th of every month), so I cannot set freq='m'or freq='D' for example. What is the workaround in this case?
I am new to using timeseries, any advice on how to not have my index ignored during forecasting would help. This prevents any predictions from being possible
First of all, it is extremely important to understand what the relationship between the datetime column and the target column (rainfall) is. Looking at the snippet you provide, I can think of two possibilities:
y represents the rainfall that occurred in the date-range between the current row's date and the next row's date. If that is the case, the timeseries is kind of an aggregated rainfall series with unequal buckets of date i.e. 1-10, 10-20, 20-(end-of-month). If that is the case, you have two options:
You can disaggregate your data using either an equal weightage or even better an interpolation to create a continuous and relatively smooth timeseries. You can then fit your model on the daily time-series and generate predictions which will also naturally be daily in nature. These you can aggregate back to the 1-10, 10-20, 20-(end-of-month) buckets to get your predicitons. One way to do the resampling is using the code below.
ts.Date = pd.to_datetime(ts.Date, format='%d/%m/%y')
ts['delta_time'] = (ts['Date'].shift(-1) - ts['Date']).dt.days
ts['delta_rain'] = ts['Rain'].shift(-1) - ts['Rain']
ts['timesteps'] = ts['Date']
ts['grad_rain'] = ts['delta_rain'] / ts['delta_time']
ts.set_index('timesteps', inplace=True )
ts = ts.resample('d').ffill()
ts
ts['daily_rain'] = ts['Rain'] + ts['grad_rain']*(ts.index - ts['Date']).dt.days
ts['daily_rain'] = ts['daily_rain']/ts['delta_time']
print(ts.head(50))
daily_rain is now the target column and the index i.e. timesteps is the timestamp.
The other option is that you approximate that the date-range of 1-10, 10-20, 20-(EOM) is roughly 10 days, so these are indeed equal timesteps. Of course statsmodel won't allow that so you would need to reset the index to mock datetime for which you maintain a mapping. Below is what you use in the statsmodel as y but do maintain a mapping back to your original dates. Freq will 'd' or 'daily' and you would need to rescale seasonality as well such that it follows the new date scale.
y.tail(10)
2019-09-01 7.854
2019-09-02 44.559
2019-09-03 46.910
2019-09-04 49.053
2019-09-05 24.881
2019-09-06 52.882
2019-09-07 84.779
2019-09-08 56.215
2019-09-09 23.347
2019-09-10 31.051
Name: mean_rainfall, dtype: float64
I would recommend the first option though as it's just more accurate in nature. Also you can try out other aggregation levels also during model training as well as for your predictions. More control!
The second scenario is that the data represents measurements only for the date itself and not for the range. That would mean that technically you do not have enough info now to construct an accurate timeseries - your timesteps are not equidistant and you don't have enough info for what happened between the timesteps. However, you can still improvise and get some approximations going. The second approach listed above would still work as is. For the first approach, you'd need to do interpolation but given the target variable which is rainfall and rainfall has a lot of variation, I would highly discourage this!!
As I can see, the package uses the frequency as a premise for everything, since it's a time-series problem.
So you will not be able to use it with data of different frequencies. In fact, you will have to make an assumption for your analysis to adequate your data for the use. Some options are:
1) Consider 3 different analyses (1st days, 10th days, 20th days individually) and use 30d frequency.
2) As you have ~10d equally separated data, you can consider using some kind of interpolation and then make downsampling to a frequency of 1d. Of course, this option only makes sense depending on the nature of your problem and how quickly your data change.
Either way, I just would like to point out that how you model your problem and your data is a key thing when dealing with time series and data science in general. In my experience as a data scientist, I can say that is analyzing at the domain (where your data came from) that you can have a feeling of which approach will work better.
I am trying to find anomalies in a huge sales-transactions dataset (more than 1 million observations), with thousands of unique customers. Same customer can purchase multiple times on the same date. Dataset contains a mix of both random and seasonal transactions. A dummy sample of my data is below:
Date CustomerID TransactionType CompanyAccountNum Amount
01.01.19 1 Sales 111xxx 100
01.01.19 1 Credit 111xxx -3100
01.01.19 4 Sales 111xxx 100
02.01.19 3 Sales 311xxx 100
02.01.19 1 Refund 211xxx -2100
03.01.19 4 Sales 211xxx 3100
Which algorithm/approach would suit this problem best? I have tried a multivariate FBprophet model (on python) so far and received less-than-satisfactory results.
You may try the pyod package, methods like isolation forest or HBOS.
It's advertised as 'a comprehensive and scalable Python toolkit for detecting outlying objects in multivariate data' but your mileage may vary in terms of performance, so first check out their benchmarks.
If you have time series data, it is better to apply some methods like moving average or exponential smoothing your data to remove trends and seasonalities. Otherwise, all data points involved in seasonality or trend periods will be labeled as anomalies.
I have data of 5000 customers over time series (monthly) which looks like:
This is my first time dealing with time series data. Can someone explain some strategies for Churn prediction probability (3 months, 6 months) in advance?
I am confused because for every customer churning probability 3 or 6 months in advance will be zero (according to target). So should I see some trends or create lag variables?
But I still don't know if regression, what will be the target variable.
you can use lag function and then define churn and use predictive model to churn. As I can see buying product category, so you can take lag for last 3 month for unique customer id and then define the churn for 3 months
I have a series of data that which correspond to values for each day. The data is for 2 weeks and there is a pattern in which the last 2 days of the week have drops.
data=[2,4,6,8,10,1,1,3,5,8,11,10,2,1]
I need to develop a simple prediction model in python using this data to predict the values for next week. This model needs to consider seasonal data ( or patterns )
I've tried using the pandas library but cant get it to work.
If you can explain your mathematical model as well that would be great.
So here is an approach
def runningSums(lst):
s = 0
for addend in lst:
s += addend
return s
runningSums(data)
>>> 2
Which is the next possible value.
To obtain a list call list on the result of this function.
For more details refer
I have created a huge log of daily activity in the format [timestamp, location]. For example
[{1365650747255, 'san francisco'},
{1365650743354, 'san francisco'},
{1365650741349, 'san mateo'},
{1365650756324, 'mountain view'},
...
{1365650813354, 'menlo park'}]
What are the ways I can mine this information to find patterns like
"On Sunday evenings, it's probable that I am near San Francisco"
"On Monday afternoons it's probable that I am near Menlo Park"
The problem is that
The dataset is huge.
it looks impossible to judge the date/time/day by applying a function on the timestamp value (unless we decode the timestamp in to Date Time values).
I do not see your problem here. As it is a timestamp counting seconds from epoch you only have to apply the modulo operator with the value being the range of interest. If you train a classifier on that you should be able to predict every upcoming place. The main problem is not performance, as the learning is only done now and then, but how to update the learned dataset.
As already stated you do not have to use machine learning for this however if you want to do it using machine learning this can basically be done using a k-nearest-neighbor on your 1d dataset.
[EDIT]:
Mixed up languages but fixed it: A classifier is the algorithm which will do the statistical classification.
In machine learning and statistics, classification is the problem of identifying to which of a set of categories (sub-populations) a new observation belongs, on the basis of a training set of data containing observations (or instances) whose category membership is known.[1]
As I only have used sklearn to do such things the following is a minimalistic example of how you could use a k-nearest-neighbor classifier [2]. To be able to classify you have to change the strings into numbers, then train your classifier on the given test dataset and afterwards you are able to predict the location for a new given timestamp.
import numpy as np
from sklearn.neighbors import KNeighborsClassifier
data = [[1365650747255, 'san francisco'],
[1365650743354, 'san francisco'],
[1365650741349, 'san mateo'],
[1365650756324, 'mountain view'],
...
[1365650813354, 'menlo park']]
# Map location strings to integers and replace
location_mapping = {}
location_index = 0
for index, (time, location) in enumerate(data):
if(not location_mapping.has_key(location)):
location_mapping[location] = location_index
location_index += 1
data[index][1] = location_mapping[location]
inverse_location_mapping = {value:key for key, value in location_mapping.items()}
data = np.array(data)
week = 60 * 60 * 24 * 7
# Setup classifier
classifier = KNeighborsClassifier(n_neighbors=10)
# Train classifier on given data
classifier.fit(data[:, 0] % week, data[:, 1])
# Predict desired location
prediction = classifier.predict([[1365444444444 % week]]))
print(inverse_location_mapping[prediction])
[1] : http://en.wikipedia.org/wiki/Statistical_classification
[2] : http://scikit-learn.org/dev/modules/generated/sklearn.neighbors.KNeighborsClassifier.html
The performance is this solution depends on how granular your requirement for pattern recognition is.
Lets assume your requirement is dividing the day into 4 parts :
Morning,Noon,Evening,Night, lets call them time_slots
Now lets take a look at how big your daily activity log is, 1 year, 2 years , 3 years ?
lets assume it is 1 year.
So we have total of 365 * 4 = 1460 timeslots to monitor.
Now,create a simple map based on timestamps for each time_slot.
Eg. It begins on T1 and ends on T2 ( where T1 and T2 are timestamps like 1365650813354 ).
Based on timestamp value in your log, it is easy to find its time_slot i.e. Evening of 28th January, or
Morning of 30th January.
You will have to store time_slot vs place_i_was data in any suitable database with proper schema.
That depends on kind of querying and analylsis you would want.
This way you will not need to run formulas on your dataset, and the predefined map/database lookup will serve your purpose.
Not sure these questions require machine learning, you can use regular statistics for that. I.e. build a probability distribution plot, x - time of day, y - probability it is San Francisco. Calculate the probability of San Francisco if time is between a and b...
This is how to load your data in pandas DataFrame:
from __future__ import print_function, division
import pandas as pd
import datetime
df = pd.read_csv("data.csv",
names=["timestamp","location"],
parse_dates=["timestamp"],
date_parser=lambda x:datetime.datetime.fromtimestamp(int(x) / 1000))
print(df.head())
Outputs:
timestamp location
0 2013-04-11 04:25:47.255000 "san francisco"
1 2013-04-11 04:25:43.354000 "san francisco"
2 2013-04-11 04:25:41.349000 "san mateo"
3 2013-04-11 04:25:56.324000 "mountain view"
4 2013-04-11 04:26:53.354000 "menlo park"
Convert the timestamps into tokens: "sunday morning".
Then do association rule mining to obtain rules such as
night => home
sunday morning => running in the park
where you only keep those rules, where the desired locations occur on the right.
Firstly, convert the timestamp value to year-month-weekday. Replace the timestamp column by 3 columns corresponding to year, month and weekday.
Later you could simply group by certain range of values for dates and count the number of instances for each location.