I have a dataset which has a date variable, and a boolean variable.
The date variable has already been broken down into 'Year' and 'month'. So I have 2 fields corresponding to the date. And the boolean indicates if that particular record is late(1) or not(0).
Here is a snapshot of the data:
Date(Index) Date_Year_Key Date_Month_Key Is_Late
2014-01-01 2014 1 1
2014-01-03 2014 1 1
2014-01-03 2014 1 1
2014-01-03 2014 1 1
I want to plot the data with time to see if any trend or pattern exists in the data(orders) being late or not and if I can predict the future orders using time-series modeling.
I have tried plotting the data using an aggregate function.
temp=big_cust_tm_series.groupby(['Date_Year_Key', 'Date_Month_Key'])['Is_Late'].mean()
temp.plot(figsize=(15,5),
title= 'Late records(Monthwise)', fontsize=14)
Also, I tried this following code but it gave me an error.
import statsmodels.api as sm
sm.tsa.seasonal_decompose(temp).plot()
result = sm.tsa.stattools.adfuller(temp)
plt.show()
AttributeError: 'MultiIndex' object has no attribute 'inferred_freq'
I don't see any increasing or decreasing trend in the plot, nor any patterns. So I am not sure even if this is a proper example to do time-series analysis or not
I'll start by noting that I'm more of a general stats person and don't know much about time series modelling.
also that "binomial data" like this isn't very informative, so you need a surprising amount to detect changes. e.g. with 300 rows and your mean of 0.05 we'd expect a standard deviation of sqrt(0.05 * (1-0.05) / 300) = ~0.013, so we'd expect to see values around 0.05 +-2 * 0.013 = (0.024, 0.076).
what you're plotting looks sensible, but I'd suggest "truncating" the date at the month, rather than working with (year, month) tuples, as it's easier to turn into plots nicely. for example, if you have a datetime64 column called Date in a Pandas dataframe called df, you could do:
df['Month'] = df['Date'].dt.to_period('M').dt.start_time
the dts are annoying, but signal the desire to work with datetime like properies. after doing this, you can do your plot as before:
plt.plot(df.groupby('Month')['Is_Late'].mean())
and you'll get date labels on the x-axis. note that your plot is basically in line with the maths at the top (i.e. lowest value is around 0.03 and largest is around 0.08).
I'm not aware of being able to set a "link function" on any of the statsmodels time series analysis stuff, which means you probably can't do much what that. that said, you might get some mileage out of using a "generalised linear model" on your data, for example:
import statsmodels.api as sm
import statsmodels.formula.api as smf
df['month'] = df.Date.dt.month
df['dayofweek'] = df.Date.dt.dayofweek
mod = smf.glm('Is_Late ~ C(dayofweek) + C(month)', df2, family=sm.families.Binomial())
res = mod.fit()
res.summary()
would allow you to fit a "day of the week" effect to the data. note that this would have shown up on your above aggregation/plot by adding noise. also note that in general fitting timeseries data with a regression like the above is not recommended, but given your use of statsmodels and the lack of an obvious way of giving a link function this serves as a low-tech fallback
note that there are a lot more stats/time-series related packages/code available for the programming language "R"
Related
I've been reading about time-series decomposition, and have a fairly good idea of how it works on simple examples, but am having trouble extending the concepts.
For example, some simple synthetic data I'm playing with:
So there is no actual time associated with this data. It could be sampled every second or every year. Whatever the sampling frequency, the period is roughly 160 time steps, and using this as the period argument yields the expected results:
# seasonal=13 based on example in the statsmodels user guide
decomp = STL(synth.value, period=160, seasonal=13).fit()
fig, ax = plt.subplots(3,1, figsize=(12,6))
decomp.trend.plot(title='Trend', ax=ax[0])
decomp.seasonal.plot(title='Seasonal', ax=ax[1])
decomp.resid.plot(title='Residual', ax=ax[2])
plt.tight_layout()
plt.show()
But looking at other datasets, it's not really that easy to see the period of the seasonality, so it leads me to a couple of questions:
How do you find the correct arguments in real-world messy data, particularly the period argument but also the others too? Is it just a parameter search that you perform until the decomposition looks sane?
Parameters
endog : array_like
Data to be decomposed. Must be squeezable to 1-d.
period : Periodicity of the sequence. If None and endog is a pandas Series or DataFrame, attempts to determine from endog. If endog is a ndarray,
period must be provided.
seasonal : Length of the seasonal smoother. Must be an odd integer, and should
normally be >= 7 (default).
trend : Length of the trend smoother. Must be an odd integer. If not provided
uses the smallest odd integer greater than 1.5 * period / (1 - 1.5 /
seasonal), following the suggestion in the original implementation.
I had the same question. After tracing some of their codebase, I have found the following. This may help:
Statsmodels expects a DatetimeIndex'd DataFrame.
This DatetimeIndex can have a frequency. You can either resample your data with Pandas, or explicitly set a frequency in your index. You can check df.index, look for the freq attribute.
This leads to two situations:
Your index has frequency set
If you have set a frequency in your index, statsmodels will inherit this frequency and automatically use this to determine a period.
It makes use of the freq_to_period method internally, defined here in the tsatools submodule.
To summarise what this does: The period is the expected periodicity of your seasonal component, translated back to a year..
In other words: "how often your seasonal cycle will repeat itself in a year".
For reference, read the note on the freq_to_period method definition:
Annual maps to 1, quarterly maps to 4, monthly to 12, weekly to 52.
This is both done for the method seasonal_decompose here, as well as for STL here.
Your index has no frequency set
It gets a bit more complicated if your data does not have a freq attribute set.
The seasonal_decompose checks whether it can find an inferred_freq attribute of your index set here, STL takes the same approach here.
This inferred_freq was set using the pandas function infer_freq, which is defined in the Pandas package here, to Infer the most likely frequency given the input index.. Pandas automatically gives a DataFrame with a DatetimeIndex an index.inferred_freq attribute by default, if you have at least 3 elements.
TLDR: The period parameter should be set to the amount of times you expect the seasonal cycle to re-occur within a year. You can explicitly set this, or otherwise statsmodels will automatically infer this from the freq attribute of your datetimeindex. If the freq attribute is None, it will depend on Pandas' index.inferred_freq attribute to determine the frequency, and then convert this to pre-set periodicity.
I am calculating ema with python on binance (BTC Futures) monthly open price data(20/12~21/01).
ema2 gives 25872.82333 on the second month like below.
df = pd.Series([19722.09, 28948.19])
ema2 = df.ewm(span=2,adjust=False).mean()
ema2
0 19722.090000
1 25872.823333
But in binance, ema(2) gives difference value(25108.05) like in the picture.
https://www.binance.com/en/futures/BTCUSDT_perpetual
Any help would be appreciated.
I had a the same problem, that the calculated EMA (df.ewm...) from pandas wasn't the same as the one from binance. You have to use a longer series. First i used 25 candlestick data, then changed to 500. When you query binance, query a lot of date, because the mathematical calculation of the EMA is from the beginning of the series.
best regards
I have a timeseries data and I would like to clean the data by approximating the missing data points and standardizing the sample rate.
Given the fact that there might be some unevenly spaced datapoints, I would like to define a function to get the timeseries and an interval X (e.g., 30 minutes or any other interval) as an input and gives the timeseries with points being spaced within X intervals as an output.
As you can see below, the periods are every 10 minutes but some data points are missing. So the algorithm should detect the missing times and remove them and create the appropriate times and generate the value for them. Then based on the defined function, the sample rate should be changed and standardized.
For approximating missing data and cleaning it, either average or linear interpolation would work.
Here is a part of raw data:
import pandas as pd
import numpy as np
df = pd.DataFrame({
"Time": ["10:09:00","10:19:00","10:29:00","10:43:00","10:59:00 ", "11:09:00"],
"Value": ["378","378","379","377","376", "377"],
})
df
First of all you need to convert "Time"" into a datetime index. Make pandas recognize the dates as actual dates with df["Time"] = pd.to_datetime(df["Time"]). Then Set time as the index: df = df.set_index("Time").
Once you have the datetime index, you can do all sorts of time-based operations with it. In your case, you want to resample: df.resample('10T')
This leaves us with the following code:
df["Time"] = pd.to_datetime(df["Time"], format="%H:%S:%M")
df = df.set_index("Time")
df.resample('10T')
From here on you have a lot of options on how to treat cases in which you have missing data (fill / interpolate / ...), or in which you have multiple data points for one new one (average / sum / ...). I suggest you take a look at the pandas resampling api. For conversions and formatting between string and datetime refer to strftime.
I have a pandas timeseries y that does not work well with statsmodel functions.
import statsmodels.api as sm
y.tail(10)
2019-09-20 7.854
2019-10-01 44.559
2019-10-10 46.910
2019-10-20 49.053
2019-11-01 24.881
2019-11-10 52.882
2019-11-20 84.779
2019-12-01 56.215
2019-12-10 23.347
2019-12-20 31.051
Name: mean_rainfall, dtype: float64
I verify that it is indeed a timeseries
type(y)
pandas.core.series.Series
type(y.index)
pandas.core.indexes.datetimes.DatetimeIndex
From here, I am able to pass the timeseries through an autocorrelation function with no problem, which produces the expected output
plot_acf(y, lags=72, alpha=0.05)
However, when I try to pass this exact same object y to SARIMA
mod = sm.tsa.statespace.SARIMAX(y.mean_rainfall, order=pdq, seasonal_order=seasonal_pdq)
results = mod.fit()
I get the following error:
A date index has been provided, but it has no associated frequency information and so will be ignored when e.g. forecasting.
The problem is that the frequency of my timeseries is not regular (it is the 1st, 10th, and 20th of every month), so I cannot set freq='m'or freq='D' for example. What is the workaround in this case?
I am new to using timeseries, any advice on how to not have my index ignored during forecasting would help. This prevents any predictions from being possible
First of all, it is extremely important to understand what the relationship between the datetime column and the target column (rainfall) is. Looking at the snippet you provide, I can think of two possibilities:
y represents the rainfall that occurred in the date-range between the current row's date and the next row's date. If that is the case, the timeseries is kind of an aggregated rainfall series with unequal buckets of date i.e. 1-10, 10-20, 20-(end-of-month). If that is the case, you have two options:
You can disaggregate your data using either an equal weightage or even better an interpolation to create a continuous and relatively smooth timeseries. You can then fit your model on the daily time-series and generate predictions which will also naturally be daily in nature. These you can aggregate back to the 1-10, 10-20, 20-(end-of-month) buckets to get your predicitons. One way to do the resampling is using the code below.
ts.Date = pd.to_datetime(ts.Date, format='%d/%m/%y')
ts['delta_time'] = (ts['Date'].shift(-1) - ts['Date']).dt.days
ts['delta_rain'] = ts['Rain'].shift(-1) - ts['Rain']
ts['timesteps'] = ts['Date']
ts['grad_rain'] = ts['delta_rain'] / ts['delta_time']
ts.set_index('timesteps', inplace=True )
ts = ts.resample('d').ffill()
ts
ts['daily_rain'] = ts['Rain'] + ts['grad_rain']*(ts.index - ts['Date']).dt.days
ts['daily_rain'] = ts['daily_rain']/ts['delta_time']
print(ts.head(50))
daily_rain is now the target column and the index i.e. timesteps is the timestamp.
The other option is that you approximate that the date-range of 1-10, 10-20, 20-(EOM) is roughly 10 days, so these are indeed equal timesteps. Of course statsmodel won't allow that so you would need to reset the index to mock datetime for which you maintain a mapping. Below is what you use in the statsmodel as y but do maintain a mapping back to your original dates. Freq will 'd' or 'daily' and you would need to rescale seasonality as well such that it follows the new date scale.
y.tail(10)
2019-09-01 7.854
2019-09-02 44.559
2019-09-03 46.910
2019-09-04 49.053
2019-09-05 24.881
2019-09-06 52.882
2019-09-07 84.779
2019-09-08 56.215
2019-09-09 23.347
2019-09-10 31.051
Name: mean_rainfall, dtype: float64
I would recommend the first option though as it's just more accurate in nature. Also you can try out other aggregation levels also during model training as well as for your predictions. More control!
The second scenario is that the data represents measurements only for the date itself and not for the range. That would mean that technically you do not have enough info now to construct an accurate timeseries - your timesteps are not equidistant and you don't have enough info for what happened between the timesteps. However, you can still improvise and get some approximations going. The second approach listed above would still work as is. For the first approach, you'd need to do interpolation but given the target variable which is rainfall and rainfall has a lot of variation, I would highly discourage this!!
As I can see, the package uses the frequency as a premise for everything, since it's a time-series problem.
So you will not be able to use it with data of different frequencies. In fact, you will have to make an assumption for your analysis to adequate your data for the use. Some options are:
1) Consider 3 different analyses (1st days, 10th days, 20th days individually) and use 30d frequency.
2) As you have ~10d equally separated data, you can consider using some kind of interpolation and then make downsampling to a frequency of 1d. Of course, this option only makes sense depending on the nature of your problem and how quickly your data change.
Either way, I just would like to point out that how you model your problem and your data is a key thing when dealing with time series and data science in general. In my experience as a data scientist, I can say that is analyzing at the domain (where your data came from) that you can have a feeling of which approach will work better.
I have a pandas DataFrame with a TIMESTAMP column, which is of the datetime64 data type. Please keep in mind, initially this column is not set as the index; the index is just regular integers, and the first few rows look like this:
TIMESTAMP TYPE
0 2014-07-25 11:50:30.640 2
1 2014-07-25 11:50:46.160 3
2 2014-07-25 11:50:57.370 2
There is an arbitrary number of records for each day, and there may be days with no data. What I am trying to get is the average number of daily records per month then plot it as a bar chart with months in the x-axis (April 2014, May 2014... etc.). I managed to calculate these values using the code below
dfWIM.index = dfWIM.TIMESTAMP
for i in range(dfWIM.TIMESTAMP.dt.year.min(),dfWIM.TIMESTAMP.dt.year.max()+1):
for j in range(1,13):
print dfWIM[(dfWIM.TIMESTAMP.dt.year == i) & (dfWIM.TIMESTAMP.dt.month == j)].resample('D', how='count').TIMESTAMP.mean()
which gives the following output:
nan
nan
3100.14285714
6746.7037037
9716.42857143
10318.5806452
9395.56666667
9883.64516129
8766.03225806
9297.78571429
10039.6774194
nan
nan
nan
This is ok as it is, and with some more work, I can map to results to correct month names, then plot the bar chart. However, I am not sure if this is the correct/best way, and I suspect there might be an easier way to get the results using Pandas.
I would be glad to hear what you think. Thanks!
NOTE: If I do not set the TIMESTAMP column as the index, I get a "reduction operation 'mean' not allowed for this dtype" error.
I think you'll want to do two rounds of groupby, first to group by day and count the instances, and next to group by month and compute the mean of the daily counts. You could do something like this.
First I'll generate some fake data that looks like yours:
import pandas as pd
# make 1000 random times throughout the year
N = 1000
times = pd.date_range('2014', '2015', freq='min')
ind = np.random.permutation(np.arange(len(times)))[:N]
data = pd.DataFrame({'TIMESTAMP': times[ind],
'TYPE': np.random.randint(0, 10, N)})
data.head()
Now I'll do the two groupbys using pd.TimeGrouper and plot the monthly average counts:
import seaborn as sns # for nice plot styles (optional)
daily = data.set_index('TIMESTAMP').groupby(pd.TimeGrouper(freq='D'))['TYPE'].count()
monthly = daily.groupby(pd.TimeGrouper(freq='M')).mean()
ax = monthly.plot(kind='bar')
The formatting along the x axis leaves something to be desired, but you can tweak that if necessary.