Resample time series data to find tail characteristics - python

I have daily time series data. I am able to convert to monthly (or quarterly) time series and obtain monthly mean using the resample function, provided by this link.
df.Date = pd.to_datetime(df.Date)
df.set_index('Date', inplace=True)
df.resample('MS').mean()
Instead of monthly mean, I am interested in obtaining monthly skewness (or kurtosis).

You could try something like this with scipy.stats.skew:
from scipy.stats import skew
df.resample('MS').agg(skew)
Or with scipy.stats.kurtosis:
from scipy.stats import kurtosis
df.resample('MS').agg(kurtosis)
Or as #Ben.T suggests, you can use the functions that pandas provides (pd.Series.skew, pd.Series.kurtosis):
df.resample('MS').agg([pd.Series.skew, pd.Series.kurtosis])
#Same as:
#df.resample('MS').skew()
#or:
#df.resample('MS').kurtosis()

Related

Pandas interpolate does not work on hourly time series data

I have got the below plot of temperature in a time series dates aggregated hourly.
What I am trying to do is to interpolate the missing values between 2019 and 2020, using pandas pd.interpolate, and generate results hourly (same frequency as the rest of the data in weather_data). My data is called weather_data, the index column is called date_time (dtype is float64) and the temperature column has also got float64 as the dtype. Here is what I have tried:
test = weather_datetime_index.temperature.interpolate("cubicspline")
test.plot()
This gave the same plot. I also tried (based on this post):
interpolated_temp = weather_datetime_index["temperature"].astype(float).interpolate(method="time")
still gave the same plot.
I also tried (as per this post):
test = weather_datetime_index.temperature.interpolate("spline",limit_direction="forward", order=1)
test.plot()
but still gave me the same plot.
How can I interpolate this data using pd.interpolate?

how can i use dataframe and datetimeindex to return rolling 12-month?

Imagine a pandas dataframe with 2 columns (“Manager Returns” and “Benchmark Returns”) and a DatetimeIndex of monthly frequency. Please write a function to calculate the rolling 12-month manager alpha and rolling-12 month tracking error (both annualized).
so far I have this but confused about the rolling-12 month:
import pandas as pd
import numpy as np
#define dummy dataframe with monthly returns
df = pd.DataFrame(1 + np.random.rand(20), columns=['returns'])
#compute 12-month rolling returns
df_roll = df.rolling(window=12).apply(np.prod) - 1
So, you want to calculate the excess return on the 'Manager Returns' compared to the 'Benchmark Returns. First, we create some random data for these two values.
import pandas as pd
import numpy as np
n=20
df = pd.DataFrame(dict(
Manager=np.random.randint(2, 9, size=n),
Benchmark=np.random.randint(1, 7, size=n),
index=pd.date_range("20180101", freq='MS', periods=20)))
df.set_index('index', inplace = True)
To calculates the excess return (Alpha), the rolling mean of Alpha and the rolling mean of Tracking Error we create new columns for each value.
# Create Alpha
df['Alpha'] = df['Manager'] - df['Benchmark']
# Rolling mean of Alpha
df['Alpha_rolling'] = df['Alpha'].rolling(12).mean()
# Rolling mean of Tracking error
df['TrackingError_rolling'] = df['Alpha'].rolling(12).std()
Edit: I see that the values should be annualized, so you would have to transform the monthly returns I guess, my finance lingo is not currently up to date.

how to add to plot gaps when observations are missed?

Here is what i got (time series) in pandas dataframe
screenshot
(also dates were converted from timestamps)
My goal is to plot not only observations, but all the range of dates. I need to see horizontal line or gap when there is no new observations.
Dealing with data that is not observed equidistant in time is a typical challenge with real-world time series data. Given your problem, this code should work.
from datetime import datetime
from matplotlib import pyplot as plt
import pandas as pd
import numpy as np
# sample Frame
df = pd.DataFrame({'time' : ['2022,7,3,0,1,21', '2022,7,3,0,2,47', '2022,7,3,0,2,47', '2022,7,3,0,5,5',
'2022,7,3,0,5,5'],
'balance' : [12.6, 12.54, 12.494426, 12.482481, 12.449206]})
df['time'] = pd.to_datetime(df['time'], format='%Y,%m,%d,%H,%M,%S')
# aggregate time duplicates by mean
df = df.groupby('time').mean()
df.reset_index(inplace=True)
# pick equidistant time grid
df_new = pd.DataFrame({'time' : pd.date_range(start=df.loc[0]['time'], end=df.loc[2]['time'], freq='S')})
df = pd.merge(left=df_new, right=df, on='time', how='left')
# fill nan
df['balance'].fillna(method='pad', inplace=True)
df.set_index("time", inplace=True)
# plot
_ = df.plot(title='Time Series of Balance')
There are several caveats to this solution.
First, your data has a high temporal resolution (seconds). However, there are hours-long gaps in between observations. You either coarsen the timestamp by rounding (e.g. to minutes or hours) or go along with the time series on a second-by-second resolution and accept the fact that most you balance values will be filled-in values rather than true observations.
Second, you have different balance values for the same timestamp which indicates faulty entries or a misspecified timestamp. I unified those entries via grouping by timestamp and averaged the balance over those non-unique timestamps.
Third, filled-up gaps and true observations both have the same visual representation in the plot (blue dots in the graph). As previously mentioned commenting out the fillna() line would only showcase true observations leaving everything in between white.
Finally, the missing values are merely filled in via padding. Look up different values of the argument method in the documentation in case you want to linearly interpolate etc.
Summary
The problems described above are typical for event-driven time series data. Since you deal with a (financial) balance that constitutes a state that is only changed by events (orders), I believe that the assumptions made above arew reasonable and can be adjusted easily for your or many other use cases.
this helped
data = data.set_index('time').resample('1M').mean()

How to use exponential smoothing to smooth the timeseries in python?

I am trying to use exponential smooting to smooth a timeseries.
Suppose my timeseries looks like this:
import pandas as pd
data = [446.6565, 454.4733, 455.663 , 423.6322, 456.2713, 440.5881, 425.3325, 485.1494, 506.0482, 526.792 , 514.2689, 494.211 ]
index= pd.date_range(start='1996', end='2008', freq='A')
oildata = pd.Series(data, index)
I want to get the smoothed version of that timeseries.
If I did something like this;
from statsmodels.tsa.api import ExponentialSmoothing
fit1 = SimpleExpSmoothing(oildata).fit(smoothing_level=0.2,optimized=False)
fcast1 = fit1.forecast(3).rename(r'$\alpha=0.2$')
it only outputs the forcasted three values, but not the smoothed version of my original timeseries. Is there a way to get the smoothed version of my original timeseries?
I am happy to provide more details if needed.
You can get the smoothed values in the fittedvalues attribute of the model, apparently.
import pandas as pd
data = [446.6565, 454.4733, 455.663 , 423.6322, 456.2713, 440.5881, 425.3325, 485.1494, 506.0482, 526.792 , 514.2689, 494.211 ]
index= pd.date_range(start='1996', end='2008', freq='A')
oildata = pd.Series(data, index)
from statsmodels.tsa.api import SimpleExpSmoothing
fit1 = SimpleExpSmoothing(oildata).fit(smoothing_level=0.2,optimized=False)
# fcast1 = fit1.forecast(3).rename(r'$\alpha=0.2$')
import matplotlib.pyplot as plt
plt.plot(oildata)
plt.plot(fit1.fittedvalues)
plt.show()
It yields:
The documentation states:
fittedvalues: ndarray
An array of the fitted values. Fitted by the Exponential Smoothing model.
Note that you can also use the fittedfcast attribute which contains all values + the first forecast, or the fcastvalues attribute which contains the forecast only.
ExponentialSmoothing is not to a tool to smoothen time series data, it is a time series forecasting method.
The fit() function will return an instance of the HoltWintersResults class that contains the learned coefficients. The forecast() or the predict() function on the result object can be called to make a forecast.
So by calling predict, what the class will doing is providing a forecast using the learned coefficients.
In order to smoothen the time series however, you can use the fittedvalues attribute, as #smarie points out
However, I'd go with a more appropriate tool, such as a savgol_filter:
from scipy.signal import savgol_filter
savgol_filter(oildata, 5, 3)
array([444.87816 , 461.58666 , 444.99296 , 441.70785143,
442.40769143, 438.36852857, 441.50125714, 472.05622571,
512.20891429, 521.74822857, 517.63141429, 493.37037143])
As mentioned in the comments, the savgol filter performs a local taylor approximation of a given polyorder on a given window size (window_length) and results in a smoothing of the time series.
Here's what it would look like with the above set up:
plt.plot(oildata)
plt.plot(pd.Series(savgol_filter(oildata, 5, 3), index=oildata.index))
plt.show()

xarray: compute daily anomalies from monthly resampled average (not the climatology)

xarray's documentation explains how to compute anomalies to the monthly climatology. Here I am trying to do something slightly different: from daily timeseries, I would like to compute the daily anomaly to this month's average (not from the monthly climatology).
I managed to do it using groupby and a manualy created monthly stamp (code below). Is there a better, less hacky way to obtain the same result?
import xarray as xr
import numpy as np
import pandas as pd
# Create a data array
t = pd.date_range('2001', '2003', freq='D')
da = xr.DataArray(np.arange(len(t)), coords={'time':t}, dims='time')
# Monthly time stamp for groupby
da.coords['stamp'] = ('time', [str(y) + '-' + str(m) for (y, m) in
zip(da['time.year'].values,
da['time.month'].values)])
# Anomaly
da_ano = da.groupby('stamp') - da.groupby('stamp').mean()
da_ano.plot();
You could explicitly resample the monthly time-series of means into a daily time-series. Example:
monthly = da.resample(time='1MS').mean()
upsampled_monthly = monthly.resample(time='1D').ffill()
anomalies = da - upsampled_monthly

Categories