Date sampling / averaging for plotting in Pandas - python

Is there any way to specify the sampling rate of the X axis in Pandas? In particular, when this axis contains datetime objects?, e.g.
df['created_dt'][0]
datetime.date(2014, 3, 24)
Ideally I would like to specify how many days (from beginning to end) to include in the plot, either by having Pandas sub-sample from my dataframe or by averaging every N days.

I think you can simply using groupby and cut to group the data into time intervals. In this example, the original dataframe has 10 days, and I group the days in to 3 intervals (that is 80 hours each). Then you can do whatever you want, take the average, for example:
In [21]:
df=pd.DataFrame(np.random.random((10,3)))
df.index=pd.date_range('1/1/2011', periods=10, freq='D')
print df
0 1 2
2011-01-01 0.125353 0.661480 0.849405
2011-01-02 0.551803 0.558052 0.905813
2011-01-03 0.221589 0.070754 0.312004
2011-01-04 0.452728 0.513566 0.535502
2011-01-05 0.730282 0.163804 0.035454
2011-01-06 0.205623 0.194948 0.180352
2011-01-07 0.586136 0.578334 0.454175
2011-01-08 0.103438 0.765212 0.570750
2011-01-09 0.203350 0.778980 0.546947
2011-01-10 0.642401 0.525348 0.500244
[10 rows x 3 columns]
In [22]:
dfgb=df.groupby(pd.cut(df.index.values.astype(float), 3),as_index=False)
df_resample=dfgb.mean()
df_resample.index=dfgb.head(1).index
df_resample.__delitem__(None)
print df_resample
0 1 2
2011-01-01 0.337868 0.450963 0.650681
2011-01-05 0.507347 0.312362 0.223327
2011-01-08 0.316396 0.689847 0.539314
[3 rows x 3 columns]
In [23]:
f=plt.figure()
ax0=f.add_subplot(121)
ax1=f.add_subplot(122)
_=df.T.boxplot(ax=ax0)
_=df_resample.T.boxplot(ax=ax1)
_=[item.set_rotation(90) for item in ax0.get_xticklabels()+ax1.get_xticklabels()]
plt.tight_layout()

Related

Convert 5-minute data points to 30-minute intervals by averaging the data points

I have a Dataset like this:
Timestamp Index Var1
19/03/2015 05:55:00 1 3
19/03/2015 06:00:00 2 4
19/03/2015 06:05:00 3 6
19/03/2015 06:10:00 4 5
19/03/2015 06:15:00 5 7
19/03/2015 06:20:00 6 7
19/03/2015 06:25:00 7 4
The data points were collected at 5-minute intervals. Convert 5-minute data points to 30-minute intervals by averaging Var1. For example, the first data point for the 30-minute intervals will be the average of the 1st data point to the 6th data point (row 1 – 6) from the provided dataset of 5-minute intervals.
I tried using
df.groupby(pd.Grouper(key='Timestamp', freq='30min')).mean()
To start from the first timestamp instead of aligning to hours, you just need to specify origin='start'. (I found that in the docs on Grouper.)
Also, averaging the Index column doesn't really make sense. It seems like you want to select only the Var1 column.*
df.groupby(
pd.Grouper(key='Timestamp', freq='30min', origin='start')
)['Var1'].mean()
Output:
Timestamp
2022-09-04 05:55:00 5.333333
2022-09-04 06:25:00 4.000000
Freq: 30T, Name: Var1, dtype: float64
* Or you could just as easily do something else with the Index column, for example, keep the first value from each group:
...
).agg({'Index': 'first', 'Var1': 'mean'})
Index Var1
Timestamp
2022-09-04 05:55:00 1 5.333333
2022-09-04 06:25:00 7 4.000000

Group by list of different time ranges in Pandas

Edit: Changing example to use Timedelta indices.
I have a DataFrame of different time ranges that represent indices in my main DataFrame. eg:
ranges = pd.DataFrame(data=np.array([[1,10,20],[3,15,30]]).T, columns=["Start","Stop"])
ranges = ranges.apply(pd.to_timedelta, unit="s")
ranges
Start Stop
0 0 days 00:00:01 0 days 00:00:03
1 0 days 00:00:10 0 days 00:00:15
2 0 days 00:00:20 0 days 00:00:30
my_data= pd.DataFrame(data=list(range(0,40*5,5)), columns=["data"])
my_data.index = pd.to_timedelta(my_data.index, unit="s")
I want to calculate the averages of the data in my_data for each of the time ranges in ranges. How can I do this?
One option would be as follows:
ranges.apply(lambda row: my_data.loc[row["Start"]:row["Stop"]].iloc[:-1].mean(), axis=1)
data
0 7.5
1 60.0
2 122.5
But can we do this without apply?
Here is one way to approach it:
Generate the timedeltas and concatenate into a single block:
# note the use of closed='left' (`Stop` is not included in the build)
timedelta = [pd.timedelta_range(a,b, closed='left', freq='1s')
for a, b in zip(ranges.Start, ranges.Stop)]
timedelta = timedelta[0].append(timedelta[1:])
Get the grouping which will be used for the groupby and aggregation:
counts = ranges.Stop.sub(ranges.Start).dt.seconds
counts = np.arange(counts.size).repeat(counts)
Group by and aggregate:
my_data.loc[timedelta].groupby(counts).mean()
data
0 7.5
1 60.0
2 122.5

How to replace by NaN a time delta object in a pandas serie?

I would like to calculate a mean of a time delta serie excluding 00:00:00 values.
Then this is my time serie:
1 00:28:00
3 01:57:00
5 00:00:00
7 01:27:00
9 00:00:00
11 01:30:00
I try to replace 5 and 9 row per NaN and then apply .mean() to the serie. mean() doesn´t include NaN values and I get the desired value.
How can I do that stuff?
I´am trying:
`df["time_column"].replace('0 days 00:00:00', np.NaN).mean()`
but no values are replaced
One idea is use 0 Timedelta object:
out = df["time_column"].replace(pd.Timedelta(0), np.NaN).mean()
print (out)
0 days 01:20:30

How to plot kernel density plot of dates in Pandas?

I have a pandas dataframe where each observation has a date (as a column of entries in datetime[64] format). These dates are spread over a period of about 5 years. I would like to plot a kernel-density plot of the dates of all the observations, with the years labelled on the x-axis.
I have figured out how to create a time-delta relative to some reference date and then create a density plot of the number of hours/days/years between each observation and the reference date:
df['relativeDate'].astype('timedelta64[D]').plot(kind='kde')
But this isn't exactly what I want: If I convert to year-deltas, then the x-axis is right but I lose the within-year variation. But if I take a smaller unit of time like hour or day, the x-axis labels are much harder to interpret.
What's the simplest way to make this work in Pandas?
Inspired by #JohnE 's answer, an alternative approach to convert date to numeric value is to use .toordinal().
import pandas as pd
import numpy as np
# simulate some artificial data
# ===============================
np.random.seed(0)
dates = pd.date_range('2010-01-01', periods=31, freq='D')
df = pd.DataFrame(np.random.choice(dates,100), columns=['dates'])
# use toordinal() to get datenum
df['ordinal'] = [x.toordinal() for x in df.dates]
print(df)
dates ordinal
0 2010-01-13 733785
1 2010-01-16 733788
2 2010-01-22 733794
3 2010-01-01 733773
4 2010-01-04 733776
5 2010-01-28 733800
6 2010-01-04 733776
7 2010-01-08 733780
8 2010-01-10 733782
9 2010-01-20 733792
.. ... ...
90 2010-01-19 733791
91 2010-01-28 733800
92 2010-01-01 733773
93 2010-01-15 733787
94 2010-01-04 733776
95 2010-01-22 733794
96 2010-01-13 733785
97 2010-01-26 733798
98 2010-01-11 733783
99 2010-01-21 733793
[100 rows x 2 columns]
# plot non-parametric kde on numeric datenum
ax = df['ordinal'].plot(kind='kde')
# rename the xticks with labels
x_ticks = ax.get_xticks()
ax.set_xticks(x_ticks[::2])
xlabels = [datetime.datetime.fromordinal(int(x)).strftime('%Y-%m-%d') for x in x_ticks[::2]]
ax.set_xticklabels(xlabels)
I imagine there is some better and automatic way to do this, but if not then this ought to be a decent workaround. First, let's set up some sample data:
np.random.seed(479)
start_date = '2011-1-1'
df = pd.DataFrame({ 'date':np.random.choice(
pd.date_range(start_date, periods=365*5, freq='D'), 50) })
df['rel'] = df['date'] - pd.to_datetime(start_date)
df.rel = df.rel.astype('timedelta64[D]')
date rel
0 2014-06-06 1252
1 2011-10-26 298
2 2013-08-24 966
3 2014-09-25 1363
4 2011-12-23 356
As you can see, 'rel' is just the number of days since the starting day. It's essentially an integer, so all you really need to do is normalize it with respect to the starting date.
df['year_as_float'] = pd.to_datetime(start_date).year + df.rel / 365.
date rel year_as_float
0 2014-06-06 1252 2014.430137
1 2011-10-26 298 2011.816438
2 2013-08-24 966 2013.646575
3 2014-09-25 1363 2014.734247
4 2011-12-23 356 2011.975342
You'd need to adjust that slightly for a date not starting on Jan 1. That's also ignoring any leap years which really isn't a practical issue if you're just producing a KDE plot over 5 years, but it could matter depending on what else you might want to do.
Here's the plot
df['year_as_float']d.plot(kind='kde')

How to access last element of a multi-index dataframe

I have a dataframe with IDs and timestamps as a multi-index. The index in the dataframe is sorted by IDs and timestamps and I want to pick the lastest timestamp for each IDs. for example:
IDs timestamp value
0 2010-10-30 1
2010-11-30 2
1 2000-01-01 300
2007-01-01 33
2010-01-01 400
2 2000-01-01 11
So basically the result I want is
IDs timestamp value
0 2010-11-30 2
1 2010-01-01 400
2 2000-01-01 11
What is the command to do that in pandas?
Given this setup:
import pandas as pd
import numpy as np
import io
content = io.BytesIO("""\
IDs timestamp value
0 2010-10-30 1
0 2010-11-30 2
1 2000-01-01 300
1 2007-01-01 33
1 2010-01-01 400
2 2000-01-01 11""")
df = pd.read_table(content, header=0, sep='\s+', parse_dates=[1])
df.set_index(['IDs', 'timestamp'], inplace=True)
using reset_index followed by groupby
df.reset_index(['timestamp'], inplace=True)
print(df.groupby(level=0).last())
yields
timestamp value
IDs
0 2010-11-30 00:00:00 2
1 2010-01-01 00:00:00 400
2 2000-01-01 00:00:00 11
This does not feel like the best solution, however. There should be a way to do this without calling reset_index...
As you point out in the comments, last ignores NaN values. To not skip NaN values, you could use groupby/agg like this:
df.reset_index(['timestamp'], inplace=True)
grouped = df.groupby(level=0)
print(grouped.agg(lambda x: x.iloc[-1]))
One can also use
df.groupby("IDs").tail(1)
This will take the last row of each label in level "IDs" and will not ignore NaN values.

Categories