I have a pandas dataframe where each observation has a date (as a column of entries in datetime[64] format). These dates are spread over a period of about 5 years. I would like to plot a kernel-density plot of the dates of all the observations, with the years labelled on the x-axis.
I have figured out how to create a time-delta relative to some reference date and then create a density plot of the number of hours/days/years between each observation and the reference date:
df['relativeDate'].astype('timedelta64[D]').plot(kind='kde')
But this isn't exactly what I want: If I convert to year-deltas, then the x-axis is right but I lose the within-year variation. But if I take a smaller unit of time like hour or day, the x-axis labels are much harder to interpret.
What's the simplest way to make this work in Pandas?
Inspired by #JohnE 's answer, an alternative approach to convert date to numeric value is to use .toordinal().
import pandas as pd
import numpy as np
# simulate some artificial data
# ===============================
np.random.seed(0)
dates = pd.date_range('2010-01-01', periods=31, freq='D')
df = pd.DataFrame(np.random.choice(dates,100), columns=['dates'])
# use toordinal() to get datenum
df['ordinal'] = [x.toordinal() for x in df.dates]
print(df)
dates ordinal
0 2010-01-13 733785
1 2010-01-16 733788
2 2010-01-22 733794
3 2010-01-01 733773
4 2010-01-04 733776
5 2010-01-28 733800
6 2010-01-04 733776
7 2010-01-08 733780
8 2010-01-10 733782
9 2010-01-20 733792
.. ... ...
90 2010-01-19 733791
91 2010-01-28 733800
92 2010-01-01 733773
93 2010-01-15 733787
94 2010-01-04 733776
95 2010-01-22 733794
96 2010-01-13 733785
97 2010-01-26 733798
98 2010-01-11 733783
99 2010-01-21 733793
[100 rows x 2 columns]
# plot non-parametric kde on numeric datenum
ax = df['ordinal'].plot(kind='kde')
# rename the xticks with labels
x_ticks = ax.get_xticks()
ax.set_xticks(x_ticks[::2])
xlabels = [datetime.datetime.fromordinal(int(x)).strftime('%Y-%m-%d') for x in x_ticks[::2]]
ax.set_xticklabels(xlabels)
I imagine there is some better and automatic way to do this, but if not then this ought to be a decent workaround. First, let's set up some sample data:
np.random.seed(479)
start_date = '2011-1-1'
df = pd.DataFrame({ 'date':np.random.choice(
pd.date_range(start_date, periods=365*5, freq='D'), 50) })
df['rel'] = df['date'] - pd.to_datetime(start_date)
df.rel = df.rel.astype('timedelta64[D]')
date rel
0 2014-06-06 1252
1 2011-10-26 298
2 2013-08-24 966
3 2014-09-25 1363
4 2011-12-23 356
As you can see, 'rel' is just the number of days since the starting day. It's essentially an integer, so all you really need to do is normalize it with respect to the starting date.
df['year_as_float'] = pd.to_datetime(start_date).year + df.rel / 365.
date rel year_as_float
0 2014-06-06 1252 2014.430137
1 2011-10-26 298 2011.816438
2 2013-08-24 966 2013.646575
3 2014-09-25 1363 2014.734247
4 2011-12-23 356 2011.975342
You'd need to adjust that slightly for a date not starting on Jan 1. That's also ignoring any leap years which really isn't a practical issue if you're just producing a KDE plot over 5 years, but it could matter depending on what else you might want to do.
Here's the plot
df['year_as_float']d.plot(kind='kde')
Related
I'm creating a pandas DataFrame with random dates and random integers values and I want to resample it by month and compute the average value of integers. This can be done with the following code:
def random_dates(start='2018-01-01', end='2019-01-01', n=300):
start_u = start.value//10**9
end_u = end.value//10**9
return pd.to_datetime(np.random.randint(start_u, end_u, n), unit='s')
start = pd.to_datetime('2018-01-01')
end = pd.to_datetime('2019-01-01')
dates = random_dates(start, end)
ints = np.random.randint(100, size=300)
df = pd.DataFrame({'Month': dates, 'Integers': ints})
print(df.resample('M', on='Month').mean())
The thing is that the resampled months always starts from day one and I want all months to start from day 15. I'm using pandas 1.1.4 and I've tried using origin='15/01/2018' or offset='15' and none of them works with 'M' resample rule (they do work when I use 30D but it is of no use). I've also tried to use '2SM'but it also doesn't work.
So my question is if is there a way of changing the resample rule or I will have to add an offset in my data?
Assume that the source DataFrame is:
Month Amount
0 2020-05-05 1
1 2020-05-14 1
2 2020-05-15 10
3 2020-05-20 10
4 2020-05-30 10
5 2020-06-15 20
6 2020-06-20 20
To compute your "shifted" resample, first shift Month column so that
the 15-th day of month becomes the 1-st:
df.Month = df.Month - pd.Timedelta('14D')
and then resample:
res = df.resample('M', on='Month').mean()
The result is:
Amount
Month
2020-04-30 1
2020-05-31 10
2020-06-30 20
If you want, change dates in the index to month periods:
res.index = res.index.to_period('M')
Then the result will be:
Amount
Month
2020-04 1
2020-05 10
2020-06 20
Edit: Not a working solution for OP's request. See short discussion in the comments.
Interesting problem. I suggest to resample using 'SMS' - semi-month start frequency (1st and 15th). Instead of keeping just the mean values, keep the count and sum values and recalculate the weighted mean for each monthly period by its two sub-period (for example: 15/1 to 15/2 is composed of 15/1-31/1 and 1/2-15/2).
The advantages here is that unlike with an (improper use of an) offset, we are certain we always start on the 15th of the month till the 14th of the next month.
df_sm = df.resample('SMS', on='Month').aggregate(['sum', 'count'])
df_sm
Integers
sum count
Month
2018-01-01 876 16
2018-01-15 864 16
2018-02-01 412 10
2018-02-15 626 12
...
2018-12-01 492 10
2018-12-15 638 16
Rolling sum and rolling count; Find the mean out of them:
df_sm['sum_rolling'] = df_sm['Integers']['sum'].rolling(2).sum()
df_sm['count_rolling'] = df_sm['Integers']['count'].rolling(2).sum()
df_sm['mean'] = df_sm['sum_rolling'] / df_sm['count_rolling']
df_sm
Integers count_sum count_rolling mean
sum count
Month
2018-01-01 876 16 NaN NaN NaN
2018-01-15 864 16 1740.0 32.0 54.375000
2018-02-01 412 10 1276.0 26.0 49.076923
2018-02-15 626 12 1038.0 22.0 47.181818
...
2018-12-01 492 10 1556.0 27.0 57.629630
2018-12-15 638 16 1130.0 26.0 43.461538
Now, just filter the odd indices of df_sm:
df_sm.iloc[1::2]['mean']
Month
2018-01-15 54.375000
2018-02-15 47.181818
2018-03-15 51.000000
2018-04-15 44.897436
2018-05-15 52.450000
2018-06-15 33.722222
2018-07-15 41.277778
2018-08-15 46.391304
2018-09-15 45.631579
2018-10-15 54.107143
2018-11-15 58.058824
2018-12-15 43.461538
Freq: 2SMS-15, Name: mean, dtype: float64
The code:
df_sm = df.resample('SMS', on='Month').aggregate(['sum', 'count'])
df_sm['sum_rolling'] = df_sm['Integers']['sum'].rolling(2).sum()
df_sm['count_rolling'] = df_sm['Integers']['count'].rolling(2).sum()
df_sm['mean'] = df_sm['sum_rolling'] / df_sm['count_rolling']
df_out = df_sm[1::2]['mean']
Edit: Changed a name of one of the columns to make it clearer
I have a timeseries data for a full year for every minute.
timestamp day hour min rainfall_rate
2010-01-01 00:00:00 1 0 0 x
2010-01-01 00:01:00 1 0 1 1
2010-01-01 00:02:00 1 0 2 2
2010-01-01 00:03:00 1 0 3 x
2010-01-01 00:04:00 1 0 4 5
... ...
2010-12-31 23:55:00 365 23 55 3
2010-12-31 23:56:00 365 23 56 9
2010-12-31 23:57:00 365 23 57 32
2010-12-31 23:58:00 365 23 58 12
2010-12-31 23:59:00 365 23 59 22
I used sampled_df = rainfall_df.groupby(pd.Grouper(freq="M")).resample('D').sum(), to group the data by month and calculate the daily sum of rainfall_rate.
Structure of sampled_df.
How to plot the monthly data against the timestamp for every months. How do I index rainfall_rate? I want the data of rainfall_rate daily for every month. Also is the grouping correct? Suppose I want to plot timestamp vs rainfall_rate for the month of January. How do I do that?
I am new to pandas.
To generate a plot from the resulting resampled data, simply call DataFrame.plot. However, since you have a multindex with two timestamps for month and day indicator, call DataFrame.reset_index to drop the redundant month level. And for specific month plotting, run boolean indexing on the day index for specific month:
import matplotlib.pyplot as plt
...
# RESET INDEX AND FILTER COLUMNS
sampled_df = (sampled_df.reindex(['rainfall_rate'], axis='columns')
.reset_index(level=0, drop=True)
)
### ALL MONTHS
sampled_df.plot(kind='line')
### ONLY JANUARY
sampled_df[sampled_df.index.month == 1].plot(kind='line')
To demonstrate with random, seeded data:
Data
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
np.random.seed(22820)
rainfall_df = pd.DataFrame({'timestamp': pd.date_range('2010-01-01 00:00',
'2010-12-31 23:59',
freq="min"),
'rainfall_rate': np.random.normal(1, 2, 525600)
})
Resampling
sampled_df = (rainfall_df.set_index('timestamp')
.groupby(pd.Grouper(freq="M"))
.resample('D')
.sum()
)
sampled_df.tail(10)
# rainfall_rate
# timestamp
# 2010-12-22 1454.287302
# 2010-12-23 1367.539650
# 2010-12-24 1460.319823
# 2010-12-25 1464.392407
# 2010-12-26 1338.139227
# 2010-12-27 1454.540103
# 2010-12-28 1553.949133
# 2010-12-29 1301.670684
# 2010-12-30 1536.173442
# 2010-12-31 1333.492614
Plots
sampled_df = sampled_df.reset_index(level=0, drop=True)
### ALL MONTHS
sampled_df.plot(kind='line')
### ONLY JANUARY
sampled_df[sampled_df.index.month == 1].plot(kind='line')
I'm trying to use Pandas to filter the dataframe. So in the dataset I have 1982-01 to 2019-11. I want to filter data based on year 2010 onwards ie. 2010-01 to 2019-11.
mydf = pd.read_csv('temperature_mean.csv')
df1 = mydf.set_index('month')
df= df1.loc['2010-01':'2019-11']
I had set the index = month, and I'm able to get the mean temperature for the filtered data. However I need to get the indexes as my x label for line graph. I'm not able to do it. I tried to use reg to get data from 201x onwards, but there's still an error.
How do I get the label for the months i.e. 2010-01, 2010-02......2019-10, 2019-11, for the line graph.
Thanks!
mydf = pd.read_csv('temperature_mean.csv')
month mean_temp
______________________
0 1982-01-01 39
1 1985-04-01 29
2 1999-03-01 19
3 2010-01-01 59
4 2013-05-01 32
5 2015-04-01 34
6 2016-11-01 59
7 2017-08-01 14
8 2017-09-01 7
df1 = mydf.set_index('month')
df= df1.loc['2010-01':'2019-11']
mean_temp
month
______________________
2010-01-01 59
2013-05-01 32
2015-04-01 34
2016-11-01 59
2017-08-01 14
2017-09-01 7
Drawing the line plot (default x & y arguments):
df.plot.line()
If for some reason, you want to manually specify the column name
df.reset_index().plot.line(x='month', y='mean_temp')
user_id start_time weekday time hour
0 7622 2019-01-01 09:15:06 1 09:15:06 9
1 2689 2019-01-01 09:38:09 1 09:38:09 9
2 5320 2019-01-01 09:45:10 1 09:45:10 9
3 7024 2019-01-01 09:47:21 1 09:47:21 9
4 11565 2019-01-01 09:48:43 1 09:48:43 9
My dataframe looks something like above. I want to create timeseries plot where bottom x-axis shows hours and y-axis shows number of rows that belong to specific hour.
I want to create two charts for weekdays and weekends.
I tried using sns.displot
sns.set();
ax = sns.distplot(df["hour"], rug=True, hist=False)
but this has lots of curves, I want simple smooth curves.
previously I had dataframe that looks like:
datetime dayofweek count hour
2011-01-01 00:00:00 1 44 0
2011-01-01 01:00:00 1 15 1
2011-01-01 02:00:00 1 498 2
2011-01-01 03:00:00 1 11 3
and so on... and when I run:
fig,(ax1,ax2)= plt.subplots(nrows=2)
fig.set_size_inches(18,25)
sns.pointplot(data=train, x="hour", y="count", ax=ax1)
sns.pointplot(data=train, x="hour", y="count", hue="dayofweek", ax=ax2)
this is perfectly what I want. which renders
I want something like this for my dataframe I am working on right now.
Thanks in advance!
You can groupby and count:
data = train.groupby(['weekday','hour'], as_index=False)['user_id'].count()
sns.pointplot(data=data, x='hour', y='user_id', hue='weekday')
Is there any way to specify the sampling rate of the X axis in Pandas? In particular, when this axis contains datetime objects?, e.g.
df['created_dt'][0]
datetime.date(2014, 3, 24)
Ideally I would like to specify how many days (from beginning to end) to include in the plot, either by having Pandas sub-sample from my dataframe or by averaging every N days.
I think you can simply using groupby and cut to group the data into time intervals. In this example, the original dataframe has 10 days, and I group the days in to 3 intervals (that is 80 hours each). Then you can do whatever you want, take the average, for example:
In [21]:
df=pd.DataFrame(np.random.random((10,3)))
df.index=pd.date_range('1/1/2011', periods=10, freq='D')
print df
0 1 2
2011-01-01 0.125353 0.661480 0.849405
2011-01-02 0.551803 0.558052 0.905813
2011-01-03 0.221589 0.070754 0.312004
2011-01-04 0.452728 0.513566 0.535502
2011-01-05 0.730282 0.163804 0.035454
2011-01-06 0.205623 0.194948 0.180352
2011-01-07 0.586136 0.578334 0.454175
2011-01-08 0.103438 0.765212 0.570750
2011-01-09 0.203350 0.778980 0.546947
2011-01-10 0.642401 0.525348 0.500244
[10 rows x 3 columns]
In [22]:
dfgb=df.groupby(pd.cut(df.index.values.astype(float), 3),as_index=False)
df_resample=dfgb.mean()
df_resample.index=dfgb.head(1).index
df_resample.__delitem__(None)
print df_resample
0 1 2
2011-01-01 0.337868 0.450963 0.650681
2011-01-05 0.507347 0.312362 0.223327
2011-01-08 0.316396 0.689847 0.539314
[3 rows x 3 columns]
In [23]:
f=plt.figure()
ax0=f.add_subplot(121)
ax1=f.add_subplot(122)
_=df.T.boxplot(ax=ax0)
_=df_resample.T.boxplot(ax=ax1)
_=[item.set_rotation(90) for item in ax0.get_xticklabels()+ax1.get_xticklabels()]
plt.tight_layout()