I am trying to group a Pandas Dataframe into buckets of 2 days. For example, if I do the below:
df = pd.DataFrame()
df['action_date'] = ['2017-01-01', '2017-01-01', '2017-01-03', '2017-01-04', '2017-01-04', '2017-01-05', '2017-01-06']
df['action_date'] = pd.to_datetime(df['action_date'], format="%Y-%m-%d")
df['user_name'] = ['abc', 'wdt', 'sdf', 'dfe', 'dsd', 'erw', 'fds']
df['number_of_apples'] = [1,2,3,4,5,6,2]
df = df.groupby(['action_date', 'number_of_apples']).sum()
I get a dataframe grouped by action_date with number_of_apples per day.
However, if I wanted to look at the dataframe in chunks of 2 days, how could I do so? I would then like to analyze the number_of_apples per date_chunk, either by making new dataframes for the dates 2017-01-01 & 2017-01-03, another for 2017-01-04 & 2017-01-05, and then one last one for 2017-01-06, OR just by regrouping and working within.
EDIT: I ultimately would like to make lists of users based on the the number of apples they have for each day chunk, so do not want to get the sum nor mean of each day chunk's apples. Sorry for the confusion!
Thank you in advance!
You can use resample:
print (df.resample('2D', on='action_date')['number_of_apples'].sum().reset_index())
action_date number_of_apples
0 2017-01-01 3
1 2017-01-03 12
2 2017-01-05 8
EDIT:
print (df.resample('2D', on='action_date')['user_name'].apply(list).reset_index())
action_date user_name
0 2017-01-01 [abc, wdt]
1 2017-01-03 [sdf, dfe, dsd]
2 2017-01-05 [erw, fds]
Try using a TimeGrouper to group by two days.
>>df.index=df.action_date
>>dg = df.groupby(pd.TimeGrouper(freq='2D'))['user_name'].apply(list) # 2 day frequency
>>dg.head()
action_date
2017-01-01 [abc, wdt]
2017-01-03 [sdf, dfe, dsd]
2017-01-05 [erw, fds]
Related
My DF looks like below:
column1 column2
2020-11-01 1
2020-12-01 2
2021-01-01 3
NaT 4
NaT 5
NaT 6
Output should be like this:
column1 column2
2020-11-01 1
2020-12-01 2
2021-01-01 3
2021-02-01 4
2021-03-01 5
2021-04-01 6
I can't create next date (only months and years changed) based on the last existing date in df. Is there any pythonic way to do this? Thanks for any help!
Regards
Tomasz
This is how I would do it, you could probably tidy this up into more of a one liner but this will help illustrate the process a little more.
#convert to date
df['column1'] = pd.to_datetime(df['column1'], format='%Y-%d-%m')
#create a group for each missing section
df['temp'] = df.column1.fillna(method = 'ffill')
#count the row within this group
df['temp2'] = df.groupby(['temp']).cumcount()
# add month
df['column1'] = [x + pd.DateOffset(months=y) for x,y in zip(df['temp'], df['temp2'])]
pandas supports time series data
pd.date_range("2020-11-1", freq=pd.tseries.offsets.DateOffset(months=1), periods=10)
will give
DatetimeIndex(['2020-11-01', '2020-12-01', '2021-01-01', '2021-02-01',
'2021-03-01', '2021-04-01', '2021-05-01', '2021-06-01',
'2021-07-01', '2021-08-01'],
dtype='datetime64[ns]', freq='<DateOffset: months=1>')
I have a dataframe with dates and tick-data like below
Date Bid
0 20160601 00:00:00.020 160.225
1 20160601 00:00:00.136 160.226
2 20160601 00:00:00.192 160.225
3 20160601 00:00:00.327 160.230
4 20160601 00:00:01.606 160.231
5 20160601 00:00:01.613 160.230
I want to filter out unique values in the 'Bid' column at set intervals
E.g: 2016-06-01 00:00:00 - 00:15:00, 2016-06-01 00:15:00 - 00:30:00...
The result will be a new dataframe (keeping the filtered values with its datetime).
Here's the code I have so far:
#Convert Date column to index with seconds as base
df['Date'] = pd.DatetimeIndex(df['Date'])
df['Date'] = df['Date'].astype('datetime64[s]')
df.set_index('Date', inplace=True)
#Create new DataFrame with filtered values
ts = pd.DataFrame(df.loc['2016-06-01'].between_time('00:00', '00:30')['Bid'].unique())
With the method above I loose the [Dates] (datetime) of the filtered values in the process of creating a new DataFrame plus I have to manually input each date and time interval which is unrealistic.
Output:
0
0 160.225
1 160.226
2 160.230
3 160.231
4 160.232
5 160.228
6 160.227
Ideally I'm looking for an operation where I can set the time interval as a timedelta and have an operation done on the whole file (about 8Gb) at once, creating a new DataFrame with Date and Bid columns of the unique values within the set interval. Like this
Date Bid
0 20160601 00:00:00.020 160.225
1 20160601 00:00:00.136 160.226
2 20160601 00:00:00.327 160.230
3 20160601 00:00:01.606 160.231
...
805 20160601 00:15:00.606 159.127
PS. I also tried using pd.rolling() & pd.resample() methods with apply(lambda x: function (eg. pd['Bid'].unique()) but it never was able to cut it, maybe someone better at it could attempt.
Just to clarify: This is not a rolling calculation. You mentioned attempting to solve this using rolling, but from your clarification it seems you want to split the time series into discrete, non-overlapping 15 minutes sequences.
Setup
df = pd.DataFrame({
'Date': [
'2016-06-01 00:00:00.020', '2016-06-01 00:00:00.136',
'2016-06-01 00:15:00.636', '2016-06-01 00:15:02.836',
],
'Bid': [150, 150, 200, 200]
})
print(df)
Date Bid
0 2016-06-01 00:00:00.020 150
1 2016-06-01 00:00:00.136 150 # Should be dropped
2 2016-06-01 00:15:00.636 200
3 2016-06-01 00:15:02.836 200 # Should be dropped
First, verify that your Date column is datetime:
df.Date = pd.to_datetime(df.Date)
Now use dt.floor to round each value down to the nearest 15 minutes, and use this new column to drop_duplicates per 15 minute window, but still keep the precision of your dates.
df.assign(flag=df.Date.dt.floor('15T')).drop_duplicates(['flag', 'Bid']).drop('flag', 1)
Date Bid
0 2016-06-01 00:00:00.020 150
2 2016-06-01 00:15:00.636 200
From my original answer, but I still believe it holds value. If you'd like to access the unique values per group, you can make use of pd.Grouper and unique, and I believe learning to leverage pd.Grouper is a powerful tool to have with pandas:
df.groupby(pd.Grouper(key='Date', freq='15T')).Bid.unique()
Date
2016-06-01 00:00:00 [150]
2016-06-01 00:15:00 [200]
Freq: 15T, Name: Bid, dtype: object
I have a one month DataFrame with a datetime object column and a bunch of functions I want to apply to it - by week. So I want to loop over the DataFrame and apply the functions to each week. How do I iterate over weekly time periods?
My DataFrame looks like this:
here is some random datetime code:
np.random.seed(123)
n = 500
df = pd.DataFrame(
{'date':pd.to_datetime(
pd.DataFrame( { 'year': np.random.choice(range(2017,2019), size=n),
'month': np.random.choice(range(1,2), size=n),
'day': np.random.choice(range(1,28), size=n)
} )
) }
)
df['random_num'] = np.random.choice(range(0,1000), size=n)
My week length is inconsistent (sometimes I have 1000 tweets per week sometimes 100,000). Could please someone give me an example of how to loop over this dataframe by week? (I don't need aggregation or groupby functions.)
If you really don't want to use groupby and aggregations then:
for week in df['date'].dt.week.unique():
this_weeks_data = df[df['date'].dt.week == week]
This will, of course, go wrong if you have data from more than one year.
Given your sample dataframe
date random_num
0 2017-01-01 214
1 2018-01-19 655
2 2017-01-24 663
3 2017-01-26 723
4 2017-01-01 974
First, you can try to set the index to datetime object as follows
df.set_index(df.date, inplace=True)
df.drop('date', axis=1, inplace=True)
This sets the index to the date column and drops the original column. You will get
>>> df.head()
date random_num
2017-01-01 214
2018-01-19 655
2017-01-24 663
2017-01-26 723
2017-01-01 974
Then you can use the pandas groupby function to group the data as per your frequency and apply any function of your choice.
# To group by week and count the number of occurances
>>> df.groupby(pd.Grouper(freq='W')).count().head()
date random_num
2017-01-01 11
2017-01-08 65
2017-01-15 55
2017-01-22 66
2017-01-29 45
# To group by week and sum the random numbers per week
>>> df.groupby(pd.Grouper(freq='W')).sum().head()
date random_num
2017-01-01 7132
2017-01-08 33916
2017-01-15 31028
2017-01-22 31509
2017-01-29 22129
You can also apply any generic function myFunction by using the apply method of pandas
df.groupby(pd.Grouper(freq='W')).apply(myFunction)
If you want to apply a function myFunction to any specific column columnName after grouping, you can also do that as follows
df.groupby(pd.Grouper(freq='W'))[columnName].apply(myFunction)
[SOLVED FOR MULTIPLE YEARS]
pd.Grouper(freq='W') works fine but sometimes I have come across some undesired behaviors related to how weeks are split when the number of weeks are not even. So this is why I sometimes prefer to do the week split by hand like shown in this example.
So, having a dataset that spans in multiple years
import numpy as np
import pandas as pd
import datetime
# Create dataset
np.random.seed(123)
n = 100000
date = pd.to_datetime({
'year': np.random.choice(range(2017, 2020), size=n),
'month': np.random.choice(range(1, 13), size=n),
'day': np.random.choice(range(1, 28), size=n)
})
random_num = np.random.choice(
range(0, 1000),
size=n)
df = pd.DataFrame({'date': date, 'random_num': random_num})
Such as:
print(df.head())
date random_num
0 2019-12-11 413
1 2018-06-08 594
2 2019-08-06 983
3 2019-10-11 73
4 2017-09-19 32
First create a helper index that allows you to iterate by week (considering the year as well):
df['grp_idx'] = df['date'].apply(
lambda x: '%s-%s' % (x.year, '{:02d}'.format(x.week)))
print(df.head())
date random_num grp_idx
0 2019-12-11 413 2019-50
1 2018-06-08 594 2018-23
2 2019-08-06 983 2019-32
3 2019-10-11 73 2019-41
4 2017-09-19 32 2017-38
Then just apply your function that makes a computation on the weekly-subset, something like this:
def something_to_do_by_week(week_data):
"""
Computes the mean random value.
"""
return week_data['random_num'].mean()
weekly_mean = df.groupby('grp_idx').apply(something_to_do_by_week)
print(weekly_mean.head())
grp_idx
2017-01 515.875668
2017-02 487.226704
2017-03 503.371681
2017-04 497.717647
2017-05 475.323420
Once you have your weekly metrics you'll probably would like to get back to actual dates which are more useful than year-week indices:
def from_year_week_to_date(year_week):
"""
"""
year, week = year_week.split('-')
year, week = int(year), int(week)
date = pd.to_datetime('%s-01-01' % year)
date += datetime.timedelta(days=week * 7)
return date
weekly_mean.index = [from_year_week_to_date(x) for x in weekly_mean.index]
print(weekly_mean.head())
2017-01-08 515.875668
2017-01-15 487.226704
2017-01-22 503.371681
2017-01-29 497.717647
2017-02-05 475.323420
dtype: float64
Finally, now you can make nice plots with nice interpretable dates:
Just as a sanity check, the computation using pd.Grouper(freq='W') gives me almost the same results (somehow it adds an extra week at the beginning of the pd.Series)
df.set_index('date').groupby(
pd.Grouper(freq='W')
).mean().head()
Out[27]:
random_num
date
2017-01-01 532.736364
2017-01-08 515.875668
2017-01-15 487.226704
2017-01-22 503.371681
2017-01-29 497.717647
I have a column in a dataframe which contains non-continuous dates. I need to group these date by a frequency of 2 days. Data Sample(after normalization):
2015-04-18 00:00:00
2015-04-20 00:00:00
2015-04-20 00:00:00
2015-04-21 00:00:00
2015-04-27 00:00:00
2015-04-30 00:00:00
2015-05-07 00:00:00
2015-05-08 00:00:00
I tried following but as the dates are not continuous I am not getting the desired result.
df.groupby(pd.Grouper(key = 'l_date', freq='2D'))
Is these a way to achieve the desired grouping using pandas or should I write a separate logic?
Once you have a l_date sorted dataframe. you can create a continuous dummy date (dum_date) column and groupby 2D frequency on it.
df = df.sort_values(by='l_date')
df['dum_date'] = pd.date_range(pd.datetime.today(), periods=df.shape[0]).tolist()
df.groupby(pd.Grouper(key = 'dum_date', freq='2D'))
OR
If you are fine with groupings other than dates. then a generalized way to group n consecutive rows could be:
n = 2 # n = 2 for your use case
df = df.sort_values(by='l_date')
df['grouping'] = [(i//n + 1) for i in range(df.shape[0])]
df.groupby(pd.Grouper(key = 'grouping'))
in Rodeo I created a dataframe with column 'Date' as a range of Date by inputing:
temp_df = pd.DataFrame({
'Date': pd.date_range('2017-01-01',periods=100,freq='D'),
'Value': np.random.normal(10,5,size=100).tolist()
})
However, when I click on the dataframe, it shows
Date
1483228800000
1483315200000
1483401600000
1483488000000
...
in the 'Date' column. Yet, the datetime format works properly when I try:
>>> temp_df.Date.head()
0 2017-01-01
1 2017-01-02
2 2017-01-03
3 2017-01-04
4 2017-01-05
May I know what am I missing in my code? Thanks a lot.