I have a dataframe with values per day (see df below).
I want to group the "Forecast" field per week but with Monday as the first day of the week.
Currently I can do it via pd.TimeGrouper('W') (see df_final below) but it groups the week starting on Sundays (see df_final below)
import pandas as pd
data = [("W1","G1",1234,pd.to_datetime("2015-07-1"),8),
("W1","G1",1234,pd.to_datetime("2015-07-30"),2),
("W1","G1",1234,pd.to_datetime("2015-07-15"),2),
("W1","G1",1234,pd.to_datetime("2015-07-2"),4),
("W1","G2",2345,pd.to_datetime("2015-07-5"),5),
("W1","G2",2345,pd.to_datetime("2015-07-7"),1),
("W1","G2",2345,pd.to_datetime("2015-07-9"),1),
("W1","G2",2345,pd.to_datetime("2015-07-11"),3)]
labels = ["Site","Type","Product","Date","Forecast"]
df = pd.DataFrame(data,columns=labels).set_index(["Site","Type","Product","Date"])
df
Forecast
Site Type Product Date
W1 G1 1234 2015-07-01 8
2015-07-30 2
2015-07-15 2
2015-07-02 4
G2 2345 2015-07-05 5
2015-07-07 1
2015-07-09 1
2015-07-11 3
df_final = (df
.reset_index()
.set_index("Date")
.groupby(["Site","Product",pd.TimeGrouper('W')])["Forecast"].sum()
.astype(int)
.reset_index())
df_final["DayOfWeek"] = df_final["Date"].dt.dayofweek
df_final
Site Product Date Forecast DayOfWeek
0 W1 1234 2015-07-05 12 6
1 W1 1234 2015-07-19 2 6
2 W1 1234 2015-08-02 2 6
3 W1 2345 2015-07-05 5 6
4 W1 2345 2015-07-12 5 6
Use W-MON instead W, check anchored offsets:
df_final = (df
.reset_index()
.set_index("Date")
.groupby(["Site","Product",pd.Grouper(freq='W-MON')])["Forecast"].sum()
.astype(int)
.reset_index())
df_final["DayOfWeek"] = df_final["Date"].dt.dayofweek
print (df_final)
Site Product Date Forecast DayOfWeek
0 W1 1234 2015-07-06 12 0
1 W1 1234 2015-07-20 2 0
2 W1 1234 2015-08-03 2 0
3 W1 2345 2015-07-06 5 0
4 W1 2345 2015-07-13 5 0
I have three solutions to this problem as described below. First, I should state that the ex-accepted answer is incorrect. Here is why:
# let's create an example df of length 9, 2020-03-08 is a Sunday
s = pd.DataFrame({'dt':pd.date_range('2020-03-08', periods=9, freq='D'),
'counts':0})
> s
dt
counts
0
2020-03-08 00:00:00
0
1
2020-03-09 00:00:00
0
2
2020-03-10 00:00:00
0
3
2020-03-11 00:00:00
0
4
2020-03-12 00:00:00
0
5
2020-03-13 00:00:00
0
6
2020-03-14 00:00:00
0
7
2020-03-15 00:00:00
0
8
2020-03-16 00:00:00
0
These nine days span three Monday-to-Sunday weeks. The weeks of March 2nd, 9th, and 16th. Let's try the accepted answer:
# the accepted answer
> s.groupby(pd.Grouper(key='dt',freq='W-Mon')).count()
dt
counts
2020-03-09 00:00:00
2
2020-03-16 00:00:00
7
This is wrong because the OP wants to have "Monday as the first day of the week" (not as the last day of the week) in the resulting dataframe. Let's see what we get when we try with freq='W'
> s.groupby(pd.Grouper(key='dt', freq='W')).count()
dt
counts
2020-03-08 00:00:00
1
2020-03-15 00:00:00
7
2020-03-22 00:00:00
1
This grouper actually grouped as we wanted (Monday to Sunday) but labeled the 'dt' with the END of the week, rather than the start. So, to get what we want, we can move the index by 6 days like:
w = s.groupby(pd.Grouper(key='dt', freq='W')).count()
w.index -= pd.Timedelta(days=6)
or alternatively we can do:
s.groupby(pd.Grouper(key='dt',freq='W-Mon',label='left',closed='left')).count()
a third solution, arguably the most readable one, is converting dt to period first, then grouping, and finally (if needed) converting back to timestamp:
s.groupby(s.dt.dt.to_period('W'))['counts'].count().to_timestamp()
# a variant of this solution is: s.set_index('dt').to_period('W').groupby(pd.Grouper(freq='W')).count().to_timestamp()
all of these solutions return what the OP asked for:
dt
counts
2020-03-02 00:00:00
1
2020-03-09 00:00:00
7
2020-03-16 00:00:00
1
Explanation: when freq is provided to pd.Grouper, both closed and label kwargs default to right. Setting freq to W (short for W-Sun) works because we want our week to end on Sunday (Sunday included, and g.closed == 'right' handles this). Unfortunately, the pd.Grouper docstring does not show the default values but you can see them like this:
g = pd.Grouper(key='dt', freq='W')
print(g.closed, g.label)
> right right
Related
I have a dataframe:
df a b
7 2019-05-01 00:00:01
6 2019-05-02 00:15:01
1 2019-05-06 00:10:01
3 2019-05-09 01:00:01
8 2019-05-09 04:20:01
9 2019-05-12 01:10:01
4 2019-05-16 03:30:01
And
l = [datetime.datetime(2019,05,02), datetime.datetime(2019,05,10), datetime.datetime(2019,05,22) ]
I want to add a column with the following:
for each row, find the last date from l that is before it, and add number of days between them.
If none of the date is smaller - add the delta from the smallest one.
So the new column will be:
df a b. delta date
7 2019-05-01 00:00:01 -1 datetime.datetime(2019,05,02)
6 2019-05-02 00:15:01 0 datetime.datetime(2019,05,02)
1 2019-05-06 00:10:01 4 datetime.datetime(2019,05,02)
3 2019-05-09 01:00:01 7 datetime.datetime(2019,05,02)
8 2019-05-09 04:20:01 7 datetime.datetime(2019,05,02)
9 2019-05-12 01:10:01 2 datetime.datetime(2019,05,10)
4 2019-05-16 03:30:01 6 datetime.datetime(2019,05,10)
How can I do it?
Using merge_asof to align df['b'] and the list (as Series), then computing the difference:
# ensure datetime
df['b'] = pd.to_datetime(df['b'])
# craft Series for merging (could be combined with line below)
s = pd.Series(l, name='l')
# merge and fillna with minimum date
ref = pd.merge_asof(df['b'], s, left_on='b', right_on='l')['l'].fillna(s.min())
# compute the delta as days
df['delta'] =(df['b']-ref).dt.days
output:
a b delta
0 7 2019-05-01 00:00:01 -1
1 6 2019-05-02 00:15:01 0
2 1 2019-05-06 00:10:01 4
3 3 2019-05-09 01:00:01 7
4 8 2019-05-09 04:20:01 7
5 9 2019-05-12 01:10:01 2
6 4 2019-05-16 03:30:01 6
Here's a one line solution if you your b column has datetime object. Otherwise convert it to datetime object.
df['delta'] = df.apply(lambda x: sorted([x.b - i for i in l], key= lambda y: y.seconds)[0].days, axis=1)
Explanation : To each row you apply a function that :
Compute the deltatime between your row's datetime and every datetime present in l, then store it in a list
Sort this list by the numbers of seconds of each deltatime
Get the first value (with the smallest deltatime) and return its days
this code is seperate this dataset on
weekday Friday
year 2014
day 01
hour 00
minute 03
rides['weekday'] = rides.timestamp.dt.strftime("%A")
rides['year'] = rides.timestamp.dt.strftime("%Y")
rides['day'] = rides.timestamp.dt.strftime("%d")
rides['hour'] = rides.timestamp.dt.strftime("%H")
rides["minute"] = rides.timestamp.dt.strftime("%M")
I have a data frame dft:
Date Total Value
02/01/2022 2
03/01/2022 6
N/A 4
03/11/2022 4
03/15/2022 4
05/01/2022 4
For each date in the data frame, I want to calculate the how many days from today and I want to add these calculated values in a new column called Days.
I have tried the following code:
newdft = []
for item in dft:
temp = item.copy()
timediff = datetime.now() - datetime.strptime(temp["Date"], "%m/%d/%Y")
temp["Days"] = timediff.days
newdft.append(temp)
But the third date value is N/A, which caused an error. What should I add to my code so that I only conduct the calculation only when the date value is valid?
I would convert the whole Date column to be a date time object, using pd.to_datetime(), with the errors set to coerce, to replace the 'N/A' string to NaT (Not a Timestamp) with the below:
dft['Date'] = pd.to_datetime(dft['Date'], errors='coerce')
So the column will now look like this:
0 2022-02-01
1 2022-03-01
2 NaT
3 2022-03-11
4 2022-03-15
5 2022-05-01
Name: Date, dtype: datetime64[ns]
You can then subtract that column from the current date in one go, which will automatically ignore the NaT value, and assign this as a new column:
dft['Days'] = datetime.now() - dft['Date']
This will make dft look like below:
Date Total Value Days
0 2022-02-01 2 148 days 15:49:03.406935
1 2022-03-01 6 120 days 15:49:03.406935
2 NaT 4 NaT
3 2022-03-11 4 110 days 15:49:03.406935
4 2022-03-15 4 106 days 15:49:03.406935
5 2022-05-01 4 59 days 15:49:03.406935
If you just want the number instead of 59 days 15:49:03.406935, you can do the below instead:
df['Days'] = (datetime.now() - df['Date']).dt.days
Which will give you:
Date Total Value Days
0 2022-02-01 2 148.0
1 2022-03-01 6 120.0
2 NaT 4 NaN
3 2022-03-11 4 110.0
4 2022-03-15 4 106.0
5 2022-05-01 4 59.0
In contrast to Emi OB's excellent answer, if you did actually need to process individual values, it's usually easier to use apply to create a new Series from an existing one. You'd just need to filter out 'N/A'.
df['Days'] = (
df['Date']
[lambda d: d != 'N/A']
.apply(lambda d: (datetime.now() - datetime.strptime(d, "%m/%d/%Y")).days)
)
Result:
Date Total Value Days
0 02/01/2022 2 148.0
1 03/01/2022 6 120.0
2 N/A 4 NaN
3 03/11/2022 4 110.0
4 03/15/2022 4 106.0
5 05/01/2022 4 59.0
And for what it's worth, another option is date.today() instead of datetime.now():
.apply(lambda d: date.today() - datetime.strptime(d, "%m/%d/%Y").date())
And the result is a timedelta instead of float:
Date Total Value Days
0 02/01/2022 2 148 days
1 03/01/2022 6 120 days
2 N/A 4 NaT
3 03/11/2022 4 110 days
4 03/15/2022 4 106 days
5 05/01/2022 4 59 days
See also: How do I select rows from a DataFrame based on column values?
Following up on the excellent answer by Emi OB I would suggest using DataFrame.mask() to update the dataframe without type coercion.
import datetime
import pandas as pd
dft = pd.DataFrame({'Date': [
'02/01/2022',
'03/01/2022',
None,
'03/11/2022',
'03/15/2022',
'05/01/2022'],
'Total Value': [2,6,4,4,4,4]})
dft['today'] = datetime.datetime.now()
dft['Days'] = 0
dft['Days'].mask(dft['Date'].notna(),
(dft['today'] - pd.to_datetime(dft['Date'])).dt.days,
axis=0, inplace=True)
dft.drop(columns=['today'], inplace=True)
This would result in integer values in the Days column:
Date Total Value Days
0 02/01/2022 2 148
1 03/01/2022 6 120
2 None 4 None
3 03/11/2022 4 110
4 03/15/2022 4 106
5 05/01/2022 4 59
This is a real use case that I am trying to implement in my work.
Sample data (fake data but similar data structure)
Lap Starttime Endtime
1 10:00:00 10:05:00
format: hh:mm:ss
Desired output
Lap time
1 10:00:00
1 10:01:00
1 10:02:00
1 10:03:00
1 10:04:00
1 10:05:00
so far only trying to think of the logic and techniques required... the codes are not correct
import re
import pandas as pd
df = pd.read_csv('sample.csv')
#1. to determine how many rows to generate. eg. 1000 to 1005 is 6 rows
df['time'] = df['Endtime'] - df['Startime']
#2. add one new row with 1 added minute. eg. 6 rows
for i in No_of_rows:
if df['time'] < df['Endtime']: #if 'time' still before end time, then continue append
df['time'] = df['Startime'] += 1 #not sure how to select Minute part only
else:
continue
pardon my limited coding skills. appreciate all the help from you experts.. thanks!
Try with pd.date_range and explode:
#convert to datetime if needed
df["Starttime"] = pd.to_datetime(df["Starttime"], format="%H:%M:%S")
df["Endtime"] = pd.to_datetime(df["Endtime"], format="%H:%M:%S")
#create list of 1min ranges
df["Range"] = df.apply(lambda x: pd.date_range(x["Starttime"], x["Endtime"], freq="1min"), axis=1)
#explode, drop unneeded columns and keep only time
df = df.drop(["Starttime", "Endtime"], axis=1).explode("Range")
df["Range"] = df["Range"].dt.time
>>> df
Range
Lap
1 10:00:00
1 10:01:00
1 10:02:00
1 10:03:00
1 10:04:00
1 10:05:00
Input df:
df = pd.DataFrame({"Lap": [1],
"Starttime": ["10:00:00"],
"Endtime": ["10:05:00"]}).set_index("Lap")
>>> df
Starttime Endtime
Lap
1 10:00:00 10:05:00
You can convert the times to datetimes, that will arbitrarily prepend the date of today (at whatever date you’re running) but we can then remove that later and it allows for easier manupulation:
>>> bounds = df[['Starttime', 'Endtime']].transform(pd.to_datetime)
>>> bounds
Starttime Endtime
0 2021-09-29 10:00:00 2021-09-29 10:05:00
1 2021-09-29 10:00:00 2021-09-29 10:02:00
Then we can simply use pd.date_range with a 1 minute frequency:
>>> times = bounds.agg(lambda s: pd.date_range(*s, freq='1min'), axis='columns')
>>> times
0 DatetimeIndex(['2021-09-29 10:00:00', '2021-09...
1 DatetimeIndex(['2021-09-29 10:00:00', '2021-09...
dtype: object
Now joining that with the Lap info and using df.explode():
>>> result = df[['Lap']].join(times.rename('time')).explode('time').reset_index(drop=True)
>>> result
Lap time
0 1 2021-09-29 10:00:00
1 1 2021-09-29 10:01:00
2 1 2021-09-29 10:02:00
3 1 2021-09-29 10:03:00
4 1 2021-09-29 10:04:00
5 1 2021-09-29 10:05:00
6 2 2021-09-29 10:00:00
7 2 2021-09-29 10:01:00
8 2 2021-09-29 10:02:00
Finally we wanted to remove the day:
>>> result['time'] = result['time'].dt.time
>>> result
Lap time
0 1 10:00:00
1 1 10:01:00
2 1 10:02:00
3 1 10:03:00
4 1 10:04:00
5 1 10:05:00
6 2 10:00:00
7 2 10:01:00
8 2 10:02:00
The objects in your series are now datetime.time
Here is another way without using apply/agg:
Convert to datetime first:
df["Starttime"] = pd.to_datetime(df["Starttime"], format="%H:%M:%S")
df["Endtime"] = pd.to_datetime(df["Endtime"], format="%H:%M:%S")
Get difference between the end and start times and then using index.repeat, repeat the rows. Then using groupby & cumcount, get pd.to_timedelta in minutes and add to the existing start time:
repeats = df['Endtime'].sub(df['Starttime']).dt.total_seconds()//60
out = df.loc[df.index.repeat(repeats+1),['Lap','Starttime']]
out['Starttime'] = (out['Starttime'].add(
pd.to_timedelta(out.groupby("Lap").cumcount(),'min')).dt.time)
print(out)
Lap Starttime
0 1 10:00:00
0 1 10:01:00
0 1 10:02:00
0 1 10:03:00
0 1 10:04:00
0 1 10:05:00
I have a dataframe of a step counter. It has a column M_DATE (dd-mm-yy hh-mm-ss) that I set to date time. It also has a column M_STEPS that contains the number of steps that are done.
I split the date column in to several columns with also a column named "day_of_week". This one determines what the name of the day is was.
I wanted to use a groupby function on the day_of_week and wanted to have the mean per Monday, Tuesday, Wednesday etc. But I get an answer that doesn't look right.
I have tried
to got the name of the days I did:
df['day_of_week'] = df['M_DATE'].dt.day_name()
then I did :
df.groupby('day_of_week')['M_STEPS'].mean()
I hoped that it would group for example all the Mondays and then give me the mean of the amount of steps taken on Mondays. But the outcome is some is a very big number that I cannot make sense of.
The strange thing is when I use:
df.groupby('day_of_week')['M_STEPS'].sum()
it does give me a correct number.
What am I doing wrong?
Edit
Here I copied and pasted the df.head()
M_ID M_DATE M_CALORIES M_STEPS M_DISTANCE M_METS M_WEEK M_WEEKDAY M_HOUR M_MINUTE year month day day_of_week
0 27 2016-01-24 00:00:00 1 0 0.0 10 3 1 0 0 2016 1 24 Sunday
1 28 2016-01-24 00:01:00 1 0 0.0 10 3 1 0 1 2016 1 24 Sunday
2 29 2016-01-24 00:02:00 1 0 0.0 10 3 1 0 2 2016 1 24 Sunday
3 30 2016-01-24 00:03:00 1 0 0.0 10 3 1 0 3 2016 1 24 Sunday
4 31 2016-01-24 00:04:00 1 0 0.0 10 3 1 0 4 2016 1 24 Sunday
Lets say you have:
day_of_week M_steps
Monday 1
Monday 2
Tuesday 1
Tuesday 3
then df.groupby('day_of_week')['M_STEPS'].mean():
Monday 1.5
Tuesday 2
and df.groupby('day_of_week')['M_STEPS'].sum():
Monday 3
Tuesday 4
This is groupby doing, probably the dataframe is sorted differently. Could you add your original dataframe to your example?
I am trying to calculate time-based aggregations in Pandas based on date values stored in a separate tables.
The top of the first table table_a looks like this:
COMPANY_ID DATE MEASURE
1 2010-01-01 00:00:00 10
1 2010-01-02 00:00:00 10
1 2010-01-03 00:00:00 10
1 2010-01-04 00:00:00 10
1 2010-01-05 00:00:00 10
Here is the code to create the table:
table_a = pd.concat(\
[pd.DataFrame({'DATE': pd.date_range("01/01/2010", "12/31/2010", freq="D"),\
'COMPANY_ID': 1 , 'MEASURE': 10}),\
pd.DataFrame({'DATE': pd.date_range("01/01/2010", "12/31/2010", freq="D"),\
'COMPANY_ID': 2 , 'MEASURE': 10})])
The second table, table_b, looks like this:
COMPANY END_DATE
1 2010-03-01 00:00:00
1 2010-06-02 00:00:00
2 2010-03-01 00:00:00
2 2010-06-02 00:00:00
and the code to create it is:
table_b = pd.DataFrame({'END_DATE':pd.to_datetime(['03/01/2010','06/02/2010','03/01/2010','06/02/2010']),\
'COMPANY':(1,1,2,2)})
I want to be able to get the sum of the 'measure' column for each 'COMPANY_ID' for each 30-day period prior to the 'END_DATE' in table_b.
This is (I think) the SQL equivalent:
select
b.COMPANY_ID,
b.DATE
sum(a.MEASURE) AS MEASURE_TO_END_DATE
from table_a a, table_b b
where a.COMPANY = b.COMPANY and
a.DATE < b.DATE and
a.DATE > b.DATE - 30
group by b.COMPANY;
Well, I can think of a few ways:
essentially blow up the dataframe by just merging on the exact field (company)... then filter on the 30-day windows after the merge.
should be fast but could use lots of memory
Move the merging and filtering on the 30-day window into a groupby().
results in a merge for each group, so slower but should use less memory
Option #1
Suppose your data looks like the following (I expanded your sample data):
print df
company date measure
0 0 2010-01-01 10
1 0 2010-01-15 10
2 0 2010-02-01 10
3 0 2010-02-15 10
4 0 2010-03-01 10
5 0 2010-03-15 10
6 0 2010-04-01 10
7 1 2010-03-01 5
8 1 2010-03-15 5
9 1 2010-04-01 5
10 1 2010-04-15 5
11 1 2010-05-01 5
12 1 2010-05-15 5
print windows
company end_date
0 0 2010-02-01
1 0 2010-03-15
2 1 2010-04-01
3 1 2010-05-15
Create a beginning date for the 30 day windows:
windows['beg_date'] = (windows['end_date'].values.astype('datetime64[D]') -
np.timedelta64(30,'D'))
print windows
company end_date beg_date
0 0 2010-02-01 2010-01-02
1 0 2010-03-15 2010-02-13
2 1 2010-04-01 2010-03-02
3 1 2010-05-15 2010-04-15
Now do a merge and then select based on if date falls within beg_date and end_date:
df = df.merge(windows,on='company',how='left')
df = df[(df.date >= df.beg_date) & (df.date <= df.end_date)]
print df
company date measure end_date beg_date
2 0 2010-01-15 10 2010-02-01 2010-01-02
4 0 2010-02-01 10 2010-02-01 2010-01-02
7 0 2010-02-15 10 2010-03-15 2010-02-13
9 0 2010-03-01 10 2010-03-15 2010-02-13
11 0 2010-03-15 10 2010-03-15 2010-02-13
16 1 2010-03-15 5 2010-04-01 2010-03-02
18 1 2010-04-01 5 2010-04-01 2010-03-02
21 1 2010-04-15 5 2010-05-15 2010-04-15
23 1 2010-05-01 5 2010-05-15 2010-04-15
25 1 2010-05-15 5 2010-05-15 2010-04-15
You can compute the 30 day window sums by grouping on company and end_date:
print df.groupby(['company','end_date']).sum()
measure
company end_date
0 2010-02-01 20
2010-03-15 30
1 2010-04-01 10
2010-05-15 15
Option #2 Move all merging into a groupby. This should be better on memory but I would think much slower:
windows['beg_date'] = (windows['end_date'].values.astype('datetime64[D]') -
np.timedelta64(30,'D'))
def cond_merge(g,windows):
g = g.merge(windows,on='company',how='left')
g = g[(g.date >= g.beg_date) & (g.date <= g.end_date)]
return g.groupby('end_date')['measure'].sum()
print df.groupby('company').apply(cond_merge,windows)
company end_date
0 2010-02-01 20
2010-03-15 30
1 2010-04-01 10
2010-05-15 15
Another option Now if your windows never overlap (like in the example data), you could do something like the following as an alternative that doesn't blow up a dataframe but is pretty fast:
windows['date'] = windows['end_date']
df = df.merge(windows,on=['company','date'],how='outer')
print df
company date measure end_date
0 0 2010-01-01 10 NaT
1 0 2010-01-15 10 NaT
2 0 2010-02-01 10 2010-02-01
3 0 2010-02-15 10 NaT
4 0 2010-03-01 10 NaT
5 0 2010-03-15 10 2010-03-15
6 0 2010-04-01 10 NaT
7 1 2010-03-01 5 NaT
8 1 2010-03-15 5 NaT
9 1 2010-04-01 5 2010-04-01
10 1 2010-04-15 5 NaT
11 1 2010-05-01 5 NaT
12 1 2010-05-15 5 2010-05-15
This merge essentially inserts your window end dates into the dataframe and then backfilling the end dates (by group) will give you a structure to easily create you summation windows:
df['end_date'] = df.groupby('company')['end_date'].apply(lambda x: x.bfill())
print df
company date measure end_date
0 0 2010-01-01 10 2010-02-01
1 0 2010-01-15 10 2010-02-01
2 0 2010-02-01 10 2010-02-01
3 0 2010-02-15 10 2010-03-15
4 0 2010-03-01 10 2010-03-15
5 0 2010-03-15 10 2010-03-15
6 0 2010-04-01 10 NaT
7 1 2010-03-01 5 2010-04-01
8 1 2010-03-15 5 2010-04-01
9 1 2010-04-01 5 2010-04-01
10 1 2010-04-15 5 2010-05-15
11 1 2010-05-01 5 2010-05-15
12 1 2010-05-15 5 2010-05-15
df = df[df.end_date.notnull()]
df['beg_date'] = (df['end_date'].values.astype('datetime64[D]') -
np.timedelta64(30,'D'))
print df
company date measure end_date beg_date
0 0 2010-01-01 10 2010-02-01 2010-01-02
1 0 2010-01-15 10 2010-02-01 2010-01-02
2 0 2010-02-01 10 2010-02-01 2010-01-02
3 0 2010-02-15 10 2010-03-15 2010-02-13
4 0 2010-03-01 10 2010-03-15 2010-02-13
5 0 2010-03-15 10 2010-03-15 2010-02-13
7 1 2010-03-01 5 2010-04-01 2010-03-02
8 1 2010-03-15 5 2010-04-01 2010-03-02
9 1 2010-04-01 5 2010-04-01 2010-03-02
10 1 2010-04-15 5 2010-05-15 2010-04-15
11 1 2010-05-01 5 2010-05-15 2010-04-15
12 1 2010-05-15 5 2010-05-15 2010-04-15
df = df[(df.date >= df.beg_date) & (df.date <= df.end_date)]
print df.groupby(['company','end_date']).sum()
measure
company end_date
0 2010-02-01 20
2010-03-15 30
1 2010-04-01 10
2010-05-15 15
Another alternative is to resample your first dataframe to daily data and then compute rolling_sums with a 30 day window; and select the dates at the end that you are interested in. This could be quite memory intensive too.
There is a very easy, and practical (or maybe the only direct way) to do conditional join in pandas. Since there is no direct way to do conditional join in pandas, you will need an additional library, and that is, pandasql
Install the library pandasql from pip using the command pip install pandasql. This library allows you to manipulate the pandas dataframes using the SQL queries.
import pandas as pd
from pandasql import sqldf
df = pd.read_excel(r'play_data.xlsx')
df
id Name Amount
0 A001 A 100
1 A002 B 110
2 A003 C 120
3 A005 D 150
Now let's just do a conditional join to compare the Amount of the IDs
# Make your pysqldf object:
pysqldf = lambda q: sqldf(q, globals())
# Write your query in SQL syntax, here you can use df as a normal SQL table
cond_join= '''
select
df_left.*,
df_right.*
from df as df_left
join df as df_right
on
df_left.[Amount] > (df_right.[Amount]+10)
'''
# Now, get your queries results as dataframe using the sqldf object that you created
pysqldf(cond_join)
id Name Amount id Name Amount
0 A003 C 120 A001 A 100
1 A005 D 150 A001 A 100
2 A005 D 150 A002 B 110
3 A005 D 150 A003 C 120
I know I am late for the party but here are two solutions. The first one is rather simple but not very general, while the second one should be more universal. In what follows I assume that table_a and table_b objects are already defined as in the original question.
Solution 1
This one is simple. Here we just do a left join and append END_DATE values to table_a and then filter out the rows we are not interested in. So the memory overhead here is size of table_a * number of unique END_DATE values per COMPANY in table_b.
table_c = table_a.merge(table_b, left_on="COMPANY_ID", right_on="COMPANY")
table_c[(table_c["DATE"] - table_c["END_DATE"]).dt.days.between(-30, 0)] \
.groupby(["COMPANY", "END_DATE"])["MEASURE"].sum()
## OUTPUT:
COMPANY END_DATE
1 2010-03-01 310
2010-06-02 310
2 2010-03-01 310
2010-06-02 310
Name: MEASURE, dtype: int64
This is quite fast, but could blow up the size of table_a significantly if table_b contained many values.
Solution 2
This one is a bit smarter and operates row-by-row, where to each row in table_b we explicitly map only the relevant subset of table_a. Thus, we get only the data we need, so there is no memory overhead (beyond the memory needed to represent the raw records over which we want to sum).
table_b.groupby(["COMPANY", "END_DATE"]) \
.apply(lambda g: table_a[
(table_a["COMPANY_ID"] == g["COMPANY"].iloc[0]) & \
((table_a["DATE"] - g["END_DATE"].iloc[0]).dt.days.between(-30, 0))
]["MEASURE"].sum())
## OUTPUT:
COMPANY END_DATE
1 2010-03-01 310
2010-06-02 310
2 2010-03-01 310
2010-06-02 310
dtype: int64
Note that in this case for each inequality we use only the relevant subsets of table_a, which will be much more memory efficient. The price is that this soution seems to be about 2-3 times slower (but in general still relatively fast; ~2-3ms runtime on your data).
I am using karl D's data.
conditional_join from pyjanitor offers a way to deal with non-equi joins efficiently:
# pip install pyjanitor
import pandas as pd
import janitor
(df
.conditional_join(
windows, # series or dataframe to join to
# variable arguments
# left column, right column, join operator
('company', 'company', '=='),
('date', 'beg_date', '>='),
('date', 'end_date', '<='),
# for more performance, depending on the data size
# you can turn on use_numba
use_numba = False,
# filter for specific columns, if required
df_columns=['company', 'measure'],
right_columns='end_date')
.groupby(['company', 'end_date'])
.sum()
)
measure
company end_date
0 2010-02-01 20
2010-03-15 30
1 2010-04-01 10
2010-05-15 15