I have a pandas dataframe something like this
Date ID
01/01/2016 a
05/01/2016 a
10/05/2017 a
05/05/2014 b
07/09/2014 b
12/08/2017 b
What I need to do is to add a column which shows the number of entries for each ID that occurred within the last year and another column showing the number within the next year. I've written some horrible code that iterates through the whole dataframe (millions of lines) and does the computations but there must be a better way!
I think you need between with boolean indexing for filter first and then groupby and aggregate size.
Outputs are concated and add reindex for add missing rows filled by 0:
print (df)
Date ID
0 01/01/2016 a
1 05/01/2016 a
2 10/05/2017 a
3 05/05/2018 b
4 07/09/2014 b
5 07/09/2014 c
6 12/08/2018 b
#convert to datetime (if first number is day, add parameter dayfirst)
df['Date'] = pd.to_datetime(df['Date'], dayfirst=True)
now = pd.datetime.today()
print (now)
oneyarbeforenow = now - pd.offsets.DateOffset(years=1)
oneyarafternow = now + pd.offsets.DateOffset(years=1)
#first filter
a = df[df['Date'].between(oneyarbeforenow, now)].groupby('ID').size()
b = df[df['Date'].between(now, oneyarafternow)].groupby('ID').size()
print (a)
ID
a 1
dtype: int64
print (b)
ID
b 2
dtype: int64
df1 = pd.concat([a,b],axis=1).fillna(0).astype(int).reindex(df['ID'].unique(),fill_value=0)
print (df1)
0 1
a 1 0
b 0 2
c 0 0
EDIT:
If need compare each date by first date add or subtract year offset per group need custom function with condition and sum Trues:
offs = pd.offsets.DateOffset(years=1)
f = lambda x: pd.Series([(x > x.iat[-1] - offs).sum(), \
(x < x.iat[-1] + offs).sum()], index=['last','next'])
df = df.groupby('ID')['Date'].apply(f).unstack(fill_value=0).reset_index()
print (df)
ID last next
0 a 1 3
1 b 3 2
2 c 1 1
In [19]: x['date'] = pd.to_datetime( x['date']) # convert string date to datetime pd object
In [20]: x['date'] = x['date'].dt.year # get year from the date
In [21]: x
Out[21]:
date id
0 2016 a
1 2016 a
2 2017 a
3 2014 b
4 2014 b
5 2017 b
In [27]: x.groupby(['date','id']).size() # group by both columns
Out[27]:
date id
2014 b 2
2016 a 2
2017 a 1
b 1
Using resample takes care of missing inbetween years. See. year-2015
In [550]: df.set_index('Date').groupby('ID').resample('Y').size().unstack(fill_value=0)
Out[550]:
Date 2014-12-31 2015-12-31 2016-12-31 2017-12-31
ID
a 0 0 2 1
b 2 0 0 1
Use rename if you want only year in columns
In [551]: (df.set_index('Date').groupby('ID').resample('Y').size().unstack(fill_value=0)
.rename(columns=lambda x: x.year))
Out[551]:
Date 2014 2015 2016 2017
ID
a 0 0 2 1
b 2 0 0 1
Related
I have a dataFrame with date column, and sometimes the date might appear twice.
When I write to a certain date, I would like to write to the last row that have this date, not the first.
Right now I use:
df.loc[df['date'] == date, columnA] = value
Which in the case of a df like this will write at index 1, not 2:
date columnA
0 17.4.2022
1 17.5.2022
2 17.5.2022 value #in the case of 17.5 write the data to this row.
3 17.6.2022
How to make sure I am writing to the last date all the time, and if there is one, so write into that one?
You can chain mask for last duplicated date value by Series.duplicated:
print (df)
date columnA
0 17.4.2022 8
1 17.5.2022 1
2 17.5.2022 1
2 17.5.2022 1
3 17.6.2022 3
date = '17.5.2022'
df.loc[(df['date'] == date) & ~df['date'].duplicated(keep='last'), 'columnA'] = 100
print (df)
date columnA
0 17.4.2022 8
1 17.5.2022 1
2 17.5.2022 1
2 17.5.2022 100
3 17.6.2022 3
I have a table like this...
Date
PlayerId
Goals
June 1
A
1
June 14
A
1
June 15
B
2
June 28
A
1
July 6th
B
0
July 17th
A
1
I would like to calculate the amount of goals a player had scored in the 30 days prior (NOT 30 games). The final results should look like...
Date
PlayerId
Goals
Goals_Prev_30
June 1
A
1
0
June 14
A
1
1
June 15
B
2
0
June 28
A
1
2
July 6th
B
0
2
July 17th
A
1
1
I created a for loop that filters that identifies a single row in the dataframe, then filters the dataframe by characteristics of the row, then calculates the sum of goals in the filtered dataframe, appends it to a list, which is finally assigned to the Goals_Prev_30 column. The code looks like...
30_day_goals = []
for i in range(len(df)):
row = df.iloc[i]
filtered_df = df[(df['Date'] < row['Date']) & (df['Date'] >= row['Date'])- pd.to_timedelta(30,unit='d')) & (df['PlayerId'] == row['PlayerId'])]
total = filtered_df['Goals'].sum()
30_day_goals.append(total)
df['Goals_Prev_30'] = 30_day_goals
This solution works, but it's slow. It can do around 30 rows a second, however it's not a viable solution as I have multiple measures that are similar and there are over 1.2M rows. This means it will take around 11hrs per measure to complete.
How can this problem be solved in a more efficient manner?
I change your solution to custom function per groups with mask created by broadcasting and sum values of Goals column per groups if match:
#if necessary
#df['Date'] = pd.to_datetime(df['Date'], format='%B %d')
def f(x):
d1 = x['Date']
d2 = d1 - pd.to_timedelta(30,unit='d')
a1 = d1.to_numpy()
a2 = d2.to_numpy()
m = (a1 < a1[:, None]) & (a1 >=a2[:, None])
x['Goals_Prev_30'] = np.where(m, x['Goals'], 0).sum(axis=1)
return x
df = df.groupby('PlayerId').apply(f)
print (df)
Date PlayerId Goals Goals_Prev_30
0 1900-06-01 A 1 0
1 1900-06-14 A 1 1
2 1900-06-15 B 2 0
3 1900-06-28 A 1 2
4 1900-07-06 B 0 2
5 1900-07-17 A 1 1
Motivation: I want to check if customers have bought anything during 2 months since first purchase. (retention)
Resources: I have 2 tables:
Buy date, ID and purchase code
Id and first day he bought
Sample data:
Table1
Date ID Purchase_code
2019-01-01 1 AQT1
2019-01-02 1 TRR1
2019-03-01 1 QTD1
2019-02-01 2 IGJ5
2019-02-05 2 ILW2
2019-02-20 2 WET2
2019-02-28 2 POY6
Table 2
ID First_Buy_Date
1 2019-01-01
2 2019-02-01
The expected result:
ID First_login_date Retention Frequency_buy_at_first_month
1 2019-01-01 1 2
2 2019-02-01 0 4
First convert columns to datetimes if necessary, then add first days by DataFrame.merge and create new columns by compare with Series.le or Series.gt and converting to integers:
df1['Date'] = pd.to_datetime(df1['Date'])
df2['First_Buy_Date'] = pd.to_datetime(df2['First_Buy_Date'])
df = df1.merge(df2, on='ID', how='left')
df['Retention'] = (df['First_Buy_Date'].add(pd.DateOffset(months=2))
.le(df['Date'])
.astype(int))
df['Frequency_buy_at_first_month'] = (df['First_Buy_Date'].add(pd.DateOffset(months=1))
.gt(df['Date'])
.astype(int))
Last aggregate by GroupBy.agg and max (if need only 0 or 1 output) and sum for count values:
df1 = (df.groupby(['ID','First_Buy_Date'], as_index=False)
.agg({'Retention':'max', 'Frequency_buy_at_first_month':'sum'}))
print (df1)
ID First_Buy_Date Retention Frequency_buy_at_first_month
0 1 2019-01-01 1 2
1 2 2019-02-01 0 4
there is a dataframe as following:
id year number
1 2016 3
1 2017 5
2 2016 1
2 2017 5
...
I want to extract the rows that groupby id and the value of number column is more than 3 in both 2016 and 2017.
for example in the above first 4 rows, the result is:
id year number
1 2016 3
1 2017 5
Thanks!
Compare by >=3 and use GroupBy.transform for Series with same size like original, so possible filter by boolean indexing:
df1 = df[(df["number"] >= 3).groupby(df["id"]).transform('all')]
#alternative for reassign mask to column
#df = df[df.assign(number= df["number"] >= 3).groupby("id")['number'].transform('all')]
print (df1)
id year number
0 1 2016 3
1 1 2017 5
Or use filter, but it should be slow if large DataFrame or many groups:
df1 = df.groupby("id").filter(lambda x: (x["number"] >= 3).all())
>>> great_in_both_years = df.groupby("id").apply(lambda x: (x["number"] >= 3).all())
>>> great_in_both_years
id
1 True
2 False
dtype: bool
>>> df.loc[lambda x: x["id"].map(great_in_both_years)]
id year number
0 1 2016 3
1 1 2017 5
I've a list of dates and alphabets. I've to find count of alphabets occurring within a week. I'm trying to group by them by alphabets and re-sample it by '1w'. But i get some weird data frame which contains MultiIndex. How can i do all this and get the DataFrame with just three rows containing score, new re-sample date and count?
PS: What i'm looking for is a week and count for occurrence of every alphabet in that week.
something like that
datetime alphabet count
2016-12-27 22:57:45.407246 a 1
2016-12-30 22:57:45.407246 a 2
2017-01-02 22:57:45.407246 a 0
2016-12-27 22:57:45.407246 b 0
2016-12-30 22:57:45.407246 b 1
2017-01-02 22:57:45.407246 b 4
2016-12-27 22:57:45.407246 c 7
2016-12-30 22:57:45.407246 c 0
2017-01-02 22:57:45.407246 c 0
Here is the code
import random
import pandas as pd
import datetime
def randchar(a, b):
return chr(random.randint(ord(a), ord(b)))
# Create a datetime variable for today
base = datetime.datetime.today()
# Create a list variable that creates 365 days of rows of datetime values
date_list = [base - datetime.timedelta(days=x) for x in range(0, 365)]
score_list =[randchar('a', 'h') for i in range(365)]
df = pd.DataFrame()
# Create a column from the datetime variable
df['datetime'] = date_list
# Convert that column into a datetime datatype
df['datetime'] = pd.to_datetime(df['datetime'])
# Set the datetime column as the index
df.index = df['datetime']
# Create a column from the numeric score variable
df['score'] = score_list
df_s = tt = df.groupby('score').resample('1w').count()
You can apply a groupby + count with 2 predicates -
pd.Grouper with a frequency of a week
score column
Finally, unstack the result.
df = df.groupby([pd.Grouper(freq='1w'), 'score']).count().unstack(fill_value=0)
df.head()
datetime
score a b c d e f g h
datetime
2016-12-25 0 0 1 1 0 1 0 1
2017-01-01 1 0 0 1 3 0 2 0
2017-01-08 0 3 1 1 1 0 0 1
2017-01-15 1 2 0 2 0 0 1 1
2017-01-22 0 1 2 1 1 2 0 0