The code below groups the dataframe by a key.
df = pd.DataFrame(data, columns=['id', 'date', 'cnt'])
df['date']= pd.to_datetime(df['date'])
for c_id, group in df.groupby('id'):
print(c_id)
print(group)
This produces a result like this:
id date cnt
1 2019-01-02 1
1 2019-01-03 2
1 2019-01-04 3
1 2019-01-05 1
1 2019-01-06 2
1 2019-01-07 1
id date cnt
2 2019-01-01 478964
2 2019-01-02 749249
2 2019-01-03 1144842
2 2019-01-04 1540846
2 2019-01-05 1444918
2 2019-01-06 1624770
2 2019-01-07 2227589
id date cnt
3 2019-01-01 41776
3 2019-01-02 82322
3 2019-01-03 93467
3 2019-01-04 56674
3 2019-01-05 47606
3 2019-01-06 41448
3 2019-01-07 145827
id date cnt
4 2019-01-01 41776
4 2019-01-02 82322
4 2019-01-06 93467
4 2019-01-07 56674
From this result, I want to find the maximum consecutive number of days for each id. So id 1 would be 6, id 2 would be 7, id 3 would be 7, and id 4 would be 2.
Use:
m = (df.assign(date=pd.to_datetime(df['date'])) #if necessary convert else drop
.groupby('id')['date']
.diff()
.gt(pd.Timedelta('1D'))
.cumsum())
df.groupby(['id', m]).size().max(level='id')
Output
id
1 6
2 7
3 7
4 2
dtype: int64
To get your result, run:
result = df.groupby('id').apply(lambda grp: grp.groupby(
(grp.date.shift() + pd.Timedelta(1, 'd') != grp.date).cumsum())
.id.count().max())
Details:
df.groupby('id') - First level grouping (by id).
grp.groupby(...) - Second level grouping (by sequences
of consecutive dates.
grp.date.shift() - Date from the previous row.
+ pd.Timedelta(1, 'd') - Shifted by 1 day.
!= grp.date - Not equal to the current date. The result
is a Series with True on the start of each sequence of
consecutive dates.
cumsum() - Convert the above (bool) Series to a Series
of int - consecutive numbers of above sequences, starting
from 1.
id - Take id column from each (second level) group.
count() - Compute the size of the current group.
.max() - Take max from sizes of second level groups
(within the current level 1 group).
Related
In one data frame (called X) I have Patient_admitted_id, Date, Hospital_ID of tested covid positive patients (I show this data frame below). I want to generate a separated data frame (called Y) with Dates of the calendar, total number of covid Cases and cumulative cases.
I dont know how to generate the column Cases
X data frame:
data = {'Patient_admitted_id': [214321,224323,3234234,23423],
'Date': ['2021-01-22', '2021-01-22','2021-01-22','2021-01-20'], # This is just an example I have created here, the real X data frame contains proper date values generated with Datatime
'Hospital_ID': ['1', '2','3','2'],
}
X = pd.DataFrame(data, columns=['Patient_admitted_id','Date', 'Hospital_ID' ])
X
Patient_admitted_id Date Hospital_ID
0 214321 2021-01-22 1
1 224323 2021-01-22 2
2 3234234 2021-01-22 3
3 23423 2021-01-20 2
...
Desirable Y data frame:
Date Cases Cumulative
0 2021-01-20 1 1
1 2021-01-21 0 1
2 2021-01-22 3 4
...
Use DataFrame.resample by days with counts by Resampler.size with Series.cumsum for cumulative counts:
X['Date']= pd.to_datetime(X['Date'])
df = X.resample('D', on='Date').size().reset_index(name='Cases')
df['Cumulative'] = df['Cases'].cumsum()
print (df)
Date Cases Cumulative
0 2021-01-20 1 1
1 2021-01-21 0 1
2 2021-01-22 3 4
You can use groupby on Date column and call size to get the count of individual Dates, you can then simply call Cumsum on cases to get the desired output
out = X.groupby('Date').size().to_frame('Cases').reset_index()
out['Cumulative'] = out['Cases'].cumsum()
out variable holds the desired dataframe.
OUTPUT:
Date Cases Cumulative
0 2021-01-20 1 1
1 2021-01-22 3 4
Just adding a solution with pd.Grouper
X['Date']= pd.to_datetime(X['Date'])
df = X.groupby(pd.Grouper(key='Date', freq='D')).size().reset_index(name='Cases')
df['Cumulative'] = df.Cases.cumsum()
df
Output
Date Cases Cumulative
0 2021-01-20 1 1
1 2021-01-21 0 1
2 2021-01-22 3 4
I have a DataFrame which contains data from last year, but the dates column has some missing dates
date
0 2019-10-21
1 2019-10-29
2 2019-11-01
3 2019-11-04
4 2019-11-05
I want to create a dictionary of gaps between dates, so keys would be start dates and values as end dates, something like:
dates_gaps = {2019-10-21:2019-10-29, 2019-10-29:2019-11-01,2019-11-01:2019-11-04 ...}
so I created a column to indicate whether a gap exists with the following:
df['missing_dates'] = df[DATE].diff().dt.days > 1
which outputs the following:
# True indicates if there's a gap or not
0 2019-10-21 False
1 2019-10-29 True
2 2019-11-01 True
3 2019-11-04 True
4 2019-11-05 False
and I'm having trouble going forward from here
You can add condition for compare missing values, convert date columnto strings by Series.dt.strftime and last create dictionary with zip:
diff = df['date'].diff()
s = df.loc[(diff.dt.days > 1) | diff.isna(), 'date'].dt.strftime('%Y-%m-%d')
print (s)
0 2019-10-21
1 2019-10-29
2 2019-11-01
3 2019-11-04
Name: date, dtype: object
d = dict(zip(s, s.shift(-1)[:-1]))
print (d)
{'2019-10-21': '2019-10-29', '2019-10-29': '2019-11-01', '2019-11-01': '2019-11-04'}
just convert these dates into datetime and find the difference between two adjacent dates.
a = pd.to_datetime('1900-01-01', format='%Y-%m-%d')
b = pd.to_datetime('1900-02-01', format='%Y-%m-%d')
c = a-b
c.days # -31
Motivation: I want to check if customers have bought anything during 2 months since first purchase. (retention)
Resources: I have 2 tables:
Buy date, ID and purchase code
Id and first day he bought
Sample data:
Table1
Date ID Purchase_code
2019-01-01 1 AQT1
2019-01-02 1 TRR1
2019-03-01 1 QTD1
2019-02-01 2 IGJ5
2019-02-05 2 ILW2
2019-02-20 2 WET2
2019-02-28 2 POY6
Table 2
ID First_Buy_Date
1 2019-01-01
2 2019-02-01
The expected result:
ID First_login_date Retention Frequency_buy_at_first_month
1 2019-01-01 1 2
2 2019-02-01 0 4
First convert columns to datetimes if necessary, then add first days by DataFrame.merge and create new columns by compare with Series.le or Series.gt and converting to integers:
df1['Date'] = pd.to_datetime(df1['Date'])
df2['First_Buy_Date'] = pd.to_datetime(df2['First_Buy_Date'])
df = df1.merge(df2, on='ID', how='left')
df['Retention'] = (df['First_Buy_Date'].add(pd.DateOffset(months=2))
.le(df['Date'])
.astype(int))
df['Frequency_buy_at_first_month'] = (df['First_Buy_Date'].add(pd.DateOffset(months=1))
.gt(df['Date'])
.astype(int))
Last aggregate by GroupBy.agg and max (if need only 0 or 1 output) and sum for count values:
df1 = (df.groupby(['ID','First_Buy_Date'], as_index=False)
.agg({'Retention':'max', 'Frequency_buy_at_first_month':'sum'}))
print (df1)
ID First_Buy_Date Retention Frequency_buy_at_first_month
0 1 2019-01-01 1 2
1 2 2019-02-01 0 4
I have yet another Python question. This one probably can be achieved with help of a loop, however I was looking for a leaner solution
Suppose that I have a data frame like this one:
I am looking for a code to generate column ID which is no more than a descending counter for when the value in column Sold changes - ie, for each Salesman I would like to have the ID column retrieving the number of days left until the sold value changes.
For example, on date 01/01/2018, salesman Joe would be having ID = 2 because the signal changes in 2 days.
Any ideas on how to solve this one?
Many thanks.
J
Setup:
df = pd.DataFrame([
pd.Series(pd.date_range('1/1/2018', '1/7/2018').append(pd.date_range('1/1/2018', '1/7/2018'))),
pd.Series(['Joe']*7 + ['Helen']*7),
pd.Series([1,1,0,0,0,0,1,0,1,1,0,1,0,0]),
]).T
df.columns = ['date', 'salesman', 'sold']
df['date'] = pd.to_datetime(df['date'])
Computation:
df['changes'] = df.groupby('salesman')['sold'].expanding().apply(lambda x: (np.diff(x) != 0).sum()).reset_index(drop = True)
df['id'] = df.groupby(['salesman', 'changes']).apply(lambda grp: pd.Series(len(grp) - grp.sort_values('date').reset_index().index)).reset_index(drop = True)
df.drop('changes', axis = 1, inplace = True)
Results:
>>> df
date salesman sold id
0 2018-01-01 Joe 1 2
1 2018-01-02 Joe 1 1
2 2018-01-03 Joe 0 4
3 2018-01-04 Joe 0 3
4 2018-01-05 Joe 0 2
5 2018-01-06 Joe 0 1
6 2018-01-07 Joe 1 1
7 2018-01-01 Helen 0 1
8 2018-01-02 Helen 1 2
9 2018-01-03 Helen 1 1
10 2018-01-04 Helen 0 1
11 2018-01-05 Helen 1 1
12 2018-01-06 Helen 0 2
13 2018-01-07 Helen 0 1
Explanation:
create a 'changes' column that increments every-time an individual salesperson's 'sold' field changes. Then for each increment group (still grouped by salesperson), get the length of this group (which is equal to how subsequent rows of this value there are) and subtract from that value the index of each row, sorted by date. The result of that subtraction will be a series that descends from the length of the group to 1. Reset the index and merge back to your original dataframe. It's a bit of a confusing solution but it should work.
I have a dataframe (df) which the head looks like:
Date
0 01/04/2015
1 01/09/1996
2 N/A
3 12/05/1992
4 NOT KNOWN
Is there a way to remove the non date values (not the rows)? With this example the resulting frame would look like:
Date
0 01/04/2015
1 01/09/1996
2
3 12/05/1992
4
All the examples I can see want me to drop the rows and I'd like to keep them.
pd.to_datetime
With errors='coerce'
df.assign(Date=pd.to_datetime(df.Date, errors='coerce'))
Date
0 2015-01-04
1 1996-01-09
2 NaT
3 1992-12-05
4 NaT
You can fill those NaT with empty strings if you'd like (though I don't recommend it)
df.assign(Date=pd.to_datetime(df.Date, errors='coerce').fillna(''))
Date
0 2015-01-04 00:00:00
1 1996-01-09 00:00:00
2
3 1992-12-05 00:00:00
4
If you want to preserve whatever the things were in your dataframe and simply replace the things that don't look like dates with ''
df.assign(Date=df.Date.mask(pd.to_datetime(df.Date, errors='coerce').isna(), ''))
Date
0 01/04/2015
1 01/09/1996
2
3 12/05/1992
4
One more simple way around..
>>> df
Date
0 01/04/2015
1 01/09/1996
2 N/A
3 12/05/1992
4 NOT KNOWN
>>> df['Date'] = pd.to_datetime(df['Date'], errors='coerce').fillna('')
>>> df
Date
0 2015-01-04 00:00:00
1 1996-01-09 00:00:00
2
3 1992-12-05 00:00:00