By having the following table, how can I count the days by ID?
without use of for or any loop because it's large size data.
ID Date
a 01/01/2020
a 05/01/2020
a 08/01/2020
a 10/01/2020
b 05/05/2020
b 08/05/2020
b 12/05/2020
c 08/08/2020
c 22/08/2020
to have this result
ID Date Days Evolved Since Inicial date
a 01/01/2020 1
a 05/01/2020 4
a 08/01/2020 7
a 10/01/2020 9
b 05/05/2020 1
b 08/05/2020 3
b 12/05/2020 7
c 08/08/2020 1
c 22/08/2020 14
Use GroupBy.transform with GroupBy.first for first values to new column, so possible subtract. Then if not duplicated datetimes is possible replace 0:
df['new']=df['Date'].sub(df.groupby("ID")['Date'].transform('first')).dt.days.replace(0, 1)
print (df)
ID Date new
0 a 2020-01-01 1
1 a 2020-01-05 4
2 a 2020-01-08 7
3 a 2020-01-10 9
4 b 2020-05-05 1
5 b 2020-05-08 3
6 b 2020-05-12 7
7 c 2020-08-08 1
8 c 2020-08-22 14
Or set 1 for first value of group by Series.where and Series.duplicated:
df['new'] = (df['Date'].sub(df.groupby("ID")['Date'].transform('first'))
.dt.days.where(df['ID'].duplicated(), 1))
print (df)
ID Date new
0 a 2020-01-01 1
1 a 2020-01-05 4
2 a 2020-01-08 7
3 a 2020-01-10 9
4 b 2020-05-05 1
5 b 2020-05-08 3
6 b 2020-05-12 7
7 c 2020-08-08 1
8 c 2020-08-22 14
You could do something like (df your dataframe):
def days_evolved(sdf):
sdf["Days_evolved"] = sdf.Date - sdf.Date.iat[0]
sdf["Days_evolved"].iat[0] = pd.Timedelta(days=1)
return sdf
df = df.groupby("ID", as_index=False, sort=False).apply(days_evolved)
Result for the sample:
ID Date Days_evolved
0 a 2020-01-01 1 days
1 a 2020-01-05 4 days
2 a 2020-01-08 7 days
3 a 2020-01-10 9 days
4 b 2020-05-05 1 days
5 b 2020-05-08 3 days
6 b 2020-05-12 7 days
7 c 2020-08-08 1 days
8 c 2020-08-22 14 days
If you want int instead of pd.Timedelta then do
df["Days_evolved"] = df["Days_evolved"].dt.days
at the end.
Related
I Have a dataframe as follows:
df = pd.DataFrame({'Key':[1,1,1,1,2,2,2,4,4,4,5,5],
'Activity':['A','A','H','B','B','H','H','A','C','H','H','B'],
'Date':['2022-12-03','2022-12-04','2022-12-06','2022-12-08','2022-12-03','2022-12-06','2022-12-10','2022-12-03','2022-12-04','2022-12-07','2022-12-03','2022-12-13']})
I need to count the activities for each 'Key' that occur before 'Activity' == 'H' as follows:
Required Output
My Approach
Sort df by Key & Date ( Sample input is already sorted)
drop the rows that occur after 'H' Activity in each group as follows:
Groupby df.groupby(['Key', 'Activity']).count()
Is there a better approach , if not then help me in code for dropping the rows that occur after 'H' Activity in each group.
Thanks in advance !
You can bring the H dates "back" into each previous row to use in a comparison.
First mark each H date in a new column:
df.loc[df["Activity"] == "H" , "End"] = df["Date"]
Key Activity Date End
0 1 A 2022-12-03 NaT
1 1 A 2022-12-04 NaT
2 1 H 2022-12-06 2022-12-06
3 1 B 2022-12-08 NaT
4 2 B 2022-12-03 NaT
5 2 H 2022-12-06 2022-12-06
6 2 H 2022-12-10 2022-12-10
7 4 A 2022-12-03 NaT
8 4 C 2022-12-04 NaT
9 4 H 2022-12-07 2022-12-07
10 5 H 2022-12-03 2022-12-03
11 5 B 2022-12-13 NaT
Backward fill the new column for each group:
df["End"] = df.groupby("Key")["End"].bfill()
Key Activity Date End
0 1 A 2022-12-03 2022-12-06
1 1 A 2022-12-04 2022-12-06
2 1 H 2022-12-06 2022-12-06
3 1 B 2022-12-08 NaT
4 2 B 2022-12-03 2022-12-06
5 2 H 2022-12-06 2022-12-06
6 2 H 2022-12-10 2022-12-10
7 4 A 2022-12-03 2022-12-07
8 4 C 2022-12-04 2022-12-07
9 4 H 2022-12-07 2022-12-07
10 5 H 2022-12-03 2022-12-03
11 5 B 2022-12-13 NaT
You can then select rows with Date before End
df.loc[df["Date"] < df["End"]]
Key Activity Date End
0 1 A 2022-12-03 2022-12-06
1 1 A 2022-12-04 2022-12-06
4 2 B 2022-12-03 2022-12-06
7 4 A 2022-12-03 2022-12-07
8 4 C 2022-12-04 2022-12-07
To generate the final form - you can use .pivot_table()
(df.loc[df["Date"] < df["End"]]
.pivot_table(index="Key", columns="Activity", values="Date", aggfunc="count")
.reindex(df["Key"].unique()) # Add in keys with no match e.g. `5`
.fillna(0)
.astype(int))
Activity A B C
Key
1 2 0 0
2 0 1 0
4 1 0 1
5 0 0 0
Try this:
(df.loc[df['Activity'].eq('H').groupby(df['Key']).cumsum().eq(0)]
.set_index('Key')['Activity']
.str.get_dummies()
.groupby(level=0).sum()
.reindex(df['Key'].unique(),fill_value=0)
.reset_index())
Output:
Key A B C
0 1 2 0 0
1 2 0 1 0
2 4 1 0 1
3 5 0 0 0
You can try:
# sort by Key and Date
df.sort_values(['Key', 'Date'], inplace=True)
# this is to keep Key in the result when no values are kept after the filter
df.Key = df.Key.astype('category')
# filter all rows after the 1st H for each Key and then pivot
df[~df.Activity.eq('H').groupby(df.Key).cummax()].pivot_table(
index='Key', columns='Activity', aggfunc='size'
).reset_index()
#Activity Key A B C
#0 1 2 0 0
#1 2 0 1 0
#2 4 1 0 1
#3 5 0 0 0
I have the following dataframe:
A B start_date end_date id
0 1 2 2022-01-01 2022-01-10 1
1 2 2 2022-02-02 2022-02-05 2
2 1 2 2022-01-11 2022-01-15 3
3 2 2 2022-02-06 2022-02-10 4
4 2 2 2022-02-11 2022-02-15 5
5 2 3 2022-01-14 2022-01-17 6
6 2 3 2022-01-19 2022-01-22 7
There are several records that follow one after the other. For example, rows 1 and 3. Row 3 has the same values A and B and starts the next day when row 1 ends. I want to compress this dataframe into the following form:
A B start_date end_date id
0 1 2 2022-01-01 2022-01-15 1
1 2 2 2022-02-02 2022-02-15 2
2 2 3 2022-01-14 2022-01-17 3
3 2 3 2022-01-19 2022-01-22 4
That is, I save one record where the difference between the start_date of the next record and the end_date of the previous one is 1 day. In this case, end_date is changed to end_date for the last record inside such a sequence.
You can use a custom grouper to join the successive dates per group:
df[['start_date', 'end_date']] = df[['start_date', 'end_date']].apply(pd.to_datetime)
m = (df['start_date'].sub(df.groupby(['A', 'B'])
['end_date'].shift()
.add(pd.Timedelta('1d'))
).ne('0')
.groupby([df['A'], df['B']]).cumsum()
)
out = (df
.groupby(['A', 'B', m], as_index=False)
.agg({'start_date': 'first', 'end_date': 'last'})
.assign(id=lambda d: range(1, len(d)+1))
)
Output:
A B start_date end_date id
0 1 2 2022-01-01 2022-01-15 1
1 2 2 2022-02-02 2022-02-15 2
2 2 3 2022-01-14 2022-01-17 3
3 2 3 2022-01-19 2022-01-22 4
def function1(dd:pd.DataFrame):
col1=dd.start_date-dd.end_date.shift()
dd1=dd.assign(col1=col1.ne("1 days").cumsum())
return dd1.groupby("col1").agg(start_date=("start_date",min),end_date=("end_date",max))
df1.groupby(["A","B"]).apply(function1).reset_index().assign(id=lambda dd:dd.index+1)
out
A B col1 start_date end_date id
0 1 2 1 2022-01-01 2022-01-15 1
1 2 2 1 2022-02-02 2022-02-15 2
2 2 3 1 2022-01-14 2022-01-17 3
3 2 3 2 2022-01-19 2022-01-22 4
I have the following data-frame:
ID date X
0 A 2021-12-15 7
1 A 2022-01-30 6
2 A 2022-02-15 2
3 B 2022-01-30 2
4 B 2022-02-15 2
5 B 2022-02-18 7
6 C 2021-12-01 7
7 C 2021-12-15 4
8 C 2022-01-30 2
9 C 2022-02-15 7
10 D 2021-12-16 5
11 D 2022-01-30 4
12 D 2022-03-15 9
I want to keep the observations for those IDs who first showed up in week, say, 51 of the year (I would like to change this parameter in the future).
For example, IDs A and D showed up first in week 51 in the data, B didn't, C showed up in week 51, but not for the first time.
So I want to keep in this example only the data pertaining to A and D.
Filter if week match variable week and it is first time by ID in DataFrame by Series.duplicated, then get ID values:
week = 50
df['date'] = pd.to_datetime(df['date'])
s = df.loc[df['date'].dt.isocalendar().week.eq(week) & ~df['ID'].duplicated(), 'ID']
Or:
df1 = df.drop_duplicates(['ID'])
s = df1.loc[df1['date'].dt.isocalendar().week.eq(week) ,'ID']
print (s)
0 A
10 D
Name: ID, dtype: object
Last filter by ID with Series.isin and boolean indexing:
df = df[df['ID'].isin(s)]
print (df)
ID date X
0 A 2021-12-15 7
1 A 2022-01-30 6
2 A 2022-02-15 2
10 D 2021-12-16 5
11 D 2022-01-30 4
12 D 2022-03-15 9
df = pd.DataFrame([[11,'b',10,'2020-01-05'],
[11,'c',4,'2020-01-02'],
[11,'a',6,'2020-01-01'],
[22,'c',2,'2020-01-13'],
[22,'a',8,'2020-01-05'],
[33,'b',2,'2020-01-09'],
[33,'d',6,'2020-01-05'],
[33,'a',8,'2020-01-01']], columns=['user','lecture','not','date'])
The output will then be:
userid lecture note date
0 11 b 10 2020-01-05
1 11 c 4 2020-01-02
2 11 a 6 2020-01-01
3 22 c 2 2020-01-13
4 22 a 8 2020-01-05
5 33 b 2 2020-01-09
6 33 d 6 2020-01-05
7 33 a 8 2020-01-01
I want to get the average not each user. but it should be the total previous date's average
the result should be like this;
userid lecture note date avg
0 11 b 10 2020-01-05 6.666667 ((10+4+6)/3)
1 11 c 4 2020-01-02 5 ((4+6)/2)
2 11 a 6 2020-01-01 6
3 22 c 2 2020-01-13 5 ((2+8)/2)
4 22 a 8 2020-01-05 8
5 33 b 2 2020-01-09 5.33334 ((2+6+8)/3)
6 33 d 6 2020-01-05 7 ((6+8)/2)
7 33 a 8 2020-01-01 8
I'm trying some lambda codes. but I couldn't reach the result
grouped = df.sort_values(['user'], ascending=False).groupby('user',as_index = False).apply(lambda x: x.reset_index(drop = True))
grouped['count'] = grouped.groupby('user').note.transform(lambda x:((x.count()-1)))
grouped['mean'] = grouped.groupby('user').note.transform(lambda x:(x.shift(1).sum()/len(x)))
Try a reversed expanding mean:
df['avg'] = (
df.groupby('user')['not']
.apply(lambda g: g[::-1].expanding().mean())
.droplevel(0)
)
Or
df['avg'] = (
df.loc[::-1, 'not'].groupby(df['user']).expanding().mean().droplevel(0)
)
df:
user lecture not date avg
0 11 b 10 2020-01-05 6.666667
1 11 c 4 2020-01-02 5.000000
2 11 a 6 2020-01-01 6.000000
3 22 c 2 2020-01-13 5.000000
4 22 a 8 2020-01-05 8.000000
5 33 b 2 2020-01-09 5.333333
6 33 d 6 2020-01-05 7.000000
7 33 a 8 2020-01-01 8.000000
I have used a for-loop to accomplish the requirement. The use of df.loc[row, col] will specify each cell according to it's row and column location to do filtering and manipulation.
df['avg'] = '' #initialize an empty column
for i in range(len(df)):
temp = df.loc[i:, 'not'][df.loc[i:, 'user']==df.loc[i, 'user']]
df.loc[i, 'avg'] = sum(temp)/len(temp)
Output df
My pandas version is 0.23.4.
I tried to run this code:
df['date_time'] = pd.to_datetime(df[['year','month','day','hour_scheduled_departure','minute_scheduled_departure']])
and the following error appeared:
extra keys have been passed to the datetime assemblage: [hour_scheduled_departure, minute_scheduled_departure]
Any ideas of how to get the job done by pd.to_datetime?
#anky_91
In this image an extract of first 10 rows is presented. First column [int32]: year; Second column[int32]: month; Third column[int32]: day; Fourth column[object]: hour; Fifth column[object]: minute. The length of objects is 2.
Another solution:
>>pd.concat([df.A,pd.to_datetime(pd.Series(df[df.columns[1:]].fillna('').values.tolist(),name='Date').map(lambda x: '0'.join(map(str,x))))],axis=1)
A Date
0 a 2002-07-01 05:07:00
1 b 2002-08-03 03:08:00
2 c 2002-09-05 06:09:00
3 d 2002-04-07 09:04:00
4 e 2002-02-01 02:02:00
5 f 2002-03-05 04:03:00
For the example you have added as image (i have skipped the last 3 columns due to save time)
>>df.month=df.month.map("{:02}".format)
>>df.day = df.day.map("{:02}".format)
>>pd.concat([df.A,pd.to_datetime(pd.Series(df[df.columns[1:]].fillna('').values.tolist(),name='Date').map(lambda x: ''.join(map(str,x))))],axis=1)
A Date
0 a 2015-01-01 00:05:00
1 b 2015-01-01 00:01:00
2 c 2015-01-01 00:02:00
3 d 2015-01-01 00:02:00
4 e 2015-01-01 00:25:00
5 f 2015-01-01 00:25:00
You can use rename to columns, so possible use pandas.to_datetime with columns year, month, day, hour, minute:
df = pd.DataFrame({
'A':list('abcdef'),
'year':[2002,2002,2002,2002,2002,2002],
'month':[7,8,9,4,2,3],
'day':[1,3,5,7,1,5],
'hour_scheduled_departure':[5,3,6,9,2,4],
'minute_scheduled_departure':[7,8,9,4,2,3]
})
print (df)
A year month day hour_scheduled_departure minute_scheduled_departure
0 a 2002 7 1 5 7
1 b 2002 8 3 3 8
2 c 2002 9 5 6 9
3 d 2002 4 7 9 4
4 e 2002 2 1 2 2
5 f 2002 3 5 4 3
cols = ['year','month','day','hour_scheduled_departure','minute_scheduled_departure']
d = {'hour_scheduled_departure':'hour','minute_scheduled_departure':'minute'}
df['date_time'] = pd.to_datetime(df[cols].rename(columns=d))
#if necessary remove columns
df = df.drop(cols, axis=1)
print (df)
A date_time
0 a 2002-07-01 05:07:00
1 b 2002-08-03 03:08:00
2 c 2002-09-05 06:09:00
3 d 2002-04-07 09:04:00
4 e 2002-02-01 02:02:00
5 f 2002-03-05 04:03:00
Detail:
print (df[cols].rename(columns=d))
year month day hour minute
0 2002 7 1 5 7
1 2002 8 3 3 8
2 2002 9 5 6 9
3 2002 4 7 9 4
4 2002 2 1 2 2
5 2002 3 5 4 3