How do I group by date with Pandas? - python

I made a game and got the players’s data like this:
StartTime Id Rank Score
2018-04-24 08:46:35.684000 aaa 1 280
2018-04-24 23:54:47.742000 bbb 2 176
2018-04-25 15:28:36.050000 ccc 1 223
2018-04-25 00:13:00.120000 aaa 4 79
2018-04-26 04:59:36.464000 ddd 1 346
2018-04-26 06:01:17.728000 fff 2 157
2018-04-27 04:57:37.701000 ggg 4 78
but I want to group it by day, just like this:
Date 2018/4/24 2018/4/25 2018/4/26 2018/4/27
ID aaa ccc ddd ggg
bbb aaa fff NaN
how do I group by date with Pandas?

Use set_index and cumcount:
df.set_index([df['StartTime'].dt.floor('D'),
df.groupby(df['StartTime'].dt.floor('D')).cumcount()])['Id'].unstack(0)
OUtput:
StartTime 2018-04-24 2018-04-25 2018-04-26 2018-04-27
0 aaa ccc ddd ggg
1 bbb aaa fff NaN

You can use cumcount to align index by group, then concat to concatenate series.
# normalize to zero out time
df['StartTime'] = pd.to_datetime(df['StartTime']).dt.normalize()
# get unique days and make index count by group
cols = df['StartTime'].unique()
df.index = df.groupby('StartTime').cumcount()
# concatenate list comprehension of series
res = pd.concat([df.loc[df['StartTime'] == i, 'Id'] for i in cols], axis=1)
res.columns = cols
print(res)
2018-04-24 2018-04-25 2018-04-26 2018-04-27
0 aaa ccc ddd ggg
1 bbb aaa fff NaN
Performance
For smaller dataframes, use #ScottBoston's more succinct solution. For larger dataframes, concat seems to scale better than unstack:
def scott(df):
df['StartTime'] = pd.to_datetime(df['StartTime'])
return df.set_index([df['StartTime'].dt.floor('D'),
df.groupby(df['StartTime'].dt.floor('D')).cumcount()])['Id'].unstack(0)
def jpp(df):
df['StartTime'] = pd.to_datetime(df['StartTime']).dt.normalize()
df.index = df.groupby('StartTime').cumcount()
res = pd.concat([df.loc[df['StartTime'] == i, 'Id'] for i in df['StartTime'].unique()], axis=1)
res.columns = cols
return res
df2 = pd.concat([df]*100000)
%timeit scott(df2) # 1 loop, best of 3: 681 ms per loop
%timeit jpp(df2) # 1 loop, best of 3: 271 ms per loop

import pandas as pd
df = pd.DataFrame({'StartTime': ['2018-04-01 15:25:11', '2018-04-04 16:25:11', '2018-04-04 15:27:11'], 'Score': [10, 20, 30]})
print(df)
This yields
Score StartTime
0 10 2018-04-01 15:25:11
1 20 2018-04-04 16:25:11
2 30 2018-04-04 15:27:11
Now we create a new column based on the StartTime column, which contains only the date:
df['Date'] = df['StartTime'].apply(lambda x: x.split(' ')[0])
print(df)
Output:
Score StartTime Date
0 10 2018-04-01 15:25:11 2018-04-01
1 20 2018-04-04 16:25:11 2018-04-04
2 30 2018-04-04 15:27:11 2018-04-04
We can now use the pd.DataFrame.groupby method to group the rows by the values of the new Datecolumn. In the example below, I first group the columns and then iterate over them to print the name (the value of the Date column of this group) and the mean score achieved:
for name, group in df.groupby('Date'):
print(name)
print(group)
print(group['Score'].mean())
Gives:
2018-04-01
Score StartTime Date
0 10 2018-04-01 15:25:11 2018-04-01
10.0
2018-04-04
Score StartTime Date
1 20 2018-04-04 16:25:11 2018-04-04
2 30 2018-04-04 15:27:11 2018-04-04
25.0
Edit: Since you initially did not provide the dataframe data in table format, I leave it as an exercise to you to adapt the dataframe in my answer ;-)

Related

consecutive rows of unsorted dates based on one day before after or on the same day into one [duplicate]

I would like to combine rows of same id with consecutive dates and same features values.
I have the following dataframe:
Id Start End Feature1 Feature2
0 A 2020-01-01 2020-01-15 1 1
1 A 2020-01-16 2020-01-30 1 1
2 A 2020-01-31 2020-02-15 0 1
3 A 2020-07-01 2020-07-15 0 1
4 B 2020-01-31 2020-02-15 0 0
5 B 2020-02-16 NaT 0 0
An the expected result is:
Id Start End Feature1 Feature2
0 A 2020-01-01 2020-01-30 1 1
1 A 2020-01-31 2020-02-15 0 1
2 A 2020-07-01 2020-07-15 0 1
3 B 2020-01-31 NaT 0 0
I have been trying other posts answers but they don't really match with my use case.
Thanks in advance!
You can approach by:
Get the day diff of each consecutive entries within same group by substracting current Start with last End with the group using GroupBy.shift().
Set group number group_no such that new group number is issued when day diff with previous entry within the group is greater than 1.
Then, group by Id and group_no and aggregate for each group the Start and End dates using .gropuby() and .agg()
As there is NaT data within the grouping, we need to specify dropna=False during grouping. Furthermore, to get the last entry of End within the group, we use x.iloc[-1] instead of last.
# convert to datetime format if not already in datetime
df['Start'] = pd.to_datetime(df['Start'])
df['End'] = pd.to_datetime(df['End'])
# sort by columns `Id` and `Start` if not already in this sequence
df = df.sort_values(['Id', 'Start'])
day_diff = (df['Start'] - df['End'].groupby([df['Id'], df['Feature1'], df['Feature2']]).shift()).dt.days
group_no = (day_diff.isna() | day_diff.gt(1)).cumsum()
df_out = (df.groupby(['Id', group_no], dropna=False, as_index=False)
.agg({'Id': 'first',
'Start': 'first',
'End': lambda x: x.iloc[-1],
'Feature1': 'first',
'Feature2': 'first',
}))
Result:
print(df_out)
Id Start End Feature1 Feature2
0 A 2020-01-01 2020-01-30 1 1
1 A 2020-01-31 2020-02-15 0 1
2 A 2020-07-01 2020-07-15 0 1
3 B 2020-01-31 NaT 0 0
Extract months from both date column
df['sMonth'] = df['Start'].apply(pd.to_datetime).dt.month
df['eMonth'] = df['End'].apply(pd.to_datetime).dt.month
Now groupby data frame with ['Id','Feature1','Feature2','sMonth','eMonth'] and we get result
df.groupby(['Id','Feature1','Feature2','sMonth','eMonth']).agg({'Start':'min','End':'max'}).reset_index().drop(['sMonth','eMonth'],axis=1)
Result
Id Feature1 Feature2 Start End
0 A 0 1 2020-01-31 2020-02-15
1 A 0 1 2020-07-01 2020-07-15
2 A 1 1 2020-01-01 2020-01-30
3 B 0 0 2020-01-31 2020-02-15

Loop by Quantity from two dataframe

I have two dataframes, one shows buys and the other shows sell. I need to pull sale date for each buy lot. Sometimes, the buy is sold in different sale lots, I need to be able to split shares for that(or if not possible, no need to split shares, just pull the selldate). This is what I have:
df1 = pd.DataFrame({'ID': ['AAA','AAA', 'AAA','BBB','CCC'],
'Buydate': ['2017-04-13', '2019-12-31', '2019-03-05', '2018-11-04', '2019-12-31' ],
'Quantity': [100.00, 2000.00, 385.95, 214514.00, 63205.00]})
df2=pd.DataFrame({'ID': ['AAA','AAA','BBB'],
'Selldate': ['2020-01-25', '2020-10-25', '2020-12-19'],
'Quantity': [500.00, 1985.95, 214714.00]})
Output is:
df1
ID | Buydate | Quantity
0 AAA 2017-04-13 100.00
1 AAA 2019-12-31 2000.00
2 AAA 2019-03-05 385.95
3 BBB 2018-11-04 214514.00
4 CCC 2019-12-31 63205.00
df2
ID Selldate Quantity
0 AAA 2020-01-25 500.00
1 AAA 2020-10-25 1985.95
2 BBB 2020-12-19 214714.00
First I added cumsum column, then I plan to use a loop for df1 each group to look for df2 by ID, if share is less than the quantity of first lot in df2, I use original quantity of df1, if it's over, I need to get the remaining quantity and continue to look for the second lot of df2. I guess I need a concat function at some point.
The ideal result is:
ID Buydate Quantity SplitQuantity Selldate
0 AAA 2017-04-13 100.00 100.00 2020-01-25
1 AAA 2019-03-05 385.95 385.95 2020-01-25
2 AAA 2019-12-31 2000.00 14.05 2020-01-25
3 AAA 2019-12-31 2000.00 1985.95 2020-10-25
4 BBB 2018-11-04 214514.00 214514.00 2020-12-19
5 CCC 2019-12-31 63205.00 NaN NaT
This solution is a little messy, but what you're asking is a little complicated, so here comes a working prototype:
# Sort values by date.
df1 = df1.sort_values(by='Buydate').reset_index()
# id_jumps will be used for ignoring items you already subtracted from.
id_jump = {}
for id_ in df1['ID']:
id_jump[id_] = 0
new_index = ['ID', 'Buydate', 'Quantity', 'SplitQuantity', 'Selldate']
new_df = []
# For all items in DF2, subtrack the quantity from items in df1 with the same ID.
for index, row in df2.iterrows():
sum_ = row['Quantity']
for index2, row2 in df1[df1['ID'] == row['ID']].iterrows():
if index2 < id_jump[row['ID']]:
# Skip items already included from previous purchases.
continue
if sum_ > row2['Quantity']:
sub = row2['Quantity']
sum_ = sum_ - row2['Quantity']
id_jump[row['ID']] += 1
new_df.append(
[row2['ID'], row2['Buydate'], row2['Quantity'], sub, row['Selldate']])
else:
id_jump[row['ID']] += 1
new_df.append(
[row2['ID'], row2['Buydate'], row2['Quantity'], sum_, row['Selldate']])
break
df3 = pd.DataFrame(new_df, columns=new_index)
# Add missing 'CCC' row, for IDs never bought.
for id_ in df1['ID']:
if id_jump[id_] == 0:
df4 = pd.concat([df3, df1[df1['ID'] == id_]]).drop(columns='index').reset_index()
print(df4)
# ID Buydate Quantity SplitQuantity Selldate
# 0 AAA 2017-04-13 100.00 100.00 2020-01-25
# 1 AAA 2019-03-05 385.95 385.95 2020-01-25
# 2 AAA 2019-12-31 2000.00 14.05 2020-01-25
# 3 AAA 2019-12-31 2000.00 1985.95 2020-10-25
# 4 BBB 2018-11-04 214514.00 214514.00 2020-12-19
# 5 CCC 2019-12-31 63205.00 NaN NaN

Pandas partial transpose

I want to reformat a dataframe by transeposing some columns with fixing other columns.
original data :
ID subID values_A
-- ----- --------
A aaa 10
B baa 20
A abb 30
A acc 40
C caa 50
B bbb 60
Pivot once :
pivot_table( df, index = ["ID", "subID"] )
Output:
ID subID values_A
-- ----- --------
A aaa 10
abb 30
acc 40
B baa 20
bbb 60
C caa 50
What I want to do ( Fix ['ID'] columns and partial transpose ) :
ID subID_1 value_1 subID_2 value_2 subID_3 value_3
-- ------- ------- -------- ------- ------- -------
A aaa 10 abb 30 acc 40
B baa 20 bbb 60 NaN NaN
C caa 50 NaN NaN NaN NaN
what I know max subIDs count value which are under each IDs.
I don't need any calculating value when pivot and transepose dataframe.
Please help
Use cumcount for counter, create MultiIndex by set_index, reshape by unstack and sort first level of MultiIndex in columns by sort_index. Last flatten it by list comprehension with reset_index:
g = df.groupby('ID').cumcount()
df = df.set_index(['ID', g]).unstack().sort_index(level=1, axis=1)
#python 3.6+
df.columns = [f'{a}_{b+1}' for a, b in df.columns]
#python bellow
#df.columns = ['{}_{}'.format(a, b+1) for a, b in df.columns]
df = df.reset_index()
print (df)
ID subID_1 values_A_1 subID_2 values_A_2 subID_3 values_A_3
0 A aaa 10.0 abb 30.0 acc 40.0
1 B baa 20.0 bbb 60.0 NaN NaN
2 C caa 50.0 NaN NaN NaN NaN

python pandas - group by date and count

I have the below dataframe. Date in DD/MM/YY
Date id
1/5/2017 2:00 PM 100
1/5/2017 3:00 PM 101
2/5/2017 10:00 AM 102
3/5/2017 09:00 AM 103
3/5/2017 10:00 AM 104
4/5/2017 09:00 AM 105
Need output such a way that , able to group by date and also count number of Ids per day , also ignore time. o/p new data frame should be as below
DATE Count
1/5/2017 2 -> count 100,101
2/5/2017 1
3/5/2017 2
4/5/2017 1
Need efficient way to achieve above.
Use:
df['Date'] = pd.to_datetime(df['Date'], dayfirst=True)
df1 = df['Date'].dt.date.value_counts().sort_index().reset_index()
df1.columns = ['DATE','Count']
Alternative solution:
df1 = df.groupby(df['Date'].dt.date).size().reset_index(name='Count')
print (df1)
DATE Count
0 2017-05-01 2
1 2017-05-02 1
2 2017-05-03 2
3 2017-05-04 1
If need same format:
df1 = df['Date'].str.split().str[0].value_counts().sort_index().reset_index()
df1.columns = ['DATE','Count']
new = df['Date'].str.split().str[0]
df1 = df.groupby(new).size().reset_index(name='Count')
print (df1)
Date Count
0 1/5/2017 2
1 2/5/2017 1
2 3/5/2017 2
3 4/5/2017 1

Resample rows for missing dates and forward fill values in all columns except one

I currently have the following sample dataframe:
No FlNo DATE Loc Type
20 1826 6/1/2017 AAA O
20 1112 6/4/2017 BBB O
20 1234 6/6/2017 CCC O
20 43 6/7/2017 DDD O
20 1840 6/8/2017 EEE O
I want to fill in missing dates for two rows right on top of each other. I want to also fill in the values of the non-date columns with the values in the top row BUT leave 'Type' column blank for filled in rows.
Please see desired output:
No FlNo DATE Loc Type
20 1826 6/1/2017 AAA O
20 1826 6/2/2017 AAA
20 1826 6/3/2017 AAA
20 1112 6/4/2017 BBB O
20 1112 6/5/2017 BBB
20 1234 6/6/2017 CCC O
20 43 6/7/2017 DDD O
20 1840 6/8/2017 EEE O
I have searched all around Google and stackoverflow but did not find any date fill in answers for pandas dataframe.
First, convert DATE to a datetime column using pd.to_datetime,
df.DATE = pd.to_datetime(df.DATE)
Option 1
Use resample + ffill, and then reset the Type column later. First, store the unique dates in some list:
dates = df.DATE.unique()
Now,
df = df.set_index('DATE').resample('1D').ffill().reset_index()
df.Type = df.Type.where(df.DATE.isin(dates), '')
df
DATE No FlNo Loc Type
0 2017-06-01 20 1826 AAA O
1 2017-06-02 20 1826 AAA
2 2017-06-03 20 1826 AAA
3 2017-06-04 20 1112 BBB O
4 2017-06-05 20 1112 BBB
5 2017-06-06 20 1234 CCC O
6 2017-06-07 20 43 DDD O
7 2017-06-08 20 1840 EEE O
If needed, you may bring DATE back to its original state;
df.DATE = df.DATE.dt.strftime('%m/%d/%Y')
Option 2
Another option would be asfreq + ffill + fillna:
df = df.set_index('DATE').asfreq('1D').reset_index()
c = df.columns.difference(['Type'])
df[c] = df[c].ffill()
df['Type'] = df['Type'].fillna('')
df
DATE No FlNo Loc Type
0 2017-06-01 20.0 1826.0 AAA O
1 2017-06-02 20.0 1826.0 AAA
2 2017-06-03 20.0 1826.0 AAA
3 2017-06-04 20.0 1112.0 BBB O
4 2017-06-05 20.0 1112.0 BBB
5 2017-06-06 20.0 1234.0 CCC O
6 2017-06-07 20.0 43.0 DDD O
7 2017-06-08 20.0 1840.0 EEE O

Categories