Pandas: how to groupy per different days and columns? - python

I have a dataframe like the following (here a subset):
df1
ID zone date
0 6a93b747472484e41f969a0ac02b96161eb0af9edb1fe4... 01529224 2020-01-01
1 6a93b747472484e41f969a0ac02b96161eb0af9edb1fe4... 01529224 2020-01-01
2 6a93b747472484e41f969a0ac02b96161eb0af9edb1fe4... 01529224 2020-01-01
3 6a93b747472484e41f969a0ac02b96161eb0af9edb1fe4... 01529224 2020-01-01
4 6a93b747472484e41f969a0ac02b96161eb0af9edb1fe4... 01529224 2020-01-01
If I count the distinct ID per day I have
tmp = df1.groupby(['date']).agg({"ID": pd.Series.nunique}).reset_index()
tmp.head()
date ID
0 2019-12-31 4653
1 2020-01-01 6656
2 2020-01-02 1
Now if I group by zone and date I have the following:
distinctID = df1.groupby(['date', "zone"]).agg({"ID": pd.Series.nunique}).reset_index()
date zone ID
0 2019-12-31 00023901 1
1 2019-12-31 00025441 2
2 2019-12-31 00025442 2
3 2019-12-31 00025443 3
4 2019-12-31 00025444 2
If I count the ID for each day, how I have:
tmp1 = distinctID.groupby(['date']).agg({"ID": 'sum'}).reset_index()
tmp1.head()
date ID
0 2019-12-31 5833
1 2020-01-01 11837
2 2020-01-02 1
Why I do not get the same counting per each day?

Problem is your code is not same, I try change data for see it:
print (df1)
date zone ID
0 2019-12-31 23901 a
0 2019-12-31 23901 b
0 2019-12-31 25441 b
1 2019-12-31 25441 a
2 2019-12-31 25442 a
#only 2 unique values per date
tmp = df1.groupby(['date']).agg({"ID": pd.Series.nunique}).reset_index()
print (tmp)
date ID
0 2019-12-31 2 <-a, b
#if test per 2 columns there are more unique values, because tested separately
distinctID = df1.groupby(['date', "zone"]).agg({"ID": pd.Series.nunique}).reset_index()
print (distinctID)
date zone ID
0 2019-12-31 23901 2 <-a, b
1 2019-12-31 25441 2 <-a, b
2 2019-12-31 25442 1 <-a
#sum is different, because unique values are counts per 2 columns
tmp1 = distinctID.groupby(['date']).agg({"ID": 'sum'}).reset_index()
print (tmp1)
date ID
0 2019-12-31 5 <-a, b, a, b, a

Related

Populate new column with added day to date in another row and and another column based on condition in Python

I've a dateset like this:
date
Condition
20-01-2015
1
20-02-2015
1
20-03-2015
2
20-04-2015
2
20-05-2015
2
20-06-2015
1
20-07-2015
1
20-08-2015
2
20-09-2015
2
20-09-2015
1
I want a new column date_new which should look at the condition in next column. If condition is one, do nothing. If condition is 2, add a day to the date and store in date_new.
Additional condition- There should be 3 continuous 2's for this to work.
The final output should look like this.
date
Condition
date_new
20-01-2015
1
20-02-2015
1
20-03-2015
2
21-02-2015
20-04-2015
2
20-05-2015
2
20-06-2015
1
20-07-2015
1
20-08-2015
2
20-09-2015
2
20-09-2015
1
Any help is appreciated. Thank you.
This solution is a little bit different. If condition is 1 put None, otherwise I add condition value -1 to the date
df['date_new'] = np.where(df['condition'] == 1, None, (df['date'] + pd.to_timedelta(df['condition']-1,'d')).dt.strftime('%d-%m-%Y') )
Ok, so I've edited my answer and transform it into a function:
def newdate(df):
L = df.Condition
res = [i for i, j, k in zip(L, L[1:], L[2:]) if i == j == k]
if 2 in res:
df['date'] = pd.to_datetime(df['date'])
df['new_date'] = df.apply(lambda x: x["date"]+pd.DateOffset(days=2) if x["Condition"]==2 else pd.NA, axis=1)
df['new_date'] = pd.to_datetime(df['new_date'])
df1 = df
return df1
#output:
index
date
Condition
new_date
0
2015-01-20 00:00:00
1
NaT
1
2015-02-20 00:00:00
1
NaT
2
2015-03-20 00:00:00
2
2015-03-22 00:00:00
3
2015-04-20 00:00:00
2
2015-04-22 00:00:00
4
2015-05-20 00:00:00
2
2015-05-22 00:00:00
5
2015-06-20 00:00:00
1
NaT
6
2015-07-20 00:00:00
1
NaT
7
2015-08-20 00:00:00
2
2015-08-22 00:00:00
8
2015-09-20 00:00:00
2
2015-09-22 00:00:00
9
2015-09-20 00:00:00
1
NaT

consecutive rows of unsorted dates based on one day before after or on the same day into one [duplicate]

I would like to combine rows of same id with consecutive dates and same features values.
I have the following dataframe:
Id Start End Feature1 Feature2
0 A 2020-01-01 2020-01-15 1 1
1 A 2020-01-16 2020-01-30 1 1
2 A 2020-01-31 2020-02-15 0 1
3 A 2020-07-01 2020-07-15 0 1
4 B 2020-01-31 2020-02-15 0 0
5 B 2020-02-16 NaT 0 0
An the expected result is:
Id Start End Feature1 Feature2
0 A 2020-01-01 2020-01-30 1 1
1 A 2020-01-31 2020-02-15 0 1
2 A 2020-07-01 2020-07-15 0 1
3 B 2020-01-31 NaT 0 0
I have been trying other posts answers but they don't really match with my use case.
Thanks in advance!
You can approach by:
Get the day diff of each consecutive entries within same group by substracting current Start with last End with the group using GroupBy.shift().
Set group number group_no such that new group number is issued when day diff with previous entry within the group is greater than 1.
Then, group by Id and group_no and aggregate for each group the Start and End dates using .gropuby() and .agg()
As there is NaT data within the grouping, we need to specify dropna=False during grouping. Furthermore, to get the last entry of End within the group, we use x.iloc[-1] instead of last.
# convert to datetime format if not already in datetime
df['Start'] = pd.to_datetime(df['Start'])
df['End'] = pd.to_datetime(df['End'])
# sort by columns `Id` and `Start` if not already in this sequence
df = df.sort_values(['Id', 'Start'])
day_diff = (df['Start'] - df['End'].groupby([df['Id'], df['Feature1'], df['Feature2']]).shift()).dt.days
group_no = (day_diff.isna() | day_diff.gt(1)).cumsum()
df_out = (df.groupby(['Id', group_no], dropna=False, as_index=False)
.agg({'Id': 'first',
'Start': 'first',
'End': lambda x: x.iloc[-1],
'Feature1': 'first',
'Feature2': 'first',
}))
Result:
print(df_out)
Id Start End Feature1 Feature2
0 A 2020-01-01 2020-01-30 1 1
1 A 2020-01-31 2020-02-15 0 1
2 A 2020-07-01 2020-07-15 0 1
3 B 2020-01-31 NaT 0 0
Extract months from both date column
df['sMonth'] = df['Start'].apply(pd.to_datetime).dt.month
df['eMonth'] = df['End'].apply(pd.to_datetime).dt.month
Now groupby data frame with ['Id','Feature1','Feature2','sMonth','eMonth'] and we get result
df.groupby(['Id','Feature1','Feature2','sMonth','eMonth']).agg({'Start':'min','End':'max'}).reset_index().drop(['sMonth','eMonth'],axis=1)
Result
Id Feature1 Feature2 Start End
0 A 0 1 2020-01-31 2020-02-15
1 A 0 1 2020-07-01 2020-07-15
2 A 1 1 2020-01-01 2020-01-30
3 B 0 0 2020-01-31 2020-02-15

Generating rows (mins) based on difference between start and end time

This is a real use case that I am trying to implement in my work.
Sample data (fake data but similar data structure)
Lap Starttime Endtime
1 10:00:00 10:05:00
format: hh:mm:ss
Desired output
Lap time
1 10:00:00
1 10:01:00
1 10:02:00
1 10:03:00
1 10:04:00
1 10:05:00
so far only trying to think of the logic and techniques required... the codes are not correct
import re
import pandas as pd
df = pd.read_csv('sample.csv')
#1. to determine how many rows to generate. eg. 1000 to 1005 is 6 rows
df['time'] = df['Endtime'] - df['Startime']
#2. add one new row with 1 added minute. eg. 6 rows
for i in No_of_rows:
if df['time'] < df['Endtime']: #if 'time' still before end time, then continue append
df['time'] = df['Startime'] += 1 #not sure how to select Minute part only
else:
continue
pardon my limited coding skills. appreciate all the help from you experts.. thanks!
Try with pd.date_range and explode:
#convert to datetime if needed
df["Starttime"] = pd.to_datetime(df["Starttime"], format="%H:%M:%S")
df["Endtime"] = pd.to_datetime(df["Endtime"], format="%H:%M:%S")
#create list of 1min ranges
df["Range"] = df.apply(lambda x: pd.date_range(x["Starttime"], x["Endtime"], freq="1min"), axis=1)
#explode, drop unneeded columns and keep only time
df = df.drop(["Starttime", "Endtime"], axis=1).explode("Range")
df["Range"] = df["Range"].dt.time
>>> df
Range
Lap
1 10:00:00
1 10:01:00
1 10:02:00
1 10:03:00
1 10:04:00
1 10:05:00
Input df:
df = pd.DataFrame({"Lap": [1],
"Starttime": ["10:00:00"],
"Endtime": ["10:05:00"]}).set_index("Lap")
>>> df
Starttime Endtime
Lap
1 10:00:00 10:05:00
You can convert the times to datetimes, that will arbitrarily prepend the date of today (at whatever date you’re running) but we can then remove that later and it allows for easier manupulation:
>>> bounds = df[['Starttime', 'Endtime']].transform(pd.to_datetime)
>>> bounds
Starttime Endtime
0 2021-09-29 10:00:00 2021-09-29 10:05:00
1 2021-09-29 10:00:00 2021-09-29 10:02:00
Then we can simply use pd.date_range with a 1 minute frequency:
>>> times = bounds.agg(lambda s: pd.date_range(*s, freq='1min'), axis='columns')
>>> times
0 DatetimeIndex(['2021-09-29 10:00:00', '2021-09...
1 DatetimeIndex(['2021-09-29 10:00:00', '2021-09...
dtype: object
Now joining that with the Lap info and using df.explode():
>>> result = df[['Lap']].join(times.rename('time')).explode('time').reset_index(drop=True)
>>> result
Lap time
0 1 2021-09-29 10:00:00
1 1 2021-09-29 10:01:00
2 1 2021-09-29 10:02:00
3 1 2021-09-29 10:03:00
4 1 2021-09-29 10:04:00
5 1 2021-09-29 10:05:00
6 2 2021-09-29 10:00:00
7 2 2021-09-29 10:01:00
8 2 2021-09-29 10:02:00
Finally we wanted to remove the day:
>>> result['time'] = result['time'].dt.time
>>> result
Lap time
0 1 10:00:00
1 1 10:01:00
2 1 10:02:00
3 1 10:03:00
4 1 10:04:00
5 1 10:05:00
6 2 10:00:00
7 2 10:01:00
8 2 10:02:00
The objects in your series are now datetime.time
Here is another way without using apply/agg:
Convert to datetime first:
df["Starttime"] = pd.to_datetime(df["Starttime"], format="%H:%M:%S")
df["Endtime"] = pd.to_datetime(df["Endtime"], format="%H:%M:%S")
Get difference between the end and start times and then using index.repeat, repeat the rows. Then using groupby & cumcount, get pd.to_timedelta in minutes and add to the existing start time:
repeats = df['Endtime'].sub(df['Starttime']).dt.total_seconds()//60
out = df.loc[df.index.repeat(repeats+1),['Lap','Starttime']]
out['Starttime'] = (out['Starttime'].add(
pd.to_timedelta(out.groupby("Lap").cumcount(),'min')).dt.time)
print(out)
Lap Starttime
0 1 10:00:00
0 1 10:01:00
0 1 10:02:00
0 1 10:03:00
0 1 10:04:00
0 1 10:05:00

Check if date in one dataframe is between two dates in another dataframe, by group

I have the following problem. I've got a dataframe with start and end dates for each group. There might be more than one start and end date per group, like this:
group start_date end_date
1 2020-01-03 2020-03-03
1 2020-05-03 2020-06-03
2 2020-02-03 2020-06-03
And another dataframe with one row per date, per group, like this:
group date
1 2020-01-03
1 2020-02-03
1 2020-03-03
1 2020-04-03
1 2020-05-03
1 2020-06-03
2 2020-02-03
3 2020-03-03
4 2020-04-03
.
.
So I want to create a column is_between in an efficient way, ideally avoiding loops, so I get the following dataframe
group date is_between
1 2020-01-03 1
1 2020-02-03 1
1 2020-03-03 1
1 2020-04-03 0
1 2020-05-03 1
1 2020-06-03 1
2 2020-02-03 1
3 2020-03-03 1
4 2020-04-03 1
.
.
So it gets a 1 when a group's date is between the dates in the first dataframe. I'm guessing some combination of groupby, where, between and maybe map might do it, but I'm not finding the correct one. Any ideas?
Based on #YOBEN_S and #Quang Hoang's advice this made it:
df = df.merge(dic_dates, how='left')
df['is_between'] = np.where(df.date.between(pd.to_datetime(df.start_date),
pd.to_datetime(df.end_Date)),1, 0)
df = (df.sort_values(by=['group', 'date', 'is_between'])
.drop_duplicates(subset=['group', 'date'], keep='last'))
you could try with merge_asof, by the group and on the date and start_date, then check where the date is less than end_date and finally assign back to the original df2
ser = (pd.merge_asof(df2.reset_index() #for later index alignment
.sort_values('date'),
df1.sort_values('start_date'),
by='group',
left_on='date', right_on='start_date',
direction='backward')
.assign(is_between=lambda x: x.date<=x.end_date)
.set_index(['index'])['is_between']
)
df2['is_between'] = ser.astype(int)
print (df2)
group date is_between
0 1 2020-01-03 1
1 1 2020-02-03 1
2 1 2020-03-03 1
3 1 2020-04-03 0
4 1 2020-05-03 1
5 1 2020-06-03 1
6 2 2020-02-03 1
7 3 2020-03-03 0
8 4 2020-04-03 0

Sum a column based on groupby and condition

I have a dataframe and some columns. I want to sum column "Gap" where time is in some time slots.
region. date. time. gap
0 1 2016-01-01 00:00:08 1
1 1 2016-01-01 00:00:48 0
2 1 2016-01-01 00:02:50 1
3 1 2016-01-01 00:00:52 0
4 1 2016-01-01 00:10:01 0
5 1 2016-01-01 00:10:03 1
6 1 2016-01-01 00:10:05 0
7 1 2016-01-01 00:10:08 0
I want to sum gap column. I have time slots in dict like that.
'slot1': '00:00:00', 'slot2': '00:10:00', 'slot3': '00:20:00'
Now after summation, above dataframe should like that.
region. date. time. gap
0 1 2016-01-01 00:10:00/slot1 2
1 1 2016-01-01 00:20:00/slot2 1
I have many regions and 144 time slots from 00:00:00 to 23:59:49. I have tried this.
regres=reg.groupby(['start_region_hash','Date','Time'])['Time'].apply(lambda x: (x >= hoursdict['slot1']) & (x <= hoursdict['slot2'])).sum()
But it doesn't work.
Idea is convert column time to datetimes with floor by 10Min, then convert to strings HH:MM:SS:
d = {'slot1': '00:00:00', 'slot2': '00:10:00', 'slot3': '00:20:00'}
d1 = {v:k for k, v in d.items()}
df['time'] = pd.to_datetime(df['time']).dt.floor('10Min').dt.strftime('%H:%M:%S')
print (df)
region date time gap
0 1 2016-01-01 00:00:00 1
1 1 2016-01-01 00:00:00 0
2 1 2016-01-01 00:00:00 1
3 1 2016-01-01 00:00:00 0
4 1 2016-01-01 00:10:00 0
5 1 2016-01-01 00:10:00 1
6 1 2016-01-01 00:10:00 0
7 1 2016-01-01 00:10:00 0
Aggregate sum and last map values by dictionary with swapped keys with values:
regres = df.groupby(['region','date','time'], as_index=False)['gap'].sum()
regres['time'] = regres['time'] + '/' + regres['time'].map(d1)
print (regres)
region date time gap
0 1 2016-01-01 00:00:00/slot1 2
1 1 2016-01-01 00:10:00/slot2 1
If want display next 10Min slots:
d = {'slot1': '00:00:00', 'slot2': '00:10:00', 'slot3': '00:20:00'}
d1 = {v:k for k, v in d.items()}
times = pd.to_datetime(df['time']).dt.floor('10Min')
df['time'] = times.dt.strftime('%H:%M:%S')
df['time1'] = times.add(pd.Timedelta('10Min')).dt.strftime('%H:%M:%S')
print (df)
region date time gap time1
0 1 2016-01-01 00:00:00 1 00:10:00
1 1 2016-01-01 00:00:00 0 00:10:00
2 1 2016-01-01 00:00:00 1 00:10:00
3 1 2016-01-01 00:00:00 0 00:10:00
4 1 2016-01-01 00:10:00 0 00:20:00
5 1 2016-01-01 00:10:00 1 00:20:00
6 1 2016-01-01 00:10:00 0 00:20:00
7 1 2016-01-01 00:10:00 0 00:20:00
regres = df.groupby(['region','date','time','time1'], as_index=False)['gap'].sum()
regres['time'] = regres.pop('time1') + '/' + regres['time'].map(d1)
print (regres)
region date time gap
0 1 2016-01-01 00:10:00/slot1 2
1 1 2016-01-01 00:20:00/slot2 1
EDIT:
Improvement for floor and convert to strings is use bining by cut or searchsorted:
df['time'] = pd.to_timedelta(df['time'])
bins = pd.timedelta_range('00:00:00', '24:00:00', freq='10Min')
labels = np.array(['{}'.format(str(x)[-8:]) for x in bins])
labels = labels[:-1]
df['time1'] = pd.cut(df['time'], bins=bins, labels=labels)
df['time11'] = labels[np.searchsorted(bins, df['time'].values) - 1]
Just to avoid the complication of the Datetime comparison (unless that is your whole point, in which case ignore my answer), and show the essence of this group by slot window problem, I here assume times are integers.
df = pd.DataFrame({'time':[8, 48, 250, 52, 1001, 1003, 1005, 1008, 2001, 2003, 2056],
'gap': [1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 1]})
slots = np.array([0, 1000, 1500])
df['slot'] = df.apply(func = lambda x: slots[np.argmax(slots[x['time']>slots])], axis=1)
df.groupby('slot')[['gap']].sum()
Output
gap
slot
-----------
0 2
1000 1
1500 3
The way to think about approaching this problem is converting your time column to the values you want first, and then doing a groupby sum of the time column.
The below code shows the approach I've used. I used np.select to include in as many conditions and condition options as I want. After I have converted time to the values I wanted, I did a simple groupby sum
None of the fuss of formatting time or converting strings etc is really needed. Simply let pandas dataframe handle it intuitively.
#Just creating the DataFrame using a dictionary here
regdict = {
'time': ['00:00:08','00:00:48','00:02:50','00:00:52','00:10:01','00:10:03','00:10:05','00:10:08'],
'gap': [1,0,1,0,0,1,0,0],}
df = pd.DataFrame(regdict)
import pandas as pd
import numpy as np #This is the library you require for np.select function
#Add in all your conditions and options here
condlist = [df['time']<'00:10:00',df['time']<'00:20:00']
choicelist = ['00:10:00/slot1','00:20:00/slot2']
#Use np.select after you have defined all your conditions and options
answerlist = np.select(condlist, choicelist)
print (answerlist)
['00:10:00/slot1' '00:10:00/slot1' '00:10:00/slot1' '00:10:00/slot1'
'00:20:00/slot2' '00:20:00/slot2' '00:20:00/slot2' '00:20:00/slot2']
#Assign answerlist to df['time']
df['time'] = answerlist
print (df)
time gap
0 00:10:00 1
1 00:10:00 0
2 00:10:00 1
3 00:10:00 0
4 00:20:00 0
5 00:20:00 1
6 00:20:00 0
7 00:20:00 0
df = df.groupby('time', as_index=False)['gap'].sum()
print (df)
time gap
0 00:10:00 2
1 00:20:00 1
If you wish to keep the original time you can instead do df['timeNew'] = answerlist and then filter from there.
df['timeNew'] = answerlist
print (df)
time gap timeNew
0 00:00:08 1 00:10:00/slot1
1 00:00:48 0 00:10:00/slot1
2 00:02:50 1 00:10:00/slot1
3 00:00:52 0 00:10:00/slot1
4 00:10:01 0 00:20:00/slot2
5 00:10:03 1 00:20:00/slot2
6 00:10:05 0 00:20:00/slot2
7 00:10:08 0 00:20:00/slot2
#Use transform function here to retain all prior values
df['aggregate sum of gap'] = df.groupby('timeNew')['gap'].transform(sum)
print (df)
time gap timeNew aggregate sum of gap
0 00:00:08 1 00:10:00/slot1 2
1 00:00:48 0 00:10:00/slot1 2
2 00:02:50 1 00:10:00/slot1 2
3 00:00:52 0 00:10:00/slot1 2
4 00:10:01 0 00:20:00/slot2 1
5 00:10:03 1 00:20:00/slot2 1
6 00:10:05 0 00:20:00/slot2 1
7 00:10:08 0 00:20:00/slot2 1

Categories