Sum a column based on groupby and condition

Sum a column based on groupby and condition - python

I have a dataframe and some columns. I want to sum column "Gap" where time is in some time slots.
region. date. time. gap
0 1 2016-01-01 00:00:08 1
1 1 2016-01-01 00:00:48 0
2 1 2016-01-01 00:02:50 1
3 1 2016-01-01 00:00:52 0
4 1 2016-01-01 00:10:01 0
5 1 2016-01-01 00:10:03 1
6 1 2016-01-01 00:10:05 0
7 1 2016-01-01 00:10:08 0
I want to sum gap column. I have time slots in dict like that.
'slot1': '00:00:00', 'slot2': '00:10:00', 'slot3': '00:20:00'
Now after summation, above dataframe should like that.
region. date. time. gap
0 1 2016-01-01 00:10:00/slot1 2
1 1 2016-01-01 00:20:00/slot2 1
I have many regions and 144 time slots from 00:00:00 to 23:59:49. I have tried this.
regres=reg.groupby(['start_region_hash','Date','Time'])['Time'].apply(lambda x: (x >= hoursdict['slot1']) & (x <= hoursdict['slot2'])).sum()
But it doesn't work.

Idea is convert column time to datetimes with floor by 10Min, then convert to strings HH:MM:SS:
d = {'slot1': '00:00:00', 'slot2': '00:10:00', 'slot3': '00:20:00'}
d1 = {v:k for k, v in d.items()}
df['time'] = pd.to_datetime(df['time']).dt.floor('10Min').dt.strftime('%H:%M:%S')
print (df)
region date time gap
0 1 2016-01-01 00:00:00 1
1 1 2016-01-01 00:00:00 0
2 1 2016-01-01 00:00:00 1
3 1 2016-01-01 00:00:00 0
4 1 2016-01-01 00:10:00 0
5 1 2016-01-01 00:10:00 1
6 1 2016-01-01 00:10:00 0
7 1 2016-01-01 00:10:00 0
Aggregate sum and last map values by dictionary with swapped keys with values:
regres = df.groupby(['region','date','time'], as_index=False)['gap'].sum()
regres['time'] = regres['time'] + '/' + regres['time'].map(d1)
print (regres)
region date time gap
0 1 2016-01-01 00:00:00/slot1 2
1 1 2016-01-01 00:10:00/slot2 1
If want display next 10Min slots:
d = {'slot1': '00:00:00', 'slot2': '00:10:00', 'slot3': '00:20:00'}
d1 = {v:k for k, v in d.items()}
times = pd.to_datetime(df['time']).dt.floor('10Min')
df['time'] = times.dt.strftime('%H:%M:%S')
df['time1'] = times.add(pd.Timedelta('10Min')).dt.strftime('%H:%M:%S')
print (df)
region date time gap time1
0 1 2016-01-01 00:00:00 1 00:10:00
1 1 2016-01-01 00:00:00 0 00:10:00
2 1 2016-01-01 00:00:00 1 00:10:00
3 1 2016-01-01 00:00:00 0 00:10:00
4 1 2016-01-01 00:10:00 0 00:20:00
5 1 2016-01-01 00:10:00 1 00:20:00
6 1 2016-01-01 00:10:00 0 00:20:00
7 1 2016-01-01 00:10:00 0 00:20:00
regres = df.groupby(['region','date','time','time1'], as_index=False)['gap'].sum()
regres['time'] = regres.pop('time1') + '/' + regres['time'].map(d1)
print (regres)
region date time gap
0 1 2016-01-01 00:10:00/slot1 2
1 1 2016-01-01 00:20:00/slot2 1
EDIT:
Improvement for floor and convert to strings is use bining by cut or searchsorted:
df['time'] = pd.to_timedelta(df['time'])
bins = pd.timedelta_range('00:00:00', '24:00:00', freq='10Min')
labels = np.array(['{}'.format(str(x)[-8:]) for x in bins])
labels = labels[:-1]
df['time1'] = pd.cut(df['time'], bins=bins, labels=labels)
df['time11'] = labels[np.searchsorted(bins, df['time'].values) - 1]

Just to avoid the complication of the Datetime comparison (unless that is your whole point, in which case ignore my answer), and show the essence of this group by slot window problem, I here assume times are integers.
df = pd.DataFrame({'time':[8, 48, 250, 52, 1001, 1003, 1005, 1008, 2001, 2003, 2056],
'gap': [1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 1]})
slots = np.array([0, 1000, 1500])
df['slot'] = df.apply(func = lambda x: slots[np.argmax(slots[x['time']>slots])], axis=1)
df.groupby('slot')[['gap']].sum()
Output
gap
slot
-----------
0 2
1000 1
1500 3

The way to think about approaching this problem is converting your time column to the values you want first, and then doing a groupby sum of the time column.
The below code shows the approach I've used. I used np.select to include in as many conditions and condition options as I want. After I have converted time to the values I wanted, I did a simple groupby sum
None of the fuss of formatting time or converting strings etc is really needed. Simply let pandas dataframe handle it intuitively.
#Just creating the DataFrame using a dictionary here
regdict = {
'time': ['00:00:08','00:00:48','00:02:50','00:00:52','00:10:01','00:10:03','00:10:05','00:10:08'],
'gap': [1,0,1,0,0,1,0,0],}
df = pd.DataFrame(regdict)
import pandas as pd
import numpy as np #This is the library you require for np.select function
#Add in all your conditions and options here
condlist = [df['time']<'00:10:00',df['time']<'00:20:00']
choicelist = ['00:10:00/slot1','00:20:00/slot2']
#Use np.select after you have defined all your conditions and options
answerlist = np.select(condlist, choicelist)
print (answerlist)
['00:10:00/slot1' '00:10:00/slot1' '00:10:00/slot1' '00:10:00/slot1'
'00:20:00/slot2' '00:20:00/slot2' '00:20:00/slot2' '00:20:00/slot2']
#Assign answerlist to df['time']
df['time'] = answerlist
print (df)
time gap
0 00:10:00 1
1 00:10:00 0
2 00:10:00 1
3 00:10:00 0
4 00:20:00 0
5 00:20:00 1
6 00:20:00 0
7 00:20:00 0
df = df.groupby('time', as_index=False)['gap'].sum()
print (df)
time gap
0 00:10:00 2
1 00:20:00 1
If you wish to keep the original time you can instead do df['timeNew'] = answerlist and then filter from there.
df['timeNew'] = answerlist
print (df)
time gap timeNew
0 00:00:08 1 00:10:00/slot1
1 00:00:48 0 00:10:00/slot1
2 00:02:50 1 00:10:00/slot1
3 00:00:52 0 00:10:00/slot1
4 00:10:01 0 00:20:00/slot2
5 00:10:03 1 00:20:00/slot2
6 00:10:05 0 00:20:00/slot2
7 00:10:08 0 00:20:00/slot2
#Use transform function here to retain all prior values
df['aggregate sum of gap'] = df.groupby('timeNew')['gap'].transform(sum)
print (df)
time gap timeNew aggregate sum of gap
0 00:00:08 1 00:10:00/slot1 2
1 00:00:48 0 00:10:00/slot1 2
2 00:02:50 1 00:10:00/slot1 2
3 00:00:52 0 00:10:00/slot1 2
4 00:10:01 0 00:20:00/slot2 1
5 00:10:03 1 00:20:00/slot2 1
6 00:10:05 0 00:20:00/slot2 1
7 00:10:08 0 00:20:00/slot2 1

Related

Populate new column with added day to date in another row and and another column based on condition in Python

I've a dateset like this:
date
Condition
20-01-2015
1
20-02-2015
1
20-03-2015
2
20-04-2015
2
20-05-2015
2
20-06-2015
1
20-07-2015
1
20-08-2015
2
20-09-2015
2
20-09-2015
1
I want a new column date_new which should look at the condition in next column. If condition is one, do nothing. If condition is 2, add a day to the date and store in date_new.
Additional condition- There should be 3 continuous 2's for this to work.
The final output should look like this.
date
Condition
date_new
20-01-2015
1
20-02-2015
1
20-03-2015
2
21-02-2015
20-04-2015
2
20-05-2015
2
20-06-2015
1
20-07-2015
1
20-08-2015
2
20-09-2015
2
20-09-2015
1
Any help is appreciated. Thank you.

This solution is a little bit different. If condition is 1 put None, otherwise I add condition value -1 to the date
df['date_new'] = np.where(df['condition'] == 1, None, (df['date'] + pd.to_timedelta(df['condition']-1,'d')).dt.strftime('%d-%m-%Y') )

Ok, so I've edited my answer and transform it into a function:
def newdate(df):
L = df.Condition
res = [i for i, j, k in zip(L, L[1:], L[2:]) if i == j == k]
if 2 in res:
df['date'] = pd.to_datetime(df['date'])
df['new_date'] = df.apply(lambda x: x["date"]+pd.DateOffset(days=2) if x["Condition"]==2 else pd.NA, axis=1)
df['new_date'] = pd.to_datetime(df['new_date'])
df1 = df
return df1
#output:
index
date
Condition
new_date
0
2015-01-20 00:00:00
1
NaT
1
2015-02-20 00:00:00
1
NaT
2
2015-03-20 00:00:00
2
2015-03-22 00:00:00
3
2015-04-20 00:00:00
2
2015-04-22 00:00:00
4
2015-05-20 00:00:00
2
2015-05-22 00:00:00
5
2015-06-20 00:00:00
1
NaT
6
2015-07-20 00:00:00
1
NaT
7
2015-08-20 00:00:00
2
2015-08-22 00:00:00
8
2015-09-20 00:00:00
2
2015-09-22 00:00:00
9
2015-09-20 00:00:00
1
NaT

interpolation of missing values, not NA

i want to interpolate (Linear interpolation) data. but There is no NA.
Here is my data.with many missing values.
timestamp
id
strength
1383260400000
1
-0.3803901328171995
1383261000000
1
-0.42196042219455937
1383265200000
1
-0.460714706261982
My expected :
timestamp
id
strength
1383260400000
1
-0.3803901328171995
1383261000000
1
-0.42196042219455937
1383261600000
1
Linear interpolated data
1383262200000
1
Linear interpolated data
1383262800000
1
Linear interpolated data
1383263400000
1
Linear interpolated data
1383264000000
1
Linear interpolated data
1383264600000
1
Linear interpolated data
1383265200000
1
-0.460714706261982
timestamp starts 1383260400000, ends 1383343800000
and another id(from 1 to 2025) has same issues.

Idea is create datetimes, convert to DatetimeIndex and in lambda function add missing datetimes by Series.asfreq with interpolate:
df['timestamp'] = pd.to_datetime(df['timestamp'], unit='ms')
f = lambda x: x.asfreq('10Min').interpolate()
df = df.set_index('timestamp').groupby('id')['strength'].apply(f).reset_index()
print (df)
id timestamp strength
0 1 2013-10-31 23:00:00 -0.380390
1 1 2013-10-31 23:10:00 -0.421960
2 1 2013-10-31 23:20:00 -0.427497
3 1 2013-10-31 23:30:00 -0.433033
4 1 2013-10-31 23:40:00 -0.438569
5 1 2013-10-31 23:50:00 -0.444106
6 1 2013-11-01 00:00:00 -0.449642
7 1 2013-11-01 00:10:00 -0.455178
8 1 2013-11-01 00:20:00 -0.460715
Last if need original format of timestamps:
df['timestamp'] = df['timestamp'].astype(np.int64) // 1000000
print (df)
id timestamp strength
0 1 1383260400000 -0.380390
1 1 1383261000000 -0.421960
2 1 1383261600000 -0.427497
3 1 1383262200000 -0.433033
4 1 1383262800000 -0.438569
5 1 1383263400000 -0.444106
6 1 1383264000000 -0.449642
7 1 1383264600000 -0.455178
8 1 1383265200000 -0.460715
EDIT:
#data from question
df =pd.DataFrame({'timestamp': [1383260400000, 1383261000000, 1383265200000],
'id': [1, 1, 1],
'strength':[-0.3803901328171995,-0.4219604221945593,-0.460714706261982]})
print (df)
timestamp id strength
0 1383260400000 1 -0.380390
1 1383261000000 1 -0.421960
2 1383265200000 1 -0.460715
Solution create for each id all datetimes by date_range and create missing values by DataFrame.reindex with MultiIndex, last per id is used interpolate:
df['timestamp'] = pd.to_datetime(df['timestamp'], unit='ms')
r = pd.date_range(pd.to_datetime(1383260400000, unit='ms') ,
pd.to_datetime(1383343800000, unit='ms'),
freq='10Min')
ids = df['id'].unique()
mux = pd.MultiIndex.from_product([r, ids], names=['timestamp','id'])
f = lambda x: x.interpolate()
df = (df.set_index(['timestamp', 'id'])
.reindex(mux)
.groupby('id')['strength']
.transform(f)
.reset_index())
print (df)
timestamp id strength
0 2013-10-31 23:00:00 1 -0.380390
1 2013-10-31 23:10:00 1 -0.421960
2 2013-10-31 23:20:00 1 -0.427497
3 2013-10-31 23:30:00 1 -0.433033
4 2013-10-31 23:40:00 1 -0.438569
.. ... .. ...
135 2013-11-01 21:30:00 1 -0.460715
136 2013-11-01 21:40:00 1 -0.460715
137 2013-11-01 21:50:00 1 -0.460715
138 2013-11-01 22:00:00 1 -0.460715
139 2013-11-01 22:10:00 1 -0.460715
[140 rows x 3 columns]

Generating rows (mins) based on difference between start and end time

This is a real use case that I am trying to implement in my work.
Sample data (fake data but similar data structure)
Lap Starttime Endtime
1 10:00:00 10:05:00
format: hh:mm:ss
Desired output
Lap time
1 10:00:00
1 10:01:00
1 10:02:00
1 10:03:00
1 10:04:00
1 10:05:00
so far only trying to think of the logic and techniques required... the codes are not correct
import re
import pandas as pd
df = pd.read_csv('sample.csv')
#1. to determine how many rows to generate. eg. 1000 to 1005 is 6 rows
df['time'] = df['Endtime'] - df['Startime']
#2. add one new row with 1 added minute. eg. 6 rows
for i in No_of_rows:
if df['time'] < df['Endtime']: #if 'time' still before end time, then continue append
df['time'] = df['Startime'] += 1 #not sure how to select Minute part only
else:
continue
pardon my limited coding skills. appreciate all the help from you experts.. thanks!

Try with pd.date_range and explode:
#convert to datetime if needed
df["Starttime"] = pd.to_datetime(df["Starttime"], format="%H:%M:%S")
df["Endtime"] = pd.to_datetime(df["Endtime"], format="%H:%M:%S")
#create list of 1min ranges
df["Range"] = df.apply(lambda x: pd.date_range(x["Starttime"], x["Endtime"], freq="1min"), axis=1)
#explode, drop unneeded columns and keep only time
df = df.drop(["Starttime", "Endtime"], axis=1).explode("Range")
df["Range"] = df["Range"].dt.time
>>> df
Range
Lap
1 10:00:00
1 10:01:00
1 10:02:00
1 10:03:00
1 10:04:00
1 10:05:00
Input df:
df = pd.DataFrame({"Lap": [1],
"Starttime": ["10:00:00"],
"Endtime": ["10:05:00"]}).set_index("Lap")
>>> df
Starttime Endtime
Lap
1 10:00:00 10:05:00

You can convert the times to datetimes, that will arbitrarily prepend the date of today (at whatever date you’re running) but we can then remove that later and it allows for easier manupulation:
>>> bounds = df[['Starttime', 'Endtime']].transform(pd.to_datetime)
>>> bounds
Starttime Endtime
0 2021-09-29 10:00:00 2021-09-29 10:05:00
1 2021-09-29 10:00:00 2021-09-29 10:02:00
Then we can simply use pd.date_range with a 1 minute frequency:
>>> times = bounds.agg(lambda s: pd.date_range(*s, freq='1min'), axis='columns')
>>> times
0 DatetimeIndex(['2021-09-29 10:00:00', '2021-09...
1 DatetimeIndex(['2021-09-29 10:00:00', '2021-09...
dtype: object
Now joining that with the Lap info and using df.explode():
>>> result = df[['Lap']].join(times.rename('time')).explode('time').reset_index(drop=True)
>>> result
Lap time
0 1 2021-09-29 10:00:00
1 1 2021-09-29 10:01:00
2 1 2021-09-29 10:02:00
3 1 2021-09-29 10:03:00
4 1 2021-09-29 10:04:00
5 1 2021-09-29 10:05:00
6 2 2021-09-29 10:00:00
7 2 2021-09-29 10:01:00
8 2 2021-09-29 10:02:00
Finally we wanted to remove the day:
>>> result['time'] = result['time'].dt.time
>>> result
Lap time
0 1 10:00:00
1 1 10:01:00
2 1 10:02:00
3 1 10:03:00
4 1 10:04:00
5 1 10:05:00
6 2 10:00:00
7 2 10:01:00
8 2 10:02:00
The objects in your series are now datetime.time

Here is another way without using apply/agg:
Convert to datetime first:
df["Starttime"] = pd.to_datetime(df["Starttime"], format="%H:%M:%S")
df["Endtime"] = pd.to_datetime(df["Endtime"], format="%H:%M:%S")
Get difference between the end and start times and then using index.repeat, repeat the rows. Then using groupby & cumcount, get pd.to_timedelta in minutes and add to the existing start time:
repeats = df['Endtime'].sub(df['Starttime']).dt.total_seconds()//60
out = df.loc[df.index.repeat(repeats+1),['Lap','Starttime']]
out['Starttime'] = (out['Starttime'].add(
pd.to_timedelta(out.groupby("Lap").cumcount(),'min')).dt.time)
print(out)
Lap Starttime
0 1 10:00:00
0 1 10:01:00
0 1 10:02:00
0 1 10:03:00
0 1 10:04:00
0 1 10:05:00

Repeat rows in DataFrame N times based on len(list) in column with different list values

I have a DataFrame, which looks like:
col_1 col_2 ... col_n date
1 1 0 1 [[2017-02-01, 2017-12-01]]
2 0 1 1 [[2018-01-01, 2018-01-01], [2019-01-01, 2019-02-01]]
3 1 1 0 [[2018-04-01, 2019-03-01]]
...
n 0 0 1 [[2017-12-01, 2017-12-01], [2018-03-01, 2018-03-01], [2018-05-01, 2018-05-01], [2018-08-01, 2018-12-01]]
And I need to repeat columns that's df.date have multiple list values and that split them to new columns df.start_date and df.end_date
e.g.
col_1 col_2 ... col_n date_start date_end
1 1 0 1 2017-02-01 2017-12-01
2 0 1 1 2018-01-01 2018-01-01
3 0 1 1 2019-01-01 2019-02-01
4 1 1 0 2018-04-01 2019-03-01
...
n 0 0 1 2017-12-01 2017-12-01
n 0 0 1 2018-03-01 2018-03-01
n 0 0 1 2018-05-01 2018-05-01
n 0 0 1 2018-08-01 2018-12-01
I tried
date_df['repeat_num'] = [[[row, idx] for idx, item in enumerate(_list)] for row, _list in enumerate(date_df['date'])]
for row in range(len(date_df)):
if id_tuple[row][0][1] == 1: np.repeat(date_df.values, 1, axis = 0)
elif id_tuple[row][0][1] == 2: np.repeat(date_df.values, 2, axis = 0)
elif id_tuple[row][0][1] == 3: np.repeat(date_df.values, 3, axis = 0)
elif id_tuple[row][0][1] == 4: np.repeat(date_df.values, 4, axis = 0)
elif id_tuple[row][0][1] == 5: np.repeat(date_df.values, 5, axis = 0)
But don't think it worked properly.
Is there a way to do it?

Use DataFrame.explode working in pandas 0.25+ and create new columns with DataFrame constructor:
print (date_df)
a date
0 4 [[2017-02-01 00:00:00, 2017-03-01 00:00:00]]
1 7 [[2017-02-01 00:00:00, 2017-04-01 00:00:00], [...
df = date_df.explode('date')
print (df)
a date
0 4 [2017-02-01 00:00:00, 2017-03-01 00:00:00]
1 7 [2017-02-01 00:00:00, 2017-04-01 00:00:00]
1 7 [2017-02-01 00:00:00, 2017-04-01 00:00:00]
df[['date_start','date_end']] = pd.DataFrame(df.pop('date').values.tolist(), index=df.index)
print (df)
a date_start date_end
0 4 2017-02-01 2017-03-01
1 7 2017-02-01 2017-04-01
1 7 2017-02-01 2017-04-01
EDIT:
Solution for oldier pandas versions:
s = date_df.pop('date')
df = date_df.loc[date_df.index.repeat(s.str.len())]
df[['date_start','date_end']] = pd.DataFrame(np.concatenate(s), index=df.index)
df = df.reset_index(drop=True)
print (df)
a date_start date_end
0 4 2017-02-01 2017-03-01
1 7 2017-02-01 2017-04-01
2 7 2017-02-01 2017-04-01

Creating sessions based on timestamps and different location

I want to create session based on Location and Timestamp. If location is new or time has exceed 15 minutes interval then a new session is assigned to the record in the dataframe. Example below
Location | Time | Session
A 2016-01-01 00:00:15 1
A 2016-01-01 00:05:00 1
A 2016-01-01 00:10:08 1
A 2016-01-01 00:14:08 1
A 2016-01-01 00:15:49 2
B 2016-01-01 00:15:55 3
C 2016-01-01 00:15:58 4
C 2016-01-01 00:26:55 4
C 2016-01-01 00:29:55 4
C 2016-01-01 00:31:08 5
This is the code which, doesn't work for given problem.
from datetime import timedelta
cond1 = df.DateTime-df.DateTime.shift(1) > pd.Timedelta(15, 'm')
#OR
#15_min = df.DateTime.diff() > pd.Timedelta(minutes=15)
cond2 = df.location != df.location.shift(1)
session_id = (cond1|cond2).cumsum()
df['session_id'] = session_id.map(pd.Series(range(0,10000)))
I want a new session, if new location is found or 15 minutes are up for current location.

You can groupby both Location and using pd.Grouper to bin into 15 minute intervals and Location, then use ngroup to number each group:
df['Session'] = (df.groupby(['Location',pd.Grouper(key='Time',freq='15min')])
.ngroup()+1)
>>> df
Location Time Session
0 A 2016-01-01 00:00:15 1
1 A 2016-01-01 00:05:00 1
2 A 2016-01-01 00:10:08 1
3 A 2016-01-01 00:14:08 1
4 A 2016-01-01 00:15:49 2
5 B 2016-01-01 00:15:55 3
6 C 2016-01-01 00:15:58 4
7 C 2016-01-01 00:26:55 4
8 C 2016-01-01 00:29:55 4
9 C 2016-01-01 00:31:08 5

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Sum a column based on groupby and condition - python

Related

Populate new column with added day to date in another row and and another column based on condition in Python

interpolation of missing values, not NA

Generating rows (mins) based on difference between start and end time

Repeat rows in DataFrame N times based on len(list) in column with different list values

Creating sessions based on timestamps and different location

Categories

Resources