Python/Pandas: How to merge rows based on other column values - python

I have a table of work experience data. The problem I'm facing is some people's work experiences have overlapping dates (ie. See rows 240, 241 & 242, 243) where the start date occurs before end date of the previous job. This causes an overstatement of the total years of experience. For purposes of calculating total years of experience, how can I get total years of work experience without double counting overlapping jobs like example shown.
I initially summed the position tenure for each person to get the total years of experience but that doesn't account for double counting.

Try:
Input data:
>>> df
start_date end_date
237 2005-01-01 2007-12-01
238 2008-01-01 2012-09-01
239 2012-09-01 2013-07-01
240 2013-07-01 2016-05-01
241 2014-06-01 2016-05-01
242 2016-05-01 2019-10-01
243 2018-01-01 2019-10-01
244 2020-05-01 2021-08-03
First compute the diff between end_date and start_date:
df['diff1'] = df['end_date'] - df['start_date']
print(df)
start_date end_date diff1
237 2005-01-01 2007-12-01 1064 days
238 2008-01-01 2012-09-01 1705 days
239 2012-09-01 2013-07-01 303 days
240 2013-07-01 2016-05-01 1035 days
241 2014-06-01 2016-05-01 700 days
242 2016-05-01 2019-10-01 1248 days
243 2018-01-01 2019-10-01 638 days
244 2020-05-01 2021-08-03 459 days
Now, subtract start_date and end_date - 1 only if the start date occurs before end date of the previous row:
df['diff2'] = (df['start_date'] - df['end_date'].shift()) \
.mul(df['start_date'].le(df['end_date'].shift()))
print(df)
start_date end_date diff1 diff2
0 2005-01-01 2007-12-01 1064 days NaT
1 2008-01-01 2012-09-01 1705 days 0 days
2 2012-09-01 2013-07-01 303 days 0 days
3 2013-07-01 2016-05-01 1035 days 0 days
4 2014-06-01 2016-05-01 700 days -700 days
5 2016-05-01 2019-10-01 1248 days 0 days
6 2018-01-01 2019-10-01 638 days -638 days
7 2020-05-01 2021-08-03 459 days 0 days
Finally, add the two diffX columns:
df['real'] = df[['diff1', 'diff2']].sum(axis=1)
print(df)
start_date end_date diff1 diff2 real
0 2005-01-01 2007-12-01 1064 days NaT 1064 days
1 2008-01-01 2012-09-01 1705 days 0 days 1705 days
2 2012-09-01 2013-07-01 303 days 0 days 303 days
3 2013-07-01 2016-05-01 1035 days 0 days 1035 days
4 2014-06-01 2016-05-01 700 days -700 days 0 days
5 2016-05-01 2019-10-01 1248 days 0 days 1248 days
6 2018-01-01 2019-10-01 638 days -638 days 0 days
7 2020-05-01 2021-08-03 459 days 0 days 459 days
The real experience is df['real'].sum().days / 365, almost 16 years instead of 19.5 years. You can put this code into a function and call it when apply after groupby on person_id.
How would I create a function that can be used with the apply method?
def total_xp_years(df):
diff1 = df['end_date'] - df['start_date']
diff2 = df['start_date'] - df['end_date'].shift()
diff2 *= df['start_date'].le(df['end_date'].shift())
return (diff1.sum() + diff2.sum()).days / 365
dfxp = df.groupby('person_id').apply(total_xp_years)

Related

Aggregate time series data on weekly basis

I have a dataframe that consists of 3 years of data and two columns remaining useful life and predicted remaining useful life.
I am aggregating rul and pred_rul of 3 years data for each machineID for the maximum date they have. The original dataframe looks like this-
rul pred_diff machineID datetime
10476749 870 312.207825 408 2021-05-25 00:00:00
11452943 68 288.517578 447 2023-03-01 12:00:00
12693829 381 273.159698 493 2021-09-16 16:00:00
3413787 331 291.326416 133 2022-10-26 12:00:00
464093 77 341.506195 19 2023-10-10 16:00:00
... ... ... ... ...
11677555 537 310.586090 456 2022-04-07 00:00:00
2334804 551 289.307129 92 2021-09-04 20:00:00
5508311 35 293.721771 214 2023-01-06 04:00:00
12319704 348 322.199219 479 2021-11-11 20:00:00
4777501 87 278.089417 186 2021-06-29 12:00:00
1287421 rows × 4 columns
And I am aggregating it based on this code-
y_test_grp = y_test.groupby('machineID').agg({'datetime':'max', 'rul':'mean', 'pred_diff':'mean'})[['datetime','rul', 'pred_diff']].reset_index()
which gives the following output-
machineID datetime rul pred_diff
0 1 2023-10-03 20:00:00 286.817681 266.419401
1 2 2023-11-14 00:00:00 225.561953 263.372531
2 3 2023-10-25 00:00:00 304.736237 256.933351
3 4 2023-01-13 12:00:00 204.084899 252.476066
4 5 2023-09-07 00:00:00 208.702431 252.487156
... ... ... ... ...
495 496 2023-10-11 00:00:00 302.445285 298.836798
496 497 2023-08-26 04:00:00 281.601613 263.479885
497 498 2023-11-28 04:00:00 292.593906 263.985034
498 499 2023-06-29 20:00:00 260.887529 263.494844
499 500 2023-11-08 20:00:00 160.223614 257.326034
500 rows × 4 columns
Since this is grouped by on machineID, it is giving just 500 rows which is less. I want to aggregate rul and pred_rul on weekly basis such that for each machineID I get 52weeks*3years=156 rows. I am not able to identify which function to use for taking 7 days as interval and aggregating rul and pred_rul on that.
You can use Grouper:
pd.groupby(['machineID', pd.Grouper(key='datetime', freq='7D')]).mean()

Sum of individual labels over a month of granular data

I have a dataframe which contains life logging data gathered over several years from 44 unique individuals.
Int64Index: 77171 entries, 0 to 4279
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 start 77171 non-null datetime64[ns]
1 end 77171 non-null datetime64[ns]
2 labelName 77171 non-null category
3 id 77171 non-null int64
The start column contains granular datetimes of the format 2020-11-01 11:00:00, in intervals of 30 minutes. The labelName column has 14 different categories.
Categories (14, object): ['COOK', 'EAT', 'GO WALK', 'GO TO BATHROOM', ..., 'DRINK', 'WAKE UP', 'SLEEP', 'WATCH TV']
Here's a sample user's head, which is [2588 rows x 4 columns] and spans from 2020 to 2021. There are also gaps in the data, occasionally.
start end labelName id
0 2020-08-05 00:00:00 2020-08-05 00:30:00 GO TO BATHROOM 486
1 2020-08-05 06:00:00 2020-08-05 06:30:00 WAKE UP 486
2 2020-08-05 09:00:00 2020-08-05 09:30:00 COOK 486
3 2020-08-05 11:00:00 2020-08-05 11:30:00 EAT 486
4 2020-08-05 12:00:00 2020-08-05 12:30:00 EAT 486
.. ... ... ... ...
859 2021-03-10 12:30:00 2021-03-10 13:00:00 GO TO BATHROOM 486
861 2021-03-10 13:30:00 2021-03-10 14:00:00 GO TO BATHROOM 486
862 2021-03-10 18:30:00 2021-03-10 19:00:00 COOK 486
864 2021-03-11 08:00:00 2021-03-11 08:30:00 EAT 486
865 2021-03-11 12:30:00 2021-03-11 13:00:00 COOK 486
I want a sum of each unique labelNames per user per month, but I'm not sure how to do this.
I would first split the data frame by id, which is easy. But how do you split these start datetimes when it records every 30 minutes over several years of data— and then create 14 new columns which record the sums?
The final data frame might look something like this (with fake values):
user
month
SLEEP
...
WATCH TV
486
jun20
324
...
23
486
jul20
234
...
12
The use-case for this data frame is training a few statistical and machine-learning models.
How do I achieve something like this?
Because there are 30 minutes data you can count them by crosstab per months by months periods by Series.dt.to_period and then multiple by 0.5 for output in hours:
If start is 2020-09-30 23:30:00 and end is 2020-10-01 00:00:00 then if need count this record for October use df['end'] in crosstab, if for September use df['start'] .
df['start'] = pd.to_datetime(df['start'])
df['end'] = pd.to_datetime(df['end'])
df1 = (pd.crosstab([df['id'], df['end'].dt.to_period('m')], df['labelName']).mul(0.5)
.rename_axis(columns=None, index=['id','month'])
.rename(columns=str)
.reset_index()
.assign(month=lambda x:x['month'].dt.strftime('%b%Y')))
print (df1)
id month COOK EAT GO TO BATHROOM SLEEP WAKE UP
0 650 Sep2020 0.0 0.0 1.0 0.5 1.0
1 650 Mar2021 0.5 1.0 0.5 0.5 0.0
For ouput in 30 minutes:
df['start'] = pd.to_datetime(df['start'])
df['end'] = pd.to_datetime(df['end'])
df = (pd.crosstab([df['id'], df['end'].dt.to_period('m')], df['labelName'])
.rename_axis(columns=None, index=['id','month'])
.reset_index()
.assign(month=lambda x:x['month'].dt.strftime('%b%Y')))
print (df)
id month COOK EAT GO TO BATHROOM SLEEP WAKE UP
0 650 Sep2020 0 0 2 1 2
1 650 Mar2021 1 2 1 1 0
Use:
from collections import Counter
df.groupby([df['start'].dt.to_period('M'), 'id'])['labelName'].apply(lambda x: Counter(x)).reset_index().pivot_table('labelName', ['id', 'start'], 'level_2', fill_value=0)
Output:
Demonstration:
#Preparing Data
string = """start end labelName id
2020-09-21 14:30:00 2020-09-21 15:00:00 WAKE UP 650
2020-09-21 15:00:00 2020-09-21 15:30:00 GO TO BATHROOM 650
2020-09-21 15:30:00 2020-09-21 16:00:00 SLEEP 650
2020-09-29 17:00:00 2020-09-29 17:30:00 WAKE UP 650
2020-09-29 17:30:00 2020-09-29 18:00:00 GO TO BATHROOM 650
2021-03-11 13:00:00 2021-03-11 13:30:00 EAT 650
2021-03-11 14:30:00 2021-03-11 15:00:00 GO TO BATHROOM 650
2021-03-11 15:00:00 2021-03-11 15:30:00 COOK 650
2021-03-11 15:30:00 2021-03-11 16:00:00 EAT 650
2021-03-11 16:00:00 2021-03-11 16:30:00 SLEEP 650"""
data = [x.split(' ') for x in string.split('\n')]
df = pd.DataFrame(data[1:], columns = data[0])
df['start'] = pd.to_datetime(df['start'])
#Solution
from collections import Counter
df.groupby([df['start'].dt.to_period('M'), 'id'])['labelName'].apply(lambda x: Counter(x)).reset_index().pivot_table('labelName', ['id', 'start'], 'level_2', fill_value=0)

use grouper to aggregate by week from the exact day

I would like to aggregate my data on week, using pandas grouper, where the week ends at the exact same day of my last date, and not the end of the week.
This is the code I wrote:
fp.groupby(pd.Grouper(key='date',freq='w')).collectionName.nunique().tail(10)
And these are the results:
date
2021-10-03 644
2021-10-10 698
2021-10-17 756
2021-10-24 839
2021-10-31 883
2021-11-07 905
2021-11-14 961
2021-11-21 1028
2021-11-28 990
2021-12-05 726
Freq: W-SUN, Name: collectionName, dtype: int64
The last date I have is 2021-12-02, so I would like that to be the last day of the week aggregate, and it goes back every 7 days, to the end (in this case beginning of the dataset).
I need help with this.
Use pd.DataFrame.resample with rule='1w', on='date' and origin='end_day'
This assumes you can find the last date prior to grouping. See references here: https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#offset-aliases and here: https://www.programiz.com/python-programming/datetime/strftime
df = pd.DataFrame({'date': pd.date_range(start='2021-01-01 01:04:16', periods=250), 'val':range(1,251)})
date val
0 2021-01-01 01:04:16 1
1 2021-01-02 01:04:16 2
2 2021-01-03 01:04:16 3
3 2021-01-04 01:04:16 4
4 2021-01-05 01:04:16 5
.. ... ...
245 2021-09-03 01:04:16 246
246 2021-09-04 01:04:16 247
247 2021-09-05 01:04:16 248
248 2021-09-06 01:04:16 249
249 2021-09-07 01:04:16 250
[250 rows x 2 columns]
# locate last date and get day of week in correct format
anchor = df['date'].iat[-1].strftime("%a")
df.groupby(pd.Grouper(key='date',freq='w-' + anchor)).nunique().tail(5)
# week ends on the same day as the original dataset
val
date
2021-08-10 7
2021-08-17 7
2021-08-24 7
2021-08-31 7
2021-09-07 7

Is it possible to convert integers to time intervals in a Pandas pivot table?

This is my pivot DataFrame:
Name Tutor Student
Date
2021-04-12 310 112
2021-04-13 394 210
2021-04-14 357 3
2021-04-15 359 0
2021-04-16 392 0
2021-04-17 307 0
2021-04-18 335 0
2021-04-19 0 121
The values under the Tutor and Student columns are integers representing the number of seconds.
Is it possible to convert these values to time intervals like Python's datetime.timedelta?
Not very clear for the output you are looking for.
We can leverage pd.to_timedelta() method to convert seconds to timedelta.
Solution
df.iloc[:].apply(pd.to_timedelta, unit='s')
(Considering you want all columns in df to be converted to time_delta, if not please use df.loc with column names)
Dry run on provided input:
Input
Name Tutor Student
Date
2021-04-12 310 112
2021-04-13 394 210
2021-04-14 357 3
2021-04-15 359 0
2021-04-16 392 0
2021-04-17 307 0
2021-04-18 335 0
2021-04-19 0 121
Output
Name Tutor Student
Date
2021-04-12 0 days 00:05:10 0 days 00:01:52
2021-04-13 0 days 00:06:34 0 days 00:03:30
2021-04-14 0 days 00:05:57 0 days 00:00:03
2021-04-15 0 days 00:05:59 0 days 00:00:00
2021-04-16 0 days 00:06:32 0 days 00:00:00
2021-04-17 0 days 00:05:07 0 days 00:00:00
2021-04-18 0 days 00:05:35 0 days 00:00:00
2021-04-19 0 days 00:00:00 0 days 00:02:01
try this:
df["Tutor"] = pd.to_datetime(df["Tutor"], unit='s').dt.time
df["Student"] = pd.to_datetime(df["Student"], unit='s').dt.time
Result:
Name Tutor Student
1 2021-04-12 00:05:10 00:01:52
2 2021-04-13 00:06:34 00:03:30
3 2021-04-14 00:05:57 00:00:03
4 2021-04-15 00:05:59 00:00:00
5 2021-04-16 00:06:32 00:00:00
6 2021-04-17 00:05:07 00:00:00
7 2021-04-18 00:05:35 00:00:00
8 2021-04-19 00:00:00 00:02:01

Grouping daily data by month in python/pandas while firstly grouping by user id

I have the table below in a Pandas dataframe:
date user_id whole_cost cost1
02/10/2012 00:00:00 1 1790 12
07/10/2012 00:00:00 1 364 15
30/01/2013 00:00:00 1 280 10
02/02/2013 00:00:00 1 259 24
05/03/2013 00:00:00 1 201 39
02/10/2012 00:00:00 3 623 1
07/12/2012 00:00:00 3 90 0
30/01/2013 00:00:00 3 312 90
02/02/2013 00:00:00 5 359 45
05/03/2013 00:00:00 5 301 34
02/02/2013 00:00:00 5 359 1
05/03/2013 00:00:00 5 801 12
..
The table was extracted from a csv file using the following query :
import pandas as pd
newnames = ['date','user_id', 'whole_cost', 'cost1']
df = pd.read_csv('expenses.csv', names = newnames, index_col = 'date')
I have to analyse the profile of my users and for this purpose:
I would like to group (for each user - they are thousands) queries by month summing the query whole_cost for the entire month e.g. if user_id=1 was has a whole cost of 1790 on 02/10/2012 with cost1 12 and on the 07/10/2012 with whole cost 364, then it should have an entry in the new table of 2154 (as the whole cost) on 31/10/2012 (end of the month end-point representing the month - all dates in the transformed table will be month ends representing the whole month to which they relate).
In 0.14 you'll be able to groupby monthly and another column at the same time:
In [11]: df
Out[11]:
user_id whole_cost cost1
2012-10-02 1 1790 12
2012-10-07 1 364 15
2013-01-30 1 280 10
2013-02-02 1 259 24
2013-03-05 1 201 39
2012-10-02 3 623 1
2012-12-07 3 90 0
2013-01-30 3 312 90
2013-02-02 5 359 45
2013-03-05 5 301 34
2013-02-02 5 359 1
2013-03-05 5 801 12
In [12]: df1 = df.sort_index() # requires sorted DatetimeIndex
In [13]: df1.groupby([pd.TimeGrouper(freq='M'), 'user_id'])['whole_cost'].sum()
Out[13]:
user_id
2012-10-31 1 2154
3 623
2012-12-31 3 90
2013-01-31 1 280
3 312
2013-02-28 1 259
5 718
2013-03-31 1 201
5 1102
Name: whole_cost, dtype: int64
until 0.14 I think you're stuck with doing two groupbys:
In [14]: g = df.groupby('user_id')['whole_cost']
In [15]: g.resample('M', how='sum').dropna()
Out[15]:
user_id
1 2012-10-31 2154
2013-01-31 280
2013-02-28 259
2013-03-31 201
3 2012-10-31 623
2012-12-31 90
2013-01-31 312
5 2013-02-28 718
2013-03-31 1102
dtype: float64
With timegrouper getting deprecated, you can replace it with Grouper to get the same results
df.groupby(['user_id', pd.Grouper(key='date', freq='M')]).agg({'whole_cost':sum})
df.groupby(['user_id', df['date'].dt.dayofweek]).agg({'whole_cost':sum})

Categories