Python dataframe find closest date for each ID - python

I have a dataframe like this:
data = {'SalePrice':[10,10,10,20,20,3,3,1,4,8,8],'HandoverDateA':['2022-04-30','2022-04-30','2022-04-30','2022-04-30','2022-04-30','2022-04-30','2022-04-30','2022-04-30','2022-04-30','2022-03-30','2022-03-30'],'ID': ['Tom', 'Tom','Tom','Joseph','Joseph','Ben','Ben','Eden','Tim','Adam','Adam'], 'Tranche': ['Red', 'Red', 'Red', 'Red','Red','Blue','Blue','Red','Red','Red','Red'],'Totals':[100,100,100,50,50,90,90,70,60,70,70],'Sent':['2022-01-18','2022-02-19','2022-03-14','2022-03-14','2022-04-22','2022-03-03','2022-02-07','2022-01-04','2022-01-10','2022-01-15','2022-03-12'],'Amount':[20,10,14,34,15,60,25,10,10,40,20],'Opened':['2021-12-29','2021-12-29','2021-12-29','2022-12-29','2022-12-29','2021-12-19','2021-12-19','2021-12-29','2021-12-29','2021-12-29','2021-12-29']}
I need to find the sent date which is closest to the HandoverDate. I've seen plenty of examples that work when you give one date to search but here the date I want to be closest to can change for every ID. I have tried to adapt the following:
def nearest(items, pivot):
return min([i for i in items if i <= pivot], key=lambda x: abs(x - pivot))
And also tried to write a loop where I make a dataframe for each ID and use max on the date column then stick them together, but it's incredibly slow!
Thanks for any suggestions :)

IIUC, you can use:
data[['HandoverDateA', 'Sent']] = data[['HandoverDateA', 'Sent']].apply(pd.to_datetime)
out = data.loc[data['HandoverDateA']
.sub(data['Sent']).abs()
.groupby(data['ID']).idxmin()]
Output:
SalePrice HandoverDateA ID Tranche Totals Sent Amount Opened
10 8 2022-03-30 Adam Red 70 2022-03-12 20 2021-12-29
5 3 2022-04-30 Ben Blue 90 2022-03-03 60 2021-12-19
7 1 2022-04-30 Eden Red 70 2022-01-04 10 2021-12-29
4 20 2022-04-30 Joseph Red 50 2022-04-22 15 2022-12-29
8 4 2022-04-30 Tim Red 60 2022-01-10 10 2021-12-29
2 10 2022-04-30 Tom Red 100 2022-03-14 14 2021-12-29

Considering that the goal is to find the sent date which is closest to the HandoverDate, one approach would be as follows.
First of all, create the dataframe df from data with pandas.DataFrame
import pandas as pd
df = pd.DataFrame(data)
Then, make sure that the columns HandoverDateA and Sent are of datetime using pandas.to_datetime
df['HandoverDateA'] = pd.to_datetime(df['HandoverDateA'])
df['Sent'] = pd.to_datetime(df['Sent'])
Then, in order to make it more convenient, create a column, diff, to store the absolute value of the difference between the columns HandoverDateA and Sent
df['diff'] = (df['HandoverDateA'] - df['Sent']).dt.days.abs()
With that column, one can simply sort by that column as follows
df = df.sort_values(by=['diff'])
[Out]:
SalePrice HandoverDateA ID ... Amount Opened diff
4 20 2022-04-30 Joseph ... 15 2022-12-29 8
10 8 2022-03-30 Adam ... 20 2021-12-29 18
2 10 2022-04-30 Tom ... 14 2021-12-29 47
5 3 2022-04-30 Ben ... 60 2021-12-19 58
8 4 2022-04-30 Tim ... 10 2021-12-29 110
7 1 2022-04-30 Eden ... 10 2021-12-29 116
and the first row is the one where Sent is closest to HandOverDateA.
With the column diff, one option to get the one where diff is minimum is with pandas.DataFrame.query as follows
df = df.query('diff == diff.min()')
[Out]:
SalePrice HandoverDateA ID Tranche ... Sent Amount Opened diff
4 20 2022-04-30 Joseph Red ... 2022-04-22 15 2022-12-29 8
Notes:
For more information on sorting dataframes by columns, read my answer here.

Related

Randomly sample from panel data by 3months periods

I have pandas data frame that is panel data i.e. data of multiple customers over a timeframe. I want to sample (for bootstraping) a continuous three months period (i always wanna get a full month) of a random customer 90 times.
I have googled a bit and found several sampling techniques but none that would include sampling based on three continuous months.
I was considering just making a list of all the month names and sampling three consecutive ones (although not sure how to do consecutive). But how would i then be able to e.g. pick Nov21-Dec21-Jan22 ?
Would appreciate the help a lot!
import pandas as pd
date_range = pd.date_range("2020-01-01", "2022-01-01")
df = pd.DataFrame({"value":3}, index=date_range)
df.groupby(df.index.quarter).sample(5)
This would output:
Out[12]:
value
2021-01-14 3
2021-02-27 3
2020-01-20 3
2021-02-03 3
2021-02-19 3
2021-04-27 3
2021-06-29 3
2021-04-12 3
2020-06-24 3
2020-06-05 3
2021-07-30 3
2020-08-29 3
2021-07-03 3
2020-07-17 3
2020-09-12 3
2020-12-22 3
2021-12-13 3
2021-11-29 3
2021-12-19 3
2020-10-18 3
It selected 5 sample values form each quarter group.
From now own you can format date column (index) for it to write month in text.

How do I convert a dataframe with time-periods into a particular format?

My initial dataframe looks as follows:
User
License
Status
Start-Date
End-Date
A
xy
access
10.01.2022
13.01.2022
B
xy
access
11.01.2022
14.01.2022
C
xy
access
11.01.2022
14.01.2022
A
xy
access
12.01.2022
15.01.2022
A
xy
access
14.01.2022
17.01.2022
B
xy
access
21.01.2022
24.01.2022
A
xy
access
21.01.2022
24.01.2022
There are three users (a, b, c) who request a license on different days. In principle, the end date is always three days later than the start date due to the fact that a license is locked for a period of 3 days.
For example, if user A accesses again within these three days, the period is extended again by three days.
My (ultimate) goal is to get a graph like the following:
So i can see how many licenses were blocked.
But this is just my goal in a later step. I thought the best way to achieve an output like this would be the following table (dataframe) in a next step and just plotting Sum over Date:
Date
A
B
C
Sum
10.01.2022
1
0
0
1
11.01.2022
1
1
1
3
12.01.2022
1
1
1
3
13.01.2022
1
1
1
3
14.01.2022
1
1
1
3
15.01.2022
1
0
0
1
16.01.2022
1
0
0
1
17.01.2022
1
0
0
1
18.01.2022
0
0
0
0
19.01.2022
0
0
0
0
20.01.2022
0
0
0
0
21.01.2022
1
1
0
2
22.01.2022
1
1
0
2
23.01.2022
1
1
0
2
24.01.2022
1
1
0
2
(But I am not sure if this is the best way to achieve it.)
Would that be possible with pandas? If yes, how? Tbh I have no clue. I hope I didn't explain the question too complicated.
So I'm only concerned with the second dataframe, how I get it, not the graph itself.
From the graph it seems like you only care about the total number of active licenses per day. So, I'm providing an answer in that context. If you need the breakup at user level then it has to be changed a bit.
First, let's import the packages and create a sample dataframe. I've added one extra row for User A at the end to solidify some of the core concepts. Also, I'm explicitly changing the start and end date columns from string to date type.
import pandas as pd
import datetime
data_dict = {
'User': ['A', 'B', 'C', 'A', 'A', 'B', 'A', 'A'],
'License': ['xy', 'xy', 'xy', 'xy', 'xy', 'xy', 'xy', 'xy'],
'Status': ['access', 'access', 'access', 'access', 'access', 'access', 'access', 'access'],
'Start-Date': ['10.01.2022', '11.01.2022', '11.01.2022', '12.01.2022', '14.01.2022', '21.01.2022', '21.01.2022', '16.01.2022'],
'End-Date': ['13.01.2022', '14.01.2022', '14.01.2022', '15.01.2022', '17.01.2022', '24.01.2022', '24.01.2022', '20.01.2022']
}
pdf = pd.DataFrame.from_dict(data=data_dict)
pdf['Start-Date'] = pd.to_datetime(pdf['Start-Date'], format='%d.%m.%Y')
pdf['End-Date'] = pd.to_datetime(pdf['End-Date'], format='%d.%m.%Y')
The dataframe will be like the one below.
User License Status Start-Date End-Date
A xy access 2022-01-10 2022-01-13
B xy access 2022-01-11 2022-01-14
C xy access 2022-01-11 2022-01-14
A xy access 2022-01-12 2022-01-15
A xy access 2022-01-14 2022-01-17
B xy access 2022-01-21 2022-01-24
A xy access 2022-01-21 2022-01-24
A xy access 2022-01-16 2022-01-20
The tricky part here is to merge the overlappihg intervals for each user into a bigger interval. I've borrowed the core ideas from this SO post. The idea is to first group your data per user (and other additional columns except dates). Then Within each group we need to further group the data into subgroups, so that the overlapping intervals per user belong to the same sub-group. Once we get that, then all we need is to extract the min start-date and max end-date per subgroup and assign that as the new start and end dates for all entries in that subgroup. In the end, we'll drop the duplicates because it's redundant info.
The next part is pretty straight-forward. we create a new dataframe with only dates that ranges between global start and end dates. Then we join these two dataframes and sum up all the licenses per day.
def f(df_grouped):
df_grouped = df_grouped.sort_values(by='Start-Date').reset_index(drop=True)
df_grouped["group"] = (df_grouped["Start-Date"] > df_grouped["End-Date"].shift()).cumsum()
grp = df_grouped.groupby("group")
df_grouped['New-Start-Date'] = grp['Start-Date'].transform('min')
df_grouped['New-End-Date'] = grp['End-Date'].transform('max')
return df_grouped.drop("group", axis=1)
pdf2 = pdf.groupby(['User', 'License', 'Status']).apply(f).drop(['Start-Date', 'End-Date'], axis=1).drop_duplicates()
pdf2 contains the refined start and end dates per user, as we can see below.
User License Status New-Start-Date New-End-Date
User License Status
A xy access 0 A xy access 2022-01-10 2022-01-20
4 A xy access 2022-01-21 2022-01-24
B xy access 0 B xy access 2022-01-11 2022-01-14
1 B xy access 2022-01-21 2022-01-24
C xy access 0 C xy access 2022-01-11 2022-01-14
Let's create the new dataframe now, with date ranges.
num_days = (pdf['End-Date'].max() - pdf['Start-Date'].min()).days + 1
# num_days = 15
min_date = pdf['Start-Date'].min()
pdf3 = pd.DataFrame.from_dict(data={'Dates': [min_date + datetime.timedelta(days=n) for n in range(num_days)]})
pdf3 looks as follows:
Dates
2022-01-10
2022-01-11
2022-01-12
2022-01-13
2022-01-14
2022-01-15
2022-01-16
2022-01-17
2022-01-18
2022-01-19
2022-01-20
2022-01-21
2022-01-22
2022-01-23
2022-01-24
Let's now join these two dataframes, and remove bad rows:
pdf4 = pd.merge(pdf2, pdf3, how='cross')
def final_data_prep(start, end, date):
if date >= start and date <= end:
return 1
else:
return 0
pdf4['Num-License'] = pdf4[['New-Start-Date', 'New-End-Date', 'Dates']].apply(lambda x: final_data_prep(x[0],
x[1], x[2]), axis=1)
pdf4 looks as follows:
User License Status New-Start-Date New-End-Date Dates Num-License
0 A xy access 2022-01-10 2022-01-20 2022-01-10 1
1 A xy access 2022-01-10 2022-01-20 2022-01-11 1
2 A xy access 2022-01-10 2022-01-20 2022-01-12 1
3 A xy access 2022-01-10 2022-01-20 2022-01-13 1
4 A xy access 2022-01-10 2022-01-20 2022-01-14 1
... ... ... ... ... ... ... ...
70 C xy access 2022-01-11 2022-01-14 2022-01-20 0
71 C xy access 2022-01-11 2022-01-14 2022-01-21 0
72 C xy access 2022-01-11 2022-01-14 2022-01-22 0
73 C xy access 2022-01-11 2022-01-14 2022-01-23 0
74 C xy access 2022-01-11 2022-01-14 2022-01-24 0
Now, all we need to do is do a groupby per day, and voila!
pdf_final = pdf4[['Dates', 'User', 'License', 'Status', 'Num-License']].groupby('Dates')['Num-License'].sum().reset_index().sort_values(['Dates'], ascending=True).reset_index(drop=True)
# here's the final dataframe
Dates Num-License
0 2022-01-10 1
1 2022-01-11 3
2 2022-01-12 3
3 2022-01-13 3
4 2022-01-14 3
5 2022-01-15 1
6 2022-01-16 1
7 2022-01-17 1
8 2022-01-18 1
9 2022-01-19 1
10 2022-01-20 1
11 2022-01-21 2
12 2022-01-22 2
13 2022-01-23 2
14 2022-01-24 2
If you need to retain user level info as well, then just add the user col in the last groupby. Also, to replicate your results, remove the last row when creating the dataframe.

Pandas returning 0 string as 0%

I'm doing a evaluation on how many stores report back in how many time (same day(0), 1 day(1), etc), but when calculate the percentage of the total, all same day stores return 0% of the total.
I tried turning the column into object, float and int, but with the same result.
DF['T_days'] = (DF['day included in the server'] - DF['day of sale']).dt.days
create my T_Days and fills it with the amount in days based on the 2 datatime columns. This works fine. And by:
DF['Percentage'] = (DF['T_days'] /DF['T_days'].sum()) * 100
return this table. I know what i should do but now how to do it.
COD_store
date in server
Date bought
T_days
Percentage
1
2021-12-03
2021-12-02
1
0.013746
1
2021-12-03
2021-12-02
1
0.013746
922
2022-01-27
2022-01-10
17
0.233677
922
2022-01-27
2022-01-10
17
0.233677
...
...
...
...
...
65
2022-01-12
2022-01-12
0
0.0
new DF after groupby:
T_DIAS
0 0.000000
1 1.374570
2 0.192440
3 15.793814
7 0.384880
17 82.254296
Name: Percentage, dtype: float64
I know i should divide the days resulted by the total amount of rows in DF and then group them by days, but my search on how to do this resulted in nothing. THW: i already have a separate DF for those days and percentage
Expected table:
T_days
Percentage
0
50
2
30
3
10
4
3
5
7
DF['T_days'].value_counts(normalize=True)*100)
worked. And after I turned it from a series to a DF to help the usage.

How to get the most recurring row from a pandas dataframe column

I would like to get the most recurring amount, along side the it description from the below dataframe. The length of the dataframe is longer that what I displayed here.
dataframe
description amount type cosine_group
1 295302|service fee 295302|microloa 1500.0 D 24
2 1292092|rpmt microloan|71302 20000.0 D 31
3 qr q10065116702 fund trf 0002080661 30000.0 D 12
4 qr q10060597280 fund trf 0002080661 30000.0 D 12
5 1246175|service fee 1246175|microlo 3000.0 D 24
6 qr q10034118487 fund trf 0002080661 2000.0 D 12
Here I tried using the grouby function
df.groupby(['cosine_group'])['amount'].value_counts()[:2]
the above code returns
cosine_group amount
12 30000.0 7
30000.0 6
I need the description along side the most recurring amount
Expected output is :
description amount
qr q10065116702 fund trf 0002080661 30000.0
qr q10060597280 fund trf 0002080661 30000.0
You can use mode:
description amount type
0 A 15
1 B 2000
2 C 3000
3 C 3000
4 C 3000
5 D 30
6 E 20
7 A 15
df[df['amount type'].eq(df['amount type'].mode().loc[0])]
description amount type
2 C 3000
3 C 3000
4 C 3000
Explaination:
df[mask] # will slice the dataframe based on boolean series (select the True rows) which is called a mask
df['amount type'].eq(3000) # .eq() stands for equal, it is a synonym for == in pandas
df['amount type'].mode() # is the mode of multiple values, which is defined as the most common
df['amount type'].loc[0] # returns the result with index 0, to get int instead of series

Updating Pandas DataFrame Specific Range

I have a Pandas DataFrame whose index is an DateTimeIndex (hourly stepped), the columns are the name of the rooms and each cell is a set().
room_a room_b ... room_az
2017-01-01 12:00 {} {} ... {}
2017-01-01 13:00 {} {} ... {}
2017-01-01 14:00 {} {} ... {}
...
2019-12-12 23:00 {} {} ... {}
I have to put the right persons in the right rooms during the time he/she/they occupied it. The data comes from another DataFrame that is like
index person_id room beg_effective_dt_tm end_effective_dt_tm
1 55 room_a 2017-01-01 15:45:33 2017-01-15 10:33:54
2 55 room_a 2017-01-25 09:15:55 2017-02-15 15:33:42
3 10 room_a 2017-01-05 12:10:33 2017-02-10 09:33:25
4 10 room_b 2017-02-10 09:34:15 2017-03-25 10:14:15
...
15000 55 room_z 2019-05-10 12:15:45 2019-05-10 15:33:25
15001 60 room_x 2019-06-02 15:10:33 2019-08-10 10:33:42
...
n
So I tried
for _, row in enumerate(df_origin.itertuples(), 1):
interval_start = hour_rounder(row.beg_effective_dt_tm)
interval_finish = hour_rounder(row.end_effective_dt_tm)
df_sets.update(
df_sets.loc[
interval_start:interval_finish, row.room
].apply(lambda s: s.add(row.person_id))
)
But, for each row this code is updating the whole respective column (room), ignoring the time span.
Without apply the series are been correctly selected, but the apply is setting the whole respective columns.
What am I missing here?
How can I implement this idea?
Thanks in advance.

Categories