I have pandas data frame that is panel data i.e. data of multiple customers over a timeframe. I want to sample (for bootstraping) a continuous three months period (i always wanna get a full month) of a random customer 90 times.
I have googled a bit and found several sampling techniques but none that would include sampling based on three continuous months.
I was considering just making a list of all the month names and sampling three consecutive ones (although not sure how to do consecutive). But how would i then be able to e.g. pick Nov21-Dec21-Jan22 ?
Would appreciate the help a lot!
import pandas as pd
date_range = pd.date_range("2020-01-01", "2022-01-01")
df = pd.DataFrame({"value":3}, index=date_range)
df.groupby(df.index.quarter).sample(5)
This would output:
Out[12]:
value
2021-01-14 3
2021-02-27 3
2020-01-20 3
2021-02-03 3
2021-02-19 3
2021-04-27 3
2021-06-29 3
2021-04-12 3
2020-06-24 3
2020-06-05 3
2021-07-30 3
2020-08-29 3
2021-07-03 3
2020-07-17 3
2020-09-12 3
2020-12-22 3
2021-12-13 3
2021-11-29 3
2021-12-19 3
2020-10-18 3
It selected 5 sample values form each quarter group.
From now own you can format date column (index) for it to write month in text.
My initial dataframe looks as follows:
User
License
Status
Start-Date
End-Date
A
xy
access
10.01.2022
13.01.2022
B
xy
access
11.01.2022
14.01.2022
C
xy
access
11.01.2022
14.01.2022
A
xy
access
12.01.2022
15.01.2022
A
xy
access
14.01.2022
17.01.2022
B
xy
access
21.01.2022
24.01.2022
A
xy
access
21.01.2022
24.01.2022
There are three users (a, b, c) who request a license on different days. In principle, the end date is always three days later than the start date due to the fact that a license is locked for a period of 3 days.
For example, if user A accesses again within these three days, the period is extended again by three days.
My (ultimate) goal is to get a graph like the following:
So i can see how many licenses were blocked.
But this is just my goal in a later step. I thought the best way to achieve an output like this would be the following table (dataframe) in a next step and just plotting Sum over Date:
Date
A
B
C
Sum
10.01.2022
1
0
0
1
11.01.2022
1
1
1
3
12.01.2022
1
1
1
3
13.01.2022
1
1
1
3
14.01.2022
1
1
1
3
15.01.2022
1
0
0
1
16.01.2022
1
0
0
1
17.01.2022
1
0
0
1
18.01.2022
0
0
0
0
19.01.2022
0
0
0
0
20.01.2022
0
0
0
0
21.01.2022
1
1
0
2
22.01.2022
1
1
0
2
23.01.2022
1
1
0
2
24.01.2022
1
1
0
2
(But I am not sure if this is the best way to achieve it.)
Would that be possible with pandas? If yes, how? Tbh I have no clue. I hope I didn't explain the question too complicated.
So I'm only concerned with the second dataframe, how I get it, not the graph itself.
From the graph it seems like you only care about the total number of active licenses per day. So, I'm providing an answer in that context. If you need the breakup at user level then it has to be changed a bit.
First, let's import the packages and create a sample dataframe. I've added one extra row for User A at the end to solidify some of the core concepts. Also, I'm explicitly changing the start and end date columns from string to date type.
import pandas as pd
import datetime
data_dict = {
'User': ['A', 'B', 'C', 'A', 'A', 'B', 'A', 'A'],
'License': ['xy', 'xy', 'xy', 'xy', 'xy', 'xy', 'xy', 'xy'],
'Status': ['access', 'access', 'access', 'access', 'access', 'access', 'access', 'access'],
'Start-Date': ['10.01.2022', '11.01.2022', '11.01.2022', '12.01.2022', '14.01.2022', '21.01.2022', '21.01.2022', '16.01.2022'],
'End-Date': ['13.01.2022', '14.01.2022', '14.01.2022', '15.01.2022', '17.01.2022', '24.01.2022', '24.01.2022', '20.01.2022']
}
pdf = pd.DataFrame.from_dict(data=data_dict)
pdf['Start-Date'] = pd.to_datetime(pdf['Start-Date'], format='%d.%m.%Y')
pdf['End-Date'] = pd.to_datetime(pdf['End-Date'], format='%d.%m.%Y')
The dataframe will be like the one below.
User License Status Start-Date End-Date
A xy access 2022-01-10 2022-01-13
B xy access 2022-01-11 2022-01-14
C xy access 2022-01-11 2022-01-14
A xy access 2022-01-12 2022-01-15
A xy access 2022-01-14 2022-01-17
B xy access 2022-01-21 2022-01-24
A xy access 2022-01-21 2022-01-24
A xy access 2022-01-16 2022-01-20
The tricky part here is to merge the overlappihg intervals for each user into a bigger interval. I've borrowed the core ideas from this SO post. The idea is to first group your data per user (and other additional columns except dates). Then Within each group we need to further group the data into subgroups, so that the overlapping intervals per user belong to the same sub-group. Once we get that, then all we need is to extract the min start-date and max end-date per subgroup and assign that as the new start and end dates for all entries in that subgroup. In the end, we'll drop the duplicates because it's redundant info.
The next part is pretty straight-forward. we create a new dataframe with only dates that ranges between global start and end dates. Then we join these two dataframes and sum up all the licenses per day.
def f(df_grouped):
df_grouped = df_grouped.sort_values(by='Start-Date').reset_index(drop=True)
df_grouped["group"] = (df_grouped["Start-Date"] > df_grouped["End-Date"].shift()).cumsum()
grp = df_grouped.groupby("group")
df_grouped['New-Start-Date'] = grp['Start-Date'].transform('min')
df_grouped['New-End-Date'] = grp['End-Date'].transform('max')
return df_grouped.drop("group", axis=1)
pdf2 = pdf.groupby(['User', 'License', 'Status']).apply(f).drop(['Start-Date', 'End-Date'], axis=1).drop_duplicates()
pdf2 contains the refined start and end dates per user, as we can see below.
User License Status New-Start-Date New-End-Date
User License Status
A xy access 0 A xy access 2022-01-10 2022-01-20
4 A xy access 2022-01-21 2022-01-24
B xy access 0 B xy access 2022-01-11 2022-01-14
1 B xy access 2022-01-21 2022-01-24
C xy access 0 C xy access 2022-01-11 2022-01-14
Let's create the new dataframe now, with date ranges.
num_days = (pdf['End-Date'].max() - pdf['Start-Date'].min()).days + 1
# num_days = 15
min_date = pdf['Start-Date'].min()
pdf3 = pd.DataFrame.from_dict(data={'Dates': [min_date + datetime.timedelta(days=n) for n in range(num_days)]})
pdf3 looks as follows:
Dates
2022-01-10
2022-01-11
2022-01-12
2022-01-13
2022-01-14
2022-01-15
2022-01-16
2022-01-17
2022-01-18
2022-01-19
2022-01-20
2022-01-21
2022-01-22
2022-01-23
2022-01-24
Let's now join these two dataframes, and remove bad rows:
pdf4 = pd.merge(pdf2, pdf3, how='cross')
def final_data_prep(start, end, date):
if date >= start and date <= end:
return 1
else:
return 0
pdf4['Num-License'] = pdf4[['New-Start-Date', 'New-End-Date', 'Dates']].apply(lambda x: final_data_prep(x[0],
x[1], x[2]), axis=1)
pdf4 looks as follows:
User License Status New-Start-Date New-End-Date Dates Num-License
0 A xy access 2022-01-10 2022-01-20 2022-01-10 1
1 A xy access 2022-01-10 2022-01-20 2022-01-11 1
2 A xy access 2022-01-10 2022-01-20 2022-01-12 1
3 A xy access 2022-01-10 2022-01-20 2022-01-13 1
4 A xy access 2022-01-10 2022-01-20 2022-01-14 1
... ... ... ... ... ... ... ...
70 C xy access 2022-01-11 2022-01-14 2022-01-20 0
71 C xy access 2022-01-11 2022-01-14 2022-01-21 0
72 C xy access 2022-01-11 2022-01-14 2022-01-22 0
73 C xy access 2022-01-11 2022-01-14 2022-01-23 0
74 C xy access 2022-01-11 2022-01-14 2022-01-24 0
Now, all we need to do is do a groupby per day, and voila!
pdf_final = pdf4[['Dates', 'User', 'License', 'Status', 'Num-License']].groupby('Dates')['Num-License'].sum().reset_index().sort_values(['Dates'], ascending=True).reset_index(drop=True)
# here's the final dataframe
Dates Num-License
0 2022-01-10 1
1 2022-01-11 3
2 2022-01-12 3
3 2022-01-13 3
4 2022-01-14 3
5 2022-01-15 1
6 2022-01-16 1
7 2022-01-17 1
8 2022-01-18 1
9 2022-01-19 1
10 2022-01-20 1
11 2022-01-21 2
12 2022-01-22 2
13 2022-01-23 2
14 2022-01-24 2
If you need to retain user level info as well, then just add the user col in the last groupby. Also, to replicate your results, remove the last row when creating the dataframe.
I'm doing a evaluation on how many stores report back in how many time (same day(0), 1 day(1), etc), but when calculate the percentage of the total, all same day stores return 0% of the total.
I tried turning the column into object, float and int, but with the same result.
DF['T_days'] = (DF['day included in the server'] - DF['day of sale']).dt.days
create my T_Days and fills it with the amount in days based on the 2 datatime columns. This works fine. And by:
DF['Percentage'] = (DF['T_days'] /DF['T_days'].sum()) * 100
return this table. I know what i should do but now how to do it.
COD_store
date in server
Date bought
T_days
Percentage
1
2021-12-03
2021-12-02
1
0.013746
1
2021-12-03
2021-12-02
1
0.013746
922
2022-01-27
2022-01-10
17
0.233677
922
2022-01-27
2022-01-10
17
0.233677
...
...
...
...
...
65
2022-01-12
2022-01-12
0
0.0
new DF after groupby:
T_DIAS
0 0.000000
1 1.374570
2 0.192440
3 15.793814
7 0.384880
17 82.254296
Name: Percentage, dtype: float64
I know i should divide the days resulted by the total amount of rows in DF and then group them by days, but my search on how to do this resulted in nothing. THW: i already have a separate DF for those days and percentage
Expected table:
T_days
Percentage
0
50
2
30
3
10
4
3
5
7
DF['T_days'].value_counts(normalize=True)*100)
worked. And after I turned it from a series to a DF to help the usage.
I would like to get the most recurring amount, along side the it description from the below dataframe. The length of the dataframe is longer that what I displayed here.
dataframe
description amount type cosine_group
1 295302|service fee 295302|microloa 1500.0 D 24
2 1292092|rpmt microloan|71302 20000.0 D 31
3 qr q10065116702 fund trf 0002080661 30000.0 D 12
4 qr q10060597280 fund trf 0002080661 30000.0 D 12
5 1246175|service fee 1246175|microlo 3000.0 D 24
6 qr q10034118487 fund trf 0002080661 2000.0 D 12
Here I tried using the grouby function
df.groupby(['cosine_group'])['amount'].value_counts()[:2]
the above code returns
cosine_group amount
12 30000.0 7
30000.0 6
I need the description along side the most recurring amount
Expected output is :
description amount
qr q10065116702 fund trf 0002080661 30000.0
qr q10060597280 fund trf 0002080661 30000.0
You can use mode:
description amount type
0 A 15
1 B 2000
2 C 3000
3 C 3000
4 C 3000
5 D 30
6 E 20
7 A 15
df[df['amount type'].eq(df['amount type'].mode().loc[0])]
description amount type
2 C 3000
3 C 3000
4 C 3000
Explaination:
df[mask] # will slice the dataframe based on boolean series (select the True rows) which is called a mask
df['amount type'].eq(3000) # .eq() stands for equal, it is a synonym for == in pandas
df['amount type'].mode() # is the mode of multiple values, which is defined as the most common
df['amount type'].loc[0] # returns the result with index 0, to get int instead of series
I have a Pandas DataFrame whose index is an DateTimeIndex (hourly stepped), the columns are the name of the rooms and each cell is a set().
room_a room_b ... room_az
2017-01-01 12:00 {} {} ... {}
2017-01-01 13:00 {} {} ... {}
2017-01-01 14:00 {} {} ... {}
...
2019-12-12 23:00 {} {} ... {}
I have to put the right persons in the right rooms during the time he/she/they occupied it. The data comes from another DataFrame that is like
index person_id room beg_effective_dt_tm end_effective_dt_tm
1 55 room_a 2017-01-01 15:45:33 2017-01-15 10:33:54
2 55 room_a 2017-01-25 09:15:55 2017-02-15 15:33:42
3 10 room_a 2017-01-05 12:10:33 2017-02-10 09:33:25
4 10 room_b 2017-02-10 09:34:15 2017-03-25 10:14:15
...
15000 55 room_z 2019-05-10 12:15:45 2019-05-10 15:33:25
15001 60 room_x 2019-06-02 15:10:33 2019-08-10 10:33:42
...
n
So I tried
for _, row in enumerate(df_origin.itertuples(), 1):
interval_start = hour_rounder(row.beg_effective_dt_tm)
interval_finish = hour_rounder(row.end_effective_dt_tm)
df_sets.update(
df_sets.loc[
interval_start:interval_finish, row.room
].apply(lambda s: s.add(row.person_id))
)
But, for each row this code is updating the whole respective column (room), ignoring the time span.
Without apply the series are been correctly selected, but the apply is setting the whole respective columns.
What am I missing here?
How can I implement this idea?
Thanks in advance.