Pandas returning 0 string as 0% - python

I'm doing a evaluation on how many stores report back in how many time (same day(0), 1 day(1), etc), but when calculate the percentage of the total, all same day stores return 0% of the total.
I tried turning the column into object, float and int, but with the same result.
DF['T_days'] = (DF['day included in the server'] - DF['day of sale']).dt.days
create my T_Days and fills it with the amount in days based on the 2 datatime columns. This works fine. And by:
DF['Percentage'] = (DF['T_days'] /DF['T_days'].sum()) * 100
return this table. I know what i should do but now how to do it.
COD_store
date in server
Date bought
T_days
Percentage
1
2021-12-03
2021-12-02
1
0.013746
1
2021-12-03
2021-12-02
1
0.013746
922
2022-01-27
2022-01-10
17
0.233677
922
2022-01-27
2022-01-10
17
0.233677
...
...
...
...
...
65
2022-01-12
2022-01-12
0
0.0
new DF after groupby:
T_DIAS
0 0.000000
1 1.374570
2 0.192440
3 15.793814
7 0.384880
17 82.254296
Name: Percentage, dtype: float64
I know i should divide the days resulted by the total amount of rows in DF and then group them by days, but my search on how to do this resulted in nothing. THW: i already have a separate DF for those days and percentage
Expected table:
T_days
Percentage
0
50
2
30
3
10
4
3
5
7

DF['T_days'].value_counts(normalize=True)*100)
worked. And after I turned it from a series to a DF to help the usage.

Related

Pandas: Groupby and sum customer profit, for every 6 months, starting from users first transaction

I have a dataset like this:
Customer ID
Date
Profit
1
4/13/2018
10.00
1
4/26/2018
13.27
1
10/23/2018
15.00
2
1/1/2017
7.39
2
7/5/2017
9.99
2
7/7/2017
10.01
3
5/4/2019
30.30
I'd like to groupby and sum profit, for every 6 months, starting at each users first transaction.
The output ideally should look like this:
Customer ID
Date
Profit
1
4/13/2018
23.27
1
10/13/2018
15.00
2
1/1/2017
7.39
2
7/1/2017
20.00
3
5/4/2019
30.30
The closest I've seem to get on this problem is by using:
df.groupby(['Customer ID',pd.Grouper(key='Date', freq='6M', closed='left')])['Profit'].sum().reset_index()
But, that doesn't seem to sum starting on a users first transaction day.
If the changing of dates is not possible (ex. customer 2 date is 7/1/2017 and not 7/5/2017), then at least summing the profit so that its based on each users own 6 month purchase journey would be extremely helpful. Thank you!
I can get you the first of the month until you find a more perfect solution.
df["Date"] = pd.to_datetime(df["Date"], format="%m/%d/%Y")
df = (
df
.set_index("Date")
.groupby(["Customer ID"])
.Profit
.resample("6MS")
.sum()
.reset_index(name="Profit")
)
print(df)
Customer ID Date Profit
0 1 2018-04-01 23.27
1 1 2018-10-01 15.00
2 2 2017-01-01 7.39
3 2 2017-07-01 20.00
4 3 2019-05-01 30.30

Python dataframe find closest date for each ID

I have a dataframe like this:
data = {'SalePrice':[10,10,10,20,20,3,3,1,4,8,8],'HandoverDateA':['2022-04-30','2022-04-30','2022-04-30','2022-04-30','2022-04-30','2022-04-30','2022-04-30','2022-04-30','2022-04-30','2022-03-30','2022-03-30'],'ID': ['Tom', 'Tom','Tom','Joseph','Joseph','Ben','Ben','Eden','Tim','Adam','Adam'], 'Tranche': ['Red', 'Red', 'Red', 'Red','Red','Blue','Blue','Red','Red','Red','Red'],'Totals':[100,100,100,50,50,90,90,70,60,70,70],'Sent':['2022-01-18','2022-02-19','2022-03-14','2022-03-14','2022-04-22','2022-03-03','2022-02-07','2022-01-04','2022-01-10','2022-01-15','2022-03-12'],'Amount':[20,10,14,34,15,60,25,10,10,40,20],'Opened':['2021-12-29','2021-12-29','2021-12-29','2022-12-29','2022-12-29','2021-12-19','2021-12-19','2021-12-29','2021-12-29','2021-12-29','2021-12-29']}
I need to find the sent date which is closest to the HandoverDate. I've seen plenty of examples that work when you give one date to search but here the date I want to be closest to can change for every ID. I have tried to adapt the following:
def nearest(items, pivot):
return min([i for i in items if i <= pivot], key=lambda x: abs(x - pivot))
And also tried to write a loop where I make a dataframe for each ID and use max on the date column then stick them together, but it's incredibly slow!
Thanks for any suggestions :)
IIUC, you can use:
data[['HandoverDateA', 'Sent']] = data[['HandoverDateA', 'Sent']].apply(pd.to_datetime)
out = data.loc[data['HandoverDateA']
.sub(data['Sent']).abs()
.groupby(data['ID']).idxmin()]
Output:
SalePrice HandoverDateA ID Tranche Totals Sent Amount Opened
10 8 2022-03-30 Adam Red 70 2022-03-12 20 2021-12-29
5 3 2022-04-30 Ben Blue 90 2022-03-03 60 2021-12-19
7 1 2022-04-30 Eden Red 70 2022-01-04 10 2021-12-29
4 20 2022-04-30 Joseph Red 50 2022-04-22 15 2022-12-29
8 4 2022-04-30 Tim Red 60 2022-01-10 10 2021-12-29
2 10 2022-04-30 Tom Red 100 2022-03-14 14 2021-12-29
Considering that the goal is to find the sent date which is closest to the HandoverDate, one approach would be as follows.
First of all, create the dataframe df from data with pandas.DataFrame
import pandas as pd
df = pd.DataFrame(data)
Then, make sure that the columns HandoverDateA and Sent are of datetime using pandas.to_datetime
df['HandoverDateA'] = pd.to_datetime(df['HandoverDateA'])
df['Sent'] = pd.to_datetime(df['Sent'])
Then, in order to make it more convenient, create a column, diff, to store the absolute value of the difference between the columns HandoverDateA and Sent
df['diff'] = (df['HandoverDateA'] - df['Sent']).dt.days.abs()
With that column, one can simply sort by that column as follows
df = df.sort_values(by=['diff'])
[Out]:
SalePrice HandoverDateA ID ... Amount Opened diff
4 20 2022-04-30 Joseph ... 15 2022-12-29 8
10 8 2022-03-30 Adam ... 20 2021-12-29 18
2 10 2022-04-30 Tom ... 14 2021-12-29 47
5 3 2022-04-30 Ben ... 60 2021-12-19 58
8 4 2022-04-30 Tim ... 10 2021-12-29 110
7 1 2022-04-30 Eden ... 10 2021-12-29 116
and the first row is the one where Sent is closest to HandOverDateA.
With the column diff, one option to get the one where diff is minimum is with pandas.DataFrame.query as follows
df = df.query('diff == diff.min()')
[Out]:
SalePrice HandoverDateA ID Tranche ... Sent Amount Opened diff
4 20 2022-04-30 Joseph Red ... 2022-04-22 15 2022-12-29 8
Notes:
For more information on sorting dataframes by columns, read my answer here.

How to get weekly averages for column values and week number for the corresponding year based on daily data records with pandas

I'm still learning python and would like to ask your help with the following problem:
I have a csv file with daily data and I'm looking for a solution to sum it per calendar weeks. So for the mockup data below I have rows stretched over 2 weeks (week 14 (current week) and week 13 (past week)). Now I need to find a way to group rows per calendar week, recognize what year they belong to and calculate week sum and week average. In the file input example there are only two different IDs. However, in the actual data file I expect many more.
input.csv
id date activeMembers
1 2020-03-30 10
2 2020-03-30 1
1 2020-03-29 5
2 2020-03-29 6
1 2020-03-28 0
2 2020-03-28 15
1 2020-03-27 32
2 2020-03-27 10
1 2020-03-26 9
2 2020-03-26 3
1 2020-03-25 0
2 2020-03-25 0
1 2020-03-24 0
2 2020-03-24 65
1 2020-03-23 22
2 2020-03-23 12
...
desired output.csv
id week WeeklyActiveMembersSum WeeklyAverageActiveMembers
1 202014 10 1.4
2 202014 1 0.1
1 202013 68 9.7
2 202013 111 15.9
my goal is to:
import pandas as pd
df = pd.read_csv('path/to/my/input.csv')
Here I'd need to group by 'id' + 'date' column (per calendar week - not sure if this is possible) and create a 'week' column with the week number, then sum 'activeMembers' values for the particular week, save as 'WeeklyActiveMembersSum' column in my output file and finally calculate 'weeklyAverageActiveMembers' for the particular week. I was experimenting with groupby and isin parameters but no luck so far... would I have to go with something similar to this:
df.groupby('id', as_index=False).agg({'date':'max',
'activeMembers':'sum'}
and finally save all as output.csv:
df.to_csv('path/to/my/output.csv', index=False)
Thanks in advance!
It seems I'm getting a different week setting than you do:
# should convert datetime column to datetime type
df['date'] = pd.to_datetime(df['date'])
(df.groupby(['id',df.date.dt.strftime('%Y%W')], sort=False)
.activeMembers.agg([('Sum','sum'),('Average','mean')])
.add_prefix('activeMembers')
.reset_index()
)
Output:
id date activeMembersSum activeMembersAverage
0 1 202013 10 10.000000
1 2 202013 1 1.000000
2 1 202012 68 9.714286
3 2 202012 111 15.857143

Rolling sum based on dates, adding in conditions that actively update values in Pandas Dataframe if met?

I am calculating rolling last 180 day sales totals by ID in Python using Pandas and need to be able to update the last 180 day cumulative sales column if a user hits a certain threshold. For example, if someone reaches $100 spent cumulatively in the last 180 days, their cumulative spend for that day should reflect them reaching that level and effectively "redeeming" that $100, leaving them only with the excess from the last visit as progress towards their next $100 hit. (See the example below)
I also need to create a separate data frame during this process containing only the dates & user_ids for when the $100 is met to keep track of how many times the threshold has been met across all users.
I was thinking somehow I could use apply with conditional statements, but was not sure exactly how it would work as the data frame needs to be updated on the fly to have the rolling sums for later dates be calculated taking into account this updated total. In other words, the cumulative sums for dates after they hit the threshold need to be adjusted for the fact that they "redeemed" the $100.
This is what I have so far that gets the rolling cumulative sum by user. I don't know if its possible to chain conditional methods with apply to this or what the best way forward is.
order_data['rolling_sales_180'] = order_data.groupby('user_id').rolling(window='180D', on='day')['sales'].sum().reset_index(drop=True)
See the below example of expected results. In row 6, the user reaches $120, crossing the $100 threshold, but the $100 is subtracted from his cumulative sum as of that date and he is left with $20 as of that date because that was the amount in excess of the $100 threshold that he spent on that day. He then continues to earn cumulatively on this $20 for his subsequent visit within 180 days. A user can go through this process many times, earning many rewards over different 180 day periods.
print(order_data)
day user_id sales \
0 2017-08-10 1 10
1 2017-08-22 1 10
2 2017-08-31 1 10
3 2017-09-06 1 10
4 2017-09-19 1 10
5 2017-10-16 1 30
6 2017-11-28 1 40
7 2018-01-22 1 10
8 2018-03-19 1 10
9 2018-07-25 1 10
rolling_sales_180
0 10
1 20
2 30
3 40
4 50
5 80
6 20
7 30
8 40
9 20
Additionally, as mentioned above, I need a separate data frame to be created throughout this process with the day, user_id, sales, and rolling_sales_180 that only includes all the days during which the $100 threshold was met in order to count the number of times this goal is reached. See below:
print(threshold_reached)
day user_id sales rolling_sales_180
0 2017-11-28 1 40 120
.
.
.
If I understand your question correctly, the following should work for you:
def groupby_rolling(grp_df):
df = grp_df.set_index("day")
cum_sales = df.rolling("180D")["sales"].sum()
hundreds = (cum_sales // 100).astype(int)
progress = cum_sales % 100
df["rolling_sales_180"] = cum_sales
df["progress"] = progress
df["milestones"] = hundreds
return df
result = df.groupby("user_id").apply(groupby_rolling)
Output of this is (for your provided sample):
user_id sales rolling_sales_180 progress milestones
user_id day
1 2017-08-10 1 10 10.0 10.0 0
2017-08-22 1 10 20.0 20.0 0
2017-08-31 1 10 30.0 30.0 0
2017-09-06 1 10 40.0 40.0 0
2017-09-19 1 10 50.0 50.0 0
2017-10-16 1 30 80.0 80.0 0
2017-11-28 1 40 120.0 20.0 1
2018-01-22 1 10 130.0 30.0 1
2018-03-19 1 10 90.0 90.0 0
2018-07-25 1 10 20.0 20.0 0
What the groupby(...).apply(...) does is for each group in the original df, the provided function is applied. In this case, I've encapsulated your complex logic, which is currently not possible to do with a straightforward groupby-rolling operation, in a simple-to-parse basic function.
The function should hopefully be self-documenting by how I named variables, but I'd be happy to add comments if you'd like.

Time series: Mean per hour per day per Id number

I am a somewhat beginner programmer and learning python (+pandas) and hope I can explain this well enough. I have a large time series pd dataframe of over 3 million rows and initially 12 columns spanning a number of years. This covers people taking a ticket from different locations denoted by Id numbers(350 of them). Each row is one instance (one ticket taken).
I have searched many questions like counting records per hour per day and getting average per hour over several years. However, I run into the trouble of including the 'Id' variable.
I'm looking to get the mean value of people taking a ticket for each hour, for each day of the week (mon-fri) and per station.
I have the following, setting datetime to index:
Id Start_date Count Day_name_no
149 2011-12-31 21:30:00 1 5
150 2011-12-31 20:51:00 1 0
259 2011-12-31 20:48:00 1 1
3015 2011-12-31 19:38:00 1 4
28 2011-12-31 19:37:00 1 4
Using groupby and Start_date.index.hour, I cant seem to include the 'Id'.
My alternative approach is to split the hour out of the date and have the following:
Id Count Day_name_no Trip_hour
149 1 2 5
150 1 4 10
153 1 2 15
1867 1 4 11
2387 1 2 7
I then get the count first with:
Count_Item = TestFreq.groupby([TestFreq['Id'], TestFreq['Day_name_no'], TestFreq['Hour']]).count().reset_index()
Id Day_name_no Trip_hour Count
1 0 7 24
1 0 8 48
1 0 9 31
1 0 10 28
1 0 11 26
1 0 12 25
Then use groupby and mean:
Mean_Count = Count_Item.groupby(Count_Item['Id'], Count_Item['Day_name_no'], Count_Item['Hour']).mean().reset_index()
However, this does not give the desired result as the mean values are incorrect.
I hope I have explained this issue in a clear way. I looking for the mean per hour per day per Id as I plan to do clustering to separate my dataset into groups before applying a predictive model on these groups.
Any help would be grateful and if possible an explanation of what I am doing wrong either code wise or my approach.
Thanks in advance.
I have edited this to try make it a little clearer. Writing a question with a lack of sleep is probably not advisable.
A toy dataset that i start with:
Date Id Dow Hour Count
12/12/2014 1234 0 9 1
12/12/2014 1234 0 9 1
12/12/2014 1234 0 9 1
12/12/2014 1234 0 9 1
12/12/2014 1234 0 9 1
19/12/2014 1234 0 9 1
19/12/2014 1234 0 9 1
19/12/2014 1234 0 9 1
26/12/2014 1234 0 10 1
27/12/2014 1234 1 11 1
27/12/2014 1234 1 11 1
27/12/2014 1234 1 11 1
27/12/2014 1234 1 11 1
04/01/2015 1234 1 11 1
I now realise I would have to use the date first and get something like:
Date Id Dow Hour Count
12/12/2014 1234 0 9 5
19/12/2014 1234 0 9 3
26/12/2014 1234 0 10 1
27/12/2014 1234 1 11 4
04/01/2015 1234 1 11 1
And then calculate the mean per Id, per Dow, per hour. And want to get this:
Id Dow Hour Mean
1234 0 9 4
1234 0 10 1
1234 1 11 2.5
I hope this makes it a bit clearer. My real dataset spans 3 years with 3 million rows, contains 350 Id numbers.
Your question is not very clear, but I hope this helps:
df.reset_index(inplace=True)
# helper columns with date, hour and dow
df['date'] = df['Start_date'].dt.date
df['hour'] = df['Start_date'].dt.hour
df['dow'] = df['Start_date'].dt.dayofweek
# sum of counts for all combinations
df = df.groupby(['Id', 'date', 'dow', 'hour']).sum()
# take the mean over all dates
df = df.reset_index().groupby(['Id', 'dow', 'hour']).mean()
You can use the groupby function using the 'Id' column and then use the resample function with how='sum'.

Categories