Using one dataframe output to find matching rows in another dataframe - python
I would like to use some daily data in one dataframe as a qualifier to run some code in another dataframe. Both dataframes contain ['Date', 'Time', 'Ticker', 'Open', 'High', 'Low', 'Close']. One dataframe has only daily information, the other contains 5min out of the same fields, here are some examples.
print(df)
Date Time Ticker Open High Low Close
0 01/02/18 3:00 PM ES 2687.00 2696.00 2681.75 2695.75
1 01/03/18 3:00 PM ES 2697.25 2714.25 2697.00 2712.50
2 01/04/18 3:00 PM ES 2719.25 2729.00 2718.25 2724.00
3 01/05/18 3:00 PM ES 2732.25 2743.00 2726.50 2741.25
4 01/08/18 3:00 PM ES 2740.25 2748.50 2737.00 2746.50
5 01/09/18 3:00 PM ES 2751.00 2760.00 2748.00 2753.00
6 01/10/18 3:00 PM ES 2744.00 2751.75 2736.50 2748.75
7 01/11/18 3:00 PM ES 2754.25 2768.50 2752.75 2768.00
8 01/12/18 3:00 PM ES 2771.25 2788.75 2770.00 2786.50
9 01/15/18 3:00 PM ES 2793.75 2796.00 2792.50 2794.50
print(df_tick)
Date Time Ticker Open High Low Close
0 01/02/18 8:45 AM ES 2687.00 2687.25 2681.75 2685.75
1 01/02/18 9:00 AM ES 2686.00 2687.75 2683.50 2687.50
2 01/02/18 9:15 AM ES 2687.50 2690.50 2687.25 2689.25
3 01/02/18 9:30 AM ES 2689.50 2692.00 2689.25 2692.00
4 01/02/18 9:45 AM ES 2692.00 2692.25 2687.25 2690.00
5 01/02/18 10:00 AM ES 2690.00 2691.00 2689.75 2690.75
6 01/02/18 10:15 AM ES 2690.50 2691.25 2690.25 2691.00
7 01/02/18 10:30 AM ES 2691.00 2692.00 2689.00 2689.50
8 01/02/18 10:45 AM ES 2689.50 2689.75 2687.75 2688.25
9 01/02/18 11:00 AM ES 2688.25 2689.50 2687.75 2689.25
10 01/02/18 11:15 AM ES 2689.25 2690.75 2689.25 2690.00
11 01/02/18 11:30 AM ES 2690.00 2690.75 2689.25 2690.00
12 01/02/18 11:45 AM ES 2690.25 2690.50 2688.50 2688.75
13 01/02/18 12:00 PM ES 2689.00 2689.25 2688.50 2689.25
14 01/02/18 12:15 PM ES 2689.25 2691.00 2689.00 2690.50
15 01/02/18 12:30 PM ES 2690.75 2691.00 2689.75 2690.50
16 01/02/18 12:45 PM ES 2690.75 2691.25 2690.25 2691.00
17 01/02/18 1:00 PM ES 2691.25 2691.25 2689.50 2690.75
18 01/02/18 1:15 PM ES 2690.50 2691.50 2690.25 2690.50
19 01/02/18 1:30 PM ES 2690.50 2691.00 2689.75 2690.75
20 01/02/18 1:45 PM ES 2690.75 2691.50 2690.25 2690.75
21 01/02/18 2:00 PM ES 2690.75 2691.25 2690.75 2691.00
22 01/02/18 2:15 PM ES 2691.25 2691.75 2690.50 2691.50
23 01/02/18 2:30 PM ES 2691.50 2693.00 2691.50 2692.75
24 01/02/18 2:45 PM ES 2693.00 2693.75 2691.00 2693.75
25 01/02/18 3:00 PM ES 2693.75 2696.00 2693.25 2695.75
26 01/03/18 8:45 AM ES 2697.25 2702.25 2697.00 2700.75
27 01/03/18 9:00 AM ES 2701.00 2703.75 2700.50 2703.25
28 01/03/18 9:15 AM ES 2703.25 2706.00 2703.00 2705.00
29 01/03/18 9:30 AM ES 2705.00 2707.25 2704.00 2706.50
Code for calculating the gap percentage
#Calculating Gap Percentage
df['Gap %'] = (df['Open'].sub(df['Close'].shift()).div(df['Close'] -
1).fillna(0))*100
I have the code for the df to find the percentage change from Close-Open, and would like to use this information as a qualifier to run some code on the df_tick.
For example if df['Gap %'] > .02, then I want to use that date in df_tick and ignore (or drop) the rest of the information.
#drop rows not meeting certain percentage
df.drop(df[df['Gap %'] < .2].index, inplace=True)
print(df)
Date Time Ticker Open High Low Close Gap Gap %
2 01/04/18 3:00 PM ES 2719.25 2729.0 2718.25 2724.00 6.75 0.247888
3 01/05/18 3:00 PM ES 2732.25 2743.0 2726.50 2741.25 8.25 0.301067
9 01/15/18 3:00 PM ES 2793.75 2796.0 2792.50 2794.50 7.25 0.259531
Now I'd like to use df['Date'] to find the matching Dates in df_tick['Date'] for some code I've already written, I tried to just drop all the data where the dates aren't the same. But received an error.
#drop rows in df_tick not matching dates in df
df_tick.drop(df_tick[df_tick['Date'] != df['Date']].index, inplace=True)
ValueError: Can only compare identically-labeled Series objects
You may be able to reset the index of both dataframes and get away with what you are trying to do, but I would try this:
df_tick = df_tick[df_tick.Date.isin(df.Date.unique())]
Related
How to get cumulative sum between two different dates under conditions
I would like to get a cumulative sum of tran_amt for each Cust ID within 24 hours of first transaction. Please see my below example for illustration. Original Data DateTime Tran_amt Cust_ID 1/1/2021 2:00:00 PM 1000 c103102 1/1/2021 3:00:00 PM 2000 c103102 1/2/2021 10:00:00 AM 2000 c103102 1/2/2021 11:00:00 AM 1000 c211203 1/2/2021 12:00:00 PM 1000 c103102 1/2/2021 5:00:00 PM 2000 c103102 1/3/2021 3:00:00 AM 1000 c211203 Expected Output Data DateTime Tran_amt Cust_ID First Transaction DateTime Cumulative_amt Remark 1/1/2021 2:00:00 PM 1000 c103102 1/1/2021 2:00:00 PM 1000 1/1/2021 3:00:00 PM 2000 c103102 1/1/2021 2:00:00 PM 3000 1/2/2021 10:00:00 AM 2000 c103102 1/1/2021 2:00:00 PM 5000 1/2/2021 11:00:00 AM 1000 c211203 1/2/2021 1:00:00 PM 1000 1/2/2021 12:00:00 PM 1000 c103102 1/1/2021 2:00:00 PM 6000 1/2/2021 5:00:00 PM 2000 c103102 1/2/2021 5:00:00 PM 2000 The tran datetime is exceeding 24 hours of previous first transaction Datetime, and thus the cumulative_amt is reset 1/3/2021 3:00:00 AM 1000 c211203 1/2/2021 1:00:00 PM 2000 Hope someone can help me the above question. Thanks a lot.
Format response content with python requests module
I have a web page which I can access from my server. The contents of the web page are as below. xys.server.com - /xys/reports/ [To Parent Directory] 3/4/2021 6:09 AM <dir> All_Master 3/4/2021 6:09 AM <dir> Hartland 3/4/2021 6:09 AM <dir> Hauppauge 3/4/2021 6:09 AM <dir> Hazelwood 2/15/2019 7:41 AM 58224 NetBackup Retention and Full Backup Occupancy.xlsx 1/1/2022 11:00 AM 23959 OpsCenter_All_Master_Server_Backup_Report_01_01_2022_10_00_45_259_AM_49.zip 2/1/2022 11:00 AM 18989 OpsCenter_All_Master_Server_Backup_Report_01_02_2022_10_00_04_813_AM_4.zip 3/1/2022 11:00 AM 18969 OpsCenter_All_Master_Server_Backup_Report_01_03_2022_10_00_24_664_AM_17.zip 4/1/2021 10:00 AM 21709 OpsCenter_All_Master_Server_Backup_Report_01_04_2021_10_00_02_266_AM_31.zip 5/1/2021 10:00 AM 27491 OpsCenter_All_Master_Server_Backup_Report_01_05_2021_10_00_27_655_AM_11.zip 6/1/2021 10:00 AM 21260 OpsCenter_All_Master_Server_Backup_Report_01_06_2021_10_00_54_053_AM_19.zip 7/1/2021 10:00 AM 19898 OpsCenter_All_Master_Server_Backup_Report_01_07_2021_10_00_12_544_AM_42.zip 8/1/2021 10:00 AM 22642 OpsCenter_All_Master_Server_Backup_Report_01_08_2021_10_00_28_384_AM_25.zip 9/1/2021 10:00 AM 19426 OpsCenter_All_Master_Server_Backup_Report_01_09_2021_10_00_43_851_AM_70.zip 10/1/2021 10:01 AM 19149 OpsCenter_All_Master_Server_Backup_Report_01_10_2021_10_01_00_422_AM_7.zip 11/1/2021 10:00 AM 19638 OpsCenter_All_Master_Server_Backup_Report_01_11_2021_10_00_15_326_AM_20.zip 12/1/2021 11:00 AM 19375 OpsCenter_All_Master_Server_Backup_Report_01_12_2021_10_00_29_943_AM_13.zip 1/2/2022 11:00 AM 22281 OpsCenter_All_Master_Server_Backup_Report_02_01_2022_10_00_45_803_AM_37.zip 2/2/2022 11:00 AM 19435 OpsCenter_All_Master_Server_Backup_Report_02_02_2022_10_00_05_577_AM_71.zip 3/2/2022 11:00 AM 19380 OpsCenter_All_Master_Server_Backup_Report_02_03_2022_10_00_24_973_AM_90.zip 4/2/2021 10:00 AM 21411 OpsCenter_All_Master_Server_Backup_Report_02_04_2021_10_00_03_069_AM_56.zip Now, I need to get the contents from this page in a structured format. I am using requests module but the data is highly un-structured and difficult to parse. The code is as below.. req = requests.get(url) print (req.content.decode('utf-8')) Output is like : <pre>[To Parent Directory]<br><br> 3/4/2021 6:09 AM <dir> All_Master<br> 3/4/2021 6:09 AM <dir> Hartland<br> 3/4/2021 6:09 AM <dir> Hauppauge<br> 3/4/2021 6:09 AM <dir> Hazelwood<br> 2/15/2019 7:41 AM 58224 NetBackup Retention and Full Backup Occupancy.xlsx<br> 1/1/2022 11:00 AM 23959 OpsCenter_All_Master_Server_Backup_Report_01_01_2022_10_00_45_259_AM_49.zip<br> 2/1/2022 11:00 AM 18989 OpsCenter_All_Master_Server_Backup_Report_01_02_2022_10_00_04_813_AM_4.zip<br> 3/1/2022 11:00 AM 18969 OpsCenter_All_Master_Server_Backup_Report_01_03_2022_10_00_24_664_AM_17.zip<br> 4/1/2021 10:00 AM 21709 OpsCenter_All_Master_Server_Backup_Report_01_04_2021_10_00_02_266_AM_31.zip<br> 5/1/2021 10:00 AM 27491 OpsCenter_All_Master_Server_Backup_Report_01_05_2021_10_00_27_655_AM_11.zip<br> 6/1/2021 10:00 AM 21260 OpsCenter_All_Master_Server_Backup_Report_01_06_2021_10_00_54_053_AM_19.zip<br> 7/1/2021 10:00 AM 19898 OpsCenter_All_Master_Server_Backup_Report_01_07_2021_10_00_12_544_AM_42.zip<br> 8/1/2021 10:00 AM 22642 OpsCenter_All_Master_Server_Backup_Report_01_08_2021_10_00_28_384_AM_25.zip<br> 9/1/2021 10:00 AM 19426 OpsCenter_All_Master_Server_Backup_Report_01_09_2021_10_00_43_851_AM_70.zip<br> 10/1/2021 10:01 AM 19149 OpsCenter_All_Master_Server_Backup_Report_01_10_2021_10_01_00_422_AM_7.zip<br> 11/1/2021 10:00 AM 19638 OpsCenter_All_Master_Server_Backup_Report_01_11_2021_10_00_15_326_AM_20.zip<br> 12/1/2021 11:00 AM 19375 OpsCenter_All_Master_Server_Backup_Report_01_12_2021_10_00_29_943_AM_13.zip<br> 1/2/2022 11:00 AM 22281 OpsCenter_All_Master_Server_Backup_Report_02_01_2022_10_00_45_803_AM_37.zip<br> 2/2/2022 11:00 AM 19435 OpsCenter_All_Master_Server_Backup_Report_02_02_2022_10_00_05_577_AM_71.zip<br> 3/2/2022 11:00 AM 19380 OpsCenter_All_Master_Server_Backup_Report_02_03_2022_10_00_24_973_AM_90.zip<br> 4/2/2021 10:00 AM 21411 OpsCenter_All_Master_Server_Backup_Report_02_04_2021_10_00_03_069_AM_56.zip<br> 5/2/2021 10:00 AM 24191 OpsCenter_All_Master_Server_Backup_Report_02_05_2021_10_00_28_556_AM_14.zip<br> 6/2/2021 10:00 AM 21675 OpsCenter_All_Master_Server_Backup_Report_02_06_2021_10_00_54_962_AM_73.zip<br> 7/2/2021 10:00 AM 19954 OpsCenter_All_Master_Server_Backup_Report_02_07_2021_10_00_13_058_AM_31.zip<br> 8/2/2021 10:00 AM 21085 OpsCenter_All_Master_Server_Backup_Report_02_08_2021_10_00_28_778_AM_79.zip<br> 9/2/2021 10:00 AM 19691 OpsCenter_All_Master_Server_Backup_Report_02_09_2021_10_00_44_294_AM_5.zip<br> 10/2/2021 10:01 AM 23477 OpsCenter_All_Master_Server_Backup_Report_02_10_2021_10_01_00_793_AM_9.zip<br> 11/2/2021 10:00 AM 2 This is very unstructured. Kindly suggest a way to make this content more readable so it is easy to parse the data...
finding regex pattern to separated unformatted comma separated values in the column of a dataframe in python
hi I've to preprocess a column that has comma separated values and i can't apply .split(',\s*') because there are places where commas and spaces shouldn't be separate so therefore i'm looking for a regex pattern. column: 0 12noon to 3:30pm, 6:30pm to 11:30pm (Mon-Sun) 1 11 AM to 11 PM 2 11:30 AM to 4:30 PM, 6:30 PM to 11 PM 3 12 Noon to 2 AM 4 12noon to 11pm (Mon, Tue, Wed, Thu, Sun), 12no... ... 100 11 AM to 11 PM 101 10 AM to 10 PM (Mon-Thu), 8 AM to 10:30 PM (Fr... 102 12 Noon to 11 PM 103 8am to 12:30AM (Mon-Sun) 104 11:30 AM to 3 PM, 7 PM to 12 Midnight what i've tried is import re pattern = '([\w+\:*\s*\w*(w{2})*]*\s*to\s*[\w+\:*\s*\w*(w{2})*]*\s*[\([a-zA-Z]*\-*\,*\s* [a-zA-Z]*\s*\)]*)' timing = data['timings'].str.lower().str.split(pattern).dropna().to_numpy() output: array([list(['12noon to 3:30pm,', ' 6:30pm to 11:30pm (mon-sun)', '']), list(['11 am to 11 pm']), list(['11:30 am to 4:30 pm, 6:30 pm to 11 pm']), list(['12 noon to 2 am']), list(['12noon to 11pm (mon, tue, wed, thu, sun),', ' 12noon to 12midnight (fri-sat)', '']), list(['12noon to 3:30pm, 4pm to 6:30pm, 7pm to 11:30pm (mon, tue, wed, thu, sun), 12noon to 3:30pm, 4pm to 6:30pm,', ' 7pm to 12midnight (fri-sat)', '']), list(['7 am to 10 pm']), list(['12 noon to 12 midnight']), list(['12 noon to 12 midnight']), list(['', '10 am to 1 am (mon-thu)', ',', ' 10 am to 1:30 am (fri-sun)', '']), list(['12 noon to 3:30 pm, 7 pm to 10:30 pm']), list(['12 noon to 3:30 pm, 6:30 pm to 11:30 pm']), list(['11:30 am to 1 am']), list(['', '12noon to 12midnight (mon-sun)', '']), list(['12 noon to 4:30 pm, 6:30 pm to 11:30 pm']), list(['11 am to 11 pm']), list(['12 noon to 10:30 pm']), list(['11:30 am to 1 am']), list(['12 noon to 12 midnight']), list(['12 noon to 11 pm']), list(['', '12:30 pm to 10 pm (tue-sun)', ', mon closed']), list(['11:30 am to 3 pm, 7 pm to 11 pm']), list(['11am to 11:30pm (mon, tue, wed, thu, sun),', ' 11am to 12midnight (fri-sat)', '']), list(['10 am to 5 am']), list(['12 noon to 12 midnight (mon-thu, sun),', ' 12 noon to 1 am (fri-sat)', '']), list(['', '12noon to 11pm (mon-thu)', ',', '12noon to 11:30pm (fri-sun)', '']), list(['', '12 noon to 11:30 pm (mon-wed)', ',', ' 12 noon to 1 am (fri-sat)', ',', ' 12 noon to 12 midnight (sun)', ', thu closed']), list(['12 noon to 4 pm, 6:30 pm to 11:30 pm']), list(['10 am to 1 am']), list(['4:30 pm to 5:30 am']), list(['11 am to 12 midnight']), list(['12noon to 4pm,', ' 7pm to 12midnight (mon-sun)', '']), list(['11 am to 12 midnight']), list(['', '6am to 12midnight (mon-sun)', '']), list(['12 noon to 11 pm']), list(['12:30 pm to 3:30 pm, 7 pm to 10:40 pm']), list(['12 noon to 4 pm, 7 pm to 11 pm']), list(['12noon to 11pm (mon, tue, wed, thu, sun),', ' 12noon to 12midnight (fri-sat)', '']), list(['12 noon to 10:30 pm']), list(['', '12noon to 11pm (mon-sun)', '']), list(['10 am to 10 pm']), list(['10 am to 10 pm']), list(['7 am to 1 am']), list(['12 noon to 11:30 pm']), list(['', '12noon to 11:30pm (mon-sun)', '']), list(['12 noon to 11:30 pm']), list(['12 noon to 11 pm']), list(['6 am to 10:30 pm']), list(['11:30 am to 3:30 pm, 6:45 pm to 11:30 pm']), list(['11:55 am to 4 pm, 7 pm to 11:15 pm']), list(['12 noon to 11 pm']), list(['11 am to 11 pm']), list(['12noon to 4:30pm, 6:30pm to 11:30pm (mon, tue, wed, fri, sat), closed (thu),', '12noon to 12midnight (sun)', '']), list(['12noon to 12midnight (mon, tue, wed, thu, sun),', ' 12noon to 1am (fri-sat)', '']), list(['8 am to 11:30 pm']), list(['6:30am to 10:30am, 12:30pm to 3pm,', ' 7pm to 11pm (mon)', ',6:30am to 10:30am, 12:30pm to 3pm,', ' 7:30pm to 11pm (tue-sat)', ',6:30am to 10:30am, 12:30pm to 3:30pm,', ' 7pm to 11pm (sun)', '']), list(['12 noon to 3 pm, 7 pm to 11:30 pm']), list(['11:30 am to 1 am']), list(['9 am to 10 pm']), list(['12 noon to 12 midnight (mon-thu, sun),', ' 12 noon to 1 am (fri-sat)', '']), list(['', '5pm to 12midnight (mon-sun)', '']), list(['11 am to 11:30 pm']), list(['', '11:30am to 11pm (mon-sun)', '']), list(['12 noon to 10:30 pm']), list(['1 pm to 11 pm']), list(['11:30 am to 12 midnight']), list(['12 noon to 12 midnight']), list(['', '12noon to 12midnight (mon-sun)', '']), list(['', '12noon to 11pm (mon-sun)', '']), list(['12 noon to 3 pm, 7 pm to 11 pm']), list(['12 noon to 3 pm, 7 pm to 11 pm']), list(['', '11 am to 8 pm (mon-sat)', ', sun closed']), list(['4 am to 12 midnight']), list(['9 am to 1 am']), list(['10:30 am to 11 pm']), list(['7 am to 11 pm']), list(['7 am to 10:30 am, 12:30 pm to 3:30 pm, 7 pm to 11 pm']), list(['12 noon to 3:30 pm, 7 pm to 11:30 pm']), list(['12 noon to 3:30 pm, 7 pm to 11 pm']), list(['12noon to 12midnight (mon, tue, wed, thu, sun),', ' 12noon to 1am (fri-sat)', '']), list(['', '11am to 11pm (mon-sun)', '']), list(['6 am to 11:30 pm']), list(['11:30 am to 5 am']), list(['12:30 pm to 3:30 pm, 7 pm to 11 pm']), list(['', '6pm to 2am (mon-sun)', '']),......) but what i want is something like this: [['6pm to 2am (mon-sun)'], ['12 noon to 12 midnight (mon-thu, sun)'] .....] something like this i think i've to design a better regex pattern in order to separated these values. so can anyone design a better regex pattern? Thanks in advance:).
Here's my attempt: import re, pandas data = pandas.read_excel('C:\\Users\\Administrator\\Desktop\\test.xls') pattern = '(\d{1,2}(?:\:\d{1,2})? ?(?:\w{2,8}) to \d{1,2}(?:\:\d{1,2})? ?(?:\w{2,8}) ?(?:\(\w{3}(?:[ ,-]{1,3}\w{3}){0,6}\))?)' re.findall(pattern, data["myData"].str.cat(sep=", ")) With the call to re.findall() my output was: ['12noon to 3:30pm', '6:30pm to 11:30pm (Mon-Sun)', '11 AM to 11 PM', '11:30 AM to 4:30 PM', '6:30 PM to 11 PM', '12 Noon to 2 AM', '11 AM to 11 PM', '10 AM to 10 PM (Mon-Thu)', '8 AM to 10:30 PM (Fri,Sat)', '12 Noon to 11 PM', '8am to 12:30AM (Mon-Sun)', '11:30 AM to 3 PM', '7 PM to 12 Midnight']
get each shift values group by python data frame
I working on the Production analysis data set(Shift-wise one(Day/Night)). Day shift is 7 AM-7 PM Aand Night Shift is 7 PM-7 AM. Sometimes day & night shift can be divided into two or more portions(ex:7AM-7PM Day shift can be - 7AM-10AM & 10AM-7PM). If shifts are divided into two or more portions, first need to check if the Brand is the same for that entire Shift partitions. If YES, set the start time as the beginning of the first shift start time partition and the End time as the end of the last shift end time partition. For production: get the total production of the shift partitions For RPM: get average of the shift partions If No, get the appropriate values for each Brand. (For more understanding, Please check the expected output.) Sample of the Raw dataframe: Start end shift Brand Production RPM 7/8/2020 19:00 7/9/2020 7:00 Night A 10 50 7/9/2020 7:00 7/9/2020 17:07 Day A 5 50 7/9/2020 17:07 7/9/2020 17:58 Day A 10 100 7/9/2020 17:58 7/9/2020 19:00 Day A 5 60 7/9/2020 19:00 7/9/2020 21:30 Night A 2 10 7/9/2020 21:30 7/9/2020 22:40 Night B 5 20 7/9/2020 22:40 7/10/2020 7:00 Night B 5 30 7/10/2020 7:00 7/10/2020 18:27 Day C 15 20 7/10/2020 18:27 7/10/2020 19:00 Day C 5 40 Expected Output: Start end shift Brand Production RPM 7/8/2020 19:00 7/9/2020 7:00 Night A 10 50 7/9/2020 7:00 7/9/2020 19:00 Day A 20 70 7/9/2020 19:00 7/9/2020 21:30 Night A 2 10 7/9/2020 21:30 7/10/2020 7:00 Night B 10 25 7/10/2020 7:00 7/10/2020 19:00 Day C 20 30 Thanks in advance.
Here's a suggestion: Make sure the columns Start and End have datetime values (I've renamed end to End and shift to Shift :)): df['Start'] = pd.to_datetime(df['Start']) df['End'] = pd.to_datetime(df['End']) Then df['Day'] = df['Start'].dt.strftime('%Y-%m-%d') df = (df.groupby(['Day', 'Shift', 'Brand']) .agg(Start = pd.NamedAgg(column='Start', aggfunc='min'), End = pd.NamedAgg(column='End', aggfunc='max'), Production = pd.NamedAgg(column='Production', aggfunc='sum'), RPM = pd.NamedAgg(column='RPM', aggfunc='mean')) .reset_index()[df.columns] .drop('Day', axis='columns')) gives you Start End Shift Brand Production RPM 0 2020-07-08 19:00:00 2020-07-09 07:00:00 Night A 10 50 1 2020-07-09 07:00:00 2020-07-09 19:00:00 Day A 20 70 2 2020-07-09 19:00:00 2020-07-09 21:30:00 Night A 2 10 3 2020-07-09 21:30:00 2020-07-10 07:00:00 Night B 10 25 4 2020-07-10 07:00:00 2020-07-10 19:00:00 Day C 20 30 which seems to be your desired output (if I'm not mistaken). If you want to transform the columns Start and End back to string with a format similar to the one you've given above (there's some additional padding): df['Start'] = df['Start'].dt.strftime('%m/%d/%Y %H:%M') df['End'] = df['End'].dt.strftime('%m/%d/%Y %H:%M')
How to groupby multiple columns and unstack get percentage of each cell by dividing from row total in Python?
My Question is as follows, i have a data set ~ 700mb which looks like rpt_period_name_week period_name_mth assigned_date_utc resolved_date_utc handle_seconds action marketplace_id login category currency_code order_amount_in_usd day_of_week_NewClmn 2020 Week 01 2020 / 01 1/11/2020 23:58 1/11/2020 23:59 84 Pass DE a MRI AT EUR 81.32 Saturday 2020 Week 02 2020 / 01 1/11/2020 23:58 1/11/2020 23:59 37 Pass DE b MRI AQ EUR 222.38 Saturday 2020 Week 01 2020 / 01 1/11/2020 23:57 1/11/2020 23:59 123 Pass DE a MRI DG EUR 444.77 Saturday 2020 Week 02 2020 / 01 1/11/2020 23:54 1/11/2020 23:59 313 Hold JP a MRI AQ Saturday 2020 Week 01 2020 / 01 1/11/2020 23:57 1/11/2020 23:59 112 Pass FR b MRI DG EUR 582.53 Saturday 2020 Week 02 2020 / 01 1/11/2020 23:54 1/11/2020 23:58 249 Pass DE f MRI AT EUR 443.16 Saturday 2020 Week 03 2020 / 01 1/11/2020 23:58 1/11/2020 23:58 48 Pass DE b MRI DG EUR 20.5 Saturday 2020 Week 03 2020 / 01 1/11/2020 23:57 1/11/2020 23:58 40 Pass IT a MRI AQ EUR 272.01 Saturday my desired output is like [Output][1] https://i.stack.imgur.com/8oz7G.png My code is below but i am unable to get the desire result? My cells are getting divided by sum of row? Have tried multiple options but in vain? df = data_final.groupby(['login','category','rpt_period_name_week','action'])['action'].agg(np.count_nonzero).unstack(['rpt_period_name_week','action']).apply(lambda x: x.fillna(0)) df = df.div(df.sum(1), 0).mul(100).round(2).assign(Total=lambda df: df.sum(axis=1)) # df = df.div(df.sum(1), 0).mul(100).round(2).assign(Total=lambda df: df.sum(axis=1)) df1 = df.astype(str) + '%' # print (df1) Please help?