Transpose and Compare - python
I'm attempting to compare two data frames. Item and Summary variables correspond to various dates and quantities. I'd like to transpose the dates into one column of data along with the associated quantities. I'd then like to compare the two data frames and see what changed from PreviousData to CurrentData.
Previous Data:
PreviousData = { 'Item' : ['abc','def','ghi','jkl','mno','pqr','stu','vwx','yza','uaza','fupa'],
'Summary' : ['party','weekend','food','school','tv','photo','camera','python','r','rstudio','spyder'],
'2022-01-01' : [1, np.nan, np.nan, 1.0, np.nan, 1.0, np.nan, np.nan, np.nan,np.nan,2],
'2022-02-01' : [1,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan],
'2022-03-01' : [np.nan,np.nan,np.nan,1,np.nan,np.nan,1,np.nan,np.nan,np.nan,np.nan],
'2022-04-01' : [np.nan,np.nan,3,np.nan,np.nan,3,np.nan,np.nan,np.nan,np.nan,np.nan],
'2022-05-01' : [np.nan,np.nan,np.nan,3,np.nan,np.nan,2,np.nan,np.nan,3,np.nan],
'2022-06-01' : [np.nan,np.nan,np.nan,np.nan,2,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan],
'2022-07-01' : [np.nan,1,np.nan,np.nan,np.nan,np.nan,1,np.nan,np.nan,np.nan,np.nan],
'2022-08-01' : [np.nan,np.nan,np.nan,1,np.nan,1,np.nan,np.nan,np.nan,np.nan,np.nan],
'2022-09-01' : [np.nan,1,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,1,np.nan],
'2022-10-01' : [np.nan,np.nan,1,np.nan,np.nan,1,np.nan,np.nan,np.nan,np.nan,np.nan],
'2022-11-01' : [np.nan,2,np.nan,np.nan,1,1,1,np.nan,np.nan,np.nan,np.nan],
'2022-12-01' : [np.nan,np.nan,np.nan,np.nan,3,np.nan,np.nan,2,np.nan,np.nan,np.nan],
'2023-01-01' : [np.nan,np.nan,1,np.nan,1,np.nan,np.nan,np.nan,2,np.nan,np.nan],
'2023-02-01' : [np.nan,np.nan,np.nan,2,np.nan,2,np.nan,np.nan,np.nan,np.nan,np.nan],
'2023-03-01' : [np.nan,3,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan],
'2023-04-01' : [np.nan,np.nan,np.nan,1,np.nan,np.nan,np.nan,1,np.nan,np.nan,np.nan],
'2023-05-01' : [np.nan,np.nan,2,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,2,np.nan],
'2023-06-01' : [1,1,np.nan,np.nan,9,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan],
'2023-07-01' : [np.nan,np.nan,np.nan,1,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan],
'2023-08-01' : [np.nan,1,np.nan,np.nan,1,np.nan,1,np.nan,np.nan,np.nan,np.nan],
'2023-09-01' : [np.nan,1,1,np.nan,np.nan,np.nan,np.nan,1,np.nan,np.nan,np.nan],
}
PreviousData = pd.DataFrame(PreviousData)
PreviousData
Current Data:
CurrentData = { 'Item' : ['ghi','stu','abc','mno','jkl','pqr','def','vwx','yza'],
'Summary' : ['food','camera','party','tv','school','photo','weekend','python','r'],
'2022-01-01' : [3, np.nan, np.nan, 1.0, np.nan, 1.0, np.nan, np.nan, np.nan],
'2022-02-01' : [np.nan,1,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan],
'2022-03-01' : [np.nan,1,1,1,np.nan,np.nan,np.nan,np.nan,np.nan],
'2022-04-01' : [np.nan,np.nan,1,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan],
'2022-05-01' : [np.nan,np.nan,3,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan],
'2022-06-01' : [2,np.nan,np.nan,np.nan,4,np.nan,np.nan,np.nan,np.nan],
'2022-07-01' : [np.nan,np.nan,np.nan,np.nan,np.nan,4,np.nan,np.nan,np.nan],
'2022-08-01' : [np.nan,np.nan,3,np.nan,4,np.nan,np.nan,np.nan,np.nan],
'2022-09-01' : [np.nan,np.nan,3,3,3,np.nan,np.nan,5,5],
'2022-10-01' : [np.nan,np.nan,np.nan,np.nan,5,np.nan,np.nan,np.nan,np.nan],
'2022-11-01' : [np.nan,np.nan,np.nan,5,np.nan,np.nan,np.nan,np.nan,np.nan],
'2022-12-01' : [np.nan,4,np.nan,np.nan,np.nan,1,np.nan,np.nan,np.nan],
'2023-01-01' : [np.nan,np.nan,np.nan,np.nan,1,1,np.nan,np.nan,np.nan],
'2023-02-01' : [np.nan,np.nan,np.nan,2,1,np.nan,np.nan,np.nan,np.nan],
'2023-03-01' : [np.nan,np.nan,np.nan,np.nan,2,np.nan,2,np.nan,2],
'2023-04-01' : [np.nan,np.nan,np.nan,np.nan,np.nan,2,np.nan,np.nan,2],
}
CurrentData = pd.DataFrame(CurrentData)
CurrentData
As requested, here's an example of a difference:
How to transpose and compare these two sets?
One way of doing this is the following. Transpose both dataframes:
PreviousData_t = PreviousData.melt(id_vars=["Item", "Summary"],
var_name="Date",
value_name="value1")
which is
Item Summary Date value1
0 abc party 2022-01-01 1.0
1 def weekend 2022-01-01 NaN
2 ghi food 2022-01-01 NaN
3 jkl school 2022-01-01 1.0
4 mno tv 2022-01-01 NaN
.. ... ... ... ...
226 stu camera 2023-09-01 NaN
227 vwx python 2023-09-01 1.0
228 yza r 2023-09-01 NaN
229 uaza rstudio 2023-09-01 NaN
230 fupa spyder 2023-09-01 NaN
and
CurrentData_t = CurrentData.melt(id_vars=["Item", "Summary"],
var_name="Date",
value_name="value2")
Item Summary Date value2
0 ghi food 2022-01-01 3.0
1 stu camera 2022-01-01 NaN
2 abc party 2022-01-01 NaN
3 mno tv 2022-01-01 1.0
4 jkl school 2022-01-01 NaN
.. ... ... ... ...
139 jkl school 2023-04-01 NaN
140 pqr photo 2023-04-01 2.0
141 def weekend 2023-04-01 NaN
142 vwx python 2023-04-01 NaN
143 yza r 2023-04-01 2.0
[144 rows x 4 columns]
THen merge:
Compare = PreviousData_t.merge(CurrentData_t, on =['Date','Item','Summary'], how = 'left')
Item Summary Date value1 value2
0 abc party 2022-01-01 1.0 NaN
1 def weekend 2022-01-01 NaN NaN
2 ghi food 2022-01-01 NaN 3.0
3 jkl school 2022-01-01 1.0 NaN
4 mno tv 2022-01-01 NaN 1.0
.. ... ... ... ... ...
226 stu camera 2023-09-01 NaN NaN
227 vwx python 2023-09-01 1.0 NaN
228 yza r 2023-09-01 NaN NaN
229 uaza rstudio 2023-09-01 NaN NaN
230 fupa spyder 2023-09-01 NaN NaN
[231 rows x 5 columns]
and compare by creating a column marking differences
Compare['diff'] = np.where(Compare['value1']!=Compare['value2'], 1,0)
Item Summary Date value1 value2 diff
0 abc party 2022-01-01 1.0 NaN 1
1 def weekend 2022-01-01 NaN NaN 1
2 ghi food 2022-01-01 NaN 3.0 1
3 jkl school 2022-01-01 1.0 NaN 1
4 mno tv 2022-01-01 NaN 1.0 1
.. ... ... ... ... ... ...
226 stu camera 2023-09-01 NaN NaN 1
227 vwx python 2023-09-01 1.0 NaN 1
228 yza r 2023-09-01 NaN NaN 1
229 uaza rstudio 2023-09-01 NaN NaN 1
230 fupa spyder 2023-09-01 NaN NaN 1
[231 rows x 6 columns]
If you only want to compare those entries that are common to both, do this:
Compare = PreviousData_t.merge(CurrentData_t, on =['Date','Item','Summary'])
Compare['diff'] = np.where(Compare['value1']!=Compare['value2'], 1,0)
Item Summary Date value1 value2 diff
0 abc party 2022-01-01 1.0 NaN 1
1 def weekend 2022-01-01 NaN NaN 1
2 ghi food 2022-01-01 NaN 3.0 1
3 jkl school 2022-01-01 1.0 NaN 1
4 mno tv 2022-01-01 NaN 1.0 1
.. ... ... ... ... ... ...
139 mno tv 2023-04-01 NaN NaN 1
140 pqr photo 2023-04-01 NaN 2.0 1
141 stu camera 2023-04-01 NaN NaN 1
142 vwx python 2023-04-01 1.0 NaN 1
143 yza r 2023-04-01 NaN 2.0 1
[144 rows x 6 columns]
Related
How to replicate index/match (with multiple criteria) in in python with multiple dataframes?
I am trying to replicate a excel model I have in python to automate it as I scale it up but I am stuck on how to translate the complex formula's into python. I have information in three dataframes: DF1: ID type 1 ID type 2 Unit a 1_a 400 b 1_b 26 c 1_c 23 d 1_b 45 e 1_d 24 f 1_b 85 g 1_a 98 DF2: ID type 1 ID type 2 Tech a 1_a wind b 1_b solar c 1_c gas d 1_b coal e 1_d wind f 1_b gas g 1_a coal And DF 3, the main DF: Date Time ID type 1 ID type 2 Period output Unit * Tech * 03/01/2022 02:30:00 a 1_a 1 254 03/01/2022 02:30:00 b 1_b 1 456 03/01/2022 02:30:00 c 1_c 2 3325 03/01/2022 02:30:00 d 1_b 2 1254 05/01/2022 02:30:00 e 1_d 3 489 05/01/2022 02:30:00 a 1_a 3 452 05/01/2022 02:30:00 b 1_b 4 12 05/01/2022 02:30:00 c 1_c 4 1 05/01/2022 03:00:00 d 1_b 35 54 05/01/2022 03:00:00 e 1_d 35 48 05/01/2022 03:00:00 a 1_a 48 56 I wish to get the information from each ID type in DF 3 for "unit" and "Tech" from DF 1 & 2 into DF 3. The conditional statements I have in excel atm are based on INDEX and MATCH and INFA, as some of the ID types in DF will be from either ID type 1 or ID type 2 so the function checks both columns and based on a positve match yields the required result. For more context, DF1 and DF2 do not change but DF3 changes and I need a function for that which I will explain later. The excel function I use to fill in Unit* from DF1 is (note I have replaced the excel sheet name to DF1 to help conceptualize the problem: =IFNA(INDEX('DF1'!$K$3:$K$1011,MATCH(N2,'DF1'!$E$3:$E$1011,0)),INDEX('DF1'!$K$3:$K$1011,MATCH(M2,'DF1'!$D$3:$D$1011,0))) The excel function I use to fill in Tech * is a bit more straight forward: =IFNA(INDEX('DF2'$L:$L,MATCH(O3,'DF2'$K:$K,0)),INDEX('DF2'$L:$L,MATCH(N3,'DF2'$J:$J,0))) That is the main stumbling block at the moment, but after this is achieved I need a function that for each day produces the following DF: ID type 1 Tech Period 1 Period 2 Period 3 Period 4 Period 5 Period 6 Period 7 … a wind Sum of output for this ID Type 1 and Period 1 b solar c gas d coal e wind a gas … … The idea here is that I can use conditional function again to sum the "output" column of DF3 under the condition of date, ID type and period number. EDIT: Output based on possible solution: time settlementDate BM Unit ID 1 BM Unit ID 2 settlementPeriod \ 0 00:00:00 03/01/2022 RCBKO-1 T_RCBKO-1 1 1 00:00:00 03/01/2022 LARYO-3 T_LARYW-3 1 2 00:00:00 03/01/2022 LAGA-1 T_LAGA-1 1 3 00:00:00 03/01/2022 CRMLW-1 T_CRMLW-1 1 4 00:00:00 03/01/2022 GRIFW-1 T_GRIFW-1 1 ... ... ... ... ... ... 52533 23:30:00 08/01/2022 CRMLW-1 T_CRMLW-1 48 52534 23:30:00 08/01/2022 LARYO-4 T_LARYW-4 48 52535 23:30:00 08/01/2022 HOWBO-3 T_HOWBO-3 48 52536 23:30:00 08/01/2022 BETHW-1 E_BETHW-1 48 52537 23:30:00 08/01/2022 HMGTO-1 T_HMGTO-1 48 quantity Capacity_x Technology Technology_x \ 0 278.658 NaN NaN WIND 1 162.940 NaN NaN WIND 2 262.200 NaN NaN CCGT 3 3.002 NaN NaN WIND 4 9.972 NaN NaN WIND ... ... ... ... ... 52533 8.506 NaN NaN WIND 52534 159.740 NaN NaN WIND 52535 32.554 NaN NaN NaN 52536 5.010 NaN NaN WIND 52537 92.094 NaN NaN WIND Registered Resource Name_x Capacity_y Technology_y \ 0 NaN NaN WIND 1 NaN NaN WIND 2 NaN NaN CCGT 3 NaN NaN WIND 4 NaN NaN WIND ... ... ... ... 52533 NaN NaN WIND 52534 NaN NaN WIND 52535 NaN NaN NaN 52536 NaN NaN WIND 52537 NaN NaN WIND Registered Resource Name_y Capacity 0 NaN NaN 1 NaN NaN 2 NaN NaN 3 NaN NaN 4 NaN NaN ... ... ... 52533 NaN NaN 52534 NaN NaN 52535 NaN NaN 52536 NaN NaN 52537 NaN NaN [52538 rows x 14 columns] EDIT: New query ID Type 1 Tech Period_1 Period_2 Period_3 Period_4 Period_35 Period_48 a wind 450 0 0 0 0 0 >>> These are mean of all dates* b wind 0 0 550 0 0 85 b wind 0 0 895 0 452 0
For the first part of your question you want to do a left merge on those 2 columns twice like this: df3 = ( df3 .merge(df1, on=['ID type 1', 'ID type 2'], how='left') .merge(df2, on=['ID type 1', 'ID type 2'], how='left') ) print(df3) Date Time ID type 1 ID type 2 Period output Unit Tech 0 03/01/2022 02:30:00 a 1_a 1 254 400 wind 1 03/01/2022 02:30:00 b 1_b 1 456 26 solar 2 03/01/2022 02:30:00 c 1_c 2 3325 23 gas 3 03/01/2022 02:30:00 d 1_b 2 1254 45 coal 4 05/01/2022 02:30:00 e 1_d 3 489 24 wind 5 05/01/2022 02:30:00 a 1_a 3 452 400 wind 6 05/01/2022 02:30:00 b 1_b 4 12 26 solar 7 05/01/2022 02:30:00 c 1_c 4 1 23 gas 8 05/01/2022 03:00:00 d 1_b 35 54 45 coal 9 05/01/2022 03:00:00 e 1_d 35 48 24 wind 10 05/01/2022 03:00:00 a 1_a 48 56 400 wind For the next part you could use a pandas.pivot_table. out = ( df3 .pivot_table( index=['Date', 'ID type 1', 'Tech'], columns='Period', values='output', aggfunc=sum, fill_value=0) .add_prefix('Period_') ) print(out) Output: Period Period_1 Period_2 Period_3 Period_4 Period_35 Period_48 Date ID type 1 Tech 03/01/2022 a wind 254 0 0 0 0 0 b solar 456 0 0 0 0 0 c gas 0 3325 0 0 0 0 d coal 0 1254 0 0 0 0 05/01/2022 a wind 0 0 452 0 0 56 b solar 0 0 0 12 0 0 c gas 0 0 0 1 0 0 d coal 0 0 0 0 54 0 e wind 0 0 489 0 48 0 I used fill_value to show you that option, without it you get 'NaN' in those cells. UPDATE: From question in comments, only get pivot data of one Technology (e.g. 'wind'): out.loc[out.index.get_level_values('Tech')=='wind'] Period Period_1 Period_2 Period_3 Period_4 Period_35 Period_48 Date ID type 1 Tech 03/01/2022 a wind 254 0 0 0 0 0 05/01/2022 a wind 0 0 452 0 0 56 e wind 0 0 489 0 48 0
How can I correctly write a function? (Python)
Here is my definition: def fill(df_name): """ Function to fill rows and dates. """ # Fill Down for row in df_name[0]: if 'Unnamed' in row: df_name[0] = df_name[0].replace(row, np.nan) df_name[0] = df_name[0].ffill(limit=2) df_name[1] = df_name[1].ffill(limit=2) # Fill in Dates for col in df_name.columns: if col >= 3: old_dt = datetime(1998, 11, 15) add_dt = old_dt + relativedelta(months=col - 3) new_dt = add_dt.strftime('%#m/%d/%Y') df_name = df_name.rename(columns={col: new_dt}) and then I call: fill(df_cars) The first half of the formula works (columns 0 and 1 have filled in correctly). However, as you can see, the columns are labeled 0-288. When I delete this function and simply run the code (changing df_name to df_cars) it runs correctly and the column names are the dates specified in the second half of the function. What could be causing this to not execute the # Fill in Dates portion when defined in a function? Does it have to do with local variables? 0 1 2 3 4 5 ... 287 288 289 290 291 292 0 France NaN Market 3330 7478 2273 ... NaN NaN NaN NaN NaN NaT 1 France NaN World 362 798 306 ... NaN NaN NaN NaN NaN NaT 2 France NaN % 0.108709 0.106713 0.134624 ... NaN NaN NaN NaN NaN NaT 3 Germany NaN Market 1452 2025 1314 ... NaN NaN NaN NaN NaN NaT 4 Germany NaN World 209 246 182 ... NaN NaN NaN NaN NaN NaT .. ... ... ... ... ... ... ... ... ... ... ... ... .. 349 Slovakia 0 World 1 1 0 ... NaN NaN NaN NaN NaN NaT 350 Slovakia 0 % 0.5 0.5 0 ... NaN NaN NaN NaN NaN NaT
transpose/stack/unstack pandas dataframe whilst concatenating the field name with existing columns
I have a dataframe that looks something like: Component Date MTD YTD QTD FC ABC Jan 2017 56 nan nan nan DEF Jan 2017 453 nan nan nan XYZ Jan 2017 657 PQR Jan 2017 123 ABC Feb 2017 56 nan nan nan DEF Feb 2017 456 nan nan nan XYZ Feb 2017 6234 57 PQR Feb 2017 123 346 ABC Dec 2017 56 nan nan nan DEF Dec 2017 nan nan 345 324 XYZ Dec 2017 6234 57 PQR Dec 2017 nan 346 54654 546 And i would like to transpose this dataframe in such a way that the component becomes the prefix of the existing MTD,QTD, etc columns so the output expected would be: Date ABC_MTD DEF_MTD XYZ_MTD PQR_MTD ABC_YTD DEF_YTD XYZ_YTD PQR_YTD etcetc Jan 2017 56 453 657 123 nan nan nan nan Feb 2017 56 456 6234 123 nan nan 57 346 Dec 2017 56 nan 6234 nan 57 346 I am not sure whether a pivot or stack/unstack would be efficient out here. Thanks in advance.
You could try this: newdf=df.pivot(values=df.columns[2:], index='Date', columns='Component' ) newdf.columns = ['%s%s' % (b, '_%s' % a if b else '') for a, b in newdf.columns] #join the multiindex columns names print(newdf) Output: df Component Date MTD YTD QTD FC 0 ABC 2017-01-01 56.0 NaN NaN NaN 1 DEF 2017-01-01 453.0 NaN NaN NaN 2 XYZ 2017-01-01 657.0 3 PQR 2017-01-01 123.0 4 ABC 2017-02-01 56.0 NaN NaN NaN 5 DEF 2017-02-01 456.0 NaN NaN NaN 6 XYZ 2017-02-01 6234.0 57 7 PQR 2017-02-01 123.0 346 8 ABC 2017-12-01 56.0 NaN NaN NaN 9 DEF 2017-12-01 NaN NaN 345 324 10 XYZ 2017-12-01 6234.0 57 11 PQR 2017-12-01 NaN 346 54654 546 newdf ABC_MTD DEF_MTD PQR_MTD XYZ_MTD ABC_YTD DEF_YTD PQR_YTD XYZ_YTD ABC_QTD DEF_QTD PQR_QTD XYZ_QTD ABC_FC DEF_FC PQR_FC XYZ_FC Date 2017-01-01 56 453 123 657 NaN NaN NaN NaN NaN NaN 2017-02-01 56 456 123 6234 NaN NaN 346 57 NaN NaN NaN NaN 2017-12-01 56 NaN NaN 6234 NaN NaN 346 57 NaN 345 54654 NaN 324 546
Python: Expand a dataframe row-wise based on datetime
I have a dataframe like this: ID Date Value 783 C 2018-02-23 0.704 580 B 2018-08-04 -1.189 221 A 2018-08-10 -0.788 228 A 2018-08-17 0.038 578 B 2018-08-02 1.188 What I want is expanding the dataframe based on Date column to 1-month earlier, and fill ID with the same person, and fill Value with nan until the last observation. The expected result is similar to this: ID Date Value 0 C 2018/01/24 nan 1 C 2018/01/25 nan 2 C 2018/01/26 nan 3 C 2018/01/27 nan 4 C 2018/01/28 nan 5 C 2018/01/29 nan 6 C 2018/01/30 nan 7 C 2018/01/31 nan 8 C 2018/02/01 nan 9 C 2018/02/02 nan 10 C 2018/02/03 nan 11 C 2018/02/04 nan 12 C 2018/02/05 nan 13 C 2018/02/06 nan 14 C 2018/02/07 nan 15 C 2018/02/08 nan 16 C 2018/02/09 nan 17 C 2018/02/10 nan 18 C 2018/02/11 nan 19 C 2018/02/12 nan 20 C 2018/02/13 nan 21 C 2018/02/14 nan 22 C 2018/02/15 nan 23 C 2018/02/16 nan 24 C 2018/02/17 nan 25 C 2018/02/18 nan 26 C 2018/02/19 nan 27 C 2018/02/20 nan 28 C 2018/02/21 nan 29 C 2018/02/22 nan 30 C 2018/02/23 1.093 31 B 2018/07/05 nan 32 B 2018/07/06 nan 33 B 2018/07/07 nan 34 B 2018/07/08 nan 35 B 2018/07/09 nan 36 B 2018/07/10 nan 37 B 2018/07/11 nan 38 B 2018/07/12 nan 39 B 2018/07/13 nan 40 B 2018/07/14 nan 41 B 2018/07/15 nan 42 B 2018/07/16 nan 43 B 2018/07/17 nan 44 B 2018/07/18 nan 45 B 2018/07/19 nan 46 B 2018/07/20 nan 47 B 2018/07/21 nan 48 B 2018/07/22 nan 49 B 2018/07/23 nan 50 B 2018/07/24 nan 51 B 2018/07/25 nan 52 B 2018/07/26 nan 53 B 2018/07/27 nan 54 B 2018/07/28 nan 55 B 2018/07/29 nan 56 B 2018/07/30 nan 57 B 2018/07/31 nan 58 B 2018/08/01 nan 59 B 2018/08/02 nan 60 B 2018/08/03 nan 61 B 2018/08/04 0.764 62 A 2018/07/11 nan 63 A 2018/07/12 nan 64 A 2018/07/13 nan 65 A 2018/07/14 nan 66 A 2018/07/15 nan 67 A 2018/07/16 nan 68 A 2018/07/17 nan 69 A 2018/07/18 nan 70 A 2018/07/19 nan 71 A 2018/07/20 nan 72 A 2018/07/21 nan 73 A 2018/07/22 nan 74 A 2018/07/23 nan 75 A 2018/07/24 nan 76 A 2018/07/25 nan 77 A 2018/07/26 nan 78 A 2018/07/27 nan 79 A 2018/07/28 nan 80 A 2018/07/29 nan 81 A 2018/07/30 nan 82 A 2018/07/31 nan 83 A 2018/08/01 nan 84 A 2018/08/02 nan 85 A 2018/08/03 nan 86 A 2018/08/04 nan 87 A 2018/08/05 nan 88 A 2018/08/06 nan 89 A 2018/08/07 nan 90 A 2018/08/08 nan 91 A 2018/08/09 nan 92 A 2018/08/10 2.144 93 A 2018/07/18 nan 94 A 2018/07/19 nan 95 A 2018/07/20 nan 96 A 2018/07/21 nan 97 A 2018/07/22 nan 98 A 2018/07/23 nan 99 A 2018/07/24 nan 100 A 2018/07/25 nan 101 A 2018/07/26 nan 102 A 2018/07/27 nan 103 A 2018/07/28 nan 104 A 2018/07/29 nan 105 A 2018/07/30 nan 106 A 2018/07/31 nan 107 A 2018/08/01 nan 108 A 2018/08/02 nan 109 A 2018/08/03 nan 110 A 2018/08/04 nan 111 A 2018/08/05 nan 112 A 2018/08/06 nan 113 A 2018/08/07 nan 114 A 2018/08/08 nan 115 A 2018/08/09 nan 116 A 2018/08/10 nan 117 A 2018/08/11 nan 118 A 2018/08/12 nan 119 A 2018/08/13 nan 120 A 2018/08/14 nan 121 A 2018/08/15 nan 122 A 2018/08/16 nan 123 A 2018/08/17 0.644 124 B 2018/07/03 nan 125 B 2018/07/04 nan 126 B 2018/07/05 nan 127 B 2018/07/06 nan 128 B 2018/07/07 nan 129 B 2018/07/08 nan 130 B 2018/07/09 nan 131 B 2018/07/10 nan 132 B 2018/07/11 nan 133 B 2018/07/12 nan 134 B 2018/07/13 nan 135 B 2018/07/14 nan 136 B 2018/07/15 nan 137 B 2018/07/16 nan 138 B 2018/07/17 nan 139 B 2018/07/18 nan 140 B 2018/07/19 nan 141 B 2018/07/20 nan 142 B 2018/07/21 nan 143 B 2018/07/22 nan 144 B 2018/07/23 nan 145 B 2018/07/24 nan 146 B 2018/07/25 nan 147 B 2018/07/26 nan 148 B 2018/07/27 nan 149 B 2018/07/28 nan 150 B 2018/07/29 nan 151 B 2018/07/30 nan 152 B 2018/07/31 nan 153 B 2018/08/01 nan 154 B 2018/08/02 -0.767 The source data can be created as below: import pandas as pd from itertools import chain import numpy as np df_1 = pd.DataFrame({ 'ID' : list(chain.from_iterable([['A'] * 365, ['B'] * 365, ['C'] * 365])), 'Date' : pd.date_range(start = '2018-01-01', end = '2018-12-31').tolist() + pd.date_range(start = '2018-01-01', end = '2018-12-31').tolist() + pd.date_range(start = '2018-01-01', end = '2018-12-31').tolist(), 'Value' : np.random.randn(365 * 3) }) df_1 = df_1.sample(5, random_state = 123) Thanks for the advice!
You can create another DataFrame with previous months, then join together by concat, create DatetimeIndex, so possible use groupby with resample by d for days for add all values between: df_2 = df_1.assign(Date = df_1['Date'] - pd.DateOffset(months=1) + pd.DateOffset(days=1), Value = np.nan) df = (pd.concat([df_2, df_1], sort=False) .reset_index() .set_index('Date') .groupby('index', sort=False) .resample('D') .ffill() .reset_index(level=1) .drop('index', 1) .rename_axis(None)) print (df) Date ID Value 783 2018-01-24 C NaN 783 2018-01-25 C NaN 783 2018-01-26 C NaN 783 2018-01-27 C NaN 783 2018-01-28 C NaN .. ... .. ... 578 2018-07-29 B NaN 578 2018-07-30 B NaN 578 2018-07-31 B NaN 578 2018-08-01 B NaN 578 2018-08-02 B 0.562684 [155 rows x 3 columns] Another solution with list comprehension and concat, but last is necessary back filling of columns for index and ID, solution working if no missing value in original ID column: offset = pd.DateOffset(months=1) + pd.DateOffset(days=1) df=pd.concat([df_1.iloc[[i]].reset_index().set_index('Date').reindex(pd.date_range(d-offset,d)) for i, d in enumerate(df_1['Date'])], sort=False) df = (df.assign(index = df['index'].bfill().astype(int), ID = df['ID'].bfill()) .rename_axis('Date') .reset_index() .set_index('index') .rename_axis(None) ) print (df) Date ID Value 783 2018-01-24 C NaN 783 2018-01-25 C NaN 783 2018-01-26 C NaN 783 2018-01-27 C NaN 783 2018-01-28 C NaN .. ... .. ... 578 2018-07-29 B NaN 578 2018-07-30 B NaN 578 2018-07-31 B NaN 578 2018-08-01 B NaN 578 2018-08-02 B 1.224345 [155 rows x 3 columns]
We can create a date range in the "Date" column, then explode it. Then group the "Value" column by the index and set values to nan but the last. Finally reset the index. def drange(t): return pd.date_range( t-pd.DateOffset(months=1)+pd.DateOffset(days=1),t,freq="D",normalize=True) df["Date"]= df["Date"].transform(drange) ID Date Value index 783 C DatetimeIndex(['2018-01-24', '2018-01-25', '20... 0.704 580 B DatetimeIndex(['2018-07-05', '2018-07-06', '20... -1.189 221 A DatetimeIndex(['2018-07-11', '2018-07-12', '20... -0.788 228 A DatetimeIndex(['2018-07-18', '2018-07-19', '20... 0.038 578 B DatetimeIndex(['2018-07-03', '2018-07-04', '20... 1.188 df= df.reset_index(drop=True).explode(column="Date") ID Date Value 0 C 2018-01-24 0.704 0 C 2018-01-25 0.704 0 C 2018-01-26 0.704 0 C 2018-01-27 0.704 0 C 2018-01-28 0.704 .. .. ... ... 4 B 2018-07-29 1.188 4 B 2018-07-30 1.188 4 B 2018-07-31 1.188 4 B 2018-08-01 1.188 4 B 2018-08-02 1.188 df["Value"]= df.groupby(level=0)["Value"].transform(lambda v: [np.nan]*(len(v)-1)+[v.iloc[0]]) df= df.reset_index(drop=True) ID Date Value 0 C 2018-01-24 NaN 1 C 2018-01-25 NaN 2 C 2018-01-26 NaN 3 C 2018-01-27 NaN 4 C 2018-01-28 NaN .. .. ... ... 150 B 2018-07-29 NaN 151 B 2018-07-30 NaN 152 B 2018-07-31 NaN 153 B 2018-08-01 NaN 154 B 2018-08-02 1.188
Pivoting DataFrame with multiple columns for the index
I have a dataframe and I want to transpose only few rows to column. This is what I have now. Entity Name Date Value 0 111 Name1 2018-03-31 100 1 111 Name2 2018-02-28 200 2 222 Name3 2018-02-28 1000 3 333 Name1 2018-01-31 2000 I want to create date as the column and then add value. Something like this: Entity Name 2018-01-31 2018-02-28 2018-03-31 0 111 Name1 NaN NaN 100.0 1 111 Name2 NaN 200.0 NaN 2 222 Name3 NaN 1000.0 NaN 3 333 Name1 2000.0 NaN NaN I can have identical Name for two different Entitys. Here is an updated dataset. Code: import pandas as pd import datetime data1 = { 'Entity': [111,111,222,333], 'Name': ['Name1','Name2', 'Name3','Name1'], 'Date': [datetime.date(2018,3, 31), datetime.date(2018,2,28), datetime.date(2018,2,28), datetime.date(2018,1,31)], 'Value': [100,200,1000,2000] } df1 = pd.DataFrame(data1, columns= ['Entity','Name','Date', 'Value']) How do I achieve this? Any pointers? Thanks all.
Based on your update, you'd need pivot_table with two index columns - v = df1.pivot_table( index=['Entity', 'Name'], columns='Date', values='Value' ).reset_index() v.index.name = v.columns.name = None v Entity Name 2018-01-31 2018-02-28 2018-03-31 0 111 Name1 NaN NaN 100.0 1 111 Name2 NaN 200.0 NaN 2 222 Name3 NaN 1000.0 NaN 3 333 Name1 2000.0 NaN NaN
From unstack df1.set_index(['Entity','Name','Date']).Value.unstack().reset_index() Date Entity Name 2018-01-31 00:00:00 2018-02-28 00:00:00 \ 0 111 Name1 NaN NaN 1 111 Name2 NaN 200.0 2 222 Name3 NaN 1000.0 3 333 Name1 2000.0 NaN Date 2018-03-31 00:00:00 0 100.0 1 NaN 2 NaN 3 NaN