Transpose and Compare - python

I'm attempting to compare two data frames. Item and Summary variables correspond to various dates and quantities. I'd like to transpose the dates into one column of data along with the associated quantities. I'd then like to compare the two data frames and see what changed from PreviousData to CurrentData.
Previous Data:
PreviousData = { 'Item' : ['abc','def','ghi','jkl','mno','pqr','stu','vwx','yza','uaza','fupa'],
'Summary' : ['party','weekend','food','school','tv','photo','camera','python','r','rstudio','spyder'],
'2022-01-01' : [1, np.nan, np.nan, 1.0, np.nan, 1.0, np.nan, np.nan, np.nan,np.nan,2],
'2022-02-01' : [1,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan],
'2022-03-01' : [np.nan,np.nan,np.nan,1,np.nan,np.nan,1,np.nan,np.nan,np.nan,np.nan],
'2022-04-01' : [np.nan,np.nan,3,np.nan,np.nan,3,np.nan,np.nan,np.nan,np.nan,np.nan],
'2022-05-01' : [np.nan,np.nan,np.nan,3,np.nan,np.nan,2,np.nan,np.nan,3,np.nan],
'2022-06-01' : [np.nan,np.nan,np.nan,np.nan,2,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan],
'2022-07-01' : [np.nan,1,np.nan,np.nan,np.nan,np.nan,1,np.nan,np.nan,np.nan,np.nan],
'2022-08-01' : [np.nan,np.nan,np.nan,1,np.nan,1,np.nan,np.nan,np.nan,np.nan,np.nan],
'2022-09-01' : [np.nan,1,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,1,np.nan],
'2022-10-01' : [np.nan,np.nan,1,np.nan,np.nan,1,np.nan,np.nan,np.nan,np.nan,np.nan],
'2022-11-01' : [np.nan,2,np.nan,np.nan,1,1,1,np.nan,np.nan,np.nan,np.nan],
'2022-12-01' : [np.nan,np.nan,np.nan,np.nan,3,np.nan,np.nan,2,np.nan,np.nan,np.nan],
'2023-01-01' : [np.nan,np.nan,1,np.nan,1,np.nan,np.nan,np.nan,2,np.nan,np.nan],
'2023-02-01' : [np.nan,np.nan,np.nan,2,np.nan,2,np.nan,np.nan,np.nan,np.nan,np.nan],
'2023-03-01' : [np.nan,3,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan],
'2023-04-01' : [np.nan,np.nan,np.nan,1,np.nan,np.nan,np.nan,1,np.nan,np.nan,np.nan],
'2023-05-01' : [np.nan,np.nan,2,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,2,np.nan],
'2023-06-01' : [1,1,np.nan,np.nan,9,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan],
'2023-07-01' : [np.nan,np.nan,np.nan,1,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan],
'2023-08-01' : [np.nan,1,np.nan,np.nan,1,np.nan,1,np.nan,np.nan,np.nan,np.nan],
'2023-09-01' : [np.nan,1,1,np.nan,np.nan,np.nan,np.nan,1,np.nan,np.nan,np.nan],
}
PreviousData = pd.DataFrame(PreviousData)
PreviousData
Current Data:
CurrentData = { 'Item' : ['ghi','stu','abc','mno','jkl','pqr','def','vwx','yza'],
'Summary' : ['food','camera','party','tv','school','photo','weekend','python','r'],
'2022-01-01' : [3, np.nan, np.nan, 1.0, np.nan, 1.0, np.nan, np.nan, np.nan],
'2022-02-01' : [np.nan,1,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan],
'2022-03-01' : [np.nan,1,1,1,np.nan,np.nan,np.nan,np.nan,np.nan],
'2022-04-01' : [np.nan,np.nan,1,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan],
'2022-05-01' : [np.nan,np.nan,3,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan],
'2022-06-01' : [2,np.nan,np.nan,np.nan,4,np.nan,np.nan,np.nan,np.nan],
'2022-07-01' : [np.nan,np.nan,np.nan,np.nan,np.nan,4,np.nan,np.nan,np.nan],
'2022-08-01' : [np.nan,np.nan,3,np.nan,4,np.nan,np.nan,np.nan,np.nan],
'2022-09-01' : [np.nan,np.nan,3,3,3,np.nan,np.nan,5,5],
'2022-10-01' : [np.nan,np.nan,np.nan,np.nan,5,np.nan,np.nan,np.nan,np.nan],
'2022-11-01' : [np.nan,np.nan,np.nan,5,np.nan,np.nan,np.nan,np.nan,np.nan],
'2022-12-01' : [np.nan,4,np.nan,np.nan,np.nan,1,np.nan,np.nan,np.nan],
'2023-01-01' : [np.nan,np.nan,np.nan,np.nan,1,1,np.nan,np.nan,np.nan],
'2023-02-01' : [np.nan,np.nan,np.nan,2,1,np.nan,np.nan,np.nan,np.nan],
'2023-03-01' : [np.nan,np.nan,np.nan,np.nan,2,np.nan,2,np.nan,2],
'2023-04-01' : [np.nan,np.nan,np.nan,np.nan,np.nan,2,np.nan,np.nan,2],
}
CurrentData = pd.DataFrame(CurrentData)
CurrentData
As requested, here's an example of a difference:
How to transpose and compare these two sets?

One way of doing this is the following. Transpose both dataframes:
PreviousData_t = PreviousData.melt(id_vars=["Item", "Summary"],
var_name="Date",
value_name="value1")
which is
Item Summary Date value1
0 abc party 2022-01-01 1.0
1 def weekend 2022-01-01 NaN
2 ghi food 2022-01-01 NaN
3 jkl school 2022-01-01 1.0
4 mno tv 2022-01-01 NaN
.. ... ... ... ...
226 stu camera 2023-09-01 NaN
227 vwx python 2023-09-01 1.0
228 yza r 2023-09-01 NaN
229 uaza rstudio 2023-09-01 NaN
230 fupa spyder 2023-09-01 NaN
and
CurrentData_t = CurrentData.melt(id_vars=["Item", "Summary"],
var_name="Date",
value_name="value2")
Item Summary Date value2
0 ghi food 2022-01-01 3.0
1 stu camera 2022-01-01 NaN
2 abc party 2022-01-01 NaN
3 mno tv 2022-01-01 1.0
4 jkl school 2022-01-01 NaN
.. ... ... ... ...
139 jkl school 2023-04-01 NaN
140 pqr photo 2023-04-01 2.0
141 def weekend 2023-04-01 NaN
142 vwx python 2023-04-01 NaN
143 yza r 2023-04-01 2.0
[144 rows x 4 columns]
THen merge:
Compare = PreviousData_t.merge(CurrentData_t, on =['Date','Item','Summary'], how = 'left')
Item Summary Date value1 value2
0 abc party 2022-01-01 1.0 NaN
1 def weekend 2022-01-01 NaN NaN
2 ghi food 2022-01-01 NaN 3.0
3 jkl school 2022-01-01 1.0 NaN
4 mno tv 2022-01-01 NaN 1.0
.. ... ... ... ... ...
226 stu camera 2023-09-01 NaN NaN
227 vwx python 2023-09-01 1.0 NaN
228 yza r 2023-09-01 NaN NaN
229 uaza rstudio 2023-09-01 NaN NaN
230 fupa spyder 2023-09-01 NaN NaN
[231 rows x 5 columns]
and compare by creating a column marking differences
Compare['diff'] = np.where(Compare['value1']!=Compare['value2'], 1,0)
Item Summary Date value1 value2 diff
0 abc party 2022-01-01 1.0 NaN 1
1 def weekend 2022-01-01 NaN NaN 1
2 ghi food 2022-01-01 NaN 3.0 1
3 jkl school 2022-01-01 1.0 NaN 1
4 mno tv 2022-01-01 NaN 1.0 1
.. ... ... ... ... ... ...
226 stu camera 2023-09-01 NaN NaN 1
227 vwx python 2023-09-01 1.0 NaN 1
228 yza r 2023-09-01 NaN NaN 1
229 uaza rstudio 2023-09-01 NaN NaN 1
230 fupa spyder 2023-09-01 NaN NaN 1
[231 rows x 6 columns]
If you only want to compare those entries that are common to both, do this:
Compare = PreviousData_t.merge(CurrentData_t, on =['Date','Item','Summary'])
Compare['diff'] = np.where(Compare['value1']!=Compare['value2'], 1,0)
Item Summary Date value1 value2 diff
0 abc party 2022-01-01 1.0 NaN 1
1 def weekend 2022-01-01 NaN NaN 1
2 ghi food 2022-01-01 NaN 3.0 1
3 jkl school 2022-01-01 1.0 NaN 1
4 mno tv 2022-01-01 NaN 1.0 1
.. ... ... ... ... ... ...
139 mno tv 2023-04-01 NaN NaN 1
140 pqr photo 2023-04-01 NaN 2.0 1
141 stu camera 2023-04-01 NaN NaN 1
142 vwx python 2023-04-01 1.0 NaN 1
143 yza r 2023-04-01 NaN 2.0 1
[144 rows x 6 columns]

Related

How to replicate index/match (with multiple criteria) in in python with multiple dataframes?

I am trying to replicate a excel model I have in python to automate it as I scale it up but I am stuck on how to translate the complex formula's into python.
I have information in three dataframes:
DF1:
ID type 1
ID type 2
Unit
a
1_a
400
b
1_b
26
c
1_c
23
d
1_b
45
e
1_d
24
f
1_b
85
g
1_a
98
DF2:
ID type 1
ID type 2
Tech
a
1_a
wind
b
1_b
solar
c
1_c
gas
d
1_b
coal
e
1_d
wind
f
1_b
gas
g
1_a
coal
And DF 3, the main DF:
Date
Time
ID type 1
ID type 2
Period
output
Unit *
Tech *
03/01/2022
02:30:00
a
1_a
1
254
03/01/2022
02:30:00
b
1_b
1
456
03/01/2022
02:30:00
c
1_c
2
3325
03/01/2022
02:30:00
d
1_b
2
1254
05/01/2022
02:30:00
e
1_d
3
489
05/01/2022
02:30:00
a
1_a
3
452
05/01/2022
02:30:00
b
1_b
4
12
05/01/2022
02:30:00
c
1_c
4
1
05/01/2022
03:00:00
d
1_b
35
54
05/01/2022
03:00:00
e
1_d
35
48
05/01/2022
03:00:00
a
1_a
48
56
I wish to get the information from each ID type in DF 3 for "unit" and "Tech" from DF 1 & 2 into DF 3. The conditional statements I have in excel atm are based on INDEX and MATCH and INFA, as some of the ID types in DF will be from either ID type 1 or ID type 2 so the function checks both columns and based on a positve match yields the required result.
For more context, DF1 and DF2 do not change but DF3 changes and I need a function for that which I will explain later.
The excel function I use to fill in Unit* from DF1 is (note I have replaced the excel sheet name to DF1 to help conceptualize the problem:
=IFNA(INDEX('DF1'!$K$3:$K$1011,MATCH(N2,'DF1'!$E$3:$E$1011,0)),INDEX('DF1'!$K$3:$K$1011,MATCH(M2,'DF1'!$D$3:$D$1011,0)))
The excel function I use to fill in Tech * is a bit more straight forward:
=IFNA(INDEX('DF2'$L:$L,MATCH(O3,'DF2'$K:$K,0)),INDEX('DF2'$L:$L,MATCH(N3,'DF2'$J:$J,0)))
That is the main stumbling block at the moment, but after this is achieved I need a function that for each day produces the following DF:
ID type 1
Tech
Period 1
Period 2
Period 3
Period 4
Period 5
Period 6
Period 7
…
a
wind
Sum of output for this ID Type 1 and Period 1
b
solar
c
gas
d
coal
e
wind
a
gas
…
…
The idea here is that I can use conditional function again to sum the "output" column of DF3 under the condition of date, ID type and period number.
EDIT: Output based on possible solution:
time settlementDate BM Unit ID 1 BM Unit ID 2 settlementPeriod \
0 00:00:00 03/01/2022 RCBKO-1 T_RCBKO-1 1
1 00:00:00 03/01/2022 LARYO-3 T_LARYW-3 1
2 00:00:00 03/01/2022 LAGA-1 T_LAGA-1 1
3 00:00:00 03/01/2022 CRMLW-1 T_CRMLW-1 1
4 00:00:00 03/01/2022 GRIFW-1 T_GRIFW-1 1
... ... ... ... ... ...
52533 23:30:00 08/01/2022 CRMLW-1 T_CRMLW-1 48
52534 23:30:00 08/01/2022 LARYO-4 T_LARYW-4 48
52535 23:30:00 08/01/2022 HOWBO-3 T_HOWBO-3 48
52536 23:30:00 08/01/2022 BETHW-1 E_BETHW-1 48
52537 23:30:00 08/01/2022 HMGTO-1 T_HMGTO-1 48
quantity Capacity_x Technology Technology_x \
0 278.658 NaN NaN WIND
1 162.940 NaN NaN WIND
2 262.200 NaN NaN CCGT
3 3.002 NaN NaN WIND
4 9.972 NaN NaN WIND
... ... ... ... ...
52533 8.506 NaN NaN WIND
52534 159.740 NaN NaN WIND
52535 32.554 NaN NaN NaN
52536 5.010 NaN NaN WIND
52537 92.094 NaN NaN WIND
Registered Resource Name_x Capacity_y Technology_y \
0 NaN NaN WIND
1 NaN NaN WIND
2 NaN NaN CCGT
3 NaN NaN WIND
4 NaN NaN WIND
... ... ... ...
52533 NaN NaN WIND
52534 NaN NaN WIND
52535 NaN NaN NaN
52536 NaN NaN WIND
52537 NaN NaN WIND
Registered Resource Name_y Capacity
0 NaN NaN
1 NaN NaN
2 NaN NaN
3 NaN NaN
4 NaN NaN
... ... ...
52533 NaN NaN
52534 NaN NaN
52535 NaN NaN
52536 NaN NaN
52537 NaN NaN
[52538 rows x 14 columns]
EDIT: New query
ID Type 1
Tech
Period_1
Period_2
Period_3
Period_4
Period_35
Period_48
a
wind
450
0
0
0
0
0
>>> These are mean of all dates*
b
wind
0
0
550
0
0
85
b
wind
0
0
895
0
452
0
For the first part of your question you want to do a left merge on those 2 columns twice like this:
df3 = (
df3
.merge(df1, on=['ID type 1', 'ID type 2'], how='left')
.merge(df2, on=['ID type 1', 'ID type 2'], how='left')
)
print(df3)
Date Time ID type 1 ID type 2 Period output Unit Tech
0 03/01/2022 02:30:00 a 1_a 1 254 400 wind
1 03/01/2022 02:30:00 b 1_b 1 456 26 solar
2 03/01/2022 02:30:00 c 1_c 2 3325 23 gas
3 03/01/2022 02:30:00 d 1_b 2 1254 45 coal
4 05/01/2022 02:30:00 e 1_d 3 489 24 wind
5 05/01/2022 02:30:00 a 1_a 3 452 400 wind
6 05/01/2022 02:30:00 b 1_b 4 12 26 solar
7 05/01/2022 02:30:00 c 1_c 4 1 23 gas
8 05/01/2022 03:00:00 d 1_b 35 54 45 coal
9 05/01/2022 03:00:00 e 1_d 35 48 24 wind
10 05/01/2022 03:00:00 a 1_a 48 56 400 wind
For the next part you could use a pandas.pivot_table.
out = (
df3
.pivot_table(
index=['Date', 'ID type 1', 'Tech'],
columns='Period',
values='output',
aggfunc=sum,
fill_value=0)
.add_prefix('Period_')
)
print(out)
Output:
Period Period_1 Period_2 Period_3 Period_4 Period_35 Period_48
Date ID type 1 Tech
03/01/2022 a wind 254 0 0 0 0 0
b solar 456 0 0 0 0 0
c gas 0 3325 0 0 0 0
d coal 0 1254 0 0 0 0
05/01/2022 a wind 0 0 452 0 0 56
b solar 0 0 0 12 0 0
c gas 0 0 0 1 0 0
d coal 0 0 0 0 54 0
e wind 0 0 489 0 48 0
I used fill_value to show you that option, without it you get 'NaN' in those cells.
UPDATE:
From question in comments, only get pivot data of one Technology (e.g. 'wind'):
out.loc[out.index.get_level_values('Tech')=='wind']
Period Period_1 Period_2 Period_3 Period_4 Period_35 Period_48
Date ID type 1 Tech
03/01/2022 a wind 254 0 0 0 0 0
05/01/2022 a wind 0 0 452 0 0 56
e wind 0 0 489 0 48 0

How can I correctly write a function? (Python)

Here is my definition:
def fill(df_name):
"""
Function to fill rows and dates.
"""
# Fill Down
for row in df_name[0]:
if 'Unnamed' in row:
df_name[0] = df_name[0].replace(row, np.nan)
df_name[0] = df_name[0].ffill(limit=2)
df_name[1] = df_name[1].ffill(limit=2)
# Fill in Dates
for col in df_name.columns:
if col >= 3:
old_dt = datetime(1998, 11, 15)
add_dt = old_dt + relativedelta(months=col - 3)
new_dt = add_dt.strftime('%#m/%d/%Y')
df_name = df_name.rename(columns={col: new_dt})
and then I call:
fill(df_cars)
The first half of the formula works (columns 0 and 1 have filled in correctly). However, as you can see, the columns are labeled 0-288. When I delete this function and simply run the code (changing df_name to df_cars) it runs correctly and the column names are the dates specified in the second half of the function.
What could be causing this to not execute the # Fill in Dates portion when defined in a function? Does it have to do with local variables?
0 1 2 3 4 5 ... 287 288 289 290 291 292
0 France NaN Market 3330 7478 2273 ... NaN NaN NaN NaN NaN NaT
1 France NaN World 362 798 306 ... NaN NaN NaN NaN NaN NaT
2 France NaN % 0.108709 0.106713 0.134624 ... NaN NaN NaN NaN NaN NaT
3 Germany NaN Market 1452 2025 1314 ... NaN NaN NaN NaN NaN NaT
4 Germany NaN World 209 246 182 ... NaN NaN NaN NaN NaN NaT
.. ... ... ... ... ... ... ... ... ... ... ... ... ..
349 Slovakia 0 World 1 1 0 ... NaN NaN NaN NaN NaN NaT
350 Slovakia 0 % 0.5 0.5 0 ... NaN NaN NaN NaN NaN NaT

transpose/stack/unstack pandas dataframe whilst concatenating the field name with existing columns

I have a dataframe that looks something like:
Component Date MTD YTD QTD FC
ABC Jan 2017 56 nan nan nan
DEF Jan 2017 453 nan nan nan
XYZ Jan 2017 657
PQR Jan 2017 123
ABC Feb 2017 56 nan nan nan
DEF Feb 2017 456 nan nan nan
XYZ Feb 2017 6234 57
PQR Feb 2017 123 346
ABC Dec 2017 56 nan nan nan
DEF Dec 2017 nan nan 345 324
XYZ Dec 2017 6234 57
PQR Dec 2017 nan 346 54654 546
And i would like to transpose this dataframe in such a way that the component becomes the prefix of the existing MTD,QTD, etc columns
so the output expected would be:
Date ABC_MTD DEF_MTD XYZ_MTD PQR_MTD ABC_YTD DEF_YTD XYZ_YTD PQR_YTD etcetc
Jan 2017 56 453 657 123 nan nan nan nan
Feb 2017 56 456 6234 123 nan nan 57 346
Dec 2017 56 nan 6234 nan 57 346
I am not sure whether a pivot or stack/unstack would be efficient out here.
Thanks in advance.
You could try this:
newdf=df.pivot(values=df.columns[2:], index='Date', columns='Component' )
newdf.columns = ['%s%s' % (b, '_%s' % a if b else '') for a, b in newdf.columns] #join the multiindex columns names
print(newdf)
Output:
df
Component Date MTD YTD QTD FC
0 ABC 2017-01-01 56.0 NaN NaN NaN
1 DEF 2017-01-01 453.0 NaN NaN NaN
2 XYZ 2017-01-01 657.0
3 PQR 2017-01-01 123.0
4 ABC 2017-02-01 56.0 NaN NaN NaN
5 DEF 2017-02-01 456.0 NaN NaN NaN
6 XYZ 2017-02-01 6234.0 57
7 PQR 2017-02-01 123.0 346
8 ABC 2017-12-01 56.0 NaN NaN NaN
9 DEF 2017-12-01 NaN NaN 345 324
10 XYZ 2017-12-01 6234.0 57
11 PQR 2017-12-01 NaN 346 54654 546
newdf
ABC_MTD DEF_MTD PQR_MTD XYZ_MTD ABC_YTD DEF_YTD PQR_YTD XYZ_YTD ABC_QTD DEF_QTD PQR_QTD XYZ_QTD ABC_FC DEF_FC PQR_FC XYZ_FC
Date
2017-01-01 56 453 123 657 NaN NaN NaN NaN NaN NaN
2017-02-01 56 456 123 6234 NaN NaN 346 57 NaN NaN NaN NaN
2017-12-01 56 NaN NaN 6234 NaN NaN 346 57 NaN 345 54654 NaN 324 546

Python: Expand a dataframe row-wise based on datetime

I have a dataframe like this:
ID Date Value
783 C 2018-02-23 0.704
580 B 2018-08-04 -1.189
221 A 2018-08-10 -0.788
228 A 2018-08-17 0.038
578 B 2018-08-02 1.188
What I want is expanding the dataframe based on Date column to 1-month earlier, and fill ID with the same person, and fill Value with nan until the last observation.
The expected result is similar to this:
ID Date Value
0 C 2018/01/24 nan
1 C 2018/01/25 nan
2 C 2018/01/26 nan
3 C 2018/01/27 nan
4 C 2018/01/28 nan
5 C 2018/01/29 nan
6 C 2018/01/30 nan
7 C 2018/01/31 nan
8 C 2018/02/01 nan
9 C 2018/02/02 nan
10 C 2018/02/03 nan
11 C 2018/02/04 nan
12 C 2018/02/05 nan
13 C 2018/02/06 nan
14 C 2018/02/07 nan
15 C 2018/02/08 nan
16 C 2018/02/09 nan
17 C 2018/02/10 nan
18 C 2018/02/11 nan
19 C 2018/02/12 nan
20 C 2018/02/13 nan
21 C 2018/02/14 nan
22 C 2018/02/15 nan
23 C 2018/02/16 nan
24 C 2018/02/17 nan
25 C 2018/02/18 nan
26 C 2018/02/19 nan
27 C 2018/02/20 nan
28 C 2018/02/21 nan
29 C 2018/02/22 nan
30 C 2018/02/23 1.093
31 B 2018/07/05 nan
32 B 2018/07/06 nan
33 B 2018/07/07 nan
34 B 2018/07/08 nan
35 B 2018/07/09 nan
36 B 2018/07/10 nan
37 B 2018/07/11 nan
38 B 2018/07/12 nan
39 B 2018/07/13 nan
40 B 2018/07/14 nan
41 B 2018/07/15 nan
42 B 2018/07/16 nan
43 B 2018/07/17 nan
44 B 2018/07/18 nan
45 B 2018/07/19 nan
46 B 2018/07/20 nan
47 B 2018/07/21 nan
48 B 2018/07/22 nan
49 B 2018/07/23 nan
50 B 2018/07/24 nan
51 B 2018/07/25 nan
52 B 2018/07/26 nan
53 B 2018/07/27 nan
54 B 2018/07/28 nan
55 B 2018/07/29 nan
56 B 2018/07/30 nan
57 B 2018/07/31 nan
58 B 2018/08/01 nan
59 B 2018/08/02 nan
60 B 2018/08/03 nan
61 B 2018/08/04 0.764
62 A 2018/07/11 nan
63 A 2018/07/12 nan
64 A 2018/07/13 nan
65 A 2018/07/14 nan
66 A 2018/07/15 nan
67 A 2018/07/16 nan
68 A 2018/07/17 nan
69 A 2018/07/18 nan
70 A 2018/07/19 nan
71 A 2018/07/20 nan
72 A 2018/07/21 nan
73 A 2018/07/22 nan
74 A 2018/07/23 nan
75 A 2018/07/24 nan
76 A 2018/07/25 nan
77 A 2018/07/26 nan
78 A 2018/07/27 nan
79 A 2018/07/28 nan
80 A 2018/07/29 nan
81 A 2018/07/30 nan
82 A 2018/07/31 nan
83 A 2018/08/01 nan
84 A 2018/08/02 nan
85 A 2018/08/03 nan
86 A 2018/08/04 nan
87 A 2018/08/05 nan
88 A 2018/08/06 nan
89 A 2018/08/07 nan
90 A 2018/08/08 nan
91 A 2018/08/09 nan
92 A 2018/08/10 2.144
93 A 2018/07/18 nan
94 A 2018/07/19 nan
95 A 2018/07/20 nan
96 A 2018/07/21 nan
97 A 2018/07/22 nan
98 A 2018/07/23 nan
99 A 2018/07/24 nan
100 A 2018/07/25 nan
101 A 2018/07/26 nan
102 A 2018/07/27 nan
103 A 2018/07/28 nan
104 A 2018/07/29 nan
105 A 2018/07/30 nan
106 A 2018/07/31 nan
107 A 2018/08/01 nan
108 A 2018/08/02 nan
109 A 2018/08/03 nan
110 A 2018/08/04 nan
111 A 2018/08/05 nan
112 A 2018/08/06 nan
113 A 2018/08/07 nan
114 A 2018/08/08 nan
115 A 2018/08/09 nan
116 A 2018/08/10 nan
117 A 2018/08/11 nan
118 A 2018/08/12 nan
119 A 2018/08/13 nan
120 A 2018/08/14 nan
121 A 2018/08/15 nan
122 A 2018/08/16 nan
123 A 2018/08/17 0.644
124 B 2018/07/03 nan
125 B 2018/07/04 nan
126 B 2018/07/05 nan
127 B 2018/07/06 nan
128 B 2018/07/07 nan
129 B 2018/07/08 nan
130 B 2018/07/09 nan
131 B 2018/07/10 nan
132 B 2018/07/11 nan
133 B 2018/07/12 nan
134 B 2018/07/13 nan
135 B 2018/07/14 nan
136 B 2018/07/15 nan
137 B 2018/07/16 nan
138 B 2018/07/17 nan
139 B 2018/07/18 nan
140 B 2018/07/19 nan
141 B 2018/07/20 nan
142 B 2018/07/21 nan
143 B 2018/07/22 nan
144 B 2018/07/23 nan
145 B 2018/07/24 nan
146 B 2018/07/25 nan
147 B 2018/07/26 nan
148 B 2018/07/27 nan
149 B 2018/07/28 nan
150 B 2018/07/29 nan
151 B 2018/07/30 nan
152 B 2018/07/31 nan
153 B 2018/08/01 nan
154 B 2018/08/02 -0.767
The source data can be created as below:
import pandas as pd
from itertools import chain
import numpy as np
df_1 = pd.DataFrame({
'ID' : list(chain.from_iterable([['A'] * 365, ['B'] * 365, ['C'] * 365])),
'Date' : pd.date_range(start = '2018-01-01', end = '2018-12-31').tolist() + pd.date_range(start = '2018-01-01', end = '2018-12-31').tolist() + pd.date_range(start = '2018-01-01', end = '2018-12-31').tolist(),
'Value' : np.random.randn(365 * 3)
})
df_1 = df_1.sample(5, random_state = 123)
Thanks for the advice!
You can create another DataFrame with previous months, then join together by concat, create DatetimeIndex, so possible use groupby with resample by d for days for add all values between:
df_2 = df_1.assign(Date = df_1['Date'] - pd.DateOffset(months=1) + pd.DateOffset(days=1),
Value = np.nan)
df = (pd.concat([df_2, df_1], sort=False)
.reset_index()
.set_index('Date')
.groupby('index', sort=False)
.resample('D')
.ffill()
.reset_index(level=1)
.drop('index', 1)
.rename_axis(None))
print (df)
Date ID Value
783 2018-01-24 C NaN
783 2018-01-25 C NaN
783 2018-01-26 C NaN
783 2018-01-27 C NaN
783 2018-01-28 C NaN
.. ... .. ...
578 2018-07-29 B NaN
578 2018-07-30 B NaN
578 2018-07-31 B NaN
578 2018-08-01 B NaN
578 2018-08-02 B 0.562684
[155 rows x 3 columns]
Another solution with list comprehension and concat, but last is necessary back filling of columns for index and ID, solution working if no missing value in original ID column:
offset = pd.DateOffset(months=1) + pd.DateOffset(days=1)
df=pd.concat([df_1.iloc[[i]].reset_index().set_index('Date').reindex(pd.date_range(d-offset,d))
for i, d in enumerate(df_1['Date'])], sort=False)
df = (df.assign(index = df['index'].bfill().astype(int), ID = df['ID'].bfill())
.rename_axis('Date')
.reset_index()
.set_index('index')
.rename_axis(None)
)
print (df)
Date ID Value
783 2018-01-24 C NaN
783 2018-01-25 C NaN
783 2018-01-26 C NaN
783 2018-01-27 C NaN
783 2018-01-28 C NaN
.. ... .. ...
578 2018-07-29 B NaN
578 2018-07-30 B NaN
578 2018-07-31 B NaN
578 2018-08-01 B NaN
578 2018-08-02 B 1.224345
[155 rows x 3 columns]
We can create a date range in the "Date" column, then explode it.
Then group the "Value" column by the index and set values to nan but the last.
Finally reset the index.
def drange(t):
return pd.date_range( t-pd.DateOffset(months=1)+pd.DateOffset(days=1),t,freq="D",normalize=True)
df["Date"]= df["Date"].transform(drange)
ID Date Value
index
783 C DatetimeIndex(['2018-01-24', '2018-01-25', '20... 0.704
580 B DatetimeIndex(['2018-07-05', '2018-07-06', '20... -1.189
221 A DatetimeIndex(['2018-07-11', '2018-07-12', '20... -0.788
228 A DatetimeIndex(['2018-07-18', '2018-07-19', '20... 0.038
578 B DatetimeIndex(['2018-07-03', '2018-07-04', '20... 1.188
df= df.reset_index(drop=True).explode(column="Date")
ID Date Value
0 C 2018-01-24 0.704
0 C 2018-01-25 0.704
0 C 2018-01-26 0.704
0 C 2018-01-27 0.704
0 C 2018-01-28 0.704
.. .. ... ...
4 B 2018-07-29 1.188
4 B 2018-07-30 1.188
4 B 2018-07-31 1.188
4 B 2018-08-01 1.188
4 B 2018-08-02 1.188
df["Value"]= df.groupby(level=0)["Value"].transform(lambda v: [np.nan]*(len(v)-1)+[v.iloc[0]])
df= df.reset_index(drop=True)
ID Date Value
0 C 2018-01-24 NaN
1 C 2018-01-25 NaN
2 C 2018-01-26 NaN
3 C 2018-01-27 NaN
4 C 2018-01-28 NaN
.. .. ... ...
150 B 2018-07-29 NaN
151 B 2018-07-30 NaN
152 B 2018-07-31 NaN
153 B 2018-08-01 NaN
154 B 2018-08-02 1.188

Pivoting DataFrame with multiple columns for the index

I have a dataframe and I want to transpose only few rows to column.
This is what I have now.
Entity Name Date Value
0 111 Name1 2018-03-31 100
1 111 Name2 2018-02-28 200
2 222 Name3 2018-02-28 1000
3 333 Name1 2018-01-31 2000
I want to create date as the column and then add value. Something like this:
Entity Name 2018-01-31 2018-02-28 2018-03-31
0 111 Name1 NaN NaN 100.0
1 111 Name2 NaN 200.0 NaN
2 222 Name3 NaN 1000.0 NaN
3 333 Name1 2000.0 NaN NaN
I can have identical Name for two different Entitys. Here is an updated dataset.
Code:
import pandas as pd
import datetime
data1 = {
'Entity': [111,111,222,333],
'Name': ['Name1','Name2', 'Name3','Name1'],
'Date': [datetime.date(2018,3, 31), datetime.date(2018,2,28), datetime.date(2018,2,28), datetime.date(2018,1,31)],
'Value': [100,200,1000,2000]
}
df1 = pd.DataFrame(data1, columns= ['Entity','Name','Date', 'Value'])
How do I achieve this? Any pointers? Thanks all.
Based on your update, you'd need pivot_table with two index columns -
v = df1.pivot_table(
index=['Entity', 'Name'],
columns='Date',
values='Value'
).reset_index()
v.index.name = v.columns.name = None
v
Entity Name 2018-01-31 2018-02-28 2018-03-31
0 111 Name1 NaN NaN 100.0
1 111 Name2 NaN 200.0 NaN
2 222 Name3 NaN 1000.0 NaN
3 333 Name1 2000.0 NaN NaN
From unstack
df1.set_index(['Entity','Name','Date']).Value.unstack().reset_index()
Date Entity Name 2018-01-31 00:00:00 2018-02-28 00:00:00 \
0 111 Name1 NaN NaN
1 111 Name2 NaN 200.0
2 222 Name3 NaN 1000.0
3 333 Name1 2000.0 NaN
Date 2018-03-31 00:00:00
0 100.0
1 NaN
2 NaN
3 NaN

Categories