Pivoting DataFrame with multiple columns for the index - python

I have a dataframe and I want to transpose only few rows to column.
This is what I have now.
Entity Name Date Value
0 111 Name1 2018-03-31 100
1 111 Name2 2018-02-28 200
2 222 Name3 2018-02-28 1000
3 333 Name1 2018-01-31 2000
I want to create date as the column and then add value. Something like this:
Entity Name 2018-01-31 2018-02-28 2018-03-31
0 111 Name1 NaN NaN 100.0
1 111 Name2 NaN 200.0 NaN
2 222 Name3 NaN 1000.0 NaN
3 333 Name1 2000.0 NaN NaN
I can have identical Name for two different Entitys. Here is an updated dataset.
Code:
import pandas as pd
import datetime
data1 = {
'Entity': [111,111,222,333],
'Name': ['Name1','Name2', 'Name3','Name1'],
'Date': [datetime.date(2018,3, 31), datetime.date(2018,2,28), datetime.date(2018,2,28), datetime.date(2018,1,31)],
'Value': [100,200,1000,2000]
}
df1 = pd.DataFrame(data1, columns= ['Entity','Name','Date', 'Value'])
How do I achieve this? Any pointers? Thanks all.

Based on your update, you'd need pivot_table with two index columns -
v = df1.pivot_table(
index=['Entity', 'Name'],
columns='Date',
values='Value'
).reset_index()
v.index.name = v.columns.name = None
v
Entity Name 2018-01-31 2018-02-28 2018-03-31
0 111 Name1 NaN NaN 100.0
1 111 Name2 NaN 200.0 NaN
2 222 Name3 NaN 1000.0 NaN
3 333 Name1 2000.0 NaN NaN

From unstack
df1.set_index(['Entity','Name','Date']).Value.unstack().reset_index()
Date Entity Name 2018-01-31 00:00:00 2018-02-28 00:00:00 \
0 111 Name1 NaN NaN
1 111 Name2 NaN 200.0
2 222 Name3 NaN 1000.0
3 333 Name1 2000.0 NaN
Date 2018-03-31 00:00:00
0 100.0
1 NaN
2 NaN
3 NaN

Related

Transpose and Compare

I'm attempting to compare two data frames. Item and Summary variables correspond to various dates and quantities. I'd like to transpose the dates into one column of data along with the associated quantities. I'd then like to compare the two data frames and see what changed from PreviousData to CurrentData.
Previous Data:
PreviousData = { 'Item' : ['abc','def','ghi','jkl','mno','pqr','stu','vwx','yza','uaza','fupa'],
'Summary' : ['party','weekend','food','school','tv','photo','camera','python','r','rstudio','spyder'],
'2022-01-01' : [1, np.nan, np.nan, 1.0, np.nan, 1.0, np.nan, np.nan, np.nan,np.nan,2],
'2022-02-01' : [1,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan],
'2022-03-01' : [np.nan,np.nan,np.nan,1,np.nan,np.nan,1,np.nan,np.nan,np.nan,np.nan],
'2022-04-01' : [np.nan,np.nan,3,np.nan,np.nan,3,np.nan,np.nan,np.nan,np.nan,np.nan],
'2022-05-01' : [np.nan,np.nan,np.nan,3,np.nan,np.nan,2,np.nan,np.nan,3,np.nan],
'2022-06-01' : [np.nan,np.nan,np.nan,np.nan,2,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan],
'2022-07-01' : [np.nan,1,np.nan,np.nan,np.nan,np.nan,1,np.nan,np.nan,np.nan,np.nan],
'2022-08-01' : [np.nan,np.nan,np.nan,1,np.nan,1,np.nan,np.nan,np.nan,np.nan,np.nan],
'2022-09-01' : [np.nan,1,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,1,np.nan],
'2022-10-01' : [np.nan,np.nan,1,np.nan,np.nan,1,np.nan,np.nan,np.nan,np.nan,np.nan],
'2022-11-01' : [np.nan,2,np.nan,np.nan,1,1,1,np.nan,np.nan,np.nan,np.nan],
'2022-12-01' : [np.nan,np.nan,np.nan,np.nan,3,np.nan,np.nan,2,np.nan,np.nan,np.nan],
'2023-01-01' : [np.nan,np.nan,1,np.nan,1,np.nan,np.nan,np.nan,2,np.nan,np.nan],
'2023-02-01' : [np.nan,np.nan,np.nan,2,np.nan,2,np.nan,np.nan,np.nan,np.nan,np.nan],
'2023-03-01' : [np.nan,3,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan],
'2023-04-01' : [np.nan,np.nan,np.nan,1,np.nan,np.nan,np.nan,1,np.nan,np.nan,np.nan],
'2023-05-01' : [np.nan,np.nan,2,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,2,np.nan],
'2023-06-01' : [1,1,np.nan,np.nan,9,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan],
'2023-07-01' : [np.nan,np.nan,np.nan,1,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan],
'2023-08-01' : [np.nan,1,np.nan,np.nan,1,np.nan,1,np.nan,np.nan,np.nan,np.nan],
'2023-09-01' : [np.nan,1,1,np.nan,np.nan,np.nan,np.nan,1,np.nan,np.nan,np.nan],
}
PreviousData = pd.DataFrame(PreviousData)
PreviousData
Current Data:
CurrentData = { 'Item' : ['ghi','stu','abc','mno','jkl','pqr','def','vwx','yza'],
'Summary' : ['food','camera','party','tv','school','photo','weekend','python','r'],
'2022-01-01' : [3, np.nan, np.nan, 1.0, np.nan, 1.0, np.nan, np.nan, np.nan],
'2022-02-01' : [np.nan,1,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan],
'2022-03-01' : [np.nan,1,1,1,np.nan,np.nan,np.nan,np.nan,np.nan],
'2022-04-01' : [np.nan,np.nan,1,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan],
'2022-05-01' : [np.nan,np.nan,3,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan],
'2022-06-01' : [2,np.nan,np.nan,np.nan,4,np.nan,np.nan,np.nan,np.nan],
'2022-07-01' : [np.nan,np.nan,np.nan,np.nan,np.nan,4,np.nan,np.nan,np.nan],
'2022-08-01' : [np.nan,np.nan,3,np.nan,4,np.nan,np.nan,np.nan,np.nan],
'2022-09-01' : [np.nan,np.nan,3,3,3,np.nan,np.nan,5,5],
'2022-10-01' : [np.nan,np.nan,np.nan,np.nan,5,np.nan,np.nan,np.nan,np.nan],
'2022-11-01' : [np.nan,np.nan,np.nan,5,np.nan,np.nan,np.nan,np.nan,np.nan],
'2022-12-01' : [np.nan,4,np.nan,np.nan,np.nan,1,np.nan,np.nan,np.nan],
'2023-01-01' : [np.nan,np.nan,np.nan,np.nan,1,1,np.nan,np.nan,np.nan],
'2023-02-01' : [np.nan,np.nan,np.nan,2,1,np.nan,np.nan,np.nan,np.nan],
'2023-03-01' : [np.nan,np.nan,np.nan,np.nan,2,np.nan,2,np.nan,2],
'2023-04-01' : [np.nan,np.nan,np.nan,np.nan,np.nan,2,np.nan,np.nan,2],
}
CurrentData = pd.DataFrame(CurrentData)
CurrentData
As requested, here's an example of a difference:
How to transpose and compare these two sets?
One way of doing this is the following. Transpose both dataframes:
PreviousData_t = PreviousData.melt(id_vars=["Item", "Summary"],
var_name="Date",
value_name="value1")
which is
Item Summary Date value1
0 abc party 2022-01-01 1.0
1 def weekend 2022-01-01 NaN
2 ghi food 2022-01-01 NaN
3 jkl school 2022-01-01 1.0
4 mno tv 2022-01-01 NaN
.. ... ... ... ...
226 stu camera 2023-09-01 NaN
227 vwx python 2023-09-01 1.0
228 yza r 2023-09-01 NaN
229 uaza rstudio 2023-09-01 NaN
230 fupa spyder 2023-09-01 NaN
and
CurrentData_t = CurrentData.melt(id_vars=["Item", "Summary"],
var_name="Date",
value_name="value2")
Item Summary Date value2
0 ghi food 2022-01-01 3.0
1 stu camera 2022-01-01 NaN
2 abc party 2022-01-01 NaN
3 mno tv 2022-01-01 1.0
4 jkl school 2022-01-01 NaN
.. ... ... ... ...
139 jkl school 2023-04-01 NaN
140 pqr photo 2023-04-01 2.0
141 def weekend 2023-04-01 NaN
142 vwx python 2023-04-01 NaN
143 yza r 2023-04-01 2.0
[144 rows x 4 columns]
THen merge:
Compare = PreviousData_t.merge(CurrentData_t, on =['Date','Item','Summary'], how = 'left')
Item Summary Date value1 value2
0 abc party 2022-01-01 1.0 NaN
1 def weekend 2022-01-01 NaN NaN
2 ghi food 2022-01-01 NaN 3.0
3 jkl school 2022-01-01 1.0 NaN
4 mno tv 2022-01-01 NaN 1.0
.. ... ... ... ... ...
226 stu camera 2023-09-01 NaN NaN
227 vwx python 2023-09-01 1.0 NaN
228 yza r 2023-09-01 NaN NaN
229 uaza rstudio 2023-09-01 NaN NaN
230 fupa spyder 2023-09-01 NaN NaN
[231 rows x 5 columns]
and compare by creating a column marking differences
Compare['diff'] = np.where(Compare['value1']!=Compare['value2'], 1,0)
Item Summary Date value1 value2 diff
0 abc party 2022-01-01 1.0 NaN 1
1 def weekend 2022-01-01 NaN NaN 1
2 ghi food 2022-01-01 NaN 3.0 1
3 jkl school 2022-01-01 1.0 NaN 1
4 mno tv 2022-01-01 NaN 1.0 1
.. ... ... ... ... ... ...
226 stu camera 2023-09-01 NaN NaN 1
227 vwx python 2023-09-01 1.0 NaN 1
228 yza r 2023-09-01 NaN NaN 1
229 uaza rstudio 2023-09-01 NaN NaN 1
230 fupa spyder 2023-09-01 NaN NaN 1
[231 rows x 6 columns]
If you only want to compare those entries that are common to both, do this:
Compare = PreviousData_t.merge(CurrentData_t, on =['Date','Item','Summary'])
Compare['diff'] = np.where(Compare['value1']!=Compare['value2'], 1,0)
Item Summary Date value1 value2 diff
0 abc party 2022-01-01 1.0 NaN 1
1 def weekend 2022-01-01 NaN NaN 1
2 ghi food 2022-01-01 NaN 3.0 1
3 jkl school 2022-01-01 1.0 NaN 1
4 mno tv 2022-01-01 NaN 1.0 1
.. ... ... ... ... ... ...
139 mno tv 2023-04-01 NaN NaN 1
140 pqr photo 2023-04-01 NaN 2.0 1
141 stu camera 2023-04-01 NaN NaN 1
142 vwx python 2023-04-01 1.0 NaN 1
143 yza r 2023-04-01 NaN 2.0 1
[144 rows x 6 columns]

How to sort columns except index column in a data frame in python after pivot

So I have a data frame
testdf = pd.DataFrame({"loc" : ["ab12","bc12","cd12","ab12","bc13","cd12"], "months" :
["Jun21","Jun21","July21","July21","Aug21","Aug21"], "dept" :
["dep1","dep2","dep3","dep2","dep1","dep3"], "count": [15, 16, 15, 92, 90, 2]})
That looks like this:
When I pivot it,
df = pd.pivot_table(testdf, values = ['count'], index = ['loc','dept'], columns = ['months'], aggfunc=np.sum).reset_index()
df.columns = df.columns.droplevel(0)
df
it looks like this:
I am looking for a sort function which will sort only the months columns in sequence and not the first 2 columns i.e loc & dept.
when I try this:
df.sort_values(by = ['Jun21'],ascending = False, inplace = True, axis = 1, ignore_index=True)[2:]
it gives me error.
I want the columns to be in sequence Jun21, Jul21, Aug21
I am looking for something which will make it dynamic and I wont need to manually change the sequence when the month changes.
Any hint will be really appreciated.
It is quite simple if you do using groupby
df = testdf.groupby(['loc', 'dept', 'months']).sum().unstack(level=2)
df = df.reindex(['Jun21', 'July21', 'Aug21'], axis=1, level=1)
Output
count
months Jun21 July21 Aug21
loc dept
ab12 dep1 15.0 NaN NaN
dep2 NaN 92.0 NaN
bc12 dep2 16.0 NaN NaN
bc13 dep1 NaN NaN 90.0
cd12 dep3 NaN 15.0 2.0
We can start by converting the column months in datetime like so :
>>> testdf.months = (pd.to_datetime(testdf.months, format="%b%y", errors='coerce'))
>>> testdf
loc months dept count
0 ab12 2021-06-01 dep1 15
1 bc12 2021-06-01 dep2 16
2 cd12 2021-07-01 dep3 15
3 ab12 2021-07-01 dep2 92
4 bc13 2021-08-01 dep1 90
5 cd12 2021-08-01 dep3 2
Then, we apply your code to get the pivot :
>>> df = pd.pivot_table(testdf, values = ['count'], index = ['loc','dept'], columns = ['months'], aggfunc=np.sum).reset_index()
>>> df.columns = df.columns.droplevel(0)
>>> df
months NaT NaT 2021-06-01 2021-07-01 2021-08-01
0 ab12 dep1 15.0 NaN NaN
1 ab12 dep2 NaN 92.0 NaN
2 bc12 dep2 16.0 NaN NaN
3 bc13 dep1 NaN NaN 90.0
4 cd12 dep3 NaN 15.0 2.0
And to finish we can reformat the column names using strftime to get the expected result :
>>> df.columns = df.columns.map(lambda t: t.strftime('%b%y') if pd.notnull(t) else '')
>>> df
months Jun21 Jul21 Aug21
0 ab12 dep1 15.0 NaN NaN
1 ab12 dep2 NaN 92.0 NaN
2 bc12 dep2 16.0 NaN NaN
3 bc13 dep1 NaN NaN 90.0
4 cd12 dep3 NaN 15.0 2.0

Flip and shift multi-column data to the left in Pandas

Here's an Employee - Supervisor mapping data.
I'd like to flip and then shift whole column to left. Only data should be shifted to the left 1 time and the columns should be fixed. Could you tell me how can I do this?
Input: Bottom - Up approach
Emp_ID
Sup_1 ID
Sup_2 ID
Sup_3 ID
Sup_4 ID
123
234
456
678
789
234
456
678
789
NaN
456
678
789
NaN
NaN
678
789
NaN
NaN
NaN
789
NaN
NaN
NaN
NaN
Output: Top - Down approach
Emp_ID
Sup_1 ID
Sup_2 ID
Sup_3 ID
Sup_4 ID
123
789
678
456
234
234
789
678
456
NaN
456
789
678
NaN
NaN
678
789
NaN
NaN
NaN
789
NaN
NaN
NaN
NaN
Appreciate any kind of assistance
Try with fliplr:
# Get numpy structure
x = df.loc[:, 'Sup_1 ID':].to_numpy()
# flip left to right
a = np.fliplr(x)
# Overwrite not NaN values in x with not NaN in a
x[~np.isnan(x)] = a[~np.isnan(a)]
# Update DataFrame
df.loc[:, 'Sup_1 ID':] = x
df:
Emp_ID Sup_1 ID Sup_2 ID Sup_3 ID Sup_4 ID
0 123 789.0 678.0 456.0 234.0
1 234 789.0 678.0 456.0 NaN
2 456 789.0 678.0 NaN NaN
3 678 789.0 NaN NaN NaN
4 789 NaN NaN NaN NaN
DataFrame Constructor and imports:
import numpy as np
import pandas as pd
df = pd.DataFrame({
'Emp_ID': [123, 234, 456, 678, 789],
'Sup_1 ID': [234.0, 456.0, 678.0, 789.0, np.nan],
'Sup_2 ID': [456.0, 678.0, 789.0, np.nan, np.nan],
'Sup_3 ID': [678.0, 789.0, np.nan, np.nan, np.nan],
'Sup_4 ID': [789.0, np.nan, np.nan, np.nan, np.nan]
})
In your case try np.roll
df = df.set_index('Emp_ID')
out = df.apply(lambda x : np.roll(x[x.notnull()].values,1)).apply(pd.Series)
0 1 2 3
Sup_1 ID 789.0 234.0 456.0 678.0
Sup_2 ID 789.0 456.0 678.0 NaN
Sup_3 ID 789.0 678.0 NaN NaN
Sup_4 ID 789.0 NaN NaN NaN
out.columns = df.columns

Leading and Trailing Padding Dates in Pandas DataFrame

This is my dataframe:
df = pd.DataFrame.from_records(data=data, coerce_float=False, index=['date'])
# date field a datetime.datetime values
account_id amount
date
2018-01-01 1 100.0
2018-01-01 1 50.0
2018-06-01 1 200.0
2018-07-01 2 100.0
2018-10-01 2 200.0
Problem description
How can I "pad" my dataframe with leading and trailing "empty dates". I have tried to reindex on a date_range and period_range, I have tried to merge another index. I have tried all sorts of things all day, and I have read alot of the docs.
I have a simple dataframe with columns transaction_date, transaction_amount, and transaction_account. I want to group this dataframe so that it is grouped by account at the first level, and then by year, and then by month. Then I want a column for each month, with the sum of that month's transaction amount value.
This seems like it should be something that is easy to do.
Expected Output
This is the closest I have gotten:
df = pd.DataFrame.from_records(data=data, coerce_float=False, index=['date'])
df = df.groupby(['account_id', df.index.year, df.index.month])
df = df.resample('M').sum().fillna(0)
print(df)
account_id amount
account_id date date date
1 2018 1 2018-01-31 2 150.0
6 2018-06-30 1 200.0
2 2018 7 2018-07-31 2 100.0
10 2018-10-31 2 200.0
And this is what I want to achieve (basically reindex the data by date_range(start='2018-01-01', period=12, freq='M')
(Ideally I would want the month to be transposed by year across the top as columns)
amount
account_id Year Month
1 2018 1 150.0
2 NaN
3 NaN
4 NaN
5 NaN
6 200.0
....
12 200.0
2 2018 1 NaN
....
7 100.0
....
10 200.0
....
12 NaN
One way is to reindex
s=df.groupby([df['account_id'],df.index.year,df.index.month]).sum()
idx=pd.MultiIndex.from_product([s.index.levels[0],s.index.levels[1],list(range(1,13))])
s=s.reindex(idx)
s
Out[287]:
amount
1 2018 1 150.0
2 NaN
3 NaN
4 NaN
5 NaN
6 200.0
7 NaN
8 NaN
9 NaN
10 NaN
11 NaN
12 NaN
2 2018 1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
6 NaN
7 100.0
8 NaN
9 NaN
10 200.0
11 NaN
12 NaN

Merge pandas df based on 2 keys

I have 2 df and I would like to merge them based on 2 keys - ID and date:
I following is just a small slice of the entire df
df_pw6
ID date pw10_0 pw50_0 pw90_0
0 153 2018-01-08 27.88590 43.2872 58.2024
0 2 2018-01-05 11.03610 21.4879 31.6997
0 506 2018-01-08 6.98468 25.3899 45.9486
df_ex
date ID measure f188 f187 f186 f185
0 2017-07-03 501 NaN 1 0.5 7 4.0
1 2017-07-03 502 NaN 0 2.5 5 3.0
2 2018-01-08 506 NaN 5 9.0 9 1.2
As you can see, only the third row has a match.
When I type:
#check date
df_ex.iloc[2,0]== df_pw6.iloc[1,1]
True
#check ID
df_ex.iloc[2,1] == df_pw6.iloc[2,0]
True
Now I try to merge them:
df19 = pd.merge(df_pw6,df_ex,on=['date','ID'])
I get an empty df
When I try:
df19 = pd.merge(df_pw6,df_ex,how ='left',on=['date','ID'])
I get:
ID date pw10_0 pw50_0 pw90_0 measure f188 f187 f186 f185
0 153 2018-01-08 00:00:00 27.88590 43.2872 58.2024 NaN NaN NaN NaN NaN
1 2 2018-01-05 00:00:00 11.03610 21.4879 31.6997 NaN NaN NaN NaN NaN
2 506 2018-01-08 00:00:00 6.98468 25.3899 45.9486 NaN NaN NaN NaN NaN
My desired result should be:
> ID date pw10_0 pw50_0 pw90_0 measure f188 f187 f186 f185
>
> 0 506 2018-01-08 00:00:00 6.98468 25.3899 45.9486 NaN 5 9.0 9 1.2
I have run your codes post your edit, and I succeeded in getting the desired result.
import pandas as pd
# copy paste your first df by hand
pw = pd.read_clipboard()
# copy paste your second df by hand
ex = pd.read_clipboard()
pd.merge(pw,ex,on=['date','ID'])
# output [edited. now it is the correct result OP wanted.]
ID date pw10_0 pw50_0 pw90_0 measure f188 f187 f186 f185
0 506 2018-01-08 6.98468 25.3899 45.9486 NaN 5 9.0 9 1.2

Categories