Pivoting DataFrame with multiple columns for the index - python
I have a dataframe and I want to transpose only few rows to column.
This is what I have now.
Entity Name Date Value
0 111 Name1 2018-03-31 100
1 111 Name2 2018-02-28 200
2 222 Name3 2018-02-28 1000
3 333 Name1 2018-01-31 2000
I want to create date as the column and then add value. Something like this:
Entity Name 2018-01-31 2018-02-28 2018-03-31
0 111 Name1 NaN NaN 100.0
1 111 Name2 NaN 200.0 NaN
2 222 Name3 NaN 1000.0 NaN
3 333 Name1 2000.0 NaN NaN
I can have identical Name for two different Entitys. Here is an updated dataset.
Code:
import pandas as pd
import datetime
data1 = {
'Entity': [111,111,222,333],
'Name': ['Name1','Name2', 'Name3','Name1'],
'Date': [datetime.date(2018,3, 31), datetime.date(2018,2,28), datetime.date(2018,2,28), datetime.date(2018,1,31)],
'Value': [100,200,1000,2000]
}
df1 = pd.DataFrame(data1, columns= ['Entity','Name','Date', 'Value'])
How do I achieve this? Any pointers? Thanks all.
Based on your update, you'd need pivot_table with two index columns -
v = df1.pivot_table(
index=['Entity', 'Name'],
columns='Date',
values='Value'
).reset_index()
v.index.name = v.columns.name = None
v
Entity Name 2018-01-31 2018-02-28 2018-03-31
0 111 Name1 NaN NaN 100.0
1 111 Name2 NaN 200.0 NaN
2 222 Name3 NaN 1000.0 NaN
3 333 Name1 2000.0 NaN NaN
From unstack
df1.set_index(['Entity','Name','Date']).Value.unstack().reset_index()
Date Entity Name 2018-01-31 00:00:00 2018-02-28 00:00:00 \
0 111 Name1 NaN NaN
1 111 Name2 NaN 200.0
2 222 Name3 NaN 1000.0
3 333 Name1 2000.0 NaN
Date 2018-03-31 00:00:00
0 100.0
1 NaN
2 NaN
3 NaN
Related
Transpose and Compare
I'm attempting to compare two data frames. Item and Summary variables correspond to various dates and quantities. I'd like to transpose the dates into one column of data along with the associated quantities. I'd then like to compare the two data frames and see what changed from PreviousData to CurrentData. Previous Data: PreviousData = { 'Item' : ['abc','def','ghi','jkl','mno','pqr','stu','vwx','yza','uaza','fupa'], 'Summary' : ['party','weekend','food','school','tv','photo','camera','python','r','rstudio','spyder'], '2022-01-01' : [1, np.nan, np.nan, 1.0, np.nan, 1.0, np.nan, np.nan, np.nan,np.nan,2], '2022-02-01' : [1,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan], '2022-03-01' : [np.nan,np.nan,np.nan,1,np.nan,np.nan,1,np.nan,np.nan,np.nan,np.nan], '2022-04-01' : [np.nan,np.nan,3,np.nan,np.nan,3,np.nan,np.nan,np.nan,np.nan,np.nan], '2022-05-01' : [np.nan,np.nan,np.nan,3,np.nan,np.nan,2,np.nan,np.nan,3,np.nan], '2022-06-01' : [np.nan,np.nan,np.nan,np.nan,2,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan], '2022-07-01' : [np.nan,1,np.nan,np.nan,np.nan,np.nan,1,np.nan,np.nan,np.nan,np.nan], '2022-08-01' : [np.nan,np.nan,np.nan,1,np.nan,1,np.nan,np.nan,np.nan,np.nan,np.nan], '2022-09-01' : [np.nan,1,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,1,np.nan], '2022-10-01' : [np.nan,np.nan,1,np.nan,np.nan,1,np.nan,np.nan,np.nan,np.nan,np.nan], '2022-11-01' : [np.nan,2,np.nan,np.nan,1,1,1,np.nan,np.nan,np.nan,np.nan], '2022-12-01' : [np.nan,np.nan,np.nan,np.nan,3,np.nan,np.nan,2,np.nan,np.nan,np.nan], '2023-01-01' : [np.nan,np.nan,1,np.nan,1,np.nan,np.nan,np.nan,2,np.nan,np.nan], '2023-02-01' : [np.nan,np.nan,np.nan,2,np.nan,2,np.nan,np.nan,np.nan,np.nan,np.nan], '2023-03-01' : [np.nan,3,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan], '2023-04-01' : [np.nan,np.nan,np.nan,1,np.nan,np.nan,np.nan,1,np.nan,np.nan,np.nan], '2023-05-01' : [np.nan,np.nan,2,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,2,np.nan], '2023-06-01' : [1,1,np.nan,np.nan,9,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan], '2023-07-01' : [np.nan,np.nan,np.nan,1,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan], '2023-08-01' : [np.nan,1,np.nan,np.nan,1,np.nan,1,np.nan,np.nan,np.nan,np.nan], '2023-09-01' : [np.nan,1,1,np.nan,np.nan,np.nan,np.nan,1,np.nan,np.nan,np.nan], } PreviousData = pd.DataFrame(PreviousData) PreviousData Current Data: CurrentData = { 'Item' : ['ghi','stu','abc','mno','jkl','pqr','def','vwx','yza'], 'Summary' : ['food','camera','party','tv','school','photo','weekend','python','r'], '2022-01-01' : [3, np.nan, np.nan, 1.0, np.nan, 1.0, np.nan, np.nan, np.nan], '2022-02-01' : [np.nan,1,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan], '2022-03-01' : [np.nan,1,1,1,np.nan,np.nan,np.nan,np.nan,np.nan], '2022-04-01' : [np.nan,np.nan,1,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan], '2022-05-01' : [np.nan,np.nan,3,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan], '2022-06-01' : [2,np.nan,np.nan,np.nan,4,np.nan,np.nan,np.nan,np.nan], '2022-07-01' : [np.nan,np.nan,np.nan,np.nan,np.nan,4,np.nan,np.nan,np.nan], '2022-08-01' : [np.nan,np.nan,3,np.nan,4,np.nan,np.nan,np.nan,np.nan], '2022-09-01' : [np.nan,np.nan,3,3,3,np.nan,np.nan,5,5], '2022-10-01' : [np.nan,np.nan,np.nan,np.nan,5,np.nan,np.nan,np.nan,np.nan], '2022-11-01' : [np.nan,np.nan,np.nan,5,np.nan,np.nan,np.nan,np.nan,np.nan], '2022-12-01' : [np.nan,4,np.nan,np.nan,np.nan,1,np.nan,np.nan,np.nan], '2023-01-01' : [np.nan,np.nan,np.nan,np.nan,1,1,np.nan,np.nan,np.nan], '2023-02-01' : [np.nan,np.nan,np.nan,2,1,np.nan,np.nan,np.nan,np.nan], '2023-03-01' : [np.nan,np.nan,np.nan,np.nan,2,np.nan,2,np.nan,2], '2023-04-01' : [np.nan,np.nan,np.nan,np.nan,np.nan,2,np.nan,np.nan,2], } CurrentData = pd.DataFrame(CurrentData) CurrentData As requested, here's an example of a difference: How to transpose and compare these two sets?
One way of doing this is the following. Transpose both dataframes: PreviousData_t = PreviousData.melt(id_vars=["Item", "Summary"], var_name="Date", value_name="value1") which is Item Summary Date value1 0 abc party 2022-01-01 1.0 1 def weekend 2022-01-01 NaN 2 ghi food 2022-01-01 NaN 3 jkl school 2022-01-01 1.0 4 mno tv 2022-01-01 NaN .. ... ... ... ... 226 stu camera 2023-09-01 NaN 227 vwx python 2023-09-01 1.0 228 yza r 2023-09-01 NaN 229 uaza rstudio 2023-09-01 NaN 230 fupa spyder 2023-09-01 NaN and CurrentData_t = CurrentData.melt(id_vars=["Item", "Summary"], var_name="Date", value_name="value2") Item Summary Date value2 0 ghi food 2022-01-01 3.0 1 stu camera 2022-01-01 NaN 2 abc party 2022-01-01 NaN 3 mno tv 2022-01-01 1.0 4 jkl school 2022-01-01 NaN .. ... ... ... ... 139 jkl school 2023-04-01 NaN 140 pqr photo 2023-04-01 2.0 141 def weekend 2023-04-01 NaN 142 vwx python 2023-04-01 NaN 143 yza r 2023-04-01 2.0 [144 rows x 4 columns] THen merge: Compare = PreviousData_t.merge(CurrentData_t, on =['Date','Item','Summary'], how = 'left') Item Summary Date value1 value2 0 abc party 2022-01-01 1.0 NaN 1 def weekend 2022-01-01 NaN NaN 2 ghi food 2022-01-01 NaN 3.0 3 jkl school 2022-01-01 1.0 NaN 4 mno tv 2022-01-01 NaN 1.0 .. ... ... ... ... ... 226 stu camera 2023-09-01 NaN NaN 227 vwx python 2023-09-01 1.0 NaN 228 yza r 2023-09-01 NaN NaN 229 uaza rstudio 2023-09-01 NaN NaN 230 fupa spyder 2023-09-01 NaN NaN [231 rows x 5 columns] and compare by creating a column marking differences Compare['diff'] = np.where(Compare['value1']!=Compare['value2'], 1,0) Item Summary Date value1 value2 diff 0 abc party 2022-01-01 1.0 NaN 1 1 def weekend 2022-01-01 NaN NaN 1 2 ghi food 2022-01-01 NaN 3.0 1 3 jkl school 2022-01-01 1.0 NaN 1 4 mno tv 2022-01-01 NaN 1.0 1 .. ... ... ... ... ... ... 226 stu camera 2023-09-01 NaN NaN 1 227 vwx python 2023-09-01 1.0 NaN 1 228 yza r 2023-09-01 NaN NaN 1 229 uaza rstudio 2023-09-01 NaN NaN 1 230 fupa spyder 2023-09-01 NaN NaN 1 [231 rows x 6 columns] If you only want to compare those entries that are common to both, do this: Compare = PreviousData_t.merge(CurrentData_t, on =['Date','Item','Summary']) Compare['diff'] = np.where(Compare['value1']!=Compare['value2'], 1,0) Item Summary Date value1 value2 diff 0 abc party 2022-01-01 1.0 NaN 1 1 def weekend 2022-01-01 NaN NaN 1 2 ghi food 2022-01-01 NaN 3.0 1 3 jkl school 2022-01-01 1.0 NaN 1 4 mno tv 2022-01-01 NaN 1.0 1 .. ... ... ... ... ... ... 139 mno tv 2023-04-01 NaN NaN 1 140 pqr photo 2023-04-01 NaN 2.0 1 141 stu camera 2023-04-01 NaN NaN 1 142 vwx python 2023-04-01 1.0 NaN 1 143 yza r 2023-04-01 NaN 2.0 1 [144 rows x 6 columns]
How to sort columns except index column in a data frame in python after pivot
So I have a data frame testdf = pd.DataFrame({"loc" : ["ab12","bc12","cd12","ab12","bc13","cd12"], "months" : ["Jun21","Jun21","July21","July21","Aug21","Aug21"], "dept" : ["dep1","dep2","dep3","dep2","dep1","dep3"], "count": [15, 16, 15, 92, 90, 2]}) That looks like this: When I pivot it, df = pd.pivot_table(testdf, values = ['count'], index = ['loc','dept'], columns = ['months'], aggfunc=np.sum).reset_index() df.columns = df.columns.droplevel(0) df it looks like this: I am looking for a sort function which will sort only the months columns in sequence and not the first 2 columns i.e loc & dept. when I try this: df.sort_values(by = ['Jun21'],ascending = False, inplace = True, axis = 1, ignore_index=True)[2:] it gives me error. I want the columns to be in sequence Jun21, Jul21, Aug21 I am looking for something which will make it dynamic and I wont need to manually change the sequence when the month changes. Any hint will be really appreciated.
It is quite simple if you do using groupby df = testdf.groupby(['loc', 'dept', 'months']).sum().unstack(level=2) df = df.reindex(['Jun21', 'July21', 'Aug21'], axis=1, level=1) Output count months Jun21 July21 Aug21 loc dept ab12 dep1 15.0 NaN NaN dep2 NaN 92.0 NaN bc12 dep2 16.0 NaN NaN bc13 dep1 NaN NaN 90.0 cd12 dep3 NaN 15.0 2.0
We can start by converting the column months in datetime like so : >>> testdf.months = (pd.to_datetime(testdf.months, format="%b%y", errors='coerce')) >>> testdf loc months dept count 0 ab12 2021-06-01 dep1 15 1 bc12 2021-06-01 dep2 16 2 cd12 2021-07-01 dep3 15 3 ab12 2021-07-01 dep2 92 4 bc13 2021-08-01 dep1 90 5 cd12 2021-08-01 dep3 2 Then, we apply your code to get the pivot : >>> df = pd.pivot_table(testdf, values = ['count'], index = ['loc','dept'], columns = ['months'], aggfunc=np.sum).reset_index() >>> df.columns = df.columns.droplevel(0) >>> df months NaT NaT 2021-06-01 2021-07-01 2021-08-01 0 ab12 dep1 15.0 NaN NaN 1 ab12 dep2 NaN 92.0 NaN 2 bc12 dep2 16.0 NaN NaN 3 bc13 dep1 NaN NaN 90.0 4 cd12 dep3 NaN 15.0 2.0 And to finish we can reformat the column names using strftime to get the expected result : >>> df.columns = df.columns.map(lambda t: t.strftime('%b%y') if pd.notnull(t) else '') >>> df months Jun21 Jul21 Aug21 0 ab12 dep1 15.0 NaN NaN 1 ab12 dep2 NaN 92.0 NaN 2 bc12 dep2 16.0 NaN NaN 3 bc13 dep1 NaN NaN 90.0 4 cd12 dep3 NaN 15.0 2.0
Flip and shift multi-column data to the left in Pandas
Here's an Employee - Supervisor mapping data. I'd like to flip and then shift whole column to left. Only data should be shifted to the left 1 time and the columns should be fixed. Could you tell me how can I do this? Input: Bottom - Up approach Emp_ID Sup_1 ID Sup_2 ID Sup_3 ID Sup_4 ID 123 234 456 678 789 234 456 678 789 NaN 456 678 789 NaN NaN 678 789 NaN NaN NaN 789 NaN NaN NaN NaN Output: Top - Down approach Emp_ID Sup_1 ID Sup_2 ID Sup_3 ID Sup_4 ID 123 789 678 456 234 234 789 678 456 NaN 456 789 678 NaN NaN 678 789 NaN NaN NaN 789 NaN NaN NaN NaN Appreciate any kind of assistance
Try with fliplr: # Get numpy structure x = df.loc[:, 'Sup_1 ID':].to_numpy() # flip left to right a = np.fliplr(x) # Overwrite not NaN values in x with not NaN in a x[~np.isnan(x)] = a[~np.isnan(a)] # Update DataFrame df.loc[:, 'Sup_1 ID':] = x df: Emp_ID Sup_1 ID Sup_2 ID Sup_3 ID Sup_4 ID 0 123 789.0 678.0 456.0 234.0 1 234 789.0 678.0 456.0 NaN 2 456 789.0 678.0 NaN NaN 3 678 789.0 NaN NaN NaN 4 789 NaN NaN NaN NaN DataFrame Constructor and imports: import numpy as np import pandas as pd df = pd.DataFrame({ 'Emp_ID': [123, 234, 456, 678, 789], 'Sup_1 ID': [234.0, 456.0, 678.0, 789.0, np.nan], 'Sup_2 ID': [456.0, 678.0, 789.0, np.nan, np.nan], 'Sup_3 ID': [678.0, 789.0, np.nan, np.nan, np.nan], 'Sup_4 ID': [789.0, np.nan, np.nan, np.nan, np.nan] })
In your case try np.roll df = df.set_index('Emp_ID') out = df.apply(lambda x : np.roll(x[x.notnull()].values,1)).apply(pd.Series) 0 1 2 3 Sup_1 ID 789.0 234.0 456.0 678.0 Sup_2 ID 789.0 456.0 678.0 NaN Sup_3 ID 789.0 678.0 NaN NaN Sup_4 ID 789.0 NaN NaN NaN out.columns = df.columns
Leading and Trailing Padding Dates in Pandas DataFrame
This is my dataframe: df = pd.DataFrame.from_records(data=data, coerce_float=False, index=['date']) # date field a datetime.datetime values account_id amount date 2018-01-01 1 100.0 2018-01-01 1 50.0 2018-06-01 1 200.0 2018-07-01 2 100.0 2018-10-01 2 200.0 Problem description How can I "pad" my dataframe with leading and trailing "empty dates". I have tried to reindex on a date_range and period_range, I have tried to merge another index. I have tried all sorts of things all day, and I have read alot of the docs. I have a simple dataframe with columns transaction_date, transaction_amount, and transaction_account. I want to group this dataframe so that it is grouped by account at the first level, and then by year, and then by month. Then I want a column for each month, with the sum of that month's transaction amount value. This seems like it should be something that is easy to do. Expected Output This is the closest I have gotten: df = pd.DataFrame.from_records(data=data, coerce_float=False, index=['date']) df = df.groupby(['account_id', df.index.year, df.index.month]) df = df.resample('M').sum().fillna(0) print(df) account_id amount account_id date date date 1 2018 1 2018-01-31 2 150.0 6 2018-06-30 1 200.0 2 2018 7 2018-07-31 2 100.0 10 2018-10-31 2 200.0 And this is what I want to achieve (basically reindex the data by date_range(start='2018-01-01', period=12, freq='M') (Ideally I would want the month to be transposed by year across the top as columns) amount account_id Year Month 1 2018 1 150.0 2 NaN 3 NaN 4 NaN 5 NaN 6 200.0 .... 12 200.0 2 2018 1 NaN .... 7 100.0 .... 10 200.0 .... 12 NaN
One way is to reindex s=df.groupby([df['account_id'],df.index.year,df.index.month]).sum() idx=pd.MultiIndex.from_product([s.index.levels[0],s.index.levels[1],list(range(1,13))]) s=s.reindex(idx) s Out[287]: amount 1 2018 1 150.0 2 NaN 3 NaN 4 NaN 5 NaN 6 200.0 7 NaN 8 NaN 9 NaN 10 NaN 11 NaN 12 NaN 2 2018 1 NaN 2 NaN 3 NaN 4 NaN 5 NaN 6 NaN 7 100.0 8 NaN 9 NaN 10 200.0 11 NaN 12 NaN
Merge pandas df based on 2 keys
I have 2 df and I would like to merge them based on 2 keys - ID and date: I following is just a small slice of the entire df df_pw6 ID date pw10_0 pw50_0 pw90_0 0 153 2018-01-08 27.88590 43.2872 58.2024 0 2 2018-01-05 11.03610 21.4879 31.6997 0 506 2018-01-08 6.98468 25.3899 45.9486 df_ex date ID measure f188 f187 f186 f185 0 2017-07-03 501 NaN 1 0.5 7 4.0 1 2017-07-03 502 NaN 0 2.5 5 3.0 2 2018-01-08 506 NaN 5 9.0 9 1.2 As you can see, only the third row has a match. When I type: #check date df_ex.iloc[2,0]== df_pw6.iloc[1,1] True #check ID df_ex.iloc[2,1] == df_pw6.iloc[2,0] True Now I try to merge them: df19 = pd.merge(df_pw6,df_ex,on=['date','ID']) I get an empty df When I try: df19 = pd.merge(df_pw6,df_ex,how ='left',on=['date','ID']) I get: ID date pw10_0 pw50_0 pw90_0 measure f188 f187 f186 f185 0 153 2018-01-08 00:00:00 27.88590 43.2872 58.2024 NaN NaN NaN NaN NaN 1 2 2018-01-05 00:00:00 11.03610 21.4879 31.6997 NaN NaN NaN NaN NaN 2 506 2018-01-08 00:00:00 6.98468 25.3899 45.9486 NaN NaN NaN NaN NaN My desired result should be: > ID date pw10_0 pw50_0 pw90_0 measure f188 f187 f186 f185 > > 0 506 2018-01-08 00:00:00 6.98468 25.3899 45.9486 NaN 5 9.0 9 1.2
I have run your codes post your edit, and I succeeded in getting the desired result. import pandas as pd # copy paste your first df by hand pw = pd.read_clipboard() # copy paste your second df by hand ex = pd.read_clipboard() pd.merge(pw,ex,on=['date','ID']) # output [edited. now it is the correct result OP wanted.] ID date pw10_0 pw50_0 pw90_0 measure f188 f187 f186 f185 0 506 2018-01-08 6.98468 25.3899 45.9486 NaN 5 9.0 9 1.2