Reshaping pandas dataframe using pivot and provide multiple column as index - python

I have a Data Frame like this(sample),
A B C D E
0 V1 B1 Clearing C1 1538884.46
1 V1 B1 CustomerPayment_Difference C1 13537679.70
2 V1 B1 Invoice C1 -15771005.81
3 V1 B1 PaymentDifference C1 0.00
4 V2 B2 Clearing C2 104457.22
5 V2 B2 Invoice C2 -400073.56
6 V2 B2 Payment C2 297856.45
7 V3 B3 Clearing C3 1989462.95
8 V3 B3 CreditMemo C3 538.95
9 V3 B3 CustomerPayment_Difference C3 2112329.00
10 V3 B3 Invoice C3 -4066485.69
11 V4 B4 Clearing C4 -123946.13
12 V4 B4 CreditMemo C4 127624.66
13 V4 B4 Accounting C4 424774.52
14 V4 B4 Invoice C4 -40446521.41
15 V4 B4 Payment C4 44441419.95
I want to reshape this data frame like below:
A B D Accounting Clearing CreditMemo CustomerPayment_Difference \
V1 B1 C1 NaN 1538884.46 NaN 13537679.7
V2 B2 C2 NaN 104457.22 NaN NaN
V3 B3 C3 NaN 1989462.95 538.95 2112329.0
V4 B4 C4 424774.52 -123946.13 127624.66 NaN
C Invoice Payment PaymentDifference
0 -15771005.81 NaN 0.0
1 -400073.56 297856.45 NaN
2 -4066485.69 NaN NaN
3 -40446521.41 44441419.95 NaN
So far I tried to get help from pivot table,
df.pivot(index='A',columns='C', values='E').reset_index()
It gives result like below:
C A Accounting Clearing CreditMemo CustomerPayment_Difference \
0 V1 NaN 1538884.46 NaN 13537679.7
1 V2 NaN 104457.22 NaN NaN
2 V3 NaN 1989462.95 538.95 2112329.0
3 V4 424774.52 -123946.13 127624.66 NaN
C Invoice Payment PaymentDifference
0 -15771005.81 NaN 0.0
1 -400073.56 297856.45 NaN
2 -4066485.69 NaN NaN
3 -40446521.41 44441419.95 NaN
In above table it leave B&C columns, I need that columns as well.
This have provided this sample data for simplicity. But in future data will be like this also,
A B C D E
0 V1 B1 Clearing C1 1538884.46
1 V1 B1 CustomerPayment_Difference C1 13537679.70
2 V1 B1 Invoice C1 -15771005.81
3 V1 B1 PaymentDifference C1 0.00
**4 V1 B2 Clearing C1 88.9
5 V1 B2 Clearing C2 79.9**
In this situation my code will throw duplicate index error.
To fix this two problems I need to specify A,B,D as index.
I need a code similar to this,
df.pivot(index=['A','B','D'],columns='C', values='E').reset_index()
this code throw me an error.
How to solve this? How to provide Multiple columns as index in pandas pivot table?

I think need:
df = df.set_index(['A','B','D', 'C'])['E'].unstack().reset_index()
print (df)
C A B D Accounting Clearing CreditMemo CustomerPayment_Difference \
0 V1 B1 C1 NaN 1538884.46 NaN 13537679.7
1 V2 B2 C2 NaN 104457.22 NaN NaN
2 V3 B3 C3 NaN 1989462.95 538.95 2112329.0
3 V4 B4 C4 424774.52 -123946.13 127624.66 NaN
C Invoice Payment PaymentDifference
0 -15771005.81 NaN 0.0
1 -400073.56 297856.45 NaN
2 -4066485.69 NaN NaN
3 -40446521.41 44441419.95 NaN
Another solution is use pivot_table:
df = df.pivot_table(index=['A','B','D'], columns='C', values='E')
But it aggregate if duplicates in A, B, C, D columns. In first solution get error if duplicates:
print (df)
A B C D E
0 V1 B1 Clearing C1 3000.00 <-V1,B1,Clearing,C1
1 V1 B1 CustomerPayment_Difference C1 13537679.70
2 V1 B1 Invoice C1 -15771005.81
3 V1 B1 PaymentDifference C1 0.00
4 V1 B1 Cleari7ng C1 1000.00 <-V1,B1,Clearing,C1
df = df.set_index(['A','B','D', 'C'])['E'].unstack().reset_index()
print (df)
ValueError: Index contains duplicate entries, cannot reshape
But pivot_table aggregate:
df = df.pivot_table(index=['A','B','D'], columns='C', values='E')
print (df)
C Clearing CustomerPayment_Difference Invoice PaymentDifference
A B D
V1 B1 C1 2000.0 13537679.7 -15771005.81 0.0
So question is: Is good idea always use pivot_table?
In my opinion it depends if need care about duplicates - if use pivot or set_index + unstack get error - you know about dupes, but pivot_table always aggregate, so no idea about dupes.

Related

Pandas concat with different columns

I have two dfs that I want to concat
(sorry I don't know how to properly recreate a df here)
A B
a1 b1
a2 b2
a3 b3
A C
a1 c1
a4 c4
Result:
A B C
a1 b1 c1
a2 b2 NaN
a3 b3 NaN
a4 NaN c4
I have tried:
merge = pd.concat([df1,df2],axis = 0,ignore_index= True)
but this seems to just append the second df to the first df
Thank you!
I believe you need an outer join:
>>> pd.merge(df,df2,how='outer')
A B C
0 a1 b1 c1
1 a2 b2 NaN
2 a3 b3 NaN
3 a4 NaN c4

Pandas: Reducing multi-index to highest available level

I have the following type of a dataframe, values which are grouped by 3 different categories A,B,C:
import pandas as pd
A = ['A1', 'A2', 'A3', 'A2', 'A1']
B = ['B3', 'B2', 'B2', 'B1', 'B3']
C = ['C2', 'C2', 'C3', 'C1', 'C3']
value = ['6','2','3','3','5']
df = pd.DataFrame({'categA': A,'categB': B, 'categC': C, 'value': value})
df
Which looks like:
categA categB categC value
0 A1 B3 C2 6
1 A2 B2 C2 2
2 A3 B2 C3 3
3 A2 B1 C1 3
4 A1 B3 C3 5
Now, when I want to unstack this df by the C category, .unstack() returns some multi-indexed dataframe with 'value' at the first level and my categories of interest C1, C2 & C3 at the second level:
df = df.set_index(['categA','categB','categC']).unstack('categC')
df
Output:
value
categC C1 C2 C3
categA categB
A1 B3 NaN 6 5
A2 B1 3 NaN NaN
B2 NaN 2 NaN
A3 B2 NaN NaN 3
Is there a quick and clean way to get rid of the multi-index by reducing it to the highest available level? This is what I'd like to have as output:
categA categB C1 C2 C3
A1 B3 NaN 6 5
A2 B1 3 NaN NaN
B2 NaN 2 NaN
A3 B2 NaN NaN 3
Many thanks in advance!
Edit:
print(df.reset_index())
gives:
categA categB value
categC C1 C2 C3
0 A1 B3 NaN 6 5
1 A2 B1 3 NaN NaN
2 A2 B2 NaN 2 NaN
3 A3 B2 NaN NaN 3
Adding reset_index also , unstack with Series
df.set_index(['categA','categB','categC']).value.unstack('categC').reset_index()
Out[875]:
categC categA categB C1 C2 C3
0 A1 B3 None 6 5
1 A2 B1 3 None None
2 A2 B2 None 2 None
3 A3 B2 None None 3

Merging two data frames with a key where a column becomes row and the other one doesn't

I have a data frame df1 like the following:
A B key
a1 b1 001A
a2 b2 4906
a3 b3 0190
a4 b4 1993
and another data frame df2 like:
C D key
c1 1 001A
c1 2 4906
c1 3 0190
c1 4 1993
c2 5 001A
c2 6 4906
c2 7 0190
c2 8 1993
I would like to merge them to get
A B key c1 c2
a1 b1 001A 1 5
a2 b2 4906 2 6
a3 b3 0190 3 7
a4 b4 1993 4 8
I have tried
pd.merge(df, df2, on='key')
but it isn't matching like I want. I can't seem to get the rows as columns.
You should first pivot your df2 to get it into the shape you want.
df2.pivot(index='key', columns='C', values='D')
C c1 c2
key
001A 1 5
0190 3 7
1993 4 8
4906 2 6
Then, you can join this pivot table to your df.
df.join(df2.pivot(index='key', columns='C', values='D'), on='key')
A B key c1 c2
0 a1 b1 001A 1 5
1 a2 b2 4906 2 6
2 a3 b3 0190 3 7
3 a4 b4 1993 4 8
Or, if you prefer, use pd.merge, although it's more verbose.
pd.merge(df, df2.pivot(index='key', columns='C', values='D'),
left_on='key', right_index=True)

Printing a double group by pandas dataframe as a 2D array

I want to display the result of a single value aggregation with 2 group by's into a table.
Such that
df.groupby(['colA', 'colB']).size
Would yield:
B1 B2 B3 B4
A1 s11 s12 s13 ..
A2 s21 s22 s23 ..
A3 s31 s32 s33 ..
A4 .. .. .. s44
What's a quick and easy way of doing this?
EDIT: here's an example. I have the logins of all users, and I want to display the number of logins (=rows) for each user and day
Day,User
1,John
1,John
1,Ben
1,Sarah
2,Ben
2,Sarah
2,Sarah
Should yield:
D\U John Ben Sarah
1 2 1 1
2 0 1 2
Use:
df.groupby(['colA', 'colB']).size().unstack()
Example:
df = pd.DataFrame(np.transpose([np.random.choice(['B1','B2','B3'], size=10),
np.random.choice(['A1','A2','A3'], size=10)]),
columns=['A','B'])
df
A B
0 B3 A1
1 B1 A2
2 B3 A3
3 B1 A3
4 B2 A2
5 B3 A3
6 B3 A1
7 B2 A1
8 B1 A3
9 B3 A3
Now:
df.groupby(['A','B']).size().unstack()
B A1 A2 A3
A
B1 NaN 1.0 2.0
B2 1.0 1.0 NaN
B3 2.0 NaN 3.0
Update now that your post has data:
df.groupby(['Day','User']).size().unstack().fillna(0)
User Ben John Sarah
Day
1 1.0 2.0 1.0
2 1.0 0.0 2.0

Copying values into Nan fields

I have a large data-frame with the format as below:
If any of the cell is "NaN", i want to copy from the cell immediately above it. So, my dataframe should look like:
In case the first row has "NaN", then I'll have to let it be.
Can someone please help me with this?
This looks like pandas if so you need to call ffill
In [72]:
df = pd.DataFrame({'A':['A0','A1','A2',np.NaN,np.NaN, 'A3'], 'B':['B0','B1','B2',np.NaN,np.NaN, 'B3'], 'C':['C0','C1','C2',np.NaN,np.NaN, 'C3']})
df
Out[72]:
A B C
0 A0 B0 C0
1 A1 B1 C1
2 A2 B2 C2
3 NaN NaN NaN
4 NaN NaN NaN
5 A3 B3 C3
In [73]:
df.ffill()
Out[73]:
A B C
0 A0 B0 C0
1 A1 B1 C1
2 A2 B2 C2
3 A2 B2 C2
4 A2 B2 C2
5 A3 B3 C3

Categories