I have the following type of a dataframe, values which are grouped by 3 different categories A,B,C:
import pandas as pd
A = ['A1', 'A2', 'A3', 'A2', 'A1']
B = ['B3', 'B2', 'B2', 'B1', 'B3']
C = ['C2', 'C2', 'C3', 'C1', 'C3']
value = ['6','2','3','3','5']
df = pd.DataFrame({'categA': A,'categB': B, 'categC': C, 'value': value})
df
Which looks like:
categA categB categC value
0 A1 B3 C2 6
1 A2 B2 C2 2
2 A3 B2 C3 3
3 A2 B1 C1 3
4 A1 B3 C3 5
Now, when I want to unstack this df by the C category, .unstack() returns some multi-indexed dataframe with 'value' at the first level and my categories of interest C1, C2 & C3 at the second level:
df = df.set_index(['categA','categB','categC']).unstack('categC')
df
Output:
value
categC C1 C2 C3
categA categB
A1 B3 NaN 6 5
A2 B1 3 NaN NaN
B2 NaN 2 NaN
A3 B2 NaN NaN 3
Is there a quick and clean way to get rid of the multi-index by reducing it to the highest available level? This is what I'd like to have as output:
categA categB C1 C2 C3
A1 B3 NaN 6 5
A2 B1 3 NaN NaN
B2 NaN 2 NaN
A3 B2 NaN NaN 3
Many thanks in advance!
Edit:
print(df.reset_index())
gives:
categA categB value
categC C1 C2 C3
0 A1 B3 NaN 6 5
1 A2 B1 3 NaN NaN
2 A2 B2 NaN 2 NaN
3 A3 B2 NaN NaN 3
Adding reset_index also , unstack with Series
df.set_index(['categA','categB','categC']).value.unstack('categC').reset_index()
Out[875]:
categC categA categB C1 C2 C3
0 A1 B3 None 6 5
1 A2 B1 3 None None
2 A2 B2 None 2 None
3 A3 B2 None None 3
Related
I have two dfs that I want to concat
(sorry I don't know how to properly recreate a df here)
A B
a1 b1
a2 b2
a3 b3
A C
a1 c1
a4 c4
Result:
A B C
a1 b1 c1
a2 b2 NaN
a3 b3 NaN
a4 NaN c4
I have tried:
merge = pd.concat([df1,df2],axis = 0,ignore_index= True)
but this seems to just append the second df to the first df
Thank you!
I believe you need an outer join:
>>> pd.merge(df,df2,how='outer')
A B C
0 a1 b1 c1
1 a2 b2 NaN
2 a3 b3 NaN
3 a4 NaN c4
Hi All i have created a dummy DF
left = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'],
'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3']})
right = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3','K0', 'K1', 'K2', 'K3'],
'C': [7, 11, 9, 13,9, 6, 10, 5],
'D': [1, 2, 1, 2,2, 1, 2,1]})
result = pd.merge(left, right, on='key')
output:
key A B C D
0 K0 A0 B0 7 1
1 K0 A0 B0 9 2
2 K1 A1 B1 11 2
3 K1 A1 B1 6 1
4 K2 A2 B2 9 1
5 K2 A2 B2 10 2
6 K3 A3 B3 13 2
7 K3 A3 B3 5 1
The problem I am trying to solve is that I want to group the entries by the key Value, and perform a mathematical operation on it, you will notice that there is 2 entries in the column
so for each key, if the top value is less than the bottom value in column D, perform simple addition on the column C entries, in this case the math would only be applied to index =[4,5,6,7] and the calculations would be
9-10 = -1
13-5 = 8
Ideally these results would be stored in a list, I know the data structure is not ideal but this is what I have been given to work with, and i have no idea how to approach it
# perform checking on column D (and add "check" column)
result["check"] = result.groupby('key')['D'].diff(periods=-1)
result
key A B C D check
0 K0 A0 B0 7 1 -1.0
1 K0 A0 B0 9 2 NaN
2 K1 A1 B1 11 2 1.0
3 K1 A1 B1 6 1 NaN
4 K2 A2 B2 9 1 -1.0
5 K2 A2 B2 10 2 NaN
6 K3 A3 B3 13 2 1.0
7 K3 A3 B3 5 1 NaN
# perform difference between column C values (and add it as a column)
result["diff"] = result.groupby('key')['C'].diff(periods=-1)
result
key A B C D check diff
0 K0 A0 B0 7 1 -1.0 -2.0
1 K0 A0 B0 9 2 NaN NaN
2 K1 A1 B1 11 2 1.0 5.0
3 K1 A1 B1 6 1 NaN NaN
4 K2 A2 B2 9 1 -1.0 -1.0
5 K2 A2 B2 10 2 NaN NaN
6 K3 A3 B3 13 2 1.0 8.0
7 K3 A3 B3 5 1 NaN NaN
# filter dataframe to get only desired results
result[(result.check < 0)]['diff']
0 -2.0
4 -1.0
# results as a list
list(result[(result.check < 0)]['diff'])
[-2.0, -1.0]
I'm trying to join two DataFrames by index that can contain columns in common and I only want to add one to the other if that specific value is NaN or doesn't exist. I'm using the pandas example, so I've got:
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3'],
'C': ['C0', 'C1', 'C2', 'C3'],
'D': ['D0', 'D1', 'D2', 'D3']},
index=[0, 1, 2, 3])
as
A B C D
0 A0 B0 C0 D0
1 A1 B1 C1 D1
2 A2 B2 C2 D2
3 A3 B3 C3 D3
and
df4 = pd.DataFrame({'B': ['B2p', 'B3p', 'B6p', 'B7p'],
'D': ['D2p', 'D3p', 'D6p', 'D7p'],
'F': ['F2p', 'F3p', 'F6p', 'F7p']},
index=[2, 3, 6, 7])
as
B D F
2 B2p D2p F2p
3 B3p D3p F3p
6 B6p D6p F6p
7 B7p D7p F7p
and the searched result is:
A B C D F
0 A0 B0 C0 D0 Nan
1 A1 B1 C1 D1 Nan
2 A2 B2 C2 D2 F2p
3 A3 B3 C3 D3 F3p
6 Nan B6p Nan D6p F6p
7 Nan B7p Nan D7p F7p
This is a good use case for combine_first, where the row and column indices of the resulting dataframe will be the union of the two, i.e in the absence of an index in one of the dataframes, the value from the other is used (same behaviour as if it contained a NaN:
df1.combine_first(df4)
A B C D F
0 A0 B0 C0 D0 NaN
1 A1 B1 C1 D1 NaN
2 A2 B2 C2 D2 F2p
3 A3 B3 C3 D3 F3p
6 NaN B6p NaN D6p F6p
7 NaN B7p NaN D7p F7p
I have a Data Frame like this(sample),
A B C D E
0 V1 B1 Clearing C1 1538884.46
1 V1 B1 CustomerPayment_Difference C1 13537679.70
2 V1 B1 Invoice C1 -15771005.81
3 V1 B1 PaymentDifference C1 0.00
4 V2 B2 Clearing C2 104457.22
5 V2 B2 Invoice C2 -400073.56
6 V2 B2 Payment C2 297856.45
7 V3 B3 Clearing C3 1989462.95
8 V3 B3 CreditMemo C3 538.95
9 V3 B3 CustomerPayment_Difference C3 2112329.00
10 V3 B3 Invoice C3 -4066485.69
11 V4 B4 Clearing C4 -123946.13
12 V4 B4 CreditMemo C4 127624.66
13 V4 B4 Accounting C4 424774.52
14 V4 B4 Invoice C4 -40446521.41
15 V4 B4 Payment C4 44441419.95
I want to reshape this data frame like below:
A B D Accounting Clearing CreditMemo CustomerPayment_Difference \
V1 B1 C1 NaN 1538884.46 NaN 13537679.7
V2 B2 C2 NaN 104457.22 NaN NaN
V3 B3 C3 NaN 1989462.95 538.95 2112329.0
V4 B4 C4 424774.52 -123946.13 127624.66 NaN
C Invoice Payment PaymentDifference
0 -15771005.81 NaN 0.0
1 -400073.56 297856.45 NaN
2 -4066485.69 NaN NaN
3 -40446521.41 44441419.95 NaN
So far I tried to get help from pivot table,
df.pivot(index='A',columns='C', values='E').reset_index()
It gives result like below:
C A Accounting Clearing CreditMemo CustomerPayment_Difference \
0 V1 NaN 1538884.46 NaN 13537679.7
1 V2 NaN 104457.22 NaN NaN
2 V3 NaN 1989462.95 538.95 2112329.0
3 V4 424774.52 -123946.13 127624.66 NaN
C Invoice Payment PaymentDifference
0 -15771005.81 NaN 0.0
1 -400073.56 297856.45 NaN
2 -4066485.69 NaN NaN
3 -40446521.41 44441419.95 NaN
In above table it leave B&C columns, I need that columns as well.
This have provided this sample data for simplicity. But in future data will be like this also,
A B C D E
0 V1 B1 Clearing C1 1538884.46
1 V1 B1 CustomerPayment_Difference C1 13537679.70
2 V1 B1 Invoice C1 -15771005.81
3 V1 B1 PaymentDifference C1 0.00
**4 V1 B2 Clearing C1 88.9
5 V1 B2 Clearing C2 79.9**
In this situation my code will throw duplicate index error.
To fix this two problems I need to specify A,B,D as index.
I need a code similar to this,
df.pivot(index=['A','B','D'],columns='C', values='E').reset_index()
this code throw me an error.
How to solve this? How to provide Multiple columns as index in pandas pivot table?
I think need:
df = df.set_index(['A','B','D', 'C'])['E'].unstack().reset_index()
print (df)
C A B D Accounting Clearing CreditMemo CustomerPayment_Difference \
0 V1 B1 C1 NaN 1538884.46 NaN 13537679.7
1 V2 B2 C2 NaN 104457.22 NaN NaN
2 V3 B3 C3 NaN 1989462.95 538.95 2112329.0
3 V4 B4 C4 424774.52 -123946.13 127624.66 NaN
C Invoice Payment PaymentDifference
0 -15771005.81 NaN 0.0
1 -400073.56 297856.45 NaN
2 -4066485.69 NaN NaN
3 -40446521.41 44441419.95 NaN
Another solution is use pivot_table:
df = df.pivot_table(index=['A','B','D'], columns='C', values='E')
But it aggregate if duplicates in A, B, C, D columns. In first solution get error if duplicates:
print (df)
A B C D E
0 V1 B1 Clearing C1 3000.00 <-V1,B1,Clearing,C1
1 V1 B1 CustomerPayment_Difference C1 13537679.70
2 V1 B1 Invoice C1 -15771005.81
3 V1 B1 PaymentDifference C1 0.00
4 V1 B1 Cleari7ng C1 1000.00 <-V1,B1,Clearing,C1
df = df.set_index(['A','B','D', 'C'])['E'].unstack().reset_index()
print (df)
ValueError: Index contains duplicate entries, cannot reshape
But pivot_table aggregate:
df = df.pivot_table(index=['A','B','D'], columns='C', values='E')
print (df)
C Clearing CustomerPayment_Difference Invoice PaymentDifference
A B D
V1 B1 C1 2000.0 13537679.7 -15771005.81 0.0
So question is: Is good idea always use pivot_table?
In my opinion it depends if need care about duplicates - if use pivot or set_index + unstack get error - you know about dupes, but pivot_table always aggregate, so no idea about dupes.
I have a large data-frame with the format as below:
If any of the cell is "NaN", i want to copy from the cell immediately above it. So, my dataframe should look like:
In case the first row has "NaN", then I'll have to let it be.
Can someone please help me with this?
This looks like pandas if so you need to call ffill
In [72]:
df = pd.DataFrame({'A':['A0','A1','A2',np.NaN,np.NaN, 'A3'], 'B':['B0','B1','B2',np.NaN,np.NaN, 'B3'], 'C':['C0','C1','C2',np.NaN,np.NaN, 'C3']})
df
Out[72]:
A B C
0 A0 B0 C0
1 A1 B1 C1
2 A2 B2 C2
3 NaN NaN NaN
4 NaN NaN NaN
5 A3 B3 C3
In [73]:
df.ffill()
Out[73]:
A B C
0 A0 B0 C0
1 A1 B1 C1
2 A2 B2 C2
3 A2 B2 C2
4 A2 B2 C2
5 A3 B3 C3