pandas - find maximum of multilevel dataframe - python

I have this simple multiindex dataframe df obtained after performing some groupby.size() operations:
U G C
1 1 en 0.600000
2 en 0.400000
2 1 es 0.333333
3 es 0.500000
I would like to mask only the rows having the maximum value of the last column with respect to the U index column. So far I tried grouping by:
mask = df.groupby(level=[0]).max()
which returns:
U
1 0.6
2 0.5
but I would need the whole structure of the dataframe:
U G C
1 1 en
2 3 es
How can I reset in some way the multiindex dataframe?

For your df:
data
U G C
1 1 en 0.600000
2 en 0.400000
2 1 es 0.333333
3 es 0.500000
You can use
df[df['data'] == df.groupby(level=[0])['data'].transform(max)]
which returns
data
U G C
1 1 en 0.6
2 3 es 0.5

Related

Merge on columns and rows

I am trying to make a large dataframe using python. I have a large amount of little dataframes with different row and column names, but there is some overlap between the row names and column names. What I was trying to do is start with one of the little dataframes and then one by one add the others.
Each of the specific row-column combinations is unique and in the end there will probably be a lot of NA.
I have tried doing this with merge from pandas, but this results in a much larger dataframe than I need with row and column names being duplicated instead of merged. If I could find a way that pandas realises that NaN is not a value and overwrites it when a new little dataframe is added, I think I would obtain the result I want.
I am also willing to try something that is not using pandas.
For example:
DF1 A B
Y 1 2
Z 0 1
DF2 C D
X 1 2
Z 0 1
Merged: A B C D
Y 1 2 NA NA
Z 0 1 0 1
X NA NA 1 2
And then a new dataframe has to be added:
DF3 C E
Y 0 1
W 1 1
The result should be:
A B C D E
Y 1 2 0 NA 1
Z 0 1 0 1 NA
X NA NA 1 2 NA
W NA NA 1 NA 1
But what happens is:
A B C_x C_y D E
Y 1 2 NA 1 NA 1
Z 0 1 0 0 1 NA
X NA NA 1 1 2 NA
W NA NA 1 1 NA 1
You want to use DataFrame.combine_first, which will align the DataFrames based on index, and will prioritize values in the left DataFrame, while using values in the right DataFrame to fill missing values.
df1.combine_first(df2).combine_first(df3)
Sample data
import pandas as pd
df1 = pd.DataFrame({'A': [1,0], 'B': [2,1]})
df1.index=['Y', 'Z']
df2 = pd.DataFrame({'C': [1,0], 'D': [2,1]})
df2.index=['X', 'Z']
df3 = pd.DataFrame({'C': [0,1], 'E': [1,1]})
df3.index=['Y', 'W']
Code
df1.combine_first(df2).combine_first(df3)
Output:
A B C D E
W NaN NaN 1.0 NaN 1.0
X NaN NaN 1.0 2.0 NaN
Y 1.0 2.0 0.0 NaN 1.0
Z 0.0 1.0 0.0 1.0 NaN

How to classify one column's value by other dataframe?

I am trying to classify one data based on a dataframe of standard.
The standard like df1, and I want to classify df2 based on df1.
df1:
PAUCode SubClass
1 RA
2 RB
3 CZ
df2:
PAUCode SubClass
2 non
2 non
2 non
3 non
1 non
2 non
3 non
I want to get the df2 like as below:
expected result:
PAUCode SubClass
2 RB
2 RB
2 RB
3 CZ
1 RA
2 RB
3 CZ
Option 1
fillna
df2 = df2.replace('non', np.nan)
df2.set_index('PAUCode').SubClass\
.fillna(df1.set_index('PAUCode').SubClass)
PAUCode
2 RB
2 RB
2 RB
3 CZ
1 RA
2 RB
3 CZ
Name: SubClass, dtype: object
Option 2
map
df2.PAUCode.map(df1.set_index('PAUCode').SubClass)
0 RB
1 RB
2 RB
3 CZ
4 RA
5 RB
6 CZ
Name: PAUCode, dtype: object
Option 3
merge
df2[['PAUCode']].merge(df1, on='PAUCode')
PAUCode SubClass
0 2 RB
1 2 RB
2 2 RB
3 2 RB
4 3 CZ
5 3 CZ
6 1 RA
Note here the order of the data changes, but the answer remains the same.
Let us using reindex
df1.set_index('PAUCode').reindex(df2.PAUCode).reset_index()
Out[9]:
PAUCode SubClass
0 2 RB
1 2 RB
2 2 RB
3 3 CZ
4 1 RA
5 2 RB
6 3 CZ

Create new column in pandas dataframe by calculation from previous index columns

Hi I am new and learning pandas for data analysis. I have 2 columns data
A B
1 2
2 3
3 4
4 5
I want to create a third column C which result would be calculated by column B , by subtracting upper value with current one and dividing by current.
A B C
1 2
2 3 0.33
3 4 0.25
4 5 0.2
for example first row value for C column is empty because there is no value above 2 .
0.33 = > 3 - 2 / 3 ,
0.25 = > 4 - 3 / 4 ,
0.2 = > 5 - 4 / 5 and so on
I am stuck while getting the upper value of current column. Need help how to achieve that.
Use shift to shift the column and then the remaining operations are the regular ones (sub and div):
df['B'].sub(df['B'].shift()).div(df['B'])
Out:
0 NaN
1 0.333333
2 0.250000
3 0.200000
Name: B, dtype: float64
This can also be done without chaining the methods, if you prefer.
(df['B'] - df['B'].shift()) / df['B']
Out[48]:
0 NaN
1 0.333333
2 0.250000
3 0.200000
Name: B, dtype: float64
Edit for handling NaN and decimals.
df['C'] = (1 - df.B.shift() / df.B).map(lambda x: '{0:.2f}'.format(round(x,2))).replace('nan','')
Output:
A B C
0 1 2
1 2 3 0.33
2 3 4 0.25
3 4 5 0.20
Let's simplify and use the following with shift to get the previous value:
df['C'] = 1 - df.B.shift() / df.B
Output:
A B C
0 1 2 NaN
1 2 3 0.333333
2 3 4 0.250000
3 4 5 0.200000
Or you can simply using diff
df2.B.diff()/df2.B
Out[545]:
0 NaN
1 0.333333
2 0.250000
3 0.200000
Name: B, dtype: float64

Python pandas; fill in data frame with pivot_table

I have a large python script, which makes two dataframes A and B, and at the end, I want to fill in dataframe A with the values of dataframe B, and keep the columns of dataframe A, but it is not going well.
Dataframe A is like this
A B C D
1 ab
2 bc
3 cd
Dataframe B:
A BB CC
1 C 10
2 C 11
3 D 12
My output must be:
new dataframe
A B C D
1 ab 10
2 bc 11
3 cd 12
But my output is
A B C D
1 ab
2 bc
3 cd
Why is it not filling in the values of dataframe B?
My command is
dfnew = dfB.pivot_table(index='A', columns='BB', values='CC').reindex(index=dfA.index, columns=dfA.columns).fillna(dfA)
I think you need set_index by index column of df for align data, fillna or combine_first and last reset_index:
dfA = pd.DataFrame({'A':[1,2,3], 'B':['ab','bc','cd'], 'C':[np.nan] * 3,'D':[np.nan] * 3})
print (dfA)
A B C D
0 1 ab NaN NaN
1 2 bc NaN NaN
2 3 cd NaN NaN
dfB = pd.DataFrame({'A':[1,2,3], 'BB':['C','C','D'], 'CC':[10,11,12]})
print (dfB)
A BB CC
0 1 C 10
1 2 C 11
2 3 D 12
df = dfB.pivot_table(index='A', columns='BB', values='CC')
print (df)
BB C D
A
1 10.0 NaN
2 11.0 NaN
3 NaN 12.0
dfA = dfA.set_index('A').fillna(df).reset_index()
#dfA = dfA.set_index('A').combine_first(df).reset_index()
print (dfA)
A B C D
0 1 ab 10.0 NaN
1 2 bc 11.0 NaN
2 3 cd NaN 12.0

Pandas: Group By Elements of a Column

Looking for assistance to group by elements of a column in a Pandas df.
Original df:
Country Feature Number
0 US A 1
1 DE A 2
2 FR A 3
3 US B 0
4 DE B 5
5 FR B 7
6 US C 9
7 DE C 0
8 FR C 1
Desired df:
Country A B C
0 US 1 0 9
1 DE 2 5 0
2 FR 3 7 1
Not sure if group by is the best choice if I should create a dictionary. Thanks in advance for your help!
You could use pivot_table for that:
In [39]: df.pivot_table(index='Country', columns='Feature')
Out[39]:
Number
Feature A B C
Country
DE 2 5 0
FR 3 7 1
US 1 0 9
If you want your index to be 0, 1, 2 you could use reset_index
EDIT
If your Number actually not numbers but strings you could convert that column with astype or with pd.to_numeric:
df.Number = df.Number.astype(float)
or:
df.Number = pd.to_numeric(df.Number)
Note: pd.to_numeric is available only for pandas >= 0.17.0

Categories