Multilevel index won't go away - python

I have a dataframe, which consists of summary statistics of another dataframe:
df = sample[['Place','Lifeexp']]
df = df.groupby('Place').agg(['count','mean', 'max','min']).reset_index()
df = df.sort_values([('Lifeexp', 'count')], ascending=False)
When looking at the structure, the dataframe has a multi index, which makes plot creations difficult:
df.columns
MultiIndex(levels=[['Lifeexp', 'Place'], ['count', 'mean', 'max', 'min', '']],
labels=[[1, 0, 0, 0, 0], [4, 0, 1, 2, 3]])
I tried the solutions of different questions here (e.g. this), but somehow don't get the desired result. I want df to have Place, count, mean,max, min as column names and delete Lifeexp so that I can create easy plots e.g. df.plot.bar(x = "Place", y = 'count')

I think solution should be simplify define column after groupby for prevent MultiIndex in columns:
df = df.groupby('Place')['Lifeexp'].agg(['count','mean', 'max','min']).reset_index()
df = df.sort_values('count', ascending=False)

Related

Remove duplicate values in each row of the column

I have a Data Frame, which has a column that shows repeated values. It was the result of an inverse "explode" operation... trello_dataframe = trello_dataframe.groupby(['Card ID', 'ID List'], as_index=True).agg({'Member (Full Name)': lambda x: x.tolist()})
How do I remove duplicate values in each row of the column?
I attach more information: https://prnt.sc/RjGazPcMBX47
I would like to have the data frame like this: https://prnt.sc/y0VjKuewp872
Thanks in advance!
You will need to target the column and with a np.unique
import pandas as pd
import numpy as np
data = {
'Column1' : ['A', 'B', 'C'],
'Column2' : [[5, 0, 5, 0, 5], [5,0,5], [5]]
}
df = pd.DataFrame(data)
df['Column2'] = df['Column2'].apply(lambda x : np.unique(x))
df

pandas largest value per group with multi columns / why does it only work when flattening?

For a pandas dataframe of:
import pandas as pd
df = pd.DataFrame({
'id': [1, 1, 2, 1], 'anomaly_score':[5, 10, 8, 100], 'match_level_0':[np.nan, 1, 1, 1], 'match_level_1':[np.nan, np.nan, 1, 1], 'match_level_2':[np.nan, 1, 1, 1]
})
display(df)
df = df.groupby(['id', 'match_level_0']).agg(['mean', 'sum'])
I want to calculate the largest rows per group.
df.columns = ['__'.join(col).strip() for col in df.columns.values]
df.groupby(['id'])['anomaly_score__mean'].nlargest(2)
Works but requires to flatten the multiindex for the columns.
Instead I want to directly use,
df.groupby(['id'])[('anomaly_score', 'mean')].nlargest(2)
But this fails with the key not being found.
Interestingly, it works just fine when not grouping:
df[('anomaly_score', 'mean')].nlargest(2)
For me working grouping by Series with first level of MultiIndex, but it seems bug why not working like in your solution:
print (df[('anomaly_score', 'mean')].groupby(level=0).nlargest(2))
id match_level_0
1 1.0 55
2 1.0 8
Name: (anomaly_score, mean), dtype: int64
print (df[('anomaly_score', 'mean')].groupby(level='id').nlargest(2))

Intepreting Pandas Column Referencing Syntax

I have a basic background in using R for data wrangling but am new to Python. I came across this code snippet from a tutorial on Coursera.
Can someone please explain to me what columns ={col:'Gold' + col[4:]}, inplace = True means?
(1) From my understanding, df.rename is to rename the existing column name to (in the case of first line, Gold) but why is there a need to +col[4:] after it?
(2) Does declaring the function inplace as True mean to assign the resulting df output to the original df?
import pandas as pd
df = pd.read_csv('olympics.csv', index_col=0, skiprows=1)
for col in df.columns:
if col[:2]=='01':
df.rename(columns={col:'Gold'+col[4:]}, inplace=True)
if col[:2]=='02':
df.rename(columns={col:'Silver'+col[4:]}, inplace=True)
if col[:2]=='03':
df.rename(columns={col:'Bronze'+col[4:]}, inplace=True)
if col[:1]=='№':
df.rename(columns={col:'#'+col[1:]}, inplace=True)
Thank you in advance.
It means:
#for each column name
for col in df.columns:
#check first 2 chars for 01
if col[:2]=='01':
#replace column name with text gold and all characters after 4th letter
df.rename(columns={col:'Gold'+col[4:]}, inplace=True)
#similar like above
if col[:2]=='02':
df.rename(columns={col:'Silver'+col[4:]}, inplace=True)
#similar like above
if col[:2]=='03':
df.rename(columns={col:'Bronze'+col[4:]}, inplace=True)
#check first letter
if col[:1]=='№':
#add # after first letter
df.rename(columns={col:'#'+col[1:]}, inplace=True)
Does declaring the function inplace as True mean to assign the resulting df output to the original dataframe
Yes, you are right. It replace inplace columns names.
if col[:2]=='01':
#replace column name with text gold and all characters after 4th letter
df.rename(columns={col:'Gold'+col[4:]}, inplace=True)
(1). If col has a column name of '01xx1234',
1. col[:2] = 01 is True
2. 'Gold'+col[4:] => 'Gold'+col[4:] => 'Gold1234'
3. so, '01xx1234' is replaced by 'Gold1234'.
(2) inplace = True applies directly to a dataframe and does not return a result.
If you do not add this option, you have to do like this.
df = df.rename(columns={col:'Gold'+col[4:]})
inplace=True means: The columns will be renamed in your original dataframe (df)
Your case (inplace=True):
import pandas as pd
df = pd.DataFrame(columns={"A": [1, 2, 3], "B": [4, 5, 6]})
df.rename(columns={"A": "a", "B": "c"}, inplace=True)
print(df.columns)
# Index(['a', 'c'], dtype='object')
# df already has the renamed columns, because inplace=True.
If you wouldn't use inplace=True, then the rename method would generate a new dataframe, like this:
import pandas as pd
df = pd.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6]})
new_frame = df.rename(columns={"A": "a", "B": "c"})
print(df.columns)
# Index(['A', 'B'], dtype='object')
# It contains the old column names
print(new_frame.columns)
# Index(['a', 'c'], dtype='object')
# It's a new dataframe and has renamed columns
NOTE: In this case, better approach to assign the new dataframe to the original dataframe (df)
df = pd.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6]})
df = df.rename(columns={"A": "a", "B": "c"})

Pandas convert from Int64Index to RangeIndex

I concatenated three dataframes. How can I print df.index in RangeIndex, instead of Int64Index?
My Input:
df = pd.concat([df1, df2, df3])
print(df.index)
My Output:
Int64Index([ 0, 1, 2, 3, 4, 5, 6, 7, 8,
9,
...
73809, 73810, 73811, 73812, 73813, 73814, 73815, 73816, 73817,
73818],
dtype='int64', length=495673)
Desired Output:
RangeIndex(start=X, stop=X, step=X)
You can use reset_index to get desired indices. For example:
df = pd.concat([df1,df2,df3])
df.index
Int64Index([0, 1, 2, 0, 1, 2, 0, 1, 2], dtype='int64')
After resetting indices:
df.reset_index(inplace=True)
df.index
RangeIndex(start=0, stop=9, step=1)
Also it is good to use axis keyword in concat function.
you can use the built-in ignore_index option:
df = pd.concat([df1, df2, df3],ignore_index=True)
print(df.index)
From the docs:
ignore_index : boolean, default False
If True, do not use the index values along the concatenation axis. The resulting axis will be labeled 0, …, n - 1. This is useful if you are concatenating objects where the concatenation axis does not have meaningful indexing information. Note the index values on the other axes are still respected in the join.

Remove a level from a MultiIndex

I need to remove a level (either by position or name) from a DataFrame's index and create a new DataFrame with the new index. The problem is that I end up having a non-unique index.
I had a look at Remove a level from a pandas MultiIndex but the problem is that the use of unique(), as the answer in there suggests, reduces the index to an array, that doesn't retain the names of the levels.
Other than using unique(), and then creating a new Index by stitching the label names onto the array, is there a more elegant solution?
index = [np.array(['foo', 'foo', 'qux']), np.array(['a', 'b', 'a'])]
data = np.random.randn(3, 2)
columns = ["X", "Y"]
df = pd.DataFrame(data, index=index, columns=columns)
df.index.names = ["Level0", "Level1"]
print df
X Y
Level0 Level1
foo a -0.591649 0.831599
b 0.049961 -1.524291
qux a -0.100124 -1.059195
index2 = df.reset_index(level=1, drop=True).index
df2 = pd.DataFrame(index=index2)
print df2.loc[idx['foo'], :]
Empty DataFrame
Columns: []
Index: [foo, foo]
If I understand you correctly, you are looking for a solution to get the first level index without duplicated values. Your result should be an Ìndex object without using unique and without explicitly creating the index again.
For your example data frame, you can use the following including get_level_values and drop_duplicates:
print(df.index.get_level_values(0).drop_duplicates())
Index(['foo', 'qux'], dtype='object', name='Level0')
Edit
For a more general solution either returning an Index or MultiIndex depending on the number of levels, you may use droplevel and drop_duplicates in conjunction:
print(df.index.droplevel(-1).drop_duplicates())
Index(['foo', 'qux'], dtype='object', name='Level0')
Here is the example from the linked SO post with 3 levels which are reduced to 2 levels mutltiindex with unique values:
tuples = [(0, 100, 1000),(0, 100, 1001),(0, 100, 1002), (1, 101, 1001)]
index_3levels=pd.MultiIndex.from_tuples(tuples,names=["l1","l2","l3"])
print(index_3levels)
MultiIndex(levels=[[0, 1], [100, 101], [1000, 1001, 1002]],
labels=[[0, 0, 0, 1], [0, 0, 0, 1], [0, 1, 2, 1]],
names=['l1', 'l2', 'l3'])
index2level= index_3levels.droplevel(-1).drop_duplicates()
print(index2level)
MultiIndex(levels=[[0, 1], [100, 101]],
labels=[[0, 1], [0, 1]],
names=['l1', 'l2'])
# show unique values of new index
print(index2level)
[(0, 100) (1, 101)]

Categories