How to remove row and rename multiindex table - python

I have multi index data frame like below and I would like remove the row above 'A' (like shift the dataframe up)
metric data data
F K
C B
A 2 3
B 4 5
C 6 7
D 8 9
desired output
ALIAS data data
metric F K
A 2 3
B 4 5
C 6 7
D 8 9
I looked multiple post but could not find anything closer to create desired outcome. How can I achive the desired output ?
https://pandas.pydata.org/pandas-docs/stable/user_guide/advanced.html

Let's try DataFrame.droplevel to remove level 2 from the columns, and DataFrame.rename_axis to update column axis names:
df = df.droplevel(level=2, axis=1).rename_axis(['ALIAS', 'metric'], axis=1)
Or with the index equivalent methods Index.droplevel and Index.rename:
df.columns = df.columns.droplevel(2).rename(['ALIAS', 'metric'])
df:
ALIAS data
metric F K
A 2 3
B 4 5
C 6 7
D 8 9
Setup:
import numpy as np
import pandas as pd
df = pd.DataFrame(
np.arange(2, 10).reshape(-1, 2),
index=list('ABCD'),
columns=pd.MultiIndex.from_arrays([
['data', 'data'],
['F', 'K'],
['C', 'B']
], names=['metric', None, None])
)
df:
metric data
F K
C B
A 2 3
B 4 5
C 6 7
D 8 9

Related

Flatten multiindex dataframe levels and remove string from end of column names if contains

I have a dataframe like this
df = pd.DataFrame(
np.arange(2, 11).reshape(-1, 3),
index=list('ABC'),
columns=pd.MultiIndex.from_arrays([
['data1', 'data2','data3'],
['F', 'K',''],
['', '','']
], names=['meter', 'Sleeper',''])
).rename_axis('Index')
df
meter data1 data2 data3
Sleeper F K
Index
A 2 3 4
B 5 6 7
C 8 9 10
So I want to join level names and flatted the data
following this solution Pandas dataframe with multiindex column - merge levels
df.columns = df.columns.map('_'.join).str.strip('|')
df.reset_index(inplace=True)
Getting this
Index data1_F_ data2_K_ data3__
0 A 2 3 4
1 B 5 6 7
2 C 8 9 10
but I dont want those _ end of the column names so I added
df.columns = df.columns.apply(lambda x: x[:-1] if x.endswith('_') else x)
df
But got
AttributeError: 'Index' object has no attribute 'apply'
How can I combine map and apply (flatten the column names and remove _ at the end of the column names in one run ?
expected output
Index data1_F data2_K data3
0 A 2 3 4
1 B 5 6 7
2 C 8 9 10
Thanks
You can try this:
df.columns = df.columns.map('_'.join).str.strip('_')
df
Out[132]:
data1_F data2_K data3
Index
A 2 3 4
B 5 6 7
C 8 9 10

Getting the total for some columns (independently) in a data frame with python [duplicate]

I have the following DataFrame:
In [1]:
df = pd.DataFrame({'a': [1, 2, 3],
'b': [2, 3, 4],
'c': ['dd', 'ee', 'ff'],
'd': [5, 9, 1]})
df
Out [1]:
a b c d
0 1 2 dd 5
1 2 3 ee 9
2 3 4 ff 1
I would like to add a column 'e' which is the sum of columns 'a', 'b' and 'd'.
Going across forums, I thought something like this would work:
df['e'] = df[['a', 'b', 'd']].map(sum)
But it didn't.
I would like to know the appropriate operation with the list of columns ['a', 'b', 'd'] and df as inputs.
You can just sum and set param axis=1 to sum the rows, this will ignore none numeric columns:
In [91]:
df = pd.DataFrame({'a': [1,2,3], 'b': [2,3,4], 'c':['dd','ee','ff'], 'd':[5,9,1]})
df['e'] = df.sum(axis=1)
df
Out[91]:
a b c d e
0 1 2 dd 5 8
1 2 3 ee 9 14
2 3 4 ff 1 8
If you want to just sum specific columns then you can create a list of the columns and remove the ones you are not interested in:
In [98]:
col_list= list(df)
col_list.remove('d')
col_list
Out[98]:
['a', 'b', 'c']
In [99]:
df['e'] = df[col_list].sum(axis=1)
df
Out[99]:
a b c d e
0 1 2 dd 5 3
1 2 3 ee 9 5
2 3 4 ff 1 7
If you have just a few columns to sum, you can write:
df['e'] = df['a'] + df['b'] + df['d']
This creates new column e with the values:
a b c d e
0 1 2 dd 5 8
1 2 3 ee 9 14
2 3 4 ff 1 8
For longer lists of columns, EdChum's answer is preferred.
Create a list of column names you want to add up.
df['total']=df.loc[:,list_name].sum(axis=1)
If you want the sum for certain rows, specify the rows using ':'
This is a simpler way using iloc to select which columns to sum:
df['f']=df.iloc[:,0:2].sum(axis=1)
df['g']=df.iloc[:,[0,1]].sum(axis=1)
df['h']=df.iloc[:,[0,3]].sum(axis=1)
Produces:
a b c d e f g h
0 1 2 dd 5 8 3 3 6
1 2 3 ee 9 14 5 5 11
2 3 4 ff 1 8 7 7 4
I can't find a way to combine a range and specific columns that works e.g. something like:
df['i']=df.iloc[:,[[0:2],3]].sum(axis=1)
df['i']=df.iloc[:,[0:2,3]].sum(axis=1)
You can simply pass your dataframe into the following function:
def sum_frame_by_column(frame, new_col_name, list_of_cols_to_sum):
frame[new_col_name] = frame[list_of_cols_to_sum].astype(float).sum(axis=1)
return(frame)
Example:
I have a dataframe (awards_frame) as follows:
...and I want to create a new column that shows the sum of awards for each row:
Usage:
I simply pass my awards_frame into the function, also specifying the name of the new column, and a list of column names that are to be summed:
sum_frame_by_column(awards_frame, 'award_sum', ['award_1','award_2','award_3'])
Result:
Following syntax helped me when I have columns in sequence
awards_frame.values[:,1:4].sum(axis =1)
You can use the function aggragate or agg:
df[['a','b','d']].agg('sum', axis=1)
The advantage of agg is that you can use multiple aggregation functions:
df[['a','b','d']].agg(['sum', 'prod', 'min', 'max'], axis=1)
Output:
sum prod min max
0 8 10 1 5
1 14 54 2 9
2 8 12 1 4
The shortest and simplest way here is to use
df.eval('e = a + b + d')

Move specific columns to the rightmost of the DataFrame

I want to shift some columns in the middle of the dataframe to the rightmost.
I could do this with individual column using code:
cols=list(df.columns.values)
cols.pop(cols.index('one_column'))
df=df[cols +['one_column']]
df
But it's inefficient to do it individually when there are 100 columns of 2 series, ie. series1_1... series1_50 and series2_1... series2_50 in the middle of the dataframe.
How can I do it by assigning the 2 series as lists, popping them and putting them back? Maybe something like
cols=list(df.columns.values)
series1 = list(df.loc['series1_1':'series1_50'])
series2 = list(df.loc['series2_1':'series2_50'])
cols.pop('series1', 'series2')
df=df[cols +['series1', 'series2']]
but this didn't work. Thanks
If you just want to shift the columns, you could call concat like this:
cols_to_shift = ['colA', 'colB']
pd.concat([
df[df.columns.difference(cols_to_shift)],
df[cols_to_shift]
], axis=1
)
Or, you could do a little list manipulation on the columns.
cols_to_keep = [c for c in df.columns if c not in cols_to_shift]
df[cols_to_keep + cols_to_shift]
Minimal Example
np.random.seed(0)
df = pd.DataFrame(np.random.randint(1, 10, (3, 5)), columns=list('ABCDE'))
df
A B C D E
0 6 1 4 4 8
1 4 6 3 5 8
2 7 9 9 2 7
cols_to_shift = ['B', 'C']
pd.concat([
df[df.columns.difference(cols_to_shift)],
df[cols_to_shift]
], axis=1
)
A D E B C
0 6 4 8 1 4
1 4 5 8 6 3
2 7 2 7 9 9
[c for c in df.columns if c not in cols_to_shift]
df[cols_to_keep + cols_to_shift]
A D E B C
0 6 4 8 1 4
1 4 5 8 6 3
2 7 2 7 9 9
I think list.pop only takes indices of the elements in the list.
You should list.remove instead.
cols = df.columns.tolist()
for s in (‘series1’, ‘series2’):
cols.remove(s)
df = df[cols + [‘series1’, ‘series2’]]

How to delete the row in a dataframe panda based on the row names of another dataframe?

I want to short my data, the whole data shape is 30000x480. And I want to drop some rows based on the row names of another data frame.
Help me to solve it and get the solution for:
df1
Row a b
A 1 2
B 3 4
C 5 6
D 7 8
E 9 10
F 11 12
G 13 14
df2
Row a b
C 5 6
D 7 8
F 11 12
G 13 14
So, I want to delete the rows in df1 that doesn't exist on the df2, it's hard to delete as manually because the data is very big
For better understanding, taking the same data given. Let me put the same question in a different context for a better understanding as below:
Question : Want to delete the rows in df1 that doesn't exist on the df2
New way : you need the rows of df1 that are present in df2 (or) in a way you need the common rows of both df1 & df2, try this
>>> import pandas as pd
>>> df2 = pd.DataFrame({'Row': ['C', 'D', 'F','G'], 'a': [5, 7, 11, 13], 'b' : [6, 8, 12, 14]})
>>> df1 = pd.DataFrame({'Row' : ['A', 'B', 'C', 'D'], 'a': [1,3,5,7], 'b': [2,4,6, 8]})
>>> df1
Row a b
0 A 1 2
1 B 3 4
2 C 5 6
3 D 7 8
>>> df2
Row a b
0 C 5 6
1 D 7 8
2 F 11 12
3 G 13 14
>>> pd.merge(df1, df2, 'inner')
Row a b
0 C 5 6
1 D 7 8
>>>

Pandas : Sum multiple columns and get results in multiple columns

I have a "sample.txt" like this.
idx A B C D cat
J 1 2 3 1 x
K 4 5 6 2 x
L 7 8 9 3 y
M 1 2 3 4 y
N 4 5 6 5 z
O 7 8 9 6 z
With this dataset, I want to get sum in row and column.
In row, it is not a big deal.
I made result like this.
### MY CODE ###
import pandas as pd
df = pd.read_csv('sample.txt',sep="\t",index_col='idx')
df.info()
df2 = df.groupby('cat').sum()
print( df2 )
The result is like this.
A B C D
cat
x 5 7 9 3
y 8 10 12 7
z 11 13 15 11
But I don't know how to write a code to get result like this.
(simply add values in column A and B as well as column C and D)
AB CD
J 3 4
K 9 8
L 15 12
M 3 7
N 9 11
O 15 15
Could anybody help how to write a code?
By the way, I don't want to do like this.
(it looks too dull, but if it is the only way, I'll deem it)
df2 = df['A'] + df['B']
df3 = df['C'] + df['D']
df = pd.DataFrame([df2,df3],index=['AB','CD']).transpose()
print( df )
When you pass a dictionary or callable to groupby it gets applied to an axis. I specified axis one which is columns.
d = dict(A='AB', B='AB', C='CD', D='CD')
df.groupby(d, axis=1).sum()
Use concat with sum:
df = df.set_index('idx')
df = pd.concat([df[['A', 'B']].sum(1), df[['C', 'D']].sum(1)], axis=1, keys=['AB','CD'])
print( df)
AB CD
idx
J 3 4
K 9 8
L 15 12
M 3 7
N 9 11
O 15 15
Does this do what you need? By using axis=1 with DataFrame.apply, you can use the data that you want in a row to construct a new column. Then you can drop the columns that you don't want anymore.
In [1]: import pandas as pd
In [5]: df = pd.DataFrame(columns=['A', 'B', 'C', 'D'], data=[[1, 2, 3, 4], [1, 2, 3, 4]])
In [6]: df
Out[6]:
A B C D
0 1 2 3 4
1 1 2 3 4
In [7]: df['CD'] = df.apply(lambda x: x['C'] + x['D'], axis=1)
In [8]: df
Out[8]:
A B C D CD
0 1 2 3 4 7
1 1 2 3 4 7
In [13]: df.drop(['C', 'D'], axis=1)
Out[13]:
A B CD
0 1 2 7
1 1 2 7

Categories