How can two pandas.IndexSlice s be combined into one?
Set up of the problem:
import pandas as pd
import numpy as np
idx = pd.IndexSlice
cols = pd.MultiIndex.from_product([['A', 'B', 'C'], ['x', 'y'], ['a', 'b']])
df = pd.DataFrame(np.arange(len(cols)*2).reshape((2, len(cols))), columns=cols)
df:
A B C
x y x y x y
a b a b a b a b a b a b
0 0 1 2 3 4 5 6 7 8 9 10 11
1 12 13 14 15 16 17 18 19 20 21 22 23
How can the two slices idx['A', 'y', :] and idx[['B', 'C'], 'x', :], be combined to show in one dataframe?
Separately they are:
df.loc[:, idx['A', 'y',:]]
A
y
a b
0 2 3
1 14 15
df.loc[:, idx[['B', 'C'], 'x', :]]
B C
x x
a b a b
0 4 5 8 9
1 16 17 20 21
Simply combining them as a list does not play nicely:
df.loc[:, [idx['A', 'y',:], idx[['B', 'C'], 'x',:]]]
....
TypeError: unhashable type: 'slice'
My current solution is incredibly clunky, but gives the sub df that I'm looking for:
df.loc[:, df.loc[:, idx['A', 'y', :]].columns.to_list() + df.loc[:,
idx[['B', 'C'], 'x', :]].columns.to_list()]
A B C
y x x
a b a b a b
0 2 3 4 5 8 9
1 14 15 16 17 20 21
However this doesn't work when one of the slices is just a series (as expected), which is less fun:
df.loc[:, df.loc[:, idx['A', 'y', 'a']].columns.to_list() + df.loc[:,
idx[['B', 'C'], 'x', :]].columns.to_list()]
...
AttributeError: 'Series' object has no attribute 'columns'
Are there any better alternatives to what I'm currently doing that would ideally work with dataframe slices and series slices?
General solution is join together both slice:
a = df.loc[:, idx['A', 'y', 'a']]
b = df.loc[:, idx[['B', 'C'], 'x', :]]
df = pd.concat([a, b], axis=1)
print (df)
A B C
y x x
a a b a b
0 2 4 5 8 9
1 14 16 17 20 21
Related
for example i have a list of name:
name_list = ['a', 'b', 'c']
and 3 dataframes:
>> df1
>> k l m
0 12 13 14
1 13 14 15
>> df2
>> o p q
0 10 11 12
1 15 16 17
>> df3
>> r s t
0 1 3 4
1 3 4 5
What i want to do is to replace the first column from each dataframe with a each name from name_list. So, a will replace k, b will replace o and c will replace r.
the output will be:
>> df1
>> a l m
0 12 13 14
1 13 14 15
>> df2
>> b p q
0 10 11 12
1 15 16 17
>> df3
>> c s t
0 1 3 4
1 3 4 5
i can do it manually but would be better if there is best method to do it. Thanks
I totally agree with #ALollz but nevertheless you can try something like
df1 = pd.DataFrame([[1,2,3]], columns=['k', 'l', 'm'])
df2 = pd.DataFrame([[1,2,3]], columns=['o', 'p', 'q'])
df3 = pd.DataFrame([[1,2,3]], columns=['r', 's', 't'])
name_list = ['a', 'b', 'c']
for index, name in enumerate(name_list, 1):
df = pd.eval('df{index}'.format(index=index))
df.rename(
columns = {
df.columns[0]: name,
}, inplace=True)
If you have the dataframes in a list like dfs = [df1, df2, df3] then you can do:
dfs = [dfs[i].rename(columns={dfs[i].columns[0]: name_list[i]}) for i in range(0,len(dfs)]
You can do it in place:
[df.rename(columns={df.columns[0]: c}, inplace=True)
for df,c in zip([df1,df2,df3], ['a', 'b', 'c'])]
Alternatively:
for df,c in zip([df1,df2,df3], ['a', 'b', 'c']):
df.rename(columns={df.columns[0]: c}, inplace=True)
I have the following DataFrame:
In [1]:
df = pd.DataFrame({'a': [1, 2, 3],
'b': [2, 3, 4],
'c': ['dd', 'ee', 'ff'],
'd': [5, 9, 1]})
df
Out [1]:
a b c d
0 1 2 dd 5
1 2 3 ee 9
2 3 4 ff 1
I would like to add a column 'e' which is the sum of columns 'a', 'b' and 'd'.
Going across forums, I thought something like this would work:
df['e'] = df[['a', 'b', 'd']].map(sum)
But it didn't.
I would like to know the appropriate operation with the list of columns ['a', 'b', 'd'] and df as inputs.
You can just sum and set param axis=1 to sum the rows, this will ignore none numeric columns:
In [91]:
df = pd.DataFrame({'a': [1,2,3], 'b': [2,3,4], 'c':['dd','ee','ff'], 'd':[5,9,1]})
df['e'] = df.sum(axis=1)
df
Out[91]:
a b c d e
0 1 2 dd 5 8
1 2 3 ee 9 14
2 3 4 ff 1 8
If you want to just sum specific columns then you can create a list of the columns and remove the ones you are not interested in:
In [98]:
col_list= list(df)
col_list.remove('d')
col_list
Out[98]:
['a', 'b', 'c']
In [99]:
df['e'] = df[col_list].sum(axis=1)
df
Out[99]:
a b c d e
0 1 2 dd 5 3
1 2 3 ee 9 5
2 3 4 ff 1 7
If you have just a few columns to sum, you can write:
df['e'] = df['a'] + df['b'] + df['d']
This creates new column e with the values:
a b c d e
0 1 2 dd 5 8
1 2 3 ee 9 14
2 3 4 ff 1 8
For longer lists of columns, EdChum's answer is preferred.
Create a list of column names you want to add up.
df['total']=df.loc[:,list_name].sum(axis=1)
If you want the sum for certain rows, specify the rows using ':'
This is a simpler way using iloc to select which columns to sum:
df['f']=df.iloc[:,0:2].sum(axis=1)
df['g']=df.iloc[:,[0,1]].sum(axis=1)
df['h']=df.iloc[:,[0,3]].sum(axis=1)
Produces:
a b c d e f g h
0 1 2 dd 5 8 3 3 6
1 2 3 ee 9 14 5 5 11
2 3 4 ff 1 8 7 7 4
I can't find a way to combine a range and specific columns that works e.g. something like:
df['i']=df.iloc[:,[[0:2],3]].sum(axis=1)
df['i']=df.iloc[:,[0:2,3]].sum(axis=1)
You can simply pass your dataframe into the following function:
def sum_frame_by_column(frame, new_col_name, list_of_cols_to_sum):
frame[new_col_name] = frame[list_of_cols_to_sum].astype(float).sum(axis=1)
return(frame)
Example:
I have a dataframe (awards_frame) as follows:
...and I want to create a new column that shows the sum of awards for each row:
Usage:
I simply pass my awards_frame into the function, also specifying the name of the new column, and a list of column names that are to be summed:
sum_frame_by_column(awards_frame, 'award_sum', ['award_1','award_2','award_3'])
Result:
Following syntax helped me when I have columns in sequence
awards_frame.values[:,1:4].sum(axis =1)
You can use the function aggragate or agg:
df[['a','b','d']].agg('sum', axis=1)
The advantage of agg is that you can use multiple aggregation functions:
df[['a','b','d']].agg(['sum', 'prod', 'min', 'max'], axis=1)
Output:
sum prod min max
0 8 10 1 5
1 14 54 2 9
2 8 12 1 4
The shortest and simplest way here is to use
df.eval('e = a + b + d')
i have a dataframe which looks like this
pd.DataFrame({'a':['A', 'B', 'B', 'C', 'C', 'D', 'D', 'E'],
'b':['Y', 'Y', 'N', 'Y', 'Y', 'N', 'N', 'N'],
'c':[20, 5, 12, 8, 15, 10, 25, 13]})
a b c
0 A Y 20
1 B Y 5
2 B N 12
3 C Y 8
4 C Y 15
5 D N 10
6 D N 25
7 E N 13
i would like to groupby column 'a', check if any of column 'b' is 'Y' or True and keep that value and then just sum on 'c'
the resulting dataframe should look like this
a b c
0 A Y 20
1 B Y 17
2 C Y 23
3 D N 35
4 E N 13
i tried the below but get an error
df.groupby('a')['b'].max()['c'].sum()
You can use agg with max and sum. Max on column 'b' indeed works because 'Y' > 'N' == True
print(df.groupby('a', as_index=False).agg({'b': 'max', 'c': 'sum'}))
a b c
0 A Y 20
1 B Y 17
2 C Y 23
3 D N 35
4 E N 13
Given the following DataFrame, I try to aggregate over columns 'A' and 'C'. for 'A', count unique appearances of the strings, and for 'C', sum the values.
Problem arises when some of the samples in 'A' are actually lists of those strings.
Here's a simplified example:
df = pd.DataFrame({'ID': [1, 1, 1, 1, 1, 2, 2, 2],
'A' : ['a', 'a', 'a', 'b', ['b', 'c', 'd'], 'a', 'a', ['a', 'b', 'c']],
'C' : [1, 2, 15, 5, 13, 6, 7, 1]})
df
Out[100]:
ID A C
0 1 a 1
1 1 a 2
2 1 a 15
3 1 b 5
4 1 [b, c, d] 13
5 2 a 6
6 2 a 7
7 2 [a, b, c] 1
aggs = {'A' : lambda x: x.nunique(dropna=True),
'C' : 'sum'}
# This will result an error: TypeError: unhashable type: 'list'
agg_df = df.groupby('ID').agg(aggs)
I'd like the following output:
print(agg_df)
A C
ID
1 4 36
2 3 14
Which resulted because for 'ID' = 1 we had 'a', 'b', 'c' and 'd' and for 'ID' = 2, we had 'a', 'b', 'c'.
One solution is to split your problem into 2 parts. First flatten your dataframe to ensure df['A'] consists only of strings. Then concatenate a couple of GroupBy operations.
Step 1: Flatten your dataframe
You can use itertools.chain and numpy.repeat to chain and repeat values as appropriate.
from itertools import chain
A = df['A'].apply(lambda x: [x] if not isinstance(x, list) else x)
lens = A.map(len)
res = pd.DataFrame({'ID': np.repeat(df['ID'], lens),
'A': list(chain.from_iterable(A)),
'C': np.repeat(df['C'], lens)})
print(res)
# A C ID
# 0 a 1 1
# 1 a 2 1
# 2 a 15 1
# 3 b 5 1
# 4 b 13 1
# 4 c 13 1
# 4 d 13 1
# 5 a 6 2
# 6 a 7 2
# 7 a 1 2
# 7 b 1 2
# 7 c 1 2
Step 2: Concatenate GroupBy on original and flattened
agg_df = pd.concat([res.groupby('ID')['A'].nunique(),
df.groupby('ID')['C'].sum()], axis=1)
print(agg_df)
# A C
# ID
# 1 4 36
# 2 3 14
I have a dataframe df
df = pd.DataFrame(np.arange(20).reshape(10, -1),
[['a', 'a', 'a', 'a', 'b', 'b', 'b', 'c', 'c', 'd'],
['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']],
['X', 'Y'])
How do I get the first and last rows, grouped by the first level of the index?
I tried
df.groupby(level=0).agg(['first', 'last']).stack()
and got
X Y
a first 0 1
last 6 7
b first 8 9
last 12 13
c first 14 15
last 16 17
d first 18 19
last 18 19
This is so close to what I want. How can I preserve the level 1 index and get this instead:
X Y
a a 0 1
d 6 7
b e 8 9
g 12 13
c h 14 15
i 16 17
d j 18 19
j 18 19
Option 1
def first_last(df):
return df.ix[[0, -1]]
df.groupby(level=0, group_keys=False).apply(first_last)
Option 2 - only works if index is unique
idx = df.index.to_series().groupby(level=0).agg(['first', 'last']).stack()
df.loc[idx]
Option 3 - per notes below, this only makes sense when there are no NAs
I also abused the agg function. The code below works, but is far uglier.
df.reset_index(1).groupby(level=0).agg(['first', 'last']).stack() \
.set_index('level_1', append=True).reset_index(1, drop=True) \
.rename_axis([None, None])
Note
per #unutbu: agg(['first', 'last']) take the firs non-na values.
I interpreted this as, it must then be necessary to run this column by column. Further, forcing index level=1 to align may not even make sense.
Let's include another test
df = pd.DataFrame(np.arange(20).reshape(10, -1),
[list('aaaabbbccd'),
list('abcdefghij')],
list('XY'))
df.loc[tuple('aa'), 'X'] = np.nan
def first_last(df):
return df.ix[[0, -1]]
df.groupby(level=0, group_keys=False).apply(first_last)
df.reset_index(1).groupby(level=0).agg(['first', 'last']).stack() \
.set_index('level_1', append=True).reset_index(1, drop=True) \
.rename_axis([None, None])
Sure enough! This second solution is taking the first valid value in column X. It is now nonsensical to have forced that value to align with the index a.
This could be on of the easy solution.
df.groupby(level = 0, as_index= False).nth([0,-1])
X Y
a a 0 1
d 6 7
b e 8 9
g 12 13
c h 14 15
i 16 17
d j 18 19
Hope this helps. (Y)
Please try this:
For last value: df.groupby('Column_name').nth(-1),
For first value: df.groupby('Column_name').nth(0)