I am trying to access the index of a row in a function applied across an entire DataFrame in Pandas. I have something like this:
df = pandas.DataFrame([[1,2,3],[4,5,6]], columns=['a','b','c'])
>>> df
a b c
0 1 2 3
1 4 5 6
and I'll define a function that access elements with a given row
def rowFunc(row):
return row['a'] + row['b'] * row['c']
I can apply it like so:
df['d'] = df.apply(rowFunc, axis=1)
>>> df
a b c d
0 1 2 3 7
1 4 5 6 34
Awesome! Now what if I want to incorporate the index into my function?
The index of any given row in this DataFrame before adding d would be Index([u'a', u'b', u'c', u'd'], dtype='object'), but I want the 0 and 1. So I can't just access row.index.
I know I could create a temporary column in the table where I store the index, but I'm wondering if it is stored in the row object somewhere.
To access the index in this case you access the name attribute:
In [182]:
df = pd.DataFrame([[1,2,3],[4,5,6]], columns=['a','b','c'])
def rowFunc(row):
return row['a'] + row['b'] * row['c']
def rowIndex(row):
return row.name
df['d'] = df.apply(rowFunc, axis=1)
df['rowIndex'] = df.apply(rowIndex, axis=1)
df
Out[182]:
a b c d rowIndex
0 1 2 3 7 0
1 4 5 6 34 1
Note that if this is really what you are trying to do that the following works and is much faster:
In [198]:
df['d'] = df['a'] + df['b'] * df['c']
df
Out[198]:
a b c d
0 1 2 3 7
1 4 5 6 34
In [199]:
%timeit df['a'] + df['b'] * df['c']
%timeit df.apply(rowIndex, axis=1)
10000 loops, best of 3: 163 µs per loop
1000 loops, best of 3: 286 µs per loop
EDIT
Looking at this question 3+ years later, you could just do:
In[15]:
df['d'],df['rowIndex'] = df['a'] + df['b'] * df['c'], df.index
df
Out[15]:
a b c d rowIndex
0 1 2 3 7 0
1 4 5 6 34 1
but assuming it isn't as trivial as this, whatever your rowFunc is really doing, you should look to use the vectorised functions, and then use them against the df index:
In[16]:
df['newCol'] = df['a'] + df['b'] + df['c'] + df.index
df
Out[16]:
a b c d rowIndex newCol
0 1 2 3 7 0 6
1 4 5 6 34 1 16
Either:
1. with row.name inside the apply(..., axis=1) call:
df = pandas.DataFrame([[1,2,3],[4,5,6]], columns=['a','b','c'], index=['x','y'])
a b c
x 1 2 3
y 4 5 6
df.apply(lambda row: row.name, axis=1)
x x
y y
2. with iterrows() (slower)
DataFrame.iterrows() allows you to iterate over rows, and access their index:
for idx, row in df.iterrows():
...
To answer the original question: yes, you can access the index value of a row in apply(). It is available under the key name and requires that you specify axis=1 (because the lambda processes the columns of a row and not the rows of a column).
Working example (pandas 0.23.4):
>>> import pandas as pd
>>> df = pd.DataFrame([[1,2,3],[4,5,6]], columns=['a','b','c'])
>>> df.set_index('a', inplace=True)
>>> df
b c
a
1 2 3
4 5 6
>>> df['index_x10'] = df.apply(lambda row: 10*row.name, axis=1)
>>> df
b c index_x10
a
1 2 3 10
4 5 6 40
Related
In pandas, I regularly use the following to filter a dataframe by number of occurrences
df = df.groupby('A').filter(lambda x: len(x) >= THRESHOLD)
Assume df has another column 'B' and I want to filter the dataframe this time by the count of unique values on that column, I would expect something like
df = df.groupby('A').filter(lambda x: len(np.unique(x['B'])) >= THRESHOLD2)
But that doesn't seem to work, what would be the right approach?
It should working nice with nunique:
df = pd.DataFrame({'B':list('abccee'),
'E':[5,3,6,9,2,4],
'A':list('aabbcc')})
print (df)
A B E
0 a a 5
1 a b 3
2 b c 6
3 b c 9
4 c e 2
5 c e 4
THRESHOLD2 = 2
df1 = df.groupby('A').filter(lambda x: x['B'].nunique() >= THRESHOLD2)
print (df1)
A B E
0 a a 5
1 a b 3
But if need faster solution use transform and filter by boolean indexing:
df2 = df[df.groupby('A')['B'].transform('nunique') >= THRESHOLD2]
print (df2)
A B E
0 a a 5
1 a b 3
Timings:
np.random.seed(123)
N = 1000000
L = list('abcde')
df = pd.DataFrame({'B': np.random.choice(L, N, p=(0.75,0.0001,0.0005,0.0005,0.2489)),
'A':np.random.randint(10000,size=N)})
df = df.sort_values(['A','B']).reset_index(drop=True)
print (df)
THRESHOLD2 = 3
In [403]: %timeit df.groupby('A').filter(lambda x: x['B'].nunique() >= THRESHOLD2)
1 loop, best of 3: 3.05 s per loop
In [404]: %timeit df[df.groupby('A')['B'].transform('nunique')>= THRESHOLD2]
1 loop, best of 3: 558 ms per loop
Caveat
The results do not address performance given the number of groups, which will affect timings a lot for some of these solutions.
I have a "sample.txt" like this.
idx A B C D cat
J 1 2 3 1 x
K 4 5 6 2 x
L 7 8 9 3 y
M 1 2 3 4 y
N 4 5 6 5 z
O 7 8 9 6 z
With this dataset, I want to get sum in row and column.
In row, it is not a big deal.
I made result like this.
### MY CODE ###
import pandas as pd
df = pd.read_csv('sample.txt',sep="\t",index_col='idx')
df.info()
df2 = df.groupby('cat').sum()
print( df2 )
The result is like this.
A B C D
cat
x 5 7 9 3
y 8 10 12 7
z 11 13 15 11
But I don't know how to write a code to get result like this.
(simply add values in column A and B as well as column C and D)
AB CD
J 3 4
K 9 8
L 15 12
M 3 7
N 9 11
O 15 15
Could anybody help how to write a code?
By the way, I don't want to do like this.
(it looks too dull, but if it is the only way, I'll deem it)
df2 = df['A'] + df['B']
df3 = df['C'] + df['D']
df = pd.DataFrame([df2,df3],index=['AB','CD']).transpose()
print( df )
When you pass a dictionary or callable to groupby it gets applied to an axis. I specified axis one which is columns.
d = dict(A='AB', B='AB', C='CD', D='CD')
df.groupby(d, axis=1).sum()
Use concat with sum:
df = df.set_index('idx')
df = pd.concat([df[['A', 'B']].sum(1), df[['C', 'D']].sum(1)], axis=1, keys=['AB','CD'])
print( df)
AB CD
idx
J 3 4
K 9 8
L 15 12
M 3 7
N 9 11
O 15 15
Does this do what you need? By using axis=1 with DataFrame.apply, you can use the data that you want in a row to construct a new column. Then you can drop the columns that you don't want anymore.
In [1]: import pandas as pd
In [5]: df = pd.DataFrame(columns=['A', 'B', 'C', 'D'], data=[[1, 2, 3, 4], [1, 2, 3, 4]])
In [6]: df
Out[6]:
A B C D
0 1 2 3 4
1 1 2 3 4
In [7]: df['CD'] = df.apply(lambda x: x['C'] + x['D'], axis=1)
In [8]: df
Out[8]:
A B C D CD
0 1 2 3 4 7
1 1 2 3 4 7
In [13]: df.drop(['C', 'D'], axis=1)
Out[13]:
A B CD
0 1 2 7
1 1 2 7
I have a pandas dataframe, and I created a function. I would like to apply this function to each row of the dataframe. However the function has a third parameter that does not come from the dataframe and is constant so to say.
import pandas as pd
df = pd.DataFrame(data = {'a':[1, 2, 3], 'b':[4, 5, 6]})
def add(a, b, c):
return a + b * c
df['c'] = add(df['a'], df['b'], 2)
I think I have to use the apply function but I don't see how I would pass this constant argument.
print df
>> a b c
>> 0 1 4 10
>> 1 2 5 14
>> 2 3 6 18
I get a bit different output in c column. If need process by rows add axis=1 to apply:
df['c'] = add(df['a'],df['b'],2)
df['d'] = df.apply(lambda x: add(x['a'], x['b'], 2), axis=1)
print (df)
a b c d
0 1 4 9 9
1 2 5 12 12
2 3 6 15 15
def add(a,b,c):
#operator precedence, need ()
return (a + b) * c
df['c'] = add(df['a'],df['b'],2)
df['d'] = df.apply(lambda x: add(x['a'], x['b'], 2), axis=1)
print (df)
a b c d
0 1 4 10 10
1 2 5 14 14
2 3 6 18 18
I have a pandas dataframe whose indices look like:
df.index
['a_1', 'b_2', 'c_3', ... ]
I want to rename these indices to:
['a', 'b', 'c', ... ]
How do I do this without specifying a dictionary with explicit keys for each index value?
I tried:
df.rename( index = lambda x: x.split( '_' )[0] )
but this throws up an error:
AssertionError: New axis must be unique to rename
Perhaps you could get the best of both worlds by using a MultiIndex:
import numpy as np
import pandas as pd
df = pd.DataFrame(np.arange(8).reshape(4,2), index=['a_1', 'b_2', 'c_3', 'c_4'])
print(df)
# 0 1
# a_1 0 1
# b_2 2 3
# c_3 4 5
# c_4 6 7
index = pd.MultiIndex.from_tuples([item.split('_') for item in df.index])
df.index = index
print(df)
# 0 1
# a 1 0 1
# b 2 2 3
# c 3 4 5
# 4 6 7
This way, you can access things according to first level of the index:
In [30]: df.ix['c']
Out[30]:
0 1
3 4 5
4 6 7
or according to both levels of the index:
In [31]: df.ix[('c','3')]
Out[31]:
0 4
1 5
Name: (c, 3)
Moreover, all the DataFrame methods are built to work with DataFrames with MultiIndices, so you lose nothing.
However, if you really want to drop the second level of the index, you could do this:
df.reset_index(level=1, drop=True, inplace=True)
print(df)
# 0 1
# a 0 1
# b 2 3
# c 4 5
# c 6 7
That's the error you'd get if your function produced duplicate index values:
>>> df = pd.DataFrame(np.random.random((4,3)),index="a_1 b_2 c_3 c_4".split())
>>> df
0 1 2
a_1 0.854839 0.830317 0.046283
b_2 0.433805 0.629118 0.702179
c_3 0.390390 0.374232 0.040998
c_4 0.667013 0.368870 0.637276
>>> df.rename(index=lambda x: x.split("_")[0])
[...]
AssertionError: New axis must be unique to rename
If you really want that, I'd use a list comp:
>>> df.index = [x.split("_")[0] for x in df.index]
>>> df
0 1 2
a 0.854839 0.830317 0.046283
b 0.433805 0.629118 0.702179
c 0.390390 0.374232 0.040998
c 0.667013 0.368870 0.637276
but I'd think about whether that's really the right direction.
Considering the following DataFrames
In [136]:
df = pd.DataFrame({'A':[1,1,2,2],'B':[1,2,1,2],'C':np.arange(10,30,5)}).set_index(['A','B'])
df
Out[136]:
C
A B
1 1 10
2 15
2 1 20
2 25
In [130]:
vals = pd.DataFrame({'A':[1,2],'values':[True,False]}).set_index('A')
vals
Out[130]:
values
A
1 True
2 False
How can I select only the rows of df with corresponding True values in vals?
If I reset_index on both frames I can now merge/join them and slice however I want, but how can I do it using the (multi)indexes?
boolean indexing all the way...
In [65]: df[pd.Series(df.index.get_level_values('A')).isin(vals[vals['values']].index)]
Out[65]:
C
A B
1 1 10
2 15
Note that you can use xs on a multiindex.
In [66]: df.xs(1)
Out[66]:
C
B
1 10
2 15