I have a pandas dataframe, and I created a function. I would like to apply this function to each row of the dataframe. However the function has a third parameter that does not come from the dataframe and is constant so to say.
import pandas as pd
df = pd.DataFrame(data = {'a':[1, 2, 3], 'b':[4, 5, 6]})
def add(a, b, c):
return a + b * c
df['c'] = add(df['a'], df['b'], 2)
I think I have to use the apply function but I don't see how I would pass this constant argument.
print df
>> a b c
>> 0 1 4 10
>> 1 2 5 14
>> 2 3 6 18
I get a bit different output in c column. If need process by rows add axis=1 to apply:
df['c'] = add(df['a'],df['b'],2)
df['d'] = df.apply(lambda x: add(x['a'], x['b'], 2), axis=1)
print (df)
a b c d
0 1 4 9 9
1 2 5 12 12
2 3 6 15 15
def add(a,b,c):
#operator precedence, need ()
return (a + b) * c
df['c'] = add(df['a'],df['b'],2)
df['d'] = df.apply(lambda x: add(x['a'], x['b'], 2), axis=1)
print (df)
a b c d
0 1 4 10 10
1 2 5 14 14
2 3 6 18 18
Related
I have a dataframe with columns A,B. I need to create a column C such that for every record / row:
C = max(A, B).
How should I go about doing this?
You can get the maximum like this:
>>> import pandas as pd
>>> df = pd.DataFrame({"A": [1,2,3], "B": [-2, 8, 1]})
>>> df
A B
0 1 -2
1 2 8
2 3 1
>>> df[["A", "B"]]
A B
0 1 -2
1 2 8
2 3 1
>>> df[["A", "B"]].max(axis=1)
0 1
1 8
2 3
and so:
>>> df["C"] = df[["A", "B"]].max(axis=1)
>>> df
A B C
0 1 -2 1
1 2 8 8
2 3 1 3
If you know that "A" and "B" are the only columns, you could even get away with
>>> df["C"] = df.max(axis=1)
And you could use .apply(max, axis=1) too, I guess.
#DSM's answer is perfectly fine in almost any normal scenario. But if you're the type of programmer who wants to go a little deeper than the surface level, you might be interested to know that it is a little faster to call numpy functions on the underlying .to_numpy() (or .values for <0.24) array instead of directly calling the (cythonized) functions defined on the DataFrame/Series objects.
For example, you can use ndarray.max() along the first axis.
# Data borrowed from #DSM's post.
df = pd.DataFrame({"A": [1,2,3], "B": [-2, 8, 1]})
df
A B
0 1 -2
1 2 8
2 3 1
df['C'] = df[['A', 'B']].values.max(1)
# Or, assuming "A" and "B" are the only columns,
# df['C'] = df.values.max(1)
df
A B C
0 1 -2 1
1 2 8 8
2 3 1 3
If your data has NaNs, you will need numpy.nanmax:
df['C'] = np.nanmax(df.values, axis=1)
df
A B C
0 1 -2 1
1 2 8 8
2 3 1 3
You can also use numpy.maximum.reduce. numpy.maximum is a ufunc (Universal Function), and every ufunc has a reduce:
df['C'] = np.maximum.reduce(df['A', 'B']].values, axis=1)
# df['C'] = np.maximum.reduce(df[['A', 'B']], axis=1)
# df['C'] = np.maximum.reduce(df, axis=1)
df
A B C
0 1 -2 1
1 2 8 8
2 3 1 3
np.maximum.reduce and np.max appear to be more or less the same (for most normal sized DataFrames)—and happen to be a shade faster than DataFrame.max. I imagine this difference roughly remains constant, and is due to internal overhead (indexing alignment, handling NaNs, etc).
The graph was generated using perfplot. Benchmarking code, for reference:
import pandas as pd
import perfplot
np.random.seed(0)
df_ = pd.DataFrame(np.random.randn(5, 1000))
perfplot.show(
setup=lambda n: pd.concat([df_] * n, ignore_index=True),
kernels=[
lambda df: df.assign(new=df.max(axis=1)),
lambda df: df.assign(new=df.values.max(1)),
lambda df: df.assign(new=np.nanmax(df.values, axis=1)),
lambda df: df.assign(new=np.maximum.reduce(df.values, axis=1)),
],
labels=['df.max', 'np.max', 'np.maximum.reduce', 'np.nanmax'],
n_range=[2**k for k in range(0, 15)],
xlabel='N (* len(df))',
logx=True,
logy=True)
For finding max among multiple columns would be:
df[['A','B']].max(axis=1).max(axis=0)
Example:
df =
A B
timestamp
2019-11-20 07:00:16 14.037880 15.217879
2019-11-20 07:01:03 14.515359 15.878632
2019-11-20 07:01:33 15.056502 16.309152
2019-11-20 07:02:03 15.533981 16.740607
2019-11-20 07:02:34 17.221073 17.195145
print(df[['A','B']].max(axis=1).max(axis=0))
17.221073
Take the following data frame and groupby object.
df = pd.DataFrame([[1, 2, 3],[1, 4, 5],[2, 5, 6]], columns=['a', 'b', 'c'])
print(df)
a b c
0 1 2 3
1 1 4 5
2 2 5 6
dfGrouped = df.groupby(['a'])
How would I apply to the groupby object dfGrouped, multiplying each element of b and c together and then taking the sum. So for this example, 2*3 + 4*5 = 26 for the 1 group and 5*6 = 30 for the 0 group.
So my desired output for the groupby object is:
a f
0 1 26
2 2 30
Do:
df = pd.DataFrame([[1, 2, 3],[1, 4, 5],[2, 5, 6]], columns=['a', 'b', 'c'])
df['f'] = df['c'] * df['b']
res = df.groupby('a', as_index=False)['f'].sum()
print(res)
Output
a f
0 1 26
1 2 30
If need multiple all columns without a use DataFrame.prod with aggregate sum:
df = df.drop('a', 1).prod(axis=1).groupby(df['a']).sum().reset_index(name='f')
print (df)
a f
0 1 26
1 2 30
Alternative with helper column:
df = df.assign(f = df.drop('a', 1).prod(axis=1)).groupby("a", as_index=False).f.sum()
If need multiple only some columns one idea is use #sammywemmy solution from comments:
df = df.assign(f = df.b.mul(df.c)).groupby("a", as_index=False).f.sum()
print (df)
a f
0 1 26
1 2 30
Code:
df=(df.b * df.c).groupby(df['a']).sum().reset_index(name="f")
print(df)
Output:
a f
0 1 26
1 2 30
I have the following DF and f(a,b) function:
A B
0 5 3
1 4 2
2 7 1
f(a,b):
return (a+b,a-b)
I want to a apply f(a,b) on columns A,B...
and return two values into two new columns df[sum,sub]
A B C D
0 5 3 8 2
1 4 2 6 2
2 7 1 8 6
Using apply with axis=1
import pandas as pd
df = pd.DataFrame({"A": [5, 4, 7], "B":[3, 2, 1]})
def f(a,b):
return (a+b,a-b)
df[["sum", "sub"]] = df.apply(lambda row: f(row["A"], row["B"]), axis=1).apply(pd.Series)
print(df)
Output:
A B sum sub
0 5 3 8 2
1 4 2 6 2
2 7 1 8 6
This is one way. I strongly recommend you don't use pd.DataFrame.apply with a row-wise calculation, as this unnecessarily sidesteps pandas vectorisation.
def f(a, b):
return a + b, a - b
def foo(df, a, b):
return f(df[a], df[b])
df['C'], df['D'] = df.pipe(foo, 'A', 'B')
print(df)
A B C D
0 5 3 8 2
1 4 2 6 2
2 7 1 8 6
I have a "sample.txt" like this.
idx A B C D cat
J 1 2 3 1 x
K 4 5 6 2 x
L 7 8 9 3 y
M 1 2 3 4 y
N 4 5 6 5 z
O 7 8 9 6 z
With this dataset, I want to get sum in row and column.
In row, it is not a big deal.
I made result like this.
### MY CODE ###
import pandas as pd
df = pd.read_csv('sample.txt',sep="\t",index_col='idx')
df.info()
df2 = df.groupby('cat').sum()
print( df2 )
The result is like this.
A B C D
cat
x 5 7 9 3
y 8 10 12 7
z 11 13 15 11
But I don't know how to write a code to get result like this.
(simply add values in column A and B as well as column C and D)
AB CD
J 3 4
K 9 8
L 15 12
M 3 7
N 9 11
O 15 15
Could anybody help how to write a code?
By the way, I don't want to do like this.
(it looks too dull, but if it is the only way, I'll deem it)
df2 = df['A'] + df['B']
df3 = df['C'] + df['D']
df = pd.DataFrame([df2,df3],index=['AB','CD']).transpose()
print( df )
When you pass a dictionary or callable to groupby it gets applied to an axis. I specified axis one which is columns.
d = dict(A='AB', B='AB', C='CD', D='CD')
df.groupby(d, axis=1).sum()
Use concat with sum:
df = df.set_index('idx')
df = pd.concat([df[['A', 'B']].sum(1), df[['C', 'D']].sum(1)], axis=1, keys=['AB','CD'])
print( df)
AB CD
idx
J 3 4
K 9 8
L 15 12
M 3 7
N 9 11
O 15 15
Does this do what you need? By using axis=1 with DataFrame.apply, you can use the data that you want in a row to construct a new column. Then you can drop the columns that you don't want anymore.
In [1]: import pandas as pd
In [5]: df = pd.DataFrame(columns=['A', 'B', 'C', 'D'], data=[[1, 2, 3, 4], [1, 2, 3, 4]])
In [6]: df
Out[6]:
A B C D
0 1 2 3 4
1 1 2 3 4
In [7]: df['CD'] = df.apply(lambda x: x['C'] + x['D'], axis=1)
In [8]: df
Out[8]:
A B C D CD
0 1 2 3 4 7
1 1 2 3 4 7
In [13]: df.drop(['C', 'D'], axis=1)
Out[13]:
A B CD
0 1 2 7
1 1 2 7
I am trying to access the index of a row in a function applied across an entire DataFrame in Pandas. I have something like this:
df = pandas.DataFrame([[1,2,3],[4,5,6]], columns=['a','b','c'])
>>> df
a b c
0 1 2 3
1 4 5 6
and I'll define a function that access elements with a given row
def rowFunc(row):
return row['a'] + row['b'] * row['c']
I can apply it like so:
df['d'] = df.apply(rowFunc, axis=1)
>>> df
a b c d
0 1 2 3 7
1 4 5 6 34
Awesome! Now what if I want to incorporate the index into my function?
The index of any given row in this DataFrame before adding d would be Index([u'a', u'b', u'c', u'd'], dtype='object'), but I want the 0 and 1. So I can't just access row.index.
I know I could create a temporary column in the table where I store the index, but I'm wondering if it is stored in the row object somewhere.
To access the index in this case you access the name attribute:
In [182]:
df = pd.DataFrame([[1,2,3],[4,5,6]], columns=['a','b','c'])
def rowFunc(row):
return row['a'] + row['b'] * row['c']
def rowIndex(row):
return row.name
df['d'] = df.apply(rowFunc, axis=1)
df['rowIndex'] = df.apply(rowIndex, axis=1)
df
Out[182]:
a b c d rowIndex
0 1 2 3 7 0
1 4 5 6 34 1
Note that if this is really what you are trying to do that the following works and is much faster:
In [198]:
df['d'] = df['a'] + df['b'] * df['c']
df
Out[198]:
a b c d
0 1 2 3 7
1 4 5 6 34
In [199]:
%timeit df['a'] + df['b'] * df['c']
%timeit df.apply(rowIndex, axis=1)
10000 loops, best of 3: 163 µs per loop
1000 loops, best of 3: 286 µs per loop
EDIT
Looking at this question 3+ years later, you could just do:
In[15]:
df['d'],df['rowIndex'] = df['a'] + df['b'] * df['c'], df.index
df
Out[15]:
a b c d rowIndex
0 1 2 3 7 0
1 4 5 6 34 1
but assuming it isn't as trivial as this, whatever your rowFunc is really doing, you should look to use the vectorised functions, and then use them against the df index:
In[16]:
df['newCol'] = df['a'] + df['b'] + df['c'] + df.index
df
Out[16]:
a b c d rowIndex newCol
0 1 2 3 7 0 6
1 4 5 6 34 1 16
Either:
1. with row.name inside the apply(..., axis=1) call:
df = pandas.DataFrame([[1,2,3],[4,5,6]], columns=['a','b','c'], index=['x','y'])
a b c
x 1 2 3
y 4 5 6
df.apply(lambda row: row.name, axis=1)
x x
y y
2. with iterrows() (slower)
DataFrame.iterrows() allows you to iterate over rows, and access their index:
for idx, row in df.iterrows():
...
To answer the original question: yes, you can access the index value of a row in apply(). It is available under the key name and requires that you specify axis=1 (because the lambda processes the columns of a row and not the rows of a column).
Working example (pandas 0.23.4):
>>> import pandas as pd
>>> df = pd.DataFrame([[1,2,3],[4,5,6]], columns=['a','b','c'])
>>> df.set_index('a', inplace=True)
>>> df
b c
a
1 2 3
4 5 6
>>> df['index_x10'] = df.apply(lambda row: 10*row.name, axis=1)
>>> df
b c index_x10
a
1 2 3 10
4 5 6 40