I want to create a new column in a pandas data frame by applying a function to two existing columns. Following this answer I've been able to create a new column when I only need one column as an argument:
import pandas as pd
df = pd.DataFrame({"A": [10,20,30], "B": [20, 30, 10]})
def fx(x):
return x * x
print(df)
df['newcolumn'] = df.A.apply(fx)
print(df)
However, I cannot figure out how to do the same thing when the function requires multiple arguments. For example, how do I create a new column by passing column A and column B to the function below?
def fxy(x, y):
return x * y
You can go with #greenAfrican example, if it's possible for you to rewrite your function. But if you don't want to rewrite your function, you can wrap it into anonymous function inside apply, like this:
>>> def fxy(x, y):
... return x * y
>>> df['newcolumn'] = df.apply(lambda x: fxy(x['A'], x['B']), axis=1)
>>> df
A B newcolumn
0 10 20 200
1 20 30 600
2 30 10 300
Alternatively, you can use numpy underlying function:
>>> import numpy as np
>>> df = pd.DataFrame({"A": [10,20,30], "B": [20, 30, 10]})
>>> df['new_column'] = np.multiply(df['A'], df['B'])
>>> df
A B new_column
0 10 20 200
1 20 30 600
2 30 10 300
or vectorize arbitrary function in general case:
>>> def fx(x, y):
... return x*y
...
>>> df['new_column'] = np.vectorize(fx)(df['A'], df['B'])
>>> df
A B new_column
0 10 20 200
1 20 30 600
2 30 10 300
This solves the problem:
df['newcolumn'] = df.A * df.B
You could also do:
def fab(row):
return row['A'] * row['B']
df['newcolumn'] = df.apply(fab, axis=1)
If you need to create multiple columns at once:
Create the dataframe:
import pandas as pd
df = pd.DataFrame({"A": [10,20,30], "B": [20, 30, 10]})
Create the function:
def fab(row):
return row['A'] * row['B'], row['A'] + row['B']
Assign the new columns:
df['newcolumn'], df['newcolumn2'] = zip(*df.apply(fab, axis=1))
One more dict style clean syntax:
df["new_column"] = df.apply(lambda x: x["A"] * x["B"], axis = 1)
or,
df["new_column"] = df["A"] * df["B"]
This will dynamically give you desired result. It works even if you have more than two arguments
df['anothercolumn'] = df[['A', 'B']].apply(lambda x: fxy(*x), axis=1)
print(df)
A B newcolumn anothercolumn
0 10 20 100 200
1 20 30 400 600
2 30 10 900 300
Related
Consider the following data frame:
df = pd.DataFrame({
'group': [i % 3 for i in range(10)],
'a': np.random.rand(10),
'b': np.random.rand(10)
})
def my_agg(x):
x = x.values.reshape([x.shape[0] // 2,2])
prod = x[:,0] * x[:,1]
return [np.sum(prod), np.mean(prod)]
df.set_index('group').stack().groupby('group').apply(my_agg)
Producing as result
group
0 [0.3625660911145343, 0.09064152277863358]
1 [1.132618561193485, 0.3775395203978283]
2 [0.37300784663400804, 0.12433594887800269]
dtype: object
whereas I would like to have separate column for each column. Is there a neat way to do this in pandas, taking into account that:
the multiple features generated ar more complex and computing them together is more efficient;
the number of features is much greater than 2?
You can convert output to lists and then to DataFrame by contructor:
def my_agg(x):
x = x.values.reshape([x.shape[0] // 2,2])
return [np.sum(x[:,0] * x[:,1]), np.mean(x[:,0] * x[:,1])]
s = df.set_index('group').stack().groupby('group').apply(my_agg)
df1 = pd.DataFrame(s.values.tolist(), index=s.index, columns=['a','b'])
print (df1)
a b
group
0 2.210601 0.552650
1 0.335913 0.111971
2 1.696796 0.565599
Or you can return Series and then unstack, but it shoud be slowier:
def my_agg(x):
x = x.values.reshape([x.shape[0] // 2,2])
return pd.Series([np.sum(x[:,0] * x[:,1]), np.mean(x[:,0] * x[:,1])], index=['a','b'])
df1 = df.set_index('group').stack().groupby('group').apply(my_agg).unstack()
print (df1)
a b
group
0 0.391921 0.097980
1 0.417366 0.139122
2 0.788845 0.262948
In pandas tables, row-index and column-index have a very similar interface and some operations allow to operate along either rows and columns simply by a parameter axis. (For example sort_index, and many more.)
But how can I access (read and write) either row-index or column-index by specifying the axis?
# Instead of this
if axis==0:
table.index = some_function(table.get_index_by_axis(axis))
else:
table.column = some_function(table.get_index_by_axis(axis))
# I would like to simply write:
newIndex = some_function(table.get_index_by_axis(axis))
table.set_index_by_axis(newIndex, axis=axis)
Does something like get_index_by_axis and set_index_by_axis exist?
Update:
Data frames have an attribute axes that permits to choose the axis by index. However, this is read-only. Assigning a new value does not have an effect on the table.
index = table.axes[axis] # Read an index
newIndex = some_function(index)
table.axes[axis] = newIndex # This has no effect on table.
I looked into the pandas source code to see how the axis keyword is used. There's a method _get_axis_name that takes the axis as a parameter.
import pandas as pd
df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
Pass in the axis parameter:
>>> df._get_axis_name(axis=0)
'index'
>>> df._get_axis_name(axis=1)
'columns'
You can use this with getattr or setattr.
>>> getattr(df, df._get_axis_name(axis=0))
RangeIndex(start=0, stop=3, step=1)
>>> getattr(df, df._get_axis_name(axis=1))
Index(['A', 'B'], dtype='object')
Use pd.DataFrame.set_axis():
import pandas as pd
def apply_axis(df, axis, func):
old_index = df.axes[axis]
new_index = old_index.map(func)
df = df.set_axis(new_index, axis=axis)
return df
def some_function(x):
return x+x
df = pd.DataFrame({'a': [1,2,3],
'b': [10,20,30],
'c': [100,200,300],
'd': [1000,2000,3000]})
# a b c d
# 0 1 10 100 1000
# 1 2 20 200 2000
# 2 3 30 300 3000
ret = apply_axis(df=df, axis=0, func=some_function)
# a b c d
# 0 1 10 100 1000
# 2 2 20 200 2000
# 4 3 30 300 3000
ret = apply_axis(df=df, axis=1, func=some_function)
# aa bb cc dd
# 0 1 10 100 1000
# 1 2 20 200 2000
# 2 3 30 300 3000
incidentcountlevel1 and examcount were two column names on CSV file. I want to calculate two columns based on these. I have written the script below but it's failing:
import pandas as pd
import numpy as np
import time, os, fnmatch, shutil
df = pd.read_csv(r"/home/corp_sourcing/Metric_Fact_20180324_1227.csv",header='infer',skiprows=[1])
df1 = pd.read_csv(r"/home/corp_sourcing/Metric_Fact_20180324_1227.csv",header='infer',skiprows=[1])
df3 = pd.read_csv("/home/corp_sourcing/Metric_Fact_20180324_1227.csv",header='infer',converters={"incidentcountlevel1":int})
inc_count_lvl_1 = df3.loc[:, ['incidentcountlevel1']]
exam_count=df3.loc[:, ['examcount']]
for exam_count in exam_count: #need to iterate this col to calculate for each row
if exam_count < 1:
print "IPTE Cannot be calculated"
else:
if inc_count_lvl_1 > 5:
ipte1= (inc_count_lvl_1/exam_count)*1000
else:
dof = 2*(inc_count_lvl_1+ 1)
chi_square=chi2.ppf(0.5,dof)
ipte1=(chi_square/(2*exam_count))×1000
You can apply lamda function on pandas column.
Just created an example using numpy. You can change according to your case
>>> import numpy as np
>>> df = pd.DataFrame({"A": [10,20,30], "B": [20, 30, 50]})
>>> df['new_column'] = np.multiply(df['A'], df['B'])
>>> df
A B new_column
0 10 20 200
1 20 30 600
2 30 10 1500
or you can create your own function:
>>> def fx(x, y):
... return x*y
...
>>> df['new_column'] = np.vectorize(fx)(df['A'], df['B'])
>>> df
A B new_column
0 10 20 200
1 20 30 600
2 30 10 1500
I your case, the solution might look like this.
df['new_column'] = np.vectorize(fx)(df['examcount'], df['incidentcountlevel1'])
def fx(exam_count,inc_count_lvl_1):
if exam_count < 1:
return -1 ##whatever you want
else:
if inc_count_lvl_1 > 5:
ipte1= (inc_count_lvl_1/exam_count)*1000
else:
dof = 2*(inc_count_lvl_1+ 1)
chi_square=chi2.ppf(0.5,dof)
ipte1=(chi_square/(2*exam_count))×1000
return ipte1
If you dont want to use lamda fucntions then you can use iterrows.
iterrows is a generator which yield both index and row.
for index, row in df.iterrows():
print row['examcount'], row['incidentcountlevel1']
#do your stuff.
I hope it helps.
I have a Pandas DataFrame, df:
import pandas as pd
import numpy as np
import math
df = pd.DataFrame({'A':[1,2,2,4,np.nan],'B':[1,2,3,4,5]})
and a dict, mask:
mask = {1:32,2:64,3:100,4:200}
I want my end result to be a DataFrame like this:
A B C
1 1 32
2 2 64
2 3 96
4 4 400
nan nan nan
Right now I am doing this, which seems innefficient:
for idx, row in df.iterrows():
if not math.isnan(row['A']):
if row['A'] != 1:
df.loc[idx, 'C'] = row['B'] * mask[row['A'] - 1]
else:
df.loc[idx, 'C'] = row['B'] * mask[row['A']]
Is there an easy way to vectorize this?
This should work:
df['C'] = df.B * (df.A - (df.A != 1)).map(mask)
Timing
10,000 rows
# Initialize each run with
df = pd.DataFrame({'A':[1,2,2,4,np.nan],'B':[1,2,3,4,5]})
df = pd.concat([df for _ in range(2000)])
100,000 rows
# Initialize each run with
df = pd.DataFrame({'A':[1,2,2,4,np.nan],'B':[1,2,3,4,5]})
df = pd.concat([df for _ in range(20000)])
Here is an option using apply, and the get method for dictionary which returns None if the key is not in the dictionary:
df['C'] = df.apply(lambda r: mask.get(r.A) if r.A == 1 else mask.get(r.A - 1), axis = 1) * df.B
df
# A B C
#0 1 1 32
#1 2 2 64
#2 2 3 96
#3 4 4 400
#4 NaN 5 NaN
I have a function which returns two list, so a can save those in two variables like:
list_a,list_b = my_function(input)
I want to save this directly into a dataframe, something like this:
df[['list_a','list_b']] = my_function(input)
I got the following error:
array is not broadcastable to correct shape
Use
df['B'], df['C'] = my_function()
to unpack the tuple of lists returned by my_function and assign the lists to df['B'] and df['C']:
import pandas as pd
N = 5
def my_function():
return [10]*N, [20]*N
df = pd.DataFrame({'A':[1]*N})
df['B'], df['C'] = my_function()
yields
A B C
0 1 10 20
1 1 10 20
2 1 10 20
3 1 10 20
4 1 10 20
Note that the lengths of the lists returned by my_function must match the length of df.
import pandas as pd
list_a, list_b = my_function(input)
df = pd.DataFrame([list_a, list_b], columns=['a','b'])
or combined in to one line:
df = pd.DataFrame(list(my_function(input)), columns=['a','b'])