How can I pass an optional function as a keyword into another function called with pd.apply?
For example, sometimes the dataframe I am working with has z in it:
df = pd.DataFrame(
{
"x": [1, 2, 3, 4],
"y": [100, 200, 300, 400],
"z": [0.1, 0.2, 0.3, 0.4],
}
)
df
This is easy because I can just use my function (much simplified herein of course):
def myfunc(dataframe):
a = dataframe["z"] * 2
return pd.Series(
{
"a": a,
}
)
To get my final dataframe with a:
output = df.apply(
myfunc,
axis=1,
)
df = df.join(output)
df
In the case where z is not in my starting dataframe, I need to call a function to calculate it. I can call several different functions to calculate it, and I want to be able to name a specific function. Here, I want to use the function computeZ.
def computeZ(dataframe):
results = dataframe["x"] / 20
return results
df = pd.DataFrame(
{
"x": [1, 2, 3, 4],
"y": [100, 200, 300, 400],
}
)
df
Here is the problem: I am trying this below, but I get TypeError: myfunc2() takes 1 positional argument but 2 were given. What is the correct way to do this? For various reasons, I need to use pd.apply rather than working in numpy arrays or using anonymous functions.
def myfunc2(dataframe, **kwds):
if analysis:
res = globals()[analysis](dataframe["x"])
a = res * 2
else:
a = dataframe["z"] * 2
return pd.Series(
{
"a": a,
}
)
output = df.apply(
myfunc2,
axis=1,
args=("computeZ",),
)
It looks like you need at second positional parameter in the myfunc2 definition. The first argument passed will be the row itself for axis=1 - then your positional parameter:
def myfunc2(row, p2):
return p2(row)
and then the passed positional computeZ should be the actual variable and not a string representation (not "computeZ" but computeZ instead):
output = df.apply(
myfunc2,
axis=1,
args=(computeZ,)
)
output
0 0.05
1 0.10
2 0.15
3 0.20
Or you could do it using kwargs:
def myfunc2(row, **kwargs):
return kwargs['your_func'](row)
output = df.apply(
myfunc2,
axis=1,
your_func=computeZ
)
output
0 0.05
1 0.10
2 0.15
3 0.20
Related
I have a dataset like the following:
import pandas as pd
dict1 = {
"idx": [1, 2, 9, 1, 1, 6, 1, 3, 2],
"value": ["a1", "b1", "c1", "t1", "t1", "f1", "r1", "l1", "b1"]
}
df = pd.DataFrame(dict1)
Here, a1, t1, r1 always occur with 1. To verify this, I used the following code,
def one_to_one(df, col1, col2):
first = df.groupby(col1)[col2].count().max()
second = df.groupby(col2)[col1].count().max()
return first + second == 2
one_to_one(df, 'idx', 'value')
But it returned None even though a1, t1, r1 always occur with 1 and others do the same. How to verify that, a1 for example, only occurs with idx = 1?
You can just drop_duplicates based idx + value combination. If each value has only one idx, then the result should contain the same number of records as the number of unique values in value column:
len(df[['idx', 'value']].drop_duplicates()) == df['value'].nunique()
# True
hi guys I wrote this small code , and in my understanding of the reduce function this code should give me the total of the the prices but it doesn't can any one explain what's going on.
import functools
bill = [["jaket" , 2500, 2] , ["ball" , 450 , 1] , ["calculator" , 120 , 4]]
lmbd_total = lambda data1 , data2 : data1[1]+data2[1]
totall = functools.reduce(lmbd_total , bill)
print(totall)
lmbd_total = lambda data1 , data2 : data1[1]+data2[1]
The first iteration will add correctly. But the second iteration, the accumulator data1 is a int object and not a list. Therefore, data1[1] is not subscriptable. Add a int as initializer and make the accumulator constant type int .
bill = [["jaket" , 2500, 2] , ["ball" , 450 , 1] , ["calculator" , 120 , 4 ]]
lmbd_total = lambda acc , data2 : acc+data2[1]
totall = functools.reduce(lmbd_total , bill,0)
print(totall)
reduce works like chaining, the result from the previous sum goes into the next invocation. So when you return an int from the first sum, it will be passed to the next call as data1. When it tries to do data1[1], it will fail.
So you'll have to return something that "looks like" the data you are passing to reduce:
import functools
bill = [["jaket", 2500, 2], ["ball", 450, 1], ["calculator", 120, 4]]
lmbd_total = lambda data1, data2: (None, data1[1] + data2[1], None)
totall = functools.reduce(lmbd_total, bill)
print(totall[1])
Which is quite ugly-looking. You can get the same result with a simple sum():
bill = [["jaket", 2500, 2], ["ball", 450, 1], ["calculator", 120, 4]]
totall = sum(a[1] for a in bill)
print(totall)
Now that you got your answer, maybe it is also nice to know accumulate method from itertools which basically does the same this as functools.reduce as specified by TheMaster's concise and to the point answer. The only difference is that accumualte also returns intermediate values that could be suitable where you are working with data frames and would like to populate a new column based on other columns:
from itertools import accumulate
list(accumulate(bill, lambda x, y: x + y[1], initial = 0))
[0, 2500, 2950, 3070]
I have a df as below:
I would like to group by id and flag and create a new column in the df which is the result of: [sum(value1)/sum(value2)] * 12. Therefore I will need the result to be:
I have created a function:
`def calculation (value1, value2):
result = (value1/value2) * 12
return(result)`
Could you advise which is the best way to apply this function along with the grouping, in order to get the desired output?
Many thanks
The following code should work.
import pandas as pd
df = pd.DataFrame({"id" : [1,1,2,2],"flag":["A","B","B","A"],"value1":[520,200,400,410],"value2":[12,5,11,2]})
def calculation(value1, value2):
result = (value1/value2) * 12
return(result)
df.groupby(['id','flag']).apply(lambda x: calculation(x['value1'],x['value2'])).astype(int)
You just have to use the following for groupby and apply.
df.groupby(['id','flag']).apply(lambda x: calculation(x['value1'],x['value2'])).astype(int)
Here's a solution using apply and a lambda function.
import pandas as pd
df = pd.DataFrame([
[1, 'A', 520, 12],
[1, 'B', 200, 5],
[2, 'B', 400, 11],
[2, 'A', 410, 2]],
columns=['id', 'flag', 'value1', 'value2'])
df.groupby(['id', 'flag']).apply(lambda x: (12 * x['value1']) / x['value2'])
If you want to use the function calculation above then just call the apply method like this.
df.groupby(['id', 'flag']).apply(lambda x: calculation(x['value1'], x['value2']))
I have two Series, each containing variables that I want to use in a function. I want to apply the function for each combination of variables with the resulting output being a DataFrame of the calculated values, the index will be the index from one Series and the columns will be the index of the other Series.
I have tried searching for an answer to a similar problem - I'm sure there's one out there but I'm not sure how to describe it for search engines.
I've solved the problem by creating a function using for loops, so you can understand the logic. I want to know if there is a more efficient operation to do this without using for loops.
From what I've read, I'm imagining some combination of a list comprehension with zipped columns to calculate the values, which is then reshaped into the DataFrame but I can't solve it this way.
Here's the code to reproduce the problem and current solution.
import pandas as pd
bands = pd.Series({'A': 5, 'B': 17, 'C': 9, 'D': 34}, name='band')
values = pd.Series({'Jan': 1, 'Feb': 1.02, 'Mar': 1.05, 'Apr': 1.12}, name='values')
# Here is an unused function as an example
myfunc = lambda x, y: x * (1 + 1/y)
def func1(values, bands):
# Initialise empty DataFrame
df = pd.DataFrame(index=bands.index,
columns=values.index)
for month, month_val in values.iteritems():
for band, band_val in bands.iteritems():
df.at[band, month] = band_val * (1/month_val - 1)
return df
outcome = func1(values, bands)
You could use numpy.outer for this:
import numpy as np
import pandas as pd
bands = pd.Series({'A': 5, 'B': 17, 'C': 9, 'D': 34}, name='band')
values = pd.Series({'Jan': 1, 'Feb': 1.02, 'Mar': 1.05, 'Apr': 1.12}, name='values')
outcome = pd.DataFrame(np.outer(bands, ((1 / values) - 1)),
index=bands.index,
columns=values.index)
[out]
Jan Feb Mar Apr
A 0.0 -0.098039 -0.238095 -0.535714
B 0.0 -0.333333 -0.809524 -1.821429
C 0.0 -0.176471 -0.428571 -0.964286
D 0.0 -0.666667 -1.619048 -3.642857
As a function:
def myFunc(ser1, ser2):
result = pd.DataFrame(np.outer(ser1, ((1 / ser2) - 1)),
index=ser1.index,
columns=ser2.index)
return result
myFunc(bands, values)
I tried to calculate specific quantile values from a data frame, as shown in the code below. There was no problem when calculate it in separate lines.
When attempting to run last 2 lines, I get the following error:
AttributeError: 'SeriesGroupBy' object has no attribute 'quantile(0.25)'
How can I fix this?
import pandas as pd
df = pd.DataFrame(
{
'x': [0, 1, 0, 1, 0, 1, 0, 1],
'y': [7, 6, 5, 4, 3, 2, 1, 0],
'number': [25000, 35000, 45000, 50000, 60000, 70000, 65000, 36000]
}
)
f = {'number': ['median', 'std', 'quantile']}
df1 = df.groupby('x').agg(f)
df.groupby('x').quantile(0.25)
df.groupby('x').quantile(0.75)
# code below with problem:
f = {'number': ['median', 'std', 'quantile(0.25)', 'quantile(0.75)']}
df1 = df.groupby('x').agg(f)
I prefer def functions
def q1(x):
return x.quantile(0.25)
def q3(x):
return x.quantile(0.75)
f = {'number': ['median', 'std', q1, q3]}
df1 = df.groupby('x').agg(f)
df1
Out[1643]:
number
median std q1 q3
x
0 52500 17969.882211 40000 61250
1 43000 16337.584481 35750 55000
#WeNYoBen's answer is great. There is one limitation though, and that lies with the fact that one needs to create a new function for every quantile. This can be a very unpythonic exercise if the number of quantiles become large. A better approach is to use a function to create a function, and to rename that function appropriately.
def rename(newname):
def decorator(f):
f.__name__ = newname
return f
return decorator
def q_at(y):
#rename(f'q{y:0.2f}')
def q(x):
return x.quantile(y)
return q
f = {'number': ['median', 'std', q_at(0.25) ,q_at(0.75)]}
df1 = df.groupby('x').agg(f)
df1
Out[]:
number
median std q0.25 q0.75
x
0 52500 17969.882211 40000 61250
1 43000 16337.584481 35750 55000
The rename decorator renames the function so that the pandas agg function can deal with the reuse of the quantile function returned (otherwise all quantiles results end up in columns that are named q).
There's a nice way if you want to give names to aggregated columns:
df1.groupby('x').agg(
q1_foo=pd.NamedAgg('number', q1),
q2_foo=pd.NamedAgg('number', q2)
)
where q1 and q2 are functions.
Or even simpler:
df1.groupby('x').agg(
q1_foo=('number', q1),
q2_foo=('number', q2)
)