Passing columns as arguments to pandas groupby apply function - python

Let say I have the following dataframe:
a = np.random.rand(10)
b = np.random.rand(10)*10
c = np.random.rand(10)*100
groups = np.array([1,1,2,2,2,2,3,3,4,4])
df = pd.DataFrame({"a":a,"b":b,"c":c,"groups":groups})
I simply want to group by the df based on groups and apply the following function to two columns (a and b) of each group:
def my_fun(x,y):
tmp = np.sum((x*y))/np.sum(y)
return tmp
What I tried is:
df.groupby("groups").apply(my_fun,("a","b"))
But that does not work and gives me error:
ValueError: Unable to coerce to Series, the length must be 4: given 2
The final output is basically a single number for each group. I can get around the problem by loops but I think there should be a better approach?
Thanks

Without changing your function, you want to do:
df.groupby("groups").apply(lambda d: my_fun(d["a"],d["b"]))
Output:
groups
1 0.603284
2 0.183289
3 0.828273
4 0.361103
dtype: float64
That said, you can rewrite your function so it takes in a dataframe as the first positional argument:
def myfunc(data, val_col, weight_col):
return np.sum(data[val_col]*data[weight_col])/np.sum(data[weight_col])
df.groupby('groups').apply(myfunc, 'a', 'b')

Related

Python TypeError: cannot convert the series to <class 'int'> when using math.floor() for iloc index lookup value

I'm having an issue where I need a function to find a corresponding value from within a dataframe with multiple rows within them, looking similar to:
Value
0 1.2332165631653
1 6.5651324661235
2 2.3651432415454
3 1.6566584651432
4 9.5168743514354
5 ...
My function looks like this:
import math
import dataframe as df
df1 = df.read_csv('Data1.csv')
df2 = df.read_csv('Data2.csv')
def dfFunction (A, B):
Step = 10
AB = A * B
ABInt = math.floor(AB / Step)
dfValue = df1.iloc[ABInt]
return AB / dfValue
When I input A and B values as int or float, the function works, but when I try to apply the function to df2 (similar to df1 in terms of layout, just additional columns of floats), I'm returning this error.
I've tried df2.apply(dfFunction(df2.ColumnA, df2.ColumnB), axis = 1) and simply dfFunction(df2.ColumnA, df2.ColumnB).
I fundamentally understand the error, since it's highlighting the math.floor() line, but I can't use a float to look up the row index of df1 with a float. Is there another way I can have the function or looking up the data value? I'd just use iloc() if the floats weren't massive decimal places, but the values are means from another portion of the code.
Please let me know if further clarification is needed; I'm only a beginning with Python and Stack :)

apply with lambda and apply without lambda

I am trying to use the impliedVolatility function in df_spx.apply() while hardcoding the variable inputs S, K, r, price, T, payoff, and c_or_p.
However, it does not work, using the same function impliedVolatility, only doing lambda + apply it works.
[code link][1]
# first version of code
S = SPX_spot
K = df_spx['strike_price']
r = df_spx['r']
price = df_spx['mid_price']
T = df_spx['T_years']
payoff = df_spx['cp_flag']
c_or_p = df_spx["cp_flag"]
df_spx["iv"] = df_spx.apply(impliedVolatility(c_or_p, S, K, T, r,price),axis=1)
# second version of code
df_spx["impliedvol"] = df_spx.apply(
lambda r: impliedVolatility(r["cp_flag"],
S,
r["strike_price"],
r['T_years'],
r["r"],
r["mid_price"]),
axis = 1)
[1]: https://i.stack.imgur.com/yBfO5.png
You have to give apply a function that it can call. It needs a callable function. In your first example
df_spx.apply(impliedVolatility(c_or_p, S, K, T, r,price), axis=1)
you are giving the result of the function as a parameter to apply. That would not work. If you instead wrote
df_spx.apply(impliedVolatility, c_or_p=c_or_p, S=S, K=K, T=T, r=r, price=price, axis=1)
if the function keywords arguments have the same names or if you wrote
df_spx.apply(impliedVolatility, args=(c_or_p, S, K, T, r,price), axis=1)
then it might work. Notice we are not calling the impliedVolatility in the apply. We are giving the function as a argument.
There is already a pretty good answer, but maybe to give it a different perspective. The apply is going to loop on your data and call the function you provide on it.
Say you have:
import pandas as pd
df = pd.DataFrame({"a": [1, 2, 3], "b": list("asd")})
df
Out:
a b
0 1 a
1 2 s
2 3 d
If you want to create new data or perform certain work on any of the columns (you could also do it at the entire row level, which btw is your use case, but let's simplify for now) you might consider using apply. Say you just wanted to multiply every input by two:
def multiply_by_two(val):
return val * 2
df.b.apply(multiply_by_two) # case 1
Out:
0 aa
1 ss
2 dd
df.a.apply(multiply_by_two) # case 2
Out:
0 2
1 4
2 6
The first usage example transformed your one letter string into two equal letter strings while the second is obvious. You should avoid using apply in the second case, because it is a simple mathematical operation that will be extremely slow in comparison to df.a * 2. Hence, my rule of thumb is: use apply when performing operations with non-numeric objects (case 1). NOTE: no actual need for a lambda in this simple case.
So what apply does is passing each element of the series to the function.
Now, if you apply on an entire dataframe, the values passed will be a data slice as a series. Hence, to properly apply your function you will need to map the inputs. For, instance:
def add_2_to_a_multiply_b(b, a):
return (a + 2) * b
df.apply(lambda row: add_2_to_a_multiply_b(*row), axis=1) # ERROR because the values are unpacked as (df.a, df.b) and you can't add integers and strings (see `add_2_to_a_multiply_b`)
df.apply(lambda row: add_2_to_a_multiply_b(row['b'], row['a']), axis=1)
Out:
0 aaa
1 ssss
2 ddddd
From this point on you can build more complex implementation, for instance, using partial functions, etc. For instance:
def add_to_a_multiply_b(b, a, *, val_to_add):
return (a + val_to_add) * b
import partial
specialized_func = partial(add_to_a_multiply_b, val_to_add=2)
df.apply(lambda row: specialized_func(row['b'], row['a']), axis=1)
Just to stress it again, avoid apply if you are performance eager:
# 'OK-ISH', does the job... but
def strike_price_minus_mid_price(strike_price, mid_price):
return strike_price - mid_price
new_data = df.apply(lambda r: strike_price_minus_mid_price(r["strike_price"], r["mid_price"] ), axis=1)
vs
'BETTER'
new_data = df["strike_price"] - df["mid_price"]

Python PANDAS: Applying a function to a dataframe, with arguments defined within dataframe

I have a dataframe with headers 'Category', 'Factor1', 'Factor2', 'Factor3', 'Factor4', 'UseFactorA', 'UseFactorB'.
The value of 'UseFactorA' and 'UseFactorB' are one of the strings ['Factor1', 'Factor2', 'Factor3', 'Factor4'], keyed based on the value in 'Category'.
I want to generate a column, 'Result', which equals dataframe[UseFactorA]/dataframe[UseFactorB]
Take the below dataframe as an example:
[Category] [Factor1] [Factor2] [Factor3] [Factor4] [useFactor1] [useFactor2]
A 1 2 5 8 'Factor1' 'Factor3'
B 2 7 4 2 'Factor3' 'Factor1'
The 'Result' series should be [2, .2]
However, I cannot figure out how to feed the value of useFactor1 and useFactor2 into an index to make this happen--if the columns to use were fixed, I would just give
df['Result'] = df['Factor1']/df['Factor2']
However, when I try to give
df['Results'] = df[df['useFactorA']]/df[df['useFactorB']]
I get the error
ValueError: Wrong number of items passed 3842, placement implies 1
Is there a method for doing what I am trying here?
Probably not the prettiest solution (because of the iterrows), but what comes to mind is to iterate through the sets of factors and set the 'Result' value at each index:
for i, factors in df[['UseFactorA', 'UseFactorB']].iterrows():
df.loc[i, 'Result'] = df[factors['UseFactorA']] / df[factors['UseFactorB']]
Edit:
Another option:
def factor_calc_for_row(row):
factorA = row['UseFactorA']
factorB = row['UseFactorB']
return row[factorA] / row[factorB]
df['Result'] = df.apply(factor_calc_for_row, axis=1)
Here's the one liner:
df['Results'] = [df[df['UseFactorA'][x]][x]/df[df['UseFactorB'][x]][x] for x in range(len(df))]
How it works is:
df[df['UseFactorA']]
Returns a data frame,
df[df['UseFactorA'][x]]
Returns a Series
df[df['UseFactorA'][x]][x]
Pulls a single value from the series.

how do you pass multiple variables to pandas dataframe to use them with .map to create a new column

To pass multiple variables to a normal python function you can just write something like:
def a_function(date,string,float):
do something....
convert string to int,
date = date + (float * int) days
return date
When using Pandas DataFrames I know you can create a new column based on the contents of one like so:
df['new_col']) = df['column_A'].map(a_function)
# This might return the year from a date column
# return date.year
What I'm wondering is in the same way you can pass multiple pieces of data to a single function (as seen in the first example above), can you use multiple columns in the creation of a new pandas DataFrame column?
For example combining three separate parts of a date Y - M - D into one field.
df['whole_date']) = df['Year','Month','Day'].map(a_function)
I get a key error with the following test.
def combine(one,two,three):
return one + two + three
df = pd.DataFrame({'a': [1,2,3], 'b': [2,3,4],'c': [4,5,6]})
df['d'] = df['a','b','b'].map(combine)
Is there a way of creating a new column in a pandas DataFrame using .map or something else which takes as input three columns and returns a single column?
-> Example input: 1, 2, 3
-> Example output: 1*2*3
Likewise is there also a way of having a function take in one argument, a date and return three new pandas DataFrame columns; one for the year, month and day?
Is there a way of creating a new column in a pandas dataframe using .MAP or something else which takes as input three columns and returns a single column. For example input would be 1, 2, 3 and output would be 1*2*3
To do that, you can use apply with axis=1. However, instead of being called with three separate arguments (one for each column) your specified function will then be called with a single argument for each row, and that argument will be a Series containing the data for that row. You can either account for this in your function:
def combine(row):
return row['a'] + row['b'] + row['c']
>>> df.apply(combine, axis=1)
0 7
1 10
2 13
Or you can pass a lambda which unpacks the Series into separate arguments:
def combine(one,two,three):
return one + two + three
>>> df.apply(lambda x: combine(*x), axis=1)
0 7
1 10
2 13
If you want to pass only specific rows, you need to select them by indexing on the DataFrame with a list:
>>> df[['a', 'b', 'c']].apply(lambda x: combine(*x), axis=1)
0 7
1 10
2 13
Note the double brackets. (This doesn't really have anything to do with apply; indexing with a list is the normal way to access multiple columns from a DataFrame.)
However, it's important to note that in many cases you don't need to use apply, because you can just use vectorized operations on the columns themselves. The combine function above can simply be called with the DataFrame columns themselves as the arguments:
>>> combine(df.a, df.b, df.c)
0 7
1 10
2 13
This is typically much more efficient when the "combining" operation is vectorizable.
Likewise is there also a way of having a function take in one argument, a date and return three new pandas dataframe columns; one for the year, month and day?
As above, there are two basic ways to do this: a general but non-vectorized way using apply, and a faster vectorized way. Suppose you have a DataFrame like this:
>>> df = pandas.DataFrame({'date': pandas.date_range('2015/05/01', '2015/05/03')})
>>> df
date
0 2015-05-01
1 2015-05-02
2 2015-05-03
You can define a function that returns a Series for each value, and then apply it to the column:
def dateComponents(date):
return pandas.Series([date.year, date.month, date.day], index=["Year", "Month", "Day"])
>>> df.date.apply(dateComponents)
11: Year Month Day
0 2015 5 1
1 2015 5 2
2 2015 5 3
In this situation, this is the only option, since there is no vectorized way to access the individual date components. However, in some cases you can use vectorized operations:
>>> df = pandas.DataFrame({'a': ["Hello", "There", "Pal"]})
>>> df
a
0 Hello
1 There
2 Pal
>>> pandas.DataFrame({'FirstChar': df.a.str[0], 'Length': df.a.str.len()})
FirstChar Length
0 H 5
1 T 5
2 P 3
Here again the operation is vectorized by operating directly on the values instead of applying a function elementwise. In this case, we have two vectorized operations (getting first character and getting the string length), and then we wrap the results in another call to DataFrame to create separate columns for each of the two kinds of results.
I normally use apply for this kind of thing; it's basically the DataFrame version of map (the axis parameter lets you decide whether to apply your function to rows or columns):
df.apply(lambda row: row.a*row.b*row.c, axis =1)
or
df.apply(np.prod, axis=1)
0 8
1 30
2 72

Pandas groupby apply function that combines some groups but not others

I'm using pandas groupby on my DataFrame df which has columns type, subtype, and 11 others. I'm then calling an apply with my combine_function (needs a better name) on the groups like:
grouped = df('type')
reduced = grouped.apply(combine_function)
where my combine_function checks if any element in the group contains any element with the given subtype, say 1, and looks like:
def combine_function(group):
if 1 in group.subtype:
return aggregate_function(group)
else:
return group
The combine_function then can call an aggregate_function, that calculates summary statistics, stores them in the first row, and then sets that row to be the group. It looks like:
def aggregate_function(group):
first = group.first_valid_index()
group.value1[group.index == first] = group.value1.mean()
group.value2[group.index == first] = group.value2.max()
group.value3[group.index == first] = group.value3.std()
group = group[(group.index == first)]
return group
I'm fairly sure this isn't the best way to do this, but it has been giving my the desired results, 99.9% of the time on thousands of DataFrames. However it sometimes throws an error that is somehow related to a group that I don't want to aggregate has exactly 2 rows:
ValueError: Shape of passed values is (13,), indices imply (13, 5)
where my an example groups had size:
In [4]: grouped.size()
Out[4]:
type
1 9288
3 7667
5 7604
11 2
dtype: int64
It processed the 3 three fine, and then gave the error when it tried to combine everything. If I comment out the line group = group[(group.index == first)] so update but don't aggregate or call my aggregate_function on all groups its fine.
Does anyone know the proper way to be doing this kind of aggregation of some groups but not others?
Your aggregate_functions looks contorted to me. When you aggregate a group, it automatically reduces to one row; you don't need to do it manually. Maybe I am missing the point. (Are you doing something special with the index that I'm not understanding?) But a more normal usage would look like this:
agg_condition = lambda x: Series([1]).isin(x['subtype]').any()
agg_functions = {'value1': np.mean, 'value2': np.max, 'value3': np.std}
df1 = df.groupby('type').filter(agg_condition).groupby('type').agg(**agg_functions)
df2 = df.groupby('type').filter(~agg_condition)
result = pd.concat([df1, df2])
Note: agg_condition is messy because (1) built-in Python in refers to the index of a Series, not its values, and (2) the result has to be reduced to a scalar by any().

Categories