I am trying to use the impliedVolatility function in df_spx.apply() while hardcoding the variable inputs S, K, r, price, T, payoff, and c_or_p.
However, it does not work, using the same function impliedVolatility, only doing lambda + apply it works.
[code link][1]
# first version of code
S = SPX_spot
K = df_spx['strike_price']
r = df_spx['r']
price = df_spx['mid_price']
T = df_spx['T_years']
payoff = df_spx['cp_flag']
c_or_p = df_spx["cp_flag"]
df_spx["iv"] = df_spx.apply(impliedVolatility(c_or_p, S, K, T, r,price),axis=1)
# second version of code
df_spx["impliedvol"] = df_spx.apply(
lambda r: impliedVolatility(r["cp_flag"],
S,
r["strike_price"],
r['T_years'],
r["r"],
r["mid_price"]),
axis = 1)
[1]: https://i.stack.imgur.com/yBfO5.png
You have to give apply a function that it can call. It needs a callable function. In your first example
df_spx.apply(impliedVolatility(c_or_p, S, K, T, r,price), axis=1)
you are giving the result of the function as a parameter to apply. That would not work. If you instead wrote
df_spx.apply(impliedVolatility, c_or_p=c_or_p, S=S, K=K, T=T, r=r, price=price, axis=1)
if the function keywords arguments have the same names or if you wrote
df_spx.apply(impliedVolatility, args=(c_or_p, S, K, T, r,price), axis=1)
then it might work. Notice we are not calling the impliedVolatility in the apply. We are giving the function as a argument.
There is already a pretty good answer, but maybe to give it a different perspective. The apply is going to loop on your data and call the function you provide on it.
Say you have:
import pandas as pd
df = pd.DataFrame({"a": [1, 2, 3], "b": list("asd")})
df
Out:
a b
0 1 a
1 2 s
2 3 d
If you want to create new data or perform certain work on any of the columns (you could also do it at the entire row level, which btw is your use case, but let's simplify for now) you might consider using apply. Say you just wanted to multiply every input by two:
def multiply_by_two(val):
return val * 2
df.b.apply(multiply_by_two) # case 1
Out:
0 aa
1 ss
2 dd
df.a.apply(multiply_by_two) # case 2
Out:
0 2
1 4
2 6
The first usage example transformed your one letter string into two equal letter strings while the second is obvious. You should avoid using apply in the second case, because it is a simple mathematical operation that will be extremely slow in comparison to df.a * 2. Hence, my rule of thumb is: use apply when performing operations with non-numeric objects (case 1). NOTE: no actual need for a lambda in this simple case.
So what apply does is passing each element of the series to the function.
Now, if you apply on an entire dataframe, the values passed will be a data slice as a series. Hence, to properly apply your function you will need to map the inputs. For, instance:
def add_2_to_a_multiply_b(b, a):
return (a + 2) * b
df.apply(lambda row: add_2_to_a_multiply_b(*row), axis=1) # ERROR because the values are unpacked as (df.a, df.b) and you can't add integers and strings (see `add_2_to_a_multiply_b`)
df.apply(lambda row: add_2_to_a_multiply_b(row['b'], row['a']), axis=1)
Out:
0 aaa
1 ssss
2 ddddd
From this point on you can build more complex implementation, for instance, using partial functions, etc. For instance:
def add_to_a_multiply_b(b, a, *, val_to_add):
return (a + val_to_add) * b
import partial
specialized_func = partial(add_to_a_multiply_b, val_to_add=2)
df.apply(lambda row: specialized_func(row['b'], row['a']), axis=1)
Just to stress it again, avoid apply if you are performance eager:
# 'OK-ISH', does the job... but
def strike_price_minus_mid_price(strike_price, mid_price):
return strike_price - mid_price
new_data = df.apply(lambda r: strike_price_minus_mid_price(r["strike_price"], r["mid_price"] ), axis=1)
vs
'BETTER'
new_data = df["strike_price"] - df["mid_price"]
Related
I have two questions, but first I will give the context. I am trying to use a pandas DataFrame with some existing code using a functional programming approach. I basically want to map a function to every row of a DataFrame, expanding the row using the double-asterisk keyword argument notation, where each column name of the DataFrame corresponds to one of the arguments of the existing function.
For example, say I have the following function.
def line(m, x, b):
y = (m * x) + b
return y
And I have a pandas DataFrame
data = [{"b": 1, "m": 1, "x": 2}, {"b": 2, "m": 2, "x": 3}]
df = pd.DataFrame(data)
# Returns
# b m x
# 0 1 1 2
# 1 2 2 3
Ultimately, I want to construct a column in the DataFrame from the results of line applied to each row; something like the following.
# Note that I'm using the list of dicts defined above, not the DataFrame.
results = [line(**datum) for datum in data]
I feel like I should be able to use some combination of DataFrame.apply, a lambda, probably Series.to_dict, and the double-asterisk keyword argument expansion but I can't figure out what is passed to the lambda in the following expression.
df.apply(lambda x: x, axis=1)
# ^
# What is pandas passing to my identity lambda?
I've tried to inspect with type and x.__class__, but both of the following lines throw TypeErrors.
df.apply(lambda x: type(x), axis=1)
df.apply(lambda x: x.__class__, axis=1)
I don't want to write/refactor a new line function that can wrangle some pandas object because I shouldn't have to. Ultimately, I want to end up with a DataFrame with columns for the input data and a column with the corresponding output of the line function.
My two questions are:
How can I pass a row of a pandas DataFrame to a function using keyword-argument expansion, either using the DataFrame.apply method or some other (functional) approach?
What exactly is DataFrame.apply passing to the function that I specify?
Maybe there is some other functional approach I could take that I'm just not aware of, but I figure pandas is a pretty popular library for this kind of thing and that's why I'm trying to use it. Also there are some data (de)serialization issues I'm facing that pandas should make pretty easy vs. writing a more bespoke solution.
Thanks.
Maybe this is what you are looking for.
1)
df.apply(lambda x: line(**x.to_dict()), axis=1)
Result
0 3
1 8
2)
The function for df.apply(..., axis=1) receives a Series representing a row with the column names as index entries.
Let say I have the following dataframe:
a = np.random.rand(10)
b = np.random.rand(10)*10
c = np.random.rand(10)*100
groups = np.array([1,1,2,2,2,2,3,3,4,4])
df = pd.DataFrame({"a":a,"b":b,"c":c,"groups":groups})
I simply want to group by the df based on groups and apply the following function to two columns (a and b) of each group:
def my_fun(x,y):
tmp = np.sum((x*y))/np.sum(y)
return tmp
What I tried is:
df.groupby("groups").apply(my_fun,("a","b"))
But that does not work and gives me error:
ValueError: Unable to coerce to Series, the length must be 4: given 2
The final output is basically a single number for each group. I can get around the problem by loops but I think there should be a better approach?
Thanks
Without changing your function, you want to do:
df.groupby("groups").apply(lambda d: my_fun(d["a"],d["b"]))
Output:
groups
1 0.603284
2 0.183289
3 0.828273
4 0.361103
dtype: float64
That said, you can rewrite your function so it takes in a dataframe as the first positional argument:
def myfunc(data, val_col, weight_col):
return np.sum(data[val_col]*data[weight_col])/np.sum(data[weight_col])
df.groupby('groups').apply(myfunc, 'a', 'b')
This question is born of this comment thread. Using Pandas 0.20.3.
I'm trying to understand why an apply() operation throws the error:
ValueError: Wrong number of items passed 2, placement implies 1
This specific flavor of Pandas ValueError is not uncommon, but it usually comes from a more obvious attempt to cram a bunch of elements into a data structure that is designed for a lesser capacity. The same thing must be going on here, but I can't figure out why.
Given a data frame with columns of integers A and B:
import pandas as pd
df = pd.DataFrame({'A': [1,2], 'B': [3,4]})
df
A B
0 1 3
1 2 4
I can construct a new column, C, which is a column of lists.
Each list in C contains values from A and B.
C should look like:
C
[1, 3]
[2, 4]
I am choosing to build C using apply() and a list comprehension:
df['C'] = df.apply(lambda x: [val for val in x], axis=1)
(For now, please overlook the possibility that this is not the most elegant way to achieve this goal - it's mainly a route to get to the error I'm confused about.)
This throws the ValueError noted above.
But, I can create lists with more items per row without difficulty:
df['C'] = df.apply(lambda x: [val for val in x]+[1], axis=1)
df
A B C
0 1 3 [1, 3, 1]
1 2 4 [2, 4, 1]
I would have thought that I'd get the same error, just with Wrong number of items passed 3... instead of 2.
I can also create C with fewer items:
df['C'] = df.apply(lambda x: [val for val in x][:1], axis=1)
df
A B C
0 1 3 [1]
1 2 4 [2]
Additionally, C builds when the first row's list length is shorter or longer than [1,3], but fails when the first row's list length matches len([1,3]), even if subsequent list lengths are different:
df['C'] = df.apply(lambda x: [val for val in x if val != 1], axis=1) # this works
df['C'] = df.apply(lambda x: [val for val in x if val != 4], axis=1) # this fails
Given all of these different cases, I don't understand what placement implies 1 is referring to, and why I can't just make the lists in C with the elements of A and B, using this approach.
How am I misinterpreting this error message?
It seems that this behavior results from a mix of (a) .apply() trying to be helpful and (b) abusing .apply() as a means of outputting non-scalar values. It has been fixed in Pandas version 0.21.
I've cobbled together this explanation from various Pandas Github issues pages [1, 2, 3], some of which are also linked to in this answer. It's not really an explanation of why this happens at the implementation level, but it at least substantively answers the question.
Happy to accept any relevannt updates/edits.
On .apply() trying to be helpful:
If a multi-dimensional value is returned that has the same shape as the input DataFrame, apply will infer a DataFrame as output:
TomAugspurger: DataFrame.apply tries to infer an output based on the result. The result of your output is inferred to be a DataFrame with the same columns. [ref]
jreback: The issue is .apply has to try to figure out what you are returning and how that maps to the starting data. [ref]
On abusing .apply() as a means of outputting non-scalar values:
In short, don't if you can avoid it. If you must, expect occasionally funny results.
jreback: Note that returning non-scalars is generally not recommended and is not efficiently supported. [ref]
I encountered this lambda expression today and can't understand how it's used:
data["class_size"]["DBN"] = data["class_size"].apply(lambda x: "{0:02d}{1}".format(x["CSD"], x["SCHOOL CODE"]), axis=1)
The line of code doesn't seem to call the lambda function or pass any arguments into it so I'm confused how it does anything at all. The purpose of this is to take two columns CSD and SCHOOL CODE and combine the entries in each row into a new row, DBN. So does this lambda expression ever get used?
You're writing your results incorrectly to a column. data["class_size"]["DBN"] is not the correct way to select the column to write to. You've also selected a column to use apply with but you'd want that across the entire dataframe.
data["DBN"] = data.apply(lambda x: "{0:02d}{1}".format(x["CSD"], x["SCHOOL CODE"]), axis=1)
the apply method of a pandas Series takes a function as one of its arguments.
here is a quick example of it in action:
import pandas as pd
data = {"numbers":range(30)}
def cube(x):
return x**3
df = pd.DataFrame(data)
df['squares'] = df['numbers'].apply(lambda x: x**2)
df['cubes'] = df['numbers'].apply(cube)
print df
gives:
numbers squares cubes
0 0 0 0
1 1 1 1
2 2 4 8
3 3 9 27
4 4 16 64
...
as you can see, either defining a function (like cube) or using a lambda function works perfectly well.
As has already been pointed out, if you're having problems with your particular piece of code it's that you have data["class_size"]["DBN"] = ... which is incorrect. I was assuming that was an odd typo because you didn't mention getting a key error, which is what that would result in.
if you're confused about this, consider:
def list_apply(func, mylist):
newlist = []
for item in mylist:
newlist.append(func(item))
this is a (not very efficient) function for applying a function to every item in a list. if you used it with cube as before:
a_list = range(10)
print list_apply(cube, a_list)
you get:
[0, 1, 8, 27, 64, 125, 216, 343, 512, 729]
this is a simplistic example of how the apply function in pandas is implemented. I hope that helps?
Are you using a multi-index dataframe (i.e. There are column hierarchies)? It's hard to tell without seeing your data, but I'm presuming it is the case, since just using data["class_size"].apply() would yield a series on a normal dataframe (meaning the lambda wouldn't be able to find your columns specified and then there would be an error!)
I actually found this answer which explains the problem of trying to create columns in multi-index dataframes, one confusing things with multi-index column creation is that you can try to create a column like you are doing and it will seem to run without any issues, but won't actually create what you want. Instead, you need to change data["class_size"]["DBN"] = ... to data["class_size", "DBN"] = ... So, in full:
data["class_size","DBN"] = data["class_size"].apply(lambda x: "{0:02d}{1}".format(x["CSD"], x["SCHOOL CODE"]), axis=1)
Of course, if it isn't a mult-index dataframe then this won't help, and you should look towards one of the other answers.
I think 0:02d means 2 decimal place for "CSD" value. {}{} basically places the 2 values together to form 'DBN'.
As the title says, I've been trying to build a Pandas DataFrame from an other df using a for loop and calculating new columns with the last one built.
So far, I've tried :
df = pd.DataFrame(np.arange(10))
df.columns = [10]
df1 = pd.DataFrame(np.arange(10))
df1.columns = [10]
steps = np.linspace(10,1,10,dtype = int)
This works:
for i in steps:
print(i)
df[i-1] = df[i].apply(lambda a: a-1)
But when I try building df and df1 at the same time like so :
for i in steps:
print(i)
df[i-1] = df[i].apply(lambda a: a-df1[i])
df1[i-1] = df1[i].apply(lambda a: a-1)
It returns a lot of gibberish + the line :
ValueError : Wrong number of items passed 10, placement implies 1
In this example, I am well aware that I could build df1 first and build df after. But it returns the same error if I try :
for i in steps:
print(i)
df[i-1] = df[i].apply(lambda a: a-df1[i])
df1[i-1] = df1[i].apply(lambda a: a-df[i])
Which is what i really need in the end.
Any help is much appreciated,
Alex
apply is trying to apply a function along an axis that you specify. It can be 0 (applying the function to each column) or 1 (applying the function to each row). Per default, it is applying the function to the columns. In your first example:
for i in steps:
print(i)
df[i-1] = df[i].apply(lambda a: a-1)
Each column is looped because of your for loop, and your function .apply removes 1 to the entire column. You can see a as being your entire column. It is exactly the same as the following:
for i in steps:
print(i)
df[i - 1] = df[i] - 1
A way you can see .apply is with the following. Assuming I have the following dataframe:
df = pd.DataFrame(np.random.rand(10,4))
df.sum() and df.apply(lambda a: np.sum(a)) yields exactly the same result. It is just a simple example, but you can do more powerful calculations if needed.
Note that .apply is not the fastest method, so try to avoid it if you can.
An example where apply would be useful is if you have a function some_fct() defined that takes int or float as arguments and you would like to apply it to the elements of a dataframe column.
import pandas as pd
import numpy as np
import math
def some_fct(x):
return math.sin(x) / x
np.random.seed(100)
df = pd.DataFrame(np.random.rand(10,2))
Obviously, some_fct(df[0]) would not work as the function takes int or float as arguments. df[0] is a Series. However, using the apply method, you could apply your function to the elements of df[0] that are themselves floats.
df[0].apply(lambda x: some_fct(x))
Found it, I just need to drop the .apply !
Example :
df = pd.DataFrame(np.arange(10))
df.columns = [10]
df1 = pd.DataFrame(np.arange(10))
df1.columns = [10]
steps = np.linspace(10,1,10,dtype = int)
for i in steps:
print(i)
df[i-1] = df[i] - df1[i]
df1[i-1] = df1[i] + df[i]
It does exactly what it should !
I don't have enough knowledge about python, I cannot explain why
pd.DataFrame().apply()
will not use what was out of itself.