Pandas Series.apply - use arguments from another Series? - python

I have the following statement:
>>> df['result'] = df['value'].apply(myfunc, args=(x,y,z))
The Python function myfunc was written before I started using Pandas and is set up to take single values. The arguments x and z are fixed and can easily be passed as a variable or literal, but I have a column in my DataFrame that represents the y parameter, so I'm looking for a way to use that row's value for each row (they differ from row to row).
i.e. df['y'] is a series of values that I'd like to send in to myfunc
My workaround is as follows:
values = list(df['value'])
y = list(df['y'])
df['result'] = pd.Series([myfunc(values[i],x,y[i],z) for i in range(0,len(values))])
Any better approaches?
EDIT
Using functools.partial has a gotcha that was able to work out. If your call does not stick to keyword arguments then it appears to resort to positional and then you may run into the 'myfunc() got multiple values for...' error.
I modified the answer from coldspeed:
# Function myfunc takes named arguments arg1, arg2, arg3 and arg4
# The values for arg2 and arg4 don't change so I'll set them when
# defining the partial (assume x and z have values set)
myfunc_p = partial(myfunc, arg2=x, arg4=z)
df['result'] = [myfunc_p(arg1=w, arg3=y) for w, y in zip(df['value'], df['y'])]

You could also apply over the rows with a lambda like so:
df['result'] = df.apply(lambda row: myfunc(row['value'], y=row['y'], x=x, z=z), axis=1)

I think what you're doing is fine. I'd maybe make a couple of improvements:
from functools import partial
myfunc_p = partial(myfunc, x=x, z=z)
df['result'] = [myfunc_p(v, y) for v, y in zip(df['value'], df['y'])]
You don't need to wrap the list in a pd.Series call, and you can clean up your function call by fixing two of the arguments with functools.partial.
There's also the other option using np.vectorize (disclaimer, this does not actually vectorize the function, just hides the loop) for more concise code, but in most cases the list comprehension should be faster.
myfunc_v = np.vectorize(partial(myfunc, x=x, z=z))
df['result'] = myfunc_v(df['value'], df['y'])

Related

What is `pandas.DataFrame.apply` actually operating on?

I have two questions, but first I will give the context. I am trying to use a pandas DataFrame with some existing code using a functional programming approach. I basically want to map a function to every row of a DataFrame, expanding the row using the double-asterisk keyword argument notation, where each column name of the DataFrame corresponds to one of the arguments of the existing function.
For example, say I have the following function.
def line(m, x, b):
y = (m * x) + b
return y
And I have a pandas DataFrame
data = [{"b": 1, "m": 1, "x": 2}, {"b": 2, "m": 2, "x": 3}]
df = pd.DataFrame(data)
# Returns
# b m x
# 0 1 1 2
# 1 2 2 3
Ultimately, I want to construct a column in the DataFrame from the results of line applied to each row; something like the following.
# Note that I'm using the list of dicts defined above, not the DataFrame.
results = [line(**datum) for datum in data]
I feel like I should be able to use some combination of DataFrame.apply, a lambda, probably Series.to_dict, and the double-asterisk keyword argument expansion but I can't figure out what is passed to the lambda in the following expression.
df.apply(lambda x: x, axis=1)
# ^
# What is pandas passing to my identity lambda?
I've tried to inspect with type and x.__class__, but both of the following lines throw TypeErrors.
df.apply(lambda x: type(x), axis=1)
df.apply(lambda x: x.__class__, axis=1)
I don't want to write/refactor a new line function that can wrangle some pandas object because I shouldn't have to. Ultimately, I want to end up with a DataFrame with columns for the input data and a column with the corresponding output of the line function.
My two questions are:
How can I pass a row of a pandas DataFrame to a function using keyword-argument expansion, either using the DataFrame.apply method or some other (functional) approach?
What exactly is DataFrame.apply passing to the function that I specify?
Maybe there is some other functional approach I could take that I'm just not aware of, but I figure pandas is a pretty popular library for this kind of thing and that's why I'm trying to use it. Also there are some data (de)serialization issues I'm facing that pandas should make pretty easy vs. writing a more bespoke solution.
Thanks.
Maybe this is what you are looking for.
1)
df.apply(lambda x: line(**x.to_dict()), axis=1)
Result
0 3
1 8
2)
The function for df.apply(..., axis=1) receives a Series representing a row with the column names as index entries.

How do I apply a function over a column?

I have created a function I would like to apply over a given dataframe column. Is there an apply function so that I can create a new column and apply my created function?
Example code:
dat = pd.DataFrame({'title': ['cat', 'dog', 'lion','turtle']})
Manual method that works:
print(calc_similarity(chosen_article,str(df['title'][1]),model_word2vec))
print(calc_similarity(chosen_article,str(df['title'][2]),model_word2vec))
Attempt to apply over dataframe column:
dat['similarity']= calc_similarity(chosen_article, str(df['title']), model_word2vec)
The issue I have been running into is that the function outputs the same result over the entirety of the newly created column.
I have tried apply() as follows:
dat['similarity'] = dat['title'].apply(lambda x: calc_similarity(chosen_article, str(x), model_word2vec))
and
dat['similarity'] = dat['title'].astype(str).apply(lambda x: calc_similarity(chosen_article, x, model_word2vec))
Which result in a ZeroDivisionError which i am not understanding since I am not passing empty strings
Function being used:
def calc_similarity(input1, input2, vectors):
s1words = set(vocab_check(vectors, input1.split()))
s2words = set(vocab_check(vectors, input2.split()))
output = vectors.n_similarity(s1words, s2words)
return output
It sounds like you are having difficulty applying a function while passing additional keyword arguments. Here's how you can execute that:
# By default, function will use values for first arg.
# You can specify kwargs in the apply method though
df['similarity'] = df['title'].apply(
calc_similarity,
input2=chosen_article,
vectors=model_word2vec
)

apply with lambda and apply without lambda

I am trying to use the impliedVolatility function in df_spx.apply() while hardcoding the variable inputs S, K, r, price, T, payoff, and c_or_p.
However, it does not work, using the same function impliedVolatility, only doing lambda + apply it works.
[code link][1]
# first version of code
S = SPX_spot
K = df_spx['strike_price']
r = df_spx['r']
price = df_spx['mid_price']
T = df_spx['T_years']
payoff = df_spx['cp_flag']
c_or_p = df_spx["cp_flag"]
df_spx["iv"] = df_spx.apply(impliedVolatility(c_or_p, S, K, T, r,price),axis=1)
# second version of code
df_spx["impliedvol"] = df_spx.apply(
lambda r: impliedVolatility(r["cp_flag"],
S,
r["strike_price"],
r['T_years'],
r["r"],
r["mid_price"]),
axis = 1)
[1]: https://i.stack.imgur.com/yBfO5.png
You have to give apply a function that it can call. It needs a callable function. In your first example
df_spx.apply(impliedVolatility(c_or_p, S, K, T, r,price), axis=1)
you are giving the result of the function as a parameter to apply. That would not work. If you instead wrote
df_spx.apply(impliedVolatility, c_or_p=c_or_p, S=S, K=K, T=T, r=r, price=price, axis=1)
if the function keywords arguments have the same names or if you wrote
df_spx.apply(impliedVolatility, args=(c_or_p, S, K, T, r,price), axis=1)
then it might work. Notice we are not calling the impliedVolatility in the apply. We are giving the function as a argument.
There is already a pretty good answer, but maybe to give it a different perspective. The apply is going to loop on your data and call the function you provide on it.
Say you have:
import pandas as pd
df = pd.DataFrame({"a": [1, 2, 3], "b": list("asd")})
df
Out:
a b
0 1 a
1 2 s
2 3 d
If you want to create new data or perform certain work on any of the columns (you could also do it at the entire row level, which btw is your use case, but let's simplify for now) you might consider using apply. Say you just wanted to multiply every input by two:
def multiply_by_two(val):
return val * 2
df.b.apply(multiply_by_two) # case 1
Out:
0 aa
1 ss
2 dd
df.a.apply(multiply_by_two) # case 2
Out:
0 2
1 4
2 6
The first usage example transformed your one letter string into two equal letter strings while the second is obvious. You should avoid using apply in the second case, because it is a simple mathematical operation that will be extremely slow in comparison to df.a * 2. Hence, my rule of thumb is: use apply when performing operations with non-numeric objects (case 1). NOTE: no actual need for a lambda in this simple case.
So what apply does is passing each element of the series to the function.
Now, if you apply on an entire dataframe, the values passed will be a data slice as a series. Hence, to properly apply your function you will need to map the inputs. For, instance:
def add_2_to_a_multiply_b(b, a):
return (a + 2) * b
df.apply(lambda row: add_2_to_a_multiply_b(*row), axis=1) # ERROR because the values are unpacked as (df.a, df.b) and you can't add integers and strings (see `add_2_to_a_multiply_b`)
df.apply(lambda row: add_2_to_a_multiply_b(row['b'], row['a']), axis=1)
Out:
0 aaa
1 ssss
2 ddddd
From this point on you can build more complex implementation, for instance, using partial functions, etc. For instance:
def add_to_a_multiply_b(b, a, *, val_to_add):
return (a + val_to_add) * b
import partial
specialized_func = partial(add_to_a_multiply_b, val_to_add=2)
df.apply(lambda row: specialized_func(row['b'], row['a']), axis=1)
Just to stress it again, avoid apply if you are performance eager:
# 'OK-ISH', does the job... but
def strike_price_minus_mid_price(strike_price, mid_price):
return strike_price - mid_price
new_data = df.apply(lambda r: strike_price_minus_mid_price(r["strike_price"], r["mid_price"] ), axis=1)
vs
'BETTER'
new_data = df["strike_price"] - df["mid_price"]

In python, rename variables using parameter of a function

I am creating a function. One input of this function will be a panda dataframe and one of its tasks is to do some operation with two variables of this dataframe. These two variables are not fixed and I want to have the freedom to determine them using parameters as inputs of the function fun.
For example, suppose at some moment the variables I want to use are 'var1' and 'var2' (but at another time, I may want to use others two variables). Supose that these variables take values 1,2,3,4 and I want to reduce df doing var1 == 1 and var2 == 1. My functions is like this
def fun(df , var = ['input_var1', 'input_var2'] , val):
df = df.rename(columns={ var[1] : 'aux_var1 ', var[2]:'aux_var2'})
# Other operations
df = df.loc[(df.aux_var1 == val ) & (df.aux_var2 == val )]
# end of operations
# recover
df = df.rename(columns={ 'aux_var1': var[1] ,'aux_var2': var[2]})
return df
When I use the function fun, I have the error
fun(df, var = ['var1','var2'], val = 1)
IndexError: list index out of range
Actually, I want to do other more complex operations and I didn't describe these operations so as not to extend the question. Perhaps the simple example above has a solution that does not need to rename the variables. But maybe this solution doesn't work with the operations I really want to do. So first, I would necessarily like to correct the error when renaming the variables. If you want to give another more elegant solution that doesn't need renaming, I appreciate that too, but I will be very grateful if besides the elegant solution, you offer me the solution about renaming.
Python liste are zero indexed, i.e. the first element index is 0.
Just change the lines:
df = df.rename(columns={ var[1] : 'aux_var1 ', var[2]:'aux_var2'})
df = df.rename(columns={ 'aux_var1': var[1] ,'aux_var2': var[2]})
to
df = df.rename(columns={ var[0] : 'aux_var1 ', var[1]:'aux_var2'})
df = df.rename(columns={ 'aux_var1': var[0] ,'aux_var2': var[1]})
respectively
In this case you are accessing var[2] but a 2-element list in Python has elements 0 and 1. Element 2 does not exist and therefore accessing it is out of range.
As it has been mentioned in other answers, the error you are receiving is due to the 0-indexing of Python lists, i.e. if you wish to access the first element of the list var, you do that by taking the 0 index instead of 1 index: var[0].
However to the topic of renaming, you are able to perform the filtering of pandas dataframe without any column renaming. I can see that you are accessing the column as an attribute of the dataframe, however you are able to achieve the same via utilising the __getitem__ method, which is more commonly used with square brackets, f.e. df[var[0]].
If you wish to have more generality over your function without any renaming happening, I can suggest this:
from functools import reduce
def fun(df , var, val):
_sub = reduce(
lambda x, y: x & (df[y] == val),
var,
pd.Series([True]*df.shape[0])
)
return df[_sub]
This will work with any number of input column variables. Hope this will serve as an inspiration to your more complicated operations you intend to do.

How to unpack the columns of a pandas DataFrame to multiple variables

Lists or numpy arrays can be unpacked to multiple variables if the dimensions match. For a 3xN array, the following will work:
import numpy as np
a,b = [[1,2,3],[4,5,6]]
a,b = np.array([[1,2,3],[4,5,6]])
# result: a=[1,2,3], b=[4,5,6]
How can I achieve a similar behaviour for the columns of a pandas DataFrame? Extending the above example:
import pandas as pd
df = pd.DataFrame([[1,2,3],[4,5,6]])
df.columns = ['A','B','C'] # Rename cols and
df.index = ['i', 'ii'] # rows for clarity
The following does not work as expected:
a,b = df.T
# result: a='i', b='ii'
a,b,c = df
# result: a='A', b='B', c='C'
However, what I would like to get is the following:
a,b,c = unpack(df)
result: a=df['A'], b=df['B'], c=df['C']
Is the function unpack already available in pandas? Or can it be mimicked in an easy way?
I just figured that the following works, which is already close to what I try to achieve:
a,b,c = df.T.values # Common
a,b,c = df.T.to_numpy() # Recommended
# a,b,c = df.T.as_matrix() # Deprecated
Details: As always, things are a little more complicated than one thinks. Note that a pd.DataFrame stores columns separately in Series. Calling df.values (or better: df.to_numpy()) is potentially expensive, as it combines the columns in a single ndarray, which likely involves copying actions and type conversions. Also, the resulting container has a single dtype able to accommodate all data in the data frame.
In summary, the above approach loses the per-column dtype information and is potentially expensive. It is technically cleaner to iterate the columns in one of the following ways (there are more options):
# The following alternatives create VIEWS!
a,b,c = (v for _,v in df.items()) # returns pd.Series
a,b,c = (df[c] for c in df) # returns pd.Series
Note that the above creates views! Modifying the data likely will trigger a SettingWithCopyWarning.
a.iloc[0] = "blabla" # raises SettingWithCopyWarning
If you want to modify the unpacked variables, you have to copy the columns.
# The following alternatives create COPIES!
a,b,c = (v.copy() for _,v in df.items()) # returns pd.Series
a,b,c = (df[c].copy() for c in df) # returns pd.Series
a,b,c = (df[c].to_numpy() for c in df) # returns np.ndarray
While this is cleaner, it requires more characters. I personally do not recommend the above approach for production code. But to avoid typing (e.g., in interactive shell sessions), it is still a fair option...
# More verbose and explicit alternatives
a,b,c = df["the first col"], df["the second col"], df["the third col"]
a,b,c = df.iloc[:,0], df.iloc[:,1], df.iloc[:,2]
The dataframe.values shown method is indeed a good solution, but it involves building a numpy array.
In the case you want to access pandas series methods after unpacking, I personally use a different approach.
For the people like me that use a lot of chained methods, I have a solution by adding a custom unpacking method to pandas. Note that this may not be very good for production pipelines, but it is very handy in ad-hoc data analyses.
df = pd.DataFrame({
"lat": [30, 40],
"lon": [0, 1],
})
This approach involves returning a generator on a .unpack() call.
from typing import Tuple
def unpack(self: pd.DataFrame) -> Tuple[pd.Series]:
return (
self[col]
for col in self.columns
)
pd.DataFrame.unpack = unpack
This can be used in two major ways.
Either directly as a solution to your problem:
lat, lon = df.unpack()
Or, can be used in a method chaining.
Imagine a geo function which has to take a latitude serie in the first arg and a longitude in the second arg, named do_something_geographical(lat, lon)
df_result = (
df
.(...some method chaining...)
.assign(
geographic_result=lambda dataframe: do_something_geographical(dataframe[["lat", "lon"]].unpack())
)
.(...some method chaining...)
)

Categories