Python lambda function dynamic creation (pandas example) [duplicate] - python

This question already has answers here:
Creating functions (or lambdas) in a loop (or comprehension)
(6 answers)
Closed 6 months ago.
I'd like to create a dictionary that contains lambda functions for the convenient filtering of pandas data frames. When I instantiate each dictionary item line by line, I get the behaviour I want. But when I use a for loop, the filters use the last value of n. Does the lambda function reference the global variable n, and not its value at the time of instantiation? Is my understanding of lambda functions off?
Note, this example is watered down. In my actual project, I use a DateTime index, and the dictionary will have integer keys that filter by year, eg. df.index.year == 2020, and some string keys that filter by week/weekend, time of day, etc.
import pandas as pd
data = [[1,2],[3,4],[5,6]] # example df
df = pd.DataFrame(index=range(len(data)), data=data)
filts = {}
filts[1] = lambda df: df[df.index == 1] # making a filter dictionary
filts[2] = lambda df: df[df.index == 2] # of lamda funcs
print(filts[1](df)) # works as expected
print(filts[2](df))
filts = {}
for n in range(len(data)):
filts[n] = lambda df: df[df.index == n] # also tried wrapping n in int
# n = 0 # changes behaviour
print(filts[0](df)) # print out the results for n = 2
print(filts[1](df)) # same problem as above
# futher investigating lambdas
filts = {}
n = 0
filts[n] = lambda df: df[df.index == n] # making a filter dictionary
n = 1
filts[n] = lambda df: df[df.index == n] # of lamda funcs
print(filts[0](df)) # print out the results for n = 1

I'm not sure about the duplicate-ness of the question, and I am answering it because I've run into this using pandas myself. You can solve your problem using closures.
Change your loop as follows:
for n in range(len(data)):
filts[n] = (lambda n: lambda df: df[df.index == n])(n)
What's wrong with OP's approach?
Lambdas maintain a reference to the variable. So n here is a reference to the variable that is being iterated over in the loop. When you evaluate your lambdas, the reference to n (in all the defined lambdas in your filts is assigned to the final value assigned to the reference n in the loop. Hence, what you're seeing is expected. The takeaway- "The lambda's closure holds a reference to the variable being used, not its value, so if the value of the variable later changes, the value in the closure also changes." source.

Related

How to improve the performance of a while loop which is processing a large dataframe for each element in a large list?

I have a code with the following while loop:
def get_names(key):
temp = pd.DataFrame(columns=['name','tot_count'])
df2 = dfx.loc[(dfx['name1'] == key) | (dfx['name2'] == key) | (dfx['name3'] == key) | (dfx['name4'] == key)]
for c in cl:
y = df2[[c,'tot_count']]
y.columns = ['name','tot_count']
temp = pd.concat([temp,y])
temp['key'] = key
return(temp)
cl = ['name1', 'name2', 'name3', 'name4']
n1 = len(s1) #s1 is the list of keywords
i = 0
d = pd.DataFrame(columns=['name','key'])
while(i < n1):
key = s1[i]
y = get_names(key)
d = pd.concat([d,y])
dfx = dfx.loc[(dfx['name1'] != key) & (dfx['name2'] != key) & (dfx['name3'] != key) & (dfx['name4'] != key)]
print(i)
i = i + 1
Here I have a thousands of keywords in the list s1. The while loop processes each keyword in s1 as defined in the function get_names where it uses the dataframe dfx which has millions of rows and contains rows relevant to the each keywords. After a keyword has been processed, I no longer need those rows corresponding to this keyword in the main dataframe dfx to be re-processed for another keyword. So I delete it inside the while loop.
Question: The run time of this code is about 2.5 hours. How can I improve the code the make it run faster on the same hardware?
d = pd.concat([d, y]) should almost nearly be used in a loop. This result in a quadratic execution due to new growing dataframes being created every time. You should use a list instead (see this post). The same applies to temp = pd.concat([temp,y]).
Additionally repeatedly make comparisons of the whole dataframe just to find a key is clearly inefficient. It is again due to a quadratic execution. Quadratic execution are a dead end for big data computation. This can be solved by using pre-computed dataframe parts using groupby stored in a dictionary where the keys are the searched key. With that, any fetch runs in constant time instead of a linear time. This is a bit come complex with 4 columns but the idea is the same: pre-computing operations by operating on groups instead than separate items or rows.

faster way to run a for loop for a very large dataframe list

I am using two for loops inside each other to calculate a value using combinations of elements in a dataframe list. the list consists of large number of dataframes and using two for loops takes considerable amount of time.
Is there a way i can do the operation faster?
the functions I refer with dummy names are the ones where I calculate the results.
My code looks like this:
conf_list = []
for tr in range(len(trajectories)):
df_1 = trajectories[tr]
if len(df_1) == 0:
continue
for tt in range(len(trajectories)):
df_2 = trajectories[tt]
if len(df_2) == 0:
continue
if df_1.equals(df_2) or df_1['time'].iloc[0] > df_2['time'].iloc[-1] or df_2['time'].iloc[0] > df_1['time'].iloc[-1]:
continue
df_temp = cartesian_product_basic(df_1,df_2)
flg, df_temp = another_function(df_temp)
if flg == 0:
continue
flg_h = some_other_function(df_temp)
if flg_h == 1:
conf_list.append(1)
My input list consist of around 5000 dataframes looking like (having several hundreds of rows)
id
x
y
z
time
1
5
7
2
5
and what i do is I get the cartesian product with combinations of two dataframes and for each couple I calculate another value 'c'. If this value c meets a condition then I add an element to my c_list so that I can get the final number of couples meeting the requirement.
For further info;
a_function(df_1, df_2) is a function getting the cartesian product of two dataframes.
another_function looks like this:
def another_function(df_temp):
df_temp['z_dif'] = nwh((df_temp['time_x'] == df_temp['time_y'])
, abs(df_temp['z_x']- df_temp['z_y']) , np.nan)
df_temp = df_temp.dropna()
df_temp['vert_conf'] = nwh((df_temp['z_dif'] >= 1000)
, np.nan , 1)
df_temp = df_temp.dropna()
if len(df_temp) == 0:
flg = 0
else:
flg = 1
return flg, df_temp
and some_other_function looks like this:
def some_other_function(df_temp):
df_temp['x_dif'] = df_temp['x_x']*df_temp['x_y']
df_temp['y_dif'] = df_temp['y_x']*df_temp['y_y']
df_temp['hor_dif'] = hypot(df_temp['x_dif'], df_temp['y_dif'])
df_temp['conf'] = np.where((df_temp['hor_dif']<=5)
, 1 , np.nan)
if df_temp['conf'].sum()>0:
flg_h = 1
return flg_h
The following are the way to make your code run faster:
Instead of for-loop use list comprehension.
use built-in functions like map, filter, sum ect, this would make your code faster.
Do not use '.' or dot operants, for example
Import datetime
A=datetime.datetime.now() #dont use this
From datetime.datetime import now as timenow
A=timenow()# use this
Use c/c++ based operation libraries like numpy.
Don't convert datatypes unnecessarily.
in infinite loops, use 1 instead of "True"
Use built-in Libraries.
if the data would not change, convert it to a tuple
Use String Concatenation
Use Multiple Assignments
Use Generators
When using if-else to check a Boolean value, avoid using assignment operator.
# Instead of Below approach
if a==1:
print('a is 1')
else:
print('a is 0')
# Try this approach
if a:
print('a is 1')
else:
print('a is 0')
# This would help as a portion of time is reduce which was used in check the 2 values.
Usefull references:
Speeding up Python Code: Fast Filtering and Slow Loops
Speed Up Python Code

apply with lambda and apply without lambda

I am trying to use the impliedVolatility function in df_spx.apply() while hardcoding the variable inputs S, K, r, price, T, payoff, and c_or_p.
However, it does not work, using the same function impliedVolatility, only doing lambda + apply it works.
[code link][1]
# first version of code
S = SPX_spot
K = df_spx['strike_price']
r = df_spx['r']
price = df_spx['mid_price']
T = df_spx['T_years']
payoff = df_spx['cp_flag']
c_or_p = df_spx["cp_flag"]
df_spx["iv"] = df_spx.apply(impliedVolatility(c_or_p, S, K, T, r,price),axis=1)
# second version of code
df_spx["impliedvol"] = df_spx.apply(
lambda r: impliedVolatility(r["cp_flag"],
S,
r["strike_price"],
r['T_years'],
r["r"],
r["mid_price"]),
axis = 1)
[1]: https://i.stack.imgur.com/yBfO5.png
You have to give apply a function that it can call. It needs a callable function. In your first example
df_spx.apply(impliedVolatility(c_or_p, S, K, T, r,price), axis=1)
you are giving the result of the function as a parameter to apply. That would not work. If you instead wrote
df_spx.apply(impliedVolatility, c_or_p=c_or_p, S=S, K=K, T=T, r=r, price=price, axis=1)
if the function keywords arguments have the same names or if you wrote
df_spx.apply(impliedVolatility, args=(c_or_p, S, K, T, r,price), axis=1)
then it might work. Notice we are not calling the impliedVolatility in the apply. We are giving the function as a argument.
There is already a pretty good answer, but maybe to give it a different perspective. The apply is going to loop on your data and call the function you provide on it.
Say you have:
import pandas as pd
df = pd.DataFrame({"a": [1, 2, 3], "b": list("asd")})
df
Out:
a b
0 1 a
1 2 s
2 3 d
If you want to create new data or perform certain work on any of the columns (you could also do it at the entire row level, which btw is your use case, but let's simplify for now) you might consider using apply. Say you just wanted to multiply every input by two:
def multiply_by_two(val):
return val * 2
df.b.apply(multiply_by_two) # case 1
Out:
0 aa
1 ss
2 dd
df.a.apply(multiply_by_two) # case 2
Out:
0 2
1 4
2 6
The first usage example transformed your one letter string into two equal letter strings while the second is obvious. You should avoid using apply in the second case, because it is a simple mathematical operation that will be extremely slow in comparison to df.a * 2. Hence, my rule of thumb is: use apply when performing operations with non-numeric objects (case 1). NOTE: no actual need for a lambda in this simple case.
So what apply does is passing each element of the series to the function.
Now, if you apply on an entire dataframe, the values passed will be a data slice as a series. Hence, to properly apply your function you will need to map the inputs. For, instance:
def add_2_to_a_multiply_b(b, a):
return (a + 2) * b
df.apply(lambda row: add_2_to_a_multiply_b(*row), axis=1) # ERROR because the values are unpacked as (df.a, df.b) and you can't add integers and strings (see `add_2_to_a_multiply_b`)
df.apply(lambda row: add_2_to_a_multiply_b(row['b'], row['a']), axis=1)
Out:
0 aaa
1 ssss
2 ddddd
From this point on you can build more complex implementation, for instance, using partial functions, etc. For instance:
def add_to_a_multiply_b(b, a, *, val_to_add):
return (a + val_to_add) * b
import partial
specialized_func = partial(add_to_a_multiply_b, val_to_add=2)
df.apply(lambda row: specialized_func(row['b'], row['a']), axis=1)
Just to stress it again, avoid apply if you are performance eager:
# 'OK-ISH', does the job... but
def strike_price_minus_mid_price(strike_price, mid_price):
return strike_price - mid_price
new_data = df.apply(lambda r: strike_price_minus_mid_price(r["strike_price"], r["mid_price"] ), axis=1)
vs
'BETTER'
new_data = df["strike_price"] - df["mid_price"]

Proper Formatting for Adjusting a Pandas DF in a Function

New programmer here. I have a pandas dataframe that I adjust based on certain if conditions. I use functions to adjust the values when certain if conditions are met. I use functions because if the function is used in multiple spots, its easier to go adjust the function code once rather than making the same adjustment several times in different spots in the code. My question focuses on what is considered best practice when making these functions.
So I have included four sample functions below. The first sample function works, but i'm wondering if its considered poor practice to structure it like that and instead use one of the other variations. Please let me know what is considered 'proper' and if you have any other input. As a quick side note, I will only ever be using one 'dataframe.' Otherwise I would have passed the dataframe as an input at the very least.
Thank you!
dataframe = pd.DataFrame #Some dataframe
#version 1 simplest
def adjustdataframe():
dataframe.iat[0,0] = #Make some adjustment
#version 2 return dataframe
def adjustdataframe():
dataframe.iat[0,0] = #Make some adjustment
return dataframe
#version 3 pass df as input but don't explicitly return df
def adjustdataframe(dataframe):
dataframe.iat[0, 0] = # Make some adjustment
#version 4 pass df as input and return df
def adjustdataframe(dataframe):
dataframe.iat[0, 0] = # Make some adjustment
return dataframe
Generally, I think it wouldn't be proper to use version 1 and version 2 in your python code because normally* it would throw an UnboundLocalError: local variable referenced before assignment error. For example, try running this code:
def version_1():
""" no parameters & no return statements """
nums = [num**2 for num in nums]
def version_2():
""" no parameters """
nums = [num**2 for num in nums]
return nums
nums = [2,3]
version_1()
version_2()
Versions 3 and 4 are good in this regard since they introduce parameters, but the third function wouldn't change anything (it would change your local variable within a function but the adjustments wouldn't take place globally since they never leave a local scope).
def version_3(nums):
""" no return """
nums = [num**2 for num in nums] # local variable
nums = [2,3] # global variable
version_3(nums)
# would result in an error
assert version_3(nums) == [num**2 for num in nums]
Since version 4 has a return statement, the adjustments made within a local scope would take place.
def version_4(nums):
nums = [num**2 for num in nums]
return nums
new_nums = version_4(nums)
assert new_nums == [num**2 for num in nums]
# but original `nums` was never changed
nums
So, I believe version_4 to be the best practice.
*normally - in terms of general python functions; with pandas objects, it's different: all four functions will result in a variable specifically called dataframe being changed in place (which you wouldn't want to do usually):
def version_1():
dataframe.iat[0,0] = 999
def version_2():
dataframe.iat[0,0] = 999
return dataframe
dataframe = pd.DataFrame({"values" : [1,2,3,4,5]})
version_1()
dataframe
dataframe = pd.DataFrame({"values" : [1,2,3,4,5]})
version_2()
dataframe
Both of the functions would throw NameError if your variable is called differently; try running your first or second function without defining dataframe object beforehand (use df as a variable name for example):
# restart your kernel - `dataframe` object was never defined
df = pd.DataFrame({"values" : [1,2,3,4,5]})
version_1()
version_2()
With version_3 and version_4, you'd expect different results.
def version_3(dataframe):
dataframe.iat[0, 0] = 999
def version_4(dataframe):
dataframe.iat[0, 0] = 999
return dataframe
df = pd.DataFrame({"values" : [1,2,3,4,5]})
version_3(df)
df
df = pd.DataFrame({"values" : [1,2,3,4,5]})
version_4(df)
df
But the results are the same: your original dataframe will be changed in place.
To avoid it, don't forget to make a copy of your dataframe:
def version_4_withcopy(dataframe):
df = dataframe.copy()
df.iat[0, 0] = 999
return df
dataframe = pd.DataFrame({"values" : [1,2,3,4,5]})
new_dataframe = version_4_withcopy(dataframe)
dataframe
new_dataframe

How to define a variable amount of columns in python pandas apply

I am trying to add columns to a python pandas df using the apply function.
However the number of columns to be added depend on the output of the function
used in the apply function.
example code:
number_of_columns_to_be_added = 2
def add_columns(number_of_columns_to_be_added):
df['n1'],df['n2'] = zip(*df['input'].apply(lambda x : do_something(x, number_of_columns_to_be_added)))
Any idea on how to define the ugly column part (df['n1'], ..., df['n696969']) before the = zip( ... part programatically?
I'm guessing that the output of zip is a tuple, therefore you could try this:
temp = zip(*df['input'].apply(lambda x : do_something(x, number_of_columns_to_be_added)))
for i, value in enumerate(temp, 1):
key = 'n'+str(i)
df[key] = value
temp will hold the all the entries and then you iterate over tempto assign the values to your dict with your specific keys. Hope this matches your original idea.

Categories