New programmer here. I have a pandas dataframe that I adjust based on certain if conditions. I use functions to adjust the values when certain if conditions are met. I use functions because if the function is used in multiple spots, its easier to go adjust the function code once rather than making the same adjustment several times in different spots in the code. My question focuses on what is considered best practice when making these functions.
So I have included four sample functions below. The first sample function works, but i'm wondering if its considered poor practice to structure it like that and instead use one of the other variations. Please let me know what is considered 'proper' and if you have any other input. As a quick side note, I will only ever be using one 'dataframe.' Otherwise I would have passed the dataframe as an input at the very least.
Thank you!
dataframe = pd.DataFrame #Some dataframe
#version 1 simplest
def adjustdataframe():
dataframe.iat[0,0] = #Make some adjustment
#version 2 return dataframe
def adjustdataframe():
dataframe.iat[0,0] = #Make some adjustment
return dataframe
#version 3 pass df as input but don't explicitly return df
def adjustdataframe(dataframe):
dataframe.iat[0, 0] = # Make some adjustment
#version 4 pass df as input and return df
def adjustdataframe(dataframe):
dataframe.iat[0, 0] = # Make some adjustment
return dataframe
Generally, I think it wouldn't be proper to use version 1 and version 2 in your python code because normally* it would throw an UnboundLocalError: local variable referenced before assignment error. For example, try running this code:
def version_1():
""" no parameters & no return statements """
nums = [num**2 for num in nums]
def version_2():
""" no parameters """
nums = [num**2 for num in nums]
return nums
nums = [2,3]
version_1()
version_2()
Versions 3 and 4 are good in this regard since they introduce parameters, but the third function wouldn't change anything (it would change your local variable within a function but the adjustments wouldn't take place globally since they never leave a local scope).
def version_3(nums):
""" no return """
nums = [num**2 for num in nums] # local variable
nums = [2,3] # global variable
version_3(nums)
# would result in an error
assert version_3(nums) == [num**2 for num in nums]
Since version 4 has a return statement, the adjustments made within a local scope would take place.
def version_4(nums):
nums = [num**2 for num in nums]
return nums
new_nums = version_4(nums)
assert new_nums == [num**2 for num in nums]
# but original `nums` was never changed
nums
So, I believe version_4 to be the best practice.
*normally - in terms of general python functions; with pandas objects, it's different: all four functions will result in a variable specifically called dataframe being changed in place (which you wouldn't want to do usually):
def version_1():
dataframe.iat[0,0] = 999
def version_2():
dataframe.iat[0,0] = 999
return dataframe
dataframe = pd.DataFrame({"values" : [1,2,3,4,5]})
version_1()
dataframe
dataframe = pd.DataFrame({"values" : [1,2,3,4,5]})
version_2()
dataframe
Both of the functions would throw NameError if your variable is called differently; try running your first or second function without defining dataframe object beforehand (use df as a variable name for example):
# restart your kernel - `dataframe` object was never defined
df = pd.DataFrame({"values" : [1,2,3,4,5]})
version_1()
version_2()
With version_3 and version_4, you'd expect different results.
def version_3(dataframe):
dataframe.iat[0, 0] = 999
def version_4(dataframe):
dataframe.iat[0, 0] = 999
return dataframe
df = pd.DataFrame({"values" : [1,2,3,4,5]})
version_3(df)
df
df = pd.DataFrame({"values" : [1,2,3,4,5]})
version_4(df)
df
But the results are the same: your original dataframe will be changed in place.
To avoid it, don't forget to make a copy of your dataframe:
def version_4_withcopy(dataframe):
df = dataframe.copy()
df.iat[0, 0] = 999
return df
dataframe = pd.DataFrame({"values" : [1,2,3,4,5]})
new_dataframe = version_4_withcopy(dataframe)
dataframe
new_dataframe
Related
I have a dataframe with 8 columns that i would like to run below code (i tested it works on a single column) as a function to map/apply over all 8 columns.
click here for sample of dataframe
all_adj_noun = []
for i in range(len(bigram_df)):
if len([bigram_df['adj_noun'][i]]) >= 1:
for j in range(len(bigram_df['adj_noun'][i])):
all_adj_noun.append(bigram_df['adj_noun'][i][j])
However, when i tried to define function the code returns an empty list when it is not empty.
def combine_bigrams(df_name, col_name):
all_bigrams = []
for i in range(len(df_name)):
if len([df_name[col_name][i]]) >= 1:
for j in range(len(df_name[col_name][i])):
return all_bigrams.append(df_name[col_name][i][j])
I call the function by
combine_bigrams(bigram_df, 'adj_noun')
May I know is there anything that I may be doing wrong here?
The problem is that you are returning the result of .append, which is None
However, there is a better (and faster) way to do this. To return a list with all the values present in the columns, you can leverage Series.agg:
col_name = 'adj_noun'
all_bigrams = bigram_df[col_name].agg(sum)
I am using two for loops inside each other to calculate a value using combinations of elements in a dataframe list. the list consists of large number of dataframes and using two for loops takes considerable amount of time.
Is there a way i can do the operation faster?
the functions I refer with dummy names are the ones where I calculate the results.
My code looks like this:
conf_list = []
for tr in range(len(trajectories)):
df_1 = trajectories[tr]
if len(df_1) == 0:
continue
for tt in range(len(trajectories)):
df_2 = trajectories[tt]
if len(df_2) == 0:
continue
if df_1.equals(df_2) or df_1['time'].iloc[0] > df_2['time'].iloc[-1] or df_2['time'].iloc[0] > df_1['time'].iloc[-1]:
continue
df_temp = cartesian_product_basic(df_1,df_2)
flg, df_temp = another_function(df_temp)
if flg == 0:
continue
flg_h = some_other_function(df_temp)
if flg_h == 1:
conf_list.append(1)
My input list consist of around 5000 dataframes looking like (having several hundreds of rows)
id
x
y
z
time
1
5
7
2
5
and what i do is I get the cartesian product with combinations of two dataframes and for each couple I calculate another value 'c'. If this value c meets a condition then I add an element to my c_list so that I can get the final number of couples meeting the requirement.
For further info;
a_function(df_1, df_2) is a function getting the cartesian product of two dataframes.
another_function looks like this:
def another_function(df_temp):
df_temp['z_dif'] = nwh((df_temp['time_x'] == df_temp['time_y'])
, abs(df_temp['z_x']- df_temp['z_y']) , np.nan)
df_temp = df_temp.dropna()
df_temp['vert_conf'] = nwh((df_temp['z_dif'] >= 1000)
, np.nan , 1)
df_temp = df_temp.dropna()
if len(df_temp) == 0:
flg = 0
else:
flg = 1
return flg, df_temp
and some_other_function looks like this:
def some_other_function(df_temp):
df_temp['x_dif'] = df_temp['x_x']*df_temp['x_y']
df_temp['y_dif'] = df_temp['y_x']*df_temp['y_y']
df_temp['hor_dif'] = hypot(df_temp['x_dif'], df_temp['y_dif'])
df_temp['conf'] = np.where((df_temp['hor_dif']<=5)
, 1 , np.nan)
if df_temp['conf'].sum()>0:
flg_h = 1
return flg_h
The following are the way to make your code run faster:
Instead of for-loop use list comprehension.
use built-in functions like map, filter, sum ect, this would make your code faster.
Do not use '.' or dot operants, for example
Import datetime
A=datetime.datetime.now() #dont use this
From datetime.datetime import now as timenow
A=timenow()# use this
Use c/c++ based operation libraries like numpy.
Don't convert datatypes unnecessarily.
in infinite loops, use 1 instead of "True"
Use built-in Libraries.
if the data would not change, convert it to a tuple
Use String Concatenation
Use Multiple Assignments
Use Generators
When using if-else to check a Boolean value, avoid using assignment operator.
# Instead of Below approach
if a==1:
print('a is 1')
else:
print('a is 0')
# Try this approach
if a:
print('a is 1')
else:
print('a is 0')
# This would help as a portion of time is reduce which was used in check the 2 values.
Usefull references:
Speeding up Python Code: Fast Filtering and Slow Loops
Speed Up Python Code
I would like to optimize my code using vectorization.
Below is a simplified version of the code:
import pandas as pd
import numpy as np
np.random.seed(34)
my_var = 0
def func(num):
global my_var
fall = my_var - my_var * 0.3
if num < fall:
my_var = num
return 'MARK'
elif num > my_var:
my_var = num
return 'BLANK'
else:
return 'NO CHANGE'
data = {
'COL' : np.random.randint(0,10, size=10)
}
df = pd.DataFrame(data)
results = df.apply(lambda row: func(row['COL']),
axis = 1)
df['RES'] = results
print(df)
The code above is keeping track of the highest number fed into it from a dataframe, then if the next number passed in is lower by a set percent, the functions returns the 'MARK' string, and the highest number becomes the number passed in. For that, I used a variable to keep track of the current highest number.
I would like to find a way to implement this in a vectorized manner to speed up the execution time of the code as the actual data set I need to process is very large.
I have considered using the .where() or the select() functions to somehow create a vectorized version of the code, but I have no idea how could I go on about achieving this. Is there a way to reference previous values in a dataframe column, or perhaps use variables to store data whilst vectorizing?
Where possible please provide a sample code to demonstrate your suggestion.
Many thanks.
This question already has answers here:
Creating functions (or lambdas) in a loop (or comprehension)
(6 answers)
Closed 6 months ago.
I'd like to create a dictionary that contains lambda functions for the convenient filtering of pandas data frames. When I instantiate each dictionary item line by line, I get the behaviour I want. But when I use a for loop, the filters use the last value of n. Does the lambda function reference the global variable n, and not its value at the time of instantiation? Is my understanding of lambda functions off?
Note, this example is watered down. In my actual project, I use a DateTime index, and the dictionary will have integer keys that filter by year, eg. df.index.year == 2020, and some string keys that filter by week/weekend, time of day, etc.
import pandas as pd
data = [[1,2],[3,4],[5,6]] # example df
df = pd.DataFrame(index=range(len(data)), data=data)
filts = {}
filts[1] = lambda df: df[df.index == 1] # making a filter dictionary
filts[2] = lambda df: df[df.index == 2] # of lamda funcs
print(filts[1](df)) # works as expected
print(filts[2](df))
filts = {}
for n in range(len(data)):
filts[n] = lambda df: df[df.index == n] # also tried wrapping n in int
# n = 0 # changes behaviour
print(filts[0](df)) # print out the results for n = 2
print(filts[1](df)) # same problem as above
# futher investigating lambdas
filts = {}
n = 0
filts[n] = lambda df: df[df.index == n] # making a filter dictionary
n = 1
filts[n] = lambda df: df[df.index == n] # of lamda funcs
print(filts[0](df)) # print out the results for n = 1
I'm not sure about the duplicate-ness of the question, and I am answering it because I've run into this using pandas myself. You can solve your problem using closures.
Change your loop as follows:
for n in range(len(data)):
filts[n] = (lambda n: lambda df: df[df.index == n])(n)
What's wrong with OP's approach?
Lambdas maintain a reference to the variable. So n here is a reference to the variable that is being iterated over in the loop. When you evaluate your lambdas, the reference to n (in all the defined lambdas in your filts is assigned to the final value assigned to the reference n in the loop. Hence, what you're seeing is expected. The takeaway- "The lambda's closure holds a reference to the variable being used, not its value, so if the value of the variable later changes, the value in the closure also changes." source.
I am creating a function. One input of this function will be a panda dataframe and one of its tasks is to do some operation with two variables of this dataframe. These two variables are not fixed and I want to have the freedom to determine them using parameters as inputs of the function fun.
For example, suppose at some moment the variables I want to use are 'var1' and 'var2' (but at another time, I may want to use others two variables). Supose that these variables take values 1,2,3,4 and I want to reduce df doing var1 == 1 and var2 == 1. My functions is like this
def fun(df , var = ['input_var1', 'input_var2'] , val):
df = df.rename(columns={ var[1] : 'aux_var1 ', var[2]:'aux_var2'})
# Other operations
df = df.loc[(df.aux_var1 == val ) & (df.aux_var2 == val )]
# end of operations
# recover
df = df.rename(columns={ 'aux_var1': var[1] ,'aux_var2': var[2]})
return df
When I use the function fun, I have the error
fun(df, var = ['var1','var2'], val = 1)
IndexError: list index out of range
Actually, I want to do other more complex operations and I didn't describe these operations so as not to extend the question. Perhaps the simple example above has a solution that does not need to rename the variables. But maybe this solution doesn't work with the operations I really want to do. So first, I would necessarily like to correct the error when renaming the variables. If you want to give another more elegant solution that doesn't need renaming, I appreciate that too, but I will be very grateful if besides the elegant solution, you offer me the solution about renaming.
Python liste are zero indexed, i.e. the first element index is 0.
Just change the lines:
df = df.rename(columns={ var[1] : 'aux_var1 ', var[2]:'aux_var2'})
df = df.rename(columns={ 'aux_var1': var[1] ,'aux_var2': var[2]})
to
df = df.rename(columns={ var[0] : 'aux_var1 ', var[1]:'aux_var2'})
df = df.rename(columns={ 'aux_var1': var[0] ,'aux_var2': var[1]})
respectively
In this case you are accessing var[2] but a 2-element list in Python has elements 0 and 1. Element 2 does not exist and therefore accessing it is out of range.
As it has been mentioned in other answers, the error you are receiving is due to the 0-indexing of Python lists, i.e. if you wish to access the first element of the list var, you do that by taking the 0 index instead of 1 index: var[0].
However to the topic of renaming, you are able to perform the filtering of pandas dataframe without any column renaming. I can see that you are accessing the column as an attribute of the dataframe, however you are able to achieve the same via utilising the __getitem__ method, which is more commonly used with square brackets, f.e. df[var[0]].
If you wish to have more generality over your function without any renaming happening, I can suggest this:
from functools import reduce
def fun(df , var, val):
_sub = reduce(
lambda x, y: x & (df[y] == val),
var,
pd.Series([True]*df.shape[0])
)
return df[_sub]
This will work with any number of input column variables. Hope this will serve as an inspiration to your more complicated operations you intend to do.