Best practice for passing Pandas DataFrame to functions - python

I am currently working with a few DataFrames and want to make my code modular. That entails passing DataFrames to functions. I am aware of the mutable nature of DataFrames and some of the 'gotchas' when passing mutable instances to functions. Is there a best practice for DataFrames to the functions? Should I make a copy within the function and then pass it back? Or should I just make changes to df within the function and return None?
Is option 1 or 2 better? Below is basic code to convey the idea:
Option 1:
def test(df):
df['col1'] = df['col1']+1
return None
test(df)
Option 2:
def test(main_df):
df = main_df.copy()
df['col1'] = df['col1']+1
return df
main_df = test(main_df)

I think Option 1 is bad way. Why? Because is not Pure function (side effects on mutable reference arguments).
How to google for details: pure / deterministic / nondeterministic functions.
So I think the second way is better.

I use a lot of DataFrame.pipe to organize my code so, I'm going to say option 2. pipe takes and returns a DataFrame and you can chain multiple steps together.
def step1(main_df):
df = main_df.copy()
df['col1'] = df['col1']+1
return df
def step2(main_df):
df = main_df.copy()
df['col1'] = df['col1']+1
return df
def setp3(main_df):
df = main_df.copy()
df['col1'] = df['col1']+1
return df
main_df = (main_df.pipe(step1)
.pipe(step2)
.pipe(step3)
)

Related

Pandas apply with swifter on axis 1 doesn't return

I try to apply the following code (minimal example) to my 2 Million rows DataFrame, but for some reason .apply returns more than one row to the function and breaks my code. I'm not sure what changed, but the code did run before.
def function(row):
return [row[clm1], row[clm2]]
res = pd.DataFrame()
res[["clm1", "clm2"]] = df.swifter.apply(function,axis=1)
Did anyone get an idea or a similar issue?
Important without swifter everything works fine, but too slow due to the amount of rows.
This should work ==>
def function(row_different_name):
return [row_different_name[clm1], row_different_name[clm2]]
res = pd.DataFrame()
res[["clm1", "clm2"]] = df.swifter.apply(function,axis=1)
Try changing the name of function parameter rwo to some other name.
based on this previous answer what you are trying to do should work if you change it like this:
def function(row):
return [row.swifter[clm1], row.swifter[clm2]]
res = pd.DataFrame()
res[["clm1", "clm2"]] = df.apply(function, axis=1, result_type='expand')
this because apply on a column lacks result_type as arg, while apply on a dataframe has it
axis=1 means column, so it will insert it vertically. Is that what you want? Try removing axis=1

Proper Formatting for Adjusting a Pandas DF in a Function

New programmer here. I have a pandas dataframe that I adjust based on certain if conditions. I use functions to adjust the values when certain if conditions are met. I use functions because if the function is used in multiple spots, its easier to go adjust the function code once rather than making the same adjustment several times in different spots in the code. My question focuses on what is considered best practice when making these functions.
So I have included four sample functions below. The first sample function works, but i'm wondering if its considered poor practice to structure it like that and instead use one of the other variations. Please let me know what is considered 'proper' and if you have any other input. As a quick side note, I will only ever be using one 'dataframe.' Otherwise I would have passed the dataframe as an input at the very least.
Thank you!
dataframe = pd.DataFrame #Some dataframe
#version 1 simplest
def adjustdataframe():
dataframe.iat[0,0] = #Make some adjustment
#version 2 return dataframe
def adjustdataframe():
dataframe.iat[0,0] = #Make some adjustment
return dataframe
#version 3 pass df as input but don't explicitly return df
def adjustdataframe(dataframe):
dataframe.iat[0, 0] = # Make some adjustment
#version 4 pass df as input and return df
def adjustdataframe(dataframe):
dataframe.iat[0, 0] = # Make some adjustment
return dataframe
Generally, I think it wouldn't be proper to use version 1 and version 2 in your python code because normally* it would throw an UnboundLocalError: local variable referenced before assignment error. For example, try running this code:
def version_1():
""" no parameters & no return statements """
nums = [num**2 for num in nums]
def version_2():
""" no parameters """
nums = [num**2 for num in nums]
return nums
nums = [2,3]
version_1()
version_2()
Versions 3 and 4 are good in this regard since they introduce parameters, but the third function wouldn't change anything (it would change your local variable within a function but the adjustments wouldn't take place globally since they never leave a local scope).
def version_3(nums):
""" no return """
nums = [num**2 for num in nums] # local variable
nums = [2,3] # global variable
version_3(nums)
# would result in an error
assert version_3(nums) == [num**2 for num in nums]
Since version 4 has a return statement, the adjustments made within a local scope would take place.
def version_4(nums):
nums = [num**2 for num in nums]
return nums
new_nums = version_4(nums)
assert new_nums == [num**2 for num in nums]
# but original `nums` was never changed
nums
So, I believe version_4 to be the best practice.
*normally - in terms of general python functions; with pandas objects, it's different: all four functions will result in a variable specifically called dataframe being changed in place (which you wouldn't want to do usually):
def version_1():
dataframe.iat[0,0] = 999
def version_2():
dataframe.iat[0,0] = 999
return dataframe
dataframe = pd.DataFrame({"values" : [1,2,3,4,5]})
version_1()
dataframe
dataframe = pd.DataFrame({"values" : [1,2,3,4,5]})
version_2()
dataframe
Both of the functions would throw NameError if your variable is called differently; try running your first or second function without defining dataframe object beforehand (use df as a variable name for example):
# restart your kernel - `dataframe` object was never defined
df = pd.DataFrame({"values" : [1,2,3,4,5]})
version_1()
version_2()
With version_3 and version_4, you'd expect different results.
def version_3(dataframe):
dataframe.iat[0, 0] = 999
def version_4(dataframe):
dataframe.iat[0, 0] = 999
return dataframe
df = pd.DataFrame({"values" : [1,2,3,4,5]})
version_3(df)
df
df = pd.DataFrame({"values" : [1,2,3,4,5]})
version_4(df)
df
But the results are the same: your original dataframe will be changed in place.
To avoid it, don't forget to make a copy of your dataframe:
def version_4_withcopy(dataframe):
df = dataframe.copy()
df.iat[0, 0] = 999
return df
dataframe = pd.DataFrame({"values" : [1,2,3,4,5]})
new_dataframe = version_4_withcopy(dataframe)
dataframe
new_dataframe

How to unpack the columns of a pandas DataFrame to multiple variables

Lists or numpy arrays can be unpacked to multiple variables if the dimensions match. For a 3xN array, the following will work:
import numpy as np
a,b = [[1,2,3],[4,5,6]]
a,b = np.array([[1,2,3],[4,5,6]])
# result: a=[1,2,3], b=[4,5,6]
How can I achieve a similar behaviour for the columns of a pandas DataFrame? Extending the above example:
import pandas as pd
df = pd.DataFrame([[1,2,3],[4,5,6]])
df.columns = ['A','B','C'] # Rename cols and
df.index = ['i', 'ii'] # rows for clarity
The following does not work as expected:
a,b = df.T
# result: a='i', b='ii'
a,b,c = df
# result: a='A', b='B', c='C'
However, what I would like to get is the following:
a,b,c = unpack(df)
result: a=df['A'], b=df['B'], c=df['C']
Is the function unpack already available in pandas? Or can it be mimicked in an easy way?
I just figured that the following works, which is already close to what I try to achieve:
a,b,c = df.T.values # Common
a,b,c = df.T.to_numpy() # Recommended
# a,b,c = df.T.as_matrix() # Deprecated
Details: As always, things are a little more complicated than one thinks. Note that a pd.DataFrame stores columns separately in Series. Calling df.values (or better: df.to_numpy()) is potentially expensive, as it combines the columns in a single ndarray, which likely involves copying actions and type conversions. Also, the resulting container has a single dtype able to accommodate all data in the data frame.
In summary, the above approach loses the per-column dtype information and is potentially expensive. It is technically cleaner to iterate the columns in one of the following ways (there are more options):
# The following alternatives create VIEWS!
a,b,c = (v for _,v in df.items()) # returns pd.Series
a,b,c = (df[c] for c in df) # returns pd.Series
Note that the above creates views! Modifying the data likely will trigger a SettingWithCopyWarning.
a.iloc[0] = "blabla" # raises SettingWithCopyWarning
If you want to modify the unpacked variables, you have to copy the columns.
# The following alternatives create COPIES!
a,b,c = (v.copy() for _,v in df.items()) # returns pd.Series
a,b,c = (df[c].copy() for c in df) # returns pd.Series
a,b,c = (df[c].to_numpy() for c in df) # returns np.ndarray
While this is cleaner, it requires more characters. I personally do not recommend the above approach for production code. But to avoid typing (e.g., in interactive shell sessions), it is still a fair option...
# More verbose and explicit alternatives
a,b,c = df["the first col"], df["the second col"], df["the third col"]
a,b,c = df.iloc[:,0], df.iloc[:,1], df.iloc[:,2]
The dataframe.values shown method is indeed a good solution, but it involves building a numpy array.
In the case you want to access pandas series methods after unpacking, I personally use a different approach.
For the people like me that use a lot of chained methods, I have a solution by adding a custom unpacking method to pandas. Note that this may not be very good for production pipelines, but it is very handy in ad-hoc data analyses.
df = pd.DataFrame({
"lat": [30, 40],
"lon": [0, 1],
})
This approach involves returning a generator on a .unpack() call.
from typing import Tuple
def unpack(self: pd.DataFrame) -> Tuple[pd.Series]:
return (
self[col]
for col in self.columns
)
pd.DataFrame.unpack = unpack
This can be used in two major ways.
Either directly as a solution to your problem:
lat, lon = df.unpack()
Or, can be used in a method chaining.
Imagine a geo function which has to take a latitude serie in the first arg and a longitude in the second arg, named do_something_geographical(lat, lon)
df_result = (
df
.(...some method chaining...)
.assign(
geographic_result=lambda dataframe: do_something_geographical(dataframe[["lat", "lon"]].unpack())
)
.(...some method chaining...)
)

Better way to structure a series of df manipulations in your class

How do you better structure the code in your class so that your class returns the df that you want, but you don't have a main method which calls a lot of other methods in sequential order. I find that in a lot of situations I arrive at this structure and it seems bad. I have a df that I just overwrite it with the result of other base functions (that I unit test) until I get what I want.
class A:
def main(self):
df = self.load_file_into_df()
df = self.add_x_columns(df)
df = self.calculate_y(df)
df = self.calculate_consequence(df)
...
return df
def add_x_columns(df)
def calculate_y(df)
def calculate_consequence(df)
...
# now use it somewhere else
df = A().main()
pipe
One feature you may wish to utilize is pd.DataFrame.pipe. This is considered "pandorable" because it facilitates operator chaining.
In my opinion, you should separate reading data into a dataframe from manipulating the dataframe. For example:
class A:
def main(self):
df = self.load_file_into_df()
df = df.pipe(self.add_x_columns)\
.pipe(self.calculate_y)\
.pipe(self.calculate_consequence)
return df
compose
Function composition is not native to Python, but the 3rd party toolz library does offer this feature. This allows you to lazily define chained functions. Note the reversed order of operations, i.e. the last argument of compose is performed first.
from toolz import compose
class A:
def main(self)
df = self.load_file_into_df()
transformer = compose(self.calculate_consequence,
self.calculate_y,
self.add_x_columns)
df = df.pipe(transformer)
return df
In my opinion, compose offers a flexible and adaptable solution. You can, for example, define any number of compositions and apply them selectively or repeatedly at various points in your workflow.

Using a dataframe to construct an other in a for loop

As the title says, I've been trying to build a Pandas DataFrame from an other df using a for loop and calculating new columns with the last one built.
So far, I've tried :
df = pd.DataFrame(np.arange(10))
df.columns = [10]
df1 = pd.DataFrame(np.arange(10))
df1.columns = [10]
steps = np.linspace(10,1,10,dtype = int)
This works:
for i in steps:
print(i)
df[i-1] = df[i].apply(lambda a: a-1)
But when I try building df and df1 at the same time like so :
for i in steps:
print(i)
df[i-1] = df[i].apply(lambda a: a-df1[i])
df1[i-1] = df1[i].apply(lambda a: a-1)
It returns a lot of gibberish + the line :
ValueError : Wrong number of items passed 10, placement implies 1
In this example, I am well aware that I could build df1 first and build df after. But it returns the same error if I try :
for i in steps:
print(i)
df[i-1] = df[i].apply(lambda a: a-df1[i])
df1[i-1] = df1[i].apply(lambda a: a-df[i])
Which is what i really need in the end.
Any help is much appreciated,
Alex
apply is trying to apply a function along an axis that you specify. It can be 0 (applying the function to each column) or 1 (applying the function to each row). Per default, it is applying the function to the columns. In your first example:
for i in steps:
print(i)
df[i-1] = df[i].apply(lambda a: a-1)
Each column is looped because of your for loop, and your function .apply removes 1 to the entire column. You can see a as being your entire column. It is exactly the same as the following:
for i in steps:
print(i)
df[i - 1] = df[i] - 1
A way you can see .apply is with the following. Assuming I have the following dataframe:
df = pd.DataFrame(np.random.rand(10,4))
df.sum() and df.apply(lambda a: np.sum(a)) yields exactly the same result. It is just a simple example, but you can do more powerful calculations if needed.
Note that .apply is not the fastest method, so try to avoid it if you can.
An example where apply would be useful is if you have a function some_fct() defined that takes int or float as arguments and you would like to apply it to the elements of a dataframe column.
import pandas as pd
import numpy as np
import math
def some_fct(x):
return math.sin(x) / x
np.random.seed(100)
df = pd.DataFrame(np.random.rand(10,2))
Obviously, some_fct(df[0]) would not work as the function takes int or float as arguments. df[0] is a Series. However, using the apply method, you could apply your function to the elements of df[0] that are themselves floats.
df[0].apply(lambda x: some_fct(x))
Found it, I just need to drop the .apply !
Example :
df = pd.DataFrame(np.arange(10))
df.columns = [10]
df1 = pd.DataFrame(np.arange(10))
df1.columns = [10]
steps = np.linspace(10,1,10,dtype = int)
for i in steps:
print(i)
df[i-1] = df[i] - df1[i]
df1[i-1] = df1[i] + df[i]
It does exactly what it should !
I don't have enough knowledge about python, I cannot explain why
pd.DataFrame().apply()
will not use what was out of itself.

Categories