For example I'd like to assert that two Pyspark DataFrame's have the same data, however just using == checks that they are the same object. Ideally I'd also like to be specify whether order matters or not.
I've tried writing a function that raises an AssertionError but that adds a lot of noise to the pytest output as it shows the traceback from that function.
The other thought I had was to mock the __eq__ method of the DataFrames but I'm not confident that's the right way to go.
Edit:
I considered just using a function that returns true or false instead of an operator, however that doesn't seem to work with pytest_assertrepr_compare. I'm not familiar enough with how that hook works so it's possible there is a way to use it with a function instead of an operator.
My current solution is to use a patch to override the DataFrame's __eq__ method. Here's an example with Pandas as it's faster to test with, the idea should apply to any object.
import pandas as pd
# use this import for python3
# from unittest.mock import patch
from mock import patch
def custom_df_compare(self, other):
# Put logic for comparing df's here
# Returning True for demonstration
return True
#patch("pandas.DataFrame.__eq__", custom_df_compare)
def test_df_equal():
df1 = pd.DataFrame(
{"id": [1, 2, 3], "name": ["a", "b", "c"]}, columns=["id", "name"]
)
df2 = pd.DataFrame(
{"id": [2, 3, 4], "name": ["b", "c", "d"]}, columns=["id", "name"]
)
assert df1 == df2
Haven't tried it yet but am planning on adding it as a fixture and using autouse to use it for all tests automatically.
In order to elegantly handle the "order matters" indicator, I'm playing with an approach similar to pytest.approx which returns a new class with it's own __eq__ for example:
class SortedDF(object):
"Indicates that the order of data matters when comparing to another df"
def __init__(self, df):
self.df = df
def __eq__(self, other):
# Put logic for comparing df's including order of data here
# Returning True for demonstration purposes
return True
def test_sorted_df():
df1 = pd.DataFrame(
{"id": [1, 2, 3], "name": ["a", "b", "c"]}, columns=["id", "name"]
)
df2 = pd.DataFrame(
{"id": [2, 3, 4], "name": ["b", "c", "d"]}, columns=["id", "name"]
)
# Passes because SortedDF.__eq__ is used
assert SortedDF(df1) == df2
# Fails because df2's __eq__ method is used
assert df2 == SortedDF(df2)
The minor issue I haven't been able to resolve is the failure of the second assert, assert df2 == SortedDF(df2). This order works fine with pytest.approx but doesn't here. I've tried reading up on the == operator but haven't been able to figure out how to fix the second case.
To do a raw comparison between the values of the DataFrames (must be exact order), you can do something like this:
import pandas as pd
from pyspark.sql import Row
df1 = spark.createDataFrame([Row(a=1, b=2, c=3), Row(a=1, b=3, c=3)])
df2 = spark.createDataFrame([Row(a=1, b=2, c=3), Row(a=1, b=3, c=3)])
pd.testing.assert_frame_equal(df1.toPandas(), df2.toPandas())
If you want to specify by order, you can do some transformations on the pandas DataFrame to sort by a particular column first using the following function:
def assert_frame_equal_with_sort(results, expected, keycolumns):
results = results.reindex(sorted(results.columns), axis=1)
expected = expected.reindex(sorted(expected.columns), axis=1)
results_sorted = results.sort_values(by=keycolumns).reset_index(drop=True)
expected_sorted = expected.sort_values(by=keycolumns).reset_index(drop=True)
pd.testing.assert_frame_equal(results_sorted, expected_sorted)
df1 = spark.createDataFrame([Row(a=1, b=2, c=3), Row(a=1, b=3, c=3)])
df2 = spark.createDataFrame([Row(a=1, b=3, c=3), Row(a=1, b=2, c=3)])
assert_frame_equal_with_sort(df1.toPandas(), df2.toPandas(), ['b'])
just use the pandas.Dataframe.equals method
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.equals.html
For example
assert df1.equals(df2)
assert can be used with anything that returns a boolean. So yes you can write any custom comparison function to compare two objects. As long as the custom function returns a boolean. However, in this case there is no need for a custom function as pandas already provides one
You can use one of pytest hooks, particularity the pytest_assertrepr_compare. In there you can define what tyou you want to compare and how, also docs are pretty good and with examples. Best of luck. :)
Related
I'm trying to apply simple functions to groups in pandas. I have this dataframe which I can group by type:
df = pandas.DataFrame({"id": ["a", "b", "c", "d"], "v": [1,2,3,4], "type": ["X", "Y", "Y", "Y"]}).set_index("id")
df.groupby("type").mean() # gets the mean per type
I want to apply a function like np.log2 only to the groups before taking the mean of each group. This does not work since apply is element wise and type (as well as potentially other columns in df in a real scenario) is not numeric:
# fails
df.apply(np.log2).groupby("type").mean()
is there a way to apply np.log2 only to the groups prior to taking the mean? I thought transform would be the answer but the problem is that it returns a dataframe that does not have the original type columns:
df.groupby("type").transform(np.log2)
v
id
a 0.000000
b 1.000000
c 1.584963
d 2.000000
Variants like grouping and then applying do not work: df.groupby("type").apply(np.log2). What is the correct way to do this?
The problem is that np.log2 cannot deal with the first column. Instead, you need to pass just your numeric column. You can do this as suggested in the comments, or define a lambda:
df.groupby('type').apply(lambda x: np.mean(np.log2(x['v'])))
As per comments, I would define a function:
df['w'] = [5, 6, 7,8]
def foo(x):
return x._get_numeric_data().apply(axis=0, func=np.log2).mean()
df.groupby('type').apply(foo)
# v w
# type
# X 0.000000 2.321928
# Y 1.528321 2.797439
I need a tqdm progress bar over a set of (possibly long) set of merge operations.
In my application, I have a set of operations in cascade like the following
data = data.merge(get_data_source1(), on="id", how="left")\
.merge(get_data_source2(), on="id", how="left")\
...
.merge(get_data_sourceN(), on="id", how="left")
It is not relevant what the get_data_source<i> functions do, they pull the data from somewhere (for instance, from different files or different DBs) and they returns a DataFrame with an "id" column and that it takes a few seconds.
I would need a progress bar that goes with N. This is probably feasible encapsulating each merge operation within lambda functions and put them into an iterable, but it looks like an overengineered and hard to read solution if I try to think of it (please correct me if you think I'm wrong).
Also, I'm aware that is possible to add a progress bar to each merge operation using the progress_apply function (like reported here), but that would generate several (N) short progress bar rather than a single one.
For the sake of emulating a working setup, let's consider this toy example
import pandas as pd
import numpy as np
import time
data = pd.DataFrame(np.random.randint(0,100,size=(100,3)), columns=["id","A", "B"])
def get_data(col):
time.sleep(1.0)
return pd.DataFrame(np.random.randint(0,100,size=(100,2)), columns=["id",col])
data.merge(get_data("C"), on="id", how="left")\
.merge(get_data("D"), on="id", how="left")\
.merge(get_data("E"), on="id", how="left")\
.merge(get_data("F"), on="id", how="left")\
.merge(get_data("G"), on="id", how="left")\
.merge(get_data("H"), on="id", how="left")
What would the best way to approach the problem?
I would suggest using functools.reduce.
Here's a snippet on some sample data frames, but it would work with any data frame iterable, just wrap it with tqdm.
import functools
import numpy as np
import pandas as pd
from tqdm.auto import tqdm
N = 10
columns = [["A", "B"], ["C"], ["D", "E", "F"]]
dfs = [
pd.DataFrame(
{
"key": range(N),
**{c: np.random.rand(N) for c in cols}
}
)
for cols in columns
]
functools.reduce(lambda x, y: x.merge(y), tqdm(dfs[1:]), dfs[0])
You can create a list with your values that you want to apply the function get_data to, and iterate over this list with tqdm.
import pandas as pd
import numpy as np
import time
import tqdm
data = pd.DataFrame(np.random.randint(0,100,size=(100,3)), columns=["id","A", "B"])
def get_data(col):
time.sleep(1.0)
return pd.DataFrame(np.random.randint(0,100,size=(100,2)), columns=["id",col])
values = ["C","D","E","F","G","H"]
for i in tqdm.tqdm(values):
data = data.merge(get_data(i), on="id", how="left")
data
You can either assign the merged dataframe to the data dataframe at each step like in the above example, or use the inplace parameter to avoid returning a new dataframe at each step.
EDIT:
As all the get_data functions are different, I suggest as the question did to create an iterable with the functions. It is not required to use lambdas, as the example below shows:
functions = [get_data1,get_data2,get_data3]
for func in functions:
data = func(param1,param2,param3)
This will iterate over all the functions of the list and execute them with the given parameters.
Let's say I have the following data
import pandas as pd
d = {'index': [0, 1, 2], 'a': [10, 8, 6], 'b': [4, 2, 6],}
data_frame = pd.DataFrame(data=d).set_index('index')
Now what I do is, filter this data based on the values of column "b", Lets say like this:
new_df = data_frame[data_frame['b']!=4]
new_df1 = data_frame[data_frame['b']==4]
What I want to do, instead of this method above, is to write the function where I can also indicate what kind of comparison operators it should use. Something like this
def slice(df, column_name):
df_new = df[df[column_name]!=4]
return df_new
new_df2 = slice(df=data_frame, column_name='b')
The function above only does != operation in the data. What I want is that both != and == to be somehow defined in the above function, and when I use the function I could ideally indicate which one to use.
Let me know if my question needs more detailed clarification
You could add a boolean parameter to your function:
def slice(df, column_name, equality=True):
if equality:
df_new = df[df[column_name]==4]
else:
df_new = df[df[column_name]!=4]
return df_new
new_df2 = slice(df=data_frame, column_name='b', equality=True)
By the way, slice is a built-in python function, so probably a good idea to rename to something else.
I am currently working with a few DataFrames and want to make my code modular. That entails passing DataFrames to functions. I am aware of the mutable nature of DataFrames and some of the 'gotchas' when passing mutable instances to functions. Is there a best practice for DataFrames to the functions? Should I make a copy within the function and then pass it back? Or should I just make changes to df within the function and return None?
Is option 1 or 2 better? Below is basic code to convey the idea:
Option 1:
def test(df):
df['col1'] = df['col1']+1
return None
test(df)
Option 2:
def test(main_df):
df = main_df.copy()
df['col1'] = df['col1']+1
return df
main_df = test(main_df)
I think Option 1 is bad way. Why? Because is not Pure function (side effects on mutable reference arguments).
How to google for details: pure / deterministic / nondeterministic functions.
So I think the second way is better.
I use a lot of DataFrame.pipe to organize my code so, I'm going to say option 2. pipe takes and returns a DataFrame and you can chain multiple steps together.
def step1(main_df):
df = main_df.copy()
df['col1'] = df['col1']+1
return df
def step2(main_df):
df = main_df.copy()
df['col1'] = df['col1']+1
return df
def setp3(main_df):
df = main_df.copy()
df['col1'] = df['col1']+1
return df
main_df = (main_df.pipe(step1)
.pipe(step2)
.pipe(step3)
)
Lists or numpy arrays can be unpacked to multiple variables if the dimensions match. For a 3xN array, the following will work:
import numpy as np
a,b = [[1,2,3],[4,5,6]]
a,b = np.array([[1,2,3],[4,5,6]])
# result: a=[1,2,3], b=[4,5,6]
How can I achieve a similar behaviour for the columns of a pandas DataFrame? Extending the above example:
import pandas as pd
df = pd.DataFrame([[1,2,3],[4,5,6]])
df.columns = ['A','B','C'] # Rename cols and
df.index = ['i', 'ii'] # rows for clarity
The following does not work as expected:
a,b = df.T
# result: a='i', b='ii'
a,b,c = df
# result: a='A', b='B', c='C'
However, what I would like to get is the following:
a,b,c = unpack(df)
result: a=df['A'], b=df['B'], c=df['C']
Is the function unpack already available in pandas? Or can it be mimicked in an easy way?
I just figured that the following works, which is already close to what I try to achieve:
a,b,c = df.T.values # Common
a,b,c = df.T.to_numpy() # Recommended
# a,b,c = df.T.as_matrix() # Deprecated
Details: As always, things are a little more complicated than one thinks. Note that a pd.DataFrame stores columns separately in Series. Calling df.values (or better: df.to_numpy()) is potentially expensive, as it combines the columns in a single ndarray, which likely involves copying actions and type conversions. Also, the resulting container has a single dtype able to accommodate all data in the data frame.
In summary, the above approach loses the per-column dtype information and is potentially expensive. It is technically cleaner to iterate the columns in one of the following ways (there are more options):
# The following alternatives create VIEWS!
a,b,c = (v for _,v in df.items()) # returns pd.Series
a,b,c = (df[c] for c in df) # returns pd.Series
Note that the above creates views! Modifying the data likely will trigger a SettingWithCopyWarning.
a.iloc[0] = "blabla" # raises SettingWithCopyWarning
If you want to modify the unpacked variables, you have to copy the columns.
# The following alternatives create COPIES!
a,b,c = (v.copy() for _,v in df.items()) # returns pd.Series
a,b,c = (df[c].copy() for c in df) # returns pd.Series
a,b,c = (df[c].to_numpy() for c in df) # returns np.ndarray
While this is cleaner, it requires more characters. I personally do not recommend the above approach for production code. But to avoid typing (e.g., in interactive shell sessions), it is still a fair option...
# More verbose and explicit alternatives
a,b,c = df["the first col"], df["the second col"], df["the third col"]
a,b,c = df.iloc[:,0], df.iloc[:,1], df.iloc[:,2]
The dataframe.values shown method is indeed a good solution, but it involves building a numpy array.
In the case you want to access pandas series methods after unpacking, I personally use a different approach.
For the people like me that use a lot of chained methods, I have a solution by adding a custom unpacking method to pandas. Note that this may not be very good for production pipelines, but it is very handy in ad-hoc data analyses.
df = pd.DataFrame({
"lat": [30, 40],
"lon": [0, 1],
})
This approach involves returning a generator on a .unpack() call.
from typing import Tuple
def unpack(self: pd.DataFrame) -> Tuple[pd.Series]:
return (
self[col]
for col in self.columns
)
pd.DataFrame.unpack = unpack
This can be used in two major ways.
Either directly as a solution to your problem:
lat, lon = df.unpack()
Or, can be used in a method chaining.
Imagine a geo function which has to take a latitude serie in the first arg and a longitude in the second arg, named do_something_geographical(lat, lon)
df_result = (
df
.(...some method chaining...)
.assign(
geographic_result=lambda dataframe: do_something_geographical(dataframe[["lat", "lon"]].unpack())
)
.(...some method chaining...)
)