pytest assert for pyspark dataframe comparison - python

I have 2 pyspark dataframe as shown in file attached. expected_df and actual_df
In my unit test I am trying to check if both are equal or not.
for which my code is
expected = map(lambda row: row.asDict(), expected_df.collect())
actual = map(lambda row: row.asDict(), actaual_df.collect())
assert expected = actual
Since both dfs are same but row order is different so assert fails here.
What is best way to compare such dfs.

You can try pyspark-test
https://pypi.org/project/pyspark-test/
This is inspired by the panadas testing module build for pyspark.
Usage is simple
from pyspark_test import assert_pyspark_df_equal
assert_pyspark_df_equal(df_1, df_2)
Also apart from just comparing dataframe, just like the pandas testing module it also accepts many optional params that you can check in the documentation.
Note:
The datatypes in pandas and pysaprk are bit different, thats why directly converting to .toPandas and using panadas testing module might not be the right approach.
This package is for unit/integration testing, so meant to be used with small size dfs

This is done in some of the pyspark documentation:
assert sorted(expected_df.collect()) == sorted(actaual_df.collect())

We solved this by hashing each row with Spark's hash function and then summing the resultant column.
from pyspark.sql import DataFrame
import pyspark.sql.functions as F
def hash_df(df):
"""Hashes a DataFrame for comparison.
Arguments:
df (DataFrame): A dataframe to generate a hash from
Returns:
int: Summed value of hashed rows of an input DataFrame
"""
# Hash every row into a new hash column
df = df.withColumn('hash_value', F.hash(*sorted(df.columns))).select('hash_value')
# Sum the hashes, see https://shortest.link/28YE
value = df.agg(F.sum('hash_value')).collect()[0][0]
return value
expected_hash = hash_df(expected_df)
actual_hash = hash_df(actual_df)
assert expected_hash == actual_hash

Unfortunately this cannot be done without applying sort on any of the columns(specially on the key column), reason being there isn't any guarantee for ordering of records in a DataFrame . You cannot predict the order in which the records are going to appear in the dataframe. The below approach works fine for me:
expected = expected_df.orderBy('period_start_time').collect()
actual = actaual_df.orderBy('period_start_time').collect()
assert expected == actual

If the overhead of an additional library such as pyspark_test is a problem, you could try sorting both dataframes by the same columns, converting them to pandas, and using pd.testing.assert_frame_equal.
I know that the .toPandas method for pyspark dataframes is generally discouraged because the data is loaded into the driver's memory (see the pyspark documentation here), but this solution works for relatively small unit tests.
For example:
sort_cols = actual_df.columns
pd.testing.assert_frame_equal(
actual_df.sort(sort_cols).toPandas(),
expected_df.sort(sort_cols).toPandas()
)

try to have "==" instead of "=".
assert expected == actual

I have two Dataframes with the same order.
Comparing this two I use:
def test_df(df1, df2):
assert df1.values.tolist() == df2.values.tolist()

Another way go about that ensuring sort order would be:
from pandas.testing import assert_frame_equal
def assert_frame_with_sort(results, expected, key_columns):
results_sorted = results.sort_values(by=key_columns).reset_index(drop=True)
expected_sorted = expected.sort_values(by=key_columns).reset_index(drop=True)
assert_frame_equal(results_sorted, expected_sorted)

Related

How to drop row in pandas if column1 = certain value and column 2 = NaN?

I'm trying to do the following: "# drop all rows where tag == train_loop and start is NaN".
Here's my current attempt (thanks Copilot):
# drop all rows where tag == train_loop and start is NaN
# apply filter function to each row
# return True if row should be dropped
def filter_fn(row):
return row["tag"] == "train_loop" and pd.isna(row["start"]):
old_len = len(df)
df = df[~df.apply(filter_fn, axis=1)]
It works well, but I'm wondering if there is a less verbose way.
using apply is a really bad way to do this actually, since it loops over every row, calling the function you defined in python. Instead, use vectorized functions which you can call on the entire dataframe, which call optimized/vectorized versions written in C under the hood.
df = df[~((df["tag"] == "train_loop") & df["start"].isnull())]
If your data is large (>~100k rows), then even faster would be to use pandas query methods, where you can write both conditions in one:
df = df.query(
'~((tag == "train_loop") and (start != start))'
)
This makes use of the fact that NaNs never equal anything, including themselves, so we can use simple logical operators to find NaNS (.isnull() isn't available in the compiled query mini-language). For the query method to be faster, you need to have numexpr installed, which will compile your queries on the fly before they're called on the data.
See the docs on enhancing performance for more info and examples.
You can do
df = df.loc[~(df['tag'].eq('train_loop') & df['start'].isna())]

How to iterate over very big dataframes in python?

I have a code and my dataframe contains almost 800k rows and therefore it is impossible to iterate over it by using standard methods. I searched a little bit and see a method of iterrows() but i couldn't understand how to use. Basicly this is my code and can you help me how to update it for iterrows()?
**
for i in range(len(x["Value"])):
if x.loc[i ,"PP_Name"] in ['ARK','DGD','KND','SRG','HCO','MNG','KSK','KOP','KVB','Yamanli','ÇBS','Dogancay'] :
x.loc[i,"Santral_Type"] = "HES"
elif x.loc[i ,"PP_Name"] in ['BND','BND2','TFB','TFB3','TFB4','KNT']:
x.loc[i,"Santral_Type"] = "TERMIK"
elif x.loc[i ,"PP_Name"] in ['BRS','ÇKL','DPZ']:
x.loc[i,"Santral_Type"] = "RES"
else : x.loc[i,"Santral_Type"] = "SOLAR"
**
How to iterate over very big dataframes -- In general, you don't. You should use some sort of vectorize operation to the column as a whole. For example, your case can be map and fillna:
map_dict = {
'HES' : ['ARK','DGD','KND','SRG','HCO','MNG','KSK','KOP','KVB','Yamanli','ÇBS','Dogancay'],
'TERMIK' : ['BND','BND2','TFB','TFB3','TFB4','KNT'],
'RES' : ['BRS','ÇKL','DPZ']
}
inv_map_dict = {x:k for k,v in map_dict.items() for x in v}
df['Santral_Type'] = df['PP_Name'].map(inv_map_dict).fillna('SOLAR')
It is not advised to iterate through DataFrames for these things. Here is one possible way of doing it, applied to all rows of the DataFrame x at once:
# Default value
x["Santral_Type"] = "SOLAR"
x.loc[x.PP_Name.isin(['BRS','ÇKL','DPZ']), 'Santral_Type'] = "RES"
x.loc[x.PP_Name.isin(['BND','BND2','TFB','TFB3','TFB4','KNT']), 'Santral_Type'] = "TERMIK"
hes_list = ['ARK','DGD','KND','SRG','HCO','MNG','KSK','KOP','KVB','Yamanli','ÇBS','Dogancay']
x.loc[x.PP_Name.isin(hes_list), 'Santral_Type'] = "HES"
Note that 800k can not be considered a large table when using standard pandas methods.
I would advise strongly against using iterrows and for loops when you have vectorised solutions available which take advantage of the pandas api.
this is your code adapted with numpy which should run much faster than your current method.
import numpy as np
col = 'PP_Name'
conditions = [
x[col].isin(
['ARK','DGD','KND','SRG','HCO','MNG','KSK','KOP','KVB','Yamanli','ÇBS','Dogancay']
),
x[col].isin(["BND", "BND2", "TFB", "TFB3", "TFB4", "KNT"]),
x[col].isin(["BRS", "ÇKL", "DPZ"])]
outcomes = ["HES", "TERMIK", "RES"]
x["Santral_Type"] = np.select(conditions, outcomes, default='SOLAR')
df.iterrows() according to documentation returns a tuple (index, Series).
You can use it like this:
for row in df.iterrows():
if row[1]['PP_Name'] in ['ARK','DGD','KND','SRG','HCO','MNG','KSK','KOP','KVB','Yamanli','ÇBS','Dogancay']:
df['Santral_Type] = "HES"
# and so on
By the way, I must say, using iterrows is going to be very slow, and looking at your sample code it's clear you can use simple pandas selection techniques to do this without explicit loops.
Better to do it as #mcsoini suggested
the simplest method could be .values, example:
def f(x0,...xn):
return('hello or some complicated operation')
df['newColumn']=[f(r[0],r[1],...,r[n]) for r in df.values]
the drawbacks of this method as far as i know is that you cannot refer to the column values by name but just by position and there is no info about the index of the df.
Advantage is faster than iterrows, itertuples and apply methods.
hope it helps

Whats is the correct way to sum different dataframe columns in a list in pyspark?

I want to sum different columns in a spark dataframe.
Code
from pyspark.sql import functions as F
cols = ["A.p1","B.p1"]
df = spark.createDataFrame([[1,2],[4,89],[12,60]],schema=cols)
# 1. Works
df = df.withColumn('sum1', sum([df[col] for col in ["`A.p1`","`B.p1`"]]))
#2. Doesnt work
df = df.withColumn('sum1', F.sum([df[col] for col in ["`A.p1`","`B.p1`"]]))
#3. Doesnt work
df = df.withColumn('sum1', sum(df.select(["`A.p1`","`B.p1`"])))
Why isn't approach #2. & #3. not working?
I am on Spark 2.2
Because,
# 1. Works
df = df.withColumn('sum1', sum([df[col] for col in ["`A.p1`","`B.p1`"]]))
Here you are using python in-built sum function which takes iterable as input,so it works. https://docs.python.org/2/library/functions.html#sum
#2. Doesnt work
df = df.withColumn('sum1', F.sum([df[col] for col in ["`A.p1`","`B.p1`"]]))
Here you are using pyspark sum function which takes column as input but you are trying to get it at row level.
http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.sum
#3. Doesnt work
df = df.withColumn('sum1', sum(df.select(["`A.p1`","`B.p1`"])))
Here, df.select() returns a dataframe and trying to sum over a dataframe. In this case, I think, you got to iterate rowwise and apply sum over it.
TL;DR builtins.sum is just fine.
Following your comments:
Using native python sum() is not benefitting from spark optimization. so whats the spark way of doing it
and
its not a pypark function so it wont be really be completely benefiting from spark right.
I can see you are making incorrect assumptions.
Let's decompose the problem:
[df[col] for col in ["`A.p1`","`B.p1`"]]
creates a list of Columns:
[Column<b'A.p1'>, Column<b'B.p1'>]
Let's call it iterable.
sum reduces output by taking elements of this list and calling __add__ method (+). Imperative equivalent is:
accum = iterable[0]
for element in iterable[1:]:
accum = accum + element
This gives Column:
Column<b'(A.p1 + B.p1)'>
which is the same as calling
df["`A.p1`"] + df["`B.p1`"]
No data has been touched and when evaluated it is benefits from all Spark optimizations.
Addition of multiple columns from a list into one column
I tried a lot of methods and the following are my observations:
PySpark's sum function doesn't support column addition (Pyspark version 2.3.1)
Built-in python's sum function is working for some folks but giving error for others (might be because of conflict in names)
In your 3rd approach, the expression (inside python's sum function) is returning a PySpark DataFrame.
So, the addition of multiple columns can be achieved using the expr function in PySpark, which takes an expression to be computed as an input.
from pyspark.sql.functions import expr
cols_list = ['a', 'b', 'c']
# Creating an addition expression using `join`
expression = '+'.join(cols_list)
df = df.withColumn('sum_cols', expr(expression))
This gives us the desired sum of columns. We can also use any other complex expression to get other output.

Efficient way of applying specific function to structured column in Spark data frame?

I have data in a Spark data frame, with a column col that contains structured data of the form:
------ col ------- # Column whose elements are structures
field0 field1 … # StructType with StructFields (variable names and count)
[1,2,3] [4,5] [6] # Each field is of type ArrayType
[1,2] [3] []
…
where the number and the names of the fields are not fixed.
What is the most efficient way of calculating the total number of elements in each row? In the above example, the expected resulting data frame would thus be:
num_elements
6
3
…
There is always the solution of a user defined function:
from pyspark.sql.types import IntegerType
def num_elements(all_arrays_in_row):
return sum(map(len, all_arrays_in_row))
num_elements = pyspark.sql.functions.udf(num_elements, IntegerType())
data_frame.select(num_elements(data_frame.col)).show() # Number of elements in each row
Now, I am not sure whether this is generally efficient, because:
Function num_elements() is in Python.
If the fields happen to not be stored together for some reason, the map() forces a fetch of each array before calculating their length.
More generally, a "pure" Spark approach would be more efficient, but it is eluding me. What I tried so far is the following, but this is way more cumbersome than the approach above, and is also not complete:
Get the field names field0, etc. with [field.name for field in data_frame.select("col").schema.fields[0].dataType.fields] (cumbersome).
For each field name, efficiently calculate the size of its array:
sizes_one_field = data_frame.select(pyspark.sql.functions.size(
data_frame.col.getField(field_name))
Now, I am stuck at this point because I am not sure how to sum together the 1-column data frames sizes_one_field (there is one for each field name). Plus, maybe there is a way of directly applying the size() function to each field of column col in Spark (through some kind of map?)? Or some completely different approach to getting the total number of elements in each row?
You can try something like the following:
from pyspark.sql import functions as f
result = df.select(sum((f.size(df[col_name]) for col_name in df.columns), f.lit(0)))
This solution uses the pyspark.sql built-in functions and will be executed in an optimized way. For more information about these functions, you can check its pyspark documentation.

How to self-reference column in pandas Data Frame?

In Python's Pandas, I am using the Data Frame as such:
drinks = pandas.read_csv(data_url)
Where data_url is a string URL to a CSV file
When indexing the frame for all "light drinkers" where light drinkers is constituted by 1 drink, the following is written:
drinks.light_drinker[drinks.light_drinker == 1]
Is there a more DRY-like way to self-reference the "parent"? I.e. something like:
drinks.light_drinker[self == 1]
You can now use query or assign depending on what you need:
drinks.query('light_drinker == 1')
or to mutate the the df:
df.assign(strong_drinker = lambda x: x.light_drinker + 100)
Old answer
Not at the moment, but an enhancement with your ideas is being discussed here. For simple cases where might be enough. The new API might look like this:
df.set(new_column=lambda self: self.light_drinker*2)
In the most current version of pandas, .where() also accepts a callable!
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.where.html?highlight=where#pandas.DataFrame.where
So, the following is now possible:
drinks.light_drinker.where(lambda x: x == 1)
which is particularly useful in method-chains. However, this will return only the Series (not the DataFrame filtered based on the values in the light_drinker column). This is consistent with your question, but I will elaborate for the other case.
To get a filtered DataFrame, use:
drinks.where(lambda x: x.light_drinker == 1)
Note that this will keep the shape of the self (meaning you will have rows where all entries will be NaN, because the condition failed for the light_drinker value at that index).
If you don't want to preserve the shape of the DataFrame (i.e you wish to drop the NaN rows), use:
drinks.query('light_drinker == 1')
Note that the items in DataFrame.index and DataFrame.columns are placed in the query namespace by default, meaning that you don't have to reference the self.
I don't know of any way to reference parent objects like self or this in Pandas, but perhaps another way of doing what you want which could be considered more DRY is where().
drinks.where(drinks.light_drinker == 1, inplace=True)

Categories