How to include pandas comparison operators in the function? - python

Let's say I have the following data
import pandas as pd
d = {'index': [0, 1, 2], 'a': [10, 8, 6], 'b': [4, 2, 6],}
data_frame = pd.DataFrame(data=d).set_index('index')
Now what I do is, filter this data based on the values of column "b", Lets say like this:
new_df = data_frame[data_frame['b']!=4]
new_df1 = data_frame[data_frame['b']==4]
What I want to do, instead of this method above, is to write the function where I can also indicate what kind of comparison operators it should use. Something like this
def slice(df, column_name):
df_new = df[df[column_name]!=4]
return df_new
new_df2 = slice(df=data_frame, column_name='b')
The function above only does != operation in the data. What I want is that both != and == to be somehow defined in the above function, and when I use the function I could ideally indicate which one to use.
Let me know if my question needs more detailed clarification

You could add a boolean parameter to your function:
def slice(df, column_name, equality=True):
if equality:
df_new = df[df[column_name]==4]
else:
df_new = df[df[column_name]!=4]
return df_new
new_df2 = slice(df=data_frame, column_name='b', equality=True)
By the way, slice is a built-in python function, so probably a good idea to rename to something else.

Related

How do I correctly use .loc in pandas? [duplicate]

This question already has answers here:
How to deal with SettingWithCopyWarning in Pandas
(20 answers)
Closed 2 years ago.
I am a newbie and trying to figure out how to correctly use the .loc function in pandas for slicing a dataframe. Any help is greatly appreciated.
The code is:
df1['Category'] = df[key_column].apply(lambda x: process_df1(x, 'category'))
where df1 is a dataframe,
key_column is a specific column identified to be operated upon
process_df1 is a function defined to run on df1.
The problem is I am trying to avoid the error:
"A value is trying to be set on a copy of a slice from a DataFrame.
Try using
.loc[row_indexer,col_indexer] = value instead"
I don't want to ignore / suppress the warnings or set
`pd.options.mode.chained_assignment = None.
Is there an alternative besides these 2?
I have tried using
df.loc[df1['Category'] = df[key_column].apply(lambda x: process_df1(x, 'category'))]
but it still produces the same error. Am I using the .loc incorrectly?
Apologies if it is a confusing question.
df1 = df[:break_index]
df2 = df[break_index:]
Thank you.
The apply method performs the function in place to the series you are running it on (key_column in this case)
If you are trying to create a new column based upon a function using another column as input you can use list comprehension
df1['Category'] = [process_df1(x, 'category') for x in df1[key_column]]
NOTE I'm assuming process_df1 operates on a single value from the key_column column and returns a new value based upon your writing. If that's not the case please update your question
Unless you give more details on the source data and your expected results, we won't be able to provide you clear answer. For now, here's something I just created to help you understand how we can pass two values and get things going.
import pandas as pd
df = pd.DataFrame({'Category':['fruit','animal','plant','fruit','animal','plant'],
'Good' :[27, 82, 32, 91, 99, 67],
'Faulty' :[10, 5, 12, 8, 2, 12],
'Region' :['north','north','south','south','north','south']})
def shipment(categ,y):
d = {'a': 0, 'b': 1, 'c': 2, 'd':3}
if (categ,y) == ('fruit','a'):
return 10
elif (categ,y) == ('fruit','b'):
return 20
elif (categ,y) == ('animal','a'):
return 30
elif (categ,y) == ('animal','c'):
return 40
elif (categ,y) == ('plant','a'):
return 50
elif (categ,y) == ('plant','d'):
return 60
else:
return 99
df['result'] = df['Category'].apply(lambda x: shipment(x,'a'))
print (df)

Summary of categorical variables pandas

As stated in the title, I want to conduct some summary analysis about categorical variables in pandas, but have not come across a satisfying solution after searching for a while. So I developed the following code as kind of a self answering-question with the hope that someone out there on SO can help to improve.
test_df = pd.DataFrame({'x':['a', 'b','b','c'],
'y':[1, 0, 0, np.nan],
'z':['Jay', 'Jade', 'Jia', ''],
'u':[1, 2, 3, 3]})
def cat_var_describe(input_df, var_list):
df = input_df.copy()
# dataframe to store result
res = pd.DataFrame({'var_name', 'values', 'count'})
for var in var_list:
temp_res = df[var].value_counts(dropna=False).rename_axis('unique_values').reset_index(name='counts')
temp_res['var_name'] = var
if var==var_list[0]:
res = temp_res.copy()
else:
res = pd.concat([res, temp_res], axis=0)
res = res[['var_name', 'unique_values', 'counts']]
return res
cat_des_test = cat_var_describe(test_df, ['x','y','z','u'])
cat_des_test
Any helpful suggestions will be deeply appreciated.
You can use the pandas DataFrame describe() method.
describe() includes only numerical data by default.
to include categorical variables you must use the include argument.
using 'object' returns only the non-numerical data
test_df.describe(include='object')
using 'all' returns a summary of all columns with NaN where the statistic is inappropriate for the datatype
test_df.describe(include='all')
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.describe.html
You can use the unique() method to get a list of individual values for a column, for example:
test_df['x'].unique()
For getting the number of occurrences of values in a column, you can use value_counts():
test_df['x'].value_counts()
A simplified loop over all columns of the DataFrame could look like this:
for col in list(test_df):
print('variable:', col)
print(test_df[col].value_counts(dropna=False).to_string())
You can use describe Function
test_df.describe()
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.describe.html

Retrieve data in Pandas

I am using pandas and uproot to read data from a .root file, and I get a table like the following one:
The aforementioned table is made with the following code:
fname = 'ZZ4lAnalysis_VBFH.root'
key = 'ZZTree/candTree'
ttree = uproot.open(fname)[key]
branches = ['Z1Flav', 'Z2Flav', 'nCleanedJetsPt30', 'LepPt', 'LepLepId']
df = ttree.pandas.df(branches, flatten=False)
I need to find the maximum value in LepPt, and, once found the maximum, I also need to retrieve the LepLepId of that maximum value.
I have no problem in finding the maximum values:
Pt_l1 = [max(i) for i in df.LepPt]
In this way I get an array with all the maximum values. However, I have to separate such values according to the LepLepId. So I need an array with the maximum LepPt and |LepLepId|=11 and one with the maximum LepPt and |LepLepId|=13.
If someone could give me any hint, advice and/or suggestion, I would be very grateful.
I made some mock data since you didn't provide yours in any easy format. I think this is what you are looking for.
import pandas as pd
df = pd.DataFrame.from_records(
[ [[1,2,3], [4,5,6]],
[[4,6,5], [7,8,9]]
],
columns=['LepPt', 'LepLepld']
)
df['max_LepPt'] = [max(i) for i in df.LepPt]
def f(row):
# get index position within list
pos = row['LepPt'].index(row['max_LepPt']).tolist()
return row['LepLepld'][pos]
df['same_index_LepLepld'] = df.apply(lambda x: f(x), axis=1)
returns:
LepPt LepLepld max_LepPt same_index_LepLepld
0 [1, 2, 3] [4, 5, 6] 3 6
1 [4, 6, 5] [7, 8, 9] 6 8
You could use the awkward.JaggedArray interface for this (one of the dependencies of uproot), which allows you to have irregularly sized arrays.
For this you would need to slightly change the way you load the data, but it allows you to use the same methods you would use with a normal numpy array, namely argmax:
fname = 'ZZ4lAnalysis_VBFH.root'
key = 'ZZTree/candTree'
ttree = uproot.open(fname)[key]
# branches = ['Z1Flav', 'Z2Flav', 'nCleanedJetsPt30', 'LepPt', 'LepLepId']
branches = ['LepPt', 'LepLepId'] # to save memory, only load what you need
# df = ttree.pandas.df(branches, flatten=False)
a = ttree.arrays(branches) # use awkward array interface
max_pt_idx = a[b'LepPt'].argmax()
max_pt_lepton_id = a[b'LepLepld'][max_pt_idx].flatten()
This is then just a normal numpy array, which you can assign to a column of a pandas dataframe if you want to. It should have the right dimensionality and order. It should also be faster than using the built-in Python functions.
Note that the keys are bytestrings, instead of normal strings and that you will have to take some extra steps if there are events with no leptons (in which case the flatten will ignore those empty events, destroying the alignment).
Alternatively, you can also convert the columns afterwards:
import awkward
df = ttree.pandas.df(branches, flatten=False)
max_pt_idx = awkward.fromiter(df["LepPt"]).argmax()
lepton_id = awkward.fromiter(df["LepLepld"])
df["max_pt_lepton_id"] = lepton_id[max_pt_idx].flatten()
The former will be faster if you don't need the columns again afterwards, otherwise the latter might be better.

How do I make custom comparisons in pytest?

For example I'd like to assert that two Pyspark DataFrame's have the same data, however just using == checks that they are the same object. Ideally I'd also like to be specify whether order matters or not.
I've tried writing a function that raises an AssertionError but that adds a lot of noise to the pytest output as it shows the traceback from that function.
The other thought I had was to mock the __eq__ method of the DataFrames but I'm not confident that's the right way to go.
Edit:
I considered just using a function that returns true or false instead of an operator, however that doesn't seem to work with pytest_assertrepr_compare. I'm not familiar enough with how that hook works so it's possible there is a way to use it with a function instead of an operator.
My current solution is to use a patch to override the DataFrame's __eq__ method. Here's an example with Pandas as it's faster to test with, the idea should apply to any object.
import pandas as pd
# use this import for python3
# from unittest.mock import patch
from mock import patch
def custom_df_compare(self, other):
# Put logic for comparing df's here
# Returning True for demonstration
return True
#patch("pandas.DataFrame.__eq__", custom_df_compare)
def test_df_equal():
df1 = pd.DataFrame(
{"id": [1, 2, 3], "name": ["a", "b", "c"]}, columns=["id", "name"]
)
df2 = pd.DataFrame(
{"id": [2, 3, 4], "name": ["b", "c", "d"]}, columns=["id", "name"]
)
assert df1 == df2
Haven't tried it yet but am planning on adding it as a fixture and using autouse to use it for all tests automatically.
In order to elegantly handle the "order matters" indicator, I'm playing with an approach similar to pytest.approx which returns a new class with it's own __eq__ for example:
class SortedDF(object):
"Indicates that the order of data matters when comparing to another df"
def __init__(self, df):
self.df = df
def __eq__(self, other):
# Put logic for comparing df's including order of data here
# Returning True for demonstration purposes
return True
def test_sorted_df():
df1 = pd.DataFrame(
{"id": [1, 2, 3], "name": ["a", "b", "c"]}, columns=["id", "name"]
)
df2 = pd.DataFrame(
{"id": [2, 3, 4], "name": ["b", "c", "d"]}, columns=["id", "name"]
)
# Passes because SortedDF.__eq__ is used
assert SortedDF(df1) == df2
# Fails because df2's __eq__ method is used
assert df2 == SortedDF(df2)
The minor issue I haven't been able to resolve is the failure of the second assert, assert df2 == SortedDF(df2). This order works fine with pytest.approx but doesn't here. I've tried reading up on the == operator but haven't been able to figure out how to fix the second case.
To do a raw comparison between the values of the DataFrames (must be exact order), you can do something like this:
import pandas as pd
from pyspark.sql import Row
df1 = spark.createDataFrame([Row(a=1, b=2, c=3), Row(a=1, b=3, c=3)])
df2 = spark.createDataFrame([Row(a=1, b=2, c=3), Row(a=1, b=3, c=3)])
pd.testing.assert_frame_equal(df1.toPandas(), df2.toPandas())
If you want to specify by order, you can do some transformations on the pandas DataFrame to sort by a particular column first using the following function:
def assert_frame_equal_with_sort(results, expected, keycolumns):
results = results.reindex(sorted(results.columns), axis=1)
expected = expected.reindex(sorted(expected.columns), axis=1)
results_sorted = results.sort_values(by=keycolumns).reset_index(drop=True)
expected_sorted = expected.sort_values(by=keycolumns).reset_index(drop=True)
pd.testing.assert_frame_equal(results_sorted, expected_sorted)
df1 = spark.createDataFrame([Row(a=1, b=2, c=3), Row(a=1, b=3, c=3)])
df2 = spark.createDataFrame([Row(a=1, b=3, c=3), Row(a=1, b=2, c=3)])
assert_frame_equal_with_sort(df1.toPandas(), df2.toPandas(), ['b'])
just use the pandas.Dataframe.equals method
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.equals.html
For example
assert df1.equals(df2)
assert can be used with anything that returns a boolean. So yes you can write any custom comparison function to compare two objects. As long as the custom function returns a boolean. However, in this case there is no need for a custom function as pandas already provides one
You can use one of pytest hooks, particularity the pytest_assertrepr_compare. In there you can define what tyou you want to compare and how, also docs are pretty good and with examples. Best of luck. :)

Python - Population of PANDAS dataframe column based on conditions met in other dataframes' columns

I have 3 dataframes (df1, df2, df3) which are identically structured (# and labels of rows/columns), but populated with different values.
I want to populate df3 based on values in the associated column/rows in df1 and df2. I'm doing this with a FOR loop and a custom function:
for x in range(len(df3.columns)):
df3.iloc[:, x] = customFunction(x)
I want to populate df3 using this custom IF/ELSE function:
def customFunction(y):
if df1.iloc[:,y] <> 1 and df2.iloc[:,y] = 0:
return "NEW"
elif df2.iloc[:,y] = 2:
return "OLD"
else:
return "NEITHER"
I understand why I get an error message when i run this, but i can't figure out how to apply this function to a series. I could do it row by row with more complex code but i'm hoping there's a more efficient solution? I fear my approach is flawed.
v1 = df1.values
v2 = df2.values
df3.loc[:] = np.where(
(v1 != 1) & (v2 == 0), 'NEW',
np.where(v2 == 2, 'OLD', 'NEITHER'))
Yeah, try to avoid loops in pandas, its inefficient and built to be used with the underlying numpy vectorization.
You want to use the apply function.
Something like:
df3['new_col'] = df3.apply(lambda x: customFunction(x))
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.apply.html

Categories