Compare Python dataframe columns to variables containing None values - python

I'm comparing a dataframe to multiple variables as follows...
if not ( (df['Column1'] == variableA) & (df['Column2'] == variableB) & (df['Column3'] == variableC) ).any():
print("row not in dataframe")
Both my dataframe and variables may contain a None value, but I've noticed that when 2 None values are compared, they do not return True and so I'm printing "not in list" even though both df and variables hold the same (None) value.
Any thoughts on the best way around this issue would be really appreciated. Perhaps I have to convert the None values to a string that will return True when compared?

The reason is that asserting equality for None is not possible with a simple equality operator.
Consider the following example:
s = pd.Series([1, "a", None])
s == 1
Out[4]:
0 True
1 False
2 False
dtype: bool
s == "a"
Out[5]:
0 False
1 True
2 False
dtype: bool
s == None
Out[6]:
0 False
1 False
2 False
dtype: bool
s.isna()
Out[7]:
0 False
1 False
2 True
dtype: bool
So if you want to collect any potential equality between Nones, you need to check whether the values are na.
If you have two series (eg two cols of a dataframe), you will need to create a union of results as:
d = pd.Series([2, "a", None])
s == d
Out[12]:
0 False
1 True
2 False
dtype: bool
(s == d) | (s.isna() & d.isna())
Out[13]:
0 False
1 True
2 True
dtype: bool
This means that the solution for your problem would be something like:
(df['Column1'] == variableA) & (df['Column2'] == variableB) & (df['Column3'] == variableC)
|
(df['Column1'].isna() & df['Column2'].isna() & df['Column3'].isna())

Related

Sum different rows in a data frame based on multiple conditions

I have the following data frame:
dataset = {
'String': ["ABCDEF","HIJABC","ABCHIJ","DEFABC"],
'Bool':[True,True,False,False],
'Number':[10,20,40,50]}
df = pd.DataFrame(dataset)
String Bool Number
0 ABCDEF True 10
1 HIJABC True 20
2 ABCHIJ False 40
3 DEFABC False 50
I would like to sum the rows of the column Number where Bool is False to the rows where Bool is True:
The rows can be matched and summed together if the reverse of String of one row is not equal to the String of the row.
In this case ABCHIJ where bool is False is not equal to the reverse of ABCDEF so I sum the numbers: 10+40.
HIJABC where Bool is True is summed to DEFABC where Bool is False the outcome is 70
Expected output:
String Bool Number
0 ABCDEF True 50
1 HIJABC True 70
2 ABCHIJ False 40
3 DEFABC False 50
I hope my explanation was good enough, is there a way to achieve the above outcome ?
One way is like this:
df_true = df[df['Bool'] == True]
df_false = df[df['Bool'] == False]
for i in df_false['String']:
idx = df_true[df_true['String'] != (i[3:] + i[:3]) ].index[0]
current_num = df.loc[df.index == idx, 'Number'].values[0]
added_num = df[df['String'] == i]['Number'].values[0]
df.loc[df.index == idx, 'Number'] = current_num + added_num
I hope it helps

Find empty cells in rows of a column | Dataframe pandas

My Dataframe has a column named "Teacher" and i want to know in that column the rows that are empty.
Example:
print(df["Teacher"])
0
1
2 Richard
3
4 Richard
Name: Teacher, Length: 5, dtype: object
I know that if i do something like this:
R = ['R']
cond = df['Teacher'].str.startswith(tuple(R))
print(cond)
It prints the rows of that column and tells me in boolean the teacher that starts with the R.
print(cond)
0 False
1 False
2 True
3 False
4 True
Name: Teacher, Length: 5, dtype: object
I want the same for the empty ones, to return True when its empty and false when its not but dont know how.
If empty is missing value or Nones use Series.isna:
cond = df['Teacher'].isna()
If empty is zero or more spaces use Series.str.contains:
cond = df['Teacher'].str.contains(r'^\s*$', na=False)
If empty is empty string compare by it:
cond = df['Teacher'] == ''
df = pd.DataFrame({'Teacher':['',' ', None, np.nan, 'Richard']})
cond1 = df['Teacher'].isna()
cond2 = df['Teacher'].str.contains(r'^\s*$', na=False)
cond3 = df['Teacher'] == ''
df = df.assign(cond1= cond1, cond2= cond2, cond3= cond3)
print (df)
Teacher cond1 cond2 cond3
0 False True True
1 False True False
2 None True False False
3 NaN True False False
4 Richard False False False

What is the difference between .any() and .any(1)?

I have come across the .any() method several times. I used it quite a few times to check if a particular string is contained in a dataframe. In that case it returns a n array/dataframe (depending on how I wish to structure it) of Trues and Falses depending on whether the string matches the values of the cell. I also found .any(1) method but I am not sure how or in which cases I should use it.
.any(1) is the same as .any(axis=1), which means look row-wise instead of per column.
With this sample dataframe:
x1 x2 x3
0 1 1 0
1 0 0 0
2 1 0 0
See the different outcomes:
import pandas as pd
df = pd.read_csv('bool.csv')
print(df.any())
>>>
x1 True
x2 True
x3 False
dtype: bool
So .any() checks if any value in a column is True
print(df.any(1))
>>>
0 True
1 False
2 True
dtype: bool
So .any(1) checks if any value in a row is True
The Document is self explanatory, However for the sake of the question.
This is Series and Dataframe methods any(). It checks whether any of value in the caller object (Dataframe or series) is not 0 and returns True for that. If all values are 0, it will return False.
Note: However, Even if the caller method contains Nan it will not considered 0.
Example DataFrame:
>>> df = pd.DataFrame({"A": [1, 2], "B": [0, 2], "C": [0, 0]})
>>> df
A B C
0 1 0 0
1 2 2 0
>>> df.any()
A True
B True
C False
dtype: bool
>>> df.any(axis='columns')
0 True
1 True
dtype: bool
calling df.any() column wise.
>>> df.any(axis=1)
0 True
1 True
dtype: bool
any is true if at least one is true
any is False if all are False
Here is nice Blog Documentation about any() & all() by Guido van Rossum.

How to assert that a pandas data frame filtered on a condition is true

So I have a pytest testing the results of a query that returns pandas dataframe.
I want to assert that a particular column col has all the values that are a substring of a given input.
So this below gives me the rows (dataframe) that have that column's col value containing some input part. How can I assert it to be true?
assert result_df[result_df['col'].astype(str).str.contains(category)].bool == True
doesn't work
Try this:
assert result_df[result_df['col'].astype(str).str.contains(category)].bool.all(axis=None) == True
Please refer to the pandas docs for more info: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.all.html
The reason your code doesn't work is because you are trying to test whether the dataframe object is True, not all of the values in it.
I believe you need Series.all for check if all values of filtered Series are Trues:
assert result_df['col'].astype(str).str.contains(category).all()
Sample:
result_df = pd.DataFrame({
'col':list('aaabbb')
})
print (result_df)
col
0 a
1 a
2 a
3 b
4 b
5 b
category = 'b'
assert result_df['col'].astype(str).str.contains(category).all()
AssertionError
Detail:
print (result_df['col'].astype(str).str.contains(category))
0 False
1 False
2 False
3 True
4 True
5 True
Name: col, dtype: bool
print (result_df['col'].astype(str).str.contains(category).all())
False
category = 'a|b'
assert result_df['col'].astype(str).str.contains(category).all()
print (result_df['col'].astype(str).str.contains(category))
0 True
1 True
2 True
3 True
4 True
5 True
Name: col, dtype: bool
print (result_df['col'].astype(str).str.contains(category).all())
True
Found it. assert result_df[result_df['col'].astype(str).str.contains(category)].bool works
or assert result_df['col'].astype(str).str.contains(category).all (Thanks to #jezrael for suggesting all)

Extracting all rows from pandas Dataframe that have certain value in a specific column

I am relatively new to Python/Pandas and am struggling with extracting the correct data from a pd.Dataframe. What I actually have is a Dataframe with 3 columns:
data =
Position Letter Value
1 a TRUE
2 f FALSE
3 c TRUE
4 d TRUE
5 k FALSE
What I want to do is put all of the TRUE rows into a new Dataframe so that the answer would be:
answer =
Position Letter Value
1 a TRUE
3 c TRUE
4 d TRUE
I know that you can access a particular column using
data['Value']
but how do I extract all of the TRUE rows?
Thanks for any help and advice,
Alex
You can test which Values are True:
In [11]: data['Value'] == True
Out[11]:
0 True
1 False
2 True
3 True
4 False
Name: Value, dtype: bool
and then use fancy indexing to pull out those rows:
In [12]: data[data['Value'] == True]
Out[12]:
Position Letter Value
0 1 a True
2 3 c True
3 4 d True
*Note: if the values are actually the strings 'TRUE' and 'FALSE' (they probably shouldn't be!) then use:
data['Value'] == 'TRUE'
You can wrap your value/values in a list and do the following:
new_df = df.loc[df['yourColumnName'].isin(['your', 'list', 'items'])]
This will return a new dataframe consisting of rows where your list items match your column name in df.

Categories