I have come across the .any() method several times. I used it quite a few times to check if a particular string is contained in a dataframe. In that case it returns a n array/dataframe (depending on how I wish to structure it) of Trues and Falses depending on whether the string matches the values of the cell. I also found .any(1) method but I am not sure how or in which cases I should use it.
.any(1) is the same as .any(axis=1), which means look row-wise instead of per column.
With this sample dataframe:
x1 x2 x3
0 1 1 0
1 0 0 0
2 1 0 0
See the different outcomes:
import pandas as pd
df = pd.read_csv('bool.csv')
print(df.any())
>>>
x1 True
x2 True
x3 False
dtype: bool
So .any() checks if any value in a column is True
print(df.any(1))
>>>
0 True
1 False
2 True
dtype: bool
So .any(1) checks if any value in a row is True
The Document is self explanatory, However for the sake of the question.
This is Series and Dataframe methods any(). It checks whether any of value in the caller object (Dataframe or series) is not 0 and returns True for that. If all values are 0, it will return False.
Note: However, Even if the caller method contains Nan it will not considered 0.
Example DataFrame:
>>> df = pd.DataFrame({"A": [1, 2], "B": [0, 2], "C": [0, 0]})
>>> df
A B C
0 1 0 0
1 2 2 0
>>> df.any()
A True
B True
C False
dtype: bool
>>> df.any(axis='columns')
0 True
1 True
dtype: bool
calling df.any() column wise.
>>> df.any(axis=1)
0 True
1 True
dtype: bool
any is true if at least one is true
any is False if all are False
Here is nice Blog Documentation about any() & all() by Guido van Rossum.
Related
I'm comparing a dataframe to multiple variables as follows...
if not ( (df['Column1'] == variableA) & (df['Column2'] == variableB) & (df['Column3'] == variableC) ).any():
print("row not in dataframe")
Both my dataframe and variables may contain a None value, but I've noticed that when 2 None values are compared, they do not return True and so I'm printing "not in list" even though both df and variables hold the same (None) value.
Any thoughts on the best way around this issue would be really appreciated. Perhaps I have to convert the None values to a string that will return True when compared?
The reason is that asserting equality for None is not possible with a simple equality operator.
Consider the following example:
s = pd.Series([1, "a", None])
s == 1
Out[4]:
0 True
1 False
2 False
dtype: bool
s == "a"
Out[5]:
0 False
1 True
2 False
dtype: bool
s == None
Out[6]:
0 False
1 False
2 False
dtype: bool
s.isna()
Out[7]:
0 False
1 False
2 True
dtype: bool
So if you want to collect any potential equality between Nones, you need to check whether the values are na.
If you have two series (eg two cols of a dataframe), you will need to create a union of results as:
d = pd.Series([2, "a", None])
s == d
Out[12]:
0 False
1 True
2 False
dtype: bool
(s == d) | (s.isna() & d.isna())
Out[13]:
0 False
1 True
2 True
dtype: bool
This means that the solution for your problem would be something like:
(df['Column1'] == variableA) & (df['Column2'] == variableB) & (df['Column3'] == variableC)
|
(df['Column1'].isna() & df['Column2'].isna() & df['Column3'].isna())
I have this dataframe :
df = pd.DataFrame()
df['Col1'] = [['B'],['A','D','B'],['D','C']]
df['Col2'] = [1,2,4]
df
Col1 Col2
0 [B] 1
1 [A,D,B] 2
2 [D,C] 4
I would like to know if Col1 contains the list [B,A,D], without caring for the order of the lists (those inside the column as the one to check).
I would like therefore to get here a True answer.
How could I do ?
Thanks
If values are not duplicated you can compare sets:
L = ['B','A','D']
print (df['Col1'].map(set).eq(set(L)))
0 False
1 True
2 False
Name: Col1, dtype: bool
If want scalar output- True or False test if at least one True in column by Series.any:
print (df['Col1'].map(set).eq(set(['B','A','D'])).any())
True
Use:
l=['B','A','D']
[set(i)==set(l) for i in df['Col1']]
#[False, True, False]
IIUC method of get_dummies
l=['B','A','D']
df.Col1.str.join(',').str.get_dummies(',')[l].all(1)
Out[197]:
0 False
1 True
2 False
dtype: bool
How can I check if a given value is NaN?
e.g. if (a == np.NaN) (doesn't work)
Please note that:
Numpy's isnan method throws errors with data types like string
Pandas docs only provide methods to drop rows containing NaNs, or ways to check if/when DataFrame contains NaNs. I'm asking about checking if a specific value is NaN.
Relevant Stackoverflow questions and Google search results seem to be about checking "if any value is NaN" or "which values in a DataFrame"
There must be a clean way to check if a given value is NaN?
You can use the inate property that NaN != NaN
so a == a will return False if a is NaN
This will work even for strings
Example:
In[52]:
s = pd.Series([1, np.NaN, '', 1.0])
s
Out[52]:
0 1
1 NaN
2
3 1
dtype: object
for val in s:
print(val==val)
True
False
True
True
This can be done in a vectorised manner:
In[54]:
s==s
Out[54]:
0 True
1 False
2 True
3 True
dtype: bool
but you can still use the method isnull on the whole series:
In[55]:
s.isnull()
Out[55]:
0 False
1 True
2 False
3 False
dtype: bool
UPDATE
As noted by #piRSquared if you compare None==None this will return True but pd.isnull will return True so depending on whether you want to treat None as NaN you can still use == for comparison or pd.isnull if you want to treat None as NaN
Pandas has isnull, notnull, isna, and notna
These functions work for arrays or scalars.
Setup
a = np.array([[1, np.nan],
[None, '2']])
Pandas functions
pd.isna(a)
# same as
# pd.isnull(a)
array([[False, True],
[ True, False]])
pd.notnull(a)
# same as
# pd.notna(a)
array([[ True, False],
[False, True]])
DataFrame (or Series) methods
b = pd.DataFrame(a)
b.isnull()
# same as
# b.isna()
0 1
0 False True
1 True False
b.notna()
# same as
# b.notnull()
0 1
0 True False
1 False True
I have a dataframe that looks like this:
df = pd.DataFrame({"piece": ["piece1", "piece2", "piece3", "piece4"], "No": [1, 1, 2, 3]})
No piece
0 1 piece1
1 1 piece2
2 2 piece3
3 3 piece4
I have a series with an index that corresponds to the "No"-column in the dataframe. It assigns boolean variables to the "No"-values, like so:
s = pd.Series([True, False, True, True])
0 True
1 False
2 True
3 True
dtype: bool
I would like to select those rows from the dataframe where in the series the "No"-value is True. This should result in
No piece
2 2 piece3
3 3 piece4
I've tried a lot of indexing with df["No"].isin(s), or something like df[s["No"] == True]... But it didn't work yet.
I think you need map the value in No column to the true/false condition and use it for subsetting:
df[df.No.map(s)]
# No piece
#2 2 piece3
#3 3 piece4
df.No.map(s)
# 0 False
# 1 False
# 2 True
# 3 True
# Name: No, dtype: bool
You are trying to index into s using df['No'], then use the result as a mask on df itself:
df[s[df['No']].values]
The final mask needs to be extracted as an array using values because the duplicates in the index cause an error otherwise.
I have a regular DataFrame with a string type (object) column. When I try to filter on the column using the equivalent of a WHERE clause, I get a KeyError when I use the dot notation. When in bracket notation, all is well.
I am referring to these instructions:
df[df.colA == 'blah']
df[df['colA'] == 'blah']
The first gives the equivalent of
KeyError: False
Not posting an example as I cannot reproduce the issue on a bespoke DataFrame built for the purpose of illustration: when I do, both notations yield the same result.
Asking then if there is a difference in the two and why.
The dot notation is just a convenient shortcut for accessing things vs. the standard brackets. Notably, they don't work when the column name is something like sum that is already a DataFrame method. My bet would be that the column name in your real example is running into that issue, and so it works fine with the bracket selection but is otherwise testing whether a method is equal to 'blah'.
Quick example below:
In [67]: df = pd.DataFrame(np.arange(10).reshape(5,2), columns=["number", "sum"])
In [68]: df
Out[68]:
number sum
0 0 1
1 2 3
2 4 5
3 6 7
4 8 9
In [69]: df.number == 0
Out[69]:
0 True
1 False
2 False
3 False
4 False
Name: number, dtype: bool
In [70]: df.sum == 0
Out[70]: False
In [71]: df['sum'] == 0
Out[71]:
0 False
1 False
2 False
3 False
4 False
Name: sum, dtype: bool