Pandas Inconsistent Boolean Comparisons - python

I'm running into an issue which probably has an obvious fix. So i have a 2x2 dataframe that has lists in it. When i take the first row and compare the entire row against a specific list value i'm looking for, the boolean array that is returned is entirely False. This is incorrect since the first value of the row has the exact value i'm looking for. When I instead compare the singular value in the dataframe, I get True. How come, when doing boolean operations over the entire row, I get False instead of True for the value in the first column. Thanks in advance!
Returns False for both values in the first row.
In:
static_levels = pd.DataFrame([[[58, 'Level']], [[24.4, 'Level'], [23.3, 'Level']]], ['Symbol 1', 'Symbol 2'])
print(static_levels.loc['Symbol 1',:]==[58, 'Level'])
Out:
0 False
1 False
Name: Symbol 1, dtype: bool
However, the code below correctly returns True when comparing just the first value in the first row:
In: print(static_levels.loc['Symbol 1',0]==[58, 'Level'])
Out: True

Related

pandas - filter only works on 1 row when used as a stored variable?

Having a strange issue with using a filter in loc.
example df:
Name trail
0 XYZ True
1 A True
2 B True
3 C True
# Trail filter
filter_trail = (df['trail'] == False)
# Set a row to False to check
df.at[3, 'trail'] = False
# use the filter, using loc because I will combine conditions
df.loc[filter_trail]
# I get the expected result
# Test further
df.at[0, 'trail'] = False
# use loc statement from earlier
# result only shows the 1st row i.e. the row with index 3
# No error in terminal
# decide to try dropping the column and setting column
df.drop('trail', axis=1, inplace=True)
df['trail'] = [True, False, False, False]
# run loc
df.loc[filter_trail]
# result still only shows row with index 3
# run without loc
df[filter_trail]
# result still only shows row with index 3
# run
df[df['trail'] == False]
# Get the desired result i.e. row index: 1,2,3
I am not sure what I am doing wrong here. Never seen this happen before.
filter_trail is not created as a reference to the Dataframe, rather a boolean calculated value from one of trail column of the DF. Thereby, creating a new set of data, which was calculated from DF column, but not referencing it.
Two of the fellow stackoverflow contributors (comments above) and me as third tried out your code and we all received an empty filter_trail.

If Condition Based On 2 Columns

Tring to conditionally execute a query, only when ColumnA = 'New' and ColumnB = 'Left' (in each individual row). I know that str.contains() works when I only have 1 condition, however, I'm getting a ValueError ("ValueError: The truth value of a Series is ambiguous..."). Can this approach not be successfully applied, for my given scenario? Alternatively, is there a better approach?
Current code:
if df1['ColumnA'].str.contains('New') and df1['ColumnB'].str.contains('Left'):
do something...
Very basic example of the dataframe:
ColumnA
Column B
New
Left
Used
Right
Scrap
Down
New
Right
First row would be the desired row to carry forward (since it meets the criteria).
You have the right idea, however it doesn't appear like your code is expressing exactly what you want to do.
df1['ColumnA'].str.contains('New')
will return a Series with true and false values corresponding to the indices where the condition is true, not a true or false value for whether the entire column contains 'new'. To accomplish this consider doing something like the following:
'new' in df['ColumnA'].values
If you are trying to do it on a row by row basis then you must use the bitwise operator to compare truth values across Series (&).
This will return a boolean like you expected, hopefully this helps (:
Use bitwise & on two mask arrays and generate another column.
>>> import pandas as pd
>>> df = pd.DataFrame({'A':['New','Used','Scrap','New'], 'B':['Left','Right','Down','Right']})
>>> df
A B
0 New Left
1 Used Right
2 Scrap Down
3 New Right
>>> df['C'] = df['A'].str.contains('New') & df['B'].str.contains('Left')
>>> df
A B C
0 New Left True
1 Used Right False
2 Scrap Down False
3 New Right False
>>>

Pandas dataframe reports no matching string when the string is present

Fairly new to python. This seems to be a really simple question but I can't find any information about it.
I have a list of strings, and for each string I want to check whether it is present in a dataframe (actually in a particular column of the dataframe. Not whether a substring is present, but the whole exact string.
So my dataframe is something like the following:
A=pd.DataFrame(["ancestry","time","history"])
I should simply be able to use the "string in dataframe" method, as in
"time" in A
This returns False however.
If I run
"time" == A.iloc[1]
it returns "True", but annoyingly as part of a series, and this depends on knowing where in the dataframe the corresponding string is.
Is there some way I can just use the string in df method, to easily find out whether the strings in my list are in the dataframe?
Add .to_numpy() to the end:
'time' in A.to_numpy()
As you've noticed, the x in pandas.DataFrame syntax doesn't produce the result you want. But .to_numpy() transforms the dataframe into a Numpy array, and x in numpy.array works as you expect.
The way to deal with this is to compare the whole dataframe with "time". That will return a mask where each value of the DF is True if it was time, False otherwise. Then, you can use .any() to check if there are any True values:
>>> A = pd.DataFrame(["ancestry","time","history"])
>>> A
0
0 ancestry
1 time
2 history
>>> A == "time" # or A.eq("time")
0
0 False
1 True
2 False
>>> (A == "time").any()
0 True
dtype: bool
Notice in the above output, (A == "time").any() returns a Series where each entry is a column and whether or not that column contained time. If you want to check the entire dataframe (across all columns), call .any() twice:
>>> (A == "time").any().any()
True
I believe (myseries==mystr).any() will do what you ask. The special __contains__ method of DataFrames (which informs behavior of in) checks whether your string is a column of the DataFrame, e.g.
>>> A = pd.DataFrame({"c": [0,1,2], "d": [3,4,5]})
>>> 'c' in A
True
>>> 0 in A
False
I would slightly modify your dataframe and use .str.contains for checking where the string is present in your series.
df=pd.DataFrame()
df['A']=pd.Series(["ancestry","time","history"])
df['A'].str.contains("time")

Count occurences of True/False in column of dataframe

Is there a way to count the number of occurrences of boolean values in a column without having to loop through the DataFrame?
Doing something like
df[df["boolean_column"]==False]["boolean_column"].sum()
Will not work because False has a value of 0, hence a sum of zeroes will always return 0.
Obviously you could count the occurrences by looping over the column and checking, but I wanted to know if there's a pythonic way of doing this.
Use pd.Series.value_counts():
>> df = pd.DataFrame({'boolean_column': [True, False, True, False, True]})
>> df['boolean_column'].value_counts()
True 3
False 2
Name: boolean_column, dtype: int64
If you want to count False and True separately you can use pd.Series.sum() + ~:
>> df['boolean_column'].values.sum() # True
3
>> (~df['boolean_column']).values.sum() # False
2
With Pandas, the natural way is using value_counts:
df = pd.DataFrame({'A': [True, False, True, False, True]})
print(df['A'].value_counts())
# True 3
# False 2
# Name: A, dtype: int64
To calculate True or False values separately, don't compare against True / False explicitly, just sum and take the reverse Boolean via ~ to count False values:
print(df['A'].sum()) # 3
print((~df['A']).sum()) # 2
This works because bool is a subclass of int, and the behaviour also holds true for Pandas series / NumPy arrays.
Alternatively, you can calculate counts using NumPy:
print(np.unique(df['A'], return_counts=True))
# (array([False, True], dtype=bool), array([2, 3], dtype=int64))
I couldn't find here what I exactly need. I needed the number of True and False occurrences for further calculations, so I used:
true_count = (df['column']).value_counts()[True]
False_count = (df['column']).value_counts()[False]
Where df is your DataFrame and column is the column with booleans.
This alternative works for multiple columns and/or rows as well. 
df[df==True].count(axis=0)
Will get you the total amount of True values per column. For row-wise count, set axis=1. 
df[df==True].count().sum()
Adding a sum() in the end will get you the total amount in the entire DataFrame.
You could simply sum:
sum(df["boolean_column"])
This will find the number of "True" elements.
len(df["boolean_column"]) - sum(df["boolean_column"])
Will yield the number of "False" elements.
df.isnull()
returns a boolean value. True indicates a missing value.
df.isnull().sum()
returns column wise sum of True values.
df.isnull().sum().sum()
returns total no of NA elements.
In case you have a column in a DataFrame with boolean values, or even more interesting, in case you do not have it but you want to find the number of values in a column satisfying a certain condition you can try something like this (as an example I used <=):
(df['col']<=value).value_counts()
the parenthesis create a tuple with # of True/False values which you can use for other calcs as well, accessing the tuple adding [0] for False counts and [1] for True counts even without creating an additional variable:
(df['col']<=value).value_counts()[0] #for falses
(df['col']<=value).value_counts()[1] #for trues
Here is an attempt to be as literal and brief as possible in providing an answer. The value_counts() strategies are probably more flexible at the end. Accumulation sum and counting count are different and each expressive of an analytical intent, sum being dependent on the type of data.
"Count occurences of True/False in column of dataframe"
import pd
df = pd.DataFrame({'boolean_column': [True, False, True, False, True]})
df[df==True].count()
#boolean_column 3
#dtype: int64
df[df!=False].count()
#boolean_column 3
#dtype: int64
df[df==False].count()
#boolean_column 2
#dtype: int64

Keep indexes intact while doing two separate things with series

I'm trying to create a boolean series in which I treat data two different ways. I'm trying to find the local minimum to start a boolean calculation, and anything before that, I'd like to return as False. My problem is the only way I can think of to do that is to essentially split the resulting series into two, one from start of group to the row before the minimum, and one from the minimum to the end of the group, finally concatenating them. You can see below that I create a list of False entries, then concatenate that with the boolean series I created starting at the minimum. This is really kludge-y, and it doesn't keep the indexes intact.
ser = pd.concat([pd.Series([False] * (argrelextrema(group['B'].values, np.less)[0][:1][0])), (group[argrelextrema(group['B'].values, np.less)[0][:1][0]:].B.diff().shift(-1) <= -1)])
From this:
B
5876 500.2
5877 500.3
5878 500.4
5879 498.3
5880 499.0
5881 512
...
I end up with something like this for example:
1 False
2 False
3 False
5879 True
5880 False
5881 False
...
To fix it, I figured I could reset the indexes starting with the first one of the group, but that seems even more kludge-y.
ser.index = np.arange(group.index[0], len(ser))
Is there a more elegant way to return False for everything before the minimum and combine that with the boolean series I create, keeping all indexes intact?
You can use the operator & to create one Boolean series instead of joining two series. In other words, your Boolean series should satisfy two conditions:
Index of item is greater than or equal to index of minimum value
AND
Item satisfies your other Boolean calculation (which you will then be able to calculate for all items regardless of position)
B = pd.Series([10,9,7,7,12,14])
# Getting first position of minimum value
minValueIndex = B[B == B.min()].index[0]
# Boolean list for condition 1
isAfterMin = B.index >= minValueIndex
# Boolean list for condition 2: whatever calculation you make, for entire series. Example:
myBoolean = [True, False, True, True, False, True]
# Final Boolean list
ser = isAfterMin & myBoolean
print (ser)
# [False False True True False True]
As you can see, items located before the minimum value are always assigned a False, and the other items are assigned the booleans you calculate.

Categories