I'm trying to create a boolean series in which I treat data two different ways. I'm trying to find the local minimum to start a boolean calculation, and anything before that, I'd like to return as False. My problem is the only way I can think of to do that is to essentially split the resulting series into two, one from start of group to the row before the minimum, and one from the minimum to the end of the group, finally concatenating them. You can see below that I create a list of False entries, then concatenate that with the boolean series I created starting at the minimum. This is really kludge-y, and it doesn't keep the indexes intact.
ser = pd.concat([pd.Series([False] * (argrelextrema(group['B'].values, np.less)[0][:1][0])), (group[argrelextrema(group['B'].values, np.less)[0][:1][0]:].B.diff().shift(-1) <= -1)])
From this:
B
5876 500.2
5877 500.3
5878 500.4
5879 498.3
5880 499.0
5881 512
...
I end up with something like this for example:
1 False
2 False
3 False
5879 True
5880 False
5881 False
...
To fix it, I figured I could reset the indexes starting with the first one of the group, but that seems even more kludge-y.
ser.index = np.arange(group.index[0], len(ser))
Is there a more elegant way to return False for everything before the minimum and combine that with the boolean series I create, keeping all indexes intact?
You can use the operator & to create one Boolean series instead of joining two series. In other words, your Boolean series should satisfy two conditions:
Index of item is greater than or equal to index of minimum value
AND
Item satisfies your other Boolean calculation (which you will then be able to calculate for all items regardless of position)
B = pd.Series([10,9,7,7,12,14])
# Getting first position of minimum value
minValueIndex = B[B == B.min()].index[0]
# Boolean list for condition 1
isAfterMin = B.index >= minValueIndex
# Boolean list for condition 2: whatever calculation you make, for entire series. Example:
myBoolean = [True, False, True, True, False, True]
# Final Boolean list
ser = isAfterMin & myBoolean
print (ser)
# [False False True True False True]
As you can see, items located before the minimum value are always assigned a False, and the other items are assigned the booleans you calculate.
Related
Having a strange issue with using a filter in loc.
example df:
Name trail
0 XYZ True
1 A True
2 B True
3 C True
# Trail filter
filter_trail = (df['trail'] == False)
# Set a row to False to check
df.at[3, 'trail'] = False
# use the filter, using loc because I will combine conditions
df.loc[filter_trail]
# I get the expected result
# Test further
df.at[0, 'trail'] = False
# use loc statement from earlier
# result only shows the 1st row i.e. the row with index 3
# No error in terminal
# decide to try dropping the column and setting column
df.drop('trail', axis=1, inplace=True)
df['trail'] = [True, False, False, False]
# run loc
df.loc[filter_trail]
# result still only shows row with index 3
# run without loc
df[filter_trail]
# result still only shows row with index 3
# run
df[df['trail'] == False]
# Get the desired result i.e. row index: 1,2,3
I am not sure what I am doing wrong here. Never seen this happen before.
filter_trail is not created as a reference to the Dataframe, rather a boolean calculated value from one of trail column of the DF. Thereby, creating a new set of data, which was calculated from DF column, but not referencing it.
Two of the fellow stackoverflow contributors (comments above) and me as third tried out your code and we all received an empty filter_trail.
Tring to conditionally execute a query, only when ColumnA = 'New' and ColumnB = 'Left' (in each individual row). I know that str.contains() works when I only have 1 condition, however, I'm getting a ValueError ("ValueError: The truth value of a Series is ambiguous..."). Can this approach not be successfully applied, for my given scenario? Alternatively, is there a better approach?
Current code:
if df1['ColumnA'].str.contains('New') and df1['ColumnB'].str.contains('Left'):
do something...
Very basic example of the dataframe:
ColumnA
Column B
New
Left
Used
Right
Scrap
Down
New
Right
First row would be the desired row to carry forward (since it meets the criteria).
You have the right idea, however it doesn't appear like your code is expressing exactly what you want to do.
df1['ColumnA'].str.contains('New')
will return a Series with true and false values corresponding to the indices where the condition is true, not a true or false value for whether the entire column contains 'new'. To accomplish this consider doing something like the following:
'new' in df['ColumnA'].values
If you are trying to do it on a row by row basis then you must use the bitwise operator to compare truth values across Series (&).
This will return a boolean like you expected, hopefully this helps (:
Use bitwise & on two mask arrays and generate another column.
>>> import pandas as pd
>>> df = pd.DataFrame({'A':['New','Used','Scrap','New'], 'B':['Left','Right','Down','Right']})
>>> df
A B
0 New Left
1 Used Right
2 Scrap Down
3 New Right
>>> df['C'] = df['A'].str.contains('New') & df['B'].str.contains('Left')
>>> df
A B C
0 New Left True
1 Used Right False
2 Scrap Down False
3 New Right False
>>>
I have a dataframe in Python called df1 where I have 4 dichotomous variables called Ordering_1; Ordering_2, Ordering_3, Ordering_4 with True/False values.
I need to create a variable called Clean, which is based on the 4 other variables. Meaning, when Ordering_1 == True, then Clean == Ordering_1, when Ordering_2==True, then Clean == Ordering_2. Then Clean would be a combination of all the true values from Ordering_1; Ordering_2, Ordering_3, Ordering_4.
Here is an example of how I would like the variable Clean to be:
I have tried the below code but it does not work:
df1[Clean] = df1[Ordering_1] + df1[Ordering_1] + df1[Ordering_1] + df1[Ordering_1]
Would anyone please be able to help me how to do this in python?
Universal solution if there are multiple Trues per rows - filter columns by DataFrame.filter and then use DataFrame.dot for matrix multiplication:
df1 = df.filter(like='Ordering_')
df['Clean'] = df1.dot(df1.columns + ',').str.strip(',')
If there is only one "True" value per row you can use the booleans of each column "Ordering_1", "Ordering_2", etc. and the df1.loc.
Note that this is what you get with df1.Ordering_1:
0 True
1 False
2 False
3 False
Name: Ordering_1, dtype: bool
With df1.loc you can use it to filter on the "True" rows, in this case only row 0:
So you can code this:
Create a new blank "clean" column:
df1["clean"]=""
Set the rows where the series df.Ordering_1 = True to "Ordering_1":
df1.loc[df1.Ordering_1,["clean"]] = "Ordering_1"
Proceed with the remaining columns in the same way.
I'm running into an issue which probably has an obvious fix. So i have a 2x2 dataframe that has lists in it. When i take the first row and compare the entire row against a specific list value i'm looking for, the boolean array that is returned is entirely False. This is incorrect since the first value of the row has the exact value i'm looking for. When I instead compare the singular value in the dataframe, I get True. How come, when doing boolean operations over the entire row, I get False instead of True for the value in the first column. Thanks in advance!
Returns False for both values in the first row.
In:
static_levels = pd.DataFrame([[[58, 'Level']], [[24.4, 'Level'], [23.3, 'Level']]], ['Symbol 1', 'Symbol 2'])
print(static_levels.loc['Symbol 1',:]==[58, 'Level'])
Out:
0 False
1 False
Name: Symbol 1, dtype: bool
However, the code below correctly returns True when comparing just the first value in the first row:
In: print(static_levels.loc['Symbol 1',0]==[58, 'Level'])
Out: True
Is there a way to count the number of occurrences of boolean values in a column without having to loop through the DataFrame?
Doing something like
df[df["boolean_column"]==False]["boolean_column"].sum()
Will not work because False has a value of 0, hence a sum of zeroes will always return 0.
Obviously you could count the occurrences by looping over the column and checking, but I wanted to know if there's a pythonic way of doing this.
Use pd.Series.value_counts():
>> df = pd.DataFrame({'boolean_column': [True, False, True, False, True]})
>> df['boolean_column'].value_counts()
True 3
False 2
Name: boolean_column, dtype: int64
If you want to count False and True separately you can use pd.Series.sum() + ~:
>> df['boolean_column'].values.sum() # True
3
>> (~df['boolean_column']).values.sum() # False
2
With Pandas, the natural way is using value_counts:
df = pd.DataFrame({'A': [True, False, True, False, True]})
print(df['A'].value_counts())
# True 3
# False 2
# Name: A, dtype: int64
To calculate True or False values separately, don't compare against True / False explicitly, just sum and take the reverse Boolean via ~ to count False values:
print(df['A'].sum()) # 3
print((~df['A']).sum()) # 2
This works because bool is a subclass of int, and the behaviour also holds true for Pandas series / NumPy arrays.
Alternatively, you can calculate counts using NumPy:
print(np.unique(df['A'], return_counts=True))
# (array([False, True], dtype=bool), array([2, 3], dtype=int64))
I couldn't find here what I exactly need. I needed the number of True and False occurrences for further calculations, so I used:
true_count = (df['column']).value_counts()[True]
False_count = (df['column']).value_counts()[False]
Where df is your DataFrame and column is the column with booleans.
This alternative works for multiple columns and/or rows as well.
df[df==True].count(axis=0)
Will get you the total amount of True values per column. For row-wise count, set axis=1.
df[df==True].count().sum()
Adding a sum() in the end will get you the total amount in the entire DataFrame.
You could simply sum:
sum(df["boolean_column"])
This will find the number of "True" elements.
len(df["boolean_column"]) - sum(df["boolean_column"])
Will yield the number of "False" elements.
df.isnull()
returns a boolean value. True indicates a missing value.
df.isnull().sum()
returns column wise sum of True values.
df.isnull().sum().sum()
returns total no of NA elements.
In case you have a column in a DataFrame with boolean values, or even more interesting, in case you do not have it but you want to find the number of values in a column satisfying a certain condition you can try something like this (as an example I used <=):
(df['col']<=value).value_counts()
the parenthesis create a tuple with # of True/False values which you can use for other calcs as well, accessing the tuple adding [0] for False counts and [1] for True counts even without creating an additional variable:
(df['col']<=value).value_counts()[0] #for falses
(df['col']<=value).value_counts()[1] #for trues
Here is an attempt to be as literal and brief as possible in providing an answer. The value_counts() strategies are probably more flexible at the end. Accumulation sum and counting count are different and each expressive of an analytical intent, sum being dependent on the type of data.
"Count occurences of True/False in column of dataframe"
import pd
df = pd.DataFrame({'boolean_column': [True, False, True, False, True]})
df[df==True].count()
#boolean_column 3
#dtype: int64
df[df!=False].count()
#boolean_column 3
#dtype: int64
df[df==False].count()
#boolean_column 2
#dtype: int64