Is there a way to count the number of occurrences of boolean values in a column without having to loop through the DataFrame?
Doing something like
df[df["boolean_column"]==False]["boolean_column"].sum()
Will not work because False has a value of 0, hence a sum of zeroes will always return 0.
Obviously you could count the occurrences by looping over the column and checking, but I wanted to know if there's a pythonic way of doing this.
Use pd.Series.value_counts():
>> df = pd.DataFrame({'boolean_column': [True, False, True, False, True]})
>> df['boolean_column'].value_counts()
True 3
False 2
Name: boolean_column, dtype: int64
If you want to count False and True separately you can use pd.Series.sum() + ~:
>> df['boolean_column'].values.sum() # True
3
>> (~df['boolean_column']).values.sum() # False
2
With Pandas, the natural way is using value_counts:
df = pd.DataFrame({'A': [True, False, True, False, True]})
print(df['A'].value_counts())
# True 3
# False 2
# Name: A, dtype: int64
To calculate True or False values separately, don't compare against True / False explicitly, just sum and take the reverse Boolean via ~ to count False values:
print(df['A'].sum()) # 3
print((~df['A']).sum()) # 2
This works because bool is a subclass of int, and the behaviour also holds true for Pandas series / NumPy arrays.
Alternatively, you can calculate counts using NumPy:
print(np.unique(df['A'], return_counts=True))
# (array([False, True], dtype=bool), array([2, 3], dtype=int64))
I couldn't find here what I exactly need. I needed the number of True and False occurrences for further calculations, so I used:
true_count = (df['column']).value_counts()[True]
False_count = (df['column']).value_counts()[False]
Where df is your DataFrame and column is the column with booleans.
This alternative works for multiple columns and/or rows as well.
df[df==True].count(axis=0)
Will get you the total amount of True values per column. For row-wise count, set axis=1.
df[df==True].count().sum()
Adding a sum() in the end will get you the total amount in the entire DataFrame.
You could simply sum:
sum(df["boolean_column"])
This will find the number of "True" elements.
len(df["boolean_column"]) - sum(df["boolean_column"])
Will yield the number of "False" elements.
df.isnull()
returns a boolean value. True indicates a missing value.
df.isnull().sum()
returns column wise sum of True values.
df.isnull().sum().sum()
returns total no of NA elements.
In case you have a column in a DataFrame with boolean values, or even more interesting, in case you do not have it but you want to find the number of values in a column satisfying a certain condition you can try something like this (as an example I used <=):
(df['col']<=value).value_counts()
the parenthesis create a tuple with # of True/False values which you can use for other calcs as well, accessing the tuple adding [0] for False counts and [1] for True counts even without creating an additional variable:
(df['col']<=value).value_counts()[0] #for falses
(df['col']<=value).value_counts()[1] #for trues
Here is an attempt to be as literal and brief as possible in providing an answer. The value_counts() strategies are probably more flexible at the end. Accumulation sum and counting count are different and each expressive of an analytical intent, sum being dependent on the type of data.
"Count occurences of True/False in column of dataframe"
import pd
df = pd.DataFrame({'boolean_column': [True, False, True, False, True]})
df[df==True].count()
#boolean_column 3
#dtype: int64
df[df!=False].count()
#boolean_column 3
#dtype: int64
df[df==False].count()
#boolean_column 2
#dtype: int64
Related
in creating a cleaning project throught Python, I've found this code:
# let's see if there is any missing data
for col in df.columns:
pct_missing = np.mean(df[col].isnull())
print('{} - {}%'.format(col, round(pct_missing,2)))
Which actually works fine, giving back the % of null values per column in the dataframe, but I'm a little confused on how it works:
First we define a loop for each column in the dataframe, then we execute that mean but exactly the mean of what? The mean for each columns of the quantity of null cells or what?
Just for reference, I've worked around it with this:
NullValues=df.isnull().sum()/len(df)
print('{} - {}%'.format(col, round(NullValues,2)))
that gives me back basically the same results but just to understand the mechanism...I'm confused about the first block of code...
It's something that's very intuitive once you're used to it. The steps leading to this kind of code could be like the following:
To get the percentage of null values, we need to count all null rows, and divide the count by the total number of rows.
So, first we need to detect the null rows. This is easy, as there is a provided method: df[col].isnull().
The result of df[col].isnull() is a new column consisting of booleans -- True or False.
Now we need to count the Trues. Here we can realize that counting Trues in a boolean array is the same as summing the array: True can be converted to 1, and False to zero.
So we would be left with df[col].isnull().sum() / len(df[col]).
But summing and dividing by the length is just the arithmetic mean! Therefore, we can shorten this to arrive at the final result: mean(df[col].isnull()).
df[col].isnull() is assigning a boolean (True/False) depending on the NA/null state of the values.
np.mean computes the average of the values, with True as 1 and False as 0, which is a equivalent of computing the proportion of Null values in the column.
np.mean([True, False, False, False])
# equivalent to
np.mean([1, 0, 0, 0])
# 0.25
the first thing that happens is the
df[col].isnull()
this creates a list of bool values with True beeing when the column is null so if for example the values are [x1, x2, x3, null, x4] then it gives the vector [False, False, False, True, False].
The next step is then the np.mean funcion. This function calculates the mean value of the vector but replaces True with 1 and False with 0. This gives the vector [0, 0, 0, 1, 0]
The mean of this vector is equal to the sum of nulls divided by the length of the vector which is the method you are using.
Just a comment. It does not give a percent you need to multiply by 100.
I have a dataframe in Python called df1 where I have 4 dichotomous variables called Ordering_1; Ordering_2, Ordering_3, Ordering_4 with True/False values.
I need to create a variable called Clean, which is based on the 4 other variables. Meaning, when Ordering_1 == True, then Clean == Ordering_1, when Ordering_2==True, then Clean == Ordering_2. Then Clean would be a combination of all the true values from Ordering_1; Ordering_2, Ordering_3, Ordering_4.
Here is an example of how I would like the variable Clean to be:
I have tried the below code but it does not work:
df1[Clean] = df1[Ordering_1] + df1[Ordering_1] + df1[Ordering_1] + df1[Ordering_1]
Would anyone please be able to help me how to do this in python?
Universal solution if there are multiple Trues per rows - filter columns by DataFrame.filter and then use DataFrame.dot for matrix multiplication:
df1 = df.filter(like='Ordering_')
df['Clean'] = df1.dot(df1.columns + ',').str.strip(',')
If there is only one "True" value per row you can use the booleans of each column "Ordering_1", "Ordering_2", etc. and the df1.loc.
Note that this is what you get with df1.Ordering_1:
0 True
1 False
2 False
3 False
Name: Ordering_1, dtype: bool
With df1.loc you can use it to filter on the "True" rows, in this case only row 0:
So you can code this:
Create a new blank "clean" column:
df1["clean"]=""
Set the rows where the series df.Ordering_1 = True to "Ordering_1":
df1.loc[df1.Ordering_1,["clean"]] = "Ordering_1"
Proceed with the remaining columns in the same way.
After doing some operations, I am getting a dataframe with an index and a column with boolean values. I just need to get those indexes having boolean value to be True. How can I get that?
My output is like this: Here, "AC name" is the index as per the output dataframe.
AC name
Agiaon False
Alamnagar False
Alauli True
Alinagar False
Ziradei True
Name: Vote percentage, Length: 253, dtype: bool
Considering that the dataframe is df, it would be:
res = df[df['Vote percentage']].index
I have a dataframe with a column named diff. I am able to group this column and get the number of True and False occurrences in the data frame.
df.groupby('diff').size()
returns
diff
True 5101
False 61
dtype: int64
I want to access to the value of True, 5101.
I have already tried
df.groupby('diff').size().loc['True']
It is Series, so loc should be omit:
s = pd.Series([5101, 61], index=[True, False])
print (s)
True 5101
False 61
dtype: int64
print (s[True])
5101
The answer is:
df_merged.groupby('diff').size().loc[True]
Explanation: note that
df_merged.groupby('diff').size().index
returns
Index([True, False], dtype='object', name='diff')
It's a bool True, not a "True" like in a string !!!!
Using .loc with lambda
s = df.groupby('diff').size().loc[lambda x :x]
I'm trying to create a boolean series in which I treat data two different ways. I'm trying to find the local minimum to start a boolean calculation, and anything before that, I'd like to return as False. My problem is the only way I can think of to do that is to essentially split the resulting series into two, one from start of group to the row before the minimum, and one from the minimum to the end of the group, finally concatenating them. You can see below that I create a list of False entries, then concatenate that with the boolean series I created starting at the minimum. This is really kludge-y, and it doesn't keep the indexes intact.
ser = pd.concat([pd.Series([False] * (argrelextrema(group['B'].values, np.less)[0][:1][0])), (group[argrelextrema(group['B'].values, np.less)[0][:1][0]:].B.diff().shift(-1) <= -1)])
From this:
B
5876 500.2
5877 500.3
5878 500.4
5879 498.3
5880 499.0
5881 512
...
I end up with something like this for example:
1 False
2 False
3 False
5879 True
5880 False
5881 False
...
To fix it, I figured I could reset the indexes starting with the first one of the group, but that seems even more kludge-y.
ser.index = np.arange(group.index[0], len(ser))
Is there a more elegant way to return False for everything before the minimum and combine that with the boolean series I create, keeping all indexes intact?
You can use the operator & to create one Boolean series instead of joining two series. In other words, your Boolean series should satisfy two conditions:
Index of item is greater than or equal to index of minimum value
AND
Item satisfies your other Boolean calculation (which you will then be able to calculate for all items regardless of position)
B = pd.Series([10,9,7,7,12,14])
# Getting first position of minimum value
minValueIndex = B[B == B.min()].index[0]
# Boolean list for condition 1
isAfterMin = B.index >= minValueIndex
# Boolean list for condition 2: whatever calculation you make, for entire series. Example:
myBoolean = [True, False, True, True, False, True]
# Final Boolean list
ser = isAfterMin & myBoolean
print (ser)
# [False False True True False True]
As you can see, items located before the minimum value are always assigned a False, and the other items are assigned the booleans you calculate.