After doing some operations, I am getting a dataframe with an index and a column with boolean values. I just need to get those indexes having boolean value to be True. How can I get that?
My output is like this: Here, "AC name" is the index as per the output dataframe.
AC name
Agiaon False
Alamnagar False
Alauli True
Alinagar False
Ziradei True
Name: Vote percentage, Length: 253, dtype: bool
Considering that the dataframe is df, it would be:
res = df[df['Vote percentage']].index
Related
I have a dataframe with a column named diff. I am able to group this column and get the number of True and False occurrences in the data frame.
df.groupby('diff').size()
returns
diff
True 5101
False 61
dtype: int64
I want to access to the value of True, 5101.
I have already tried
df.groupby('diff').size().loc['True']
It is Series, so loc should be omit:
s = pd.Series([5101, 61], index=[True, False])
print (s)
True 5101
False 61
dtype: int64
print (s[True])
5101
The answer is:
df_merged.groupby('diff').size().loc[True]
Explanation: note that
df_merged.groupby('diff').size().index
returns
Index([True, False], dtype='object', name='diff')
It's a bool True, not a "True" like in a string !!!!
Using .loc with lambda
s = df.groupby('diff').size().loc[lambda x :x]
I'm running into an issue which probably has an obvious fix. So i have a 2x2 dataframe that has lists in it. When i take the first row and compare the entire row against a specific list value i'm looking for, the boolean array that is returned is entirely False. This is incorrect since the first value of the row has the exact value i'm looking for. When I instead compare the singular value in the dataframe, I get True. How come, when doing boolean operations over the entire row, I get False instead of True for the value in the first column. Thanks in advance!
Returns False for both values in the first row.
In:
static_levels = pd.DataFrame([[[58, 'Level']], [[24.4, 'Level'], [23.3, 'Level']]], ['Symbol 1', 'Symbol 2'])
print(static_levels.loc['Symbol 1',:]==[58, 'Level'])
Out:
0 False
1 False
Name: Symbol 1, dtype: bool
However, the code below correctly returns True when comparing just the first value in the first row:
In: print(static_levels.loc['Symbol 1',0]==[58, 'Level'])
Out: True
Is there a way to count the number of occurrences of boolean values in a column without having to loop through the DataFrame?
Doing something like
df[df["boolean_column"]==False]["boolean_column"].sum()
Will not work because False has a value of 0, hence a sum of zeroes will always return 0.
Obviously you could count the occurrences by looping over the column and checking, but I wanted to know if there's a pythonic way of doing this.
Use pd.Series.value_counts():
>> df = pd.DataFrame({'boolean_column': [True, False, True, False, True]})
>> df['boolean_column'].value_counts()
True 3
False 2
Name: boolean_column, dtype: int64
If you want to count False and True separately you can use pd.Series.sum() + ~:
>> df['boolean_column'].values.sum() # True
3
>> (~df['boolean_column']).values.sum() # False
2
With Pandas, the natural way is using value_counts:
df = pd.DataFrame({'A': [True, False, True, False, True]})
print(df['A'].value_counts())
# True 3
# False 2
# Name: A, dtype: int64
To calculate True or False values separately, don't compare against True / False explicitly, just sum and take the reverse Boolean via ~ to count False values:
print(df['A'].sum()) # 3
print((~df['A']).sum()) # 2
This works because bool is a subclass of int, and the behaviour also holds true for Pandas series / NumPy arrays.
Alternatively, you can calculate counts using NumPy:
print(np.unique(df['A'], return_counts=True))
# (array([False, True], dtype=bool), array([2, 3], dtype=int64))
I couldn't find here what I exactly need. I needed the number of True and False occurrences for further calculations, so I used:
true_count = (df['column']).value_counts()[True]
False_count = (df['column']).value_counts()[False]
Where df is your DataFrame and column is the column with booleans.
This alternative works for multiple columns and/or rows as well.
df[df==True].count(axis=0)
Will get you the total amount of True values per column. For row-wise count, set axis=1.
df[df==True].count().sum()
Adding a sum() in the end will get you the total amount in the entire DataFrame.
You could simply sum:
sum(df["boolean_column"])
This will find the number of "True" elements.
len(df["boolean_column"]) - sum(df["boolean_column"])
Will yield the number of "False" elements.
df.isnull()
returns a boolean value. True indicates a missing value.
df.isnull().sum()
returns column wise sum of True values.
df.isnull().sum().sum()
returns total no of NA elements.
In case you have a column in a DataFrame with boolean values, or even more interesting, in case you do not have it but you want to find the number of values in a column satisfying a certain condition you can try something like this (as an example I used <=):
(df['col']<=value).value_counts()
the parenthesis create a tuple with # of True/False values which you can use for other calcs as well, accessing the tuple adding [0] for False counts and [1] for True counts even without creating an additional variable:
(df['col']<=value).value_counts()[0] #for falses
(df['col']<=value).value_counts()[1] #for trues
Here is an attempt to be as literal and brief as possible in providing an answer. The value_counts() strategies are probably more flexible at the end. Accumulation sum and counting count are different and each expressive of an analytical intent, sum being dependent on the type of data.
"Count occurences of True/False in column of dataframe"
import pd
df = pd.DataFrame({'boolean_column': [True, False, True, False, True]})
df[df==True].count()
#boolean_column 3
#dtype: int64
df[df!=False].count()
#boolean_column 3
#dtype: int64
df[df==False].count()
#boolean_column 2
#dtype: int64
Hello I have a boolean value with True and False.
When I run a value_counts() like this
df['column'].value_counts()
I receive the following:
True 10718
False 1105
Name: column, dtype: int64
Is there a way to calculate what % of the total is true and what % is false?
Something like this:
True 91%
False 09%
Name: column, dtype: int64
Thank you
You can do with
df['yourcolumns'].value_counts(normalize=True).mul(100).astype(str)+'%'
I was notified that it is as simple as
df['column'].value_counts(normalize=True)
I've been working on a DataFrame with User_IDs, DateTime objects and other information, like the following extract:
User_ID;Latitude;Longitude;Datetime
222583401;41.4020375;2.1478710;2014-07-06 20:49:20
287280509;41.3671346;2.0793115;2013-01-30 09:25:47
329757763;41.5453577;2.1175164;2012-09-25 08:40:59
189757330;41.5844998;2.5621569;2013-10-01 11:55:20
624921653;41.5931846;2.3030671;2013-07-09 20:12:20
414673119;41.5550136;2.0965829;2014-02-24 20:15:30
414673119;41.5550136;2.0975829;2014-02-24 20:16:30
414673119;41.5550136;2.0985829;2014-02-24 20:17:30
I've grouped Users with:
g = df.groupby(['User_ID','Datetime'])
and then check for no-single DataTime objects:
df = df.groupby('User_ID')['Datetime'].apply(lambda g: len(g)>1)
I've obtained the following boolean DataFrame:
User_ID
189757330 False
222583401 False
287280509 False
329757763 False
414673119 True
624921653 False
Name: Datetime, dtype: bool
which is fine for my purposes to keep only User_ID with a True masked value. Now I would like to keep only the User_ID values associated to the True values, and write them to a new DataFrame with pandas.to_csv, for instance. The expected DataFrame would contain only the User_ID with more than one DateTime object:
User_ID;Latitude;Longitude;Datetime
414673119;41.5550136;2.0965829;2014-02-24 20:15:30
414673119;41.5550136;2.0975829;2014-02-24 20:16:30
414673119;41.5550136;2.0985829;2014-02-24 20:17:30
How may I have access to the boolean values for each User_ID? Thanks for your kind help.
Assign the result of df.groupby('User_ID')['Datetime'].apply(lambda g: len(g)>1) to a variable so you can perform boolean indexing and then use the index from this to call isin and filter your orig df:
In [366]:
users = df.groupby('User_ID')['Datetime'].apply(lambda g: len(g)>1)
users
Out[366]:
User_ID
189757330 False
222583401 False
287280509 False
329757763 False
414673119 True
624921653 False
Name: Datetime, dtype: bool
In [367]:
users[users]
Out[367]:
User_ID
414673119 True
Name: Datetime, dtype: bool
In [368]:
users[users].index
Out[368]:
Int64Index([414673119], dtype='int64')
In [361]:
df[df['User_ID'].isin(users[users].index)]
Out[361]:
User_ID Latitude Longitude Datetime
5 414673119 41.555014 2.096583 2014-02-24 20:15:30
6 414673119 41.555014 2.097583 2014-02-24 20:16:30
7 414673119 41.555014 2.098583 2014-02-24 20:17:30
You can then call to_csv on the above as normal
first, make sure you have no duplicate entries:
df = df.drop_duplicates()
then, figure out the counts for each:
counts = df.groupby('User_ID').Datetime.count()
finally, figure out where the indexes overlap:
df[df.User_ID.isin(counts[counts > 1].index)]