Filtering Boolean in Dataframe - python

I have a df with ±100k rows and 10 columns.
I would like to find/filter which rows contain at least 2 to 4 True values.
For simplicity's sake, let's say I have this df:
A B C D E F
1 True True False False True
2 False True True True False
3 False False False False False
4 True False False False True
5 True False False False False
Expected output:
A B C D E F
1 True True False False True
2 False True True True False
4 True False False False True
I have tried using
df[(df['B']==True) | (df['C']==True) | (df['D']==True)| (df['E']==True)| (df['F']==True)]
But this only eliminates False rows and doesn't work if I want to find instances of at least 2/3 True.
Can anyone please help? Appreciate it.

Use DataFrame.select_dtypes for only boolean columns, count Trues by sum and then filter values by Series.between in boolean indexing:
df = df[df.select_dtypes(bool).sum(axis=1).between(2,4)]
print (df)
A B C D E F
0 1 True True False False True
1 2 False True True True False
3 4 True False False False True

Related

How to find the count of same values in a row in a dataframe?

The dataframe is as follows:
a | b | c | d
-------------------------------
TRUE FALSE TRUE TRUE
FALSE FALSE FALSE TRUE
TRUE TRUE TRUE TRUE
TRUE FALSE TRUE FALSE
I need to find the count of the TRUE's in each column.
The last row should contain the count as follows:
a | b | c | d | count
---------------------------------------
TRUE FALSE TRUE TRUE 3
FALSE FALSE FALSE TRUE 1
TRUE TRUE TRUE TRUE 4
TRUE FALSE TRUE FALSE 2
The logic I tried is:
df.groupby(df.columns.tolist(),as_index=False).size()
But it doesn't work as expected.
Could anyone please help me out here?
Thank you.
Because Trues are processing like 1 you can use sum:
df['count'] = df.sum(axis=1)
If TRUEs are strings:
df['count'] = df.eq('TRUE').sum(axis=1)

Pyspark equivalent of pandas all fuction

I have a spark dataframe df:
A B C D
True True True True
True False True True
True None True None
True NaN NaN False
True NaN True True
Is there a way in pyspark to get a fifth column based on rows A, B, C, D not having the value False in them but returning an int value or 1 for True and 0 for false. Hence:
A B C D E
True True True True 1
True False True True 0
True None True None 1
True NaN NaN False 0
True NaN True True 1
This can be acheived in a pandas dataframe with the function df.all().astype(int).
Any help for a pyspark equivalent would be most appreciated, please.
I don't have anything to test, but try the code below:
df2 = df.withColumn(
'E',
(
(F.greatest(*df.columns) == F.least(*df.columns)) &
(F.least(*df.columns) == F.lit(True))
).cast('int')
)

keep rows and the previous one which are not equal

I have this kind of dataframe :
d = {1 : [False,False,False,False,True],2:[False,True,True,True,False],3 :[True,False,False,False,True]}
df = pd.DataFrame(d)
df :
1 2 3
0 False False True
1 False True False
2 False True False
3 False True False
4 True False True
My goal is to keep rows n+1 and n where rows n+1 and n are differents. In the example df, the result would be :
df_result :
1 2 3
0 False False True
1 False True False
3 False True False
4 True False True
I have already tried this line df_result = df.neq(df.shift()) and kept only rows where there is a least one true but it doesn't get the row 3
Any idea how i can have the expected result ?
Thanks !
I believe you need compare bot not equal by DataFrame.ne shifting by 1 and by -1, get at least one match by DataFrame.any and chain with | for bitwise OR:
df_result = df[df.ne(df.shift()).any(axis=1) | df.ne(df.shift(-1)).any(axis=1)]
print (df_result)
1 2 3
0 False False True
1 False True False
3 False True False
4 True False True
Another similar idea:
df_result = df[(df.ne(df.shift()) | df.ne(df.shift(-1))).any(axis=1)]

Quickly apply multiple filters to pandas dataframe

I am trying to filter this dataframe:
ID fallowDeer woodland fox rabbits
1 0.0 4.056649 2.210927 31.150451
2 0.0 2.267544 1.380185 38.631221
3 0.0 5.602904 1.201781 4.124286
4 0.0 7.377308 7.834358 25.911328
5 0.0 2.129115 1.564075 3.931565
6 0.0 5.988451 1.699852 32.915730
7 0.0 1.427553 3.586585 7.444735
8 0.0 9.857577 8.709137 34.004470
9 0.0 7.468365 1.317117 38.440278
10 0.0 3.902134 4.112038 22.427969
To keep only those rows where each species is between the minimum and maximum values shown below (these are series):
Minimum values
fallowDeer 0
woodland 1
fox 3
rabbits 10
Maximum values
fallowDeer 0
woodland 4
fox 6
rabbits 20
This code works:
accepted_simulations = df[(df['fallowDeer'] <= max_values['fallowDeer']) & (df['fallowDeer'] >= min_values['fallowDeer']) & (df['woodland'] <= max_values['woodland']) & (df['woodland'] >= min_values['woodland']) & (df['fox'] <= max_values['fox']) & (df['fox'] >= min_values['fox']) & (df['rabbits'] <= max_values['rabbits']) & (df['rabbits'] >= min_values['rabbits'])]
However, I am going to be adding many more species/columns in the future, and would like to avoid having to manually check each species against the min/max as I've done here. Is there a way to quickly compare each species to the min/max and filter the dataframe, without having to manually check each one?
You can compare all columns together, if min_values and max_values are Series with Index.intersection for get columns names by index of minimal and maximal values and compared by DataFrame.le and
DataFrame.ge:
c1 = df.columns.intersection(min_values.index)
m1 = df[c1].ge(min_values.loc[c1])
print (m1)
fallowDeer woodland fox rabbits
0 True True False True
1 True False False True
2 True True False False
3 True True True True
4 True False False False
5 True True False True
6 True False False False
7 True True True True
8 True True False True
9 True False False True
c2 = df.columns.intersection(max_values.index)
m2 = df[c2].le(max_values.loc[c2])
print (m2)
fallowDeer woodland fox rabbits
0 True False True False
1 True False True False
2 True False True True
3 True False False False
4 True False True True
5 True False True False
6 True False False True
7 True False False False
8 True False True False
9 True False False False
Then is possible chain both masks:
m = m1 & m2
print (m)
fallowDeer woodland fox rabbits
0 True False False False
1 True False False False
2 True False False False
3 True False False False
4 True False False False
5 True False False False
6 True False False False
7 True False False False
8 True False False False
9 True False False False
Last for filter if True in all rows of mask use DataFrame.all:
accepted_simulations = df[m.all(axis=1)]
print (accepted_simulations)
Empty DataFrame
Columns: [ID, fallowDeer, woodland, fox, rabbits]
Index: []

How to select column names in python pandas by dataframe values?

I have the following dataframe:
pandas.DataFrame(numpy.random.randn(10, 5) > 1, index=range(1, 11), columns=list('ABCDE'))
A B C D E
1 False False False False False
2 False False False False False
3 True True False True False
4 False False True True False
5 False False False False False
6 False False False False False
7 False False False False False
8 False False False False False
9 False False False False False
10 False True False True False
For each row I would like to get the column name that is the last one in that row containing True.
If there isn't any, return any resonable value.
How can I do that?
A one liner:
>>> value = np.nan
>>> df.reindex_axis(df.columns[::-1], axis=1)\ # flip vertically
.idxmax(axis=1)\ # find last(now first) True value
.reset_index()\ # get index for the next step
.apply(lambda x: value if (x[0]==df.columns[-1] and not df.ix[x['index'], x[0]])
else x[0], axis=1) # =value if col=="E" and value==False
Out [1]:
0 NaN
1 NaN
2 D
3 D
4 NaN
5 NaN
6 NaN
7 NaN
8 NaN
9 D
Explanation:
idxmax returns the index of the max value in a row, if there is more than one max it returns the first one. We want the last one so we flip the dataframe vertically.
Finally we must replace the obtained Series with value if col=="E" and value==False. You can't apply a condition on the index of a Series, thats why you need the reset_index first.
This last step could be more elegantly done with df.replace({'E': {False: value}), which replaces False in column 'E' with value, but somehow it doesn't work for me.
set up the example data first:
np.random.seed(1)
df = pd.DataFrame( (randn(10, 5) > 1) , index=range(1, 11), columns=['A','B','C','D','E'])
df
looks like:
A B C D E
1 True False False False False
2 False True False False False
3 True False False False True
4 False False False False False
5 False True False False False
6 False False False False False
7 False False False False False
8 False False False True False
9 False False False True False
10 False False True False False
it sounds like what you want to do is get the index # for each true value and then select the max index #. On a single column that might look like the following:
df['A'][df['A']].index.max()
which returns 3. To do this for all the columns, the easiest is to iterate through each column and shove the result in a list:
mylist = []
for col in df.columns:
myval = df[col][df[col]].index.max()
mylist.append(myval)
mylist
that returns:
[3, 5, 10, 9, 3]
the loop logic above returns nan if there is no True value in the column.

Categories