Conditional DataFrame filter on boolean columns? - python

If I have a DataFrame as so:
| id | attribute_1 | attribute_2 |
|--------|-------------|-------------|
| 123abc | TRUE | TRUE |
| 123abc | TRUE | FALSE |
| 456def | TRUE | FALSE |
| 789ghi | TRUE | TRUE |
| 789ghi | FALSE | FALSE |
| 789ghi | FALSE | FALSE |
How do I apply a groupby or some equivalent filter to count the unique number of id elements in a subset of the DataFrame that looks like this:
| id | attribute_1 | attribute_2 |
|--------|-------------|-------------|
| 123abc | TRUE | TRUE |
| 123abc | TRUE | FALSE |
Meaning, I want to get the unique number of id values where attribute_1 == True for all instances of a given id BUT attribute_2 must have at least 1 True.
So, 456def would not be included in the filter because it does not have at least one True for attribute_2.
789ghi would not be included in the filter because all of it's attribute_1 entries are not True.

You'll need to groupby twice, once with transform('all') on "attribute_1" and the second time with transform('any') on "attribute_2".
i = df[df.groupby('id').attribute_1.transform('all')]
j = i[i.groupby('id').attribute_2.transform('any')]
print (j)
id attribute_1 attribute_2
0 123abc True True
1 123abc True False
Finally, to get the unique IDs that satisfy this condition, call nunique:
print (j['id'].nunique())
1
This is easiest to do when your attribute_* columns are boolean. If they are strings, fix them first:
df = df.replace({'TRUE': True, 'FALSE': False})

Related

Check if a value in a df is in a row after

First of all, my English is not my native language, sorry for my mistakes.
I am looking to automate some tasks done via Excel with Python.
I have a dataframe ordered by date/time, and I want to check if a customer has contacted me again after already having a response.
So I have a dataframe like this:
| Date | Tel |
| ---------- | ------------ |
| 01-01-2023 | +33000000001 |
| 01-01-2023 | +33000000002 |
| 01-01-2023 | +33000000003 |
| 02-01-2023 | +33000000002 |
| 02-01-2023 | +33000000004 |
I'd like to add a column TRUE/FALSE if my client has contacted me later :
| Date | Tel | Re-contact |
| ------------ | ------------ | ------------ |
| 01-01-2023 | +33000000001 | FALSE |
| 01-01-2023 | +33000000002 | TRUE |
| 01-01-2023 | +33000000003 | FALSE |
| 02-01-2023 | +33000000002 | FALSE |
| 02-01-2023 | +33000000004 | FALSE |
In Excel, I do this action as follows:
COUNTIFS(A2:A$5;A1)>0
And I would get my TRUE/FALSE if the phone number exists further in my list.
I looked at the documentation to see if a value existed in a list, but I couldn't find a way to see if it existed further down. Also, I'm looking for a quick way to calculate it, as I have 100,000 rows in my dataframe.
# I've tried this so far:
length = len(df.index) - 1
i = 1
for i in range(i, length):
print(i)
for x in df['number']:
if x in df['number'][[i+1, length]]:
df['Re-contact'] = 'TRUE'
else:
df['Re-contact'] = 'FALSE'
i += 1
It feels very wrong to me, and my code takes too much time. I'm looking for a more efficient way to perform what I'm trying to do.
Use pandas.DataFrame.duplicated over Tel column to find repeated calls:
df['Re-contact'] = df.Tel.duplicated(keep='last')
Date Tel Re-contact
0 01-01-2023 33000000001 False
1 01-01-2023 33000000002 True
2 01-01-2023 33000000003 False
3 02-01-2023 33000000002 False
4 02-01-2023 33000000004 False

Python. Dataframes. Select rows and apply condition on the selection

python newbie here. I have written the code that solves the issue. However, there should be a much better way of doing it.
I have two Series that come from the same table but due to some earlier process I get as separate sets. (They could be joined into a single dataframe again since the entries belong to the same record)
Ser1 Ser2
| id | | section |
| ---| |-------- |
| 1 | | A |
| 2 | | B |
| 2 | | C |
| 3 | | D |
df2
| id | section |
| ---|---------|
| 1 | A |
| 2 | B |
| 2 | Z |
| 2 | Y |
| 4 | X |
First, I would like to find those entries in Ser1, which match the same id in df2. Then, check if the values in the ser2 can NOT be found in the section column of df2
My expected results:
| id | section | result |
| ---|-------- |---------|
| 1 | A | False | # Both id(1) and section(A) are also in df2
| 2 | B | False | # Both id(2) and section(B) are also in df2
| 2 | C | True | # id(2) is in df2 but section(C) is not
| 3 | D | False | # id(3) is not in df2, in that case the result should also be False
My code:
for k, v in Ser2.items():
rslt_df = df2[df2['id'] == Ser[k]]
if rslt_df.empty:
print(False)
if(v not in rslt_df['section'].tolist()):
print(True)
else:
print(False)
I know the code is not very good. But after reading about merging and comprehension lists I am getting confused what the best way would be to improve it.
You can concat the series and compute the "result" with boolean arithmetic (XOR):
out = (
pd.concat([ser1, ser2], axis=1)
.assign(result=ser1.isin(df2['id'])!=ser2.isin(df2['section']))
)
Output:
id section result
0 1 A False
1 2 B False
2 2 C True
3 3 D False
Intermediates:
m1 = ser1.isin(df2['id'])
m2 = ser2.isin(df2['section'])
m1 m2 m1!=m2
0 True True False
1 True True False
2 True False True
3 False False False

Convert a Pandas DataFrame with true/false to a dictionary

I would like to transform a dataframe with the following layout:
| image | finding1 | finding2 | nofinding |
| ------- | -------- | -------- | --------- |
| 039.png | true | false | false |
| 006.png | true | true | false |
| 012.png | false | false | true |
into a dictionary with the following structure:
{
"039.png" : [
"finding1"
],
"006.png" : [
"finding1",
"finding2"
],
"012.png" : [
"nofinding"
]}
IIUC, you could replace the False to NA (assuming boolean False here, for strings use 'false'), then stack to remove the values and use groupby.agg to aggregate as list before converting to dictionary:
dic = (df
.set_index('image')
.replace({False: pd.NA})
.stack()
.reset_index(1)
.groupby(level='image', sort=False)['level_1'].agg(list)
.to_dict()
)
output:
{'039.png': ['finding1'],
'006.png': ['finding1', 'finding2'],
'012.png': ['nofinding']}

Pandas crosstab dataframe and setting the new columns as True/False/Null based on if they existed or not and based on another column

As the title states I want to pivot/crosstab my dataframe
Let's say I have a df that looks like this:
df = pd.DataFrame({'ID' : [0, 0, 1, 1, 1],
'REV' : [0, 0, 1, 1, 1],
'GROUP' : [1, 2, 1, 2, 3],
'APPR' : [True, True, NULL, NULL, True})
+----+-----+-------+------+
| ID | REV | GROUP | APPR |
+----+-----+-------+------+
| 0 | 0 | 1 | True |
| 0 | 0 | 2 | True |
| 1 | 1 | 1 | NULL |
| 1 | 1 | 2 | NULL |
| 1 | 1 | 3 | True |
+----+-----+-------+------+
I want to do some kind of pivot so my result of the table looks like
+----+-----+------+------+-------+
| ID | REV | 1 | 2 | 3 |
+----+-----+------+------+-------+
| 0 | 0 | True | True | False |
| 1 | 1 | NULL | NULL | True |
+----+-----+------+------+-------+
Now the values from the GROUP column becomes there own column. The value of each of those columns is T/F/NULL based on APPR only for the T/NULL part. I want it be False when the group didn't exist for the ID REV combo.
similar question I've asked before, but I wasn't sure how to make this answer work with my new scenario:
Pandas pivot dataframe and setting the new columns as True/False based on if they existed or not
Hope that makes, sense!
Have you tried to pivot?
pd.pivot(df, index=['ID','REV'], columns=['GROUP'], values='APPR').fillna(False).reset_index()

Calculate True Positives and False Negatives for Responses Within a DataFrame

I have survey results which I have one-hot encoded. I would like to calculate the sensitivity of each participant's response.
The below is an example of how my DataFrame is structured, whereby:
'Chocolate' and 'Ice-Cream' are correct
'Pizza' and 'None of the Above' are incorrect
Question 1 | Chocolate | Pizza | Ice-Cream | None of the Above |
Participant ID | | | | |
1 | 1 | 1 | 1 | 0 |
2 | 0 | 0 | 1 | 0 |
3 | 1 | 0 | 1 | 0 |
I would like to append a column that contains the sum of true positives and another with the sum of false negatives, to then create another with the sensitivity score (for each participant).
The below is an example of what I am trying to do:
Question 1 | Chocolate | ... | True Positive | False Negative | ..
Participant ID | | | | |
1 | 1 | ... | 2 | 0 | ..
2 | 0 | ... | 1 | 1 | ..
3 | 1 | ... | 2 | 1 | ..
I am not sure where to begin with this! Can anyone help me out?
Thanks a lot!
You could calculate the 'true pos', false neg' etc by using a confusion matrix (e.g. from Sklearn). Maybe the following code is usefull for you:
import pandas as pd
import sklearn
from sklearn.metrics import confusion_matrix
a = [[1,1,1,0], [0,0,1,0], [1,0,1,0]]
correct = [[1,0,1,0], [1,0,1,0], [1,0,1,0]]
df = pd.DataFrame(data=a)
df.columns=['chocolate', 'pizza', 'icecream', 'none']
for i in range(len(df)):
pred = a[i]
true = correct[i]
tn, fp, fn, tp = confusion_matrix(true,pred).ravel()
print (f'Nr:{i} true neg:{tn} false pos:{fp} false neg:{fn} true pos:{tp}')
The output is (which you could put in a DataFrame):
Nr:0 true neg:1 false pos:1 false neg:0 true pos:2
Nr:1 true neg:2 false pos:0 false neg:1 true pos:1
Nr:2 true neg:2 false pos:0 false neg:0 true pos:2

Categories