Calculate True Positives and False Negatives for Responses Within a DataFrame - python

I have survey results which I have one-hot encoded. I would like to calculate the sensitivity of each participant's response.
The below is an example of how my DataFrame is structured, whereby:
'Chocolate' and 'Ice-Cream' are correct
'Pizza' and 'None of the Above' are incorrect
Question 1 | Chocolate | Pizza | Ice-Cream | None of the Above |
Participant ID | | | | |
1 | 1 | 1 | 1 | 0 |
2 | 0 | 0 | 1 | 0 |
3 | 1 | 0 | 1 | 0 |
I would like to append a column that contains the sum of true positives and another with the sum of false negatives, to then create another with the sensitivity score (for each participant).
The below is an example of what I am trying to do:
Question 1 | Chocolate | ... | True Positive | False Negative | ..
Participant ID | | | | |
1 | 1 | ... | 2 | 0 | ..
2 | 0 | ... | 1 | 1 | ..
3 | 1 | ... | 2 | 1 | ..
I am not sure where to begin with this! Can anyone help me out?
Thanks a lot!

You could calculate the 'true pos', false neg' etc by using a confusion matrix (e.g. from Sklearn). Maybe the following code is usefull for you:
import pandas as pd
import sklearn
from sklearn.metrics import confusion_matrix
a = [[1,1,1,0], [0,0,1,0], [1,0,1,0]]
correct = [[1,0,1,0], [1,0,1,0], [1,0,1,0]]
df = pd.DataFrame(data=a)
df.columns=['chocolate', 'pizza', 'icecream', 'none']
for i in range(len(df)):
pred = a[i]
true = correct[i]
tn, fp, fn, tp = confusion_matrix(true,pred).ravel()
print (f'Nr:{i} true neg:{tn} false pos:{fp} false neg:{fn} true pos:{tp}')
The output is (which you could put in a DataFrame):
Nr:0 true neg:1 false pos:1 false neg:0 true pos:2
Nr:1 true neg:2 false pos:0 false neg:1 true pos:1
Nr:2 true neg:2 false pos:0 false neg:0 true pos:2

Related

Check if a value in a df is in a row after

First of all, my English is not my native language, sorry for my mistakes.
I am looking to automate some tasks done via Excel with Python.
I have a dataframe ordered by date/time, and I want to check if a customer has contacted me again after already having a response.
So I have a dataframe like this:
| Date | Tel |
| ---------- | ------------ |
| 01-01-2023 | +33000000001 |
| 01-01-2023 | +33000000002 |
| 01-01-2023 | +33000000003 |
| 02-01-2023 | +33000000002 |
| 02-01-2023 | +33000000004 |
I'd like to add a column TRUE/FALSE if my client has contacted me later :
| Date | Tel | Re-contact |
| ------------ | ------------ | ------------ |
| 01-01-2023 | +33000000001 | FALSE |
| 01-01-2023 | +33000000002 | TRUE |
| 01-01-2023 | +33000000003 | FALSE |
| 02-01-2023 | +33000000002 | FALSE |
| 02-01-2023 | +33000000004 | FALSE |
In Excel, I do this action as follows:
COUNTIFS(A2:A$5;A1)>0
And I would get my TRUE/FALSE if the phone number exists further in my list.
I looked at the documentation to see if a value existed in a list, but I couldn't find a way to see if it existed further down. Also, I'm looking for a quick way to calculate it, as I have 100,000 rows in my dataframe.
# I've tried this so far:
length = len(df.index) - 1
i = 1
for i in range(i, length):
print(i)
for x in df['number']:
if x in df['number'][[i+1, length]]:
df['Re-contact'] = 'TRUE'
else:
df['Re-contact'] = 'FALSE'
i += 1
It feels very wrong to me, and my code takes too much time. I'm looking for a more efficient way to perform what I'm trying to do.
Use pandas.DataFrame.duplicated over Tel column to find repeated calls:
df['Re-contact'] = df.Tel.duplicated(keep='last')
Date Tel Re-contact
0 01-01-2023 33000000001 False
1 01-01-2023 33000000002 True
2 01-01-2023 33000000003 False
3 02-01-2023 33000000002 False
4 02-01-2023 33000000004 False

Python. Dataframes. Select rows and apply condition on the selection

python newbie here. I have written the code that solves the issue. However, there should be a much better way of doing it.
I have two Series that come from the same table but due to some earlier process I get as separate sets. (They could be joined into a single dataframe again since the entries belong to the same record)
Ser1 Ser2
| id | | section |
| ---| |-------- |
| 1 | | A |
| 2 | | B |
| 2 | | C |
| 3 | | D |
df2
| id | section |
| ---|---------|
| 1 | A |
| 2 | B |
| 2 | Z |
| 2 | Y |
| 4 | X |
First, I would like to find those entries in Ser1, which match the same id in df2. Then, check if the values in the ser2 can NOT be found in the section column of df2
My expected results:
| id | section | result |
| ---|-------- |---------|
| 1 | A | False | # Both id(1) and section(A) are also in df2
| 2 | B | False | # Both id(2) and section(B) are also in df2
| 2 | C | True | # id(2) is in df2 but section(C) is not
| 3 | D | False | # id(3) is not in df2, in that case the result should also be False
My code:
for k, v in Ser2.items():
rslt_df = df2[df2['id'] == Ser[k]]
if rslt_df.empty:
print(False)
if(v not in rslt_df['section'].tolist()):
print(True)
else:
print(False)
I know the code is not very good. But after reading about merging and comprehension lists I am getting confused what the best way would be to improve it.
You can concat the series and compute the "result" with boolean arithmetic (XOR):
out = (
pd.concat([ser1, ser2], axis=1)
.assign(result=ser1.isin(df2['id'])!=ser2.isin(df2['section']))
)
Output:
id section result
0 1 A False
1 2 B False
2 2 C True
3 3 D False
Intermediates:
m1 = ser1.isin(df2['id'])
m2 = ser2.isin(df2['section'])
m1 m2 m1!=m2
0 True True False
1 True True False
2 True False True
3 False False False

Pandas keeping certain rows based on strings in other rows

I have the following dataframe
+-------+------------+--+
| index | keep | |
+-------+------------+--+
| 0 | not useful | |
| 1 | start_1 | |
| 2 | useful | |
| 3 | end_1 | |
| 4 | not useful | |
| 5 | start_2 | |
| 6 | useful | |
| 7 | useful | |
| 8 | end_2 | |
+-------+------------+--+
There are two pairs of strings (start_1, end_1, start_2, end_2) that indicate that the rows between those strings are the only ones relevant in the data. Hence, in the dataframe below, the output dataframe would be only composed of the rows at index 2, 6, 7 (since 2 is between start_1 and end_1; and 6 and 7 is between start_2 and end_2)
d = {'keep': ["not useful", "start_1", "useful", "end_1", "not useful", "start_2", "useful", "useful", "end_2"]}
df = pd.DataFrame(data=d)
What is the most Pythonic/Pandas approach to this problem?
Thanks
Here's one way to do that (in a couple of steps, for clarity). There might be others:
df["sections"] = 0
df.loc[df.keep.str.startswith("start"), "sections"] = 1
df.loc[df.keep.str.startswith("end"), "sections"] = -1
df["in_section"] = df.sections.cumsum()
res = df[(df.in_section == 1) & ~df.keep.str.startswith("start")]
Output:
index keep sections in_section
2 2 useful 0 1
6 6 useful 0 1
7 7 useful 0 1

Conditional DataFrame filter on boolean columns?

If I have a DataFrame as so:
| id | attribute_1 | attribute_2 |
|--------|-------------|-------------|
| 123abc | TRUE | TRUE |
| 123abc | TRUE | FALSE |
| 456def | TRUE | FALSE |
| 789ghi | TRUE | TRUE |
| 789ghi | FALSE | FALSE |
| 789ghi | FALSE | FALSE |
How do I apply a groupby or some equivalent filter to count the unique number of id elements in a subset of the DataFrame that looks like this:
| id | attribute_1 | attribute_2 |
|--------|-------------|-------------|
| 123abc | TRUE | TRUE |
| 123abc | TRUE | FALSE |
Meaning, I want to get the unique number of id values where attribute_1 == True for all instances of a given id BUT attribute_2 must have at least 1 True.
So, 456def would not be included in the filter because it does not have at least one True for attribute_2.
789ghi would not be included in the filter because all of it's attribute_1 entries are not True.
You'll need to groupby twice, once with transform('all') on "attribute_1" and the second time with transform('any') on "attribute_2".
i = df[df.groupby('id').attribute_1.transform('all')]
j = i[i.groupby('id').attribute_2.transform('any')]
print (j)
id attribute_1 attribute_2
0 123abc True True
1 123abc True False
Finally, to get the unique IDs that satisfy this condition, call nunique:
print (j['id'].nunique())
1
This is easiest to do when your attribute_* columns are boolean. If they are strings, fix them first:
df = df.replace({'TRUE': True, 'FALSE': False})

Creating a truth table for any expression in Python

I am attempting to create a program that when run will ask for the boolean expression, the variables and then create a truth table for whatever is entered. I need to use a class and this is what I have so far. I am not sure where to go from here.
from itertools import product
class Boolean(object):
def __init__(self, statement, vars):
self.exp = statement
self.vars = vars
def __call__(self, statement, vars):
def main():
expression = raw_input('Give an expression:')
vars = raw_input('Give names of variables:')
variables = vars.split(' ')
b = Boolean(expression, variables)
if __name__ == "__main__":
main()
I have a library that does exactly what you want!
Check out the github repo or find it here on pypi.
The readme describes how everything works, but here's a quick example:
from truths import Truths
print Truths(['a', 'b', 'x', 'd'], ['(a and b)', 'a and b or x', 'a and (b or x) or d'])
+---+---+---+---+-----------+--------------+---------------------+
| a | b | x | d | (a and b) | a and b or x | a and (b or x) or d |
+---+---+---+---+-----------+--------------+---------------------+
| 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 0 | 0 | 0 | 1 | 0 | 0 | 1 |
| 0 | 0 | 1 | 0 | 0 | 1 | 0 |
| 0 | 0 | 1 | 1 | 0 | 1 | 1 |
| 0 | 1 | 0 | 0 | 0 | 0 | 0 |
| 0 | 1 | 0 | 1 | 0 | 0 | 1 |
| 0 | 1 | 1 | 0 | 0 | 1 | 0 |
| 0 | 1 | 1 | 1 | 0 | 1 | 1 |
| 1 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 0 | 0 | 1 | 0 | 0 | 1 |
| 1 | 0 | 1 | 0 | 0 | 1 | 1 |
| 1 | 0 | 1 | 1 | 0 | 1 | 1 |
| 1 | 1 | 0 | 0 | 1 | 1 | 1 |
| 1 | 1 | 0 | 1 | 1 | 1 | 1 |
| 1 | 1 | 1 | 0 | 1 | 1 | 1 |
| 1 | 1 | 1 | 1 | 1 | 1 | 1 |
+---+---+---+---+-----------+--------------+---------------------+
Hope this helps!
You could simply define any boolean function right in python.
consider the following example:
def f(w,x,y,z):
return (x and y) and (w or z)
I've wrote a snippet that takes any function f, and returns its truth table:
import pandas as pd
from itertools import product
def truth_table(f):
values = [list(x) + [f(*x)] for x in product([False,True], repeat=f.func_code.co_argcount)]
return pd.DataFrame(values,columns=(list(f.func_code.co_varnames) + [f.func_name]))
Using this will yield (in a nicely formatted html if you're using IPython Notebook):
truth_table(f)
w x y z f
0 False False False False False
1 False False False True False
2 False False True False False
3 False False True True False
4 False True False False False
5 False True False True False
6 False True True False False
7 False True True True True
8 True False False False False
9 True False False True False
10 True False True False False
11 True False True True False
12 True True False False False
13 True True False True False
14 True True True False True
15 True True True True True
Cheers.
You probably want to do something like this:
from itertools import product
for p in product((True, False), repeat=len(variables)):
# Map variable in variables to value in p
# Apply boolean operators to variables that now have values
# add result of each application to column in truth table
pass
But the inside of the for loop is the hardest part, so good luck.
This is an example of what you would be iterating over in the case of three variables:
>>> list(product((True, False), repeat=3))
[(True, True, True), (True, True, False), (True, False, True), (True, False, False), (False, True, True), (False, True, False), (False, False, True), (False, False, False)]
If you don't mind providing the number of variables of the function (I think it is possible to get but I don't know how at this moment). I have the following:
from itertools import product
B = {0,1}
varNames= ["r","s","t","u","v","w","x","y","z"]
def booltable(f,n):
vars = varNames[-n:]
header = vars + ["f"]
table = [*reversed([*map(lambda input: [*input,f(*input)], product(B,repeat=len(vars)))])]
return [header] + table
If you want to have your 1's at the top (like I do), just reverse the answer.
return [*reversed([*map(lambda input: [*input,f(*input)],product(B,repeat=len(vars)))])]
Here is an example of how to use it, functions are defined using bitwise operations.
x | y - bitwise or of x and y
x ^ y - bitwise exclusive or of x and y
x & y - bitwise and of x and y
~x - the bits of x inverted
# Example function
def aBooleanFunction(x,y,z):
return (x | y) ^ ~(x ^ z) % 2
# Run
display(booltable(aBooleanFunction,3))
The output is:
[['x', 'y', 'z', 'f'],
[1, 1, 1, 0],
[1, 1, 0, 1],
[1, 0, 1, 0],
[1, 0, 0, 1],
[0, 1, 1, 1],
[0, 1, 0, 0],
[0, 0, 1, 0],
[0, 0, 0, 1]]
You can then parse this to a table in whatever format you want using, for example, tabulate.

Categories