I am trying to compare column values of each rows of dataframe with predefined list of dictionary, and do filtering. I tried pandas to compare column value by row-wise with list of dictionary, but it is not quite working, I got type error. I think I may need to convert dataframe into dictionary then compare it with list of dictionary then convert back to dataframe with new column added, but this still not giving my desired output. Does anyone suggest possible workaround on this? How can we do this easily in python
working minimal example
import pandas as pd
indf=pd.DataFrame.from_dict(indf_dict)
indf_lst=indf.to_dict(orient='records')
matches=[]
for each in rules_list:
for row in indf_lst:
if row in each:
matches.append(row)
I tried pandas approach to check column values of every rows in rules_list but the attempt is not successful. Now I tried to convert indf dataframe to dictionary and compare two dictionary, but I have type error as follow:
TypeError Traceback (most recent call last)
Input In [11], in <cell line: 12>()
12 for each in rules_list:
13 for row in indf_lst:
---> 14 if row in each:
15 matches.append(row)
TypeError: unhashable type: 'dict'
objective
I need to compare columns of every rows with list of dictionary rules_list, and add new column which shows found match or not. How this can be done in python?
updated desired output
here is my desired output where I want to add two new columns when columns values hit match with list of dictionary rules_list that I defined.
output={'code0':{0:('5'),1:'nan',2:('98'),3:('98'),4:'nan',5:('15'),6:('40'),7:('52'),8:('52'),9:('40'),10:('52'),11:('52'),12:('58')},'code1':{0:('Agr','Serv'),1:('VA','HC','NIH','SAP','AUS','HOL','ATT','COL','UCL'),2:('ATT','NC'),3:('ATT','VA','NC'),4:('VA','HC','NIH','ATT','COL','UCL'),5:('Agr'),6:'nan',7:('NC'),8:('NC'),9:('VA'),10:('NC'),11:('NC'),12:('CE')},'code2':{0:'nan',1:'nan',2:('103','104','105','106','31'),3:('104','105'),4:'nan',5:('5'),6:'nan',7:('109'),8:('109'),9:('11'),10:('109'),11:('109'),12:('109')},'code3':{0:('90'),1:'nan',2:('810'),3:('810'),4:'nan',5:('58'),6:('518'),7:('610','620','682','642','621','611'),8:('620','682','642','611'),9:('113','174','131','115'),10:('612','790','110'),11:('612','110'),12:('423','114')},'code4':{0:('1'),1:'nan',2:('computerscience'),3:('computerscience'),4:'nan',5:('fishing'),6:'nan',7:('biology'),8:('biology'),9:'nan',10:('biology'),11:('biology'),12:'nan'},'code5':{0:'nan',1:'nan',2:'nan',3:'nan',4:'nan',5:'nan',6:'nan',7:'nan',8:'nan',9:('11','19','31'),10:('12','16','18','19'),11:('12','18','19'),12:('31')},'code6':{0:'nan',1:'nan',2:'nan',3:'nan',4:'nan',5:'nan',6:('594'),7:('712','479','297','639','452','172'),8:('712','479','297'),9:('164','157','388','158'),10:('285','295','236','239','269','284','237'),11:('285','295','237'),12:('372','238')},'isHit':{0:False,1:True,2:True,3:True,4:True,5:False,6:True,7:True,8:True,9:True,10:True,11:True,12:True},'rules_desc':{0:'None',1:'rules1',2:'rules2',3:'rules2',4:'rules1',5:'None',6:'rules12',7:'rules21',8:'rules21',9:'rules4',10:'rules3',11:'rules3',12:'rules5'}}
outdf=pd.DataFrame.from_dict(output)
how can I achieve this sort of mapping value from each cell of dataframe to list of dictionary? should I handle this in pandas or convert them into list then compare it? any possible thoughts? Anything close to above desired output should be fine.
The code below should do what you are asking for, but I haven't tested it yet if it actually really does what it should. I have put some effort in appropriate naming of the variables to make it easier to understand what the code does and how it works.
In the first step the code transforms the list with dictionaries for the rules into a list of tuples with code and code value for each of the rules with the purpose of making the final loop for checking if there is a hit easier to put together, understand, maintain and debug.
In the second step the code transforms the dictionary with data using pandas like it is done in code mentioned in the question.
Probably there is also a pandas way of transforming the list of dictionaries in the first step, so if you read this and know how to accomplish this using pandas I would be glad to hear about that.
Maybe there is a way to accomplish the entire task using pandas and two or three lines of code ... now with the variable naming and the provided code of the loops it would be easier for you who is reading this to come up with the code and provide maybe another and better answer.
from pprint import pprint
import pandas as pd
from collections import defaultdict
# ----------------------------------------------------------------------
rules_list=rules_dict=[{'code1':('VA','HC','NIH','SAP','AUS','HOL','ATT','COL','UCL'),'rules_desc':'rules1'},{'code0':('40'),'code3':('518'),'code6':('594'),'rules_desc':'rules12'},{'code0':('98'),'code1':('ATT','NC'),'code2':('103','104','105','106','31'),'code3':('810'),'code4':('computerscience'),'rules_desc':'rules2'},{'code0':('98'),'code1':('ATT','VA','NC'),'code2':('104','105','106','31'),'code4':('computerscience'),'rules_desc':'rules2'},{'code0':('52'),'code1':('NC'),'code2':('109'),'code3':('610','620','682','642','621','611'),'code4':('biology'),'code6':('712','479','297','639','452','172'),'rules_desc':'rules2'},{'code0':('52'),'code1':('NC'),'code2':('109'),'code3':('396','340','394','393','240'),'code4':('biology'),'code5':('12','18'),'rules_desc':'rules2'},{'code0':('52'),'code1':('NC'),'code2':('109'),'code3':('612','790','110'),'code4':('biology'),'code5':('12','16','18','19'),'code6':('285','295','236','239','269','284','237'),'rules_desc':'rules3'},{'code0':('52'),'code1':('NC'),'code2':('109'),'code3':('730','320','350','379','812','374'),'code4':('biology'),'code5':('12','18','19'),'rules_desc':'rules3'},{'code0':('40'),'code1':('VA'),'code2':('11'),'code3':('113','174','131','115'),'code5':('11','19','31'),'code6':('164','157','388','158'),'rules_desc':'rules4'},{'code0':('58'),'code1':('CE'),'code2':('109'),'code3':('423','114'),'code5':('31'),'code6':('372','238'),'rules_desc':'rules5'}]
# codeNname : 'code1', 'code2', 'code3', ..., 'code6'
# ruleNname : 'rules1', 'rules12', 'rules2', ..., 'rules5'
# ruleDescrKey : 'rules_desc'
# dictRulesSpec : dictionary { codeNname:value {1,N} ... , rulesDct_ruleKey:ruleNname }
# dictCodes : dictionary { codeNname:value, codeNname:value, ... }
# Rules : List [ dictRulesSpec, dictRulesSpec, ... ]
# dictRules : { ruleNname:[codeNname, codeNnameValue], ... }
Rules = rules_list
ruleDescrKey = 'rules_desc'
dictRules = defaultdict(list)
for dictRulesSpec in Rules:
ruleNname = dictRulesSpec.pop(ruleDescrKey)
# dictRulesSpec without ruleDescrKey item has only Codes as keys, so:
dictCodes = dictRulesSpec
for codeNname, codeNnameValue in dictCodes.items():
dictRules[ruleNname].append( (codeNname, codeNnameValue) )
print(f'{Rules=}')
print(f'{dictRules=}')
print(' ------------- ')
# ----------------------------------------------------------------------
indf_dict={'code0':{0:('5'),1:'nan',2:('98'),3:('98'),4:'',5:('15'),6:('40'),7:('52'),8:('52'),9:('40'),10:('52'),11:('52'),12:('58')},'code1':{0:('Agr','Serv'),1:('VA','HC','NIH','SAP','AUS','HOL','ATT','COL','UCL'),2:('ATT','NC'),3:('ATT','VA','NC'),4:('VA','HC','NIH','ATT','COL','UCL'),5:('Agr'),6:'nan',7:('NC'),8:('NC'),9:('VA'),10:('NC'),11:('NC'),12:('CE')},'code2':{0:'nan',1:'nan',2:('103','104','105','106','31'),3:('104','105'),4:'nan',5:('5'),6:'nan',7:('109'),8:('109'),9:('11'),10:('109'),11:('109'),12:('109')},'code3':{0:('90'),1:'nan',2:('810'),3:('810'),4:'nan',5:('58'),6:('518'),7:('610','620','682','642','621','611'),8:('620','682','642','611'),9:('113','174','131','115'),10:('612','790','110'),11:('612','110'),12:('423','114')},'code4':{0:('1'),1:'nan',2:('computerscience'),3:('computerscience'),4:'nan',5:('fishing'),6:'nan',7:('biology'),8:('biology'),9:'nan',10:('biology'),11:('biology'),12:'nan'},'code5':{0:'nan',1:'nan',2:'nan',3:'nan',4:'nan',5:'nan',6:'nan',7:'nan',8:'nan',9:('11','19','31'),10:('12','16','18','19'),11:('12','18','19'),12:'31'},'code6':{0:'nan',1:'nan',2:'nan',3:'nan',4:'nan',5:'nan',6:'594',7:('712','479','297','639','452','172'),8:('712','479','297'),9:('164','157','388','158'),10:('285','295','236','239','269','284','237'),11:('285','295','237'),12:('372','238')}}
dictDataRowsByCodeNname = indf_dict
df_dictDataRowsByCodeNname = pd.DataFrame.from_dict(dictDataRowsByCodeNname)
print(f'{dictDataRowsByCodeNname=}')
listDataRowsByRow = df_dictDataRowsByCodeNname.to_dict(orient='records')
print(f'{listDataRowsByRow=}')
print(' ------------- ')
isHit_Column = []
rules_desc_Column = []
# The loop below tests for only one hit within the rule ...
for dctDataRow in listDataRowsByRow:
isHit = False
for ruleNname, listTuplesCodeNnameValue in dictRules.items():
if isHit:
break
for codeNname, codeNnameValue in listTuplesCodeNnameValue:
if isHit:
break
else:
if dctDataRow[codeNname] == codeNnameValue:
isHit = True
bckpRuleNname = ruleNname
break
rules_desc_Column.append( bckpRuleNname if isHit else None)
isHit_Column.append(isHit)
print(f'{rules_desc_Column = }')
print(f'{isHit_Column = }')
print('================================')
df_dictDataRowsByCodeNname['isHit'] = isHit_Column
df_dictDataRowsByCodeNname['rules_desc'] = rules_desc_Column
print(df_dictDataRowsByCodeNname)
print('================================')
isHit_Column = []
rules_desc_Column = []
# The loop below tests for all hits within the rule and
# lists all rules that apply in case of hits:
for dctDataRow in listDataRowsByRow:
lstRulesWithHits = []
for ruleNname, listTuplesCodeNnameValue in dictRules.items():
ruleItemsWithHits = 0
for codeNname, codeNnameValue in listTuplesCodeNnameValue:
if dctDataRow[codeNname] == codeNnameValue:
ruleItemsWithHits += 1
if ruleItemsWithHits == len(listTuplesCodeNnameValue):
lstRulesWithHits.append(ruleNname)
isHit = bool(lstRulesWithHits)
rules_desc_Column.append( lstRulesWithHits if isHit else None)
isHit_Column.append(isHit)
df_dictDataRowsByCodeNname['isHit'] = isHit_Column
df_dictDataRowsByCodeNname['rules_desc'] = rules_desc_Column
print(df_dictDataRowsByCodeNname)
print('================================')
which gives:
Rules=[{'code1': ('VA', 'HC', 'NIH', 'SAP', 'AUS', 'HOL', 'ATT', 'COL', 'UCL')}, {'code0': '40', 'code3': '518', 'code6': '594'}, {'code0': '98', 'code1': ('ATT', 'NC'), 'code2': ('103', '104', '105', '106', '31'), 'code3': '810', 'code4': 'computerscience'}, {'code0': '98', 'code1': ('ATT', 'VA', 'NC'), 'code2': ('104', '105', '106', '31'), 'code4': 'computerscience'}, {'code0': '52', 'code1': 'NC', 'code2': '109', 'code3': ('610', '620', '682', '642', '621', '611'), 'code4': 'biology', 'code6': ('712', '479', '297', '639', '452', '172')}, {'code0': '52', 'code1': 'NC', 'code2': '109', 'code3': ('396', '340', '394', '393', '240'), 'code4': 'biology', 'code5': ('12', '18')}, {'code0': '52', 'code1': 'NC', 'code2': '109', 'code3': ('612', '790', '110'), 'code4': 'biology', 'code5': ('12', '16', '18', '19'), 'code6': ('285', '295', '236', '239', '269', '284', '237')}, {'code0': '52', 'code1': 'NC', 'code2': '109', 'code3': ('730', '320', '350', '379', '812', '374'), 'code4': 'biology', 'code5': ('12', '18', '19')}, {'code0': '40', 'code1': 'VA', 'code2': '11', 'code3': ('113', '174', '131', '115'), 'code5': ('11', '19', '31'), 'code6': ('164', '157', '388', '158')}, {'code0': '58', 'code1': 'CE', 'code2': '109', 'code3': ('423', '114'), 'code5': '31', 'code6': ('372', '238')}]
dictRules=defaultdict(<class 'list'>, {'rules1': [('code1', ('VA', 'HC', 'NIH', 'SAP', 'AUS', 'HOL', 'ATT', 'COL', 'UCL'))], 'rules12': [('code0', '40'), ('code3', '518'), ('code6', '594')], 'rules2': [('code0', '98'), ('code1', ('ATT', 'NC')), ('code2', ('103', '104', '105', '106', '31')), ('code3', '810'), ('code4', 'computerscience'), ('code0', '98'), ('code1', ('ATT', 'VA', 'NC')), ('code2', ('104', '105', '106', '31')), ('code4', 'computerscience'), ('code0', '52'), ('code1', 'NC'), ('code2', '109'), ('code3', ('610', '620', '682', '642', '621', '611')), ('code4', 'biology'), ('code6', ('712', '479', '297', '639', '452', '172')), ('code0', '52'), ('code1', 'NC'), ('code2', '109'), ('code3', ('396', '340', '394', '393', '240')), ('code4', 'biology'), ('code5', ('12', '18'))], 'rules3': [('code0', '52'), ('code1', 'NC'), ('code2', '109'), ('code3', ('612', '790', '110')), ('code4', 'biology'), ('code5', ('12', '16', '18', '19')), ('code6', ('285', '295', '236', '239', '269', '284', '237')), ('code0', '52'), ('code1', 'NC'), ('code2', '109'), ('code3', ('730', '320', '350', '379', '812', '374')), ('code4', 'biology'), ('code5', ('12', '18', '19'))], 'rules4': [('code0', '40'), ('code1', 'VA'), ('code2', '11'), ('code3', ('113', '174', '131', '115')), ('code5', ('11', '19', '31')), ('code6', ('164', '157', '388', '158'))], 'rules5': [('code0', '58'), ('code1', 'CE'), ('code2', '109'), ('code3', ('423', '114')), ('code5', '31'), ('code6', ('372', '238'))]})
-------------
dictDataRowsByCodeNname={'code0': {0: '5', 1: 'nan', 2: '98', 3: '98', 4: '', 5: '15', 6: '40', 7: '52', 8: '52', 9: '40', 10: '52', 11: '52', 12: '58'}, 'code1': {0: ('Agr', 'Serv'), 1: ('VA', 'HC', 'NIH', 'SAP', 'AUS', 'HOL', 'ATT', 'COL', 'UCL'), 2: ('ATT', 'NC'), 3: ('ATT', 'VA', 'NC'), 4: ('VA', 'HC', 'NIH', 'ATT', 'COL', 'UCL'), 5: 'Agr', 6: 'nan', 7: 'NC', 8: 'NC', 9: 'VA', 10: 'NC', 11: 'NC', 12: 'CE'}, 'code2': {0: 'nan', 1: 'nan', 2: ('103', '104', '105', '106', '31'), 3: ('104', '105'), 4: 'nan', 5: '5', 6: 'nan', 7: '109', 8: '109', 9: '11', 10: '109', 11: '109', 12: '109'}, 'code3': {0: '90', 1: 'nan', 2: '810', 3: '810', 4: 'nan', 5: '58', 6: '518', 7: ('610', '620', '682', '642', '621', '611'), 8: ('620', '682', '642', '611'), 9: ('113', '174', '131', '115'), 10: ('612', '790', '110'), 11: ('612', '110'), 12: ('423', '114')}, 'code4': {0: '1', 1: 'nan', 2: 'computerscience', 3: 'computerscience', 4: 'nan', 5: 'fishing', 6: 'nan', 7: 'biology', 8: 'biology', 9: 'nan', 10: 'biology', 11: 'biology', 12: 'nan'}, 'code5': {0: 'nan', 1: 'nan', 2: 'nan', 3: 'nan', 4: 'nan', 5: 'nan', 6: 'nan', 7: 'nan', 8: 'nan', 9: ('11', '19', '31'), 10: ('12', '16', '18', '19'), 11: ('12', '18', '19'), 12: '31'}, 'code6': {0: 'nan', 1: 'nan', 2: 'nan', 3: 'nan', 4: 'nan', 5: 'nan', 6: '594', 7: ('712', '479', '297', '639', '452', '172'), 8: ('712', '479', '297'), 9: ('164', '157', '388', '158'), 10: ('285', '295', '236', '239', '269', '284', '237'), 11: ('285', '295', '237'), 12: ('372', '238')}}
listDataRowsByRow=[{'code0': '5', 'code1': ('Agr', 'Serv'), 'code2': 'nan', 'code3': '90', 'code4': '1', 'code5': 'nan', 'code6': 'nan'}, {'code0': 'nan', 'code1': ('VA', 'HC', 'NIH', 'SAP', 'AUS', 'HOL', 'ATT', 'COL', 'UCL'), 'code2': 'nan', 'code3': 'nan', 'code4': 'nan', 'code5': 'nan', 'code6': 'nan'}, {'code0': '98', 'code1': ('ATT', 'NC'), 'code2': ('103', '104', '105', '106', '31'), 'code3': '810', 'code4': 'computerscience', 'code5': 'nan', 'code6': 'nan'}, {'code0': '98', 'code1': ('ATT', 'VA', 'NC'), 'code2': ('104', '105'), 'code3': '810', 'code4': 'computerscience', 'code5': 'nan', 'code6': 'nan'}, {'code0': '', 'code1': ('VA', 'HC', 'NIH', 'ATT', 'COL', 'UCL'), 'code2': 'nan', 'code3': 'nan', 'code4': 'nan', 'code5': 'nan', 'code6': 'nan'}, {'code0': '15', 'code1': 'Agr', 'code2': '5', 'code3': '58', 'code4': 'fishing', 'code5': 'nan', 'code6': 'nan'}, {'code0': '40', 'code1': 'nan', 'code2': 'nan', 'code3': '518', 'code4': 'nan', 'code5': 'nan', 'code6': '594'}, {'code0': '52', 'code1': 'NC', 'code2': '109', 'code3': ('610', '620', '682', '642', '621', '611'), 'code4': 'biology', 'code5': 'nan', 'code6': ('712', '479', '297', '639', '452', '172')}, {'code0': '52', 'code1': 'NC', 'code2': '109', 'code3': ('620', '682', '642', '611'), 'code4': 'biology', 'code5': 'nan', 'code6': ('712', '479', '297')}, {'code0': '40', 'code1': 'VA', 'code2': '11', 'code3': ('113', '174', '131', '115'), 'code4': 'nan', 'code5': ('11', '19', '31'), 'code6': ('164', '157', '388', '158')}, {'code0': '52', 'code1': 'NC', 'code2': '109', 'code3': ('612', '790', '110'), 'code4': 'biology', 'code5': ('12', '16', '18', '19'), 'code6': ('285', '295', '236', '239', '269', '284', '237')}, {'code0': '52', 'code1': 'NC', 'code2': '109', 'code3': ('612', '110'), 'code4': 'biology', 'code5': ('12', '18', '19'), 'code6': ('285', '295', '237')}, {'code0': '58', 'code1': 'CE', 'code2': '109', 'code3': ('423', '114'), 'code4': 'nan', 'code5': '31', 'code6': ('372', '238')}]
-------------
rules_desc_Column = [None, 'rules12', 'rules3', 'rules3', None, None, 'rules2', 'rules3', 'rules3', 'rules2', 'rules3', 'rules3', 'rules3']
isHit_Column = [False, True, True, True, False, False, True, True, True, True, True, True, True]
================================
code0 code1 ... isHit rules_desc
0 5 (Agr, Serv) ... False None
1 nan (VA, HC, NIH, SAP, AUS, HOL, ATT, COL, UCL) ... True rules12
2 98 (ATT, NC) ... True rules3
3 98 (ATT, VA, NC) ... True rules3
4 (VA, HC, NIH, ATT, COL, UCL) ... False None
5 15 Agr ... False None
6 40 nan ... True rules2
7 52 NC ... True rules3
8 52 NC ... True rules3
9 40 VA ... True rules2
10 52 NC ... True rules3
11 52 NC ... True rules3
12 58 CE ... True rules3
[13 rows x 9 columns]
================================
code0 code1 ... isHit rules_desc
0 5 (Agr, Serv) ... False None
1 nan (VA, HC, NIH, SAP, AUS, HOL, ATT, COL, UCL) ... True [rules1]
2 98 (ATT, NC) ... False None
3 98 (ATT, VA, NC) ... False None
4 (VA, HC, NIH, ATT, COL, UCL) ... False None
5 15 Agr ... False None
6 40 nan ... True [rules12]
7 52 NC ... False None
8 52 NC ... False None
9 40 VA ... True [rules4]
10 52 NC ... False None
11 52 NC ... False None
12 58 CE ... True [rules5]
[13 rows x 9 columns]
================================
P.S. The first final loop in the code above does NOT accumulate the hits providing a list of all rules which apply if there is a hit. In other words the search for hits is stopped after the first hit and first rule item which give a hit.
The second final loop tests all rule items and collects the rules which give hits in a list.
Perhaps this will get you started. The only tricky thing here is the all function. What I'm saying here is, "for every key and value in this particular rule, if the value is found in the list of values for the corresponding key in our data row, and that's true for EVERY part of this rule, then it is a winner".
When you have nested data like this, pandas is not the right tool. You could probably make it work, but this is way easier.
A key point here is that you need to search the VALUES in your data dictionary. Right? You have {0:'5',2:'98'...}, but we don't care about 0 and 2. We only care about the strings.
for row in indf_dict:
for rno,rule in enumerate(rules_list):
print("New rule", rno)
match = all( val in row[key].values() for key,val in rule.items() if key in row)
if match:
print("Rule", rno, "matches")
Output:
New rule 0
Rule 0 matches
New rule 1
Rule 1 matches
New rule 2
Rule 2 matches
New rule 3
New rule 4
Rule 4 matches
New rule 5
New rule 6
Rule 6 matches
New rule 7
New rule 8
Rule 8 matches
New rule 9
Rule 9 matches