Remove item from nested dictionaries if specified key contains None values - python

I have a list of dictionaries in which I am trying to remove any dictionary should the value of a certain key is None, it will be removed.
item_dict = [
{'code': 'aaa0000',
'id': 415294,
'index_range': '10-33',
'location': 'A010',
'type': 'True'},
{'code': 'bbb1458',
'id': 415575,
'index_range': '30-62',
'location': None,
'type': 'True'},
{'code': 'ccc3013',
'id': 415575,
'index_range': '14-59',
'location': 'C041',
'type': 'True'}
]
for item in item_dict:
filtered = dict((k,v) for k,v in item.iteritems() if v is not None)
# Output Results
# Item - aaa0000 is missing
# {'index_range': '14-59', 'code': 'ccc3013', 'type': 'True', 'id': 415575, 'location': 'C041'}
In my example, the output result is missing one of the dictionary and if I tried to create a new list to append filtered, item bbb1458 will be included in the list as well.
How can I rectify this?

[item for item in item_dict if None not in item.values()]
Each item in this list is a dictionary. And a dictionary is only appended to this list if None does not appear in the dictionary values.

You can create a new list using a list comprehension, filtering on the condition that all values are not None:
item_dict = [
{'code': 'aaa0000',
'id': 415294,
'index_range': '10-33',
'location': 'A010',
'type': 'True'},
{'code': 'bbb1458',
'id': 415575,
'index_range': '30-62',
'location': None,
'type': 'True'},
{'code': 'ccc3013',
'id': 415575,
'index_range': '14-59',
'location': 'C041',
'type': 'True'}
]
filtered = [d for d in item_dict if all(value is not None for value in d.values())]
print(filtered)
#[{'index_range': '10-33', 'id': 415294, 'location': 'A010', 'type': 'True', 'code': 'aaa0000'}, {'index_range': '14-59', 'id': 415575, 'location': 'C041', 'type': 'True', 'code': 'ccc3013'}]

Related

How to flatten nested dict formatted '_source' column of csv, into dataframe

I have a csv with 500+ rows where one column "_source" is stored as JSON. I want to extract that into a pandas dataframe. I need each key to be its own column. #I have a 1 mb Json file of online social media data that I need to convert the dictionary and key values into their own separate columns. The social media data is from Facebook,Twitter/web crawled... etc. There are approximately 528 separate rows of posts/tweets/text with each having many dictionaries inside dictionaries. I am attaching a few steps from my Jupyter notebook below to give a more complete understanding. need to turn all key value pairs for dictionaries inside dictionaries into columns inside a dataframe
Thank you so much this will be a huge help!!!
I have tried changing it to a dataframe by doing this
source = pd.DataFrame.from_dict(source, orient='columns')
And it returns something like this... I thought it might unpack the dictionary but it did not.
#source.head()
#_source
#0 {'sub_organization_id': 'default', 'uid': 'aba...
#1 {'sub_organization_id': 'default', 'uid': 'ab0...
#2 {'sub_organization_id': 'default', 'uid': 'ac0...
below is the shape
#source.shape (528, 1)
below is what the an actual "_source" row looks like stretched out. There are many dictionaries and key:value pairs where each key needs to be its own column. Thanks! The actual links have been altered/scrambled for privacy reasons.
{'sub_organization_id': 'default',
'uid': 'ac0fafe9ba98327f2d0c72ddc365ffb76336czsa13280b',
'project_veid': 'default',
'campaign_id': 'default',
'organization_id': 'default',
'meta': {'rule_matcher': [{'atribs': {'website': 'github.com/res',
'source': 'Explicit',
'version': '1.1',
'type': 'crawl'},
'results': [{'rule_type': 'hashtag',
'rule_tag': 'Far',
'description': None,
'project_veid': 'A7180EA-7078-0C7F-ED5D-86AD7',
'campaign_id': '2A6DA0C-365BB-67DD-B05830920',
'value': '#Far',
'organization_id': None,
'sub_organization_id': None,
'appid': 'ray',
'project_id': 'CDE2F42-5B87-C594-C900E578C',
'rule_id': '1838',
'node_id': None,
'metadata': {'campaign_title': 'AF',
'project_title': 'AF '}}]}],
'render': [{'attribs': {'website': 'github.com/res',
'version': '1.0',
'type': 'Page Render'},
'results': [{'render_status': 'success',
'path': 'https://east.amanaws.com/rays-ime-store/renders/b/b/70f7dffb8b276f2977f8a13415f82c.jpeg',
'image_hash': 'bb7674b8ea3fc05bfd027a19815f82c',
'url': 'https://discooprdapp.com/',
'load_time': 32}]}]},
'norm_attribs': {'website': 'github.com/res',
'version': '1.1',
'type': 'crawl'},
'project_id': 'default',
'system_timestamp': '2019-02-22T19:04:53.569623',
'doc': {'appid': 'subtter',
'links': [],
'response_url': 'https://discooprdapp.com',
'url': 'https://discooprdapp.com/',
'status_code': 200,
'status_msg': 'OK',
'encoding': 'utf-8',
'attrs': {'uid': '2ab8f2651cb32261b911c990a8b'},
'timestamp': '2019-02-22T19:04:53.963',
'crawlid': '7fd95-785-4dd259-fcc-8752f'},
'type': 'crawl',
'norm': {'body': '\n',
'domain': 'discordapp.com',
'author': 'crawl',
'url': 'https://discooprdapp.com',
'timestamp': '2019-02-22T19:04:53.961283+00:00',
'id': '7fc5-685-4dd9-cc-8762f'}}
before you post make sure the actual code works for the data attached. Thanks!
The below code I tried but it did not work there was a syntax error that I could not figure out.
pd.io.json.json_normalize(source_data.[_source].apply(json.loads))
pd.io.json.json_normalize(source_data.[_source].apply(json.loads))
^
SyntaxError: invalid syntax
Whoever can help me with this will be a saint!
I had to do something like that a while back. Basically I used a function that completely flattened out the json to identify the keys that would be turned into the columns, then iterated through the json to reconstruct a row and append each row into a "results" dataframe. So with the data you provided, it created 52 column row and looking through it, looks like it included all the keys into it's own column. Anything nested, for example: 'meta': {'rule_matcher':[{'atribs': {'website': ...]} should then have a column name meta.rule_matcher.atribs.website where the '.' denotes those nested keys
data_source = {'sub_organization_id': 'default',
'uid': 'ac0fafe9ba98327f2d0c72ddc365ffb76336czsa13280b',
'project_veid': 'default',
'campaign_id': 'default',
'organization_id': 'default',
'meta': {'rule_matcher': [{'atribs': {'website': 'github.com/res',
'source': 'Explicit',
'version': '1.1',
'type': 'crawl'},
'results': [{'rule_type': 'hashtag',
'rule_tag': 'Far',
'description': None,
'project_veid': 'A7180EA-7078-0C7F-ED5D-86AD7',
'campaign_id': '2A6DA0C-365BB-67DD-B05830920',
'value': '#Far',
'organization_id': None,
'sub_organization_id': None,
'appid': 'ray',
'project_id': 'CDE2F42-5B87-C594-C900E578C',
'rule_id': '1838',
'node_id': None,
'metadata': {'campaign_title': 'AF',
'project_title': 'AF '}}]}],
'render': [{'attribs': {'website': 'github.com/res',
'version': '1.0',
'type': 'Page Render'},
'results': [{'render_status': 'success',
'path': 'https://east.amanaws.com/rays-ime-store/renders/b/b/70f7dffb8b276f2977f8a13415f82c.jpeg',
'image_hash': 'bb7674b8ea3fc05bfd027a19815f82c',
'url': 'https://discooprdapp.com/',
'load_time': 32}]}]},
'norm_attribs': {'website': 'github.com/res',
'version': '1.1',
'type': 'crawl'},
'project_id': 'default',
'system_timestamp': '2019-02-22T19:04:53.569623',
'doc': {'appid': 'subtter',
'links': [],
'response_url': 'https://discooprdapp.com',
'url': 'https://discooprdapp.com/',
'status_code': 200,
'status_msg': 'OK',
'encoding': 'utf-8',
'attrs': {'uid': '2ab8f2651cb32261b911c990a8b'},
'timestamp': '2019-02-22T19:04:53.963',
'crawlid': '7fd95-785-4dd259-fcc-8752f'},
'type': 'crawl',
'norm': {'body': '\n',
'domain': 'discordapp.com',
'author': 'crawl',
'url': 'https://discooprdapp.com',
'timestamp': '2019-02-22T19:04:53.961283+00:00',
'id': '7fc5-685-4dd9-cc-8762f'}}
Code:
def flatten_json(y):
out = {}
def flatten(x, name=''):
if type(x) is dict:
for a in x:
flatten(x[a], name + a + '_')
elif type(x) is list:
i = 0
for a in x:
flatten(a, name + str(i) + '_')
i += 1
else:
out[name[:-1]] = x
flatten(y)
return out
flat = flatten_json(data_source)
import pandas as pd
import re
results = pd.DataFrame()
special_cols = []
columns_list = list(flat.keys())
for item in columns_list:
try:
row_idx = re.findall(r'\_(\d+)\_', item )[0]
except:
special_cols.append(item)
continue
column = re.findall(r'\_\d+\_(.*)', item )[0]
column = re.sub(r'\_\d+\_', '.', column)
row_idx = int(row_idx)
value = flat[item]
results.loc[row_idx, column] = value
for item in special_cols:
results[item] = flat[item]
Output:
print (results.to_string())
atribs_website atribs_source atribs_version atribs_type results.rule_type results.rule_tag results.description results.project_veid results.campaign_id results.value results.organization_id results.sub_organization_id results.appid results.project_id results.rule_id results.node_id results.metadata_campaign_title results.metadata_project_title attribs_website attribs_version attribs_type results.render_status results.path results.image_hash results.url results.load_time sub_organization_id uid project_veid campaign_id organization_id norm_attribs_website norm_attribs_version norm_attribs_type project_id system_timestamp doc_appid doc_response_url doc_url doc_status_code doc_status_msg doc_encoding doc_attrs_uid doc_timestamp doc_crawlid type norm_body norm_domain norm_author norm_url norm_timestamp norm_id
0 github.com/res Explicit 1.1 crawl hashtag Far NaN A7180EA-7078-0C7F-ED5D-86AD7 2A6DA0C-365BB-67DD-B05830920 #Far NaN NaN ray CDE2F42-5B87-C594-C900E578C 1838 NaN AF AF github.com/res 1.0 Page Render success https://east.amanaws.com/rays-ime-store/render... bb7674b8ea3fc05bfd027a19815f82c https://discooprdapp.com/ 32.0 default ac0fafe9ba98327f2d0c72ddc365ffb76336czsa13280b default default default github.com/res 1.1 crawl default 2019-02-22T19:04:53.569623 subtter https://discooprdapp.com https://discooprdapp.com/ 200 OK utf-8 2ab8f2651cb32261b911c990a8b 2019-02-22T19:04:53.963 7fd95-785-4dd259-fcc-8752f crawl \n discordapp.com crawl https://discooprdapp.com 2019-02-22T19:04:53.961283+00:00 7fc5-685-4dd9-cc-8762f

Iterate through a list of key and value pairs and get specific key and value using Python

I have a list like this.
data = [{
'category': 'software',
'code': 110,
'actual': '["5.1.4"]',
'opened': '2018-10-16T09:18:12Z',
'component_type': 'update',
'event': 'new update available',
'current_severity': 'info',
'details': '',
'expected': None,
'id': 10088862,
'component_name': 'Purity//FA'
},
{
'category': 'software',
'code': 67,
'actual': None,
'opened': '2018-10-18T01:14:45Z',
'component_type': 'host',
'event': 'misconfiguration',
'current_severity': 'critical',
'details': '',
'expected': None,
'id': 10088898,
'component_name': 'pudc-vm-001'
},
{
'category': 'array',
'code': 42,
'actual': None,
'opened': '2018-11-22T22:27:29Z',
'component_type': 'hardware',
'event': 'failure',
'current_severity': 'warning',
'details': '' ,
'expected': None,
'id': 10089121,
'component_name': 'ct1.eth15'
}]
I want to iterate over this and get only category, component_type, event and current_severity.
I tried a for loop but it says too values to unpack, obviously.
for k, v, b, n in data:
print(k, v, b, n) //do something
i essentially want a list that is filtered to have only category, component_type, event and current_severity. So that i can use the same for loop to get out my four key value pairs.
Or if there is a better way to do it? Please help me out.
Note: The stanzas in the list is not fixed, it keeps changing, it might have more than three stanzas.
You have a list of dictionaries, simple way to iterate over this is
category = [x['category'] for x in data]
Which prints the values of category key
['software', 'software', 'array']
Do the same for component_type, event and current_severity and you're good to go
If you know that every dict inside your current list of dicts should have at least the keys you're trying to extract their data, then you can use dict[key], however for safety, i prefer using dict.get(key, default value) like this example:
out = [
{
'category': elm.get('category'),
'component_type': elm.get('component_type'),
'event': elm.get('event'),
'current_severity': elm.get('current_severity')
} for elm in data
]
print(out)
Output:
[{'category': 'software',
'component_type': 'update',
'current_severity': 'info',
'event': 'new update available'},
{'category': 'software',
'component_type': 'host',
'current_severity': 'critical',
'event': 'misconfiguration'},
{'category': 'array',
'component_type': 'hardware',
'current_severity': 'warning',
'event': 'failure'}]
For more informations about when we should use dict.get() instead of dict[key], see this answer
with this you get a new list with only the keys you are interested on:
new_list = [{
'category': stanza['category'],
'component_type': stanza['component_type'],
'event': stanza['event'],
} for stanza in data]

Sort and return all of nested dictionaries based on specified key value

I am trying to re-arrange the contents of a nested dictionaries where it will check the value of a specified key.
dict_entries = {
'entries': {
'AzP746r3Nl': {
'uniqueID': 'AzP746r3Nl',
'index': 2,
'data': {'comment': 'First Plastique Mat.',
'created': '17/01/19 10:18',
'project': 'EMZ',
'name': 'plastique_varA',
'version': '1'},
'name': 'plastique_varA',
'text': 'plastique test',
'thumbnail': '/Desktop/mat/plastique_varA/plastique_varA.jpg',
'type': 'matEntry'
},
'Q2tch2xm6h': {
'uniqueID': 'Q2tch2xm6h',
'index': 0,
'data': {'comment': 'Camino from John Inds.',
'created': '03/01/19 12:08',
'project': 'EMZ',
'name': 'camino_H10a',
'version': '1'},
'name': 'camino_H10a',
'text': 'John Inds : Camino',
'thumbnail': '/Desktop/chips/camino_H10a/camino_H10a.jpg',
'type': 'ChipEntry'
},
'ZeqCFCmHqp': {
'uniqueID': 'ZeqCFCmHqp',
'index': 1,
'data': {'comment': 'Prototype Bleu.',
'created': '03/01/19 14:07',
'project': 'EMZ',
'name': 'bleu_P23y',
'version': '1'},
'name': 'bleu_P23y',
'text': 'Bleu : Prototype',
'thumbnail': '/Desktop/chips/bleu_P23y/bleu_P23y.jpg',
'type': 'ChipEntry'
}
}
}
In my above nested dictionary example, I am trying to check it by the name and created key (2 functions each) and once it has been sorted, the index value will be updated accordingly as well...
Even so, I am able to query for the values of the said key(s):
for item in dict_entries.get('entries').values():
#The key that I am targetting at
tar_key = item['name']
but this is returning me the value of the name key and I am unsure on my next step as I am trying to sort by the value of the name key and capturing + re-arranging all the contents of the nested dictionaries.
This is my desired output (if checking by name):
{'entries': {
'ZeqCFCmHqp': {
'uniqueID': 'ZeqCFCmHqp',
'index': 1,
'data': {'comment': 'Prototype Bleu.',
'created': '03/01/19 14:07',
'project': 'EMZ',
'name': 'bleu_P23y',
'version': '1'},
'name': 'bleu_P23y',
'text': 'Bleu : Prototype',
'thumbnail': '/Desktop/chips/bleu_P23y/bleu_P23y.jpg',
'type': 'ChipEntry'
}
'Q2tch2xm6h': {
'uniqueID': 'Q2tch2xm6h',
'index': 0,
'data': {'comment': 'Camino from John Inds.',
'created': '03/01/19 12:08',
'project': 'EMZ',
'name': 'camino_H10a',
'version': '1'},
'name': 'camino_H10a',
'text': 'John Inds : Camino',
'thumbnail': '/Desktop/chips/camino_H10a/camino_H10a.jpg',
'type': 'ChipEntry'
},
'AzP746r3Nl': {
'uniqueID': 'AzP746r3Nl',
'index': 2,
'data': {'comment': 'First Plastique Mat.',
'created': '17/01/19 10:18',
'project': 'EMZ',
'name': 'plastique_varA',
'version': '1'},
'name': 'plastique_varA',
'text': 'plastique test',
'thumbnail': '/Desktop/mat/plastique_varA/plastique_varA.jpg',
'type': 'matEntry'
}
}
}

compare two different length lists of dictionaries in python

I want to compare below dictionaries. Name key in the dictionary is common in both dictionaries.
If Name matched in both the dictionaries, i wanted to do some other stuff with the data.
PerfData = [
{'Name': 'abc', 'Type': 'Ex1', 'Access': 'N1', 'perfStatus':'Latest Perf', 'Comments': '07/12/2017 S/W Version'},
{'Name': 'xyz', 'Type': 'Ex1', 'Access': 'N2', 'perfStatus':'Latest Perf', 'Comments': '11/12/2017 S/W Version upgrade failed'},
{'Name': 'efg', 'Type': 'Cust1', 'Access': 'A1', 'perfStatus':'Old Perf', 'Comments': '11/10/2017 S/W Version upgrade failed, test data is active'}
]
beatData = [
{'Name': 'efg', 'Status': 'Latest', 'rcvd-timestamp': '1516756202.632'},
{'Name': 'abc', 'Status': 'Latest', 'rcvd-timestamp': '1516756202.896'}
]
Thanks
Rajeev
l = [{'name': 'abc'}, {'name': 'xyz'}]
k = [{'name': 'a'}, {'name': 'abc'}]
[i['name'] for i in l for f in k if i['name'] == f['name']]
Hope above logic work for you.
The answer provided didn't assign the result to any variable. If you want to print it, add the following would work:
result = [i['name'] for i in l for f in k if i['name'] == f['name']]
print(result)

Convert dictionary lists to multi-dimensional list of dictionaries

I've been trying to convert the following:
data = {'title':['doc1','doc2','doc3'], 'name':['test','check'], 'id':['ddi5i'] }
to:
[{'title':'doc1', 'name': 'test', 'id': 'ddi5i'},
{'title':'doc2', 'name': 'test', 'id': 'ddi5i'},
{'title':'doc3', 'name': 'test', 'id': 'ddi5i'},
{'title':'doc1', 'name': 'check', 'id': 'ddi5i'},
{'title':'doc2', 'name': 'check', 'id': 'ddi5i'},
{'title':'doc3', 'name': 'check', 'id': 'ddi5i'}]
I've tried various options (list comprehensions, pandas and custom code) but nothing seems to work. For example, the following:
panda.DataFrame(data).to_dict('list')
throws an error because, since it tries to map the lists, all of them have to be of the same length. Besides, the output would only be uni-dimensional which is not what I'm looking for.
itertools.product may be what you're looking for here, and it can be applied to the values of your data to get appropriate value groupings for the new dicts. Something like
list(dict(zip(data, ele)) for ele in product(*data.values()))
Demo
>>> from itertools import product
>>> list(dict(zip(data, ele)) for ele in product(*data.values()))
[{'id': 'ddi5i', 'name': 'test', 'title': 'doc1'},
{'id': 'ddi5i', 'name': 'test', 'title': 'doc2'},
{'id': 'ddi5i', 'name': 'test', 'title': 'doc3'},
{'id': 'ddi5i', 'name': 'check', 'title': 'doc1'},
{'id': 'ddi5i', 'name': 'check', 'title': 'doc2'},
{'id': 'ddi5i', 'name': 'check', 'title': 'doc3'}]
It is clear how this works once seeing
>>> list(product(*data.values()))
[('test', 'doc1', 'ddi5i'),
('test', 'doc2', 'ddi5i'),
('test', 'doc3', 'ddi5i'),
('check', 'doc1', 'ddi5i'),
('check', 'doc2', 'ddi5i'),
('check', 'doc3', 'ddi5i')]
and now it is just a matter of zipping back into a dict with the original keys.

Categories