parse multilevel json to string with condition - python

I have this nested json item that I just want to flatten out to a comma separated string (i.e. parkinson:5, billy mays:4)so I can store in a database if needed for future analysis. I wrote out the function below but am wondering if there's a more elegant way using list comprehension (or something else). I found this post but I'm not sure how to adapt it for my needs (Python - parse JSON values by multilevel keys).
Data looks like this:
{'persons':
[{'name': 'parkinson', 'sentiment': '5'},
{'name': 'knott david', 'sentiment': 'none'},
{'name': 'billy mays', 'sentiment': '4'}],
'organizations':
[{'name': 'piper jaffray companies', 'sentiment': 'none'},
{'name': 'marketbeat.com', 'sentiment': 'none'},
{'name': 'zacks investment research', 'sentiment': 'none'}]
'locations': []
}
Here's my code:
def parse_entities(data):
results = ''
for category in data.keys():
# for c_id, category in enumerate(data.keys()):
entity_data = data[category]
for e_id, entity in enumerate(entity_data):
if not entity_data[e_id]['sentiment'] == 'none':
results = results + (data[category][e_id]['name'] + ":" +
data[category][e_id]['sentiment'] + ",")
return results

Firstly, the most important thing to make your code shorter and nicer to look at is to use your own variables. Be aware that entity_data = data[category] and entity = entity_data[e_id]. So you can write entity['name'] instead of data[category][e_id]['name'].
Secondly, if you want something like
for category in data.keys():
entity_data = data[category]
you can make it shorter and easier to read by changing it to
for category, entity_data in data.items():
But you don't even need that here, you can just use the data.values() iterator to get the values. When combining these improvements your code looks like this:
def parse_entities(data):
results = ''
for entity_data in data.values():
for entity in entity_data:
if entity['sentiment'] != 'none':
results += entity['name'] + ":" + entity['sentiment'] + ","
return results
(I have also changed results = results + ... to results += ... and if not entity['sentiment'] == 'none' to if entity['sentiment'] != 'none', because it is shorter and doesn't lower the readability)
When you have this it is much easier to make it even shorter and more elegant by using list comprehension:
def parse_entities(data):
return ",".join([entity['name'] + ":" + entity['sentiment']
for entity_data in data.values()
for entity in entity_data
if not entity['sentiment'] == 'none'])

Maybe something like this will work?
def parse_entities(data):
results = []
for category in data.keys():
results += list(map(lambda x: '{0}:{1}'.format(x['name'], x['sentiment']),
filter(lambda i: i['sentiment'] != 'none', data[category])))
return ','.join(results)
if __name__ == '__main__':
print(parse_entities(data))
With the output looking like this
parkinson:5,billy mays:4

This might be a way to do it. Even though using a 'proper library' (depending on your actual use case) makes more sense.
data = {
'persons':
[{'name': 'parkinson', 'sentiment': '5'},
{'name': 'knott david', 'sentiment': 'none'},
{'name': 'billy mays', 'sentiment': '4'}],
'organizations':
[{'name': 'piper jaffray companies', 'sentiment': 'none'},
{'name': 'marketbeat.com', 'sentiment': 'none'},
{'name': 'zacks investment research', 'sentiment': 'none'}],
'locations': []
}
import itertools
# eq. = itertools.chain.from_iterable(data.values())
dicts = itertools.chain(*data.values())
pairs = [":".join([d['name'], d['sentiment']])
for d in dicts if d['sentiment'] != 'none']
result = ",".join(pairs)
print(result)
# parkinson:5,billy mays:4
# short, but less readable version
result = ",".join([":".join([d['name'], d['sentiment']])
for d in itertools.chain(*data.values())
if d['sentiment'] != 'none'])

This is a problem where we need to perform the 3 separate tasks:
Filter out unqualified rows of data
Flatten the dict of lists into a simple list
Transform each dictionary object into a simple tuple, ready for formatting
Here is the code:
def parse_entities(data):
new_data = [
(row['name'], row['sentiment']) # 3. Transform
for rows in data.values() # 2. Flatten
for row in rows # 2. Flatten
if row['sentiment'] != 'none' # 1. Filter
]
# e.g, new_data = [('parkinson', '5'), ('billy mays', '4')]
return ','.join('{}:{}'.format(*row) for row in new_data)
#
# test code
#
data = {
'locations': [],
'organizations': [
{'name': 'piper jaffray companies', 'sentiment': 'none'},
{'name': 'marketbeat.com', 'sentiment': 'none'},
{'name': 'zacks investment research', 'sentiment': 'none'}
],
'persons': [
{'name': 'parkinson', 'sentiment': '5'},
{'name': 'knott david', 'sentiment': 'none'},
{'name': 'billy mays', 'sentiment': '4'}
],
}
print parse_entities(data)
Output:
parkinson:5,billy mays:4

Here's a generator expression that does it:
data = {'persons': [
{'name': 'parkinson', 'sentiment': '5'},
{'name': 'knott david', 'sentiment': 'none'},
{'name': 'billy mays', 'sentiment': '4'}],
'organizations': [
{'name': 'piper jaffray companies', 'sentiment': 'none'},
{'name': 'marketbeat.com', 'sentiment': '99'},
{'name': 'zacks investment research', 'sentiment': 'none'}],
'locations': []
}
results = ','.join(entity['name'] + ':' + entity['sentiment']
for category, entity_data in data.items()
for entity in entity_data if entity['sentiment'] is not 'none')
print(results) # -> parkinson:5,billy mays:4,marketbeat.com:99
Note: I changed the sample data slightly to make sure it handled data in more than one category the same as your code.

Related

From list to nested dictionary

there are list :
data = ['man', 'man1', 'man2']
key = ['name', 'id', 'sal']
man_res = ['Alexandra', 'RST01', '$34,000']
man1_res = ['Santio', 'RST009', '$45,000']
man2_res = ['Rumbalski', 'RST50', '$78,000']
the expected output will be nested output:
Expected o/p:- {'man':{'name':'Alexandra', 'id':'RST01', 'sal':$34,000},
'man1':{'name':'Santio', 'id':'RST009', 'sal':$45,000},
'man2':{'name':'Rumbalski', 'id':'RST50', 'sal':$78,000}}
Easy way would be using pandas dataframe
import pandas as pd
df = pd.DataFrame([man_res, man1_res, man2_res], index=data, columns=key)
print(df)
df.to_dict(orient='index')
name id sal
man Alexandra RST01 $34,000
man1 Santio RST009 $45,000
man2 Rumbalski RST50 $78,000
{'man': {'name': 'Alexandra', 'id': 'RST01', 'sal': '$34,000'},
'man1': {'name': 'Santio', 'id': 'RST009', 'sal': '$45,000'},
'man2': {'name': 'Rumbalski', 'id': 'RST50', 'sal': '$78,000'}}
Or you could manually merge them using dict + zip
d = dict(zip(
data,
(dict(zip(key, res)) for res in (man_res, man1_res, man2_res))
))
d
{'man': {'name': 'Alexandra', 'id': 'RST01', 'sal': '$34,000'},
'man1': {'name': 'Santio', 'id': 'RST009', 'sal': '$45,000'},
'man2': {'name': 'Rumbalski', 'id': 'RST50', 'sal': '$78,000'}}
#save it in 2D array
all_man_res = []
all_man_res.append(man_res)
all_man_res.append(man1_res)
all_man_res.append(man2_res)
print(all_man_res)
#Add it into a dict output
output = {}
for i in range(len(l)):
person = l[i]
details = {}
for j in range(len(key)):
value = key[j]
details[value] = all_man_res[i][j]
output[person] = details
output
The pandas dataframe answer provided by NoThInG makes the most intuitive sense. If you are looking to use only the built in python tools, you can do
info_list = [dict(zip(key,man) for man in (man_res, man1_res, man2_res)]
output = dict(zip(data,info_list))

Writing a GET functionality for json-like data in Python

I am working on a coding challenge for self-development and I came across a question where I am given an input like this:
add {"id":1,"last":"Doe","first":"John","location":{"city":"Oakland","state":"CA","postalCode":"94607"},"active":true}
add {"id":2,"last":"Doe","first":"Jane","location":{"city":"San Francisco","state":"CA","postalCode":"94105"},"active":true}
add {"id":3,"last":"Black","first":"Jim","location":{"city":"Spokane","state":"WA","postalCode":"99207"},"active":true}
add {"id":4,"last":"Frost","first":"Jack","location":{"city":"Seattle","state":"WA","postalCode":"98204"},"active":false}
get {"location":{"state":"WA"},"active":true}
get {"id":1}
get {"active":true}
delete {"active":true}
get {}
And what I am doing is adding the entries that start with add to a list called database = []:
json_input = []
database = []
for line in sys.stdin:
json_input.append(line.split("', "))
for i in range(0, len(json_input)):
if json_input[i][0] == 'add':
database.append(json_input[i][1])
What I want to do is to print out every entry that matches what follows get and delete every entry that matches what follows delete. This is where I am stuck. Currently, this is what json_input() looks like. database is empty:
[
['add {"id":1,"last":"Doe","first":"John","location":{"city":"Oakland","state":"CA","postalCode":"94607"},"active":true}\n'],
['add {"id":2,"last":"Doe","first":"Jane","location":{"city":"San Francisco","state":"CA","postalCode":"94105"},"active":true}\n'],
['add {"id":3,"last":"Black","first":"Jim","location":{"city":"Spokane","state":"WA","postalCode":"99207"},"active":true}\n'],
['add {"id":4,"last":"Frost","first":"Jack","location":{"city":"Seattle","state":"WA","postalCode":"98204"},"active":false}\n'],
['get {"location":{"state":"WA"},"active":true}\n'], ['get {"id":1}\n'],
['get {"active":true}\n'], ['delete {"active":true}\n'],
['get {}']
]
Perhaps an easy-to-read way to handle this would be a simple class that maintains a list of records. You can add methods for the various commands you want to handle. Then it's just a matter of defining the methods and processing the input to pass to the methods. Here's a possible way (without any frills like error checking):
import json
raw_data = '''add {"id":1,"last":"Doe","first":"John","location":{"city":"Oakland","state":"CA","postalCode":"94607"},"active":true}
add {"id":2,"last":"Doe","first":"Jane","location":{"city":"San Francisco","state":"CA","postalCode":"94105"},"active":true}
add {"id":3,"last":"Black","first":"Jim","location":{"city":"Spokane","state":"WA","postalCode":"99207"},"active":true}
add {"id":4,"last":"Frost","first":"Jack","location":{"city":"Seattle","state":"WA","postalCode":"98204"},"active":false}
get {"location":{"state":"WA"},"active":true}
get {"id":1}
get {"active":true}
delete {"active":true}
get {}'''
class Data:
#staticmethod
def matches(obj, query):
if not isinstance(query, dict):
return obj == query
return all(Data.matches(obj.get(key), q) for key, q in query.items())
def __init__(self):
self.data = []
def add(self, record):
self.data.append(record)
def get(self, query):
for item in self.data:
if (Data.matches(item, query)):
print(item)
def delete(self, query):
self.data = [record for record in self.data if not Data.matches(record, query)]
data = Data()
for line in raw_data.split('\n'):
command, line = line.split(None, 1)
command = getattr(data, command)
command(json.loads(line))
This will print the records from WA then the active:True records. Then after deleting the True records it will print everything (the result of the {} query), which is the only one left -- the active:False record:
{'id': 3, 'last': 'Black', 'first': 'Jim', 'location': {'city': 'Spokane', 'state': 'WA', 'postalCode': '99207'}, 'active': True}
{'id': 1, 'last': 'Doe', 'first': 'John', 'location': {'city': 'Oakland', 'state': 'CA', 'postalCode': '94607'}, 'active': True}
{'id': 1, 'last': 'Doe', 'first': 'John', 'location': {'city': 'Oakland', 'state': 'CA', 'postalCode': '94607'}, 'active': True}
{'id': 2, 'last': 'Doe', 'first': 'Jane', 'location': {'city': 'San Francisco', 'state': 'CA', 'postalCode': '94105'}, 'active': True}
{'id': 3, 'last': 'Black', 'first': 'Jim', 'location': {'city': 'Spokane', 'state': 'WA', 'postalCode': '99207'}, 'active': True}
{'id': 4, 'last': 'Frost', 'first': 'Jack', 'location': {'city': 'Seattle', 'state': 'WA', 'postalCode': '98204'}, 'active': False}
If this were a test or a serious coding challenge, you would probably want to look carefully at matches() to make sure it properly handles edge cases (I didn't do that).

understanding nested python dict comprehension

I am getting along with dict comprehensions and trying to understand how the below 2 dict comprehensions work:
select_vals = ['name', 'pay']
test_dict = {'data': [{'name': 'John', 'city': 'NYC', 'pay': 70000}, {'name': 'Mike', 'city': 'NYC', 'pay': 80000}, {'name': 'Kate', 'city': 'Houston', 'pay': 65000}]}
dict_comp1 = [{key: item[key] for key in select_vals } for item in test_dict['data'] if item['pay'] > 65000 ]
The above line gets me
[{'name': 'John', 'pay': 70000}, {'name': 'Mike', 'pay': 80000}]
dict_comp2 = [{key: item[key]} for key in select_vals for item in test_dict['data'] if item['pay'] > 65000 ]
The above line gets me
[{'name': 'John'}, {'name': 'Mike'}, {'pay': 70000}, {'pay': 80000}]
How does the two o/ps vary when written in a for loop ? When I execute in a for loop
dict_comp3 = []
for key in select_vals:
for item in test_dict['data']:
if item['pay'] > 65000:
dict_comp3.append({key: item[key]})
print(dict_comp3)
The above line gets me same as dict_comp2
[{'name': 'John'}, {'name': 'Mike'}, {'pay': 70000}, {'pay': 80000}]
How do I get the o/p as dict_comp1 in a for loop ?
The select vals iteration should be the inner one
result = []
for item in test_dict['data']:
if item['pay'] > 65000:
aux = {}
for key in select_vals:
aux[key] = item[key]
result.append(aux)

compare two different length lists of dictionaries in python

I want to compare below dictionaries. Name key in the dictionary is common in both dictionaries.
If Name matched in both the dictionaries, i wanted to do some other stuff with the data.
PerfData = [
{'Name': 'abc', 'Type': 'Ex1', 'Access': 'N1', 'perfStatus':'Latest Perf', 'Comments': '07/12/2017 S/W Version'},
{'Name': 'xyz', 'Type': 'Ex1', 'Access': 'N2', 'perfStatus':'Latest Perf', 'Comments': '11/12/2017 S/W Version upgrade failed'},
{'Name': 'efg', 'Type': 'Cust1', 'Access': 'A1', 'perfStatus':'Old Perf', 'Comments': '11/10/2017 S/W Version upgrade failed, test data is active'}
]
beatData = [
{'Name': 'efg', 'Status': 'Latest', 'rcvd-timestamp': '1516756202.632'},
{'Name': 'abc', 'Status': 'Latest', 'rcvd-timestamp': '1516756202.896'}
]
Thanks
Rajeev
l = [{'name': 'abc'}, {'name': 'xyz'}]
k = [{'name': 'a'}, {'name': 'abc'}]
[i['name'] for i in l for f in k if i['name'] == f['name']]
Hope above logic work for you.
The answer provided didn't assign the result to any variable. If you want to print it, add the following would work:
result = [i['name'] for i in l for f in k if i['name'] == f['name']]
print(result)

Merging similar dictionaries in a list together

New to python here. I've been pulling my hair for hours and still can't figure this out.
I have a list of dictionaries:
[ {'FX0XST001.MID5': '195', 'Name': 'Firmicutes', 'Taxonomy ID': '1239', 'Type': 'phylum'}
{'FX0XST001.MID13': '4929', 'Name': 'Firmicutes', 'Taxonomy ID': '1239','Type': 'phylum'},
{'FX0XST001.MID6': '826', 'Name': 'Firmicutes', 'Taxonomy ID': '1239', 'Type': 'phylum'},
.
.
.
.
{'FX0XST001.MID6': '125', 'Name': 'Acidobacteria', 'Taxonomy ID': '57723', 'Type': 'phylum'}
{'FX0XST001.MID25': '70', 'Name': 'Acidobacteria', 'Taxonomy ID': '57723', 'Type': 'phylum'}
{'FX0XST001.MID40': '40', 'Name': 'Acidobacteria', 'Taxonomy ID': '57723', 'Type': 'phylum'} ]
I want to merge the dictionaries in the list based on their Type, Name, and Taxonomy ID
[ {'FX0XST001.MID5': '195', 'FX0XST001.MID13': '4929', 'FX0XST001.MID6': '826', 'Name': 'Firmicutes', 'Taxonomy ID': '1239', 'Type': 'phylum'}
.
.
.
.
{'FX0XST001.MID6': '125', 'FX0XST001.MID25': '70', 'FX0XST001.MID40': '40', 'Name': 'Acidobacteria', 'Taxonomy ID': '57723', 'Type': 'phylum'}]
I have the data structure setup like this because I need to write the data to CSV using csv.DictWriter later.
Would anyone kindly point me to the right direction?
You can use the groupby function for this:
http://docs.python.org/library/itertools.html#itertools.groupby
from itertools import groupby
keyfunc = lambda row : (row['Type'], row['Taxonomy ID'], row['Name'])
result = []
data = sorted(data, key=keyfunc)
for k, g in groupby(data, keyfunc):
# you can either add the matching rows to the item so you end up with what you wanted
item = {}
for row in g:
item.update(row)
result.append(item)
# or you could just add the matched rows as subitems to a parent dictionary
# which might come in handy if you need to work with just the parts that are
# different
item = {'Type': k[0], 'Taxonomy ID' : k[1], 'Name' : k[2], 'matches': [])
for row in g:
del row['Type']
del row['Taxonomy ID']
del row['Name']
item['matches'].append(row)
result.append(item)
Make some test data:
list_of_dicts = [
{"Taxonomy ID":1, "Name":"Bob", "Type":"M", "hair":"brown", "eyes":"green"},
{"Taxonomy ID":1, "Name":"Bob", "Type":"M", "height":"6'2''", "weight":200},
{"Taxonomy ID":2, "Name":"Alice", "Type":"F", "hair":"black", "eyes":"hazel"},
{"Taxonomy ID":2, "Name":"Alice", "Type":"F", "height":"5'7''", "weight":145}
]
I think this (below) is a neat trick using reduce that improves upon the other groupby solution.
import itertools
def key_func(elem):
return (elem["Taxonomy ID"], elem["Name"], elem["Type"])
output_list_of_dicts = [reduce((lambda x,y: x.update(y) or x), list(val)) for key, val in itertools.groupby(list_of_dicts, key_func)]
Then print the output:
for elem in output_list_of_dicts:
print elem
This prints:
{'eyes': 'green', 'Name': 'Bob', 'weight': 200, 'Taxonomy ID': 1, 'hair': 'brown', 'height': "6'2''", 'Type': 'M'}
{'eyes': 'hazel', 'Name': 'Alice', 'weight': 145, 'Taxonomy ID': 2, 'hair': 'black', 'height': "5'7''", 'Type': 'F'}
FYI, Python Pandas is far better for this sort of aggregation, especially when dealing with file I/O to .csv or .h5 files, than the itertools stuff.
Perhaps the easiest thing to do would be to create a new dictionary, indexed by a (Type, Name, Taxonomy ID) tuple, and iterate over your dictionary, storing values by (Type, Name, Taxonomy ID). Use a default dict to make this easier. For example:
from collections import defaultdict
grouped = defaultdict(lambda : {})
# iterate over items and store:
for entry in list_of_dictionaries:
grouped[(entry["Type"], entry["Name"], entry["Taxonomy ID"])].update(entry)
# now you have everything stored the way you want in values, and you don't
# need the dict anymore
grouped_entries = grouped.values()
This is a bit hackish, especially because you end up overwriting "Type", "Name", and "Phylum" every time you use update, but since your dict keys are variable, that might be the best you can do. This will get you at least close to what you need.
Even better would be to do this on your initial import and skip intermediate steps (unless you actually need to transform the data beforehand). Plus, if you could get at the only varying field, you could change the update to just: grouped[(type, name, taxonomy_id)][key] = value where key and value are something like: 'FX0XST001.MID5', '195'
from itertools import groupby
data = [ {'FX0XST001.MID5': '195', 'Name': 'Firmicutes', 'Taxonomy ID': '1239', 'Type':'phylum'},
{'FX0XST001.MID13': '4929', 'Name': 'Firmicutes', 'Taxonomy ID': '1239','Type': 'phylum'},
{'FX0XST001.MID6': '826', 'Name': 'Firmicutes', 'Taxonomy ID': '1239', 'Type': 'phylum'},
{'FX0XST001.MID6': '125', 'Name': 'Acidobacteria', 'Taxonomy ID': '57723', 'Type': 'phylum'},
{'FX0XST001.MID25': '70', 'Name': 'Acidobacteria', 'Taxonomy ID': '57723', 'Type': 'phylum'},
{'FX0XST001.MID40': '40', 'Name': 'Acidobacteria', 'Taxonomy ID': '57723', 'Type': 'phylum'} ,]
kk = ('Name', 'Taxonomy ID', 'Type')
def key(item): return tuple(item[k] for k in kk)
result = []
data = sorted(data, key=key)
for k, g in groupby(data, key):
result.append(dict((i, j) for d in g for i,j in d.items()))
print result

Categories