Counting Items in Python from a JSON file - python

I'm trying to search a data file, for example Yelp.json. It has businesses in it in LA, Boston, DC.
I wrote this:
# Python 2
# read json
with open('updated_data.json') as facts_data:
data = json.load(facts_data)
# return every unique locality along with how often it occurs
locality = []
unique_locality = []
# Load items into lists
for item in data:
locality.append(data["payload"]["locality"])
if data["payload"]["locality"] not in unique_locality:
print unique_locality.append(data["payload"]["locality"])
# Loops over unique_locality and count from locality
print "Unique Locality Count:", unique_locality, locality.count(data["payload"]["locality"])
But I get an answer of "Portsmouth 1", which means it is not providing all the cities and might not even be provided all the counts. My goal for this section is to search that JSON file and have it say "DC: 10 businesses, LA: 20 businesses, Boston: 2 businesses." Each payload is a grouping of info about a single business and "locality" is just the city. So I want it to find how many unique cities there are and then how many businesses in each city. So one payload could be Starbucks in la, another payload could be Starbucks in dc, another could be Chipotle in la.
Example of JSON file (JSONlite.com says its valid):
"payload": {
"existence_full": 1,
"geo_virtual": "[\"56.9459720|-2.1971226|20|within_50m|4\"]",
"latitude": "56.945972",
"locality": "Stonehaven",
"_records_touched": "{\"crawl\":8,\"lssi\":0,\"polygon_centroid\":0,\"geocoder\":0,\"user_submission\":0,\"tdc\":0,\"gov\":0}",
"address": "The Lodge, Dunottar",
"email": "dunnottarcastle#btconnect.com",
"existence_ml": 0.5694238217658721,
"domain_aggregate": "",
"name": "Dunnottar Castle",
"search_tags": ["Dunnottar Castle Aberdeenshire", "Dunotter Castle"],
"admin_region": "Scotland",
"existence": 1,
"category_labels": [
["Landmarks", "Buildings and Structures"]
],
"post_town": "Stonehaven",
"region": "Kincardineshire",
"review_count": "719",
"geocode_level": "within_50m",
"tel": "01569 762173",
"placerank": 65,
"longitude": "-2.197123",
"placerank_ml": 37.27916073464469,
"fax": "01330 860325",
"category_ids_text_search": "",
"website": "http://www.dunnottarcastle.co.uk",
"status": "1",
"geocode_confidence": "20",
"postcode": "AB39 2TL",
"category_ids": [108],
"country": "gb",
"_geocode_quality": "4",
"uuid": "3867aaf3-12ab-434f-b12b-5d627b3359c3"
},
"payload": {
"existence_full": 1,
"geo_virtual": "[\"56.237480|-5.073578|20|within_50m|4\"]",
"latitude": "56.237480",
"locality": "Inveraray",
"_records_touched": "{\"crawl\":11,\"lssi\":0,\"polygon_centroid\":0,\"geocoder\":0,\"user_submission\":0,\"tdc\":0,\"gov\":0}",
"address": "Cherry Park",
"email": "enquiries#inveraray-castle.com",
"longitude": "-5.073578",
"domain_aggregate": "",
"name": "Inveraray Castle",
"admin_region": "Scotland",
"search_tags": ["Inveraray Castle Tea Room", "Inverary Castle"],
"existence": 1,
"category_labels": [
["Social", "Food and Dining", "Restaurants"]
],
"region": "Argyll",
"review_count": "532",
"geocode_level": "within_50m",
"tel": "01499 302203",
"placerank": 67,
"post_town": "Inveraray",
"placerank_ml": 41.19978087352266,
"fax": "01499 302421",
"category_ids_text_search": "",
"website": "http://www.inveraray-castle.com",
"status": "1",
"geocode_confidence": "20",
"postcode": "PA32 8XE",
"category_ids": [347],
"country": "gb",
"_geocode_quality": "4",
"existence_ml": 0.7914881102847783,
"uuid": "8278ab80-2cd1-4dbd-9685-0d0036b681eb"
},

If your "json" semantics is something like
{"payload":{ CONTENT_A }, "payload":{ CONTENT_B }, ..., "payload":{ CONTENT_LAST }}
it is a valid json string, but after you json.loads the string, it will be evaluated as
{"payload":{ CONTENT_LAST }}
And that is why you end up with one city and one business count.
You can verify this behaviour on this online json parser http://json.parser.online.fr/ by checking JS eval field.
In this case, one way to preprocess your json string is to get rid of the dummy "payload" key and wrap the content dictionary directly in a list. You will have a json string in the following format.
{[{CONTENT_A}, {CONTENT_B} ..., {CONTENT_LAST} ]}
Assume your json string is now a list of payload dictionary, and you have json.loads(json_str) to data.
As you iterate through json payload, build a lookup table along the way.
This will handle duplicated city for you automatically since business in the same city will be hashed to the same list.
city_business_map = {}
for payload in data:
city = payload['locality']
business = payload['name']
if city not in city_business_map:
city_business_map[city] = []
city_business_map[city].append(business)
Then later on, you can easily present the solution by
for city, business_list in city_business_map.items():
print city, len(business_list)
If you want to count the unique business in each city, initialize the value to set instead of list.
If this is an overkill, instead of initialize to list or set, just associate a counter with each key.

Related

Python: Iterate JSON and remove items with specific criteria

I am trying to filter out data from API JSON response with Python and I get weird results. I would be glad if somebody can guide me how to deal with the situation.
The main idea is to remove irrelevant data in the JSON and keep only the data that is associated with particular people which I hold in a list.
Here is a snip of the JSON file:
{
"result": [
{
"number": "Number1",
"short_description": "Some Description",
"assignment_group": {
"display_value": "Some value",
"link": "https://some_link.com"
},
"incident_state": "Closed",
"sys_created_on": "2020-03-30 11:51:24",
"priority": "4 - Low",
"assigned_to": {
"display_value": "John Doe",
"link": "https://some_link.com"
}
},
{
"number": "Number2",
"short_description": "Some Description",
"assignment_group": {
"display_value": "Some value",
"link": "https://some_link.com"
},
"incident_state": "Closed",
"sys_created_on": "2020-03-10 11:07:13",
"priority": "4 - Low",
"assigned_to": {
"display_value": "Tyrell Greenley",
"link": "https://some_link.com"
}
},
{
"number": "Number3",
"short_description": "Some Description",
"assignment_group": {
"display_value": "Some value",
"link": "https://some_link.com"
},
"incident_state": "Closed",
"sys_created_on": "2020-03-20 10:23:35",
"priority": "4 - Low",
"assigned_to": {
"display_value": "Delmar Vachon",
"link": "https://some_link.com"
}
},
{
"number": "Number4",
"short_description": "Some Description",
"assignment_group": {
"display_value": "Some value",
"link": "https://some_link.com"
},
"incident_state": "Closed",
"sys_created_on": "2020-03-30 11:51:24",
"priority": "4 - Low",
"assigned_to": {
"display_value": "Samual Isham",
"link": "https://some_link.com"
}
}
]
}
Here is the Python code:
users_test = ['Ahmad Wickert', 'Dick Weston', 'Gerardo Salido', 'Rosendo Dewey', 'Samual Isham']
# Load JSON file
with open('extract.json', 'r') as input_file:
input_data = json.load(input_file)
# Create a function to clear the data
def clear_data(data, users):
"""Filter out the data and leave only records for the names in the users_test list"""
for elem in data:
print(elem['assigned_to']['display_value'] not in users)
if elem['assigned_to']['display_value'] not in users:
print('Removing {} from JSON as not present in list of names.'.format(elem['assigned_to']['display_value']))
data.remove(elem)
else:
print('Keeping the record for {} in JSON.'.format(elem['assigned_to']['display_value']))
return data
cd = clear_data(input_data['result'], users_test)
And here is the output, which seems to iterate through only 2 of the items in the file:
True
Removing John Doe from JSON as not present in list of names.
True
Removing Delmar Vachon from JSON as not present in list of names.
Process finished with exit code 0
It seems that the problem is more or less related to the .remove() method however I don't find any other suitable solution to delete these particular items that I do not need.
Here is the output of the iteration without applying the remove() method:
True
Removing John Doe from JSON as not present in list of names.
True
Removing Tyrell Greenley from JSON as not present in list of names.
True
Removing Delmar Vachon from JSON as not present in list of names.
False
Keeping the record for Samual Isham in JSON.
Process finished with exit code 0
Note: I have left the check for the name visible on purpose.
I would appreciate any ideas to sort out the situation.
If you don't need to log info about people you are removing you could simply try
filtered = [i for i in data['result'] if i['assigned_to']['display_value'] in users_test]
users_test = ['Ahmad Wickert', 'Dick Weston', 'Gerardo Salido', 'Rosendo Dewey', 'Samual Isham']
solution = []
for user in users_test:
print(user)
for value in data['result']:
if user == value['assigned_to']['display_value']:
solution.append(value)
print(solution)
for more efficient code, as asked by #NomadMonad
solution = list(filter(lambda x: x['assigned_to']['display_value'] in users_test, data['result']))
You are modifying a dictionary while at the same time iterating through it. Check out this blog post which describes this behavior.
A safer way to do this is to make a copy of your dictionary to iterate over, and to delete from your original dictionary:
import copy
def clear_data(data, users):
"""Filter out the data and leave only records for the names in the users_test list"""
for elem in copy.deepcopy(data): # deepcopy handles nested dicts
# Still call data.remove() in here

Parse JSON data with varying parent keys using python

Here is my working example
jsonData = {
"3": {
"map_id": "1",
"marker_id": "3",
"title": "Your title here",
"address": "456 Example Ave",
"desc": "Description",
"pic": "",
"icon": "",
"linkd": "",
"lat": "3.14",
"lng": "-22.98",
"anim": "0",
"retina": "0",
"category": "1",
"infoopen": "0",
"other_data": ["0"]
},
"4": {
"map_id": "1",
"marker_id": "4",
"title": "Title of Place",
"address": "123 Main St, City, State",
"desc": "insert description",
"pic": "",
"icon": "",
"linkd": "",
"lat": "1.23",
"lng": "-4.56",
"anim": "0",
"retina": "0",
"category": "0",
"infoopen": "0",
"other_data": ["0"]
}
I am having such a hard time getting the title and address keys. Here is what I have tried:
for each in testJson:
print(each["title"])
and I get the following error TypeError: string indices must be integers. I don't understand why this isn't working.
I have tried so many variations to get the key data, but I just can't get it to work. I can't really change the raw JSON either because my real JSON data is a huge file. I have looked on stackoverflow for a similarly formatted JSON example (e.g., here) but have come up short. I assume there is something wrong with the way my JSON is formatted, because I have parsed JSON before with the above code without any problems.
Your getting that error because your looping over the keys, which don't have title and address properties. Those properties exist in the inner dictionaries, which are the values of the dictionary.
Here is how you can iterate over the dict.values() instead:
for value in jsonData.values():
print(value["title"], value["address"])
Which will give you the title and address:
Your title here 456 Example Ave
Title of Place 123 Main St, City, State
If you want to find out which key your iterating over, you can loop over the tuple (key, value) pair from dict.items():
for key, value in jsonData.items():
print(f"key = {key}, title = {value['title']}, address = {value['address']}")
Which will show the key with the address and title:
key = 3, title = Your title here, address = 456 Example Ave
key = 4, title = Title of Place, address = 123 Main St, City, State
for… in loops iterate over the keys of dictionaries, not their values. If you want to iterate over their values, you can either use .values():
for value in someDict.values():
…or you can iterate over items(), which will give you the key and the value as a tuple:
for key, value in someDict.items():
The reason why you are getting the error message you are is because when you try to get title out of each, each is actually the key, .i.e. "3" or "4". Python will let you get individual characters out of a string with things like someString[0] to get the first character, but it doesn't make sense to access things like someString['title'].

How to extract objects from nested lists from a Json file with Python?

I have a response that I receive from Lobbyview in the form of json. I tried to put it in data frame to access only some variables, but with no success. How can I access only some variables such as the id and the committees in a format exportable to .dta ? Here is the code I have tried.
import requests, json
query = {"naics": "424430"}
results = requests.post('https://www.lobbyview.org/public/api/reports',
data = json.dumps(query))
print(results.json())
import pandas as pd
b = pd.DataFrame(results.json())
_id = data["_id"]
committee = data["_source"]["specific_issues"][0]["bills_by_algo"][0]["committees"]
An observation of the json looks like this:
"_score": 4.421936,
"_type": "object",
"_id": "5EZUMbQp3hGKH8Uq2Vxuke",
"_source":
{
"issue_codes": ["CPT"],
"received": 1214320148,
"client_name": "INTELLECTUAL PROPERTY OWNERS ASSOCIATION",
"amount": 240000,
"client":
{
"legal_name": "INTELLECTUAL PROPERTY OWNERS ASSOCIATION",
"name": "INTELLECTUAL PROPERTY OWNERS ASSOCIATION",
"naics": null,
"gvkey": null,
"ticker": "Unlisted",
"id": null,
"bvdid": "US131283992L"},
"specific_issues": [
{
"text": "H.R. 34, H.R. 1908, H.R. 2336, H.R. 3093 S. 522, S. 681, S. 1145, S. 1745",
"bills_by_algo": [
{
"titles": ["To amend title 35, United States Code, to provide for patent reform.", "Patent Reform Act of 2007", "Patent Reform Act of 2007", "Patent Reform Act of 2007"],
"top_terms": ["Commerce", "Administrative fees"],
"sponsor":
{
"firstname": "Howard",
"district": 28,
"title": "rep",
"id": 400025
},
"committees": ["House Judiciary"],
"introduced": 1176868800,
"type": "HR", "id": "110_HR1908"},
{
"titles": ["To amend title 35, United States Code, relating to the funding of the United States Patent and Trademark Office."],
"top_terms": ["Commerce", "Administrative fees"],
"sponsor":
{
"firstname": "Howard",
"district": 28,
"title": "rep",
"id": 400025
},
"committees": ["House Judiciary"],
"introduced": 1179288000,
"type": "HR",
"id": "110_HR2336"
}],
"gov_entities": ["U.S. House of Representatives", "Patent and Trademark Office (USPTO)", "U.S. Senate", "UNDETERMINED", "U.S. Trade Representative (USTR)"],
"lobbyists": ["Valente, Thomas Silvio", "Wamsley, Herbert C"],
"year": 2007,
"issue": "CPT",
"id": "S4nijtRn9Q5NACAmbqFjvZ"}],
"year": 2007,
"is_latest_amendment": true,
"type": "MID-YEAR AMENDMENT",
"id": "1466CDCD-BA3D-41CE-B7A1-F9566573611A",
"alternate_name": "INTELLECTUAL PROPERTY OWNERS ASSOCIATION"
},
"_index": "collapsed"}```
Since the data that you specified is nested pretty deeply in the JSON-response, you have to loop through it and save it to a list temporarily. To understand the response data better, I would advice you to use some tool to look into the JSON structure, like this online JSON-Viewer. Not every entry in the JSON contains the necessary data, therefore I try to catch the error through a try and except. To make sure that the id and committees are matched correctly, I chose to add them as small dicts to the list. This list can then be read into Pandas with ease. Saving to .dta requires you to convert the lists inside the committees column to strings, instead you might also want to save as .csv for a more generally usable format.
import requests, json
import pandas as pd
query = {"naics": "424430"}
results = requests.post(
"https://www.lobbyview.org/public/api/reports", data=json.dumps(query)
)
json_response = results.json()["result"]
# to save the JSON response
# with open("data.json", "w") as outfile:
# json.dump(results.json()["result"], outfile)
resulting_data = []
# loop through the response
for data in json_response:
# try to find entries with specific issues, bills_by_algo and committees
try:
# loop through the special issues
for special_issue in data["specific_issues"]:
_id = special_issue["id"]
# loop through the bills_by_algo's
for x in special_issue["bills_by_algo"]:
# append the id and committees in a dict
resulting_data.append(({"id": _id, "committees": x["committees"]}))
except KeyError as e:
print(e, "not found in entry.")
continue
# create a DataFrame
df = pd.DataFrame(resulting_data)
# export of list objects in the column is not supported by .dta, therefore we convert
# to strings with ";" as delimiter
df["committees"] = ["; ".join(map(str, l)) for l in df["committees"]]
print(df)
df.to_stata("result.dta")
Results in
id committees
0 D8BxG5664FFb8AVc6KTphJ House Judiciary
1 D8BxG5664FFb8AVc6KTphJ Senate Judiciary
2 8XQE5wu3mU7qvVPDpUWaGP House Agriculture
3 8XQE5wu3mU7qvVPDpUWaGP Senate Agriculture, Nutrition, and Forestry
4 kzZRLAHdMK4YCUQtQAdCPY House Agriculture
.. ... ...
406 ZxXooeLGVAKec9W2i32hL5 House Agriculture
407 ZxXooeLGVAKec9W2i32hL5 Senate Agriculture, Nutrition, and Forestry; H...
408 ZxXooeLGVAKec9W2i32hL5 House Appropriations; Senate Appropriations
409 ahmmafKLfRP8wZay9o8GRf House Agriculture
410 ahmmafKLfRP8wZay9o8GRf Senate Agriculture, Nutrition, and Forestry
[411 rows x 2 columns]

How to take two keys with a same name in JSON

The code below already takes "street": "Manhattan street 15", but how I can take "PL 300" since they have the same name?
My current python code:
contact_info = dict(business_id=business_id,
name=business_info['name'],
street=address['street'],
post_code=address['postCode'],
city=address['city'],
website=address['website'],
phone=address['phone'],
register_date=register_date
)
And this is the JSON format:
"addresses": [
{
"street": "Manhattan street 15",
"postCode": "53100",
"type": 1,
"city": "Monaco",
"country": "MC",
"website": null,
"phone": null,
"fax": null,
"registrationDate": "2014-11-17",
"endDate": null
},
{
"street": "PL 300",
"postCode": "00089",
"type": 2,
"city": "Halic",
"country": "Hc",
"website": null,
"phone": null,
"fax": null,
"registrationDate": "2014-11-17",
"endDate": null
}
]
The json you have posted its an array of object so you have to get the object from which you want to fetch the street
so var address=adresses[1];
street=address[street];
you can go through iteration
It is seemed address as a listwith two dicts.So
address[0]['street'] #will give you street in first dict
address[1]['street'] #will give you street in second dict
import json
business_info = json.loads('your.json')
streets = [address['street'] for address in business_info.address]
TRY:
from urllib2 import urllib
import json
url = 'http://example.com'
response = urlopen(url)
json_obj = json.load(response)
for i in json_obj['addresses']:
print i['street']
It should work. It'll all the street names within addresses array.
For other values u need to specify those entity names like I did for street
It's a JSON array with two contacts, therefore json["address"][0]["street"] and json["address"][1]["street"] are different.
import json
contact_infos = []
parsed_json = json.loads(json_string)
for addr in parsed_json["addresses"]:
contact_infos.append(
dict(
business_id=9999,
name="Jason Derulo",
street=addr["street"],
post_code=addr["postCode"],
city=addr["city"],
website=addr["website"],
phone=addr["phone"],
register_date=addr["registrationDate"]
)
)
# A list of two contact infos
print(contact_infos)

Why am I unable to print JSON attributes in Python? What am I doing wrong?

I have the following code:
jobs = {"24": {"wage": "empty", "phone": "empty", "title": "sapfh", "description": "sod", "time": "twelve"}, "20": {"wage": "987g", "phone": "iudg", "time": "twelve", "description": "fdgsdfg", "title": "sfgji"}, "21": {"wage": "987g", "phone": "iudg", "title": "sfgji", "description": "fdgsdfg", "time": "twelve"}, "22": {"wage": "987g", "phone": "iudg", "time": "twelve", "description": "fdgsdfg", "title": "sfgji"}, "23": {"wage": "987g", "phone": "iudg", "title": "sfgji", "description": "fdgsdfg", "time": "twelve"}, "24": {"wage": "empty", "phone": "empty", "time": "twelve", "description": "sod", "title": "sapfh"}}
for job in jobs:
print job["title"]
But it won't print out the title each time. I just get TypeError: string indices must be integers, not str but if I put 0 instead of "title" it just outputs the first character of the number (so all 2s).
When you iterate over a dictionary, you iterate over its keys. This means that your current code is iterating through the keys of jobs (which are strings).
You should use dict.values instead:
for val in jobs.values():
print val["title"]
Now, the code is iterating through the values of jobs, which are the dictionaries.
If you want to have the keys and the values, you can use dict.items:
for key,val in jobs.items():
print val["title"]
When you use for/in to iterate over a dictionary, it iterates over the dictionary's keys. As such, the iteration variable (job in this case) will contain each of the dictionary's keys in turn: in this case, it'll contain "24", "20", "21", etc.
You want to iterate over the dictionary values (each of the job dictionaries). You can then retrieve the title property of each. To do so, loop like this instead:
for job in jobs.values():
print job["title"]
If you want both the keys and values, you can use iteritems as follows:
for job_key, job in jobs.iteritems():
print "Job key: ", job_key
print "Job title: ", job["title"] // or jobs[job_key]["title"]
Note also that jobs is a Python dictionary literal, not JSON (there's actually no JSON involved at all). It's also an illegally formed dictionary literal, since it contains two "24" keys (keys must not be duplicated).

Categories