Parsing child nodes from JSON file with Python - python

I'm trying to parse specific child nodes from a JSON file using Python.
I know similar questions have been asked and answered before, but I simply haven't been able to translate those solutions to my own problem (disclaimer: I'm not a developer).
This is the beginning of my JSON file (each new "entry" starts at "_index"):
{
"took": 83,
"timed_out": false,
"_shards": {
"total": 3,
"successful": 3,
"failed": 0
},
"hits": {
"total": 713628,
"max_score": 1.3753585,
"hits": [{
"_index": "offentliggoerelser-prod-20161006",
"_type": "offentliggoerelse",
"_id": "urn:ofk:oid:5135592",
"_score": 1.3753585,
"_source": {
"cvrNummer": 89986915,
"indlaesningsId": "AUzWhUXw3pscZq1LGK_z",
"sidstOpdateret": "2015-04-20T10:53:09.154Z",
"omgoerelse": false,
"regNummer": null,
"offentliggoerelsestype": "regnskab",
"regnskab": {
"regnskabsperiode": {
"startDato": "2014-01-01",
"slutDato": "2014-12-31"
}
},
"indlaesningsTidspunkt": "2015-04-20T11:10:53.529Z",
"sagsNummer": "X15-AA-66-TA",
"dokumenter": [{
"dokumentUrl": "http://regnskaber.virk.dk/51968998/ZG9rdW1lbnRsYWdlcjovLzAzLzdlL2I5L2U2LzlkLzIxN2EtNDA1OC04Yjg0LTAwZGJlNzUwMjU3Yw.pdf",
"dokumentMimeType": "application/pdf",
"dokumentType": "AARSRAPPORT"
}, {
"dokumentUrl": "http://regnskaber.virk.dk/51968998/ZG9rdW1lbnRsYWdlcjovLzAzLzk0LzNlL2RjL2Q4L2I1NjUtNGJjZC05NzJmLTYyMmE4ZTczYWVhNg.xhtml",
"dokumentMimeType": "application/xhtml+xml",
"dokumentType": "AARSRAPPORT"
}, {
"dokumentUrl": "http://regnskaber.virk.dk/51968998/ZG9rdW1lbnRsYWdlcjovLzAzLzc5LzM3LzUwLzMxL2NjZWQtNDdiNi1hY2E1LTgxY2EyYjRmOGYzMw.xml",
"dokumentMimeType": "application/xml",
"dokumentType": "AARSRAPPORT"
}],
"offentliggoerelsesTidspunkt": "2015-04-20T10:53:09.075Z"
}
},
More specifically, I'm trying to extract all "dokumentUrl" where "dokumentMimeType" is equal to "application/xhtml+xml".
When I use something simple like this:
import json
from pprint import pprint
with open('output.json') as data_file:
data = json.load(data_file)
pprint(data['hits']['hits'][0]['_source']['dokumenter'][1]['dokumentUrl'])
I get the first URL that matches my criteria. But how do I create a list of all URLs (all 713.628 of them) from the file with the criteria mentioned above and export it to a CSV file?
I should probably mention that my end goal is to create a program that can loop scrape my list of URLs (I'll save that for another post!).

Hopefully I am understand this right, and #roganjosh has a similar idea. You can loop through the specific parts with contain lists of useful things. So, we can do something like:
myURL = []
hits = data['hits']['hits']
for hit in hits:
// Making the assumption here that you want all of the URLs associated with a given document
document = hit['_source']['dokumenter']
for url in document:
if url['dokumentMimeType'] == "application/xhtml+xml":
myURL.append(url['dokumentUrl'])
Again, I am hoping that I understand your JSON schema enough that this does what you want it to. At least it should get you close.
Also just saw another part of your question regarding CSV outputting.

Related

How to parse specific data from JSON request

I'm getting into coding, and I'm wondering how I'd go about retrieving the data for "tag_id": 4 specifically.
I know that to get the data for status, but how would I go about getting specific data if there are multiple entries?
r = requests.get('url.com', headers = user_agent).json()
event = (r['status'])
print(event)
//////////////////
{
"status": "SUCCESS",
"status_message": "blah blah blah",
"pri_tag": [
{
"tag_id": 1,
"name": "Tag1"
},
{
"tag_id": 2,
"name": "Tag2"
},
{
"tag_id": 3,
"name": "Tag3"
},
{
"tag_id": 4,
"name": "Tag4"
}
]
}
The for loop answer is sufficient, but this is a good chance to learn how to use list comprehensions, which are ubiquitous and "pythonic":
desired_tag_name = [tag["name"] for tag in event["pri_tag"] if tag["tag_id"] == 4]
List comprehensions are advantageous for readability (I know it may not seem so the first time you look at one) and because they tend to be much faster.
There is a bounty of documentation and blog posts out there to understand the syntax better, and I don't prefer any particular one over another.
I think you're looking for something like:
tags = event["pri_tag"]
for tag in tags:
if tag['tag_id']==4:
print(tag['name'])
Output:
Tag4

How can I skip the top hierarchy of json file and recreate it as a new json file?

I created an API call to download a json file which looks like this:
{
"_v": "19.10",
"_type": "store_result",
"count": 30,
"data": [
{
"_type": "store",
"address1": "46 Fre...",
.....,
},
{
"_type": "store",
"address1": "915 ....',
.....,
},
{
.....,
.....,
}]
I want to add this data to a table but having trouble loading this data with the top hierarchy. How can I recreate this file with just the json objects in "data":[] while skipping the top 3 lines ("_v": "19.10", "_type": "store_result", "count": 30,)?
Based on no example code the following code might suit you.
import requests
data = requests.get(yourURL).json()
results = data['data']
for result in results:
# Iterates over the json objects in the data array
# Can log this to a file if you wanted to
print(result)

Accessing nested objects with python

I have a response that I receive from foursquare in the form of json. I have tried to access the certain parts of the object but have had no success. How would I access say the address of the object? Here is my code that I have tried.
url = 'https://api.foursquare.com/v2/venues/explore'
params = dict(client_id=foursquare_client_id,
client_secret=foursquare_client_secret,
v='20170801', ll=''+lat+','+long+'',
query=mealType, limit=100)
resp = requests.get(url=url, params=params)
data = json.loads(resp.text)
msg = '{} {}'.format("Restaurant Address: ",
data['response']['groups'][0]['items'][0]['venue']['location']['address'])
print(msg)
Here is an example of json response:
"items": [
{
"reasons": {
"count": 0,
"items": [
{
"summary": "This spot is popular",
"type": "general",
"reasonName": "globalInteractionReason"
}
]
},
"venue": {
"id": "412d2800f964a520df0c1fe3",
"name": "Central Park",
"contact": {
"phone": "2123106600",
"formattedPhone": "(212) 310-6600",
"twitter": "centralparknyc",
"instagram": "centralparknyc",
"facebook": "37965424481",
"facebookUsername": "centralparknyc",
"facebookName": "Central Park"
},
"location": {
"address": "59th St to 110th St",
"crossStreet": "5th Ave to Central Park West",
"lat": 40.78408342593807,
"lng": -73.96485328674316,
"labeledLatLngs": [
{
"label": "display",
"lat": 40.78408342593807,
"lng": -73.96485328674316
}
],
the full response can be found here
Like so
addrs=data['items'][2]['location']['address']
Your code (at least as far as loading and accessing the object) looks correct to me. I loaded the json from a file (since I don't have your foursquare id) and it worked fine. You are correctly using object/dictionary keys and array positions to navigate to what you want. However, you mispelled "address" in the line where you drill down to the data. Adding the missing 'a' made it work. I'm also correcting the typo in the URL you posted.
I answered this assuming that the example JSON you linked to is what is stored in data. If that isn't the case, a relatively easy way to see exact what python has stored in data is to import pprint and use it like so: pprint.pprint(data).
You could also start an interactive python shell by running the program with the -i switch and examine the variable yourself.
data["items"][2]["location"]["address"]
This will access the address for you.
You can go to any level of nesting by using integer index in case of an array and string index in case of a dict.
Like in your case items is an array
#items[int index]
items[0]
Now items[0] is a dictionary so we access by string indexes
item[0]['location']
Now again its an object s we use string index
item[0]['location']['address]

TypeError: string indices must be integers // working with JSON as dict in python

Okay, so I've been banging my head on this for the last 2 days, with no real progress. I am a beginner with python and coding in general, but this is the first issue I haven't been able to solve myself.
So I have this long file with JSON formatting with about 7000 entries from the youtubeapi.
right now I want to have a short script to print certain info ('videoId') for a certain dictionary key (refered to as 'key'):
My script:
import json
f = open ('path file.txt', 'r')
s = f.read()
trailers = json.loads(s)
print(trailers['key']['Items']['id']['videoId'])
# print(trailers['key']['videoId'] gives same response
Error:
print(trailers['key']['Items']['id']['videoId'])
TypeError: string indices must be integers
It does work when I want to print all the information for the dictionary key:
This script works
import json
f = open ('path file.txt', 'r')
s = f.read()
trailers = json.loads(s)
print(trailers['key'])
Also print(type(trailers)) results in class 'dict', as it's supposed to.
My JSON File is formatted like this and is from the youtube API, youtube#searchListResponse.
{
"kind": "youtube#searchListResponse",
"etag": "",
"nextPageToken": "",
"regionCode": "",
"pageInfo": {
"totalResults": 1000000,
"resultsPerPage": 1
},
"items": [
{
"kind": "youtube#searchResult",
"etag": "",
"id": {
"kind": "youtube#video",
"videoId": ""
},
"snippet": {
"publishedAt": "",
"channelId": "",
"title": "",
"description": "",
"thumbnails": {
"default": {
"url": "",
"width": 120,
"height": 90
},
"medium": {
"url": "",
"width": 320,
"height": 180
},
"high": {
"url": "",
"width": 480,
"height": 360
}
},
"channelTitle": "",
"liveBroadcastContent": "none"
}
}
]
}
What other information is needed to be given for you to understand the problem?
The following code gives me all the videoId's from the provided sample data (which is no id's at all in fact):
import json
with open('sampledata', 'r') as datafile:
data = json.loads(datafile.read())
print([item['id']['videoId'] for item in data['items']])
Perhaps you can try this with more data.
Hope this helps.
I didn't really look into the youtube api but looking at the code and the sample you gave it seems you missed out a [0]. Looking at the structure of json there's a list in key items.
import json
f = open ('json1.json', 'r')
s = f.read()
trailers = json.loads(s)
print(trailers['items'][0]['id']['videoId'])
I've not used json before at all. But it's basically imported in the form of dicts with more dicts, lists etc. Where applicable. At least from my understanding.
So when you do type(trailers) you get type dict. Then you do dict with trailers['key']. If you do type of that, it should also be a dict, if things work correctly. Working through the items in each dict should in the end find your error.
Pythons error says you are trying find the index/indices of a string, which only accepts integers, while you are trying to use a dict. So you need to find out why you are getting a string and not dict when using each argument.
Edit to add an example. If your dict contains a string on key 'item', then you get a string in return, not a new dict which you further can get a dict from. item in the json for example, seem to be a list, with dicts in it. Not a dict itself.

Error while parsing json from IBM watson using python

I am trying to parse out a JSON download using python and here is the download that I have:
{
"document_tone":{
"tone_categories":[
{
"tones":[
{
"score":0.044115,
"tone_id":"anger",
"tone_name":"Anger"
},
{
"score":0.005631,
"tone_id":"disgust",
"tone_name":"Disgust"
},
{
"score":0.013157,
"tone_id":"fear",
"tone_name":"Fear"
},
{
"score":1.0,
"tone_id":"joy",
"tone_name":"Joy"
},
{
"score":0.058781,
"tone_id":"sadness",
"tone_name":"Sadness"
}
],
"category_id":"emotion_tone",
"category_name":"Emotion Tone"
},
{
"tones":[
{
"score":0.0,
"tone_id":"analytical",
"tone_name":"Analytical"
},
{
"score":0.0,
"tone_id":"confident",
"tone_name":"Confident"
},
{
"score":0.0,
"tone_id":"tentative",
"tone_name":"Tentative"
}
],
"category_id":"language_tone",
"category_name":"Language Tone"
},
{
"tones":[
{
"score":0.0,
"tone_id":"openness_big5",
"tone_name":"Openness"
},
{
"score":0.571,
"tone_id":"conscientiousness_big5",
"tone_name":"Conscientiousness"
},
{
"score":0.936,
"tone_id":"extraversion_big5",
"tone_name":"Extraversion"
},
{
"score":0.978,
"tone_id":"agreeableness_big5",
"tone_name":"Agreeableness"
},
{
"score":0.975,
"tone_id":"emotional_range_big5",
"tone_name":"Emotional Range"
}
],
"category_id":"social_tone",
"category_name":"Social Tone"
}
]
}
}
I am trying to parse out 'tone_name' and 'score' from the above file and I am using following code:
import urllib
import json
url = urllib.urlopen('https://watson-api-explorer.mybluemix.net/tone-analyzer/api/v3/tone?version=2016-05-19&text=I%20am%20happy')
data = json.load(url)
for item in data['document_tone']:
print item["tone_name"]
I keep running into error that tone_name not defined.
As jonrsharpe said in a comment:
data['document_tone'] is a dictionary, but 'tone_name' is a key in dictionaries much further down the structure.
You need to access the dictionary that tone_name is in. If I am understanding the JSON correctly, tone_name is a key within tones, within tone_categories, within document_tone. You would then want to change your code to go to that level, like so:
for item in data['document_tone']['tone_categories']:
# item is an anonymous dictionary
for thing in item[tones]:
print(thing['tone_name'])
The reason more than one for is needed is because of the mix of lists and dictionaries in the file. 'tone_categories is a list of dictionaries, so it accesses each one of those. Then, it iterates through the list tones, which is in each one and full of more dictionaries. Those dictionaries are the ones that contain 'tone_name', so it prints the value of 'tone_name'.
If this does not work, let me know. I was unable to test it since I could not get the rest of the code to work on my computer.
You are incorrectly walking the structure. The root node has a single document_tone key, the value of which only has the tone_categories key. Each of the categories has a list of tones and it's name. Here is how you would print it out (adjust as needed):
for cat in data['document_tone']['tone_categories']:
print('Category:', cat['category_name'])
for tone in cat['tones']:
print('-', tone['tone_name'])
The result of this is:
Category: Emotion Tone
- Anger
- Disgust
- Fear
- Joy
- Sadness
Category: Language Tone
- Analytical
- Confident
- Tentative
Category: Social Tone
- Openness
- Conscientiousness
- Extraversion
- Agreeableness
- Emotional Range

Categories