Python extract json structure from html page

Python extract json structure from html page - python

in python i'm reading an html page content which contains a lot of stuff.
To do this i read the webpage as string by this way:
url = 'https://myurl.com/'
reqq = req.Request(url, headers={'User-Agent': 'Mozilla/5.0'})
reddit_file = req.urlopen(reqq)
reddit_data = reddit_file.read().decode('utf-8')
if i print the reddit_data i can see correctly the whole html contents.
Now, inside it there's a structure like json that i would like to read and extract some fields from that.
Below the structure:
"dealDetails" : {
"f240141a" : {
"egressUrl" : "https://ccc.com",
"title" : "ZZZ",
"type" : "ghi",
},
"5f9ab246" : {
"egressUrl" : "https://www.bbb.com/",
"title" : "YYY",
"type" : "def",
},
"2bf6723b" : {
"egressUrl" : "https://www.aaa.com//",
"title" : "XXX",
"type" : "abc",
},
}
What i want to do is: find the dealDetails field and then for each f240141a 5f9ab246 2bf6723b
get the egressURL, title and type values.
Thanks

Try this,
[nested_dict['egressUrl'] for nested_dict in reddit_data['dealDetails'].keys()]
To access the values of JSON, you can consider as dictionary and use the same syntax to access values as well.
Edit-1:
Make sure your type of reddit_data is a dictionary.
if type(reddit_data) is str.
You need to do..
import ast
reddit_data = ast.literal_eval(reddit_data)
OR
import json
reddit_data = json.loads(reddit_data)

If you just wanted to know how to access the egressURL, title and the type. You might just wanna read the answer below! Be careful however, cause the following code won't work unless you converted your HTML file reddit_data in something like a dictionary ( Modified shaik moeed's answer a tiny bit to also return title and type) :
[(i['egressUrl'], i['title'], i['type']) for i in reddit_data['dealDetails'].keys()]
However, If I got it right, the part you're missing is the conversion from HTML to a JSON friendly file. What I personally use, even though it's quite unpopular, is the eval function
dictionary = eval(reddit_data)
This will convert the whole file into a dictionary, I recommend that you only use it on the part of the text that 'looks' like a dictionary! (One of the reason eval is unpopular, is because it won't convert strings like 'true'/'false' to Python's True/False, be careful with that :) )
Hope that helped!

Related

Python JSON formatting

So I have a problem with JSON. The data that I receive from an API looks like this.
{
"weather":[
{
"id":804,
"main":"Clouds",
"description":"overcast clouds",
"icon":"04d"
}
]
}
I can't read the data in the weather cell tough because it's wrapped between those '[ ]'. But if I try to create a JSON file but remove the "[ ]" and try to read it, it works. What can I do? Please help!

If I do the following it works just fine:
import json
data = '''{
"weather":[
{
"id":804,
"main":"Clouds",
"description":"overcast clouds",
"icon":"04d"
}
]
}'''
dict_data = json.loads(data)
print(dict_data.get("weather")[0].get("main"))
>>> "Clouds"
It works as expected. Because it is a list, you have to target the first item, here another dict, which holds the information you want.

Extract nested JSON values

I am trying to figure out on how to convert the following JSON object in to a dataframe
[
{
"id":"123",
"sources":[
{
"name":"ABC",
"first":"2020-02-26T03:19:23.247Z",
"last":"2020-02-26T03:19:23.247Z"
},
{
"name":"XYZ",
"first":"2020-02-26T03:19:23.247Z",
"last":"2020-02-26T03:19:23.247Z"
}
]
}
]
The dataframe should appear like this.
id ABC.first ABC.last XYZ.first XYZ.last
123 2020-02-26.. 2020-02-26.. 2020-02-26.. 2020-02-26T03:19:23.247Z
Thanks in advance for your help

Python includes the useful json module. You can find some documentation for it here.
To use it , you'll need to import it into your Python script using the import json command.
From there, you can use the json.loads() method and pass it a some JSON. The loads() method returns a Python dictionary which then lets you access the properties of your JSON object by name. Some example code might look like:
import json
# some JSON:
personJSON = '{"name":"John", "age":30, "city":"New York", "address": {"street": "Fake Street", "streetNumber": "123"}}'
# parse personJSON:
personDict = json.loads(personJSON)
# the result is a Python dictionary:
print(personDict["name"]) # Will print "John"
print(personDict["address"]["street"]) # Will print "Fake Street"
Good luck!

How to get a value from my JSON file, in python

I want you to ask for your help.
I need to browse folder with json files and I need to search for one attribute with specific value.
Here is my code:
def listDir():
fileNames = os.listdir(FOLDER_PATH)
for fileName in fileNames:
print(fileName)
with open('C:\\Users\\Kamčo\\Desktop\\UBLjsons\\' + fileName, 'r') as json_file:
data = json.load(json_file)
data_json = json.dumps(data, indent=2, sort_keys=True)
print(data_json)
for line in data_json:
if line['ID'] == 'Kamilko':
print("try") #just to be sure it accessed to this
I am getting this error: TypeError: string indices must be integers
I also tried to search for a solution here but it didnt help me.
Here is my JSON
{
"Invoice": {
"#xmlns": "urn:oasis:names:specification:ubl:schema:xsd:Invoice-2",
"#xmlns:cac": "urn:oasis:names:specification:ubl:schema:xsd:CommonAggregateComponents-2",
"#xmlns:cbc": "urn:oasis:names:specification:ubl:schema:xsd:CommonBasicComponents-2",
"ID": "Kamilko",
"IssueDate": "2020-02-09",
"OrderReference": {
"ID": "22"
},
"InvoiceLine": {
"Price": {
"PriceAmount": {
"#currencyID": "EUR",
"#text": "23.50"
},
"BaseQuantity": {
"#unitCode": "C62",
"#text": "1"
}
}
}
}
}
do you have any idea how to do it?

You've loaded your file in using json.load(...). That'll convert the JSON data into a Python dictionary that you can use to access elements:
if data["Invoice"]["OrderReference"]["ID"] == 22:
print("try")
Note that you might want to check the relevant keys exist along the way, in case the structure of your file varies, or you could catch the KeyError that'll come up if the key doesn't exist, using try/except.
Some more background:
When you then call json.dumps(...), you're taking that handy python structure and converting it back into a hard-to-understand string again. I don't think you want or need to do this.
The specific error you have is because dumps has created a string. You're then trying to access an element of that string using the [ ] operator. Strings can only be indexed using integers, e.g. mystr[4], so Python doesn't understand what you're asking it to do with data_json["ID"].

JSON Parse an element inside an element in Python

I have a JSON text grabbed from an API of a website:
{"result":"true","product":{"made":{"Taiwan":"Taipei","HongKong":"KongStore","Area":"Asia"}}}
I want to capture "Taiwan" and "Taipei" but always fail.
Here is my code:
import json
weather = urllib2.urlopen('url')
wjson = weather.read()
wjdata = json.loads(wjson)
print wjdata['product']['made'][0]['Taiwan']
I always get the following error:
Keyword 0 error
Whats the correct way to parse that json?

You are indexing an array where there are none.
The JSON is the following:
{
"result":"true",
"product": {
"made": {
"Taiwan":"Taipei",
"HongKong":"KongStore",
"Area":"Asia"
}
}
}
And the above contains no arrays.
You are assuming the JSON structure to be something like this:
{
"result":"true",
"product": {
"made": [
{"Taiwan":"Taipei"},
{"HongKong":"KongStore"},
{"Area":"Asia"}
]
}
}
From a brief look at the doc pages for the json package, I found this conversion table: Conversion table using json.loads
It tells us that a JSON object translates to a dict. And a dict has a method called keys, which returns a list of the keys.
I suggest you try something like this:
#... omitted code
objectKeys = wjdata['product']['made'].keys()
# You should now have a list of the keys stored in objectKeys.
for key in objectKeys:
print key
if key == 'Taiwan':
print 'Eureka'
I haven't tested the above code, but I think you get the gist here :)

wjdata['product']['made']['Taiwan'] works

Confused by Python returning JSON as string instead of literal

I've done some coding in RoR, and in Rails, when I return a JSON object via an API call, it returns as
{ "id" : "1", "name" : "Dan" }.
However in Python (with Flask and Flask-SQLAlchemy), when I return a JSON object via json.dumps or jsonpickle.encode it is returned as
"{ \"id\" : \"1\", \"name\": \"Dan\" }" which seems very unwieldily as it can't easily be parsed on the other end (by an iOS app in this case - Obj-C).
What am I missing here, and what should I do to return it as a JSON literal, rather than a JSON string?
This is what my code looks like:
people = models.UserRelationships.query.filter_by(user_id=user_id, active=ACTIVE_RECORD)
friends = people.filter_by(friends=YES)
json_object = jsonpickle.encode(friends.first().as_dict(), unpicklable=False, keys=True)
print(json_object) # this prints here, i.e. { "id" : "1", "name" : "Dan" }
return json_object # this returns "{ \"id\" : \"1\", \"name\": \"Dan\" }" to the browser

What is missing in your understanding here is that when you use the JSON modules in Python, you're not working with a JSON object. JSON is by definition just a string that matches a certain standard.
Lets say you have the string:
friends = '{"name": "Fred", "id": 1}'
If you want to work with this data in python, you will want to load it into a python object:
import json
friends_obj = json.loads(friends)
At this point friends_obj is a python dictionary.
If you want to convert it (or any other python dictionary or list) then this is where json.dumps comes in handy:
friends_str = json.dumps(friends_obj)
print friends_str
'{"name": "Fred", "id": 1}'
However if we attempt to "dump" the original friends string you'll see you get a different result:
dumped_str = json.dumps(friends)
print dumped_str
'"{\\"name\\": \\"Fred\\", \\"id\\": 1}"'
This is because you're basically attempting to encode an ordinary string as JSON and it is escaping the characters. I hope this helps make sense of things!
Cheers

Looks like you are using Django here, in which case do something like
from django.utils import simplejson as json
...
return HttpResponse(json.dumps(friends.first().as_dict()))

This is almost always a sign that you're double-encoding your data somewhere. For example:
>>> obj = { "id" : "1", "name" : "Dan" }
>>> j = json.dumps(obj)
>>> jj = json.dumps(j)
>>> print(obj)
{'id': '1', 'name': 'Dan'}
>>> print(j)
{"id": "1", "name": "Dan"}
>>> print(jj)
"{\"id\": \"1\", \"name\": \"Dan\"}"
Here, jj is a perfectly valid JSON string representation—but it's not a representation of obj, it's a representation of the string j, which is useless.
Normally you don't do this directly; instead, either you started with a JSON string rather than an object in the first place (e.g., you got it from a client request or from a text file), or you called some function in a library like requests or jsonpickle that implicitly calls json.dumps with an already-encoded string. But either way, it's the same problem, with the same solution: Just don't double-encode.

You should be using flask.jsonify, which will not only encode correctly, but also set the content-type headers accordingly.
people = models.UserRelationships.query.filter_by(user_id=user_id, active=ACTIVE_RECORD)
friends = people.filter_by(friends=YES)
return jsonify(friends.first().as_dict())

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python extract json structure from html page - python

Related

Python JSON formatting

Extract nested JSON values

How to get a value from my JSON file, in python

JSON Parse an element inside an element in Python

Confused by Python returning JSON as string instead of literal

Categories

Resources