How to extract JSON from script with Python?

How to extract JSON from script with Python? - python

I am parsing a scraped html page that contains a script with JSON inside. This JSON contains all info I am looking for but I can't figure out how to extract a valid JSON.
Minimal example:
my_string = '
(function(){
window.__PRELOADED_STATE__ = window.__PRELOADED_STATE__ || [];
window.__PRELOADED_STATE__.push(
{ *placeholder representing valid JSON inside* }
);
})()
'
The json inside is valid according to jsonlinter.
The result should be loaded into a dictionary:
import json
import re
my_json = re.findall(r'.*(?={\").*', my_string)[0] // extract json
data = json.loads(my_json)
// print(data)
regex: https://regex101.com/r/r0OYZ0/1
This try results in:
>>> data = json.loads(my_json)
Traceback (most recent call last):
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/code.py", line 90, in runcode
exec(code, self.locals)
File "<console>", line 1, in <module>
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/json/__init__.py", line 357, in loads
return _default_decoder.decode(s)
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/json/decoder.py", line 337, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/json/decoder.py", line 355, in raw_decode
raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 7 (char 6)
How can the JSON be extracted and loaded from the string with Python 3.7.x?

you can try to extract this regex, its a very simple case and might not answerto all possible json variations:
my_string = '''
(function(){
window.__PRELOADED_STATE__ = window.__PRELOADED_STATE__ || [];
window.__PRELOADED_STATE__.push(
{"tst":{"f":3}}
);
})()
'''
result = re.findall(r"push\(([{\[].*\:.*[}\]])\)",string3)[0]
result
>>> '{ "tst":{"f":3}}'
to parse it to dictionary now:
import json
dictionary = json.loads(result)
type(dictionary)
>>>dict

Have a look at the below. Note that { *placeholder representing valid JSON inside* } has to be a valid JSON.
my_string = '''
<script>
(function(){
window.__PRELOADED_STATE__ = window.__PRELOADED_STATE__ || [];
window.__PRELOADED_STATE__.push(
{"foo":["bar1", "bar2"]}
);
})()
</script>
'''
import re, json
my_json = re.findall(r'.*(?={\").*', my_string)[0].strip()
data = json.loads(my_json)
print(data)
Output:
{'foo': ['bar1', 'bar2']}

The my_string provided here is not valid JSON. For valid JSON, you can use json.loads(JSON_STRING)
import json
d = json.loads('{"test":2}')
print(d) # Prints the dictionary `{'test': 2}`

Related

python beautiful-soap json - scrape one page but not the other similar ones

Im trying to scrape a nutritional website and the following code works
import requests
from bs4 import BeautifulSoup
import json
import re
page = requests.get("https://nutritiondata.self.com/facts/nut-and-seed-products/3071/1")
soup = BeautifulSoup(page.content, 'html.parser')
scripts = soup.find_all("script")
for script in scripts:
if 'foodNutrients = ' in script.text:
jsonStr = script.text
jsonStr = jsonStr.split('foodNutrients =')[-1]
jsonStr = jsonStr.rsplit('fillSpanValues')[0]
jsonStr = jsonStr.rsplit(';',1)[0]
jsonStr = "".join(jsonStr.split())
valid_json = re.sub(r'([{,:])(\w+)([},:])', r'\1"\2"\3', jsonStr)
jsonObj = json.loads(valid_json)
# These are in terms of 100 grams. I also calculated for per serving
g_per_serv = int(jsonObj['FOODSERVING_WEIGHT_1'].split('(')[-1].split('g')[0])
for k, v in jsonObj.items():
if k == 'NUTRIENT_0':
conv_v = (float(v)*g_per_serv)/100
print ('%s : %s (per 100 grams) | %s (per serving %s' %(k, round(float(v)), round(float(conv_v)), jsonObj['FOODSERVING_WEIGHT_1'] ))
but when I try and use it on other almost identical webpages on the same domain it does not. For example if I use
page = requests.get("https://nutritiondata.self.com/facts/vegetables-and-vegetable-products/2383/2")
I get the error
Traceback (most recent call last):
File "scrape_test_2.py", line 20, in <module>
jsonObj = json.loads(valid_json)
File "/Users/benjamattesjaroen/anaconda3/lib/python3.7/json/__init__.py", line 348, in loads
return _default_decoder.decode(s)
File "/Users/benjamattesjaroen/anaconda3/lib/python3.7/json/decoder.py", line 337, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/Users/benjamattesjaroen/anaconda3/lib/python3.7/json/decoder.py", line 353, in raw_decode
obj, end = self.scan_once(s, idx)
json.decoder.JSONDecodeError: Expecting ',' delimiter: line 1 column 5446 (char 5445)
looking at the source code for both pages they seem identical in the sense they both have
<script type="text/javascript">
<!--
foodNutrients = { NUTRIENT_142: ........
which is the part being scraped.
Ive been looking at this all day, does anyone know how to make this script work for both pages, what is the problem here?

I would switch to using hjson which allows unquoted keys and simply extract the entire foodNutrients variable and parse rather than manipulating strings over and over.
Your error:
Currently yours is failing due the number of elements in at least one of the source arrays being a different length and thus your regex to sanitize is inappropriate. We examine only the first known occurrence...
In first url, before you use regex to clean you have:
aifr:"[ -35, -10 ]"
after:
"aifr":"[-35,-10]"
In second you start with a different length array:
aifr:"[ 163, 46, 209, 179, 199, 117, 11, 99, 7, 5, 82 ]"
after regex replace, instead of:
"aifr":"[ 163, 46, 209, 179, 199, 117, 11, 99, 7, 5, 82 ]"
you have:
"aifr":"[163,"46",209,"179",199,"117",11,"99",7,"5",82]"
i.e. invalid json. No more nicely delimited key:value pairs.
Nutshell:
Use hjson it's easier. Or update regex appropriately to handle variable length arrays.
import requests, re, hjson
urls = ['https://nutritiondata.self.com/facts/nut-and-seed-products/3071/1','https://nutritiondata.self.com/facts/vegetables-and-vegetable-products/2383/2']
p = re.compile(r'foodNutrients = (.*?);')
with requests.Session() as s:
for url in urls:
r = s.get(url)
jsonObj = hjson.loads(p.findall(r.text)[0])
serving_weight = jsonObj['FOODSERVING_WEIGHT_1']
g_per_serv = int(serving_weight.split('(')[-1].split('g')[0])
nutrient_0 = jsonObj['NUTRIENT_0']
conv_v = float(nutrient_0)*g_per_serv/100
print('%s : %s (per 100 grams) | %s (per serving %s' %(nutrient_0, round(float(nutrient_0)), round(float(conv_v)), serving_weight))

Python breaks parsing json with characters \"

I'm trying to parse json string with an escape character (Of some sort I guess)
{
"publisher": "\"O'Reilly Media, Inc.\""
}
Parser parses well if I remove the character \" from the string,
the exceptions raised by different parsers are,
json
File "/usr/lib/python2.7/json/__init__.py", line 338, in loads
return _default_decoder.decode(s)
File "/usr/lib/python2.7/json/decoder.py", line 366, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/usr/lib/python2.7/json/decoder.py", line 382, in raw_decode
obj, end = self.scan_once(s, idx)
ValueError: Expecting , delimiter: line 17 column 20 (char 392)
ujson
ValueError: Unexpected character in found when decoding object value
How do I make the parser to escape this characters ?
update:
ps. json is imported as ujson in this example
This is what my ide shows
comma is just added accidently, it has no trailing comma at the end of json, json is valid
the string definition.

You almost certainly did not define properly escaped backslashes. If you define the string properly the JSON parses just fine:
>>> import json
>>> json_str = r'''
... {
... "publisher": "\"O'Reilly Media, Inc.\""
... }
... ''' # raw string to prevent the \" from being interpreted by Python
>>> json.loads(json_str)
{u'publisher': u'"O\'Reilly Media, Inc."'}
Note that I used a raw string literal to define the string in Python; if I did not, the \" would be interpreted by Python and a regular " would be inserted. You'd have to double the backslash otherwise:
>>> print '\"'
"
>>> print '\\"'
\"
>>> print r'\"'
\"
Reencoding the parsed Python structure back to JSON shows the backslashes re-appearing, with the repr() output for the string using the same double backslash:
>>> json.dumps(json.loads(json_str))
'{"publisher": "\\"O\'Reilly Media, Inc.\\""}'
>>> print json.dumps(json.loads(json_str))
{"publisher": "\"O'Reilly Media, Inc.\""}
If you did not escape the \ escape you'll end up with unescaped quotes:
>>> json_str_improper = '''
... {
... "publisher": "\"O'Reilly Media, Inc.\""
... }
... '''
>>> print json_str_improper
{
"publisher": ""O'Reilly Media, Inc.""
}
>>> json.loads(json_str_improper)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/mj/Development/Library/buildout.python/parts/opt/lib/python2.7/json/__init__.py", line 338, in loads
return _default_decoder.decode(s)
File "/Users/mj/Development/Library/buildout.python/parts/opt/lib/python2.7/json/decoder.py", line 366, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/Users/mj/Development/Library/buildout.python/parts/opt/lib/python2.7/json/decoder.py", line 382, in raw_decode
obj, end = self.scan_once(s, idx)
ValueError: Expecting , delimiter: line 3 column 20 (char 22)
Note that the \" sequences now are printed as ", the backslash is gone!

Your JSON is invalid. If you have questions about your JSON objects, you can always validate them with JSONlint. In your case you have an object
{
"publisher": "\"O'Reilly Media, Inc.\"",
}
and you have an extra comma indicating that something else should be coming. So JSONlint yields
Parse error on line 2:
...edia, Inc.\"", }
---------------------^
Expecting 'STRING'
which would begin to help you find where the error was.
Removing the comma for
{
"publisher": "\"O'Reilly Media, Inc.\""
}
yields
Valid JSON
Update: I'm keeping the stuff in about JSONlint as it may be helpful to others in the future. As for your well formed JSON object, I have
import json
d = {
"publisher": "\"O'Reilly Media, Inc.\""
}
print "Here is your string parsed."
print(json.dumps(d))
yielding
Here is your string parsed.
{"publisher": "\"O'Reilly Media, Inc.\""}
Process finished with exit code 0

json.loads not replacing apostrophe

I have a json object that I am loading and replacing single with double quotes as I do. The syntax for this is:
response = json.loads(response.text.replace("'", '"'))
Within my data I have key/value pairs that take the format:
"name":"John O'Shea"
This is causing me to get the following traceback:
Traceback (most recent call last):
File "C:\Python27\Whoscored\Test.py", line 204, in <module>
response = json.loads(response.text.replace("'", '"').replace(',,', ','))
File "C:\Python27\lib\json\__init__.py", line 338, in loads
return _default_decoder.decode(s)
File "C:\Python27\lib\json\decoder.py", line 366, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "C:\Python27\lib\json\decoder.py", line 382, in raw_decode
obj, end = self.scan_once(s, idx)
ValueError: Expecting ',' delimiter: line 1 column 7751 (char 7750)
I don't actually want to replace the apostrophe in a name such as the one above, but I would have thought that my json.loads statement would have converted my key/value pair to this:
"name":"John O"Shea"
I'm assuming this would also fail however. What I need to know is:
1) Why is my json.loads statement not replacing the apostrophes in my string during the load?
2) What is the best way to escape the apostrophes within my string so that they do not cause an error, but are still displayed in the load?
I have used a json tester on my larger to string to confirm that there are no other errors that would stop the object from working correctly, which there are not.
Thanks

Json uses " as a formatting character, so response.text.replace("'", '"') is just corrupting the file. Json escapes quotes inside strings as \" so this should work:
response = json.loads(response.text.replace("'", '\\"'))

if your json doesn't take special characters better convert them to Unicode
private static String escapeNonAscii(String str) {
StringBuilder retStr = new StringBuilder();
for(int i=0; i<str.length(); i++) {
int cp = Character.codePointAt(str, i);
int charCount = Character.charCount(cp);
if (charCount > 1) {
i += charCount - 1; // 2.
if (i >= str.length()) {
throw new IllegalArgumentException("truncated unexpectedly");
}
}
if (cp < 128) {
retStr.appendCodePoint(cp);
} else {
retStr.append(String.format("\\u%x", cp));
}
}
return retStr.toString();
}

How to convert json response into Python list

I get the JSON response by requests.get
req = requests.get(SAMPLE_SCHEDULE_API)
and convert it into dictionary
data = json.loads(req.text)["data"]
When I tried to convert the string into Python dict,
I got ValueError: malformed node or string:
ast.literal_eval(data)
I have no idea how to do this task.
code snippets
def schedules(cls, start_date=None, end_date=None):
import ast
req = requests.get(SAMPLE_SCHEDULE_API)
data = json.loads(req.text)["data"]
ast.literal_eval(data)
return pd.DataFrame(json.loads(req.text)["data"])
JSON response
{
status: "ok",
version: "v1",
data: "[
{"_id":"2015-01-28","end_date":"2015-01-28","estimated_release":1422453600000,"is_projection":false,"is_statement":true,"material_link":null,"start_date":"2015-01-27"},
{"_id":"2015-03-18","end_date":"2015-03-18","estimated_release":1426687200000,"is_projection":false,"is_statement":false,"material_link":null,"start_date":"2015-03-17"},
{"_id":"2015-04-29","end_date":"2015-04-29","estimated_release":1430316000000,"is_projection":false,"is_statement":false,"material_link":null,"start_date":"2015-04-28"},
{"_id":"2015-06-17","end_date":"2015-06-17","estimated_release":1434549600000,"is_projection":false,"is_statement":false,"material_link":null,"start_date":"2015-06-16"},
{"_id":"2015-07-29","end_date":"2015-07-29","estimated_release":1438178400000,"is_projection":false,"is_statement":false,"material_link":null,"start_date":"2015-07-28"}]"
}
Detail error message
Traceback (most recent call last):
File "fomc.py", line 25, in <module>
schedules = FOMC.schedules()
File "fomc.py", line 21, in schedules
ast.literal_eval(data)
File "/usr/local/Cellar/python3/3.3.2/Frameworks/Python.framework/Versions/3.3/lib/python3.3/ast.py", line 86, in literal_eval
return _convert(node_or_string)
File "/usr/local/Cellar/python3/3.3.2/Frameworks/Python.framework/Versions/3.3/lib/python3.3/ast.py", line 58, in _convert
return list(map(_convert, node.elts))
File "/usr/local/Cellar/python3/3.3.2/Frameworks/Python.framework/Versions/3.3/lib/python3.3/ast.py", line 63, in _convert
in zip(node.keys, node.values))
File "/usr/local/Cellar/python3/3.3.2/Frameworks/Python.framework/Versions/3.3/lib/python3.3/ast.py", line 62, in <genexpr>
return dict((_convert(k), _convert(v)) for k, v
File "/usr/local/Cellar/python3/3.3.2/Frameworks/Python.framework/Versions/3.3/lib/python3.3/ast.py", line 85, in _convert
raise ValueError('malformed node or string: ' + repr(node))
ValueError: malformed node or string: <_ast.Name object at 0x10a19c990>

You have encoded the data twice (which would strictly not be necessary). You just need to decode the data again with json.loads:
def schedules(cls, start_date=None, end_date=None):
req = requests.get(SAMPLE_SCHEDULE_API)
data_json = json.loads(req.text)["data"]
data = json.loads(data_json)
return pd.DataFrame(data)
Do note that ast.literal_eval is for Python code, whereas json.loads is for JSON that closely follows JavaScript code; the differences are for example true , false and null vs True, False and None. The former are the javascript syntax as used in JSON (and thus you would need json.loads; the latter is Python code, for which you would use ast.literal_eval.

As the response already is json format, you do not need to encode it. Approach like this,
req = requests.get(SAMPLE_SCHEDULE_API)
data_str = req.json().get('data')
json_data = json.loads(data_str)
json() method will return the json-encoded content of a response.

The field "data" is a string, not a list. The content of that string seems to be JSON, too, so you have JSON encapsulated in JSON for some reason. If you can, fix that so that you only encode as JSON once. If that doesn't work, you can retrieve that field and decode it separately.

JSON ValueError: Unterminated string

My script work, but sometimes crashes with that error:
Traceback (most recent call last):
File "planetafm.py", line 6, in <module>
songs = json.loads(json_data)
File "/usr/lib/python2.7/json/__init__.py", line 338, in loads
return _default_decoder.decode(s)
File "/usr/lib/python2.7/json/decoder.py", line 366, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/usr/lib/python2.7/json/decoder.py", line 382, in raw_decode
obj, end = self.scan_once(s, idx)
ValueError: Unterminated string starting at: line 1 column 32 (char 31)
For example, that json causes:
rdsData({"now":{"id":"0052-55","title":"Summertime Sadness (Radio Mix)","artist":"Lana Del Rey","startDate":"2014-09-07 21:48:51","duration":"2014-09-07 21:48:51"}})
sourcecode:
import requests, json, re
url = "http://rds.eurozet.pl/reader/var/planeta.json"
response = requests.get(url)
json_data = re.match('rdsData\((.*?)\)', response.content).group(1)
songs = json.loads(json_data)
print (songs['now']['artist'] + " - " + songs['now']['title']).encode('utf-8')
Why that json is invalid? How to fix this?
Thanks for answers!

Your regexp has a problem with closing bracket inside text. You can fix it by adding $ to the regexp:
import requests, json, re
url = "http://rds.eurozet.pl/reader/var/planeta.json"
response = requests.get(url)
print response.content
json_data = re.match('rdsData\((.*?)\)$', response.content).group(1)
print json_data
songs = json.loads(json_data)
print (songs['now']['artist'] + " - " + songs['now']['title']).encode('utf-8')

Your method of extracting is flawed; your expression terminates at the first ) character:
>>> import re
>>> import requests
>>> url = "http://rds.eurozet.pl/reader/var/planeta.json"
>>> r = requests.get(url)
>>> re.match('rdsData\((.*?)\)', r.content).group(1)
'{"now":{"id":"0052-55","title":"Summertime Sadness (Radio Mix'
Rather than use a regular expression, just partition the value out using str.partition() and str.rpartition():
url = "http://rds.eurozet.pl/reader/var/planeta.json"
response = requests.get(url)
json_data = response.content.partition('(')[-1].rpartition(')')[0]
songs = json.loads(json_data)
Demo:
>>> json_data = r.content.partition('(')[-1].rpartition(')')[0]
>>> json.loads(json_data)['now']
{u'duration': u'2014-09-07 21:48:51', u'startDate': u'2014-09-07 21:48:51', u'artist': u'Lana Del Rey', u'id': u'0052-55', u'title': u'Summertime Sadness (Radio Mix)'}

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to extract JSON from script with Python? - python

The my_string provided here is not valid JSON. For valid JSON, you can use json.loads(JSON_STRING) import json d = json.loads('{"test":2}') print(d) # Prints the dictionary `{'test': 2}`

Related

python beautiful-soap json - scrape one page but not the other similar ones

Python breaks parsing json with characters \"

json.loads not replacing apostrophe

How to convert json response into Python list

JSON ValueError: Unterminated string

Categories

Resources