My script work, but sometimes crashes with that error:
Traceback (most recent call last):
File "planetafm.py", line 6, in <module>
songs = json.loads(json_data)
File "/usr/lib/python2.7/json/__init__.py", line 338, in loads
return _default_decoder.decode(s)
File "/usr/lib/python2.7/json/decoder.py", line 366, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/usr/lib/python2.7/json/decoder.py", line 382, in raw_decode
obj, end = self.scan_once(s, idx)
ValueError: Unterminated string starting at: line 1 column 32 (char 31)
For example, that json causes:
rdsData({"now":{"id":"0052-55","title":"Summertime Sadness (Radio Mix)","artist":"Lana Del Rey","startDate":"2014-09-07 21:48:51","duration":"2014-09-07 21:48:51"}})
sourcecode:
import requests, json, re
url = "http://rds.eurozet.pl/reader/var/planeta.json"
response = requests.get(url)
json_data = re.match('rdsData\((.*?)\)', response.content).group(1)
songs = json.loads(json_data)
print (songs['now']['artist'] + " - " + songs['now']['title']).encode('utf-8')
Why that json is invalid? How to fix this?
Thanks for answers!
Your regexp has a problem with closing bracket inside text. You can fix it by adding $ to the regexp:
import requests, json, re
url = "http://rds.eurozet.pl/reader/var/planeta.json"
response = requests.get(url)
print response.content
json_data = re.match('rdsData\((.*?)\)$', response.content).group(1)
print json_data
songs = json.loads(json_data)
print (songs['now']['artist'] + " - " + songs['now']['title']).encode('utf-8')
Your method of extracting is flawed; your expression terminates at the first ) character:
>>> import re
>>> import requests
>>> url = "http://rds.eurozet.pl/reader/var/planeta.json"
>>> r = requests.get(url)
>>> re.match('rdsData\((.*?)\)', r.content).group(1)
'{"now":{"id":"0052-55","title":"Summertime Sadness (Radio Mix'
Rather than use a regular expression, just partition the value out using str.partition() and str.rpartition():
url = "http://rds.eurozet.pl/reader/var/planeta.json"
response = requests.get(url)
json_data = response.content.partition('(')[-1].rpartition(')')[0]
songs = json.loads(json_data)
Demo:
>>> json_data = r.content.partition('(')[-1].rpartition(')')[0]
>>> json.loads(json_data)['now']
{u'duration': u'2014-09-07 21:48:51', u'startDate': u'2014-09-07 21:48:51', u'artist': u'Lana Del Rey', u'id': u'0052-55', u'title': u'Summertime Sadness (Radio Mix)'}
Related
I am parsing a scraped html page that contains a script with JSON inside. This JSON contains all info I am looking for but I can't figure out how to extract a valid JSON.
Minimal example:
my_string = '
(function(){
window.__PRELOADED_STATE__ = window.__PRELOADED_STATE__ || [];
window.__PRELOADED_STATE__.push(
{ *placeholder representing valid JSON inside* }
);
})()
'
The json inside is valid according to jsonlinter.
The result should be loaded into a dictionary:
import json
import re
my_json = re.findall(r'.*(?={\").*', my_string)[0] // extract json
data = json.loads(my_json)
// print(data)
regex: https://regex101.com/r/r0OYZ0/1
This try results in:
>>> data = json.loads(my_json)
Traceback (most recent call last):
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/code.py", line 90, in runcode
exec(code, self.locals)
File "<console>", line 1, in <module>
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/json/__init__.py", line 357, in loads
return _default_decoder.decode(s)
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/json/decoder.py", line 337, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/json/decoder.py", line 355, in raw_decode
raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 7 (char 6)
How can the JSON be extracted and loaded from the string with Python 3.7.x?
you can try to extract this regex, its a very simple case and might not answerto all possible json variations:
my_string = '''
(function(){
window.__PRELOADED_STATE__ = window.__PRELOADED_STATE__ || [];
window.__PRELOADED_STATE__.push(
{"tst":{"f":3}}
);
})()
'''
result = re.findall(r"push\(([{\[].*\:.*[}\]])\)",string3)[0]
result
>>> '{ "tst":{"f":3}}'
to parse it to dictionary now:
import json
dictionary = json.loads(result)
type(dictionary)
>>>dict
Have a look at the below. Note that { *placeholder representing valid JSON inside* } has to be a valid JSON.
my_string = '''
<script>
(function(){
window.__PRELOADED_STATE__ = window.__PRELOADED_STATE__ || [];
window.__PRELOADED_STATE__.push(
{"foo":["bar1", "bar2"]}
);
})()
</script>
'''
import re, json
my_json = re.findall(r'.*(?={\").*', my_string)[0].strip()
data = json.loads(my_json)
print(data)
Output:
{'foo': ['bar1', 'bar2']}
The my_string provided here is not valid JSON. For valid JSON, you can use json.loads(JSON_STRING)
import json
d = json.loads('{"test":2}')
print(d) # Prints the dictionary `{'test': 2}`
I am trying to load a dict from a text file (file.txt) using json.loads(), I can save the dict, but can't get it. I have two scripts: one that saves the dict, and one that receives it. The script that receives will wait until it receives it, but when it does, it errors
Traceback (most recent call last):
File "C:/Users/User/Desktop/receiver.py", line 9, in <module>
d = json.loads(file.read())
File "C:\Users\User\AppData\Local\Programs\Python\Python38-32\lib\json\__init__.py", line 357, in loads
return _default_decoder.decode(s)
File "C:\Users\User\AppData\Local\Programs\Python\Python38-32\lib\json\decoder.py", line 337, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "C:\Users\User\AppData\Local\Programs\Python\Python38-32\lib\json\decoder.py", line 355, in raw_decode
raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
Here are my full scripts if that could help you
RECEIVE.PY
import json
d = {}
while True:
with open('file.txt', 'r') as file:
if file.read():
d = json.loads(file.read()) # It errors here
file.close()
print('Data found in this file !')
break
else:
print('No data in this file..')
print(str(d))
SENDER.PY
import json
import time
d = {
'Hello': {
'Guys': True,
'World': False,
},
}
time.sleep(5)
with open('file.txt', 'w') as file:
file.write(json.dumps(d))
file.close()
print(d['Hello']['Guys'])
You call file.read() twice, so the first one reads all the data then the second one won't yield any. Just store it to a variable:
import json
d = {}
while True:
with open('file.txt', 'r') as file:
data = file.read()
if data:
d = json.loads(data)
# you also don't need to close the file due to the with statement
print('Data found in this file !')
break
else:
print('No data in this file..')
print(str(d))
Adding onto Aplet's answer above, you can file.seek(0) to reset the file object position to the start of the file after reading it:
import json
d = {}
while True:
with open('file.txt', 'r') as file:
if file.read():
file.seek(0)
d = json.loads(file.read()) # It errors here
file.close()
print('Data found in this file !')
break
else:
print('No data in this file..')
print(str(d))
Aplet's answer is probably the better way to do it, but this is also a possible way.
For more info see the docs: https://docs.python.org/3/tutorial/inputoutput.html
Im trying to scrape a nutritional website and the following code works
import requests
from bs4 import BeautifulSoup
import json
import re
page = requests.get("https://nutritiondata.self.com/facts/nut-and-seed-products/3071/1")
soup = BeautifulSoup(page.content, 'html.parser')
scripts = soup.find_all("script")
for script in scripts:
if 'foodNutrients = ' in script.text:
jsonStr = script.text
jsonStr = jsonStr.split('foodNutrients =')[-1]
jsonStr = jsonStr.rsplit('fillSpanValues')[0]
jsonStr = jsonStr.rsplit(';',1)[0]
jsonStr = "".join(jsonStr.split())
valid_json = re.sub(r'([{,:])(\w+)([},:])', r'\1"\2"\3', jsonStr)
jsonObj = json.loads(valid_json)
# These are in terms of 100 grams. I also calculated for per serving
g_per_serv = int(jsonObj['FOODSERVING_WEIGHT_1'].split('(')[-1].split('g')[0])
for k, v in jsonObj.items():
if k == 'NUTRIENT_0':
conv_v = (float(v)*g_per_serv)/100
print ('%s : %s (per 100 grams) | %s (per serving %s' %(k, round(float(v)), round(float(conv_v)), jsonObj['FOODSERVING_WEIGHT_1'] ))
but when I try and use it on other almost identical webpages on the same domain it does not. For example if I use
page = requests.get("https://nutritiondata.self.com/facts/vegetables-and-vegetable-products/2383/2")
I get the error
Traceback (most recent call last):
File "scrape_test_2.py", line 20, in <module>
jsonObj = json.loads(valid_json)
File "/Users/benjamattesjaroen/anaconda3/lib/python3.7/json/__init__.py", line 348, in loads
return _default_decoder.decode(s)
File "/Users/benjamattesjaroen/anaconda3/lib/python3.7/json/decoder.py", line 337, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/Users/benjamattesjaroen/anaconda3/lib/python3.7/json/decoder.py", line 353, in raw_decode
obj, end = self.scan_once(s, idx)
json.decoder.JSONDecodeError: Expecting ',' delimiter: line 1 column 5446 (char 5445)
looking at the source code for both pages they seem identical in the sense they both have
<script type="text/javascript">
<!--
foodNutrients = { NUTRIENT_142: ........
which is the part being scraped.
Ive been looking at this all day, does anyone know how to make this script work for both pages, what is the problem here?
I would switch to using hjson which allows unquoted keys and simply extract the entire foodNutrients variable and parse rather than manipulating strings over and over.
Your error:
Currently yours is failing due the number of elements in at least one of the source arrays being a different length and thus your regex to sanitize is inappropriate. We examine only the first known occurrence...
In first url, before you use regex to clean you have:
aifr:"[ -35, -10 ]"
after:
"aifr":"[-35,-10]"
In second you start with a different length array:
aifr:"[ 163, 46, 209, 179, 199, 117, 11, 99, 7, 5, 82 ]"
after regex replace, instead of:
"aifr":"[ 163, 46, 209, 179, 199, 117, 11, 99, 7, 5, 82 ]"
you have:
"aifr":"[163,"46",209,"179",199,"117",11,"99",7,"5",82]"
i.e. invalid json. No more nicely delimited key:value pairs.
Nutshell:
Use hjson it's easier. Or update regex appropriately to handle variable length arrays.
import requests, re, hjson
urls = ['https://nutritiondata.self.com/facts/nut-and-seed-products/3071/1','https://nutritiondata.self.com/facts/vegetables-and-vegetable-products/2383/2']
p = re.compile(r'foodNutrients = (.*?);')
with requests.Session() as s:
for url in urls:
r = s.get(url)
jsonObj = hjson.loads(p.findall(r.text)[0])
serving_weight = jsonObj['FOODSERVING_WEIGHT_1']
g_per_serv = int(serving_weight.split('(')[-1].split('g')[0])
nutrient_0 = jsonObj['NUTRIENT_0']
conv_v = float(nutrient_0)*g_per_serv/100
print('%s : %s (per 100 grams) | %s (per serving %s' %(nutrient_0, round(float(nutrient_0)), round(float(conv_v)), serving_weight))
I downloaded a zip file from https://clinicaltrials.gov/AllPublicXML.zip, which contains over 200k xml files (most are < 10 kb in size), to a directory (see 'dirpath_zip' in the CODE) I created in ubuntu 16.04 (using DigitalOcean). What I'm trying to accomplish is loading all of these into MongoDB (also installed in the same location as the zip file).
I ran the CODE below twice and consistently failed when processing the 15988th file.
I've googled around and tried reading other posts regarding this particular error, but couldn't find a way to solve this particular issue. Actually, I'm not really sure what problem really is... any help is much appreciated!!
CODE:
import re
import json
import zipfile
import pymongo
import datetime
import xmltodict
from bs4 import BeautifulSoup
from pprint import pprint as ppt
def timestamper(stamp_type="regular"):
if stamp_type == "regular":
timestamp = str(datetime.datetime.now())
elif stamp_type == "filename":
timestamp = str(datetime.datetime.now()).replace("-", "").replace(":", "").replace(" ", "_")[:15]
else:
sys.exit("ERROR [timestamper()]: unexpected 'stamp_type' (parameter) encountered")
return timestamp
client = pymongo.MongoClient()
db = client['ctgov']
coll_name = "ts_"+timestamper(stamp_type="filename")
coll = db[coll_name]
dirpath_zip = '/glbdat/ctgov/all/alltrials_20180402.zip'
z = zipfile.ZipFile(dirpath_zip, 'r')
i = 0
for xmlfile in z.namelist():
print(i, 'parsing:', xmlfile)
if xmlfile == 'Contents.txt':
print(xmlfile, '==> entering "continue"')
continue
else:
soup = BeautifulSoup(z.read(xmlfile), 'lxml')
json_study = json.loads(re.sub('\s', ' ', json.dumps(xmltodict.parse(str(soup.find('clinical_study'))))).strip())
coll.insert_one(json_study)
i+=1
ERROR MESSAGE:
Traceback (most recent call last):
File "zip_to_mongo_alltrials.py", line 38, in <module>
soup = BeautifulSoup(z.read(xmlfile), 'lxml')
File "/usr/local/lib/python3.5/dist-packages/bs4/__init__.py", line 225, in __init__
markup, from_encoding, exclude_encodings=exclude_encodings)):
File "/usr/local/lib/python3.5/dist-packages/bs4/builder/_lxml.py", line 118, in prepare_markup
for encoding in detector.encodings:
File "/usr/local/lib/python3.5/dist-packages/bs4/dammit.py", line 264, in encodings
self.chardet_encoding = chardet_dammit(self.markup)
File "/usr/local/lib/python3.5/dist-packages/bs4/dammit.py", line 34, in chardet_dammit
return chardet.detect(s)['encoding']
File "/usr/lib/python3/dist-packages/chardet/__init__.py", line 30, in detect
u.feed(aBuf)
File "/usr/lib/python3/dist-packages/chardet/universaldetector.py", line 128, in feed
if prober.feed(aBuf) == constants.eFoundIt:
File "/usr/lib/python3/dist-packages/chardet/charsetgroupprober.py", line 64, in feed
st = prober.feed(aBuf)
File "/usr/lib/python3/dist-packages/chardet/hebrewprober.py", line 224, in feed
aBuf = self.filter_high_bit_only(aBuf)
File "/usr/lib/python3/dist-packages/chardet/charsetprober.py", line 53, in filter_high_bit_only
aBuf = re.sub(b'([\x00-\x7F])+', b' ', aBuf)
File "/usr/lib/python3.5/re.py", line 182, in sub
return _compile(pattern, flags).sub(repl, string, count)
MemoryError
Try to push reading from file and inserting into db in another method.
Also add gc.collect() for garbage collection.
import gc;
def read_xml_insert(xmlfile):
soup = BeautifulSoup(z.read(xmlfile), 'lxml')
json_study = json.loads(re.sub('\s', ' ', json.dumps(xmltodict.parse(str(soup.find('clinical_study'))))).strip())
coll.insert_one(json_study)
for xmlfile in z.namelist():
print(i, 'parsing:', xmlfile)
if xmlfile == 'Contents.txt':
print(xmlfile, '==> entering "continue"')
continue;
else:
read_xml_insert(xmlfile);
i+=1
gc.collect()
`
Please see.
I am making an API call and the response has unicode characters. Loading this response into a file throws the following error:
'ascii' codec can't encode character u'\u2019' in position 22462
I've tried all combinations of decode and encode ('utf-8').
Here is the code:
url = "https://%s?start_time=%s&include=metric_sets,users,organizations,groups" % (api_path, start_epoch)
while url != None and url != "null" :
json_filename = "%s/%s.json" % (inbound_folder, start_epoch)
try:
resp = requests.get(url,
auth=(api_user, api_pwd),
headers={'Content-Type': 'application/json'})
except requests.exceptions.RequestException as e:
print "|********************************************************|"
print e
return "Error: {}".format(e)
print "|********************************************************|"
sys.exit(1)
try:
total_records_extracted = total_records_extracted + rec_cnt
jsonfh = open(json_filename, 'w')
inter = resp.text
string_e = inter#.decode('utf-8')
final = string_e.replace('\\n', ' ').replace('\\t', ' ').replace('\\r', ' ')#.replace('\\ ',' ')
encoded_data = final.encode('utf-8')
cleaned_data = json.loads(encoded_data)
json.dump(cleaned_data, jsonfh, indent=None)
jsonfh.close()
except ValueError as e:
tb = traceback.format_exc()
print tb
print "|********************************************************|"
print e
print "|********************************************************|"
sys.exit(1)
Lot of developers have faced this issue. a lot of places have asked to use .decode('utf-8') or having a # _*_ coding:utf-8 _*_ at the top of python.
It is still not helping.
Can someone help me with this issue?
Here is the trace:
Traceback (most recent call last):
File "/Users/SM/PycharmProjects/zendesk/zendesk_tickets_api.py", line 102, in main
cleaned_data = json.loads(encoded_data)
File "/Users/SM/anaconda/lib/python2.7/json/__init__.py", line 339, in loads
return _default_decoder.decode(s)
File "/Users/SM/anaconda/lib/python2.7/json/decoder.py", line 364, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/Users/SM/anaconda/lib/python2.7/json/decoder.py", line 380, in raw_decode
obj, end = self.scan_once(s, idx)
ValueError: Invalid \escape: line 1 column 2826494 (char 2826493)
|********************************************************|
Invalid \escape: line 1 column 2826494 (char 2826493)
inter = resp.text
string_e = inter#.decode('utf-8')
encoded_data = final.encode('utf-8')
The text property is a Unicode character string, decoded from the original bytes using whatever encoding the Requests module guessed might be in use from the HTTP headers.
You probably don't want that; JSON has its own ideas about what the encoding should be, so you should let the JSON decoder do that by taking the raw response bytes from resp.content and passing them straight to json.loads.
What's more, Requests has a shortcut method to do the same: resp.json().
final = string_e.replace('\\n', ' ').replace('\\t', ' ').replace('\\r', ' ')#.replace('\\ ',' ')
Trying to do this on the JSON-string-literal formatted input is a bad idea: you will miss some valid escapes, and incorrectly unescape others. Your actual error is nothing to do with Unicode at all, it's that this replacement is mangling the input. For example consider the input JSON:
{"message": "Open the file C:\\newfolder\\text.txt"}
after replacement:
{"message": "Open the file C:\ ewfolder\ ext.txt"}
which is clearly not valid JSON.
Instead of trying to operate on the JSON-encoded string, you should let json decode the input and then filter any strings you have in the structured output. This may involve using a recursive function to walk down into each level of the data looking for strings to filter. eg
def clean(data):
if isinstance(data, basestring):
return data.replace('\n', ' ').replace('\t', ' ').replace('\r', ' ')
if isinstance(data, list):
return [clean(item) for item in data]
if isinstance(data, dict):
return {clean(key): clean(value) for (key, value) in data.items()}
return data
cleaned_data = clean(resp.json())