Python breaks parsing json with characters \"

Python breaks parsing json with characters \" - python

I'm trying to parse json string with an escape character (Of some sort I guess)
{
"publisher": "\"O'Reilly Media, Inc.\""
}
Parser parses well if I remove the character \" from the string,
the exceptions raised by different parsers are,
json
File "/usr/lib/python2.7/json/__init__.py", line 338, in loads
return _default_decoder.decode(s)
File "/usr/lib/python2.7/json/decoder.py", line 366, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/usr/lib/python2.7/json/decoder.py", line 382, in raw_decode
obj, end = self.scan_once(s, idx)
ValueError: Expecting , delimiter: line 17 column 20 (char 392)
ujson
ValueError: Unexpected character in found when decoding object value
How do I make the parser to escape this characters ?
update:
ps. json is imported as ujson in this example
This is what my ide shows
comma is just added accidently, it has no trailing comma at the end of json, json is valid
the string definition.

You almost certainly did not define properly escaped backslashes. If you define the string properly the JSON parses just fine:
>>> import json
>>> json_str = r'''
... {
... "publisher": "\"O'Reilly Media, Inc.\""
... }
... ''' # raw string to prevent the \" from being interpreted by Python
>>> json.loads(json_str)
{u'publisher': u'"O\'Reilly Media, Inc."'}
Note that I used a raw string literal to define the string in Python; if I did not, the \" would be interpreted by Python and a regular " would be inserted. You'd have to double the backslash otherwise:
>>> print '\"'
"
>>> print '\\"'
\"
>>> print r'\"'
\"
Reencoding the parsed Python structure back to JSON shows the backslashes re-appearing, with the repr() output for the string using the same double backslash:
>>> json.dumps(json.loads(json_str))
'{"publisher": "\\"O\'Reilly Media, Inc.\\""}'
>>> print json.dumps(json.loads(json_str))
{"publisher": "\"O'Reilly Media, Inc.\""}
If you did not escape the \ escape you'll end up with unescaped quotes:
>>> json_str_improper = '''
... {
... "publisher": "\"O'Reilly Media, Inc.\""
... }
... '''
>>> print json_str_improper
{
"publisher": ""O'Reilly Media, Inc.""
}
>>> json.loads(json_str_improper)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/mj/Development/Library/buildout.python/parts/opt/lib/python2.7/json/__init__.py", line 338, in loads
return _default_decoder.decode(s)
File "/Users/mj/Development/Library/buildout.python/parts/opt/lib/python2.7/json/decoder.py", line 366, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/Users/mj/Development/Library/buildout.python/parts/opt/lib/python2.7/json/decoder.py", line 382, in raw_decode
obj, end = self.scan_once(s, idx)
ValueError: Expecting , delimiter: line 3 column 20 (char 22)
Note that the \" sequences now are printed as ", the backslash is gone!

Your JSON is invalid. If you have questions about your JSON objects, you can always validate them with JSONlint. In your case you have an object
{
"publisher": "\"O'Reilly Media, Inc.\"",
}
and you have an extra comma indicating that something else should be coming. So JSONlint yields
Parse error on line 2:
...edia, Inc.\"", }
---------------------^
Expecting 'STRING'
which would begin to help you find where the error was.
Removing the comma for
{
"publisher": "\"O'Reilly Media, Inc.\""
}
yields
Valid JSON
Update: I'm keeping the stuff in about JSONlint as it may be helpful to others in the future. As for your well formed JSON object, I have
import json
d = {
"publisher": "\"O'Reilly Media, Inc.\""
}
print "Here is your string parsed."
print(json.dumps(d))
yielding
Here is your string parsed.
{"publisher": "\"O'Reilly Media, Inc.\""}
Process finished with exit code 0

Related

How to extract JSON from script with Python?

I am parsing a scraped html page that contains a script with JSON inside. This JSON contains all info I am looking for but I can't figure out how to extract a valid JSON.
Minimal example:
my_string = '
(function(){
window.__PRELOADED_STATE__ = window.__PRELOADED_STATE__ || [];
window.__PRELOADED_STATE__.push(
{ *placeholder representing valid JSON inside* }
);
})()
'
The json inside is valid according to jsonlinter.
The result should be loaded into a dictionary:
import json
import re
my_json = re.findall(r'.*(?={\").*', my_string)[0] // extract json
data = json.loads(my_json)
// print(data)
regex: https://regex101.com/r/r0OYZ0/1
This try results in:
>>> data = json.loads(my_json)
Traceback (most recent call last):
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/code.py", line 90, in runcode
exec(code, self.locals)
File "<console>", line 1, in <module>
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/json/__init__.py", line 357, in loads
return _default_decoder.decode(s)
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/json/decoder.py", line 337, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/json/decoder.py", line 355, in raw_decode
raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 7 (char 6)
How can the JSON be extracted and loaded from the string with Python 3.7.x?

you can try to extract this regex, its a very simple case and might not answerto all possible json variations:
my_string = '''
(function(){
window.__PRELOADED_STATE__ = window.__PRELOADED_STATE__ || [];
window.__PRELOADED_STATE__.push(
{"tst":{"f":3}}
);
})()
'''
result = re.findall(r"push\(([{\[].*\:.*[}\]])\)",string3)[0]
result
>>> '{ "tst":{"f":3}}'
to parse it to dictionary now:
import json
dictionary = json.loads(result)
type(dictionary)
>>>dict

Have a look at the below. Note that { *placeholder representing valid JSON inside* } has to be a valid JSON.
my_string = '''
<script>
(function(){
window.__PRELOADED_STATE__ = window.__PRELOADED_STATE__ || [];
window.__PRELOADED_STATE__.push(
{"foo":["bar1", "bar2"]}
);
})()
</script>
'''
import re, json
my_json = re.findall(r'.*(?={\").*', my_string)[0].strip()
data = json.loads(my_json)
print(data)
Output:
{'foo': ['bar1', 'bar2']}

The my_string provided here is not valid JSON. For valid JSON, you can use json.loads(JSON_STRING)
import json
d = json.loads('{"test":2}')
print(d) # Prints the dictionary `{'test': 2}`

Unicode API response throwing error ''ascii' codec can't encode character u'\u2019' in position 22462'

I am making an API call and the response has unicode characters. Loading this response into a file throws the following error:
'ascii' codec can't encode character u'\u2019' in position 22462
I've tried all combinations of decode and encode ('utf-8').
Here is the code:
url = "https://%s?start_time=%s&include=metric_sets,users,organizations,groups" % (api_path, start_epoch)
while url != None and url != "null" :
json_filename = "%s/%s.json" % (inbound_folder, start_epoch)
try:
resp = requests.get(url,
auth=(api_user, api_pwd),
headers={'Content-Type': 'application/json'})
except requests.exceptions.RequestException as e:
print "|********************************************************|"
print e
return "Error: {}".format(e)
print "|********************************************************|"
sys.exit(1)
try:
total_records_extracted = total_records_extracted + rec_cnt
jsonfh = open(json_filename, 'w')
inter = resp.text
string_e = inter#.decode('utf-8')
final = string_e.replace('\\n', ' ').replace('\\t', ' ').replace('\\r', ' ')#.replace('\\ ',' ')
encoded_data = final.encode('utf-8')
cleaned_data = json.loads(encoded_data)
json.dump(cleaned_data, jsonfh, indent=None)
jsonfh.close()
except ValueError as e:
tb = traceback.format_exc()
print tb
print "|********************************************************|"
print e
print "|********************************************************|"
sys.exit(1)
Lot of developers have faced this issue. a lot of places have asked to use .decode('utf-8') or having a # _*_ coding:utf-8 _*_ at the top of python.
It is still not helping.
Can someone help me with this issue?
Here is the trace:
Traceback (most recent call last):
File "/Users/SM/PycharmProjects/zendesk/zendesk_tickets_api.py", line 102, in main
cleaned_data = json.loads(encoded_data)
File "/Users/SM/anaconda/lib/python2.7/json/__init__.py", line 339, in loads
return _default_decoder.decode(s)
File "/Users/SM/anaconda/lib/python2.7/json/decoder.py", line 364, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/Users/SM/anaconda/lib/python2.7/json/decoder.py", line 380, in raw_decode
obj, end = self.scan_once(s, idx)
ValueError: Invalid \escape: line 1 column 2826494 (char 2826493)
|********************************************************|
Invalid \escape: line 1 column 2826494 (char 2826493)

inter = resp.text
string_e = inter#.decode('utf-8')
encoded_data = final.encode('utf-8')
The text property is a Unicode character string, decoded from the original bytes using whatever encoding the Requests module guessed might be in use from the HTTP headers.
You probably don't want that; JSON has its own ideas about what the encoding should be, so you should let the JSON decoder do that by taking the raw response bytes from resp.content and passing them straight to json.loads.
What's more, Requests has a shortcut method to do the same: resp.json().
final = string_e.replace('\\n', ' ').replace('\\t', ' ').replace('\\r', ' ')#.replace('\\ ',' ')
Trying to do this on the JSON-string-literal formatted input is a bad idea: you will miss some valid escapes, and incorrectly unescape others. Your actual error is nothing to do with Unicode at all, it's that this replacement is mangling the input. For example consider the input JSON:
{"message": "Open the file C:\\newfolder\\text.txt"}
after replacement:
{"message": "Open the file C:\ ewfolder\ ext.txt"}
which is clearly not valid JSON.
Instead of trying to operate on the JSON-encoded string, you should let json decode the input and then filter any strings you have in the structured output. This may involve using a recursive function to walk down into each level of the data looking for strings to filter. eg
def clean(data):
if isinstance(data, basestring):
return data.replace('\n', ' ').replace('\t', ' ').replace('\r', ' ')
if isinstance(data, list):
return [clean(item) for item in data]
if isinstance(data, dict):
return {clean(key): clean(value) for (key, value) in data.items()}
return data
cleaned_data = clean(resp.json())

json.loads not replacing apostrophe

I have a json object that I am loading and replacing single with double quotes as I do. The syntax for this is:
response = json.loads(response.text.replace("'", '"'))
Within my data I have key/value pairs that take the format:
"name":"John O'Shea"
This is causing me to get the following traceback:
Traceback (most recent call last):
File "C:\Python27\Whoscored\Test.py", line 204, in <module>
response = json.loads(response.text.replace("'", '"').replace(',,', ','))
File "C:\Python27\lib\json\__init__.py", line 338, in loads
return _default_decoder.decode(s)
File "C:\Python27\lib\json\decoder.py", line 366, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "C:\Python27\lib\json\decoder.py", line 382, in raw_decode
obj, end = self.scan_once(s, idx)
ValueError: Expecting ',' delimiter: line 1 column 7751 (char 7750)
I don't actually want to replace the apostrophe in a name such as the one above, but I would have thought that my json.loads statement would have converted my key/value pair to this:
"name":"John O"Shea"
I'm assuming this would also fail however. What I need to know is:
1) Why is my json.loads statement not replacing the apostrophes in my string during the load?
2) What is the best way to escape the apostrophes within my string so that they do not cause an error, but are still displayed in the load?
I have used a json tester on my larger to string to confirm that there are no other errors that would stop the object from working correctly, which there are not.
Thanks

Json uses " as a formatting character, so response.text.replace("'", '"') is just corrupting the file. Json escapes quotes inside strings as \" so this should work:
response = json.loads(response.text.replace("'", '\\"'))

if your json doesn't take special characters better convert them to Unicode
private static String escapeNonAscii(String str) {
StringBuilder retStr = new StringBuilder();
for(int i=0; i<str.length(); i++) {
int cp = Character.codePointAt(str, i);
int charCount = Character.charCount(cp);
if (charCount > 1) {
i += charCount - 1; // 2.
if (i >= str.length()) {
throw new IllegalArgumentException("truncated unexpectedly");
}
}
if (cp < 128) {
retStr.appendCodePoint(cp);
} else {
retStr.append(String.format("\\u%x", cp));
}
}
return retStr.toString();
}

JSON ValueError: Unterminated string

My script work, but sometimes crashes with that error:
Traceback (most recent call last):
File "planetafm.py", line 6, in <module>
songs = json.loads(json_data)
File "/usr/lib/python2.7/json/__init__.py", line 338, in loads
return _default_decoder.decode(s)
File "/usr/lib/python2.7/json/decoder.py", line 366, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/usr/lib/python2.7/json/decoder.py", line 382, in raw_decode
obj, end = self.scan_once(s, idx)
ValueError: Unterminated string starting at: line 1 column 32 (char 31)
For example, that json causes:
rdsData({"now":{"id":"0052-55","title":"Summertime Sadness (Radio Mix)","artist":"Lana Del Rey","startDate":"2014-09-07 21:48:51","duration":"2014-09-07 21:48:51"}})
sourcecode:
import requests, json, re
url = "http://rds.eurozet.pl/reader/var/planeta.json"
response = requests.get(url)
json_data = re.match('rdsData\((.*?)\)', response.content).group(1)
songs = json.loads(json_data)
print (songs['now']['artist'] + " - " + songs['now']['title']).encode('utf-8')
Why that json is invalid? How to fix this?
Thanks for answers!

Your regexp has a problem with closing bracket inside text. You can fix it by adding $ to the regexp:
import requests, json, re
url = "http://rds.eurozet.pl/reader/var/planeta.json"
response = requests.get(url)
print response.content
json_data = re.match('rdsData\((.*?)\)$', response.content).group(1)
print json_data
songs = json.loads(json_data)
print (songs['now']['artist'] + " - " + songs['now']['title']).encode('utf-8')

Your method of extracting is flawed; your expression terminates at the first ) character:
>>> import re
>>> import requests
>>> url = "http://rds.eurozet.pl/reader/var/planeta.json"
>>> r = requests.get(url)
>>> re.match('rdsData\((.*?)\)', r.content).group(1)
'{"now":{"id":"0052-55","title":"Summertime Sadness (Radio Mix'
Rather than use a regular expression, just partition the value out using str.partition() and str.rpartition():
url = "http://rds.eurozet.pl/reader/var/planeta.json"
response = requests.get(url)
json_data = response.content.partition('(')[-1].rpartition(')')[0]
songs = json.loads(json_data)
Demo:
>>> json_data = r.content.partition('(')[-1].rpartition(')')[0]
>>> json.loads(json_data)['now']
{u'duration': u'2014-09-07 21:48:51', u'startDate': u'2014-09-07 21:48:51', u'artist': u'Lana Del Rey', u'id': u'0052-55', u'title': u'Summertime Sadness (Radio Mix)'}

How to pass Unicode string as argument to urllib.urlencode()

I'm using Microsoft's free translation service to translate some Hindi characters to English. They don't provide an API for Python, but I borrowed code from: tinyurl.com/dxh6thr
I'm trying to use the 'Detect' method as described here: tinyurl.com/bxkt3we
The 'hindi.txt' file is saved in unicode charset.
>>> hindi_string = open('hindi.txt').read()
>>> data = { 'text' : hindi_string }
>>> token = msmt.get_access_token(MY_USERID, MY_TOKEN)
>>> request = urllib2.Request('http://api.microsofttranslator.com/v2/Http.svc/Detect?'+urllib.urlencode(data))
>>> request.add_header('Authorization', 'Bearer '+token)
>>> response = urllib2.urlopen(request)
>>> print response.read()
<string xmlns="http://schemas.microsoft.com/2003/10/Serialization/">en</string>
>>>
The response shows that the Translator detected 'en', instead of 'hi' (for Hindi). When I check the encoding, it shows as 'string':
>>> type(hindi_string)
<type 'str'>
For reference, here is content of 'hindi.txt':
हाय, कैसे आप आज कर रहे हैं। मैं अच्छी तरह से, आपको धन्यवाद कर रहा हूँ।
I'm not sure if using string.encode or string.decode applies here. If it does, what do I need to encode/decode from/to? What is the best method to pass a Unicode string as a urllib.urlencode argument? How can I ensure that the actual Hindi characters are passed as the argument?
Thank you.
** Additional Information **
I tried using codecs.open() as suggested, but I get the following error:
>>> hindi_new = codecs.open('hindi.txt', encoding='utf-8').read()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Python27\lib\codecs.py", line 671, in read
return self.reader.read(size)
File "C:\Python27\lib\codecs.py", line 477, in read
newchars, decodedbytes = self.decode(data, self.errors)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xff in position 0: invalid start byte
Here is the repr(hindi_string) output:
>>> repr(hindi_string)
"'\\xff\\xfe9\\t>\\t/\\t,\\x00 \\x00\\x15\\tH\\t8\\tG\\t \\x00\\x06\\t*\\t \\x00
\\x06\\t\\x1c\\t \\x00\\x15\\t0\\t \\x000\\t9\\tG\\t \\x009\\tH\\t\\x02\\td\\t \
\x00.\\tH\\t\\x02\\t \\x00\\x05\\t'"

Your file is utf-16, so you need to decode the content before sending it:
hindi_string = open('hindi.txt').read().decode('utf-16')
data = { 'text' : hindi_string.encode('utf-8') }
...

You could try opening the file using codecs.open and decode it with utf-8:
import codecs
with codecs.open('hindi.txt', encoding='utf-8') as f:
hindi_text = f.read()

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python breaks parsing json with characters \" - python

Related

How to extract JSON from script with Python?

Unicode API response throwing error ''ascii' codec can't encode character u'\u2019' in position 22462'

json.loads not replacing apostrophe

JSON ValueError: Unterminated string

How to pass Unicode string as argument to urllib.urlencode()

Categories

Resources