python3 - json.loads for a string that contains " in a value - python

I'm trying to transform a string that contains a dict to a dict object using json.
But in the data contains a "
example
string = '{"key1":"my"value","key2":"my"value2"}'
js = json.loads(s,strict=False)
it outputs json.decoder.JSONDecodeError: Expecting ',' delimiter: line 1 column 13 (char 12) as " is a delimiter and there is too much of it
What is the best way to achieve my goal ?
The solution I have found is to perform several .replace on the string to replace legit " by a pattern until only illgal " remains then replace back the pattern by the legit "
After that I can use json.loads and then replace the remaining pattern by the illegal "
But there must be another way
ex :
string = '{"key1":"my"value","key2":"my"value2"}'
string = string.replace('{"','__pattern_1')
string = string.replace('}"','__pattern_2')
...
...
string = string.replace('"','__pattern_42')
string = string.replace('__pattern_1','{"')
string = string.replace('__pattern_2','}"')
...
...
js = json.loads(s,strict=False)

This should work. What I am doing here is to simply replace all the expected double quotes with something else and then remove the unwanted double quotes. and then convert it back.
import re
import json
def fix_json_string(st):
st = re.sub(r'","',"!!",st)
st = re.sub(r'":"',"--",st)
st = re.sub(r'{"',"{{",st)
st = re.sub(r'"}',"}}",st)
st = st.replace('"','')
st = re.sub(r'}}','"}',st)
st = re.sub(r'{{','{"',st)
st = re.sub(r'--','":"',st)
st = re.sub(r'!!','","',st)
return st
broken_string = '{"key1":"my"value","key2":"my"value2"}'
fixed_string = fix_json_string(broken_string)
print(fixed_string)
js = json.dumps(eval(fixed_string))
print(js)
Output -
{"key1":"myvalue","key2":"myvalue2"} # str
{"key1": "myvalue", "key2": "myvalue2"} # converted to json

The variable string is not a valid JSON string.
The correct string should be:
string = '{"key1":"my\\"value","key2":"my\\"value2"}'

Problem is, that the string contains invalid json format.
String '{"key1": "my"value", "key2": "my"value2"}': value of key1 ends with "my" and additional characters value" are against the format.
You can use character escaping, valid json would look like:
{"key1": "my\"value", "key2": "my\"value2"}.
Since you are defining it as value you would then need to escape the escape characters:
string = '{"key1": "my\\"value", "key2": "my\\"value2"}'
There is a lot of educative material online on character escaping. I recommend to check it out if something is not clear
Edit: If you insist on fixing the string in code (which I don't recommend, see comment) you can do something like this:
import re
import json
string = '{"key1":"my"value","key2":"my"value2"}'
# finds contents of keys and values, assuming that the key the real key/value ending double quotes
# followed by one of following characters: ,}:]
m = re.finditer(r'"([^:]+?)"(?:[,}:\]])', string)
new_string = string
for i in reversed(list(m)):
ss, se = i.span(1) # first group holds the content
# escape double backslashes in the content and add all back together
# note that this is not effective. Bigger amounts of replacements would require other approach of concatanation
new_string = new_string[:ss] + new_string[ss:se].replace('"', '\\"') + new_string[se:]
json.loads(new_string)
This assumes that the real ending double quotes are followed by one of ,:}]. In other cases this won't work

Related

Replacing Unicode character / Python / Django

Since I'm pretty much forced to replace some unicode characters in my string returned by some OCR technology the only way I found to do it is replace them "one by one". This is done using following code:
def recode(mystr):
mystr = mystr.replace(r'\u0104', '\u0104')
mystr = mystr.replace(r'\u017c', '\u017c')
mystr = mystr.replace(r'\u0106' , '\u0106')
...
...
mystr = mystr.replace(r'\u017a' , '\u017a')
mystr = mystr.replace(r'\u017c' , '\u017c')
return mystr
I know this might be confusing. The string returned by mentioned OCR API is returning a sequence of characters, for example "\u017a" is not a mapped character in Unicode but rather "\" , "u", "0", "1", "7", "a". But this can't be changed from my end.
The above solution is very messy and unprofessional. However if I try to loop through all the characters that I want to "replace" it seems like it doesn't do anything:
def recode(mystr):
for foo in ['\u0106','\u0118','\u0141', ...... , '\u017a','\u017c']:
mystr = mystr.replace(r'%s' % foo, foo)
return mystr
Why in this case the foo string is not read as a raw text if in first scenario it is done properly? What is the difference?
So the reason why foo is not read as raw text is that the r in front of a string only plays a role when the string is created - afterwards it will act as a normal string - for example when the %-operator is applied.
As a solution to what you want to do, you can try something like this:
bar = r"\u0104"
mystr = mystr.replace(bar, chr(int(bar[2:], 16)))
This is an X-Y problem. The API is returning literal Unicode strings. Maybe it is actually JSON and OP should be doing json.loads() on the returned data, but if not you can use the unicode_escape codec to translate the escape codes. That codec requires a byte string so it may need to be encoded via ascii or latin1 first:
def recode(mystr):
mystr = mystr.replace(r'\u0104', '\u0104')
mystr = mystr.replace(r'\u017c', '\u017c')
mystr = mystr.replace(r'\u0106' , '\u0106')
mystr = mystr.replace(r'\u017a' , '\u017a')
mystr = mystr.replace(r'\u017c' , '\u017c')
return mystr
def recode2(s):
return s.encode('latin1').decode('unicode_escape')
s = r'\u0104\u017c\u0106\u017a\u017c'
print(s)
print(recode(s))
print(recode2(s))
Output:
\u0104\u017c\u0106\u017a\u017c
ĄżĆźż
ĄżĆźż

Python: Convert string representative of a string array to a list

I am trying to convert a string representative of a string array containing double quotes, single quotes and commas in its array items in to a python list when that array is passed to an API endpoint over postman. (Im am using Python 3.6)
Ex:
value passed in postman
"data":["bacsr "attd" and, fhhh'''","'gehh', uujf "hjjj"",",,,hhhhh,, ","1"]
element 1 = "bacsr "attd" and, fhhh'''"
element 2 = "'gehh', uujf "hjjj""
element 3 = ",,,hhhhh,, "
element 4 = "1"
What I tried and failed:
post_values = request.data['data']
post_values = ast.literal_eval(post_values)
Gives this error:
During handling of the above exception (invalid syntax (,
line 1)), another exception occurred:
How can I convert this in to a 4 element list with relevant string escapes?
When you write :"bacsr "attd" and, fhhh'''", the string starts with the first double quote, and finish with the second double quote, the attd is out of the string.
To use quotes and double quotes, you must put a \ before. Like this:
"bacsr \"attd\" and, fhhh\'\'\'"
Without the \, Python understand that your string ended and don't know what is the attd.
PS. Sorry my English, I'm not fluent.
hope this is clear enough.
import re
data = """\
"data":["bacsr "attd" and, fhhh'''","'gehh', uujf "hjjj"",",,,hhhhh,, ","1"]\
"""
data = data.replace('[','').replace(']','')
# regular expression to split out quoted or unquoted tokens in data string into individual groups
pat = re.compile(r'(?:")?([^"]*)(?:(?(1)"|))')
groups = [* filter(None, pat.split(data))]
l = ['']
for token in groups[2:]:
if token == ',':
l.append('')
else:
l[-1] += token
post_values = {groups[0] : l} # construct the result dict
print(post_values)
print()
for v in post_values['data']:
print(v)
output:
{'data': ["bacsr attd and, fhhh'''", "'gehh', uujf hjjj", ',,,hhhhh,, ', '1']}
bacsr attd and, fhhh'''
'gehh', uujf hjjj
,,,hhhhh,,
1
note: element 2 is not same as what you give, but I can't achieve that.

Python: Can't turn string into JSON

For the past few hours, I've been fighting to get a string into a JSON dict. I've tried everything from json.loads(... which throws an error:
requestInformation = json.loads(entry["request"]["postData"]["text"])
//throws this error
json.decoder.JSONDecodeError: Expecting property name enclosed in double quotes:
to stripping out the slashes using a medley of re.sub('\\','',mystring) ,mystring.sub(... to no effect. My problem string looks like so
'{items:[{n:\\'PackageChannel.GetUnitsInConfigurationForUnitType\\',ps:[{n:\\'unitType\\',v:"ActionTemplate"}]}]}'
The origin of this string is that it's a HAR dump from Google Chrome. I think those backslashes are from it being escaped somewhere along the way because the bulk of the HAR file doesn't contain them, but they do appear commonly in any field labeled "text".
"postData": {
"mimeType": "application/json",
"text": "{items:[{n:'PackageChannel.GetUnitsInConfigurationForUnitType',ps:[{n:'unitType',v:\"Analysis\"}]}]}"
}
EDIT I eventually gave up on turning the text above into JSON and instead opted for regex. Sometimes the slashes showed up, sometimes they didn't based on what I was viewing the text in and that made it difficult to work with.
the json module wants a string where the keys are also wrapped in double quotes
so the string below would work:
mystring = '{"items":[{"n":"PackageChannel.GetUnitsInConfigurationForUnitType", "ps":[{"n":"unitType","v":"ActionTemplate"}]}]}'
myjson = json.loads(mystring)
This function should remove the double backslashes and put double quotes around your keys.
import json, re
def make_jsonable(mystring):
# we'll use this regex to find any key that doesn't contain any of: {}[]'",
key_regex = "([\,\[\{](\s+)?[^\"\{\}\,\[\]]+(\s+)?:)"
mystring = re.sub("[\\\]", "", mystring) # remove any backslashes
mystring = re.sub("\'", "\"", mystring) # replace single quotes with doubles
match = re.search(key_regex, mystring)
while match:
start_index = match.start(0)
end_index = match.end(0)
print(mystring[start_index+1:end_index-1].strip())
mystring = '%s"%s"%s'%(mystring[:start_index+1], mystring[start_index+1:end_index-1].strip(), mystring[end_index-1:])
match = re.search(key_regex, mystring)
return mystring
I couldn't directly test it on the first string you wrote, the double/single quotes don't match up, but on the one in the last code sample it works.
You'll need a r before JSON String, or replace all \ with \\
This works:
import json
validasst_json = r'''{
"postData": {
"mimeType": "application/json",
"text": "{items:[{n:'PackageChannel.GetUnitsInConfigurationForUnitType',ps:[{n:'unitType',v:\"Analysis\"}]}]}"
}
}'''
txt = json.loads(validasst_json)
print(txt["postData"]['mimeType'])
print(txt["postData"]['text'])

Convert escaped utf-8 string to utf in python 3

I have a py3 string that includes escaped utf-8 sequencies, such as "Company\\ffffffc2\\ffffffae", which I would like to convert to the correct utf 8 string (which would in the example be "Company®", since the escaped sequence is c2 ae). I've tried
print (bytes("Company\\\\ffffffc2\\\\ffffffae".replace(
"\\\\ffffff", "\\x"), "ascii").decode("utf-8"))
result: Company\xc2\xae
print (bytes("Company\\\\ffffffc2\\\\ffffffae".replace (
"\\\\ffffff", "\\x"), "ascii").decode("unicode_escape"))
result: Company®
(wrong, since chracters are treated separately, but they should be treated together.
If I do
print (b"Company\xc2\xae".decode("utf-8"))
It gives the correct result.
Company®
How can i achieve that programmatically (i.e. starting from a py3 str)
A simple solution is:
import ast
test_in = "Company\\\\ffffffc2\\\\ffffffae"
test_out = ast.literal_eval("b'''" + test_in.replace('\\\\ffffff','\\x') + "'''").decode('utf-8')
print(test_out)
However it will fail if there is a triple quote ''' in the input string itself.
Following code does not have this problem, but it is not as simple as the first one.
In the first step the string is split on a regular expression. The odd items are ascii parts, e.g. "Company"; each even item corresponds to one escaped utf8 code, e.g. "\\\\ffffffc2". Each substring is converted to bytes according to its meaning in the input string. Finally all parts are joined together and decoded from bytes to a string.
import re
REGEXP = re.compile(r'(\\\\ffffff[0-9a-f]{2})', flags=re.I)
def convert(estr):
def split(estr):
for i, substr in enumerate(REGEXP.split(estr)):
if i % 2:
yield bytes.fromhex(substr[-2:])
elif substr:
yield bytes(substr, 'ascii')
return b''.join(split(estr)).decode('utf-8')
test_in = "Company\\\\ffffffc2\\\\ffffffae"
print(convert(test_in))
The code could be optimized. Ascii parts do not need encode/decode and consecutive hex codes should be concatenated.

Proper way to deal with string which looks like json object but it is wrapped with single quote

By definition the JSON string is wrapped with double quote.
In fact:
json.loads('{"v":1}') #works
json.loads("{'v':1}") #doesn't work
But how to deal with the second statements?
I'm looking for a solution different from eval or replace.
Thanks.
If you get a mailformed json why don't you just replace the double quotes with single quotes before
json.load
If you cannot fix the other side you will have to convert invalid JSON into valid JSON. I think the following treats escaped characters properly:
def fixEscapes(value):
# Replace \' by '
value = re.sub(r"[^\\]|\\.", lambda match: "'" if match.group(0) == "\\'" else match.group(0), value)
# Replace " by \"
value = re.sub(r"[^\\]|\\.", lambda match: '\\"' if match.group(0) == '"' else match.group(0), value)
return value
input = "{'vt\"e\\'st':1}"
input = re.sub(r"'(([^\\']|\\.)+)'", lambda match: '"%s"' % fixEscapes(match.group(1)), input)
print json.loads(input)
Not sure if I got your requirements right, but are you looking for something like this?
def fix_json(string_):
if string_[0] == string_[-1] == "'":
return '"' + string_[1:-1] +'"'
return string_
Example usage:
>>> fix_json("'{'key':'val\"'...cd'}'")
"{'key':'val"'...cd'}"
EDIT: it seems that the humour I tried to have in making the example above is not self-explanatory. So, here's another example:
>>> fix_json("'This string has - I'm sure - single quotes delimiters.'")
"This string has - I'm sure - single quotes delimiters."
This examples show how the "replacement" only happens at the extremities of the string, not within it.
you could also achieve the same with a regular expression, of course, but if you are just checking the starting and finishing char of a string, I find using regular string indexes more readable....
unfortunately you have to do this:
f = open('filename.json', 'rb')
json = eval(f.read())
done!
this works, but apparently people don't like the eval function. Let me know if you find a better approach. I used this on some twitter data...

Categories