Related
I came up with list of dictionaries as a string. I wanted to convert this string to dictionary but it gives error.
data = '{
"address": "Ludwig-Wolf-Straß 1, 75181 Pforzheim Eutingen",
"lat": 48.90962790,
"lng": 8.74648390,
"name": "Psychiatrische Tagesklinik Pforzheim "Alte Mühle"",
"path": "appportrait7e29d81c345927b0start",
"color" : "yellow",
"zIndex": "30",}'
After checking it, I found out a value enclosed in double quotes two times.
data = {
"address": "Ludwig-Wolf-Straß 1, 75181 Pforzheim Eutingen",
"lat": 48.90962790,
"lng": 8.74648390,
"name": "Psychiatrische Tagesklinik Pforzheim "Alte Mühle"", # this value
"path": "appportrait7e29d81c345927b0start",
"color" : "yellow",
"zIndex": "30",}
I want to turn "Alte Mühle" into a single quote 'Alte Mühle' or just Alte Mühle. I tried to parse the dictionary to str and use string.replace() function but it didn't work. Since the value is dynamic I can't just change the value in a static way. i,e
string.replace('"Alte Mühle"', 'Alte Mühle') # will only change this value
is there any way to get rid of this?
Not enough rep to comment, so I'm assuming you are starting with a bunch of string literals you typed manually into your code. If not, there are other ways to handle this or it may have not been an issue to start with.
Here is an solution that doesn't require manually searching for problem strings. Enclose your dictionary string literal using tripple quotes (either """ or ''' are permitted) instead of the single ' or ". This will prevent the interpreter from getting confused about ' or " inside a string literal.
data = """{
"address": "Ludwig-Wolf-Straß 1, 75181 Pforzheim Eutingen",
"lat": 48.90962790,
"lng": 8.74648390,
"name": "Psychiatrische Tagesklinik Pforzheim "Alte Mühle"",
"path": "appportrait7e29d81c345927b0start",
"color" : "yellow",
"zIndex": "30",}"""
Next, the double quote problem can be handled using regular expressions (re). I have to leave this as an exercise as I am on a phone, but you can replace all " that lies inside a dictionary value regular expression search string ": \"([.]+?)\",” with '. Find this pattern, modify the substring, then replace the old substring with the corrected one.
Finally, to interpret it as a dictionary, call ast.literal_eval(...) on the corrected string (a version of eval(...) made safer by only interpreting literals). Requires the standard library ast import.
Consider comparing this workload vs manually fixing your strings or loading the strings or key/value pairs from a database, avoiding these string literal issues all together.
Python Escape Double quote character and convert the string to json
I have tried escaping double quotes with escape characters but that didn't worked either
raw_string = '[{"Attribute":"color","Keywords":"green","AttributeComments":null},{"Attribute":" season","Keywords":["Holly Berry"],"AttributeComments":null},{"Attribute":" size","Keywords":"20"x30"","AttributeComments":null},{"Attribute":" unit","Keywords":"1","AttributeComments":null}]'
new_data = json.loads(raw_string)
It load errors saying Expecting ',' delimiter: line 1 column 180 (char 179)
The expected output is JSON string
The correct JSON string, with escaped quotes should look like this:
[{
"Attribute": "color",
"Keywords": "green",
"AttributeComments": null
}, {
"Attribute": " season",
"Keywords": ["Holly Berry"],
"AttributeComments": null
}, {
"Attribute": " size",
"Keywords": "20\"x30",
"AttributeComments": null
}, {
"Attribute": " unit",
"Keywords": "1",
"AttributeComments": null
}]
Edit:
You can use a regular expression to correct the sting in Python resulting in a valid json:
import re
import json
raw_string = '[{"Attribute":"color","Keywords":"green","AttributeComments":null},{"Attribute":" season","Keywords":["Holly Berry"],"AttributeComments":null},{"Attribute":" size","Keywords":"20"x30"","AttributeComments":null},{"Attribute":" unit","Keywords":"1","AttributeComments":null}]'
pattern = r'"Keywords":"([\d].)"x([\d].)""'
correctedString = re.sub(pattern, '"Keywords": "\g<1>x\g<2>"', raw_string)
print(json.loads(correctedString))
Output:
[{u'Keywords': u'green', u'Attribute': u'color', u'AttributeComments': None}, {u'Keywords': [u'Holly Berry'], u'Attribute': u' season', u'AttributeComments': None}, {u'Keywords': u'20x30', u'Attribute': u' size', u'AttributeComments': None}, {u'Keywords': u'1', u'Attribute': u' unit', u'AttributeComments': None}]
raw_string = '[{"Attribute":"color","Keywords":"green","AttributeComments":null},{"Attribute":" season","Keywords":["Holly Berry"],"AttributeComments":null},{"Attribute":" size","Keywords":"20x30","AttributeComments":null},{"Attribute":" unit","Keywords":"1","AttributeComments":null}]'
new_data = json.loads(raw_string)
First of all change the key-value pair : "Keywords":"20"x30"" to "Keywords":"20x30".
The formatting is invalid in your code. If this JSON is not made by you or generated by some other source, check the source. You can check if the JSON is valid or not using JSONLint. Just paste your JSON here to check.
As for your code:
import json
raw_string = '[{"Attribute":"color","Keywords":"green","AttributeComments":null},{"Attribute":" season","Keywords":["Holly Berry"],"AttributeComments":null},{"Attribute":" size","Keywords":"20x30","AttributeComments":null},{"Attribute":" unit","Keywords":"1","AttributeComments":null}]'
new_data = json.loads(raw_string)
Since new_data is a list. If you check the type of its first and only element, using print(type(new_data[0])) you'll find it is a dict that you desired.
EDIT: Since you say you are fetching this JSON from a database, check if the JSONs there are all carrying these type of formatting errors. If yes, you'd want to check where these are JSONs being generated. Your options are either to correct it at the source and correct it manually or adding escape characters, if this is a one-off problem. I strongly suggest the former.
This code:
import json
s = '{ "key1": "value1", "key2": "value2", }'
json.loads(s)
produces this error in Python 2:
ValueError: Expecting property name: line 1 column 16 (char 15)
Similar result in Python 3:
json.decoder.JSONDecodeError: Expecting property name enclosed in double quotes: line 1 column 16 (char 15)
If I remove that trailing comma (after "value2"), I get no error. But my code will process many different JSONs, so I can't do it manually. Is it possible to setup the parser to ignore such last commas?
Another option is to parse it as YAML; YAML accepts valid JSON but also accepts all sorts of variations.
import yaml
s = '{ "key1": "value1", "key2": "value2", }'
yaml.load(s)
JSON specification doesn't allow trailing comma. The parser is throwing since it encounters invalid syntax token.
You might be interested in using a different parser for those files, eg. a parser built for JSON5 spec which allows such syntax.
It could be that this data stream is JSON5, in which case there's a parser for that: https://pypi.org/project/json5/
This situation can be alleviated by a regex substitution that looks for ", }, and replaces it with " }, allowing for any amount of whitespace between the quotes, comma and close-curly.
>>> import re
>>> s = '{ "key1": "value1", "key2": "value2", }'
>>> re.sub(r"\"\s*,\s*\}", "\" }", s)
'{ "key1": "value1", "key2": "value2" }'
Giving:
>>> import json
>>> s2 = re.sub(r"\"\s*,\s*\}", "\" }", s)
>>> json.loads(s2)
{'key1': 'value1', 'key2': 'value2'}
EDIT: as commented, this is not a good practice unless you are confident your JSON data contains only simple words, and this change is not corrupting the data-stream further. As I commented on the OP, the best course of action is to repair the up-stream data source. But sometimes that's not possible.
I wrote a regex to find and remove all commas with ] } followed in the json, but the ones in strings will be skipped.
it seems to work fine and fast.
import re, json
s = r'''
[
123, true, false, null,
{
"\n\\\",]\\": "\n\\\",]\\",
"\n\\\",}\\": "\n\\\",}\\",
},
]
'''
r = json.loads(re.sub(r'("(?:\\?.)*?")|,\s*([]}])', r'\1\2', s))
print(r) # [123, True, False, None, {'\n\\",]\\': '\n\\",]\\', '\n\\",}\\': '\n\\",}\\'}]
That's because an extra , is invalid according to JSON standard.
An object is an unordered set of name/value pairs. An object begins
with { (left brace) and ends with } (right brace). Each name is
followed by : (colon) and the name/value pairs are separated by ,
(comma).
If you really need this, you could wrap python's json parser with jsoncomment. But I would try to fix JSON in the origin.
I suspect it doesn't parse because "it's not json", but you could pre-process strings, using regular expression to replace , } with } and , ] with ]
How about use the following regex?
s = re.sub(r",\s*}", "}", s)
For the past few hours, I've been fighting to get a string into a JSON dict. I've tried everything from json.loads(... which throws an error:
requestInformation = json.loads(entry["request"]["postData"]["text"])
//throws this error
json.decoder.JSONDecodeError: Expecting property name enclosed in double quotes:
to stripping out the slashes using a medley of re.sub('\\','',mystring) ,mystring.sub(... to no effect. My problem string looks like so
'{items:[{n:\\'PackageChannel.GetUnitsInConfigurationForUnitType\\',ps:[{n:\\'unitType\\',v:"ActionTemplate"}]}]}'
The origin of this string is that it's a HAR dump from Google Chrome. I think those backslashes are from it being escaped somewhere along the way because the bulk of the HAR file doesn't contain them, but they do appear commonly in any field labeled "text".
"postData": {
"mimeType": "application/json",
"text": "{items:[{n:'PackageChannel.GetUnitsInConfigurationForUnitType',ps:[{n:'unitType',v:\"Analysis\"}]}]}"
}
EDIT I eventually gave up on turning the text above into JSON and instead opted for regex. Sometimes the slashes showed up, sometimes they didn't based on what I was viewing the text in and that made it difficult to work with.
the json module wants a string where the keys are also wrapped in double quotes
so the string below would work:
mystring = '{"items":[{"n":"PackageChannel.GetUnitsInConfigurationForUnitType", "ps":[{"n":"unitType","v":"ActionTemplate"}]}]}'
myjson = json.loads(mystring)
This function should remove the double backslashes and put double quotes around your keys.
import json, re
def make_jsonable(mystring):
# we'll use this regex to find any key that doesn't contain any of: {}[]'",
key_regex = "([\,\[\{](\s+)?[^\"\{\}\,\[\]]+(\s+)?:)"
mystring = re.sub("[\\\]", "", mystring) # remove any backslashes
mystring = re.sub("\'", "\"", mystring) # replace single quotes with doubles
match = re.search(key_regex, mystring)
while match:
start_index = match.start(0)
end_index = match.end(0)
print(mystring[start_index+1:end_index-1].strip())
mystring = '%s"%s"%s'%(mystring[:start_index+1], mystring[start_index+1:end_index-1].strip(), mystring[end_index-1:])
match = re.search(key_regex, mystring)
return mystring
I couldn't directly test it on the first string you wrote, the double/single quotes don't match up, but on the one in the last code sample it works.
You'll need a r before JSON String, or replace all \ with \\
This works:
import json
validasst_json = r'''{
"postData": {
"mimeType": "application/json",
"text": "{items:[{n:'PackageChannel.GetUnitsInConfigurationForUnitType',ps:[{n:'unitType',v:\"Analysis\"}]}]}"
}
}'''
txt = json.loads(validasst_json)
print(txt["postData"]['mimeType'])
print(txt["postData"]['text'])
Is there any way using regular expression in python to replace all the occurrences of , (comma) after the flower braces {
Data is of the following format in a file - abc.json
{
"Key1":"value1",
"Key2":"value2"
},
{
"Key1":"value3",
"Key2":"value4"
},
{
"Key1":"value5",
"Key2":"value6"
}
This should result in following -
{
"Key1":"value1",
"Key2":"value2"
}
{
"Key1":"value3",
"Key2":"value4"
}
{
"Key1":"value5",
"Key2":"value6"
}
As you can see the ,(comma) has been removed after every braces }.
Would be helpful if this can be achieved via jq as well, apart from python REGEX
Test Source: https://regex101.com/r/wT6uU2/1
import re
p = re.compile(ur'},')
test_str = u"{\n\"Key1\":\"value1\",\n\"Key2\":\"value2\"\n},\n\n{\n\"Key1\":\"value3\",\n\"Key2\":\"value4\"\n},\n\n{\n\"Key1\":\"value5\",\n\"Key2\":\"value6\"\n}"
re.findall(p, test_str)
But use replace instead
replace }, -> }
This works:
import re
s="""{
"Key1":"value1",
"Key2":"value2"
},
{
"Key1":"value3",
"Key2":"value4"
},
{
"Key1":"value5",
"Key2":"value6"
}"""
pattern=re.compile(r'(?P<data>{.*?}),', re.S)
print pattern.findall(s)
s1=pattern.sub(r'\g<data>', s)
print s1
If you intend to process the resulting JSON in jq, it's probably easier to wrap it in brackets [{...}, {...}] to make it a JSON array. Then, you can use .[] in jq to unwrap the array.
Before you even consider other options, you really should go back to the source that generated that file and make sure it actually outputs valid json.
That said, you could use JQ to manipulate the contents as a raw string to add brackets, then parse it as an array to them spit out the contents.
$ jq -Rs '"[\(.)]" | fromjson[]' abc.json