JSON string parser without literals

JSON string parser without literals - python

How to check if a string like this {:[{},{}]}, without any literals, can be represented as a JSON object or not?
The input comes with the following constraints:
1. A JSON object should start with '{' and ends with a '}'.
2. The key and value should be separated by a ':'.
3. A ',' suggests an additional JSON property.
4. An array only consists of JSON objects. It cannot contain a "key":"value" pair by itself.
And it is to be intrepreted like this:
{
"Key": [{
"Key": "Value"
}, {
"Key": "Value"
}]
}

The syntax spec for JSON can be found here.
It indicates that the [{},{}] is legal, because [] has to contain 0 or more elements separated by ,, and {} is a legal element. However, the first part of your example is NOT valid - the : must have a string in front of it. While it is legal for it to be an empty string, it's not legal for it to be null, and the interpretation of a totally missing element is ambiguous.
So. {"":[{},{}]} is legal, but {:[{},{}]} is not.

Related

Python regex parse file name with underscore separated fields

I have the following format which parameterises a file name.
"{variable}_{domain}_{GCMsource}_{scenario}_{member}_{RCMsource}_{RCMversion}_{frequency}_{start}-{end}_{fid}.nc"
e.g.
"pr_EUR-11_CNRM-CERFACS-CNRM-CM5_rcp45_r1i1p1_CLMcom-CCLM4-8-17_v1_day_20060101-20101231.nc"
(Note that {start}-{end} is meant to be hyphon separated instead of underscore)
The various fields are always separated by underscores and contain a predictable (but variable) format. In the example file name I have left out the final {fid} field as I would like that to be optional.
I'd like to use regex in python to parse such a file name to give me a dict or similar with keys for the field names in the format string and the corresponding values of the parsed file name. e.g.
{
"variable": "pr",
"domain", "EUR-11",
"GCMsource": "CNRM-CERFACS-CNRM-CM5",
"scenario": "rcp45",
"member": "r1i1p1",
"RCMsource": "CLMcom-CCLM4-8-17",
"RCMversion": "v1",
"frequency": "day",
"start": "20060101",
"end": "20101231".
"fid": None
}
The regex patten for each field can be constrained depending on the field. e.g.
"domain" is always 3 letters - 2 numbers
"member" is always rWiXpY where W, X and Y are numbers.
"scenario" always contains the letters "rcp" followed by 2 numbers.
"start" and "end" are always 8 digit numbers (YYYYMMDD)
There are never underscores within a field, underscores are only used to separate fields.
Note that I have used https://github.com/r1chardj0n3s/parse with some success but I don't think it is flexible enough for my needs (trying to parse other similar filenames with similar formats can often get confused with one another).
It would be great if the answer can explain some regex principles which will allow me to do this.

document for regular expression in python: https://docs.python.org/3/howto/regex.html#regex-howto
named group in regular expression in python:
https://docs.python.org/3/howto/regex.html#non-capturing-and-named-groups
import re
test_string = """pr_EUR-11_CNRM-CERFACS-CNRM-CM5_rcp45_r1i1p1_CLMcom-CCLM4-8-17_v1_day_20060101-20101231.nc"""
pattern = r"""
(?P<variable>\w+)_
(?P<domain>[a-zA-Z]{3}-\d{2})_
(?P<GCMsource>([A-Z0-9]+[-]?)+)_
(?P<scenario>rcp\d{2})_
(?P<member>([rip]\d)+)_
(?P<RCMsource>([a-zA-Z0-9]-?)+)_
(?P<RCMversion>[a-zA-Z0-9]+)_
(?P<frequency>[a-zA-Z-0-9]+)_
(?P<start>\d{8})-
(?P<end>\d{8})
_?
(?P<fid>[a-zA-Z0-9]+)?
.nc
"""
re_object = re.compile(pattern, re.VERBOSE) # we use VERBOSE flag
search_result = re_object.match(test_string)
print(search_result.groupdict())
# result:
"""
{'variable': 'pr', 'domain': 'EUR-11', 'GCMsource': 'CNRM-CERFACS-CNRM-CM5', 'scenario': 'rcp45', 'member': 'r1i1p1', 'RCMsource': 'CLMcom-CCLM4-8-17', 'RCMversion': 'v1', 'frequency': 'day', 'start': '20060101', 'end': '20101231', 'fid': None}
"""

List of dictionaries where value has two double quoted values

I came up with list of dictionaries as a string. I wanted to convert this string to dictionary but it gives error.
data = '{
"address": "Ludwig-Wolf-Straß 1, 75181 Pforzheim Eutingen",
"lat": 48.90962790,
"lng": 8.74648390,
"name": "Psychiatrische Tagesklinik Pforzheim "Alte Mühle"",
"path": "appportrait7e29d81c345927b0start",
"color" : "yellow",
"zIndex": "30",}'
After checking it, I found out a value enclosed in double quotes two times.
data = {
"address": "Ludwig-Wolf-Straß 1, 75181 Pforzheim Eutingen",
"lat": 48.90962790,
"lng": 8.74648390,
"name": "Psychiatrische Tagesklinik Pforzheim "Alte Mühle"", # this value
"path": "appportrait7e29d81c345927b0start",
"color" : "yellow",
"zIndex": "30",}
I want to turn "Alte Mühle" into a single quote 'Alte Mühle' or just Alte Mühle. I tried to parse the dictionary to str and use string.replace() function but it didn't work. Since the value is dynamic I can't just change the value in a static way. i,e
string.replace('"Alte Mühle"', 'Alte Mühle') # will only change this value
is there any way to get rid of this?

Not enough rep to comment, so I'm assuming you are starting with a bunch of string literals you typed manually into your code. If not, there are other ways to handle this or it may have not been an issue to start with.
Here is an solution that doesn't require manually searching for problem strings. Enclose your dictionary string literal using tripple quotes (either """ or ''' are permitted) instead of the single ' or ". This will prevent the interpreter from getting confused about ' or " inside a string literal.
data = """{
"address": "Ludwig-Wolf-Straß 1, 75181 Pforzheim Eutingen",
"lat": 48.90962790,
"lng": 8.74648390,
"name": "Psychiatrische Tagesklinik Pforzheim "Alte Mühle"",
"path": "appportrait7e29d81c345927b0start",
"color" : "yellow",
"zIndex": "30",}"""
Next, the double quote problem can be handled using regular expressions (re). I have to leave this as an exercise as I am on a phone, but you can replace all " that lies inside a dictionary value regular expression search string ": \"([.]+?)\",” with '. Find this pattern, modify the substring, then replace the old substring with the corrected one.
Finally, to interpret it as a dictionary, call ast.literal_eval(...) on the corrected string (a version of eval(...) made safer by only interpreting literals). Requires the standard library ast import.
Consider comparing this workload vs manually fixing your strings or loading the strings or key/value pairs from a database, avoiding these string literal issues all together.

Converting string containing double quotes to json

Python Escape Double quote character and convert the string to json
I have tried escaping double quotes with escape characters but that didn't worked either
raw_string = '[{"Attribute":"color","Keywords":"green","AttributeComments":null},{"Attribute":" season","Keywords":["Holly Berry"],"AttributeComments":null},{"Attribute":" size","Keywords":"20"x30"","AttributeComments":null},{"Attribute":" unit","Keywords":"1","AttributeComments":null}]'
new_data = json.loads(raw_string)
It load errors saying Expecting ',' delimiter: line 1 column 180 (char 179)
The expected output is JSON string

The correct JSON string, with escaped quotes should look like this:
[{
"Attribute": "color",
"Keywords": "green",
"AttributeComments": null
}, {
"Attribute": " season",
"Keywords": ["Holly Berry"],
"AttributeComments": null
}, {
"Attribute": " size",
"Keywords": "20\"x30",
"AttributeComments": null
}, {
"Attribute": " unit",
"Keywords": "1",
"AttributeComments": null
}]
Edit:
You can use a regular expression to correct the sting in Python resulting in a valid json:
import re
import json
raw_string = '[{"Attribute":"color","Keywords":"green","AttributeComments":null},{"Attribute":" season","Keywords":["Holly Berry"],"AttributeComments":null},{"Attribute":" size","Keywords":"20"x30"","AttributeComments":null},{"Attribute":" unit","Keywords":"1","AttributeComments":null}]'
pattern = r'"Keywords":"([\d].)"x([\d].)""'
correctedString = re.sub(pattern, '"Keywords": "\g<1>x\g<2>"', raw_string)
print(json.loads(correctedString))
Output:
[{u'Keywords': u'green', u'Attribute': u'color', u'AttributeComments': None}, {u'Keywords': [u'Holly Berry'], u'Attribute': u' season', u'AttributeComments': None}, {u'Keywords': u'20x30', u'Attribute': u' size', u'AttributeComments': None}, {u'Keywords': u'1', u'Attribute': u' unit', u'AttributeComments': None}]

raw_string = '[{"Attribute":"color","Keywords":"green","AttributeComments":null},{"Attribute":" season","Keywords":["Holly Berry"],"AttributeComments":null},{"Attribute":" size","Keywords":"20x30","AttributeComments":null},{"Attribute":" unit","Keywords":"1","AttributeComments":null}]'
new_data = json.loads(raw_string)

First of all change the key-value pair : "Keywords":"20"x30"" to "Keywords":"20x30".
The formatting is invalid in your code. If this JSON is not made by you or generated by some other source, check the source. You can check if the JSON is valid or not using JSONLint. Just paste your JSON here to check.
As for your code:
import json
raw_string = '[{"Attribute":"color","Keywords":"green","AttributeComments":null},{"Attribute":" season","Keywords":["Holly Berry"],"AttributeComments":null},{"Attribute":" size","Keywords":"20x30","AttributeComments":null},{"Attribute":" unit","Keywords":"1","AttributeComments":null}]'
new_data = json.loads(raw_string)
Since new_data is a list. If you check the type of its first and only element, using print(type(new_data[0])) you'll find it is a dict that you desired.
EDIT: Since you say you are fetching this JSON from a database, check if the JSONs there are all carrying these type of formatting errors. If yes, you'd want to check where these are JSONs being generated. Your options are either to correct it at the source and correct it manually or adding escape characters, if this is a one-off problem. I strongly suggest the former.

Concate strings in python with single and double quotes is not giving right result

How to concatenate strings in python which has single quotes and double quotes
a = "{'requests': [{ "
c = '"image" : {"source" : '
a+c
is giving a "\" before the single quote
{\'requests\': [{ "image" : {"source" :
I need an output like this
{'requests': [{ "image" : {"source" :

This has nothing to do with how the string actually looks like; try printing it:
print(a+c)
It is simply a matter of how Python represents the string in an interactive session (or when repr()d).
See also: Built-in Functions: repr() in the Python docs.

Extract the data specified in brackets '[ ]' from a string message in python

I want to extract fields from below Log message.
Example:
Ignoring entry, Affected columns [column1:column2], reason[some reason], Details[some entry details]
I need to extract the data specified in the brackets [ ] for "Affected columns,reason, Details"
What would be the efficient way to extract these fields in Python?
Note: I can modify the log message format if needed.

If you are free to change the log format, it's easiest to use a common data format - I'd recommend JSON for such data. It is structured, but lightweight enough to write it even from custom bash scripts. The json module allows you to directly convert it to native python objects:
import json # python has a default parser
# assume this is your log message
log_line = '{"Ignoring entry" : {"Affected columns": [1, 3], "reason" : "some reason", "Details": {}}}'
data = json.loads(log_line)
print("Columns to ignore:", data["Ignoring entry"]["Affected columns"])
If you want to work with the current format, you'll have to work with str methods or the re module.
For example, you could do this:
log_msg = "Ignoring entry, Affected columns [column1:column2], reason[some reason], Details[some entry details]"
def parse_log_line(log_line):
if log_line.startswith("Ignoring entry"):
log_data = {
for element in log_line.split(',')[1:]: # parse all elements but the header
key, value = element.partition('[')
if value[-1] != ']':
raise ValueError('Malformed Content. Expected %r to end with "]"' % element)
value = value[:-1]
log_data[key] = value
return log_data
raise ValueError('Unrecognized log line type')
Many parsing tasks are best compactly handled by the re module. It allows you to use regular expressions. They are very powerful, but difficult to maintain if you are not used to it. In your case, the following would work:
log_data = {key: value for key, value in re.findall(',\s?(.+?)\s?\[(.+?)\]', log_line)}
The re works like this:
, a literal comma, separating your entries
\s* an arbitrary sequence of whitespace after the comma, before the next element
(.+?) any non-whitespace characters (the key, captured via '()')
\s* an arbitrary sequence of whitespace between key and value
\[ a literal [
(.+?) the shortest sequence of non-whitespace characters before the next element (the value, captured via '()')
\] a literal ]
The symbols *, + and ? mean "any", "more than one", and "as few as possible".

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.