Parse non-standard semicolon separated "JSON" - python

I have a non-standard "JSON" file to parse. Each item is semicolon separated instead of comma separated. I can't simply replace ; with , because there might be some value containing ;, ex. "hello; world". How can I parse this into the same structure that JSON would normally parse it?
{
"client" : "someone";
"server" : ["s1"; "s2"];
"timestamp" : 1000000;
"content" : "hello; world";
...
}

Use the Python tokenize module to transform the text stream to one with commas instead of semicolons. The Python tokenizer is happy to handle JSON input too, even including semicolons. The tokenizer presents strings as whole tokens, and 'raw' semicolons are in the stream as single token.OP tokens for you to replace:
import tokenize
import json
corrected = []
with open('semi.json', 'r') as semi:
for token in tokenize.generate_tokens(semi.readline):
if token[0] == tokenize.OP and token[1] == ';':
corrected.append(',')
else:
corrected.append(token[1])
data = json.loads(''.join(corrected))
This assumes that the format becomes valid JSON once you've replaced the semicolons with commas; e.g. no trailing commas before a closing ] or } allowed, although you could even track the last comma added and remove it again if the next non-newline token is a closing brace.
Demo:
>>> import tokenize
>>> import json
>>> open('semi.json', 'w').write('''\
... {
... "client" : "someone";
... "server" : ["s1"; "s2"];
... "timestamp" : 1000000;
... "content" : "hello; world"
... }
... ''')
>>> corrected = []
>>> with open('semi.json', 'r') as semi:
... for token in tokenize.generate_tokens(semi.readline):
... if token[0] == tokenize.OP and token[1] == ';':
... corrected.append(',')
... else:
... corrected.append(token[1])
...
>>> print ''.join(corrected)
{
"client":"someone",
"server":["s1","s2"],
"timestamp":1000000,
"content":"hello; world"
}
>>> json.loads(''.join(corrected))
{u'content': u'hello; world', u'timestamp': 1000000, u'client': u'someone', u'server': [u's1', u's2']}
Inter-token whitespace was dropped, but could be re-instated by paying attention to the tokenize.NL tokens and the (lineno, start) and (lineno, end) position tuples that are part of each token. Since the whitespace around the tokens doesn't matter to a JSON parser, I've not bothered with this.

You can do some odd things and get it (probably) right.
Because strings on JSON cannot have control chars such as \t, you could replace every ; to \t, so the file will be parsed correctly if your JSON parser is able to load non strict JSON (such as Python's).
After, you only need to convert your data back to JSON so you can replace back all these \t, to ; and use a normal JSON parser to finally load the correct object.
Some sample code in Python:
data = '''{
"client" : "someone";
"server" : ["s1"; "s2"];
"timestamp" : 1000000;
"content" : "hello; world"
}'''
import json
dec = json.JSONDecoder(strict=False).decode(data.replace(';', '\t,'))
enc = json.dumps(dec)
out = json.loads(dec.replace('\\t,' ';'))

Using a simple character state machine, you can convert this text back to valid JSON. The basic thing we need to handle is to determine the current "state" (whether we are escaping a character, in a string, list, dictionary, etc), and replace ';' by ',' when in a certain state.
I don't know if this is properly way to write it, there is a probably a way to make it shorter, but I don't have enough programming skills to make an optimal version for this.
I tried to comment as much as I could :
def filter_characters(text):
# we use this dictionary to match opening/closing tokens
STATES = {
'"': '"', "'": "'",
"{": "}", "[": "]"
}
# these two variables represent the current state of the parser
escaping = False
state = list()
# we iterate through each character
for c in text:
if escaping:
# if we are currently escaping, no special treatment
escaping = False
else:
if c == "\\":
# character is a backslash, set the escaping flag for the next character
escaping = True
elif state and c == state[-1]:
# character is expected closing token, update state
state.pop()
elif c in STATES:
# character is known opening token, update state
state.append(STATES[c])
elif c == ';' and state == ['}']:
# this is the delimiter we want to change
c = ','
yield c
assert not state, "unexpected end of file"
def filter_text(text):
return ''.join(filter_characters(text))
Testing with :
{
"client" : "someone";
"server" : ["s1"; "s2"];
"timestamp" : 1000000;
"content" : "hello; world";
...
}
Returns :
{
"client" : "someone",
"server" : ["s1"; "s2"],
"timestamp" : 1000000,
"content" : "hello; world",
...
}

Pyparsing makes it easy to write a string transformer. Write an expression for the string to be changed, and add a parse action (a parse-time callback) to replace the matched text with what you want. If you need to avoid some cases (like quoted strings or comments), then include them in the scanner, but just leave them unchanged. Then, to actually transform the string, call scanner.transformString.
(It wasn't clear from your example whether you might have a ';' after the last element in one of your bracketed lists, so I added a term to suppress these, since a trailing ',' in a bracketed list is also invalid JSON.)
sample = """
{
"client" : "someone";
"server" : ["s1"; "s2"];
"timestamp" : 1000000;
"content" : "hello; world";
}"""
from pyparsing import Literal, replaceWith, Suppress, FollowedBy, quotedString
import json
SEMI = Literal(";")
repl_semi = SEMI.setParseAction(replaceWith(','))
term_semi = Suppress(SEMI + FollowedBy('}'))
qs = quotedString
scanner = (qs | term_semi | repl_semi)
fixed = scanner.transformString(sample)
print(fixed)
print(json.loads(fixed))
prints:
{
"client" : "someone",
"server" : ["s1", "s2"],
"timestamp" : 1000000,
"content" : "hello; world"}
{'content': 'hello; world', 'timestamp': 1000000, 'client': 'someone', 'server': ['s1', 's2']}

Related

Regular Expression to remove selective string

Looking to remove particular string coming in between json string:
For Example my Json string is :
{"tableName":"avzConf","rows":[{"Comp":"mster","Conf": "[{\"name\": \"state\", \"dispN\": \"c_d_test\", \"\": {\"updated_at\": \"2020-09-16T06:33:07.684504Z\", \"updated_by\": \"Abc_xyz<abc_xyz#uuvvww.com>\"}}, {\"name\": \"stClu\", \"dNme\": \"tab(s) Updatedd\", \"\": {\"updated_at\": \"2020-09-21T10:17:48.307874Z\", \"updated_by\": \"Def Ghi<def_ghi#uuvvww.com>\"}}
}]
}
want to remove: \"\": {\"updated_at\": \"2020-09-16T06:33:07.684504Z\", \"updated_by\": \"Abc_xyz<abc_xyz#uuvvww.com>\"}
Expected output :
{"tableName":"avzConf","rows":[{"Comp":"mster","Conf": "[{\"name\": \"state\", \"dispN\": \"c_d_test\"}, {\"name\": \"stClu\", \"dNme\": \"tab(s) Updatedd\"}
}]
}
I tried with ( \\"\\": {\\"updated_\w+)(.*)(>\\")
used in my code:
import re
line = re.sub(r"updated_\w+(.*)(.com>)", '', json_str)
But it's also selecting the between lines as there is 2 occurrences of "": {"updated_at\ and "updated_by"
And leaving special char "": {""}
How can I completely remove \"\": {\"updated_at\": \"2020-09-16T06:33:07.684504Z\", \"updated_by\": \"Abc_xyz<abc_xyz#uuvvww.com>\"}?
Try this:
\{\"updated_at[^{]+\}
This matches from the relevant opening { to the relevant closing }by allowing any character except { to occur once or more times in-between
With python json string I'm able to remove those unwanted fields as below:
this has completely removed the unwanted empty key and replace the same with }, to complete the json perfectly.
regex as \,\s\\\"\\\":\s\{\\\"updated_at[^{]+\}[^\]]
json_str = str({"tableName":"avzConf","rows":[{"Comp":"mster","Conf": "[{"name": "state", "dispN": "c_d_test", "": {"updated_at": "2020-09-16T06:33:07.684504Z", "updated_by": "Abc_xyzabc_xyz#uuvvww.com"}}, {"name": "stClu", "dNme": "tab(s) Updatedd", "": {"updated_at": "2020-09-21T10:17:48.307874Z", "updated_by": "Def Ghidef_ghi#uuvvww.com"}} }] })
import re
line = re.sub(r"\,\s\\\"\\\":\s\{\\\"updated_at[^{]+\}",'},', json_str)

format a json and then open it with the json.load () [duplicate]

I currently have JSON in the below format.
Some of the Key values are NOT properly formatted as they are missing double quotes (")
How do I fix these key values to have double-quotes on them?
{
Name: "test",
Address: "xyz",
"Age": 40,
"Info": "test"
}
Required:
{
"Name": "test",
"Address": "xyz",
"Age": 40,
"Info": "test"
}
Using the below post, I was able to find such key values in the above INVALID JSON.
However, I could NOT find an efficient way to replace these found values with double-quotes.
s = "Example: String"
out = re.findall(r'\w+:', s)
How to Escape Double Quote inside JSON
Using Regex:
import re
data = """{ Name: "test", Address: "xyz"}"""
print( re.sub("(\w+):", r'"\1":', data) )
Output:
{ "Name": "test", "Address": "xyz"}
You can use PyYaml. Since JSON is a subset of Yaml, pyyaml may overcome the lack of quotes.
Example
import yaml
dirty_json = """
{
key: "value",
"key2": "value"
}
"""
yaml.load(dirty_json, yaml.SafeLoader)
I had few more issues that I faced in my JSON.
Thought of sharing the final solution that worked for me.
jsonStr = re.sub("((?=\D)\w+):", r'"\1":', jsonStr)
jsonStr = re.sub(": ((?=\D)\w+)", r':"\1"', jsonStr)
First Line will fix this double-quotes issue for the Key. i.e.
Name: "test"
Second Line will fix double-quotes issue for the value. i.e. "Info": test
Also, above will exclude double-quoting within date timestamp which have : (colon) in them.
You can use online formatter. I know most of them are throwing error for not having double quotes but below one seems handling it nicely!
JSON Formatter
The regex approach can be brittle. I suggest you find a library that can parse the JSON text that is missing quotes.
For example, in Kotlin 1.4, the standard way to parse a JSON string is using Json.decodeFromString. However, you can use Json { isLenient = true }.decodeFromString to relax the requirements for quotes. Here is a complete example in JUnit.
import kotlinx.serialization.Serializable
import kotlinx.serialization.decodeFromString
import kotlinx.serialization.json.Json
import org.junit.jupiter.api.Assertions
import org.junit.jupiter.api.Test
#Serializable
data class Widget(val x: Int, val y: String)
class JsonTest {
#Test
fun `Parsing Json`() {
val w: Widget = Json.decodeFromString("""{"x":123, "y":"abc"}""")
Assertions.assertEquals(123, w.x)
Assertions.assertEquals("abc", w.y)
}
#Test
fun `Parsing Json missing quotes`() {
// Json.decodeFromString("{x:123, y:abc}") failed to decode due to missing quotes
val w: Widget = Json { isLenient = true }.decodeFromString("{x:123, y:abc}")
Assertions.assertEquals(123, w.x)
Assertions.assertEquals("abc", w.y)
}
}

How to remove the first and last portion of a string in Python?

How can i cut from such a string (json) everything before and including the first [ and everything behind and including the last ] with Python?
{
"Customers": [
{
"cID": "w2-502952",
"soldToId": "34124"
},
...
...
],
"status": {
"success": true,
"message": "Customers: 560",
"ErrorCode": ""
}
}
I want to have at least only
{
"cID" : "w2-502952",
"soldToId" : "34124",
}
...
...
String manipulation is not the way to do this. You should parse your JSON into Python and extract the relevant data using normal data structure access.
obj = json.loads(data)
relevant_data = obj["Customers"]
Addition to #Daniel Rosman answer, if you want all the list from JSON.
result = []
obj = json.loads(data)
for value in obj.values():
if isinstance(value, list):
result.append(*value)
While I agree that Daniel's answer is the absolute best way to go, if you must use string splitting, you can try .find()
string = #however you are loading this json text into a string
start = string.find('[')
end = string.find(']')
customers = string[start:end]
print(customers)
output will be everything between the [ and ] braces.
If you really want to do this via string manipulation (which I don't recommend), you can do it this way:
start = s.find('[') + 1
finish = s.find(']')
inner = s[start : finish]

In JSON output, force every opening curly brace to appear in a new separate line

With json.dumps(some_dict,indent=4,sort_keys=True) in my code:
I get something like this:
{
"a": {
"x":1,
"y":2
},
"b": {
"z":3,
"w":4
}
}
But I want something like this:
{
"a":
{
"x":1,
"y":2
},
"b":
{
"z":3,
"w":4
}
}
How can I force each opening curly brace to appear at the beginning of a new separate line?
Do I have to write my own JSON serializer, or is there a special argument that I can use when calling json.dumps?
You can use a regular expression replacement on the result.
better_json = re.sub(r'^((\s*)".*?":)\s*([\[{])', r'\1\n\2\3', json, flags=re.MULTILINE)
The first capture group matches everything up to the : after the property name, the second capture group matches the whitespace before the property name, and the third capture group captures the { or [ before the object or array. The whitespace is then copied after the newline, so that the indentation will match properly.
DEMO
Building on Barmar's excellent answer, here's a more complete demo showing how you can convert and customize your JSON in Python:
import json
import re
# JSONifies dict and saves it to file
def save(data, filename):
with open(filename, "w") as write_file:
write_file.write(jsonify(data))
# Converts Python dict to a JSON string. Indents with tabs and puts opening
# braces on their own line.
def jsonify(data):
default_json = json.dumps(data, indent = '\t')
better_json = re.sub(
r'^((\s*)".*?":)\s*([\[{])',
r'\1\n\2\3',
default_json,
flags=re.MULTILINE
)
return better_json
# Sample data for demo
data = {
"president":
{
"name": "Zaphod Beeblebrox",
"species": "Betelgeusian"
}
}
filename = 'test.json'
# Demo
print("Here's your pretty JSON:")
print(jsonify(data))
print()
print('Saving to file:', filename)
save(data, filename)

python parse bind configuration with nested curly braces

I'm trying to automatically parse an existing bind configuration, consisting of multiple of these zone definitons:
zone "domain.com" {
type slave;
file "sec/domain.com";
masters {
11.22.33.44;
55.66.77.88;
};
allow-transfer {
"acl1";
"acl2";
};
};
note that the amount of elements in masters and in allow-transfer may differ. I tried my way around splitting this using re.split() and failed horribly due to the nested curly braces.
My goal is a dictionary for each of these entries.
Thanks in advance for any help!
This should do the trick, where 'st' is a string of all your zone definitions:
import re
zone_def = re.split('zone', st, re.DOTALL)
big_dict = {}
for zone in zone_def:
if len(zone) > 0:
zone_name = re.search('(".*?")', zone)
sub_dicts = re.finditer('([\w]+) ({.*?})', zone, re.DOTALL)
big_dict[zone_name.group(1)] = {}
for sub_dict in sub_dicts:
big_dict[zone_name.group(1)][sub_dict.group(1)] = sub_dict.group(2).replace(' ', '')
sub_types = re.finditer('([\w]+) (.*?);', zone)
for sub_type in sub_types:
big_dict[zone_name.group(1)][sub_type.group(1)] = sub_type.group(2)
big_dict will then return a dictionary of zone definitions. Each zone definition will have the domain/url as its key. Every key/value in the zone definition is a string.
This is the output for the above example:
{'"domain.com"': {'transfer': '{\n"acl1";\n"acl2";\n}', 'masters': '{\n11.22.33.44;\n55.66.77.88;\n}', 'type': 'slave', 'file': '"sec/domain.com"'}}
And this is the output if you were to have a second identical zone, with key "sssss.com".
{'"sssss.com"': {'transfer': '{\n"acl1";\n"acl2";\n}', 'masters': '{\n11.22.33.44;\n55.66.77.88;\n}', 'type': 'slave', 'file': '"sec/domain.com"'},'"domain.com"': {'transfer': '{\n"acl1";\n"acl2";\n}', 'masters': '{\n11.22.33.44;\n55.66.77.88;\n}', 'type': 'slave', 'file': '"sec/domain.com"'}}
You will have to do some further stripping to make it more readable.
A way is to (install and) use the regex module instead of the re module. The problem is that the re module is unable to deal with undefined level of nested brackets:
#!/usr/bin/python
import regex
data = '''zone "domain.com" {
type slave;
file "sec/domain.com";
masters {
11.22.33.44; { toto { pouet } glups };
55.66.77.88;
};
allow-transfer {
"acl1";
"acl2";
};
}; '''
pattern = r'''(?V1xi)
(?:
\G(?<!^)
|
zone \s (?<zone> "[^"]+" ) \s* {
) \s*
(?<key> \S+ ) \s+
(?<value> (?: ({ (?> [^{}]++ | (?4) )* }) | "[^"]+" | \w+ ) ; )
'''
matches = regex.finditer(pattern, data)
for m in matches:
if m.group("zone"):
print "\n" + m.group("zone")
print m.group("key") + "\t" + m.group("value")
You can find more informations about this module by following this link: https://pypi.python.org/pypi/regex

Categories