Custom format to JSON - python

How can i convert the following line(not sure what format is this) to JSON format?
[root=Root [key1=value1, key2=value2, key3=Key3 [key3_1=value3_1, key3_2=value3_2, key3_3=Key3_3 [key3_3_1=value3_3_1]], key4=value4]]
where Root, Key3, Key3_3 denote complex elements.
to
{
"root": {
"key1" : "value1",
"key2" : "value2",
"key3" : {
"key3_1" : "value3_1",
"key3_2" : "value3_2",
"key3_3" : {
"key3_3_1" : "value3_3_1"
}
},
"key4" : "value4
}
}
I am looking for approach and not solution. If you are down-voting this question, Please comment why you are doing so.

Let x be a string with the above serialization.
First, lets replace the occurrences of Root, Key3 and Key3_3 with empty strings
# the string fragments like "root=Root [" need to be replaced by "root=["
# to achieve this, we match the regex pattern "\w+ ["
# This matches ALL instances in the input string where we have a word bounded by "=" & " [",
# i.e. "Root [", "Key3 [", "Key3_3" are all matched. as will any other example you can think of
# where the `word` is composed of letters numbers or underscore followed
# by a single space character and then "["
# We replace this fragment with "[", (which we will later replace with "{")
# giving us the transformation "root=Root [" => "root=["
import re
o = re.compile(r'\w+ [[]')
y = re.sub(o, '[', x, 0)
Then, lets split the resulting string into words and non words
# Here we split the string into two lists, one containing adjacent tokens (nonwords)
# and the other containing the words
# The idea is to split / recombine the source string with quotes around all our words
w = re.compile(r'\W+')
nw = re.compile(r'\w+')
words = w.split(y)[1:-1] # ignore the end elements which are empty.
nonwords = nw.split(y) # list elements are contiguous non-word characters, i.e not a-Z_0-9
struct = '"{}"'.join(nonwords) # format structure of final output with quotes around the word's placeholder.
almost_there = struct.format(*words) # insert words into the string
And finally, replace the square brackets with squigly ones, and = with :
jeeson = almost_there.replace(']', '}').replace('=', ':').replace('[', '{')
# "{'root':{'key1':'value1', 'key2':'value2', 'key3':{'key3_1':'value3_1', 'key3_2':'value3_2', 'key3_3':{'key3_3_1':'value3_3_1'}}, 'key4':'value4'}}"

I had to spend around two hours on this, but I think I have something which would work all the cases based on the format you provided. If not, I am sure it'll be a minor change. Even though you asked only for the idea, since I coded it up anyway, here's the Python code.
import json
def to_json(cust_str):
from_index = 0
left_indices = []
levels = {}
level = 0
for i, char in enumerate(cust_str):
if char == '[':
level += 1
left_indices.append(i)
if level in levels:
levels[level] += 1
else:
levels[level] = 1
elif char == ']':
level -= 1
level = max(levels.keys())
value_stack = []
while True:
left_index = left_indices.pop()
right_index = cust_str.find(']', left_index) + 1
values = {}
pairs = cust_str[left_index:right_index][1:-1].split(',')
if levels[level] > 0:
for pair in pairs:
pair = pair.split('=')
values[pair[0].strip()] = pair[1]
else:
level -= 1
for pair in pairs:
pair = pair.split('=')
if pair[1][-1] == ' ':
values[pair[0].strip()] = value_stack.pop()
else:
values[pair[0].strip()] = pair[1]
value_stack.append(values)
levels[level] -= 1
cust_str = cust_str[:left_index] + cust_str[right_index:]
if levels[1] == 0:
return json.dumps(values)
if __name__ == '__main__':
# Data in custom format
cust_str = '[root=Root [key1=value1, key2=value2, key3=Key3 [key3_1=value3_1, key3_2=value3_2, key3_3=Key3_3 [key3_3_1=value3_3_1]], key4=value4]]'
# Data in JSON format
json_str = to_json(cust_str)
print json_str
The idea is that, we map the number of levels the dicts go to in the custom format and the number of values which are not strings corresponding to those levels. Along with that, we keep track of the indices of the [ character in the given string. We then start from the innermost dict representation by popping the stack containing the [ (left) indices and parse them. As each of them is parsed, we remove them from the string and continue. The rest you can probably read in the code.
I ran it for the data you gave and the result is as follows.
{
"root":{
"key2":"value2",
"key3":{
"key3_2":"value3_2",
"key3_3":{
"key3_3_1":"value3_3_1"
},
"key3_1":"value3_1"
},
"key1":"value1",
"key4":"value4"
}
}
Just to make sure it works for more general cases, I used this custom string.
[root=Root [key1=value1, key2=Key2 [key2_1=value2_1], key3=Key3 [key3_1=value3_1, key3_2=Key3_2 [key3_2_1=value3_2_1], key3_3=Key3_3 [key3_3_1=value3_3_1]], key4=value4]]
And parsed it.
{
"root":{
"key2":{
"key2_1":"value2_1"
},
"key3":{
"key3_2":{
"key3_2_1":"value3_2_1"
},
"key3_3":{
"key3_3_1":"value3_3_1"
},
"key3_1":"value3_1"
},
"key1":"value1",
"key4":"value4"
}
}
Which, as far as I can see, is how it should be parsed. Also, remember, do not strip the values since the logic depends on the whitespace at the end of values which should have the dicts as values (if that makes any sense).

Related

Using Regular Expressions to Parse Based on Unique Character Sequence

I'm hoping to get some Python assistance with parsing out a column in a DataFrame that has a unique character sequence.
Each record can have a variable number of parameter name/value pairings.
The only way to determine where each name/value pairing ends is by looking for an equal sign and then finding the most immediate preceding comma. This gets a little tricky as some of the values will continue commas, so using a comma to parse won't always yield clean results.
Example below:
String
NAME=TEST,TEST ID=1234,ENTRY DESCR=Verify,ENTRY CLASS=CCD,TRACE NO=124313523,12414,ENTRY DATE=210506
DESCRIPTION=TEST,TEST,TEST
End Result:
String1
String2
String3
String4
String5
String6
NAME=TEST
TEST ID=1234
ENTRY DESCR=Verify
ENTRY CLASS=CCD
TRACE NO=124313523,12414
ENTRY DATE=210506
DESCRIPTION=TEST,TEST,TEST
Thanks in advance for your help!
This can certainly be done with Regexs, but for a quick and dirty parser I would do it manually.
First a test suite:
import pytest
#pytest.mark.parametrize(
"encoded,parsed",
[
("X=Y", {"X": "Y"}),
("DESCRIPTION=TEST,TEST,TEST", {"DESCRIPTION": "TEST,TEST,TEST"}),
(
"NAME=TEST,TEST ID=1234,ENTRY DESCR=Verify,ENTRY CLASS=CCD,TRACE NO=124313523,12414,ENTRY DATE=210506",
{
"NAME": "TEST",
"TEST ID": "1234",
"ENTRY DESCR": "Verify",
"ENTRY CLASS": "CCD",
"TRACE NO": "124313523,12414",
"ENTRY DATE": "210506",
},
),
],
)
def test_parser(encoded, parsed):
assert parser(encoded) == parsed
You'll need to pip install pytest if you don't already have it.
Then a parser:
def parser(encoded: str) -> dict[str, str]:
parsed = {}
val = []
for token in reversed(encoded.split("=")):
if val:
*vals, token = token.split(",")
parsed[token] = ",".join(val)
val = vals
else:
val = token.split(",")
return parsed
This is not a 'proper' parser (i.e. the traditional token, lex, parse) but handles this format. It works as follows:
step backwards through all the something=val pairs.
split the val (this is strictly pointless, but see below)
split something at the last comma (using a * expression to collect all the other components)
add a new entry into the parsed dict, joining the val back up again with commas
Note that this would work just as well with val = [token]. But you probably don't want a parser which returns a format which in turn needs parsing. You probably want it to turn , separated values into a list of appropriate types. Currently you have three types: strs, ints and a datetime. Thus "".join(val) could profitably be replaced with [convert(x) for x in val]. convert might look something like this:
from datetime import datetime
def convert(x: str) -> Union[date, int, str]:
for candidate in (
lambda x: datetime.strptime(x, "%y%m%d").date(),
lambda x: int(x),
lambda x: x,
):
try:
return candidate(x)
except ValueError:
pass
This would then be used by doing something like this in the parser:
converted = [convert(x) for x in val]
if len(converted) == 1:
converted = converted[0]
parsed[token] = converted
However, this conversion function has a problem---it falsely identifies one number as a date. How exactly to fix this depends on the input data. Perhaps the date parsing function can be context-agnostic, and just check for a 6-digit input before parsing (or manually split the str and pass to datetime.date). Perhaps the decision needs to be made in the parser, based on whether the word "DATE" is in the key.
If you really want to use regexs, have a look at negative lookaheads.
You can do this:
def parse(s: str) -> dict:
# Split by "=" and by ","
raw = [x.split(",") for x in s.split("=")]
# Keys are the last element of each row, besides the last
keys = [k[-1] for k in raw[:-1]]
# Values are all the elements before the last, shifted by one
values = [",".join(k[:-1]) for k in raw[1:-1]] + [",".join(raw[-1])]
return dict(zip(keys, values))
If we try:
s1 = "NAME=TEST,TEST ID=1234,ENTRY DESCR=Verify,ENTRY CLASS=CCD,TRACE NO=124313523,12414,ENTRY DATE=210506"
s2 = "DESCRIPTION=TEST,TEST,TEST"
print(parse(s1))
print(parse(s2))
We get:
>>> {'NAME': 'TEST',
'TEST ID': '1234',
'ENTRY DESCR': 'Verify',
'ENTRY CLASS': 'CCD',
'TRACE NO': '124313523,12414',
'ENTRY DATE': '210506'}
>>> {'DESCRIPTION': 'TEST,TEST,TEST'}
Thanks for the suggestions, everyone! I wound up figuring a way using RegEx and did this in a two lines of code.
s1 = "NAME=TEST,TEST ID=1234,ENTRY DESCR=Verify,ENTRY CLASS=CCD,TRACE NO=124313523,12414,ENTRY DATE=210506"
regex=re.compile(',(?=[^,]+=)')
regex.split(s1)

Python add multiple strings to another string with indexes single time

I have a long text, and some list of dict objects which has indexes of this long text. I want to add some strings to these indexes. If I set a loop, indexes change and I must calculate the indexes again. I think this way very confusing. Is there any way add different strings to different indexes in single time?
My sample data:
main_str = 'Lorem Ipsum is simply dummy text of the printing and typesetting industry.'
My indexes list:
indexes_list = [
{
"type": "first_type",
"endOffset": 5,
"startOffset": 0,
},
{
"type": "second_type",
"endOffset": 22,
"startOffset": 16,
}
]
My main purpose: I want to add <span> attributes to given indexes with some color styles based on types. After that I render it on template, directly. Have you another suggestion?
For example I want to create this data according to above variables main_str and indexes_list(Please ignore color part of styles. I provide it dynamically from value of type from indexes_list):
new_str = '<span style="color:#FFFFFF">Lorem</span> Ipsum is <span style="color:#FFFFFF">simply</span> dummy text of the printing and typesetting industry.'
Create a new str to avoid change the main_str:
main_str = 'Lorem Ipsum is simply dummy text of the printing and typesetting industry.'
indexes_list = [
{
"type": "first_type",
"startOffset": 0,
"endOffset": 5,
},
{
"type": "second_type",
"startOffset": 16,
"endOffset": 22,
}
]
new_str = ""
index = 0
for i in indexes_list:
start = i["startOffset"]
end = i["endOffset"]
new_str += main_str[index: start] + "<span>" + main_str[start:end] + "</span>"
index = end
new_str += main_str[index:]
print(new_str)
Here is a solution without any imperative for loops. It still uses plenty of looping for the list comprehensions.
# Get all the indices and label them as starts or ends.
starts = [(o['startOffset'], True) for o in indexes_list]
ends = [(o['endOffset'], False) for o in indexes_list]
# Sort everything...
all_indices = sorted(starts + ends)
# ...so it is possible zip together adjacent pairs and extract substrings.
pieces = [
(s[1], main_str[s[0]:e[0]])
for s, e in zip(all_indices, all_indices[1:])
]
# And then join all the pieces together with a bit of conditional formatting.
formatted = ''.join([
f"<span>{part}</span>" if is_start else part
for is_start, part in pieces
])
formatted
# '<span>Lorem</span> Ipsum is s<span>imply </span>dummy text of the printing and typesetting industry.'
Also, although you said you do not want for loops, it is important to note that you do not have to do any index modification if you do the updates in reverse order.
def update_str(s, spans):
for lookup in sorted(spans, reverse=True, key=lambda o: o['startOffset']):
start = lookup['startOffset']
end = lookup['endOffset']
before, span, after = s[:start], s[start:end], s[end:]
s = f'{before}<span>{span}</span>{after}'
return s
update_str(main_str, indexes_list)
# '<span>Lorem</span> Ipsum is s<span>imply </span>dummy text of the printing and typesetting industry.'
The unvisited insertion indices won't change if you iterate backwards. This is true for all such problems. It sometimes even lets you modify sequences during iteration if you're careful (not that I'd ever recommend it).
You can find all insertion points from the dict, sort them backwards, and then do the insertion. For example:
items = ['<span ...>', '</span>']
keys = ['startOffset', 'endOffset']
insertion_points = [(d[key], item) for d in indexes_list for key, item in zip(keys, items)]
insertion_points.sort(reverse=True)
for index, content in insertion_points:
main_str = main_str[:index] + content + main_str[index:]
The reason not to do that is that it's inefficient. For reasonable sized text that's not a huge problem, but keep in mind that you are chopping up and reallocating an ever increasing string with each step.
A much more efficient approach would be to chop up the entire string once at all the insertion points. Adding list elements at the right places with the right content would be much cheaper that way, and you would only have to rejoin the whole thing once:
items = ['<span ...>', '</span>']
keys = ['startOffset', 'endOffset']
insertion_points = [(d[key], item) for d in indexes_list for key, item in zip(keys, items)]
insertion_points.sort()
last = 0
chopped_str = []
for index, content in insertion_points:
chopped_str.append(main_str[last:index])
chopped_str.append(content)
last = index
chopped_str.append[main_str[last:]]
main_str = ''.join(chopped_str)

Update last character of string with a value

I have two strings:
input = "12.34.45.362"
output = "2"
I want to be able to replace the 362 in input by 2 from output.
Thus the final result should be 12.34.45.2. I am unsure on how to do it. Any help is appreciated.
You can use a simple regex for this:
import re
input_ = "12.34.45.362"
output = "2"
input_ = re.sub(r"\.\d+$", f".{output}", input_)
print(input_)
Output:
12.34.45.2
Notice that I also changed input to input_, so we're not shadowing the built-in input() function.
Can also use a more simple, but little bit less robust pattern, which doesn't take the period into account at all, and just replaces all the digits from the end:
import re
input_ = "12.34.45.362"
output = "2"
input_ = re.sub(r"\d+$", output, input_)
print(input_)
Output:
12.34.45.2
Just in case you want to do this for any string of form X.Y.Z.W where X, Y, Z, and W may be of non-constant length:
new_result = ".".join(your_input.split(".")[:-1]) + "." + output
s.join will join a collection together to a string using the string s between each element. s.split will turn a string into a list which each element between the given character .. Slicing the list (l[:-1]) will give you all but the last element, and finally string concatenation (if you are sure output is str) will give you your result.
Breaking it down step-by-step:
your_input = "12.34.45.362"
your_input.split(".") # == ["12", "34", "45", "362"]
your_input.split(".")[:-1] # == ["12", "34", "45"]
".".join(your_input.split(".")[:-1]) # == "12.34.45"
".".join(your_input.split(".")[:-1]) + "." + output # == "12.34.45.2"
If you are trying to split int the lat . just do a right split get everything and do a string formatting
i = "12.34.45.362"
r = "{}.2".format(i.rsplit(".",1)[0])
output
'12.34.45.2'

Regex Python find everything between four characters

I have a string that holds data. And I want everything in between ({ and })
"({Simple Data})"
Should return "Simple Data"
Or regex:
s = '({Simple Data})'
print(re.search('\({([^})]+)', s).group(1))
Output:
'Simple Data'
You could try the following:
^\({(.*)}\)$
Group 1 will contain Simple Data.
See an example on regexr.
If the brackets are always positioned at the beginning and the end of the string, then you can do this:
l = "({Simple Data})"
print(l[2:-2])
Which resulst in:
"Simple Data"
In Python you can access single characters via the [] operator. With this you can access the sequence of characters starting with the third one (index = 2) up to the second-to-last (index = -2, second-to-last is not included in the sequence).
You could try this regex (?s)\(\{(.*?)\}\)
which simply captures the contents between the delimiters.
Beware though, this doesn't account for nesting.
If nesting is a concern, the best you can to with standard Python re engine
is to get the inner nest only, using this regex:
\(\{((?:(?!\(\{|\}\).)*)\}\)
Hereby I designed a tokenizer aimming at nesting data. OP should check out here.
import collections
import re
Token = collections.namedtuple('Token', ['typ', 'value', 'line', 'column'])
def tokenize(code):
token_specification = [
('DATA', r'[ \t]*[\w]+[\w \t]*'),
('SKIP', r'[ \t\f\v]+'),
('NEWLINE', r'\n|\r\n'),
('BOUND_L', r'\(\{'),
('BOUND_R', r'\}\)'),
('MISMATCH', r'.'),
]
tok_regex = '|'.join('(?P<%s>%s)' % pair for pair in token_specification)
line_num = 1
line_start = 0
lines = code.splitlines()
for mo in re.finditer(tok_regex, code):
kind = mo.lastgroup
value = mo.group(kind)
if kind == 'NEWLINE':
line_start = mo.end()
line_num += 1
elif kind == 'SKIP':
pass
else:
column = mo.start() - line_start
yield Token(kind, value, line_num, column)
statements = '''
({Simple Data})
({
Parent Data Prefix
({Nested Data (Yes)})
Parent Data Suffix
})
'''
queue = collections.deque()
for token in tokenize(statements):
if token.typ == 'DATA' or token.typ == 'MISMATCH':
queue.append(token.value)
elif token.typ == 'BOUND_L' or token.typ == 'BOUND_R':
print(''.join(queue))
queue.clear()
Output of this code should be:
Simple Data
Parent Data Prefix
Nested Data (Yes)
Parent Data Suffix

Parse Key Value Pairs in Python

So I have a key value file that's similar to JSON's format but it's different enough to not be picked up by the Python JSON parser.
Example:
"Matt"
{
"Location" "New York"
"Age" "22"
"Items"
{
"Banana" "2"
"Apple" "5"
"Cat" "1"
}
}
Is there any easy way to parse this text file and store the values into an array such that I could access the data using a format similar to Matt[Items][Banana]? There is only to be one pair per line and a bracket should denote going down a level and going up a level.
You could use re.sub to 'fix up' your string and then parse it. As long as the format is always either a single quoted string or a pair of quoted strings on each line, you can use that to determine where to place commas and colons.
import re
s = """"Matt"
{
"Location" "New York"
"Age" "22"
"Items"
{
"Banana" "2"
"Apple" "5"
"Cat" "1"
}
}"""
# Put a colon after the first string in every line
s1 = re.sub(r'^\s*(".+?")', r'\1:', s, flags=re.MULTILINE)
# add a comma if the last non-whitespace character in a line is " or }
s2 = re.sub(r'(["}])\s*$', r'\1,', s1, flags=re.MULTILINE)
Once you've done that, you can use ast.literal_eval to turn it into a Python dict. I use that over JSON parsing because it allows for trailing commas, without which the decision of where to put commas becomes a lot more complicated:
import ast
data = ast.literal_eval('{' + s2 + '}')
print data['Matt']['Items']['Banana']
# 2
Not sure how robust this approach is outside of the example you've posted but it does support for escaped characters and deeper levels of structured data. It's probably not going to be fast enough for large amounts of data.
The approach converts your custom data format to JSON using a (very) simple parser to add the required colons and braces, the JSON data can then be converted to a native Python dictionary.
import json
# Define the data that needs to be parsed
data = '''
"Matt"
{
"Location" "New \\"York"
"Age" "22"
"Items"
{
"Banana" "2"
"Apple" "5"
"Cat"
{
"foo" "bar"
}
}
}
'''
# Convert the data from custom format to JSON
json_data = ''
# Define parser states
state = 'OUT'
key_or_value = 'KEY'
for c in data:
# Handle quote characters
if c == '"':
json_data += c
if state == 'IN':
state = 'OUT'
if key_or_value == 'KEY':
key_or_value = 'VALUE'
json_data += ':'
elif key_or_value == 'VALUE':
key_or_value = 'KEY'
json_data += ','
else:
state = 'IN'
# Handle braces
elif c == '{':
if state == 'OUT':
key_or_value = 'KEY'
json_data += c
elif c == '}':
# Strip trailing comma and add closing brace and comma
json_data = json_data.rstrip().rstrip(',') + '},'
# Handle escaped characters
elif c == '\\':
state = 'ESCAPED'
json_data += c
else:
json_data += c
# Strip trailing comma
json_data = json_data.rstrip().rstrip(',')
# Wrap the data in braces to form a dictionary
json_data = '{' + json_data + '}'
# Convert from JSON to the native Python
converted_data = json.loads(json_data)
print(converted_data['Matt']['Items']['Banana'])

Categories