Pyparsing: Parsing semi-JSON nested plaintext data to a list - python

I have a bunch of nested data in a format that loosely resembles JSON:
company="My Company"
phone="555-5555"
people=
{
person=
{
name="Bob"
location="Seattle"
settings=
{
size=1
color="red"
}
}
person=
{
name="Joe"
location="Seattle"
settings=
{
size=2
color="blue"
}
}
}
places=
{
...
}
There are many different parameters with varying levels of depth--this is just a very small subset.
It also might be worth noting that when a new sub-array is created that there is always an equals sign followed by a line break followed by the open bracket (as seen above).
Is there any simple looping or recursion technique for converting this data to a system-friendly data format such as arrays or JSON? I want to avoid hard-coding the names of properties. I am looking for something that will work in Python, Java, or PHP. Pseudo-code is fine, too.
I appreciate any help.
EDIT: I discovered the Pyparsing library for Python and it looks like it could be a big help. I can't find any examples for how to use Pyparsing to parse nested structures of unknown depth. Can anyone shed light on Pyparsing in terms of the data I described above?
EDIT 2: Okay, here is a working solution in Pyparsing:
def parse_file(fileName):
#get the input text file
file = open(fileName, "r")
inputText = file.read()
#define the elements of our data pattern
name = Word(alphas, alphanums+"_")
EQ,LBRACE,RBRACE = map(Suppress, "={}")
value = Forward() #this tells pyparsing that values can be recursive
entry = Group(name + EQ + value) #this is the basic name-value pair
#define data types that might be in the values
real = Regex(r"[+-]?\d+\.\d*").setParseAction(lambda x: float(x[0]))
integer = Regex(r"[+-]?\d+").setParseAction(lambda x: int(x[0]))
quotedString.setParseAction(removeQuotes)
#declare the overall structure of a nested data element
struct = Dict(LBRACE + ZeroOrMore(entry) + RBRACE) #we will turn the output into a Dictionary
#declare the types that might be contained in our data value - string, real, int, or the struct we declared
value << (quotedString | struct | real | integer)
#parse our input text and return it as a Dictionary
result = Dict(OneOrMore(entry)).parseString(inputText)
return result.dump()
This works, but when I try to write the results to a file with json.dump(result), the contents of the file are wrapped in double quotes. Also, there are \n chraacters between many of the data pairs. I tried suppressing them in the code above with LineEnd().suppress() , but I must not be using it correctly.

Parsing an arbitrarily nested structure can be done with pyparsing by defining a placeholder to hold the nested part, using the Forward class. In this case, you are just parsing simple name-value pairs, where then value could itself be a nested structure containing name-value pairs.
name :: word of alphanumeric characters
entry :: name '=' value
struct :: '{' entry* '}'
value :: real | integer | quotedstring | struct
This translates to pyparsing almost verbatim. To define value, which can recursively contain values, we first create a Forward() placeholder, which can be used as part of the definition of entry. Then once we have defined all the possible types of values, we use the '<<' operator to insert this definition into the value expression:
EQ,LBRACE,RBRACE = map(Suppress,"={}")
name = Word(alphas, alphanums+"_")
value = Forward()
entry = Group(name + EQ + value)
real = Regex(r"[+-]?\d+\.\d*").setParseAction(lambda x: float(x[0]))
integer = Regex(r"[+-]?\d+").setParseAction(lambda x: int(x[0]))
quotedString.setParseAction(removeQuotes)
struct = Group(LBRACE + ZeroOrMore(entry) + RBRACE)
value << (quotedString | struct | real | integer)
The parse actions on real and integer will convert these elements from strings to float or ints at parse time, so that the values can be used as their actual types immediately after parsing (no need to post-process to do string-to-other-type conversion).
Your sample is a collection of one or more entries, so we use that to parse the total input:
result = OneOrMore(entry).parseString(sample)
We can access the parsed data as a nested list, but it is not so pretty to display. This code uses pprint to pretty-print a formatted nested list:
from pprint import pprint
pprint(result.asList())
Giving:
[['company', 'My Company'],
['phone', '555-5555'],
['people',
[['person',
[['name', 'Bob'],
['location', 'Seattle'],
['settings', [['size', 1], ['color', 'red']]]]],
['person',
[['name', 'Joe'],
['location', 'Seattle'],
['settings', [['size', 2], ['color', 'blue']]]]]]]]
Notice that all the strings are just strings with no enclosing quotation marks, and the ints are actual ints.
We can do just a little better than this, by recognizing that the entry format actually defines a name-value pair suitable for accessing like a Python dict. Our parser can do this with just a few minor changes:
Change the struct definition to:
struct = Dict(LBRACE + ZeroOrMore(entry) + RBRACE)
and the overall parser to:
result = Dict(OneOrMore(entry)).parseString(sample)
The Dict class treats the parsed contents as a name followed by a value, which can be done recursively. With these changes, we can now access the data in result like elements in a dict:
print result['phone']
or like attributes in an object:
print result.company
Use the dump() method to view the contents of a structure or substructure:
for person in result.people:
print person.dump()
print
prints:
['person', ['name', 'Bob'], ['location', 'Seattle'], ['settings', ['size', 1], ['color', 'red']]]
- location: Seattle
- name: Bob
- settings: [['size', 1], ['color', 'red']]
- color: red
- size: 1
['person', ['name', 'Joe'], ['location', 'Seattle'], ['settings', ['size', 2], ['color', 'blue']]]
- location: Seattle
- name: Joe
- settings: [['size', 2], ['color', 'blue']]
- color: blue
- size: 2

There is no "simple" way, but there are harder and not-so-hard ways. If you don't want to hardcode things, then at some point you're going to have to parse it as a structured format. That would involve parsing each line one-by-one, tokenizing it appropriately (for example, separating the key from the value correctly), and then determining how you want to deal with the line.
You may need to store your data in an intermediary format such as a (parse) tree in order to account for the arbitrary nesting relationships (represented by indents and braces), and then after you have finished parsing the data, take your resulting tree and then go through it again to get your arrays or JSON.
There are libraries available such as ANTLR that handles some of the manual work of figuring out how to write the parser.

Take a look at this code:
still_not_valid_json = re.sub (r'(\w+)=', r'"\1":', pseudo_json ) #1
this_one_is_tricky = re.compile ('("|\d)\n(?!\s+})', re.M)
that_one_is_tricky_too = re.compile ('(})\n(?=\s+\")', re.M)
nearly_valid_json = this_one_is_tricky.sub (r'\1,\n', still_not_valid_json) #2
nearly_valid_json = that_one_is_tricky_too.sub (r'\1,\n', nearly_valid_json) #3
valid_json = '{' + nearly_valid_json + '}' #4
You can convert your pseudo_json in parseable json via some substitutions.
Replace '=' with ':'
Add missing commas between simple value (like "2" or "Joe") and next field
Add missing commas between closing brace of a complex value and next field
Embrace it with braces
Still there are issues. In your example 'people' dictionary contains two similar keys 'person'. After parsing only one key remains in the dictionary. This is what I've got after parsing:{u'phone': u'555-5555', u'company': u'My Company', u'people': {u'person': {u'settings': {u'color': u'blue', u'size': 2}, u'name': u'Joe', u'location': u'Seattle'}}}
If only you could replace second occurence of 'person=' to 'person1=' and so on...

Replace the '=' with ':', Then just read it as json, add in trailing commas

Okay, I came up with a final solution that actually transforms this data into a JSON-friendly Dict as I originally wanted. It first using Pyparsing to convert the data into a series of nested lists and then loops through the list and transforms it into JSON. This allows me to overcome the issue where Pyparsing's toDict() method was not able to handle where the same object has two properties of the same name. To determine whether a list is a plain list or a property/value pair, the prependPropertyToken method adds the string __property__ in front of property names when Pyparsing detects them.
def parse_file(self,fileName):
#get the input text file
file = open(fileName, "r")
inputText = file.read()
#define data types that might be in the values
real = Regex(r"[+-]?\d+\.\d*").setParseAction(lambda x: float(x[0]))
integer = Regex(r"[+-]?\d+").setParseAction(lambda x: int(x[0]))
yes = CaselessKeyword("yes").setParseAction(replaceWith(True))
no = CaselessKeyword("no").setParseAction(replaceWith(False))
quotedString.setParseAction(removeQuotes)
unquotedString = Word(alphanums+"_-?\"")
comment = Suppress("#") + Suppress(restOfLine)
EQ,LBRACE,RBRACE = map(Suppress, "={}")
data = (real | integer | yes | no | quotedString | unquotedString)
#define structures
value = Forward()
object = Forward()
dataList = Group(OneOrMore(data))
simpleArray = (LBRACE + dataList + RBRACE)
propertyName = Word(alphanums+"_-.").setParseAction(self.prependPropertyToken)
property = dictOf(propertyName + EQ, value)
properties = Dict(property)
object << (LBRACE + properties + RBRACE)
value << (data | object | simpleArray)
dataset = properties.ignore(comment)
#parse it
result = dataset.parseString(inputText)
#turn it into a JSON-like object
dict = self.convert_to_dict(result.asList())
return json.dumps(dict)
def convert_to_dict(self, inputList):
dict = {}
for item in inputList:
#determine the key and value to be inserted into the dict
dictval = None
key = None
if isinstance(item, list):
try:
key = item[0].replace("__property__","")
if isinstance(item[1], list):
try:
if item[1][0].startswith("__property__"):
dictval = self.convert_to_dict(item)
else:
dictval = item[1]
except AttributeError:
dictval = item[1]
else:
dictval = item[1]
except IndexError:
dictval = None
#determine whether to insert the value into the key or to merge the value with existing values at this key
if key:
if key in dict:
if isinstance(dict[key], list):
dict[key].append(dictval)
else:
old = dict[key]
new = [old]
new.append(dictval)
dict[key] = new
else:
dict[key] = dictval
return dict
def prependPropertyToken(self,t):
return "__property__" + t[0]

Related

Using Regular Expressions to Parse Based on Unique Character Sequence

I'm hoping to get some Python assistance with parsing out a column in a DataFrame that has a unique character sequence.
Each record can have a variable number of parameter name/value pairings.
The only way to determine where each name/value pairing ends is by looking for an equal sign and then finding the most immediate preceding comma. This gets a little tricky as some of the values will continue commas, so using a comma to parse won't always yield clean results.
Example below:
String
NAME=TEST,TEST ID=1234,ENTRY DESCR=Verify,ENTRY CLASS=CCD,TRACE NO=124313523,12414,ENTRY DATE=210506
DESCRIPTION=TEST,TEST,TEST
End Result:
String1
String2
String3
String4
String5
String6
NAME=TEST
TEST ID=1234
ENTRY DESCR=Verify
ENTRY CLASS=CCD
TRACE NO=124313523,12414
ENTRY DATE=210506
DESCRIPTION=TEST,TEST,TEST
Thanks in advance for your help!
This can certainly be done with Regexs, but for a quick and dirty parser I would do it manually.
First a test suite:
import pytest
#pytest.mark.parametrize(
"encoded,parsed",
[
("X=Y", {"X": "Y"}),
("DESCRIPTION=TEST,TEST,TEST", {"DESCRIPTION": "TEST,TEST,TEST"}),
(
"NAME=TEST,TEST ID=1234,ENTRY DESCR=Verify,ENTRY CLASS=CCD,TRACE NO=124313523,12414,ENTRY DATE=210506",
{
"NAME": "TEST",
"TEST ID": "1234",
"ENTRY DESCR": "Verify",
"ENTRY CLASS": "CCD",
"TRACE NO": "124313523,12414",
"ENTRY DATE": "210506",
},
),
],
)
def test_parser(encoded, parsed):
assert parser(encoded) == parsed
You'll need to pip install pytest if you don't already have it.
Then a parser:
def parser(encoded: str) -> dict[str, str]:
parsed = {}
val = []
for token in reversed(encoded.split("=")):
if val:
*vals, token = token.split(",")
parsed[token] = ",".join(val)
val = vals
else:
val = token.split(",")
return parsed
This is not a 'proper' parser (i.e. the traditional token, lex, parse) but handles this format. It works as follows:
step backwards through all the something=val pairs.
split the val (this is strictly pointless, but see below)
split something at the last comma (using a * expression to collect all the other components)
add a new entry into the parsed dict, joining the val back up again with commas
Note that this would work just as well with val = [token]. But you probably don't want a parser which returns a format which in turn needs parsing. You probably want it to turn , separated values into a list of appropriate types. Currently you have three types: strs, ints and a datetime. Thus "".join(val) could profitably be replaced with [convert(x) for x in val]. convert might look something like this:
from datetime import datetime
def convert(x: str) -> Union[date, int, str]:
for candidate in (
lambda x: datetime.strptime(x, "%y%m%d").date(),
lambda x: int(x),
lambda x: x,
):
try:
return candidate(x)
except ValueError:
pass
This would then be used by doing something like this in the parser:
converted = [convert(x) for x in val]
if len(converted) == 1:
converted = converted[0]
parsed[token] = converted
However, this conversion function has a problem---it falsely identifies one number as a date. How exactly to fix this depends on the input data. Perhaps the date parsing function can be context-agnostic, and just check for a 6-digit input before parsing (or manually split the str and pass to datetime.date). Perhaps the decision needs to be made in the parser, based on whether the word "DATE" is in the key.
If you really want to use regexs, have a look at negative lookaheads.
You can do this:
def parse(s: str) -> dict:
# Split by "=" and by ","
raw = [x.split(",") for x in s.split("=")]
# Keys are the last element of each row, besides the last
keys = [k[-1] for k in raw[:-1]]
# Values are all the elements before the last, shifted by one
values = [",".join(k[:-1]) for k in raw[1:-1]] + [",".join(raw[-1])]
return dict(zip(keys, values))
If we try:
s1 = "NAME=TEST,TEST ID=1234,ENTRY DESCR=Verify,ENTRY CLASS=CCD,TRACE NO=124313523,12414,ENTRY DATE=210506"
s2 = "DESCRIPTION=TEST,TEST,TEST"
print(parse(s1))
print(parse(s2))
We get:
>>> {'NAME': 'TEST',
'TEST ID': '1234',
'ENTRY DESCR': 'Verify',
'ENTRY CLASS': 'CCD',
'TRACE NO': '124313523,12414',
'ENTRY DATE': '210506'}
>>> {'DESCRIPTION': 'TEST,TEST,TEST'}
Thanks for the suggestions, everyone! I wound up figuring a way using RegEx and did this in a two lines of code.
s1 = "NAME=TEST,TEST ID=1234,ENTRY DESCR=Verify,ENTRY CLASS=CCD,TRACE NO=124313523,12414,ENTRY DATE=210506"
regex=re.compile(',(?=[^,]+=)')
regex.split(s1)

Fastest way to tokenize signal?

I need to find the fastest way to tokenize a signal. The signal is of the form:
identifier:value identifier:value identifier:value ...
identifier only consists of alphanumerics and underscores. identifier is separated from previous value by a space. Value may contain alphanumerics, various braces/brackets and spaces.
e.g.
signal_id:debug_word12_ind data:{ } virtual_interface_index:0x0000 module_id:0x0001 module_sub_id:0x0016 timestamp:0xcc557366 debug_words:[0x0006 0x0006 0x0000 0x0000 0x0000 0x0000 0xcc55 0x70a9 0x4c55 0x7364 0x0000 0x0000] sequence_number:0x0174
The best I've come up with is below. Ideally I'd like to halve the time it takes. I've tried various things with regexes but they're no better. Any suggestions?
# Convert data to dictionary. Expect data to be something like
# parameter_1:a b c d parameter_2:false parameter_3:0xabcd parameter_4:-56
# Split at colons. First part will be just parameter name, last will be just value
# everything in between will be <parameter name><space><value>
parts1 = data.split(":")
parts2 = []
for part in parts1:
# Copy first and last 'as is'
if part in (parts1[0], parts1[-1]):
parts2.append(part)
# Split everything in between at last space (don't expect parameter names to contain spaces)
else:
parts2.extend(part.rsplit(' ', 1))
# Expect to now have [parameter name, value, parameter name, value, ...]. Convert to a dict
self.data_dict = {}
for i in range(0, len(parts2), 2):
self.data_dict[parts2[i]] = parts2[i + 1]
I have optimized your solution a little:
1) Removed the check from the loop.
2) Changed a dictionary creation code: Pairs from single list.
parts1 = data.split(":")
parts2 = []
parts2.append(parts1.pop(0))
for part in parts1[0:-1]:
parts2.extend(part.rsplit(' ', 1))
parts2.append(parts1.pop())
data_dict = {k : v for k, v in zip(parts2[::2], parts2[1::2])}

How to parse a string that looks like JSON with lots of embedded classes in python?

I have a string that lists the properties of a request event.
My string looks like:
requestBody: {
propertyA = 1
propertyB = 2
propertyC = {
propertyC1 = 1
propertyC2 = 2
}
propertyD = [
{ propertyD1 = { propertyD11 = 1}},
{ propertyD1 = [ {propertyD21 = 1, propertyD22 = 2},
{propertyD21 = 3, propertyD22 = 4}]}
]
}
I have tried to replace the "=" with ":" so that I can put it into a JSON reader in python, but JSON also requires that key and value are stored in string with double quotes and a "," to separate each KV pair. This then became a little complicated to implement. What are some better approaches to parsing this into python dictionary with exactly the same structure (e.g. embedded dictionaries are also preserved)?
Question:
If I were to write a full parser, what's the main pattern that I should tackle? Storing parenthesis in a stack until the parenthesis completes?
This is a nice case for using pyparsing, especially since it adds the issue of recursive structuring.
The short answer is the following parser (processes everything after the leading requestBody :):
LBRACE,RBRACE,LBRACK,RBRACK,EQ = map(Suppress, "{}[]=")
NL = LineEnd().setName("NL")
# define special delimiter for lists and objects, since they can be
# comma-separated or just newline-separated
list_delim = NL | ','
list_delim.leaveWhitespace()
# use a parse action to convert numeric values to ints or floats at parse time
def convert_number(t):
try:
return int(t[0])
except ValueError:
return float(t[0])
number = Word(nums, nums+'.').addParseAction(convert_number)
qs = quotedString
# forward-declare value, since it will be defined recursively
obj_value = Forward()
ident = Word(alphas, alphanums+'_')
obj_property = Group(ident + EQ + obj_value)
# use Dict wrapper to auto-define nested properties as key-values
obj = Group(LBRACE + Dict(Optional(delimitedList(obj_property, delim=list_delim))) + RBRACE)
obj_array = Group(LBRACK + Optional(delimitedList(obj, delim=list_delim)) + RBRACK)
# now assign to previously-declared obj_value, using '<<=' operator
obj_value <<= obj_array | obj | number | qs
# parse the data
res = obj.parseString(sample)[0]
# convert the result to a dict
import pprint
pprint.pprint(res.asDict())
gives
{'propertyA': 1,
'propertyB': 2,
'propertyC': {'propertyC1': 1, 'propertyC2': 2},
'propertyD': {'propertyD1': {'propertyD11': 1},
'propertyD2': {'propertyD21': 3, 'propertyD22': 4}}}

Parsing named nested expressions with pyparsing

I'm trying to parse some data using pyparsing that looks (more or less) like this:
User.Name = Dave
User.Age = 42
Date = 2015/01/01
Begin Component List
Begin Component 2
1 some data = a value
2 another key = 999
End Component 2
Begin Another Component
a.key = 42
End Another Component
End Component List
Begin MoreData
Another = KeyPair
End MoreData
I've found some similar examples, but I've not done very well for myself.
parsing file with curley brakets
Parse line data until keyword with pyparsing
Here's what I have so far, but I keep hitting an error similar to: pyparsing.ParseException: Expected "End" (at char 26), (line:5, col:1)
from pyparsing import *
data = '''Begin A
hello
world
End A
'''
opener = Literal('Begin') + Word(alphas)
closer = Literal('End') + Word(alphas)
content = Combine(OneOrMore(~opener
+ ~closer
+ CharsNotIn('\n', exact=1)))
expr = nestedExpr(opener=opener, closer=closer, content=content)
parser = expr
res = parser.parseString(data)
print(res)
It's important the the words after "Begin" are captured, as these are the names of the dictionaries, as well as the key-value pairs. Where there is a number after the opener, e.g. "Begin Component 2" the "2" is the number of pairs which I don't need (presumably this is used by the original software?). Similarly, I don't need the numbers in the list (the "1" and "2").
Is nestedExpr the correct approach to this?

How to get a dicitionary key value from a string that contains dictionary?

I have a string that contains dictionary:
data = 'IN.Tags.Share.handleCount({"count":17737,"fCnt":"17K","fCntPlusOne":"17K","url":"www.test.com\\/"});'
How can i get value of an dictionary element count? (In my case 17737)
P.S. maybe I need to delete IN.Tags.Share.handleCount from string before getting a dictionary by i.e.
k = data.replace("IN.Tags.Share.handleCount", "") but the problem that '()' remains after delete?
Thanks
import re, ast
data = 'IN.Tags.Share.handleCount({"count":17737,"fCnt":"17K","fCntPlusOne":"17K","url":"www.test.com\/"});'
m = re.match('.*({.*})', data)
d = ast.literal_eval(m.group(1))
print d['count']

Categories