Stop PyYaml from converting a yaml item to list - python

I am trying to load the following yaml:
yaml_string = """
key:
- [HELLO]
- another string
- another
"""
yaml.safe_load(yaml_string) # returns {"key": [["HELLLO"], "another_string", "another"}
and the result is a list containing the HELLO string. I wan to load this a string value that is
type(yaml.save_load(yaml_string).get("key")[0])
<class 'str'>
Since yaml describes some sort of commnads that are formatted this way, it is necessary to be read as strings and not as sequences. Basically I want to be able to read strings that start and end with brackets. As explained to the comment underneath, unfortunatelly it is not possible to add " since the yaml files were created by a Java app using Jackson which didn't have a problem turning yaml into an object and treating entries that start and end with brackets as strings. The files are to many for users to start adding quotes.
Is this possible?
EDIT: Added a more complete example

surround [HELLO] with quotes:
import yaml
yaml_string = """
key:
- "[HELLO]"
- another string
- another
"""
print(yaml.safe_load(yaml_string))
outputs
{'key': ['[HELLO]', 'another string', 'another']}

If you want the string "HELLO" as a result, then remove the [...] around it in the YAML:
yaml_string = """
key:
- HELLO
"""
print(yaml.safe_load(yaml_string))
# {'key': ['HELLO']}
If you want the string "[HELLO]" (instead of a list containing the string "HELLO"), then add quotes in the YAML:
yaml_string = """
key:
- "[HELLO]"
"""
print(yaml.safe_load(yaml_string))
# {'key': ['[HELLO]']}

The [] syntax is part of the YAML syntax. If you created this with a program and those are supposed to be strings, the program you used did not implement YAML correctly, as the strings would have to be quoted.
You can try out the following experimental perl script to add quotes around [...]. This relies on the assumption that your documents do not use flow style sequences that should be real sequences. Also it might not work for all cases.
It will definitely not work if the string only has an opening [ but not a closing one.
#!/usr/bin/env perl
use strict;
use warnings;
use 5.010;
use YAML::LibYAML::API::XS;
my $yaml = <<'EOM';
key:
- [HELLO]
- another string
- another
- [HELLO2]
EOM
my #lines = split /(?<=\n)/, $yaml;
my #events;
YAML::LibYAML::API::XS::parse_string_events($yaml, \#events);
while (my $event = shift #events) {
if ($event->{name} eq 'sequence_start_event' and $event->{style} == 2) {
my $start = $event->{start};
my $line = $start->{line};
my $column = $start->{column};
# Add single quote before [
substr($lines[ $line ], $column, 0) = "'";
# find the next matching `]`
while (my $event = shift #events) {
if ($event->{name} eq 'sequence_end_event') {
my $end = $event->{end};
my $line = $end->{line};
# Add single quote after ]
# add 1 because we modified the line already and added one char
my $column = $end->{column} + 1;
substr($lines[ $line ], $column, 0) = "'";
last;
}
}
}
}
$yaml = join '', #lines;
say $yaml;
You might be able to do the same with Python if you have an interface to the libyaml API.
Output:
key:
- '[HELLO]'
- another string
- another
- '[HELLO2]'

Related

Parsing Regular expression from YAML file adds extra \

I have a bunch of regular expression I am using to scrape lot of specific fields from a text document. Those all work fine when used directly inside the python script.
But I thought of putting them in a YAML file and reading from there. Here's how it looks:
# Document file for Regular expression patterns for a company invoice
---
issuer: ABCCorp
fields:
invoice_number: INVOICE\s*(\S+)
invoice_date: INVOICE DATE\s*(\S+)
cusotmer_id: CUSTOMER ID\s*(\S+)
origin: ORIGIN\s*(.*)ETD
destination: DESTINATION\s*(.*)ETA
sub_total: SUBTOTAL\s*(\S+)
add_gst: SUBTOTAL\s*(\S+)
total_cost: TOTAL USD\s*(\S+)
description_breakdown: (?s)(DESCRIPTION\s*GST IN USD\s*.+?TOTAL CHARGES)
package_details_fields: (?s)(WEIGHT\s*VOLUME\s*.+?FLIGHT|ROAD REFERENCE)
mawb_hawb: (?s)((FLIGHT|ROAD REFERENCE).*(MAWB|MASTER BILL)\s*.+?GOODS COLLECTED FROM)
When I retrieve it using pyyml in python, it is adding a string quote around that (which is ok as I can add r'' later) but I see it is also adding extra \ in between the regex. That would make the regex go wrong when used in code now
import yaml
with open(os.path.join(TEMPLATES_DIR,"regex_template.yml")) as f:
my_dict = yaml.safe_load(f)
print(my_dict)
{'issuer': 'ABCCorp', 'fields': {'invoice_number': 'INVOICE\\s*(\\S+)', 'invoice_date': 'INVOICE DATE\\s*(\\S+)', 'cusotmer_id': 'CUSTOMER ID\\s*(\\S+)', 'origin': 'ORIGIN\\s*(.*)ETD', 'destination': 'DESTINATION\\s*(.*)ETA', 'sub_total': 'SUBTOTAL\\s*(\\S+)', 'add_gst': 'SUBTOTAL\\s*(\\S+)', 'total_cost': 'TOTAL USD\\s*(\\S+)', 'description_breakdown': '(?s)(DESCRIPTION\\s*GST IN USD\\s*.+?TOTAL CHARGES)', 'package_details_fields': '(?s)(WEIGHT\\s*VOLUME\\s*.+?FLIGHT|ROAD REFERENCE)', 'mawb_hawb'
How to read the right regex as I have it in yaml file? Does any string written in yaml file gets a quotation mark around that when read in python because that is a string?
EDIT:
The main regex in yaml file is:
INVOICE\s*(\S+)
Output in dict is:
'INVOICE\\s*(\\S+)'
This is too long to do as a comment.
The backslash character is used to escape special characters. For example:
'\n': newline
'\a': alarm
When you use it before a letter that has no special meaning it is just taken to be a backslash character:
'\s': backslash followed by 's'
But to be sure, whenever you want to enter a backslash character in a string and not have it interpreted as the start of an escape sequence, you double it up:
'\\s': also a backslash followed by a 's'
'\\a': a backslash followed by a 'a'
If you use a r'' type literal, then a backslash is never interpreted as the start of an escape sequence:
r'\a': a backslash followed by 'a' (not an alarm character)
r'\n': a backslash followed by n (not a newline -- however when used in a regex. it will match a newline)
Now here is the punchline:
When you print out these Python objects, such as:
d = {'x': 'ab\sd'}
print(d)
Python will print the string representation of the dictionary and the string will print:
'ab\\sd'. If you just did:
print('ab\sd')
You would see ab\sd. Quite a difference.
Why the difference. See if this makes sense:
d = {'x': 'ab\ncd'}
print(d)
print('ab\ncd')
Results:
d = {'x': 'ab\ncd'}
ab
cd
The bottom line is that when you print a Python object other than a string, it prints a representation of the object showing how you would have created it. And if the object contains a string and that string contains a backslash, you would have doubled up on that backslash when entering it.
Update
To process your my_dict: Since you did not provide the complete value of my_dict, I can only use a truncated version for demo purposes. But this will demonstrate that my_dict has perfectly good regular expressions:
import re
my_dict = {'issuer': 'ABCCorp', 'fields': {'invoice_number': 'INVOICE\\s*(\\S+)', 'invoice_date': 'INVOICE DATE\\s*(\\S+)'}}
fields = my_dict['fields']
invoice_number_re = fields['invoice_number']
m = re.search(invoice_number_re, 'blah-blah INVOICE 12345 blah-blah')
print(m[1])
Prints:
12345
If you are going to be using the same regular expressions over and over again, then it is best to compile them:
import re
my_dict = {'issuer': 'ABCCorp', 'fields': {'invoice_number': 'INVOICE\\s*(\\S+)', 'invoice_date': 'INVOICE DATE\\s*(\\S+)'}}
#compile the strings to regular expressions
fields = my_dict['fields']
for k, v in fields.items():
fields[k] = re.compile(v)
invoice_number_re = fields['invoice_number']
m = invoice_number_re.search('blah-blah INVOICE 12345 blah-blah')
print(m[1])

Extract email addresses from academic curly braces format

I have a file where each line contains a string that represents one or more email addresses.
Multiple addresses can be grouped inside curly braces as follows:
{name.surname, name2.surnam2}#something.edu
Which means both addresses name.surname#something.edu and name2.surname2#something.edu are valid (this format is commonly used in scientific papers).
Moreover, a single line can also contain curly brackets multiple times. Example:
{a.b, c.d, e.f}#uni.somewhere, {x.y, z.k}#edu.com
results in:
a.b#uni.somewhere
c.d#uni.somewhere
e.f#uni.somewhere
x.y#edu.com
z.k#edu.com
Any suggestion on how I can parse this format to extract all email addresses? I'm trying with regexes but I'm currently struggling.
Pyparsing is a PEG parser that gives you an embedded DSL to build up parsers that can read through expressions like this, with resulting code that is more readable (and maintainable) than regular expressions, and flexible enough to add afterthoughts (wait, some parts of the email can be in quotes?).
pyparsing uses '+' and '|' operators to build up your parser from smaller bits. It also supports named fields (similar to regex named groups) and parse-time callbacks. See how this all rolls together below:
import pyparsing as pp
LBRACE, RBRACE = map(pp.Suppress, "{}")
email_part = pp.quotedString | pp.Word(pp.printables, excludeChars=',{}#')
# define a compressed email, and assign names to the separate parts
# for easier processing - luckily the default delimitedList delimiter is ','
compressed_email = (LBRACE
+ pp.Group(pp.delimitedList(email_part))('names')
+ RBRACE
+ '#'
+ email_part('trailing'))
# add a parse-time callback to expand the compressed emails into a list
# of constructed emails - note how the names are used
def expand_compressed_email(t):
return ["{}#{}".format(name, t.trailing) for name in t.names]
compressed_email.addParseAction(expand_compressed_email)
# some lists will just contain plain old uncompressed emails too
# Combine will merge the separate tokens into a single string
plain_email = pp.Combine(email_part + '#' + email_part)
# the complete list parser looks for a comma-delimited list of compressed
# or plain emails
email_list_parser = pp.delimitedList(compressed_email | plain_email)
pyparsing parsers come with a runTests method to test your parser against various test strings:
tests = """\
# original test string
{a.b, c.d, e.f}#uni.somewhere, {x.y, z.k}#edu.com
# a tricky email containing a quoted string
{x.y, z.k}#edu.com, "{a, b}"#domain.com
# just a plain email
plain_old_bob#uni.elsewhere
# mixed list of plain and compressed emails
{a.b, c.d, e.f}#uni.somewhere, {x.y, z.k}#edu.com, plain_old_bob#uni.elsewhere
"""
email_list_parser.runTests(tests)
Prints:
# original test string
{a.b, c.d, e.f}#uni.somewhere, {x.y, z.k}#edu.com
['a.b#uni.somewhere', 'c.d#uni.somewhere', 'e.f#uni.somewhere', 'x.y#edu.com', 'z.k#edu.com']
# a tricky email containing a quoted string
{x.y, z.k}#edu.com, "{a, b}"#domain.com
['x.y#edu.com', 'z.k#edu.com', '"{a, b}"#domain.com']
# just a plain email
plain_old_bob#uni.elsewhere
['plain_old_bob#uni.elsewhere']
# mixed list of plain and compressed emails
{a.b, c.d, e.f}#uni.somewhere, {x.y, z.k}#edu.com, plain_old_bob#uni.elsewhere
['a.b#uni.somewhere', 'c.d#uni.somewhere', 'e.f#uni.somewhere', 'x.y#edu.com', 'z.k#edu.com', 'plain_old_bob#uni.elsewhere']
DISCLOSURE: I am the author of pyparsing.
Note
I'm more familiar with JavaScript than Python, and the basic logic is the same regardless (the different is syntax), so I've written my solutions here in JavaScript. Feel free to translate to Python.
The Issue
This question is a bit more involved than a simple one-line script or regular expression, but depending on the specific requirements you may be able to get away with something rudimentary.
For starters, parsing an e-mail is not trivially boiled down to a single regular expression. This website has several examples of regular expressions that will match "many" e-mails, but explains the trade-offs (complexity versus accuracy) and goes on to include the RFC 5322 standard regular expression that should theoretically match any e-mail, followed by a paragraph for why you shouldn't use it. However even that regular expression assumes that a domain name taking the form of an IP address can only consist of a tuple of four integers ranging from 0 to 255 -- it doesn't allow for IPv6
Even something as simple as:
{a, b}#domain.com
Could get tripped up because technically according to the e-mail address specification an e-mail address can contain ANY ASCII characters surrounded by quotes. The following is a valid (single) e-mail address:
"{a, b}"#domain.com
To accurately parse an e-mail would require that you read the characters one letter at a time and build a finite state machine to track whether you are within a double-quote, within a curly brace, before the #, after the #, parsing a domain name, parsing an IP, etc. In this way you could tokenize the address, locate your curly brace token, and parse it independently.
Something Rudimentary
Regular expressions are not the way to go for 100% accuracy and support for all e-mails, *especially* if you want to support more than one e-mail on a single line. But we'll start with them and try to build from there.
You've probably tried a regular expression like:
/\{(([^,]+),?)+\}\#(\w+\.)+[A-Za-z]+/
Match a single curly brace...
Followed by one or more instances of:
One or more non-comma characters...
Followed by zero or one commas
Followed by a single closing curly brace...
Followed by a single #
Followed by one or more instances of:
One or more "word" characters...
Followed by a single .
Followed by one or more alpha characters
This should match something roughly of the form:
{one, two}#domain1.domain2.toplevel
This handles validating, next is the issue of extracting all valid e-mails. Note that we have two sets of parenthesis in the name portion of the e-mail address that are nested: (([^,]+),?). This causes a problem for us. Many regular expression engines don't know how to return matches in this case. Consider what happens when I run this in JavaScript using my Chrome developer console:
var regex = /\{(([^,]+),?)+\}\#(\w+\.)+[A-Za-z]+/
var matches = "{one, two}#domain.com".match(regex)
Array(4) [ "{one, two}#domain.com", " two", " two", "domain." ]
Well that wasn't right. It found two twice, but didn't find one once! To fix this, we need to eliminate the nesting and do this in two steps.
var regexOne = /\{([^}]+)\}\#(\w+\.)+[A-Za-z]+/
"{one, two}#domain.com".match(regexOne)
Array(3) [ "{one, two}#domain.com", "one, two", "domain." ]
Now we can use the match and parse that separately:
// Note: It's important that this be a global regex (the /g modifier) since we expect the pattern to match multiple times
var regexTwo = /([^,]+,?)/g
var nameMatches = matches[1].match(regexTwo)
Array(2) [ "one,", " two" ]
Now we can trim these and get our names:
nameMatches.map(name => name.replace(/, /g, "")
nameMatches
Array(2) [ "one", "two" ]
For constructing the "domain" part of the e-mail, we'll need similar logic for everything after the #, since this has a potential for repeats the same way the name part had a potential for repeats. Our final code (in JavaScript) may look something like this (you'll have to convert to Python yourself):
function getEmails(input)
{
var emailRegex = /([^#]+)\#(.+)/;
var emailParts = input.match(emailRegex);
var name = emailParts[1];
var domain = emailParts[2];
var nameList;
if (/\{.+\}/.test(name))
{
// The name takes the form "{...}"
var nameRegex = /([^,]+,?)/g;
var nameParts = name.match(nameRegex);
nameList = nameParts.map(name => name.replace(/\{|\}|,| /g, ""));
}
else
{
// The name is not surrounded by curly braces
nameList = [name];
}
return nameList.map(name => `${name}#${domain}`);
}
Multi-email Lines
This is where things start to get tricky, and we need to accept a little less accuracy if we don't want to build a full on lexer / tokenizer. Because our e-mails contain commas (within the name field) we can't accurately split on commas -- unless those commas aren't within curly braces. With my knowledge of regular expressions, I don't know if this can be easily done. It may be possible with lookahead or lookbehind operators, but someone else will have to fill me in on that.
What can be easily done with regular expressions, however, is finding a block of text containing a post-ampersand comma. Something like: #[^#{]+?,
In the string a#b.com, c#d.com this would match the entire phrase #b.com, - but the important thing is that it gives us a place to split our string. The tricky bit is then finding out how to split your string here. Something along the lines of this will work most of the time:
var emails = "a#b.com, c#d.com"
var matches = emails.match(/#[^#{]+?,/g)
var split = emails.split(matches[0])
console.log(split) // Array(2) [ "a", " c#d.com" ]
split[0] = split[0] + matches[0] // Add back in what we split on
This has a potential bug should you have two e-mails in the list with the same domain:
var emails = "a#b.com, c#b.com, d#e.com"
var matches = emails.match(#[^#{]+?,/g)
var split = emails.split(matches[0])
console.log(split) // Array(3) [ "a", " c", " d#e.com" ]
split[0] = split[0] + matches[0]
console.log(split) // Array(3) [ "a#b.com", " c", " d#e.com" ]
But again, without building a lexer / tokenizer we're accepting that our solution will only work for most cases and not all.
However since the task of splitting one line into multiple e-mails is easier than diving into the e-mail, extracting a name, and parsing the name: we may be able to write a really stupid lexer for just this part:
var inBrackets = false
var emails = "{a, b}#c.com, d#e.com"
var split = []
var lastSplit = 0
for (var i = 0; i < emails.length; i++)
{
if (inBrackets && emails[i] === "}")
inBrackets = false;
if (!inBrackets && emails[i] === "{")
inBrackets = true;
if (!inBrackets && emails[i] === ",")
{
split.push(emails.substring(lastSplit, i))
lastSplit = i + 1 // Skip the comma
}
}
split.push(emails.substring(lastSplit))
console.log(split)
Once again, this won't be a perfect solution because an e-mail address may exist like the following:
","#domain.com
But, for 99% of use cases, this simple lexer will suffice and we can now build a "usually works but not perfect" solution like the following:
function getEmails(input)
{
var emailRegex = /([^#]+)\#(.+)/;
var emailParts = input.match(emailRegex);
var name = emailParts[1];
var domain = emailParts[2];
var nameList;
if (/\{.+\}/.test(name))
{
// The name takes the form "{...}"
var nameRegex = /([^,]+,?)/g;
var nameParts = name.match(nameRegex);
nameList = nameParts.map(name => name.replace(/\{|\}|,| /g, ""));
}
else
{
// The name is not surrounded by curly braces
nameList = [name];
}
return nameList.map(name => `${name}#${domain}`);
}
function splitLine(line)
{
var inBrackets = false;
var split = [];
var lastSplit = 0;
for (var i = 0; i < line.length; i++)
{
if (inBrackets && line[i] === "}")
inBrackets = false;
if (!inBrackets && line[i] === "{")
inBrackets = true;
if (!inBrackets && line[i] === ",")
{
split.push(line.substring(lastSplit, i));
lastSplit = i + 1;
}
}
split.push(line.substring(lastSplit));
return split;
}
var line = "{a.b, c.d, e.f}#uni.somewhere, {x.y, z.k}#edu.com";
var emails = splitLine(line);
var finalList = [];
for (var i = 0; i < emails.length; i++)
{
finalList = finalList.concat(getEmails(emails[i]));
}
console.log(finalList);
// Outputs: [ "a.b#uni.somewhere", "c.d#uni.somewhere", "e.f#uni.somewhere", "x.y#edu.com", "z.k#edu.com" ]
If you want to try and implement the full lexer / tokenizer solution, you can look at the simple / dumb lexer I built as a starting point. The general idea is that you have a state machine (in my case I only had two states: inBrackets and !inBrackets) and you read one letter at a time but interpret it differently based on your current state.
a quick solution using re:
test with one text line:
import re
line = '{a.b, c.d, e.f}#uni.somewhere, {x.y, z.k}#edu.com, {z.z, z.a}#edu.com'
com = re.findall(r'(#[^,\n]+),?', line) #trap #xx.yyy
adrs = re.findall(r'{([^}]+)}', line) #trap all inside { }
result=[]
for i in range(len(adrs)):
s = re.sub(r',\s*', com[i] + ',', adrs[i]) + com[i]
result=result+s.split(',')
for r in result:
print(r)
output in list result:
a.b#uni.somewhere
c.d#uni.somewhere
e.f#uni.somewhere
x.y#edu.com
z.k#edu.com
z.z#edu.com
z.a#edu.com
test with a text file:
import io
data = io.StringIO(u'''\
{a.b, c.d, e.f}#uni.somewhere, {x.y, z.k}#edu.com, {z.z, z.a}#edu.com
{a.b, c.d, e.f}#uni.anywhere
{x.y, z.k}#adi.com, {z.z, z.a}#du.com
''')
result=[]
import re
for line in data:
com = re.findall(r'(#[^,\n]+),?', line)
adrs = re.findall(r'{([^}]+)}', line)
for i in range(len(adrs)):
s = re.sub(r',\s*', com[i] + ',', adrs[i]) + com[i]
result = result + s.split(',')
for r in result:
print(r)
output in list result:
a.b#uni.somewhere
c.d#uni.somewhere
e.f#uni.somewhere
x.y#edu.com
z.k#edu.com
z.z#edu.com
z.a#edu.com
a.b#uni.anywhere
c.d#uni.anywhere
e.f#uni.anywhere
x.y#adi.com
z.k#adi.com
z.z#du.com
z.a#du.com

decode Google translate json response in python [duplicate]

I would like to parse JSON-like strings. Their lone difference with normal JSON is the presence of contiguous commas in arrays. When there are two such commas, it implicitly means that null should be inserted in-between. Example:
JSON-like: ["foo",,,"bar",[1,,3,4]]
Javascript: ["foo",null,null,"bar",[1,null,3,4]]
Decoded (Python): ["foo", None, None, "bar", [1, None, 3, 4]]
The native json.JSONDecoder class doesn't allow me to change the behavior of the array parsing. I can only modify the parser for objects (dicts), ints, floats, strings (by giving kwargs functions to JSONDecoder(), please see the doc).
So, does it mean I have to write a JSON parser from scratch? The Python code of json is available but it's quite a mess. I would prefer to use its internals instead of duplicating its code!
Since what you're trying to parse isn't JSON per se, but rather a different language that's very much like JSON, you may need your own parser.
Fortunately, this isn't as hard as it sounds. You can use a Python parser generator like pyparsing. JSON can be fully specified with a fairly simple context-free grammar (I found one here), so you should be able to modify it to fit your needs.
Small & simple workaround to try out:
Convert JSON-like data to strings.
Replace ",," with ",null,".
Convert it to whatever is your representation.
Let JSONDecoder(),
do the heavy lifting.
& 3. can be omitted if you already deal with strings.
(And if converting to string is impractical, update your question with this info!)
You can do the comma replacement of Lattyware's/przemo_li's answers in one pass by using a lookbehind expression, i.e. "replace all commas that are preceded by just a comma":
>>> s = '["foo",,,"bar",[1,,3,4]]'
>>> re.sub(r'(?<=,)\s*,', ' null,', s)
'["foo", null, null,"bar",[1, null,3,4]]'
Note that this will work for small things where you can assume there aren't consecutive commas in string literals, for example. In general, regular expressions aren't enough to handle this problem, and Taymon's approach of using a real parser is the only fully correct solution.
It's a hackish way of doing it, but one solution is to simply do some string modification on the JSON-ish data to get it in line before parsing it.
import re
import json
not_quite_json = '["foo",,,"bar",[1,,3,4]]'
not_json = True
while not_json:
not_quite_json, not_json = re.subn(r',\s*,', ', null, ', not_quite_json)
Which leaves us with:
'["foo", null, null, "bar",[1, null, 3,4]]'
We can then do:
json.loads(not_quite_json)
Giving us:
['foo', None, None, 'bar', [1, None, 3, 4]]
Note that it's not as simple as a replace, as the replacement also inserts commas that can need replacing. Given this, you have to loop through until no more replacements can be made. Here I have used a simple regex to do the job.
I've had a look at Taymon recommendation, pyparsing, and I successfully hacked the example provided here to suit my needs.
It works well at simulating Javascript eval() but fails one situation: trailing commas. There should be a optional trailing comma – see tests below – but I can't find any proper way to implement this.
from pyparsing import *
TRUE = Keyword("true").setParseAction(replaceWith(True))
FALSE = Keyword("false").setParseAction(replaceWith(False))
NULL = Keyword("null").setParseAction(replaceWith(None))
jsonString = dblQuotedString.setParseAction(removeQuotes)
jsonNumber = Combine(Optional('-') + ('0' | Word('123456789', nums)) +
Optional('.' + Word(nums)) +
Optional(Word('eE', exact=1) + Word(nums + '+-', nums)))
jsonObject = Forward()
jsonValue = Forward()
# black magic begins
commaToNull = Word(',,', exact=1).setParseAction(replaceWith(None))
jsonElements = ZeroOrMore(commaToNull) + Optional(jsonValue) + ZeroOrMore((Suppress(',') + jsonValue) | commaToNull)
# black magic ends
jsonArray = Group(Suppress('[') + Optional(jsonElements) + Suppress(']'))
jsonValue << (jsonString | jsonNumber | Group(jsonObject) | jsonArray | TRUE | FALSE | NULL)
memberDef = Group(jsonString + Suppress(':') + jsonValue)
jsonMembers = delimitedList(memberDef)
jsonObject << Dict(Suppress('{') + Optional(jsonMembers) + Suppress('}'))
jsonComment = cppStyleComment
jsonObject.ignore(jsonComment)
def convertNumbers(s, l, toks):
n = toks[0]
try:
return int(n)
except ValueError:
return float(n)
jsonNumber.setParseAction(convertNumbers)
def test():
tests = (
'[1,2]', # ok
'[,]', # ok
'[,,]', # ok
'[ , , , ]', # ok
'[,1]', # ok
'[,,1]', # ok
'[1,,2]', # ok
'[1,]', # failure, I got [1, None], I should have [1]
'[1,,]', # failure, I got [1, None, None], I should have [1, None]
)
for test in tests:
results = jsonArray.parseString(test)
print(results.asList())
For those looking for something quick and dirty to convert general JS objects (to dicts). Some part of the page of one real site gives me some object I'd like to tackle. There are 'new' constructs for dates, and it's in one line, no spaces in between, so two lines suffice:
data=sub(r'new Date\(([^)])*\)', r'\1', data)
data=sub(r'([,{])(\w*):', r'\1"\2":', data)
Then json.loads() worked fine. Your mileage may vary:)

Parsing a lightweight language in Python

Say I define a string in Python like the following:
my_string = "something{name1, name2, opt1=2, opt2=text}, something_else{name3, opt1=58}"
I would like to parse that string in Python in a way that allows me to index the different structures of the language.
For example, the output could be a dictionary parsing_result that allows me to index the different elements in a structred manner.
For example, the following:
parsing_result['names']
would hold a list of strings: ['name1', 'name2']
whereas parsing_result['options'] would hold a dictionary so that:
parsing_result['something']['options']['opt2'] holds the string "text"
parsing_result['something_else']['options']['opt1'] holds the string "58"
My first question is: How do I approach this problem in Python? Are there any libraries that simplify this task?
For a working example, I am not necessarily interested in a solution that parses the exact syntax I defined above (although that would be fantastic), but anything close to it would be great.
Update
It looks like the general right solution is using a parser and a lexer such as ply (thank you #Joran), but the documentation is a bit intimidating. Is there an easier way of getting this done when the syntax is lightweight?
I found this thread where the following regular expression is provided to partition a string around outer commas:
r = re.compile(r'(?:[^,(]|\([^)]*\))+')
r.findall(s)
But this is assuming that the grouping character are () (and not {}). I am trying to adapt this, but it doesn't look easy.
I highly recommend pyparsing:
The pyparsing module is an alternative approach to creating and
executing simple grammars, vs. the traditional lex/yacc approach, or
the use of regular expressions.
The Python representation of the grammar is quite
readable, owing to the self-explanatory class names, and the use of
'+', '|' and '^' operator definitions. The parsed results returned from parseString() can be accessed as a nested list, a dictionary, or an object with named attributes.
Sample code (Hello world from the pyparsing docs):
from pyparsing import Word, alphas
greet = Word( alphas ) + "," + Word( alphas ) + "!" # <-- grammar defined here
hello = "Hello, World!"
print (hello, "->", greet.parseString( hello ))
Output:
Hello, World! -> ['Hello', ',', 'World', '!']
Edit: Here's a solution to your sample language:
from pyparsing import *
import json
identifier = Word(alphas + nums + "_")
expression = identifier("lhs") + Suppress("=") + identifier("rhs")
struct_vals = delimitedList(Group(expression | identifier))
structure = Group(identifier + nestedExpr(opener="{", closer="}", content=struct_vals("vals")))
grammar = delimitedList(structure)
my_string = "something{name1, name2, opt1=2, opt2=text}, something_else{name3, opt1=58}"
parse_result = grammar.parseString(my_string)
result_list = parse_result.asList()
def list_to_dict(l):
d = {}
for struct in l:
d[struct[0]] = {}
for ident in struct[1]:
if len(ident) == 2:
d[struct[0]][ident[0]] = ident[1]
elif len(ident) == 1:
d[struct[0]][ident[0]] = None
return d
print json.dumps(list_to_dict(result_list), indent=2)
Output: (pretty printed as JSON)
{
"something_else": {
"opt1": "58",
"name3": null
},
"something": {
"opt1": "2",
"opt2": "text",
"name2": null,
"name1": null
}
}
Use the pyparsing API as your guide to exploring the functionality of pyparsing and understanding the nuances of my solution. I've found that the quickest way to master this library is trying it out on some simple languages you think up yourself.
As stated by #Joran Beasley, you'd really want to use a parser and a lexer. They are not easy to wrap your head around at first, so you'd want to start off with a very simple tutorial on them.
If you are really trying to write a light weight language, then you're going to want to go with parser/lexer, and learn about context-free grammars.
If you are really just trying to write a program to strip data out of some text, then regular expressions would be the way to go.
If this is not a programming exercise, and you are just trying to get structured data in text format into python, check out JSON.
Here is a test of regular expression modified to react on {} instead of ():
import re
s = "something{name1, name2, opt1=2, opt2=text}, something_else{name3, opt1=58}"
r = re.compile(r'(?:[^,{]|{[^}]*})+')
print r.findall(s)
You'll get a list of separate 'named blocks' as a result:
`['something{name1, name2, opt1=2, opt2=text}', ' something_else{name3, opt1=58}']`
I've made better code that can parse your simple example, you should for example catch exceptions to detect a syntax error, and restrict more valid block names, parameter names:
import re
s = "something{name1, name2, opt1=2, opt2=text}, something_else{name3, opt1=58}"
r = re.compile(r'(?:[^,{]|{[^}]*})+')
rblock = re.compile(r'\s*(\w+)\s*{(.*)}\s*')
rparam = re.compile(r'\s*([^=\s]+)\s*(=\s*([^,]+))?')
blocks = r.findall(s)
for block in blocks:
resb = rblock.match(block)
blockname = resb.group(1)
blockargs = resb.group(2)
print "block name=", blockname
print "args:"
for arg in re.split(",", blockargs):
resp = rparam.match(arg)
paramname = resp.group(1)
paramval = resp.group(3)
if paramval == None:
print "param name =\"{0}\" no value".format(paramname)
else:
print "param name =\"{0}\" value=\"{1}\"".format(paramname, str(paramval))

Python: Regex question / CSV parsing / Psycopg nested arrays

I'm having trouble parsing nested array's returned by Psycopg2. The DB I'm working on returns records that can have nested array's as value. Psycopg only parses the outer array of such values.
My first approach was splitting the string on comma's, but then I ran into the problem that sometimes a string within the result also contains comma's, which renders the entire approach unusable.
My next attempt was using regex to find the "components" within the string, but then I noticed I wasn't able to detect numbers (since numbers can also occur within strings).
Currently, this is my code:
import re
text = '{2f5e5fef-1e8c-43a2-9a11-3a39b2cbb45e,"Marc, Dirk en Koen",398547,85.5,-9.2, 62fe6393-00f7-418d-b0b3-7116f6d5cf10}'
r = re.compile('\".*?\"|[\w]{8}-[\w]{4}-[\w]{4}-[\w]{4}-[\w]{12}|^\d*[0-9](|.\d*[0-9]|,\d*[0-9])?$')
result = r.search(text)
if result:
result = result.groups()
The result of this should be:
['2f5e5fef-1e8c-43a2-9a11-3a39b2cbb45e', 'Marc, Dirk en Koen', 398547, 85.5, -9.2, '62fe6393-00f7-418d-b0b3-7116f6d5cf10']
Since I would like to have this functionality generic, I cannot be certain of the order of arguments. I only know that the types that are supported are strings, uuid's, (signed) integers and (signed) decimals.
Am I using a wrong approach? Or can anyone point me in the right direction?
Thanks in advance!
Python's native lib should do a good work. Have you tried it already?
http://docs.python.org/library/csv.html
From your sample, it looks something like ^{(?:(?:([^},"']+|"[^"]+"|'[^']+')(?:,|}))+(?<=})|})$ to me. That's not perfect since it would allow "{foo,bar}baz}", but it could be fixed if that matters to you.
If you can do ASSERTIONS, this will get you on the right track.
This problem is too extensive to be done in a single regex. You are trying to validate and parse at the same time in a global match. But your intented result requires sub-processing after the match. For that reason, its better to write a simpler global parser, then itterate over the results for validation and fixup (yes, you have fixup stipulated in your example).
The two main parsing regex's are these:
strips delimeter quote too and only $2 contains data, use in a while loop, global context
/(?!}$)(?:^{?|,)\s*("|)(.*?)\1\s*(?=,|}$)/
my preferred one, does not strip quotes, only captures $1, can use to capture in an array or in a while loop, global context
/(?!}$)(?:^{?|,)\s*(".*?"|.*?)\s*(?=,|}$)/
This is an example of post processing (in Perl) with a documented regex: (edit: fix append trailing ,)
use strict; use warnings;
my $str = '{2f5e5fef-1e8c-43a2-9a11-3a39b2cbb45e,"Marc, Dirk en Koen",398547,85.5,-9.2, 62fe6393-00f7-418d-b0b3-7116f6d5cf10}';
my $rx = qr/ (?!}$) (?:^{?|,) \s* ( ".*?" | .*?) \s* (?=,|}$) /x;
my $rxExpanded = qr/
(?!}$) # ASSERT ahead: NOT a } plus end
(?:^{?|,) # Boundry: Start of string plus { OR comma
\s* # 0 or more whitespace
( ".*?" | .*?) # Capture "Quoted" or non quoted data
\s* # 0 or more whitespace
(?=,|}$) # Boundry ASSERT ahead: Comma OR } plus end
/x;
my ($newstring, $sucess) = ('[', 0);
for my $field ($str =~ /$rx/g)
{
my $tmp = $field;
$sucess = 1;
if ( $tmp =~ s/^"|"$//g || $tmp =~ /(?:[a-f0-9]+-){3,}/ ) {
$tmp = "'$tmp'";
}
$newstring .= "$tmp,";
}
if ( $sucess ) {
$newstring =~ s/,$//;
$newstring .= ']';
print $newstring,"\n";
}
else {
print "Invalid string!\n";
}
Output:
['2f5e5fef-1e8c-43a2-9a11-3a39b2cbb45e','Marc, Dirk en Koen',398547,85.5,-9.2,'6
2fe6393-00f7-418d-b0b3-7116f6d5cf10']
It seemed that the CSV approach was the easiest to implement:
def parsePsycopgSQLArray(input):
import csv
import cStringIO
input = input.strip("{")
input = input.strip("}")
buffer = cStringIO.StringIO(input)
reader = csv.reader(buffer, delimiter=',', quotechar='"')
return reader.next() #There can only be one row
if __name__ == "__main__":
text = '{2f5e5fef-1e8c-43a2-9a11-3a39b2cbb45e,"Marc, Dirk en Koen",398547,85.5,-9.2, 62fe6393-00f7-418d-b0b3-7116f6d5cf10}'
result = parsePsycopgSQLArray(text)
print result
Thanks for the responses, they were most helpfull!
Improved upon Dirk's answer. This handles escape characters better as well as the empty array case. One less strip call as well:
def restore_str_array(val):
"""
Converts a postgres formatted string array (as a string) to python
:param val: postgres string array
:return: python array with values as strings
"""
val = val.strip("{}")
if not val:
return []
reader = csv.reader(StringIO(val), delimiter=',', quotechar='"', escapechar='\\')
return reader.next()

Categories