pyparse: How to handle "{ foo bar \n }" formatted stream? - python

I'm hoping someone can point out a method to get pyparse to handle the following stream of data:
"text { \n line1 line1\n line2 line2\n \n }"
where the information between the braces is just a blob of strings for further parsing later. The best I've been able to accomplish is to use skipTo with a failOn attribute.
line = SkipTo(LineEnd(), failOn=(LineStart()+LineEnd())|'}') + LineEnd().suppress()
nxos_clause = "with" + output_file + "{" + OneOrMore(line.setDebug()) + "}"
Debug shows
Match {SkipTo:(LineEnd) Suppress:(LineEnd)} at loc 76(4,1)
Exception raised:Found expression {{LineStart LineEnd} | "}"} (at char 94), (line:4, col:19)
(1, 'failed parse:', 'Expected "}" (at char 77), (line:4, col:2)')
The output I am looking for would be
"{", "line1 line1", "line2 line2", "}"
I know this is dead simple to do manually. I am looking to build a more complex grammar once I get the simple stuff working...

If newlines are significant, you'll need to remove them from the pyparsing set of default whitespace characters.
from pyparsing import *
ParserElement.setDefaultWhitespaceChars(' ')
To suppress empty lines, define an expression that matches empty lines, and match and suppress them, before testing for the expression that matches lines that may have content:
test = "text { \n line1 line1\n line2 line2\n \n }"
NL = LineEnd().suppress()
LBRACE,RBRACE = map(Literal, "{}")
emptyLine = Suppress(Empty() + NL)
line = SkipTo(NL) + NL
nxos_clause = "text" + LBRACE + OneOrMore(~RBRACE + (emptyLine | line)) + RBRACE
Also, note that we had to lookahead inside the OneOrMore so as not to process a closing right brace as a valid non-empty line.
Now parse the whole input line:
print nxos_clause.parseString(test)
Gives:
['text', '{', 'line1 line1', 'line2 line2', '}']

Related

Parsing keyword next to special character (pyparsing)

Using pyparsing, how can I match a keyword immediately before or after a special character (like "{" or "}")? The code below shows that my keyword "msg" is not matched unless it is preceded by whitespace (or at start):
import pyparsing as pp
openBrace = pp.Suppress(pp.Keyword("{"))
closeBrace = pp.Suppress(pp.Keyword("}"))
messageKw = pp.Keyword("msg")
messageExpr = pp.Forward()
messageExpr << messageKw + openBrace +\
pp.ZeroOrMore(messageExpr) + closeBrace
try:
result = messageExpr.parseString("msg { msg { } }")
print result.dump(), "\n"
result = messageExpr.parseString("msg {msg { } }")
print result.dump()
except pp.ParseException as pe:
print pe, "\n", "Text: ", pe.line
I'm sure there's a way to do this, but I have been unable to find it.
Thanks in advance
openBrace = pp.Suppress(pp.Keyword("{"))
closeBrace = pp.Suppress(pp.Keyword("}"))
should be:
openBrace = pp.Suppress(pp.Literal("{"))
closeBrace = pp.Suppress(pp.Literal("}"))
or even just:
openBrace = pp.Suppress("{")
closeBrace = pp.Suppress("}")
(Most pyparsing classes will auto-promote a string argument "arg" to Literal("arg").)
When I have parsers with many punctuation marks, rather than have a big ugly chunk of statements like this, I'll collapse them down to something like:
OBRACE, CBRACE, OPAR, CPAR, SEMI, COMMA = map(pp.Suppress, "{}();,")
The problem you are seeing is that Keyword looks at the immediately-surrounding characters, to make sure that the current string is not being accidentally matched when it is really embedded in a larger identifier-like string. In Keyword('{'), this will only work if there is no adjoining character that could be confused as part of a larger word. Since '{' itself is not really a typical keyword character, using Keyword('{') is not a good use of that class.
Only use Keyword for strings that could be misinterpreted as identifiers. For matching characters that are not in the set of typical keyword characters (by "keyword characters" I mean alphanumerics + '_'), use Literal.

Picking up field value using Python regex

This is an example of two lines in a file that I am trying to pick up information from.
...
{ "SubtitleSettings_REPOSITORY", FieldType_STRING, (int32_t)REPOSITORY},
{ "PREFERRED_SUBTITLE_LANGUAGE", FieldType_STRING,SUBTITLE_LANGUAGE},
...
What I want to do is to find out the 3rd field of this weird data structure for the given string to match to 1st field, i.e.
SubtitleSettings_REPOSITORY => REPOSITORY
PREFERRED_SUBTITLE_LANGUAGE => SUBTITLE_LANGUAGE
The regx in my Python code can only handles the second line, but not cope with the first line. How I can improve it?
import re
...
#field is given a value in previous code, can be "SubtitleSettings_REPOSITORY", or "PREFERRED_SUBTITLE_LANGUAGE"
match = re.search(field+'"[, \t]+(\w+)[, \t]+(\w+)', src_file.read(), re.M|re.I)
return_value = match.group(2)
You can insert (?:\(\w+\))?, which allows (and ignores) an optional word in parentheses there:
match = re.search(field+'"[, \t]+(\w+)[, \t]+(?:\(\w+\))?(\w+)', line, re.M|re.I)
With this, the line matches and you get 'REPOSITORY' as desired.
import re
with open("input.txt") as f:
pattern = "\{ \"(.+)\",.+,(.+)\}"
for line in f:
first, third = re.findall(pattern, line.strip())[0]
print first.strip(), "=>", third.strip()
prints
SubtitleSettings_REPOSITORY => (int32_t)REPOSITORY
PREFERRED_SUBTITLE_LANGUAGE => SUBTITLE_LANGUAGE
where input.txt contains
{ "SubtitleSettings_REPOSITORY", FieldType_STRING, (int32_t)REPOSITORY},
{ "PREFERRED_SUBTITLE_LANGUAGE", FieldType_STRING,SUBTITLE_LANGUAGE}
Breakdown:
\{ \"(.+)\" matches strings with the structure { + space + " + text + " and extracts text
,.+,(.+)\} matches strings with the structure , + text1 + , + text2 + } and extracts text2

Check strings in a for loop for multiple regexs

I'm tracing log files for someone and they are a complete mess (no line-breaks and separators). So I did some easy Regex to make the logs tidy. The logging #codes# are now nicely separated in a list and their string attached to it in a sub-dict. It's like this:
Dict [
0 : [LOGCODE_53 : 'The string etc etc']
]
As this was kind of easy I was purposing to directly add some log-recognition to it too. Now I can match the LOGCODE, but the problem is that the codes aren't complaint to anything and often different LOGCODE's contain the same output-strings.
So I wrote a few REGEX matches to detect what the log is about. My question now is; what is wisdom to detect a big variety of string patterns? There might be around 110 different types of strings and they are so different that it's not possible to "super-match" them. How can I run ~110 REGEXes over a string to find out the string's intend and thus index them in a logical register.
So kind of like; "take this $STRING and test all the $REGEXes in this $LIST and let me know which $REGEX(es) (indexes) matches the string".
My code:
import re
# Open, Read-out and close; Log file
f = open('000000df.log', "rb")
text = f.read()
f.close()
matches = re.findall(r'00([a-zA-Z0-9]{2})::((?:(?!00[a-zA-Z0-9]{2}::).)+)', text)
print 'Matches: ' + str(len(matches))
print '=========================================================================================='
for match in matches:
submatching = re.findall(r'(.*?)\'s (.*?) connected (.*?) with ZZZ device (.*?)\.', match[1])
print match[0] + ' >>> ' + match[1]
print match[0] + ' >>> ' + submatching[0][0] + ', ' + submatching[0][1] + ',',
print submatching[0][2] + ', ' + submatching[0][3]
re.match, re.search and re.findall return None if a particular regex doesn't match, so you could just iterate over your possible regular expressions and test them:
tests = [
re.compile(r'...'),
re.compile(r'...'),
re.compile(r'...'),
re.compile(r'...')
]
for test in tests:
matches = test.findall(your_string):
if matches:
print test, 'works'

Regex replace with correct spacing

I need help on some regex problem with chinese characters in python.
"拉柏多公园" is the correct form of the word, but in a text i found "拉柏 多公 园", what regex should i use to replace the characters.
import re
name = "拉柏多公园"
line = "whatever whatever it is then there comes a 拉柏 多公 园 sort of thing"
line2 = "whatever whatever it is then there comes another拉柏 多公 园 sort of thing"
line3 = "whatever whatever it is then there comes yet another 拉柏 多公 园sort of thing"
line4 = "whatever whatever it is then there comes a拉柏 多公 园sort of thing"
firstchar = "拉"
lastchar = "园"
i need to replace the strings in the lines so that the output line will look like this
line = "whatever whatever it is then there comes a 拉柏多公园 sort of thing"
line2 = "whatever whatever it is then there comes another 拉柏多公园 sort of thing"
line3 = "whatever whatever it is then there comes yet another 拉柏多公园 sort of thing"
line4 = "whatever whatever it is then there comes a 拉柏多公园 sort of thing"
i tried these to but the regex is badly structured:
reline = line.replace (r"firstchar*lastchar", name) #
reline2 = reline.replace (" ", " ")
print reline2
can someone help to correct my regex?
Thanks
(I assume you're using python 3, since you're using unicode characters in regular strings. For python 2, add u before each string literal.)
Python 3
import re
name = "拉柏多公园"
# the string of Chinese characters, with any number of spaces interspersed.
# The regex will match any surrounding spaces.
regex = r"\s*拉\s*柏\s*多\s*公\s*园\s*"
So you can replace each string with
reline = re.sub(regex, ' ' + name + ' ', line)
Python 2
# -*- coding: utf-8 -*-
import re
name = u"拉柏多公园"
# the string of Chinese characters, with any number of spaces interspersed.
# The regex will match any surrounding spaces.
regex = ur"\s*拉\s*柏\s*多\s*公\s*园\s*"
So you can replace each string with
reline = re.sub(regex, u' ' + name + u' ', line)
Discussion
The result will be surrounded by spaces. More generally, if you want it to work at the start or end of the line, or before commas or periods, you'll have to replace ' ' + name + ' ' with something more sophisticated.
Edit: fixed. Of course, you have to use the re library function.

Python Regular expression must strip whitespace except between quotes

I need a way to remove all whitespace from a string, except when that whitespace is between quotes.
result = re.sub('".*?"', "", content)
This will match anything between quotes, but now it needs to ignore that match and add matches for whitespace..
I don't think you're going to be able to do that with a single regex. One way to do it is to split the string on quotes, apply the whitespace-stripping regex to every other item of the resulting list, and then re-join the list.
import re
def stripwhite(text):
lst = text.split('"')
for i, item in enumerate(lst):
if not i % 2:
lst[i] = re.sub("\s+", "", item)
return '"'.join(lst)
print stripwhite('This is a string with some "text in quotes."')
Here is a one-liner version, based on #kindall's idea - yet it does not use regex at all! First split on ", then split() every other item and re-join them, that takes care of whitespaces:
stripWS = lambda txt:'"'.join( it if i%2 else ''.join(it.split())
for i,it in enumerate(txt.split('"')) )
Usage example:
>>> stripWS('This is a string with some "text in quotes."')
'Thisisastringwithsome"text in quotes."'
You can use shlex.split for a quotation-aware split, and join the result using " ".join. E.g.
print " ".join(shlex.split('Hello "world this is" a test'))
Oli, resurrecting this question because it had a simple regex solution that wasn't mentioned. (Found your question while doing some research for a regex bounty quest.)
Here's the small regex:
"[^"]*"|(\s+)
The left side of the alternation matches complete "quoted strings". We will ignore these matches. The right side matches and captures spaces to Group 1, and we know they are the right spaces because they were not matched by the expression on the left.
Here is working code (and an online demo):
import re
subject = 'Remove Spaces Here "But Not Here" Thank You'
regex = re.compile(r'"[^"]*"|(\s+)')
def myreplacement(m):
if m.group(1):
return ""
else:
return m.group(0)
replaced = regex.sub(myreplacement, subject)
print(replaced)
Reference
How to match pattern except in situations s1, s2, s3
How to match a pattern unless...
Here little longish version with check for quote without pair. Only deals with one style of start and end string (adaptable for example for example start,end='()')
start, end = '"', '"'
for test in ('Hello "world this is" atest',
'This is a string with some " text inside in quotes."',
'This is without quote.',
'This is sentence with bad "quote'):
result = ''
while start in test :
clean, _, test = test.partition(start)
clean = clean.replace(' ','') + start
inside, tag, test = test.partition(end)
if not tag:
raise SyntaxError, 'Missing end quote %s' % end
else:
clean += inside + tag # inside not removing of white space
result += clean
result += test.replace(' ','')
print result

Categories