In PyParsing, how to stop a Regex from consuming the entire string - python

I'm trying to write a function parse such that, for example,
assert parse("file://foo:bar.txt:r+") == ("foo:bar.txt", "r+")
The string consists of a fixed prefix file://, followed by a file name (which can consist of one or more of any character), followed by a colon and a string representing access flags.
Here is one implementation using regular expressions:
import re
def parse(string):
SCHEME = r"file://" # File prefix
PATH_PATTERN = r"(?P<path>.+)" # One or more of any character
FLAGS_PATTERN = r"(?P<flags>[rwab+0-9]+)" # The letters r, w, a, b, a '+' symbol, or any digit
FILE_RESOURCE_PATTERN = SCHEME + PATH_PATTERN + r":" + FLAGS_PATTERN + r"$" # The full pattern including the end of line character
tokens = re.match(FILE_RESOURCE_PATTERN, string).groupdict()
return tokens['path'], tokens['flags']
I would prefer to use PyParsing, however, because it typically gives more detailed error messages if the string doesn't match the expression (rather than re.match which simply returns None), and I would eventually like to make the flags optional.
Following Paul McGuire's answer in python regex in pyparsing, I made the following attempt:
from pyparsing import Word, alphas, nums, StringEnd, Regex, FollowedBy, Suppress, Literal
def parse(string):
scheme = Literal("file://")
path = Regex(".+")
flags = Word(alphas + nums + "+")
expression = Suppress(scheme) + (~(Suppress(":") + flags + StringEnd()) + path("path") + Suppress(":") + flags("flags") + StringEnd())
tokens = expression.parseString(string)
return tokens['path'], tokens['flags']
In the second part of the expression, I'm basically trying the negative lookahead (~suffix + path + suffix), where suffix is ":" + flags + StringEnd(). However, when trying to parse "file://foo:bar.txt:r+", I run into the following error:
pyparsing.ParseException: Expected ":" (at char 21), (line:1, col:22)
Since the string is 21 characters long, I interpret this as that the Regex has 'consumed' the entire string so that the suffix is no longer 'found'.
How can I fix the parse method using pyparsing?

Try this:
s="file://foo:bar.txt:r+"
path,flag=re.sub(r'.*\/\/(.*):(.*$)',r'\1,\2',s)

Related

what regular expression can extract data I need?

I have a string
url = '//item.taobao.com/item.htm?id\u003d528341191030\u0026ns\u003d1\u0026abbucket\u003d0#detail'
I like to extract the number 528341191030 between the first two \u. I tried this,
m = re.search('\?id\u\d+d(\d+?)\u', url)
if m:
print m.group(1)
But it doesn't work. What is wrong with my solution?
Are you sure you need regex?
Here is a solution using split:
url.split("\u")[1].split("d")[-1]
'528341191030'
In terms of what is wrong with your regex, "\" is a special character, so you should use "\\" for backslash (so " \\\u" instead of "\u"):
m = re.search('\?id\\\u\d+d(\d+?)\\\u', url)
if m:
print m.group(1)
Gives: 528341191030
Docs:
Regular expressions use the backslash character ('\') to indicate
special forms or to allow special characters to be used without
invoking their special meaning. This collides with Python’s usage of
the same character for the same purpose in string literals; for
example, to match a literal backslash, one might have to write '\\'
as the pattern string, because the regular expression must be \, and
each backslash must be expressed as \ inside a regular Python string
literal.
Or,use Raw String Notation
m = re.search(r"\?id\\u\d+d(\d+?)\\u", url)
if m:
print m.group(1)
Well, you could always try this (not super elegant but works):
first = url.find('\u') + 2
prev = 'u'
m = ""
for i in range(first, len(url)):
if prev == '\' and url[i] == 'u':
break
else:
m += url[i]
if url[i] == 'd':
m = ""
Better way is to parseurl and get the query string values
url = '//item.taobao.com/item.htm?id\u003d528341191030\u0026ns\u003d1\u0026abbucket\u003d0#detail'
import urllib.parse as urlparse
print ( urlparse.parse_qs(urlparse.urlparse(url).query) )
print ( urlparse.parse_qs(urlparse.urlparse(url).query)['id'] )
Output:
{'id': ['528341191030'], 'ns': ['1'], 'abbucket': ['0']}
['528341191030']

Parsing keyword next to special character (pyparsing)

Using pyparsing, how can I match a keyword immediately before or after a special character (like "{" or "}")? The code below shows that my keyword "msg" is not matched unless it is preceded by whitespace (or at start):
import pyparsing as pp
openBrace = pp.Suppress(pp.Keyword("{"))
closeBrace = pp.Suppress(pp.Keyword("}"))
messageKw = pp.Keyword("msg")
messageExpr = pp.Forward()
messageExpr << messageKw + openBrace +\
pp.ZeroOrMore(messageExpr) + closeBrace
try:
result = messageExpr.parseString("msg { msg { } }")
print result.dump(), "\n"
result = messageExpr.parseString("msg {msg { } }")
print result.dump()
except pp.ParseException as pe:
print pe, "\n", "Text: ", pe.line
I'm sure there's a way to do this, but I have been unable to find it.
Thanks in advance
openBrace = pp.Suppress(pp.Keyword("{"))
closeBrace = pp.Suppress(pp.Keyword("}"))
should be:
openBrace = pp.Suppress(pp.Literal("{"))
closeBrace = pp.Suppress(pp.Literal("}"))
or even just:
openBrace = pp.Suppress("{")
closeBrace = pp.Suppress("}")
(Most pyparsing classes will auto-promote a string argument "arg" to Literal("arg").)
When I have parsers with many punctuation marks, rather than have a big ugly chunk of statements like this, I'll collapse them down to something like:
OBRACE, CBRACE, OPAR, CPAR, SEMI, COMMA = map(pp.Suppress, "{}();,")
The problem you are seeing is that Keyword looks at the immediately-surrounding characters, to make sure that the current string is not being accidentally matched when it is really embedded in a larger identifier-like string. In Keyword('{'), this will only work if there is no adjoining character that could be confused as part of a larger word. Since '{' itself is not really a typical keyword character, using Keyword('{') is not a good use of that class.
Only use Keyword for strings that could be misinterpreted as identifiers. For matching characters that are not in the set of typical keyword characters (by "keyword characters" I mean alphanumerics + '_'), use Literal.

The elegant way to replace specific characters in Python

I have strings that are unpredictable in terms of character content, but I know that every string contains exactly one character '*'.
How to replace two characters after the '*' with some non hard-coded string. Non hard-coded string is actually calculated checksum and converted into string:
checksum_str = str(hex(csum).lstrip('0x'))
You want something like:
star_pos = my_string.find('*')
my_string = my_string[:star_pos] + '*' + checksum_str + my_string[star_pos + 3:]
You can do it with a regular expression:
import re
my_string = re.sub(r'(?<=\*)..', checksum_str, my_string, 1)

python regex for repeating string

I am wanting to verify and then parse this string (in quotes):
string = "start: c12354, c3456, 34526; other stuff that I don't care about"
//Note that some codes begin with 'c'
I would like to verify that the string starts with 'start:' and ends with ';'
Afterward, I would like to have a regex parse out the strings. I tried the following python re code:
regx = r"start: (c?[0-9]+,?)+;"
reg = re.compile(regx)
matched = reg.search(string)
print ' matched.groups()', matched.groups()
I have tried different variations but I can either get the first or the last code but not a list of all three.
Or should I abandon using a regex?
EDIT: updated to reflect part of the problem space I neglected and fixed string difference.
Thanks for all the suggestions - in such a short time.
In Python, this isn’t possible with a single regular expression: each capture of a group overrides the last capture of that same group (in .NET, this would actually be possible since the engine distinguishes between captures and groups).
Your easiest solution is to first extract the part between start: and ; and then using a regular expression to return all matches, not just a single match, using re.findall('c?[0-9]+', text).
You could use the standard string tools, which are pretty much always more readable.
s = "start: c12354, c3456, 34526;"
s.startswith("start:") # returns a boolean if it starts with this string
s.endswith(";") # returns a boolean if it ends with this string
s[6:-1].split(', ') # will give you a list of tokens separated by the string ", "
This can be done (pretty elegantly) with a tool like Pyparsing:
from pyparsing import Group, Literal, Optional, Word
import string
code = Group(Optional(Literal("c"), default='') + Word(string.digits) + Optional(Literal(","), default=''))
parser = Literal("start:") + OneOrMore(code) + Literal(";")
# Read lines from file:
with open('lines.txt', 'r') as f:
for line in f:
try:
result = parser.parseString(line)
codes = [c[1] for c in result[1:-1]]
# Do something with teh codez...
except ParseException exc:
# Oh noes: string doesn't match!
continue
Cleaner than a regular expression, returns a list of codes (no need to string.split), and ignores any extra characters in the line, just like your example.
import re
sstr = re.compile(r'start:([^;]*);')
slst = re.compile(r'(?:c?)(\d+)')
mystr = "start: c12354, c3456, 34526; other stuff that I don't care about"
match = re.match(sstr, mystr)
if match:
res = re.findall(slst, match.group(0))
results in
['12354', '3456', '34526']

Keyword Matching in Pyparsing: non-greedy slurping of tokens

Pythonistas:
Suppose you want to parse the following string using Pyparsing:
'ABC_123_SPEED_X 123'
were ABC_123 is an identifier; SPEED_X is a parameter, and 123 is a value. I thought of the following BNF using Pyparsing:
Identifier = Word( alphanums + '_' )
Parameter = Keyword('SPEED_X') or Keyword('SPEED_Y') or Keyword('SPEED_Z')
Value = # assume I already have an expression valid for any value
Entry = Identifier + Literal('_') + Parameter + Value
tokens = Entry.parseString('ABC_123_SPEED_X 123')
#Error: pyparsing.ParseException: Expected "_" (at char 16), (line:1, col:17)
If I remove the underscore from the middle (and adjust the Entry definition accordingly) it parses correctly.
How can I make this parser be a bit lazier and wait until it matches the Keyword (as opposed to slurping the entire string as an Identifier and waiting for the _, which does not exist.
Thank you.
[Note: This is a complete rewrite of my question; I had not realized what the real problem was]
I based my answer off of this one, since what you're trying to do is get a non-greedy match. It seems like this is difficult to make happen in pyparsing, but not impossible with some cleverness and compromise. The following seems to work:
from pyparsing import *
Parameter = Literal('SPEED_X') | Literal('SPEED_Y') | Literal('SPEED_Z')
UndParam = Suppress('_') + Parameter
Identifier = SkipTo(UndParam)
Value = Word(nums)
Entry = Identifier + UndParam + Value
When we run this from the interactive interpreter, we can see the following:
>>> Entry.parseString('ABC_123_SPEED_X 123')
(['ABC_123', 'SPEED_X', '123'], {})
Note that this is a compromise; because I use SkipTo, the Identifier can be full of evil, disgusting characters, not just beautiful alphanums with the occasional underscore.
EDIT: Thanks to Paul McGuire, we can concoct a truly elegant solution by setting Identifier to the following:
Identifier = Combine(Word(alphanums) +
ZeroOrMore('_' + ~Parameter + Word(alphanums)))
Let's inspect how this works. First, ignore the outer Combine; we'll get to this later. Starting with Word(alphanums) we know we'll get the 'ABC' part of the reference string, 'ABC_123_SPEED_X 123'. It's important to note that we didn't allow the "word" to contain underscores in this case. We build that separately in to the logic.
Next, we need to capture the '_123' part without also sucking in '_SPEED_X'. Let's also skip over ZeroOrMore at this point and return to it later. We start with the underscore as a Literal, but we can shortcut with just '_', which will get us the leading underscore, but not all of '_123'. Instictively, we would place another Word(alphanums) to capture the rest, but that's exactly what will get us in trouble by consuming all of the remaining '_123_SPEED_X'. Instead, we say, "So long as what follows the underscore is not the Parameter, parse that as part of my Identifier. We state that in pyparsing terms as '_' + ~Parameter + Word(alphanums). Since we assume we can have an arbitrary number of underscore + WordButNotParameter repeats, we wrap that expression a ZeroOrMore construct. (If you always expect at least underscore + WordButNotParameter following the initial, you can use OneOrMore.)
Finally, we need to wrap the initial Word and the special underscore + Word repeats together so that it's understood they are contiguous, not separated by whitespace, so we wrap the whole expression up in a Combine construct. This way 'ABC _123_SPEED_X' will raise a parse error, but 'ABC_123_SPEED_X' will parse correctly.
Note also that I had to change Keyword to Literal because the ways of the former are far too subtle and quick to anger. I do not trust Keywords, nor could I get matching with them.
If you are sure that the identifier never ends with an underscore, you can enforce it in the definition:
from pyparsing import *
my_string = 'ABC_123_SPEED_X 123'
Identifier = Combine(Word(alphanums) + Literal('_') + Word(alphanums))
Parameter = Literal('SPEED_X') | Literal('SPEED_Y') | Literal('SPEED_Z')
Value = Word(nums)
Entry = Identifier + Literal('_').suppress() + Parameter + Value
tokens = Entry.parseString(my_string)
print tokens # prints: ['ABC_123', 'SPEED_X', '123']
If it's not the case but if the identifier length is fixed you can define Identifier like this:
Identifier = Word( alphanums + '_' , exact=7)
You can also parse the identifier and parameter as one token, and split them in a parse action:
from pyparsing import *
import re
def split_ident_and_param(tokens):
mo = re.match(r"^(.*?_.*?)_(.*?_.*?)$", tokens[0])
return [mo.group(1), mo.group(2)]
ident_and_param = Word(alphanums + "_").setParseAction(split_ident_and_param)
value = Word(nums)
entry = ident_and_param + value
print entry.parseString("APC_123_SPEED_X 123")
The example above assumes that the identifiers and parameters always have the format XXX_YYY (containing one single underscore).
If this is not the case, you need to adjust the split_ident_and_param() method.
This answers a question that you probably have also asked yourself: "What's a real-world application for reduce?):
>>> keys = ['CAT', 'DOG', 'HORSE', 'DEER', 'RHINOCEROS']
>>> p = reduce(lambda x, y: x | y, [Keyword(x) for x in keys])
>>> p
{{{{"CAT" | "DOG"} | "HORSE"} | "DEER"} | "RHINOCEROS"}
Edit:
This was a pretty good answer to the original question. I'll have to work on the new one.
Further edit:
I'm pretty sure you can't do what you're trying to do. The parser that pyparsing creates doesn't do lookahead. So if you tell it to match Word(alphanums + '_'), it's going to keep matching characters until it finds one that's not a letter, number, or underscore.

Categories