Pyparsing: Detect tokens with a specific ending

Pyparsing: Detect tokens with a specific ending - python

I wonder what I am doing wrong here. Maybe someone can give me a hint on this problem.
I want to detect certain tokens using pyparsing that terminate with the string _Init.
As an example, I have the following lines stored in text
one
two_Init
threeInit
four_foo_Init
five_foo_bar_Init
I want to extract the following lines:
two_Init
four_foo_Init
five_foo_bar_Init
Currently, I have reduced my problem to the following lines:
import pyparsing as pp
ident = pp.Word(pp.alphas, pp.alphanums + "_")
ident_init = pp.Combine(ident + pp.Literal("_Init"))
for detected, s, e in ident_init.scanString(text):
print detected
Using this code there are no results. If I remove the "_" in the Word statement then I can detect at least the lines having a _Init at their ends. But the result isnt complete:
['two_Init']
['foo_Init']
['bar_Init']
Has someone any ideas what I am doing completely wrong here?

The problem is that you want to accept '_' as long as it is not the '_' in the terminating '_Init'. Here are two pyparsing solutions, one is more "pure" pyparsing, the other just says the heck with it and uses an embedded regex.
samples = """\
one
two_Init
threeInit
four_foo_Init
six_seven_Init_eight_Init
five_foo_bar_Init"""
from pyparsing import Combine, OneOrMore, Word, alphas, alphanums, Literal, WordEnd, Regex
# implement explicit lookahead: allow '_' as part of your Combined OneOrMore,
# as long as it is not followed by "Init" and the end of the word
option1 = Combine(OneOrMore(Word(alphas,alphanums) |
'_' + ~(Literal("Init")+WordEnd()))
+ "_Init")
# sometimes regular expressions and their implicit lookahead/backtracking do
# make things easier
option2 = Regex(r'\b[a-zA-Z_][a-zA-Z0-9_]*_Init\b')
for expr in (option1, option2):
print '\n'.join(t[0] for t in expr.searchString(samples))
print
Both options print:
two_Init
four_foo_Init
six_seven_Init_eight_Init
five_foo_bar_Init

Related

Regex to exclude words followed by space

I tried a lot of solutions but can't get this Regex to work.
The string-
"Flow Control None"
I want to exclude "Flow Control" plus the blank space, and only return whatever is on the right.

Since you have tagged your question with #python and #regex, I'll outline a simple solution to your problem using these tools. Furthermore, the other two answers don't really tackle the exact problem of matching "whatever is on the right" of your "Flow Control " prefix.
First, start by importing the re builtin module (read the docs).
import re
Define the pattern you want to match. Here, we're matching "whatever is on the right" ((?P<suffix>.+)$) of ^Flow Control .
pattern = re.compile(r"^Flow Control (?P<suffix>.+)$")
Grab the match for a given string (e.g. "Flow Control None")
suffix = pattern.search("Flow Control None").group("suffix")
print(suffix) # Out: None
Hopefully, this complete working example will also help you
import re
def get_suffix(text: str):
pattern = re.compile(r"^Flow Control (?P<suffix>.+)$")
matches = pattern.search(text)
return matches.group("suffix") if matches else None
examples = [
"Flow Control None",
"Flow Control None None",
"Flow Control None",
"Flow Control ",
]
for example in examples:
suffix = get_suffix(text=example)
if suffix:
print(f"Matched: {repr(suffix)}")
else:
print(f"No matches for: {repr(example)}")

Use split like so:
my_str = 'Flow Control None'
out_str = my_str.split()[-1]
# 'None'
Or use re.findall:
import re
out_str = re.findall(r'^.*\s(\S+)$', my_str)[0]

If you really want a purely regex solution try this: (?<= )[a-zA-Z]*$
The (?<= ) matches a single ' ' but doesn't include it in the match. [a-zA-Z]* matches anything from a to z or A to Z any number of times. $ matches the end of the line.
You could also try replacing the * with a + if you want to ensure that your match has at least one letter (* will produce a 0-length match if your string ends in a space, + will match nothing).
But it may be clearer to do something like
data = "Flow Control None"
split = data.split(' ')
split[len(split) - 1] # returns "None"
EDIT data.split(' ')[-1] also returns "None"
or
data[data.rfind(' ') + 1:] # returns "None"
that don't involve regexes at all.

regex in python - how to understand this ip lable without parentheses

I have this code to check if a string is a valid IPv4 address:
import re
def is_ip4(IP):
label = "([0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])"
pattern = re.compile("(" + label + "\.){3}" + label + "$")
if pattern.match(IP):
print("matched!")
else:
print("No!")
it works fine. but if I remove the parentheses from the label, as this
import re
def is_ip4(IP):
label = "[0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5]"
pattern = re.compile("(" + label + "\.){3}" + label + "$")
if pattern.match(IP):
print("matched!")
else:
print("No!")
it show valid ip for "2090.1.11.0", "20.1.11.0", but not for "2.1.11.0". I'm actually a bit confused for the cases with vs without parentheses. Can someone explain this for me? thanks

The reason you need the parentheses is because of the two-step process you're using. By itself, the parentheses don't do anything (other than capturing in a group). But you're also doing this:
pattern = re.compile("(" + label + "\.){3}" + label + "$")
The label regex is copied twice, first for three repetitions followed by a period. That copy is fine (almost), because in the statement, it is enclosed in parentheses once more. However, the second copy is outside any parentheses, so you end up with a regex like (simplified):
pattern == '(a|ab|abc\.){3}a|ab|abc$'
This matches if either (a|ab|abc\.){3}a matches, or ab or abc. With parentheses, it would be like:
pattern == '((a|ab|abc)\.){3}(a|ab|abc)$'
So, although the parentheses appear superfluous, they are not for two reasons. They are keeping the period separate from the last option abc and they are keeping the final choices together and apart from the first bit.
However, you shouldn't be doing this in the first place. Just use:
from ipaddress import ip_address
def is_ip4(ip):
try:
ip_address(ip)
return True
except ValueError:
return False
No installation required, it's a standard library.
The reason you get a match for '2090.1.11.0' is because matching it to this:
'([0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5]\\.){3}[0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5]$'
Comes down to matching it to this:
'([0-9]){3}[0-9]'
Since, [0-9] is the first option in the 'or' expression in parentheses, repeated three times and the second [0-9] is the first option in the 'or' expression after the {3}.
Note that the $ you put in to ensure the entire string was matches is lumped in with the final 'or' option, so that doesn't do anything here.
Try running the below and note the identical first match:
import re
print(re.findall('([0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5]\\.){3}[0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5]$', '2090.1.11.0'))
print(re.findall('([0-9]){3}[0-9]', '2090.1.11.0'))
(ignore the second match on the first line, not as relevant)

python regex for repeating string

I am wanting to verify and then parse this string (in quotes):
string = "start: c12354, c3456, 34526; other stuff that I don't care about"
//Note that some codes begin with 'c'
I would like to verify that the string starts with 'start:' and ends with ';'
Afterward, I would like to have a regex parse out the strings. I tried the following python re code:
regx = r"start: (c?[0-9]+,?)+;"
reg = re.compile(regx)
matched = reg.search(string)
print ' matched.groups()', matched.groups()
I have tried different variations but I can either get the first or the last code but not a list of all three.
Or should I abandon using a regex?
EDIT: updated to reflect part of the problem space I neglected and fixed string difference.
Thanks for all the suggestions - in such a short time.

In Python, this isn’t possible with a single regular expression: each capture of a group overrides the last capture of that same group (in .NET, this would actually be possible since the engine distinguishes between captures and groups).
Your easiest solution is to first extract the part between start: and ; and then using a regular expression to return all matches, not just a single match, using re.findall('c?[0-9]+', text).

You could use the standard string tools, which are pretty much always more readable.
s = "start: c12354, c3456, 34526;"
s.startswith("start:") # returns a boolean if it starts with this string
s.endswith(";") # returns a boolean if it ends with this string
s[6:-1].split(', ') # will give you a list of tokens separated by the string ", "

This can be done (pretty elegantly) with a tool like Pyparsing:
from pyparsing import Group, Literal, Optional, Word
import string
code = Group(Optional(Literal("c"), default='') + Word(string.digits) + Optional(Literal(","), default=''))
parser = Literal("start:") + OneOrMore(code) + Literal(";")
# Read lines from file:
with open('lines.txt', 'r') as f:
for line in f:
try:
result = parser.parseString(line)
codes = [c[1] for c in result[1:-1]]
# Do something with teh codez...
except ParseException exc:
# Oh noes: string doesn't match!
continue
Cleaner than a regular expression, returns a list of codes (no need to string.split), and ignores any extra characters in the line, just like your example.

import re
sstr = re.compile(r'start:([^;]*);')
slst = re.compile(r'(?:c?)(\d+)')
mystr = "start: c12354, c3456, 34526; other stuff that I don't care about"
match = re.match(sstr, mystr)
if match:
res = re.findall(slst, match.group(0))
results in
['12354', '3456', '34526']

Keyword Matching in Pyparsing: non-greedy slurping of tokens

Pythonistas:
Suppose you want to parse the following string using Pyparsing:
'ABC_123_SPEED_X 123'
were ABC_123 is an identifier; SPEED_X is a parameter, and 123 is a value. I thought of the following BNF using Pyparsing:
Identifier = Word( alphanums + '_' )
Parameter = Keyword('SPEED_X') or Keyword('SPEED_Y') or Keyword('SPEED_Z')
Value = # assume I already have an expression valid for any value
Entry = Identifier + Literal('_') + Parameter + Value
tokens = Entry.parseString('ABC_123_SPEED_X 123')
#Error: pyparsing.ParseException: Expected "_" (at char 16), (line:1, col:17)
If I remove the underscore from the middle (and adjust the Entry definition accordingly) it parses correctly.
How can I make this parser be a bit lazier and wait until it matches the Keyword (as opposed to slurping the entire string as an Identifier and waiting for the _, which does not exist.
Thank you.
[Note: This is a complete rewrite of my question; I had not realized what the real problem was]

I based my answer off of this one, since what you're trying to do is get a non-greedy match. It seems like this is difficult to make happen in pyparsing, but not impossible with some cleverness and compromise. The following seems to work:
from pyparsing import *
Parameter = Literal('SPEED_X') | Literal('SPEED_Y') | Literal('SPEED_Z')
UndParam = Suppress('_') + Parameter
Identifier = SkipTo(UndParam)
Value = Word(nums)
Entry = Identifier + UndParam + Value
When we run this from the interactive interpreter, we can see the following:
>>> Entry.parseString('ABC_123_SPEED_X 123')
(['ABC_123', 'SPEED_X', '123'], {})
Note that this is a compromise; because I use SkipTo, the Identifier can be full of evil, disgusting characters, not just beautiful alphanums with the occasional underscore.
EDIT: Thanks to Paul McGuire, we can concoct a truly elegant solution by setting Identifier to the following:
Identifier = Combine(Word(alphanums) +
ZeroOrMore('_' + ~Parameter + Word(alphanums)))
Let's inspect how this works. First, ignore the outer Combine; we'll get to this later. Starting with Word(alphanums) we know we'll get the 'ABC' part of the reference string, 'ABC_123_SPEED_X 123'. It's important to note that we didn't allow the "word" to contain underscores in this case. We build that separately in to the logic.
Next, we need to capture the '_123' part without also sucking in '_SPEED_X'. Let's also skip over ZeroOrMore at this point and return to it later. We start with the underscore as a Literal, but we can shortcut with just '_', which will get us the leading underscore, but not all of '_123'. Instictively, we would place another Word(alphanums) to capture the rest, but that's exactly what will get us in trouble by consuming all of the remaining '_123_SPEED_X'. Instead, we say, "So long as what follows the underscore is not the Parameter, parse that as part of my Identifier. We state that in pyparsing terms as '_' + ~Parameter + Word(alphanums). Since we assume we can have an arbitrary number of underscore + WordButNotParameter repeats, we wrap that expression a ZeroOrMore construct. (If you always expect at least underscore + WordButNotParameter following the initial, you can use OneOrMore.)
Finally, we need to wrap the initial Word and the special underscore + Word repeats together so that it's understood they are contiguous, not separated by whitespace, so we wrap the whole expression up in a Combine construct. This way 'ABC _123_SPEED_X' will raise a parse error, but 'ABC_123_SPEED_X' will parse correctly.
Note also that I had to change Keyword to Literal because the ways of the former are far too subtle and quick to anger. I do not trust Keywords, nor could I get matching with them.

If you are sure that the identifier never ends with an underscore, you can enforce it in the definition:
from pyparsing import *
my_string = 'ABC_123_SPEED_X 123'
Identifier = Combine(Word(alphanums) + Literal('_') + Word(alphanums))
Parameter = Literal('SPEED_X') | Literal('SPEED_Y') | Literal('SPEED_Z')
Value = Word(nums)
Entry = Identifier + Literal('_').suppress() + Parameter + Value
tokens = Entry.parseString(my_string)
print tokens # prints: ['ABC_123', 'SPEED_X', '123']
If it's not the case but if the identifier length is fixed you can define Identifier like this:
Identifier = Word( alphanums + '_' , exact=7)

You can also parse the identifier and parameter as one token, and split them in a parse action:
from pyparsing import *
import re
def split_ident_and_param(tokens):
mo = re.match(r"^(.*?_.*?)_(.*?_.*?)$", tokens[0])
return [mo.group(1), mo.group(2)]
ident_and_param = Word(alphanums + "_").setParseAction(split_ident_and_param)
value = Word(nums)
entry = ident_and_param + value
print entry.parseString("APC_123_SPEED_X 123")
The example above assumes that the identifiers and parameters always have the format XXX_YYY (containing one single underscore).
If this is not the case, you need to adjust the split_ident_and_param() method.

This answers a question that you probably have also asked yourself: "What's a real-world application for reduce?):
>>> keys = ['CAT', 'DOG', 'HORSE', 'DEER', 'RHINOCEROS']
>>> p = reduce(lambda x, y: x | y, [Keyword(x) for x in keys])
>>> p
{{{{"CAT" | "DOG"} | "HORSE"} | "DEER"} | "RHINOCEROS"}
Edit:
This was a pretty good answer to the original question. I'll have to work on the new one.
Further edit:
I'm pretty sure you can't do what you're trying to do. The parser that pyparsing creates doesn't do lookahead. So if you tell it to match Word(alphanums + '_'), it's going to keep matching characters until it finds one that's not a letter, number, or underscore.

Using pyparsing to parse a word escape-split over multiple lines

I'm trying to parse words which can be broken up over multiple lines with a backslash-newline combination ("\\n") using pyparsing. Here's what I have done:
from pyparsing import *
continued_ending = Literal('\\') + lineEnd
word = Word(alphas)
split_word = word + Suppress(continued_ending)
multi_line_word = Forward()
multi_line_word << (word | (split_word + multi_line_word))
print multi_line_word.parseString(
'''super\\
cali\\
fragi\\
listic''')
The output I get is ['super'], while the expected output is ['super', 'cali', fragi', 'listic']. Better still would be all of them joined as one word (which I think I can just do with multi_line_word.parseAction(lambda t: ''.join(t)).
I tried looking at this code in pyparsing helper, but it gives me an error, maximum recursion depth exceeded.
EDIT 2009-11-15: I realized later that pyparsing gets a little generous with regards to white space, and that leads to some poor assumptions that what I thought I was parsing for was a lot looser. That is to say, we want to see no white space between any of the portions of the word, the escape, and the EOL character.
I realized that the little example string above is insufficient as a test case, so I wrote the following unit tests. Code that passes these tests should be able to match what I intuitively think of as a escape-split word—and only an escape-split word. They will not match a basic word that is not escape-split. We can—and I believe should—use a different grammatical construct for that. This keeps it all tidy having the two separate.
import unittest
import pyparsing
# Assumes you named your module 'multiline.py'
import multiline
class MultiLineTests(unittest.TestCase):
def test_continued_ending(self):
case = '\\\n'
expected = ['\\', '\n']
result = multiline.continued_ending.parseString(case).asList()
self.assertEqual(result, expected)
def test_continued_ending_space_between_parse_error(self):
case = '\\ \n'
self.assertRaises(
pyparsing.ParseException,
multiline.continued_ending.parseString,
case
)
def test_split_word(self):
cases = ('shiny\\', 'shiny\\\n', ' shiny\\')
expected = ['shiny']
for case in cases:
result = multiline.split_word.parseString(case).asList()
self.assertEqual(result, expected)
def test_split_word_no_escape_parse_error(self):
case = 'shiny'
self.assertRaises(
pyparsing.ParseException,
multiline.split_word.parseString,
case
)
def test_split_word_space_parse_error(self):
cases = ('shiny \\', 'shiny\r\\', 'shiny\t\\', 'shiny\\ ')
for case in cases:
self.assertRaises(
pyparsing.ParseException,
multiline.split_word.parseString,
case
)
def test_multi_line_word(self):
cases = (
'shiny\\',
'shi\\\nny',
'sh\\\ni\\\nny\\\n',
' shi\\\nny\\',
'shi\\\nny '
'shi\\\nny captain'
)
expected = ['shiny']
for case in cases:
result = multiline.multi_line_word.parseString(case).asList()
self.assertEqual(result, expected)
def test_multi_line_word_spaces_parse_error(self):
cases = (
'shi \\\nny',
'shi\\ \nny',
'sh\\\n iny',
'shi\\\n\tny',
)
for case in cases:
self.assertRaises(
pyparsing.ParseException,
multiline.multi_line_word.parseString,
case
)
if __name__ == '__main__':
unittest.main()

After poking around for a bit more, I came upon this help thread where there was this notable bit
I often see inefficient grammars when
someone implements a pyparsing grammar
directly from a BNF definition. BNF
does not have a concept of "one or
more" or "zero or more" or
"optional"...
With that, I got the idea to change these two lines
multi_line_word = Forward()
multi_line_word << (word | (split_word + multi_line_word))
To
multi_line_word = ZeroOrMore(split_word) + word
This got it to output what I was looking for: ['super', 'cali', fragi', 'listic'].
Next, I added a parse action that would join these tokens together:
multi_line_word.setParseAction(lambda t: ''.join(t))
This gives a final output of ['supercalifragilistic'].
The take home message I learned is that one doesn't simply walk into Mordor.
Just kidding.
The take home message is that one can't simply implement a one-to-one translation of BNF with pyparsing. Some tricks with using the iterative types should be called into use.
EDIT 2009-11-25: To compensate for the more strenuous test cases, I modified the code to the following:
no_space = NotAny(White(' \t\r'))
# make sure that the EOL immediately follows the escape backslash
continued_ending = Literal('\\') + no_space + lineEnd
word = Word(alphas)
# make sure that the escape backslash immediately follows the word
split_word = word + NotAny(White()) + Suppress(continued_ending)
multi_line_word = OneOrMore(split_word + NotAny(White())) + Optional(word)
multi_line_word.setParseAction(lambda t: ''.join(t))
This has the benefit of making sure that no space comes between any of the elements (with the exception of newlines after the escaping backslashes).

You are pretty close with your code. Any of these mods would work:
# '|' means MatchFirst, so you had a left-recursive expression
# reversing the order of the alternatives makes this work
multi_line_word << ((split_word + multi_line_word) | word)
# '^' means Or/MatchLongest, but beware using this inside a Forward
multi_line_word << (word ^ (split_word + multi_line_word))
# an unusual use of delimitedList, but it works
multi_line_word = delimitedList(word, continued_ending)
# in place of your parse action, you can wrap in a Combine
multi_line_word = Combine(delimitedList(word, continued_ending))
As you found in your pyparsing googling, BNF->pyparsing translations should be done with a special view to using pyparsing features in place of BNF, um, shortcomings. I was actually in the middle of composing a longer answer, going into more of the BNF translation issues, but you have already found this material (on the wiki, I assume).

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pyparsing: Detect tokens with a specific ending - python

Related

Regex to exclude words followed by space

regex in python - how to understand this ip lable without parentheses

python regex for repeating string

Keyword Matching in Pyparsing: non-greedy slurping of tokens

Using pyparsing to parse a word escape-split over multiple lines

Categories

Resources