Regular expression to parse word structure

Regular expression to parse word structure - python

I'm trying to build my first non-trivial regular expression (for use in Python), but struggling.
Let us assume that a word in language X (NOT English) is a sequence of minimal 'structures'. Each 'structure' could be:
An independent vowel (basically one letter of the alphabet)
A consonant (one letter of the alphabet)
A consonant followed by a right-attaching vowel
A left-attaching vowel followed by a consonant
(Certain left-attaching vowels) followed by a consonant followed by (certain right-attaching vowels)
For example this word of 3 characters:
<a consonant><a left-attaching vowel><an independent vowel>
is not a valid word, and should not match the regex, because there is no consonant to the right of the left-attaching vowel.
I know all the Unicode ranges - the Unicode ranges for consonants, independent vowels, left-attaching vowels and so on.
Here is what I have so far:
WordPattern = (
ur'('
ur'[\u0985-\u0994]|'
ur'[\u0995-\u09B9]|'
ur'[\u0995-\u09B9(\u09BE|[\u09C0-\u09C4])]|'
ur'[(\u09BF|\u09C7|\u09C8)\u0995-\u09B9]|'
ur'[(\u09BF|\u09C7|\u09C8)\u0995-\u09B9(\u09BE|[\u09C0-\u09C4])]'
ur')+'
)
It's not working. Apart from getting it to work, I have three specific problems:
I need to split the regular expression over multiple lines, or else the code is going to look terrible. How do I do this?
I would like to use string substitution / templates of some sort to 'name' the Unicode ranges, for code readability and to prevent typing Unicode ranges multiple times.
(This seems very difficult) The list of permissible minimal 'structures' will have to be extended later. Is there any way to set up a sort of 'loop' mechanism within a regex, so that it works for all permissible structures in a list?
Any help would be appreciated. This seems very complex to a beginner!

The appropriate tool for morphological analysis of languages with non-trivial morphology is "finite state transducers". There are robust implementations that you can track down and use (one by Xerox Parc). There's one that has python bindings (for using as an external library). Google it.
FSTs are based on finite-state automata, like (pure) regular expressions, but they are by no means a drop-in replacement. It's complex machinery, so if your goals are simple (e.g., syllabification for purposes of hyphenation) you may want to look for something simpler. There are machine-learning algorithms that will "learn" hyphenation, for example. If you are indeed interested in morphological analysis, you have to make the effort to look at FSTs.
Now for your algorithm, in case you really only need a trivial implementation: Since any vowel or consonant could be independent, your rules are ambiguous: They allow "ab" to be parsed as "a-b". Such ambiguities mean that a regexp approach will probably never work, but you may get better results if you put the longer regexps first, so they are used in preference to the short ones when both would apply. But really you need to build a parser (by hand or using a module) and try different things in steps. It's backwards from what you imagined: Set up a loop that uses different regexps, and "consumes" the string in steps.
However, it seems to me that what you are describing is essentially syllabification. And the near-universal rule of syllabification is this: A syllable consists of a core vowel, plus as many preceding ("onset") consonants as the rules of the language allow, plus any following consonants that cannot belong to the next syllable. The rule is called "maximize onset", and it has the consequence that it's easier to parse your syllables backwards (from the end of the word). Try it out.
PS. You probably know this, but if you put the following as the second line in your scripts you can embed Bengali in your regexps:
# -*- coding: utf-8 -*-

I need to split the regular expression over multiple lines, or else the code is going to look terrible. How do I do this?
Use the re.VERBOSE flag when compiling the regex.
pattern = re.compile(r"""(
[\u0985-\u0994] # comment to explain what this is
| [\u0995-\u09B9]
# etc.
)
""", re.VERBOSE)
I would like to use string substitution / templates of some sort to 'name' the Unicode ranges
You can construct an RE from ordinary Python strings:
>>> subpatterns = {"vowel": "[aeiou]", "consonant": "[^aeiou]"}
>>> "{consonant}{vowel}+{consonant}*".format(**subpatterns)
'[^aeiou][aeiou]+[^aeiou]*'
The list of permissible minimal 'structures' will have to be extended later. Is there any way to set up a sort of 'loop' mechanism within a regex, so that it works for all permissible structures in a list?
I'm not sure if I get what you mean, but... suppose you have a list of (uncompiled) REs, say, patterns, then you can compute their union with
re.compile("(%s)" % "|".join(patterns))
Be careful with special characters when constructing REs this way and use re.escape where necessary.

Related

competing regular expressions (race condition)

I'm trying to use python PLY (lex/yacc) to parse a language called 'GRBL'.
GRBL looks something like this:
G00 X0.0 Y0.0 Z-1.0
G01 X1.0
..
The 'G' Codes tell a machine to 'go' (or move) and the coordinates say where.
LEX requires us to specify a unique regular expression for every possible 'token'.
So in this case I need a regex that will clearly define 'G00' and one that will clearly define 'G01' etc.
Obviously one's first thought would be r'G00' etc.
However G code is imprecise. The G can be upper or lower case, there can be leading zeros etc.
(g0, G00, g001 etc.)
So something for G00 may be as simple as:
r'[Gg]{1}0*'
And for G01 we could have
r'[Gg]{1}0*1'
But this does not work. G00 parses correctly, but G01 gives:
LexToken(G00,'G0',3,21)
Illegal character '1'
That is, lex thinks that G01 is a G0 token and doesn't know what to do with the '1'.
Which is clearly some sort of greedy matching problem.
Unfortunately I can't use the "$" terminator to specify that the string must 'end' with a "1"
I realise this might seem simple to some, but I've been at this for 3 hours and can't get it to work! Does anyone know how to address this problem?

Note: There's pretty well no reason to write {1} in a regular expression. It means that the previous element should be repeated exactly once, which is what would have happened without the repetition operator. So all it does is to obfuscate the regular expression (and slow down matching).
But that's not your problem. Your problem is likely the order in which Ply applies the regular expressions. Ply creates a single massive Python regular expression by concatenating all patterns into a set of alternatives:
(pattern1)|(pattern2)|(pattern3)|...|(patternz)
The order in which the patterns are inserted is important because Python "regular" expressions use an ordered alternation operator (making them actually irregular in mathematical terms, but that's a side issue). So once some alternative matches, the following ones are not even tried.
The Ply manual defines the ordering:
All tokens defined by functions are added in the same order as they appear in the lexer file.
Tokens defined by strings are added next by sorting them in order of decreasing regular expression length (longer expressions are added first).
I'm guessing that you're using functions, so that the patterns are in order by appearance in the file, because your second pattern --which is longer-- would by applied first if they were defined as strings. But without seeing your actual file, it's very hard to know for sure.
In any case, conventional wisdom for Ply lexers is to use as few patterns as possible, preferring to map keywords to tokens with dictionaries. In the case of GRBL one possibility might be to use [Gg][0-9]+(\.[0-9]÷)? as the pattern and then extract the index in the semantic action.

Your pattern [Gg]{1}0* is to general and it matches both G00 and G01 https://regex101.com/r/gkz0Wb/1
And in this second case you are left with single character 1.
You will have to make this pattern more specific for example by adding whitespace character at the end pattern [Gg]{1}0*\s
https://regex101.com/r/dsSFaG/1

Is a single big regex more efficient than a bunch of smaller ones?

I'm working on a function that uses regular expressions to find some product codes in a (very long) string given as an argument.
There are many possible forms of that code, for example:
UK[A-z]{10} or DE[A-z]{20} or PL[A-z]{7} or...
What solution would be better? Many (most probably around 20-50) small regular expressions or one huge monster-regex that matches them all? What is better when performance is concerned?

It depends what kind of big regex you write. If you end with a pathological pattern it's better to test smaller patterns. Example:
UK[A-Za-z]{10}|DE[A-Za-z]{20}|PL[A-Za-z]{7}
this pattern is very inefficient because it starts with an alternation, this means that in the worst case (no match) each alternative needs to be tested for all positions in the string.
(* Note that a regex engine like PCRE is able to quickly find potentially matching positions when each branch of an alternation starts with literal characters.)
But if you write your pattern like this:
(?=[UDP][KEL])(?:UK[A-Za-z]{10}|DE[A-Za-z]{20}|PL[A-Za-z]{7})
or the variation:
[UDP][KEL](?:(?<=UK)[A-Za-z]{10}|(?<=DE)[A-Za-z]{20}|(?<=PL)[A-Za-z]{7})
Most of the positions where the match isn't possible are quickly discarded before the alternation.
Also, when you write a single pattern, obviously, the string is parsed only once.

Finding a simpler Python RegEx for a string that contains each character at least once

I'm working on a small project and have need for Regular Expressions that accept strings that contain each character in a given alphabet at least once.
So for the alphabet {J, K, L} I would need a RegEx that accepts strings containing J one or more times AND K one or more times, AND L one or more times, in any order, with any amount of duplicate characters before, after, or in-between.
I'm pretty inexperienced with RegEx and so have trouble finding "lateral thinking" solutions to many problems. My first approach to this was therefore pretty brute-force: I took each possible "base" string, for example,
JKL, JLK, KJL, KLJ, LKJ, LJK
and allow for any string that could be built up from one of those starting points. However the resulting regular expression* (despite working) ends up being very long and containing a lot of redundancy. Not to mention this approach becomes completely untenable once the alphabet has more than a handful of characters.
I spent a few hours trying to find a more elegant approach, but I have yet to find one that still accepts every possible string. Is there a method or technique I could be using to get this done in a way that's more elegant and scalable (to larger alphabets)?
*For reference, my regular expression for the listed example:
((J|K|L)*J(J|K|L)*K(J|K|L)*L(J|K|L)*)|
((J|K|L)*J(J|K|L)*L(J|K|L)*K(J|K|L)*)|
((J|K|L)*K(J|K|L)*J(J|K|L)*L(J|K|L)*)|
((J|K|L)*K(J|K|L)*L(J|K|L)*J(J|K|L)*)|
((J|K|L)*L(J|K|L)*J(J|K|L)*K(J|K|L)*)|
((J|K|L)*L(J|K|L)*K(J|K|L)*J(J|K|L)*)

This is a typical use-case for a lookahead. You can simply use ^(?=[^J]*J)(?=[^K]*K)(?=[^L]*L) to check all your conditions. If your string also must contain only these characters, you can append [JKL]+$ to it.

If using regex is not a requirement you could also check for the characters individually:
text = ...
alphabet = 'JKL'
assert all([character in text for character in alphabet])
Or if you do not want to allow characters that are not in the alphabet:
assert set(alphabet) == set(text)

regex- capturing text between matches

In the following text, I try to match a number followed by ")" and number followed by a period. I am trying to retrieve the text between the matches.
Example:
"1) there is a dsfsdfsd and 2) there is another one and 3) yet another
case"
so I am trying to output: ["there is a dsfsdfsd and", "there is another one and", yet another case"]
I've used this regex: (?:\d)|\d.)
Adding a .* at the end matches the entire string, I only want it to match the words between
also in this string:
"we will give 4. there needs to be another option and 6.99 USD is a
bit amount"
I want to only match the 4. and not the 6.99
Any pointers will be appreciated. Thank you. r

tldr
Regular expressions are tricky beasts and you should avoid them if at all possible.
If you can't avoid them, then make sure you have lots of test cases for all the edge cases that can occur.
Build up your regular expression slowly and systematically, testing your assumptions at every step.
If this code will go intro production, then please write unit tests that explain the thinking process to the poor soul who has to maintain it one day
The long version
Regular expressions are finicky. Your best approach may be to solve the problem a different way.
For example, your language might have a library function that allows you to split up strings using a regular expression to define what comes between the numbers. That will let you get away with writing a simpler regex to match the numbers and brackets/dots.
If you still decide to use regular expressions, then you need to be very structured about how you build up your regular expressions. It's extremely easy to miss edge cases.
So let's break this down piece by piece...
Set up a test environment for quickly experimenting with your regex.
There are lots of options here, depending on your programming language and OS. Ones I sometimes use are:
a Powershell window for testing .Net regexes (NB: the cli gives you a history of past attempts, so you can go back a few steps if you mess things up too badly)
a Python console for testing Python regexes (which are slightly different to .Net regexes in their syntax for named capture groups).
an html page with JavaScript to test the regex
an online or desktop regex tool (I still use the ancient Regular Expression Workbench from Eric Gunnerson, but I'm sure there are better alternatives these days)
Since you didn't specify a language or regex version, I'll assume .Net regular expressions
Create a single test string for testing a wider variety of options.
Your goal is to include as many edge cases as you can think of. Here's what I would use: "ab 1. there is a dsfsdfsd costing $6.99 and 2) there is another one and 3. yet another case 4)5) 6)10."
Note that I've added a few extra cases you didn't mention:
empty strings between two round bracket numbers: "4)" and "5)"
white space string between two round bracket numbers: "5)" and "6)"
empty strings between a round bracket number and a dotted number: "6)" and "10."
empty string after the dotted number "10." at the end of the string
random text and empty space, which should be ignored, before the first number
I'm going to make a few assumptions here, which you will need to vary based on your actual requirements:
You DO want to capture the white space after the dot or round bracket.
You DO want to capture the white space before the next dotted number or round bracket number.
You might have numbers that go beyond 9, so I've included "10" in the test cases.
You want to capture empty strings at the end e.g. after the "10."
NOTES:
Thinking through this test case forces you to be more rigorous about your requirements.
It will also help you be more efficient while you are manually testing your regular expression.
HOWEVER, this is assuming you aren't following a TDD approach. If you are, then you should probably do things a little differently... create unit tests for each scenario separately and get the regex working incrementally.
This test string doesn't cover all cases. For example, there are no newline or tab characters in the test string. Also it can't test for an empty string following a round bracket number at the very end.
First get a regex working that just captures the round brackets and dotted brackets.
Don't worry about the $6.99 edge case yet.
Drop the "(?:" non-capturing group syntax from your regex for now: "\d)|\d."
This doesn't even parse, because you have an unescaped round bracket.
The revised string is "\d\)|\d.", which parses, but which also matches "99" which you probably weren't expecting. That's because you forgot to escape the "."
The revised string is "\d\)|\d\.". This no longer matches "99", but it now matches "0." at the end instead of "10.". That's because it assumes that numbers will be single digit only.
The following string seems to work: "\d+\)|\d+\."
Time to deal with that pesky "$6.99" now...
Modify the regex so that it doesn't capture a floating point number.
You need to use a negative look ahead pattern to prevent a digit being after the decimal point.
Result: "\d+\)|\d+\.(?!\d)"
Count how many matches this produces. You're going to use this number for checking later results.
Hint: Save the regex pattern somewhere. You want to be able to go back to it any time you mess up your regex pattern beyond repair.
If you found a string splitting function, then you should use it now and avoid the complexity that follows. [I've included an example of this at the end.]
Simple is better, but I'm going to continue with the longer solution in the interests of showing an approach to staying in control of regex'es that start getting horribly complicated
Decide how to exclude that pattern
You used the non-capture group pattern in your question i.e. "(?:"
That approach can work. But it's a bit cumbersome, because you need to have a capturing group after it that you will look for instead.
It would be much nicer if your entire pattern matched what you are looking for.
So wrap the number pattern inside a zero-width positive look behind pattern (if your language supports it) i.e. "(?<=".
This checks for the pattern, but doesn't include it in what gets captured.
So now your regex looks like this: "(?<=\d+\)|\d+\.(?!\d))"
Test it!
It might seem silly to test this on its own - all the matches are empty strings.
Do it anyway. You want to sanity check every step of the way.
Make sure that it still produces the same number of matches as in step 4.
Decide how to match the text in between the numbers.
You rightly mention that ".*" will match the entire string, not just the parts in between.
There's a neat trick that allows you to reuse the pattern from step 5 to get the text in between.
Start by just matching the next character
The trick is that you want to match any character unless it's the start of the next number
That sounds like a negative look ahead pattern again: "(?!"
Let X be the pattern you saved in step 4. Matching a single character will look like this: "(?!X)."
You want to match lots of those characters. So put that pattern into a non-capturing group and repeat it: "(?:(?!X).)*"
This assumes you want to capture empty text.
If you're not, then change the "*" to a "+".
Hint: This is such a common pattern that you will want to reuse it in future pasting in different patterns in place of X
I used a non-capturing group instead of a normal group so that you can also embed this pattern in regexes where you do care about the capturing groups
Resulting pattern: "(?:(?!\d+\)|\d+\.(?!\d)).)*"
I suggest testing this pattern on its own to see what it does
Now put parts 5 and 7 together: "(?<=\d+\)|\d+\.(?!\d))(?:(?!\d+\)|\d+\.(?!\d)).)*"
Test it!
Unit tests!
If this is going into production, then please write lots of unit tests that will explain each step of this thought process
Have pity on the poor soul who has to maintain your regex in future!
By rights that person should be you
I suggest putting a note in your calendar to return to this code in 6 months' time and make sure you can still understand it from the unit tests alone!
Refactor
In six months' time, if you can't understand the code any more, use your newfound insight (and incentive) to solve the problem without using regular expressions (or only very simple ones)
Addendum
As an example of using a string splitting function to get away with a simpler regex, here's a solution in Powershell:
$string = 'ab 1. there is a dsfsdfsd costing $6.99 and 2) there is another one and 3. yet another case 4)5) 6)10.'
$pattern = [regex] '\d+\)|\d+\.(?!\d)'
$string -split $pattern | select-object -skip 1

Judging by the task you have, it might be easier to match the delimiters and use re.split (as also pointed out by bobblebubble in the comments).
I dsuggest a mere
\d+[.)]\B\s*
See it in action (demo)
It matches 1 or more digits, then a . or a ), then it makes sure there is no word letter (digit, letter or underscore) after it and then matches zero or more whitespace.
Python demo:
import re
rx = r'\d+[.)]\B\s*'
test_str = "1) there is a dsfsdfsd and 2) there is another one and 3) yet another case\n\"we will give 4. there needs to be another option and 6.99 USD is a bit amount"
print([x for x in re.split(rx,test_str) if x])

Try the following regex with the g modifier:
([A-Za-z\s\-_]+|\d(?!(\)|\.)\D)|\.\d)
Example: https://regex101.com/r/kB1xI0/3
[A-Za-z\s\-_]+ automatically matches all alphabetical characters + whitespace
\d(?!(\)|\.)\D) match any numeric sequence of digits not followed by a closing parenthesis ) or decimal value (.99)
\.\d match any period followed by numeric digit.

I used this pattern:
(?<=\d.\s)(.*?)(?=\d.\s)
demo
This looks for the contents between any digit, any character, then a space.
Edit: Updated pattern to handle the currency issue and line ends better:
This is with flag 'g'
(?<=[0-9].\s)(.*?)(?=\s[0-9].\s|\n|\r)
Demo 2

import re
s = "1) there is a dsfsdfsd and 2) there is another one and 3) yet another case"
s1 = "we will give 4. there needs to be another option and 6.99 USD is a bit amount"
regex = re.compile("\d\)\s.*?|\s\d\.\D.*?")
print ([x for x in regex.split(s) if x])
print regex.split(s1)
Output:
['there is a dsfsdfsd and ', 'there is another one and ', 'yet another case']
['we will give', 'there needs to be another option and 6.99 USD is a bit amount']

What is the reason behind the advice that the substrings in regex should be ordered based on length?

longest first
>>> p = re.compile('supermanutd|supermanu|superman|superm|super')
shortest first
>>> p = re.compile('super|superm|superman|supermanu|supermanutd')
Why is the longest first regex preferred?

Alternatives in Regexes are tested in order you provide, so if first branch matches, then Rx doesn't check other branches. This doesn't matter if you only need to test for match, but if you want to extract text based on match, then it matters.
You only need to sort by length when your shorter strings are substrings of longer ones. For example when you have text:
supermanutd
supermanu
superman
superm
then with your first Rx you'll get:
>>> regex.findall(string)
[u'supermanutd', u'supermanu', u'superman', u'superm']
but with second Rx:
>>> regex.findall(string)
[u'super', u'super', u'super', u'super', u'super']
Test your regexes with http://www.pythonregex.com/

As #MBO says, alternatives are tested in the order they are written, and once one of them matches, the RE engine goes on to what comes after.
This behaviour is common to Perl-like RE engines, and ultimately goes back to the 1985 Bell Labs design of the RE library for Edition 8 Unix.
Note that POSIX 2 (from 1991) has another definition, insisting on the leftmost longest match for the whole RE and subject to that, for each subexpression in turn (in lexical order). In POSIX 2, order of alternatives does not matter.
However, the difference in behaviour is often: irrelevant (if you're just testing), masked by backtracking (if the shorter match causes the rest of the RE to fail), or compensated by the rest of the RE matching the part that the longer match 'should have' -- so most people aren't aware of it.

I'd guess it's because they're matched in that order, and it's faster to match shorter substrings. As an extreme example, a match against a single letter | a huge string will perform much better if the single letter (which is probably going to be responsible for the majority of matches anyway) is tested against first.
But in practice you should measure, not guess. If you need to have a performant regexp, test variations against representative test data.

The advice to which you refer is contingent on the regex engine attempting to match the components of the alternation in strictly left-to-right order, as is documented for the Python re module.
Sorting substrings in descending length order is just a special case of a wider problem when you are trying to extract a series of tokens. The general principle is that you put the more specialised sub-regexes first. For example, you are writing the lexical analysis for a formula parser. You have a "float constant" subregex and an "int constant" subregex. Your first attempt at the float subregex is likely to also match int constants. If so, you have two choices: (1) write a more complicated float subregex that doesn't match int constants (2) put your int subregex first.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.