Combine case sensitive regex and case insensitive regex into one

Combine case sensitive regex and case insensitive regex into one - python

I have multiple filters for files (I'm using python). Some of them are glob filters some of them are regular expressions. I have both case sensitive and case insensitive globs and regexes. I can transform the glob into a regular expression with translate.
I can combine the case sensitive regular expressions into one big regular expression. Let's call it R_sensitive.
I can combine the case insensitive regular expressions into one big regular expression (case insensitive). Let's call it R_insensitive.
Is there a way to combine R_insensitive and R_sensitive into one regular expression? The expression would be (of course) case sensitive?
Thanks,
Iulian
NOTE: The way I combine expressions is the following:
Having R1,R2,R3 regexes I make R = (R1)|(R2)|(R3).
EXAMPLE:
I'm searching for "*.txt" (insensitive glob). But I have another glob that is like this: "*abc*" (case sensitive). How to combine (from programming) the 2 regex resulted from "fnmatch.translate" when one is case insensitive while the other is case sensitive?

Unfortunately, the regex ability you describe is either ordinal modifiers or a modifier span. Python does not support either, though here are what they would look like:
Ordinal Modifiers: (?i)case_insensitive_match(?-i)case_sensitive_match
Modifier Spans: (?i:case_insensitive_match)(?-i:case_sensitive_match)
In Python, they both fail to parse in re. The closest thing you could do (for simple or small matches) would be letter groups:
[Cc][Aa][Ss][Ee]_[Ii][Nn][Ss][Ee][Nn][Ss][Ii][Tt][Ii][Vv][Ee]_[Mm][Aa][Tt][Cc][Hh]case_sensitive_match
Obviously, this approach would be best for something where the insensitive portion is very brief, so I'm afraid it wouldn't be the best choice for you.

What you need is a way to convert a case-insensitive-flagged regexp into a regexp that works equivalent without the flag.
To do this fully generally is going to be a nightmare.
To do this just for fnmatch results is a whole lot easier.
If you need to handle full Unicode case rules, it will still be very hard.
If you only need to handle making sure each character c also matches c.upper() and c.lower(), it's very easy.
I'm only going to explain the easy case, because it's probably what you want, given your examples, and it's easy. :)
Some modules in the Python standard library are meant to serve as sample code as well as working implementations; these modules' docs start with a link directly to their source code. And fnmatch has such a link.
If you understand regexp syntax, and glob syntax, and look at the source to the translate function, it should be pretty easy to write your own translatenocase function.
Basically: In the inner else clause for building character classes, iterate over the characters, and for each character, if c.upper() != c.lower(), append both instead of c. Then, in the outer else clause for non-special characters, if c.upper() != c.lower(), append a two-character character class consisting of those two characters.
So, translatenocase('*.txt') will return something like r'.*\.[tT][xX][tT]' instead of something like r'.*\.txt'. But normal translate('*abc*') will of course return the usual r'.*abc.*'. And you can combine these just by using an alternation, as you apparently already know how to do.

Related

Is a single big regex more efficient than a bunch of smaller ones?

I'm working on a function that uses regular expressions to find some product codes in a (very long) string given as an argument.
There are many possible forms of that code, for example:
UK[A-z]{10} or DE[A-z]{20} or PL[A-z]{7} or...
What solution would be better? Many (most probably around 20-50) small regular expressions or one huge monster-regex that matches them all? What is better when performance is concerned?

It depends what kind of big regex you write. If you end with a pathological pattern it's better to test smaller patterns. Example:
UK[A-Za-z]{10}|DE[A-Za-z]{20}|PL[A-Za-z]{7}
this pattern is very inefficient because it starts with an alternation, this means that in the worst case (no match) each alternative needs to be tested for all positions in the string.
(* Note that a regex engine like PCRE is able to quickly find potentially matching positions when each branch of an alternation starts with literal characters.)
But if you write your pattern like this:
(?=[UDP][KEL])(?:UK[A-Za-z]{10}|DE[A-Za-z]{20}|PL[A-Za-z]{7})
or the variation:
[UDP][KEL](?:(?<=UK)[A-Za-z]{10}|(?<=DE)[A-Za-z]{20}|(?<=PL)[A-Za-z]{7})
Most of the positions where the match isn't possible are quickly discarded before the alternation.
Also, when you write a single pattern, obviously, the string is parsed only once.

How do I extract definitions from a html file?

I'm trying to practice with regular expressions by extracting function definitions from Python's standard library built-in functions page. What I do have so far is that the definitions are generally printed between <dd><p> and </dd></dl>. When I try
import re
fname = open('functions.html').read()
deflst = re.findall(r'<dd><p>([\D3]+)</dd></dl>', fhand)
it doesn't actually stop at </dd></dl>. This is probably something very silly that I'm missing here, but I've been really having a hard time trying to figure this one out.

Regular expressions are evaluated left to right, in a sense. So in your regular expression,
r'<dd><p>([\D3]+)</dd></dl>'
the regex engine will first look for a <dd><p>, then it will look at each of the following characters in turn, checking each for whether it's a nondigit or 3, and if so, add it to the match. It turns out that all the characters in </dd></dl> are in the class "nondigit or 3", so all of them get added to the portion matched by [\D3]+, and the engine dutifully keeps going. It will only stop when it finds a character that is a digit other than 3, and then go on and "notice" the rest of the regex (the </dd></dl>).
To fix this, you can use the reluctant quantifier like so:
r'<dd><p>([\D3]+?)</dd></dl>'
(note the added ?) which means the regex engine should be conservative in how much it adds to the match. Instead of trying to "gobble" as many characters as possible, it will now try to match the [\D3]+? to just one character and then go on and see if the rest of the regex matches, and if not it will try to match [\D3]+? with just two characters, and so on.
Basically, [\D3]+ matches the longest possible string of [\D3]'s that it can while still letting the full regex match, whereas [\D3]+? matches the shortest possible string of [\D3]'s that it can while still letting the full regex match.
Of course one shouldn't really be using regular expressions to parse HTML in "the real world", but if you just want to practice regular expressions, this is probably as good a text sample as any.

By default all quantifiers are greedy which means they want to match as many characters as possible. You can use ? after quantifier to make it lazy which matches as few characters as possible. \d+? matches at least one digit, but as few as possible.
Try r'<dd><p>([\D3]+?)</dd></dl>'

How to match a string against a set of wildcard strings efficiently?

I am looking for a solution to match a single string against a set of wildcard strings. For example
>>> match("ab", ["a*", "b*", "*", "c", "*b"])
["a*", "*", "*b"]
The order of the output is of no importance.
I will have in the order of 10^4 wildcard strings to match against and I will do around ~10^9 match calls. This means I will probably have to rewrite my code like so:
>>> matcher = prepare(["a*", "b*", "*", "c", "*b"]
>>> for line in lines: yield matcher.match("ab")
["a*", "*", "*b"]
I've started writing a trie implementation in Python that handles wildcards and I just need to get those corner cases right. Despite this I am curious to hear; How would you solve this? Are there any Python libraries out there that make me solve this faster?
Some insights so far:
Named (Python, re) regular expressions will not help me here since they'll only return one match.
pyparsing seems like an awesome library, but is sparsely documented and does not, as I see it, support matching multiple patterns.

You could use FilteredRE2 class from re2 library with a help from Aho-Corasick algorithm implementation (or similar). From re2 docs:
Required substrings. Suppose you have an efficient way to check which
of a list of strings appear as substrings in a large text (for
example, maybe you implemented the Aho-Corasick algorithm), but now
your users want to be able to do regular expression searches
efficiently too. Regular expressions often have large literal strings
in them; if those could be identified, they could be fed into the
string searcher, and then the results of the string searcher could be
used to filter the set of regular expression searches that are
necessary. The FilteredRE2 class implements this analysis. Given a
list of regular expressions, it walks the regular expressions to
compute a boolean expression involving literal strings and then
returns the list of strings. For example, FilteredRE2 converts
(hello|hi)world[a-z]+foo into the boolean expression “(helloworld OR
hiworld) AND foo” and returns those three strings. Given multiple
regular expressions, FilteredRE2 converts each into a boolean
expression and returns all the strings involved. Then, after being
told which of the strings are present, FilteredRE2 can evaluate each
expression to identify the set of regular expressions that could
possibly be present. This filtering can reduce the number of actual
regular expression searches significantly.
The feasibility of these analyses depends crucially on the simplicity
of their input. The first uses the DFA form, while the second uses the
parsed regular expression (Regexp*). These kind of analyses would be
more complicated (maybe even impossible) if RE2 allowed non-regular
features in its regular expressions.

Seems like Aho-Corasick algorithm would work. esmre seem to do what I'm looking for. I got this information from this question.

Use regular expression to handle nested parenthesis in math equation?

If I have:
statement = "(2*(3+1))*2"
I want to be able to handle multiple parentheses within parentheses for a math reader I'm writing. Perhaps I'm going about this the wrong way, but my goal was to recursively go deeper into the parentheses until there were none, and then I would perform the math operations. Thus, I would first want to focus on
"(2*(3+1))"
then focus on
"(3+1)"
I hoped to do this by assigning the focus value to the start index of the regex and the end index of the regex. I have yet to figure out how to find the end index, but I'm more interested in first matching the regex
r"\(.+\)"
failed to match. I wanted it to read as "any one or more characters contained within a set of parentheses". Could someone explain why the above expression will not match to the above statement in python?

I love regular expressions. I use them all the time.
Don't use regular expressions for this.
You want an actual parser that will actually parse your math expressions. You might want to read this:
http://effbot.org/zone/simple-top-down-parsing.htm
Once you have actually parsed the expression, it's trivial to walk the parse tree and compute the result.
EDIT: #Lattyware suggested pyparsing, which should also be a good way to go, and might be easier than the EFFBot solution posted above.
https://github.com/pyparsing/pyparsing
Here's a direct link to the pyparsing sample code for a four-function algebraic expression evaluator:
http://pyparsing.wikispaces.com/file/view/fourFn.py

for what it's worth, here's a little more context:
regular expressions are called "regular" because they're associated with regular grammars, and regular grammars cannot describe (an unlimited number of) nested parentheses (they can describe a bunch of random parentheses, but cannot make them match in neat pairs).
one way to understand this is to understand that regular expressions can (modulo some details which i will explain at the end) be converted to deterministic finite automatons. which sounds intimidating but really just means that they can be converted into lists of "rules", where the rules depend on what you matched, and describe what you can match.
for example, the regular expression ab*c can be converted to:
at the start, you can only match a. then go to 2.
now, you can match b and go back to 2, or match c and go to 3
you're done! the match was a success!
and that is a "deterministic finite automaton".
anyway, the interesting part of this is that if you sit down and try to make something like that for matching pairs of parentheses you can't! try it. you can match a finite number by making more and more rules, but you can't write a general set of rules that match an unlimited number of parentheses (i should add that the rules have to be of the form "if you match X go to Y").
now obviously you could modify that in various ways. you could allow more complex rules (like extending them to let you keep a count of the parentheses), and you could then get something that worked as you expect. but it wouldn't be a regular grammar.
given that regular expressions are limited in this way, why are they used rather than something more complex? it turns out that they're something of a sweet spot - they can do a lot, while remaining fairly simple and efficient. more complex grammars (kinds of rules) can be more powerful, but are also harder to implement, and have more problems with efficiency.
final disclaimer and promised extra details: in practice many regular expressions these days actually are more powerful than this (and should not really be called "regular expressions"). but the above is still the basic explanation of why you should not use a regexp for this.
ps jesse's suggested solution gets round this by using a regexp multiple times; the argument here is for a single use of the regexp.

I probably agree with steveha, and don't recommend regex for this, but to answer your question specifically, you need unescaped parens to pull out results groups (your pattern only has escaped parens):
>>> re.match(r"\((.+)\)", "(2*(3+1))*2").group(1)
'2*(3+1)'
If you go that route, you could iterate over the match results until you run out of matches, and then reverse the results list to work inside out.

Regular expression to parse word structure

I'm trying to build my first non-trivial regular expression (for use in Python), but struggling.
Let us assume that a word in language X (NOT English) is a sequence of minimal 'structures'. Each 'structure' could be:
An independent vowel (basically one letter of the alphabet)
A consonant (one letter of the alphabet)
A consonant followed by a right-attaching vowel
A left-attaching vowel followed by a consonant
(Certain left-attaching vowels) followed by a consonant followed by (certain right-attaching vowels)
For example this word of 3 characters:
<a consonant><a left-attaching vowel><an independent vowel>
is not a valid word, and should not match the regex, because there is no consonant to the right of the left-attaching vowel.
I know all the Unicode ranges - the Unicode ranges for consonants, independent vowels, left-attaching vowels and so on.
Here is what I have so far:
WordPattern = (
ur'('
ur'[\u0985-\u0994]|'
ur'[\u0995-\u09B9]|'
ur'[\u0995-\u09B9(\u09BE|[\u09C0-\u09C4])]|'
ur'[(\u09BF|\u09C7|\u09C8)\u0995-\u09B9]|'
ur'[(\u09BF|\u09C7|\u09C8)\u0995-\u09B9(\u09BE|[\u09C0-\u09C4])]'
ur')+'
)
It's not working. Apart from getting it to work, I have three specific problems:
I need to split the regular expression over multiple lines, or else the code is going to look terrible. How do I do this?
I would like to use string substitution / templates of some sort to 'name' the Unicode ranges, for code readability and to prevent typing Unicode ranges multiple times.
(This seems very difficult) The list of permissible minimal 'structures' will have to be extended later. Is there any way to set up a sort of 'loop' mechanism within a regex, so that it works for all permissible structures in a list?
Any help would be appreciated. This seems very complex to a beginner!

The appropriate tool for morphological analysis of languages with non-trivial morphology is "finite state transducers". There are robust implementations that you can track down and use (one by Xerox Parc). There's one that has python bindings (for using as an external library). Google it.
FSTs are based on finite-state automata, like (pure) regular expressions, but they are by no means a drop-in replacement. It's complex machinery, so if your goals are simple (e.g., syllabification for purposes of hyphenation) you may want to look for something simpler. There are machine-learning algorithms that will "learn" hyphenation, for example. If you are indeed interested in morphological analysis, you have to make the effort to look at FSTs.
Now for your algorithm, in case you really only need a trivial implementation: Since any vowel or consonant could be independent, your rules are ambiguous: They allow "ab" to be parsed as "a-b". Such ambiguities mean that a regexp approach will probably never work, but you may get better results if you put the longer regexps first, so they are used in preference to the short ones when both would apply. But really you need to build a parser (by hand or using a module) and try different things in steps. It's backwards from what you imagined: Set up a loop that uses different regexps, and "consumes" the string in steps.
However, it seems to me that what you are describing is essentially syllabification. And the near-universal rule of syllabification is this: A syllable consists of a core vowel, plus as many preceding ("onset") consonants as the rules of the language allow, plus any following consonants that cannot belong to the next syllable. The rule is called "maximize onset", and it has the consequence that it's easier to parse your syllables backwards (from the end of the word). Try it out.
PS. You probably know this, but if you put the following as the second line in your scripts you can embed Bengali in your regexps:
# -*- coding: utf-8 -*-

I need to split the regular expression over multiple lines, or else the code is going to look terrible. How do I do this?
Use the re.VERBOSE flag when compiling the regex.
pattern = re.compile(r"""(
[\u0985-\u0994] # comment to explain what this is
| [\u0995-\u09B9]
# etc.
)
""", re.VERBOSE)
I would like to use string substitution / templates of some sort to 'name' the Unicode ranges
You can construct an RE from ordinary Python strings:
>>> subpatterns = {"vowel": "[aeiou]", "consonant": "[^aeiou]"}
>>> "{consonant}{vowel}+{consonant}*".format(**subpatterns)
'[^aeiou][aeiou]+[^aeiou]*'
The list of permissible minimal 'structures' will have to be extended later. Is there any way to set up a sort of 'loop' mechanism within a regex, so that it works for all permissible structures in a list?
I'm not sure if I get what you mean, but... suppose you have a list of (uncompiled) REs, say, patterns, then you can compute their union with
re.compile("(%s)" % "|".join(patterns))
Be careful with special characters when constructing REs this way and use re.escape where necessary.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.