Python findall, regexp with "|" and grouping - python

I'm trying to implement Pig Latin with Python.
I want to match strings which begins by a consonant or "qu" (no matter the case) so to find the first letters, so at first I was doing :
first_letters = re.findall(r"^[^aeiou]+|^[qQ][uU]", "qualification")
It didn't work (finds only "q") so I figured that i had to add the q in the first group :
first_letters = re.findall(r"^[^aeiouq]+|^[qQ][uU]", "qualification")
so that works (it finds "qu" and not only "q") !
But playing around I found myself with this :
first_letters = re.findall(r"{^[^aeiou]+}|{^[qQ][uU]}", "qualification")
that didn't work because it is the same as the first expression I tried I think.
But finally this also worked :
first_letters = re.findall(r"{^[^aeiou]+}|(^[qQ][uU])", "qualification")
and I don't know why.
Someone can tell me why ?

You should put qu before [^aeuio], because otherwise "q" gets captured by the class and fails to match. Besides that, [Qq][Uu] is not needed, just provide the case insensitive flag:
first_letters = re.findall(r"^(qu|[^aeiou]+)", "qualification", re.I)
Given that you're probably going to match the rest of the string as well, this would be more practical:
start, rest = re.findall(r"^(qu|[^aeiou]+)?(.+)", word, re.I)[0]

Reverse the order of the rules:
>>> re.findall(r"^[qQ][uU]|^[^aeiou]+", "qualification")
['qu']
>>> re.findall(r"^[qQ][uU]|^[^aeiou]+", "boogie")
['b']
>>> re.findall(r"^[qQ][uU]|^[^aeiou]+", "blogie")
['bl']
In your first case, the first regex ^[^aeiou]+ matches the q. In the second case, since you've added q to the first part, the regex engine examines the second expression and matches qu.
In your other cases, I don't think the first expression does what you think it does (i.e. the ^ character inside the braces), so it's the second expression which matches again.
The first part of your 3rd and 4th patterns, {^[^aeiou]+} is trying to match a literal curly brace { followed by a start-of-line followed by one or more non-vowel characters, followed by a literal closing curly brace }. Since you don't have re.MULTILINE enabled, I'd assume that your pattern will be technically valid but unable to match any input.

The | runs left to right, and stops at the first success. So, that's why you found only q with the first expression, and qu with the second.
Not sure what your final regex does, particularly with regard to the {} expression. The part after the | will match in qualification, though. Perhaps that's what you are seeing.
You might find the re.I (ignore case) flag useful.

Related

python re.search() find none

s="XX.1.1. Accidents"
pattern = re.compile(r'\d|[a-zA-Z]\.\s([a-zA-Z]\S+)')
match=pattern.search(s)
if match:
print(match.group(1))
the output is None. However, I think it should have been "Accidents" Can someone tell me why?
Your | is messing with it - since you're not placing it in a capturing group or anything, it'll match \d or all of [a-zA-Z]\.\s([a-zA-Z]\S+). This is an issue because the regex will act greedily and you'll end up just a single \d.
If you use (?:\d|[a-zA-Z])\.\s([a-zA-Z]\S+), it'll work properly and you'll receive Accidents.
You can put the whole first part in square brackets, then search for the space, and then for the characters and the rest:
pattern = re.compile(r'[a-zA-Z\d.]*\s([a-zA-Z]\S+)')

Add multiplication signs (*) between coefficients

I have a program in which a user inputs a function, such as sin(x)+1. I'm using ast to try to determine if the string is 'safe' by whitelisting components as shown in this answer. Now I'd like to parse the string to add multiplication (*) signs between coefficients without them.
For example:
3x-> 3*x
4(x+5) -> 4*(x+5)
sin(3x)(4) -> sin(3x)*(4) (sin is already in globals, otherwise this would be s*i*n*(3x)*(4)
Are there any efficient algorithms to accomplish this? I'd prefer a pythonic solution (i.e. not complex regexes, not because they're pythonic, but just because I don't understand them as well and want a solution I can understand. Simple regexes are ok. )
I'm very open to using sympy (which looks really easy for this sort of thing) under one condition: safety. Apparently sympy uses eval under the hood. I've got pretty good safety with my current (partial) solution. If anyone has a way to make sympy safer with untrusted input, I'd welcome this too.
A regex is easily the quickest and cleanest way to get the job done in vanilla python, and I'll even explain the regex for you, because regexes are such a powerful tool it's nice to understand.
To accomplish your goal, use the following statement:
import re
# <code goes here, set 'thefunction' variable to be the string you're parsing>
re.sub(r"((?:\d+)|(?:[a-zA-Z]\w*\(\w+\)))((?:[a-zA-Z]\w*)|\()", r"\1*\2", thefunction)
I know it's a bit long and complicated, but a different, simpler solution doesn't make itself immediately obvious without even more hacky stuff than what's gone into the regex here. But, this has been tested against all three of your test cases and works out precisely as you want.
As a brief explanation of what's going on here: The first parameter to re.sub is the regular expression, which matches a certain pattern. The second is the thing we're replacing it with, and the third is the actual string to replace things in. Every time our regex sees a match, it removes it and plugs in the substitution, with some special behind-the-scenes tricks.
A more in-depth analysis of the regex follows:
((?:\d+)|(?:[a-zA-Z]\w*\(\w+\)))((?:[a-zA-Z]\w*)|\() : Matches a number or a function call, followed by a variable or parentheses.
((?:\d+)|(?:[a-zA-Z]\w*\(\w+\))) : Group 1. Note: Parentheses delimit a Group, which is sort of a sub-regex. Capturing groups are indexed for future reference; groups can also be repeated with modifiers (described later). This group matches a number or a function call.
(?:\d+) : Non-capturing group. Any group with ?: immediately after the opening parenthesis will not assign an index to itself, but still act as a "section" of the pattern. Ex. A(?:bc)+ will match "Abcbcbcbc..." and so on, but you cannot access the "bcbcbcbc" match with an index. However, without this group, writing "Abc+" would match "Abcccccccc..."
\d : Matches any numerical digit once. A regex of \d all its own will match, separately, "1", "2", and "3" of "123".
+ : Matches the previous element one or more times. In this case, the previous element is \d, any number. In the previous example, \d+ on "123" will successfully match "123" as a single element. This is vital to our regex, to make sure that multi-digit numbers are properly registered.
| : Pipe character, and in a regex, it effectively says or: "a|b" will match "a" OR "b". In this case, it separates "a number" and "a function call"; match a number OR a function call.
(?:[a-zA-Z]\w*\(\w+\)) : Matches a function call. Also a non-capturing group, like (?:\d+).
[a-zA-Z] : Matches the first letter of the function call. There is no modifier on this because we only need to ensure the first character is a letter; A123 is technically a valid function name.
\w : Matches any alphanumeric character or an underscore. After the first letter is ensured, the following characters could be letters, numbers, or underscores and still be valid as a function name.
* : Matches the previous element 0 or more times. While initially seeming unnecessary, the star character effectively makes an element optional. In this case, our modified element is \w, but a function doesn't technically need any more than one character; A() is a valid function name. A would be matched by [a-zA-Z], making \w unnecessary. On the other end of the spectrum, there could be any number of characters following the first letter, which is why we need this modifier.
\( : This is important to understand: this is not another group. The backslash here acts much like an escape character would in a normal string. In a regex, any time you preface a special character, such as parentheses, +, or * with a backslash, it uses it like a normal character. \( matches an opening parenthesis, for the actual function call part of the function.
\w+ : Matches a number, letter or underscore one or more times. This ensures the function actually has a parameter going into it.
\) : Like \(, but matches a closing parenthesis
((?:[a-zA-Z]\w*)|\() : Group 2. Matches a variable, or an opening parenthesis.
(?:[a-zA-Z]\w*) : Matches a variable. This is the exact same as our function name matcher. However, note that this is in a non-capturing group: this is important, because of the way the OR checks. The OR immediately following this looks at this group as a whole. If this was not grouped, the "last object matched" would be \w*, which would not be sufficient for what we want. It would say: "match one letter followed by more letters OR one letter followed by a parenthesis". Putting this element in a non-capturing group allows us to control what the OR registers.
| : Or character. Matches (?:[a-zA-Z]\w*) or \(.
\( : Matches an opening parenthesis. Once we have checked if there is an opening parenthesis, we don't need to check anything beyond it for the purposes of our regex.
Now, remember our two groups, group one and group two? These are used in the substitution string, "\1*\2". The substitution string is not a true regex, but it still has certain special characters. In this case, \<number> will insert the group of that number. So our substitution string is saying: "Put group 1 in (which is either our function call or our number), then put in an asterisk (*), then put in our second group (either a variable or a parenthesis)"
I think that about sums it up!

Python Regex: Question mark (?) doesn't match in middle of string

I bumped into the problem while playing around in Python: when I create a random string, let's say "test 1981", the following Python call returns with an empty string.
>>> re.search('\d?', "test 1981").group()
''
I was wondering why this is. I was reading through some other posts, and it seems that it has to do with greedy vs. non-greedy operators. Is it that the '?' checks to see if the first value is a digit, and if it's not, it takes the easier, quicker path and just outputs nothing?
Any clarification would help. Thanks!
Your pattern matches a digit or the empty string. It starts at the first character and tries to match a digit, what it is doing next is trying to match the alternative, means the empty string, voilà a match is found before the first character.
I think you expected it to move on and try to match on the next character, but that is not done, first it tries to match what the quantifier allows on the first position. And that is 0 or one digit.
The use of the optional quantifier makes only sense in combination with a required part, say you want a digit followed by an optional one:
>>> re.search('\d\d?', "test 1981").group()
'19'
Otherwise your pattern is always true.
Regex
\d?
simply means that it should optionally (?) match single digit (\d).
If you use something like this, it will work as you expect (match single digit anywhere in the string):
\d
re.search('\d?', "test 1981").group() greedily matches the first match of the pattern (0 or 1 digits) it can find. In this case that's zero digits. Note that re.search('\d?', "1981 test").group() actually matches the string '1' at the beginning of the string. What you're probably looking for here is re.search('\d+', "test 1981").group(), which finds the whole string 1981 no matter where it is.

Phrase matching using regex and Python

I have some short phrases that I want to match on. I used a regex as follows:
(^|)(piston|piston ring)( |$)
Using the above, regex.match("piston ring") matches on "piston". If I change the regex such that the longer phrase "piston ring" comes first then it work as expected.
I was surprised by this behavior as I was assuming that the greedy nature of regex would try to match the longest string "for free."
What am I missing? Can somebody explain this? Thanks!
When using alternation (|) in regular expressions, each option is attempted in order from left to right until a match can be found. So in your example since a match can be made with piston, piston ring will never be attempted.
A better way to write this regex would be something like this:
(^|)(piston( ring)?)( |$)
This will attempt to match 'piston', and then immediately attempt to match ' ring', with the ? making it optional. Alternatively just make sure your longer options occur at the beginning of the alternation.
You may also want to consider using a word boundary, \b, instead of (^|) and ( |$).
from http://www.regular-expressions.info/alternation.html (first Google result):
the regex engine is eager. It will stop searching as soon as it finds a valid match. The consequence is that in certain situations, the order of the alternatives matters
one exception:
the POSIX standard mandates that the longest match be returned, regardless if the regex engine is implemented using an NFA or DFA algorithm.
possible solutions:
piston( ring)?
(piston ring|piston) (put the longest before)
Thats the behaviour of Alternations. It tries to match the first alternative, that is "piston" if it is successful it is done.
That means it will not try all alternatives, it will finish with the first that matches.
You can find more details here on regular-expressions.info
What could also be interesting for you are word boundaries \b. I think what you are looking for is
\bpiston(?: ring)?\b
Edit2: It wasn't clear if your test data
contained pipes or not. I saw the pipes in
the regex and assumed you are searching
for pipe delim. Oh well.. not sure if below
helps.
Using regex to match text that's pipe delimited will need more alternations to pick up the beginning and ending columns.
What about another approach?
text='start piston|xxx|piston ring|xxx|piston cast|xxx|piston|xxx|stock piston|piston end'
j=re.split(r'\|',text)
k = [ x for x in j if x.find('piston') >= 0 ]
['start piston', 'piston ring', 'piston cast', 'piston', 'stock piston', 'piston end']
k = [ x for x in j if x.startswith('piston') ]
['piston ring', 'piston cast', 'piston', 'piston end']
k = [ x for x in j if x == 'piston' ]
['piston']
j=re.split(r'\|',text)
if 'piston ring' in j:
print True
> True
Edit: To clarify - take this example:
text2='piston1|xxx|spiston2|xxx|piston ring|xxx|piston3'
I add '.' to match anything to show the items matched
re.findall('piston.',text2)
['piston1', 'piston2', 'piston ', 'piston3']
To make it more accurate, you will need to use look-behind assertion.
This guarantees you match '|piston' but doesn't include the pipe in the result
re.findall('(?<=\|)piston.',text2)
['piston ', 'piston3']
Limit matching from greedy to first matching character .*?< stop char >
Add grouping parens to exclude the pipe. The match .*? is smart enough to detect if inside a group and ignores the paren and uses the next character as the stop matching sentinel. This seems to work, but it ignores the last column.
re.findall('(?<=\|)(piston.*?)\|',text2)
['piston ring']
When you add grouping you can now just specify starts with an escaped pipe
re.findall('\|(piston.*?)\|',text2)
['piston ring']
To search the last column as well, add this non-grouping match (?:\||$) - which means match on pipe (needs to be escaped) or (|) the end ($) of string.
The non-grouping match (?:x1|x2) doesn't get included in the result. An added bonus it gets optimized.
re.findall('\|(piston.*?)(?:\||$)',text2)
['piston ring', 'piston3']
Finally, to fix for the beginning of the string, add another alteration much like the previous one for end string match
re.findall('(?:\||^)(piston.*?)(?:\||$)',text2)
['piston1', 'piston ring', 'piston3']
Hope it helps. :)

How to use ? and ?: and : in REGEX for Python?

I understand that
* = "zero or more"
? = "zero or more" ...what's the difference?
Also, ?: << my book uses this, it says its a "subtlety" but I don't know what exactly these do!
As Manu already said, ? means "zero or one time". It is the same as {0,1}.
And by ?:, you probably meant (?:X), where X is some other string. This is called a "non-capturing group".
Normally when you wrap parenthesis around something, you group what is matched by those parenthesis. For example, the regex .(.).(.) matches any 4 characters (except line breaks) and stores the second character in group 1 and the fourth character in group 2. However, when you do: .(?:.).(.) only the fourth character is stored in group 1, everything bewteen (?:.) is matched, but not "remembered".
A little demo:
import re
m = re.search('.(.).(.)', '1234')
print m.group(1)
print m.group(2)
# output:
# 2
# 4
m = re.search('.(?:.).(.)', '1234')
print m.group(1)
# output:
# 4
You might ask yourself: "why use this non-capturing group at all?". Well, sometimes, you want to make an OR between two strings, for example, you want to match the string "www.google.com" or "www.yahoo.com", you could then do: www\.google\.com|www\.yahoo\.com, but shorter would be: www\.(google|yahoo)\.com of course. But if you're not going to do something useful with what is being captured by this group (the string "google", or "yahoo"), you mind as well use a non-capturing group: www\.(?:google|yahoo)\.com. When the regex engine does not need to "remember" the substring "google" or "yahoo" then your app/script will run faster. Of course, it wouldn't make much difference with relatively small strings, but when your regex and string(s) gets larger, it probably will.
And for a better example to use non-capturing groups, see Chris Lutz's comment below.
?: << my book uses this, it says its a "subtlety" but I don't know what exactly these do!
If that’s indeed what your book says, then I advise getting a better book.
Inside parentheses (more precisely: right after an opening parenthesis), ? has another meaning. It starts a group of options which count only for the scope of the parentheses. ?: is a special case of these options. To understand this special case, you must first know that parentheses create capture groups:
a(.)c
This is a regular expression that matches any three-letter string starting with a and ending with c. The middle character is (more or less) aribtrary. Since you put it in parentheses, you can capture it:
m = re.search('a(.)c', 'abcdef')
print m.group(1)
This will print b, since m.group(1) captures the content of the first parentheses (group(0) captures the whole hit, here abc).
Now, consider this regular expression:
a(?:.)c
No capture is made here – this is what ?: after an opening parenthesis means. That is, the following code will fail:
print m.group(1)
Because there is no group 1!
? = zero or one
you use (?:) for grouping w/o saving the group in a temporary variable as you would with ()
? does not mean "zero or more", it means "zero or one".

Categories