Python - REGEX issue with RE using function re.compile + search - python

I'm using regex library 're' in Python (2.7) to validate a flight number.
I've had no issues with expected outputs using a really helpful online editor here: http://regexr.com/
My results on regexr.com are: http://imgur.com/nB0QDug
My code is:
import re
test1 = 'ba116'
###Referencelink: http://academe.co.uk/2014/01/validating-flight-codes/
p = re.compile('/^([a-z][a-z]|[a-z][0-9]|[0-9][a-z])[a-z]?[0-9]{1,4}[a-z]?$/g')
m = p.search(test1) # p.match() to find from start of string only
if m:
print 'It works!: ', m.group() # group(1...n) for capture groups
else:
print 'Did not work'
I'm unsure why I get the 'didn't work' output where regexr shows one match (as expected)
I made a much simpler regex lookup, and it seemed that the results were correct, so it seems either my regex string is invalid, or I'm using re.complile (or perhaps the if loop) incorrectly?
'ba116' is valid, and should match.

Python's re.compile is treating your leading / and trailing /g as part of the regular expression, not as delimiters and modifiers. This produces a compiled RE that will never match anything, since you have ^ with stuff before it and $ with stuff after it.
The first argument to re.compile should be a string containing only the stuff you would put inside the slashes in a language that had /.../ regex notation. The g modifier corresponds to calling the findall method on the compiled RE; in this case it appears to be unnecessary. (Some of the other modifiers, e.g. i, s, m, correspond to values passed to the second argument to re.compile.)
So this is what your code should look like:
import re
test1 = 'ba116'
###Referencelink: http://academe.co.uk/2014/01/validating-flight-codes/
p = re.compile(r'^([a-z][a-z]|[a-z][0-9]|[0-9][a-z])[a-z]?[0-9]{1,4}[a-z]?$')
m = p.search(test1) # p.match() to find from start of string only
if m:
print 'It works!: ', m.group() # group(1...n) for capture groups
else:
print 'Did not work'
The r immediately before the open quote makes no difference for this regular expression, but if you needed to use backslashes in the RE it would save you from having to double all of them.

Related

Python regular expression works in online tester, but not in program [duplicate]

What is the difference between the search() and match() functions in the Python re module?
I've read the Python 2 documentation (Python 3 documentation), but I never seem to remember it. I keep having to look it up and re-learn it. I'm hoping that someone will answer it clearly with examples so that (perhaps) it will stick in my head. Or at least I'll have a better place to return with my question and it will take less time to re-learn it.
re.match is anchored at the beginning of the string. That has nothing to do with newlines, so it is not the same as using ^ in the pattern.
As the re.match documentation says:
If zero or more characters at the
beginning of string match the regular expression pattern, return a
corresponding MatchObject instance.
Return None if the string does not
match the pattern; note that this is
different from a zero-length match.
Note: If you want to locate a match
anywhere in string, use search()
instead.
re.search searches the entire string, as the documentation says:
Scan through string looking for a
location where the regular expression
pattern produces a match, and return a
corresponding MatchObject instance.
Return None if no position in the
string matches the pattern; note that
this is different from finding a
zero-length match at some point in the
string.
So if you need to match at the beginning of the string, or to match the entire string use match. It is faster. Otherwise use search.
The documentation has a specific section for match vs. search that also covers multiline strings:
Python offers two different primitive
operations based on regular
expressions: match checks for a match
only at the beginning of the string,
while search checks for a match
anywhere in the string (this is what
Perl does by default).
Note that match may differ from search
even when using a regular expression
beginning with '^': '^' matches only
at the start of the string, or in
MULTILINE mode also immediately
following a newline. The “match”
operation succeeds only if the pattern
matches at the start of the string
regardless of mode, or at the starting
position given by the optional pos
argument regardless of whether a
newline precedes it.
Now, enough talk. Time to see some example code:
# example code:
string_with_newlines = """something
someotherthing"""
import re
print re.match('some', string_with_newlines) # matches
print re.match('someother',
string_with_newlines) # won't match
print re.match('^someother', string_with_newlines,
re.MULTILINE) # also won't match
print re.search('someother',
string_with_newlines) # finds something
print re.search('^someother', string_with_newlines,
re.MULTILINE) # also finds something
m = re.compile('thing$', re.MULTILINE)
print m.match(string_with_newlines) # no match
print m.match(string_with_newlines, pos=4) # matches
print m.search(string_with_newlines,
re.MULTILINE) # also matches
search ⇒ find something anywhere in the string and return a match object.
match ⇒ find something at the beginning of the string and return a match object.
match is much faster than search, so instead of doing regex.search("word") you can do regex.match((.*?)word(.*?)) and gain tons of performance if you are working with millions of samples.
This comment from #ivan_bilan under the accepted answer above got me thinking if such hack is actually speeding anything up, so let's find out how many tons of performance you will really gain.
I prepared the following test suite:
import random
import re
import string
import time
LENGTH = 10
LIST_SIZE = 1000000
def generate_word():
word = [random.choice(string.ascii_lowercase) for _ in range(LENGTH)]
word = ''.join(word)
return word
wordlist = [generate_word() for _ in range(LIST_SIZE)]
start = time.time()
[re.search('python', word) for word in wordlist]
print('search:', time.time() - start)
start = time.time()
[re.match('(.*?)python(.*?)', word) for word in wordlist]
print('match:', time.time() - start)
I made 10 measurements (1M, 2M, ..., 10M words) which gave me the following plot:
As you can see, searching for the pattern 'python' is faster than matching the pattern '(.*?)python(.*?)'.
Python is smart. Avoid trying to be smarter.
re.search searches for the pattern throughout the string, whereas re.match does not search the pattern; if it does not, it has no other choice than to match it at start of the string.
You can refer the below example to understand the working of re.match and re.search
a = "123abc"
t = re.match("[a-z]+",a)
t = re.search("[a-z]+",a)
re.match will return none, but re.search will return abc.
The difference is, re.match() misleads anyone accustomed to Perl, grep, or sed regular expression matching, and re.search() does not. :-)
More soberly, As John D. Cook remarks, re.match() "behaves as if every pattern has ^ prepended." In other words, re.match('pattern') equals re.search('^pattern'). So it anchors a pattern's left side. But it also doesn't anchor a pattern's right side: that still requires a terminating $.
Frankly given the above, I think re.match() should be deprecated. I would be interested to know reasons it should be retained.
Much shorter:
search scans through the whole string.
match scans only the beginning of the string.
Following Ex says it:
>>> a = "123abc"
>>> re.match("[a-z]+",a)
None
>>> re.search("[a-z]+",a)
abc
re.match attempts to match a pattern at the beginning of the string. re.search attempts to match the pattern throughout the string until it finds a match.
Quick answer
re.search('test', ' test') # returns a Truthy match object (because the search starts from any index)
re.match('test', ' test') # returns None (because the search start from 0 index)
re.match('test', 'test') # returns a Truthy match object (match at 0 index)

Need to Escape the Character After Special Characters in Python's regex?

I have the following python code:
#!/usr/bin/python
# -*- coding: utf-8 -*-
import re
line = 'div><div class="fieldRow jr_name"><div class="fieldLabel">name<'
regex0 = re.compile('(.+?)\v class="fieldLabel">name.+?', re.VERBOSE | re.UNICODE)
regex1 = re.compile('(.+?)v class="fieldLabel">name.+?', re.VERBOSE | re.UNICODE)
regex2 = re.compile('(.+?) class="fieldLabel">name.+?', re.VERBOSE | re.UNICODE)
m0 = regex0.match(line)
m1 = regex1.match(line)
m2 = regex2.match(line)
if m0:
print 'regex0 is good'
else:
print 'regex0 is no good'
if m1:
print 'regex1 is good'
else:
print 'regex1 is no good'
if m2:
print 'regex2 is good'
else:
print 'regex2 is no good'
The output is
regex0 is good
regex1 is no good
regex2 is good
I don't quite understand why I need to escape the character 'v' after "(.+?)" in regex0. If I don't escape, which will become regex1, then the matching will fail. However, for space right after "(.+?)" in regex3, I don't have to escape.
Any idea?
Thanks in advance.
So, there are some issues with your approach
The ones that contribute to your specific complaint are:
You do not mark te regexp string as raw (r' prefix) - that makes the Python compiler change some "\" prefixed characters inside the string before they even reach the re.match call.
"\v" happens to be one such character - a vertical tab that is replaced by "\0x0b"
You use the "re.VERBOSE" flag - that simply tells the regexp engine to ignore any whitesapce character. "\v" being a vertical tab is one character in this class and is ignored.
So, there is your match for regex0: the letter "v" os never seem as such.
Now, for the possible fixes on you approach, in the order that you should be trying to do them:
1) Don't use regular expressions to parse HTML. Really. There are a lot of packages that can do a good job on parsing HTML, and in missing those you can use stdlib's own HTMLParser (html.parser in Python3);
2) If possible, use Python 3 instead of Python 2 - you will be bitten on the first non-ASCII character inside yourt HTML body if you go on with the naive approach of treating Python2 strings as "real life" text. Python 3 automatic encoding handling (and explicit settings allowed to you when it is not automatic) .
Since you are probably not changing anyway, so try to use regex.findall instead of regex.match - this returns a list of matching strings and could retreive the attributes you are looking at once, without searching from the beggining of the file, or depending on line-breaks inside the HTML.
There is a special symbol in Python regex \v, about which you can read here:
https://docs.python.org/2/library/re.html
Python regex usually are written in r'your regex' block, where "r" means raw string. (https://docs.python.org/3/reference/lexical_analysis.html)
In your code all special characters should be escaped to be understood as normal characters. E.g. if you write s - this is space, \s is just "s". To make it work in an opposite way use raw strings.
The line below is the solution you need, I believe.
regex1 = re.compile(r'(.+?)v class="fieldLabel">name.+?', re.VERBOSE | re.UNICODE)

How to find a specific character in a string and put it at the end of the string

I have this string:
'Is?"they'
I want to find the question mark (?) in the string, and put it at the end of the string. The output should look like this:
'Is"they?'
I am using the following regular expression in python 2.7. I don't know why my regex is not working.
import re
regs = re.sub('(\w*)(\?)(\w*)', '\\1\\3\\2', 'Is?"they')
print regs
Is?"they # this is the output of my regex.
Your regex doesn't match because " is not in the \w character class. You would need to change it to something like:
regs = re.sub('(\w*)(\?)([^"\w]*)', '\\1\\3\\2', 'Is?"they')
As shown here, " is not captured by \w. Hence, it would probably be best to just use a .:
>>> import re
>>> re.sub("(.*)(\?)(.*)", r'\1\3\2', 'Is?"they')
'Is"they?'
>>>
. captures anything/everything in Regex (except newlines).
Also, you'll notice that I used a raw-string for the second argument of re.sub. Doing so is cleaner than having all those backslashes.

Do Python regexes support something like Perl's \G?

I have a Perl regular expression (shown here, though understanding the whole thing isn't hopefully necessary to answering this question) that contains the \G metacharacter. I'd like to translate it into Python, but Python doesn't appear to support \G. What can I do?
Try these:
import re
re.sub()
re.findall()
re.finditer()
for example:
# Finds all words of length 3 or 4
s = "the quick brown fox jumped over the lazy dogs."
print re.findall(r'\b\w{3,4}\b', s)
# prints ['the','fox','over','the','lazy','dogs']
Python does not have the /g modifier for their regexen, and so do not have the \G regex token. A pity, really.
You can use re.match to match anchored patterns. re.match will only match at the beginning (position 0) of the text, or where you specify.
def match_sequence(pattern,text,pos=0):
pat = re.compile(pattern)
match = pat.match(text,pos)
while match:
yield match
if match.end() == pos:
break # infinite loop otherwise
pos = match.end()
match = pat.match(text,pos)
This will only match pattern from the given position, and any matches that follow 0 characters after.
>>> for match in match_sequence(r'[^\W\d]+|\d+',"he11o world!"):
... print match.group()
...
he
11
o
I know I'm little late, but here's an alternative to the \G approach:
import re
def replace(match):
if match.group(0)[0] == '/': return match.group(0)
else: return '<' + match.group(0) + '>'
source = '''http://a.com http://b.com
//http://etc.'''
pattern = re.compile(r'(?m)^//.*$|http://\S+')
result = re.sub(pattern, replace, source)
print(result)
output (via Ideone):
<http://a.com> <http://b.com>
//http://etc.
The idea is to use a regex that matches both kinds of string: a URL or a commented line. Then you use a callback (delegate, closure, embedded code, etc.) to find out which one you matched and return the appropriate replacement string.
As a matter of fact, this is my preferred approach even in flavors that do support \G. Even in Java, where I have to write a bunch of boilerplate code to implement the callback.
(I'm not a Python guy, so forgive me if the code is terribly un-pythonic.)
Don't try to put everything into one expression as it become very hard to read, translate (as you see for yourself) and maintain.
import re
lines = [re.sub(r'http://[^\s]+', r'<\g<0>>', line) for line in text_block.splitlines() if not line.startedwith('//')]
print '\n'.join(lines)
Python is not usually best when you literally translate from Perl, it has it's own programming patterns.

Using Regex Plus Function in Python to Encode and Substitute

I'm trying to substitute something in a string in python and am having some trouble. Here's what I'd like to do.
For a given comment in my posting:
"here are some great sites that i will do cool things with! https://stackoverflow.com/it's a pig & http://google.com"
I'd like to use python to make the strings like this:
"here are some great sites that i will do cool things with! http%3A//stackoverflow.com & http%3A//google.com
Here's what I have so far...
import re
import urllib
def getExpandedURL(url)
encoded_url = urllib.quote(url)
return ""+encoded_url+""
text = '<text from above>'
url_pattern = re.compile('(http.+?[^ ]+', re.I | re.S | re.M)
url_iterator = url_pattern.finditer(text)
for matched_url in url_iterator:
getExpandedURL(matched_url.groups(1)[0])
But this is where i'm stuck. I've previously seen things on here like this: Regular Expressions but for Writing in the Match but surely there's got to be a better way than iterating through each match and doing a position replace on them. The difficulty here is that it's not a straight replace, but I need to do something specific with each match before replacing it.
I think you want url_pattern.sub(getExpandedURL, text).
re.sub(pattern, repl, string, count=0)
Return the string obtained by replacing the leftmost non-overlapping occurrences of the pattern in string by the replacement repl. repl can be either a string or a callable; if a callable, it's passed the match object and must return a replacement string to be used.

Categories