How can I match 'suck' only if not part of 'honeysuckle'?
Using lookbehind and lookahead I can match suck if not 'honeysuck' or 'suckle', but it also fails to catch something like 'honeysucker'; here the expression should match, because it doesn't end in le:
re.search(r'(?<!honey)suck(?!le)', 'honeysucker')
You need to nest the lookaround assertions:
>>> import re
>>> regex = re.compile(r"(?<!honey(?=suckle))suck")
>>> regex.search("honeysuckle")
>>> regex.search("honeysucker")
<_sre.SRE_Match object at 0x00000000029B6370>
>>> regex.search("suckle")
<_sre.SRE_Match object at 0x00000000029B63D8>
>>> regex.search("suck")
<_sre.SRE_Match object at 0x00000000029B6370>
An equivalent solution would be suck(?!(?<=honeysuck)le).
here is a solution without using regular expressions:
s = s.replace('honeysuckle','')
and now:
re.search('suck',s)
and this would work for any of these strings : honeysuckle sucks, this sucks and even regular expressions suck.
I believe you should separate your exceptions in a different Array, just in case in the future you wish to add a different rule. This will be easier to read, and will be faster in the future to change if needed.
My suggestion in Ruby is:
words = ['honeysuck', 'suckle', 'HONEYSUCKER', 'honeysuckle']
EXCEPTIONS = ['honeysuckle']
def match_suck word
if (word =~ /suck/i) != nil
# should not match any of the exceptions
return true unless EXCEPTIONS.include? word.downcase
end
false
end
words.each{ |w|
puts "Testing match of '#{w}' : #{match_suck(w)}"
}
>>>string = 'honeysucker'
>>>print 'suck' in string
True
Related
I'm learning about regular expression. I don't know how to combine different regular expression to make a single generic regular expression.
I want to write a single regular expression which works for multiple cases. I know this is can be done with naive approach by using or " | " operator.
I don't like this approach. Can anybody tell me better approach?
You need to compile all your regex functions. Check this example:
import re
re1 = r'\d+\.\d*[L][-]\d*\s[A-Z]*[/]\d*'
re2 = '\d*[/]\d*[A-Z]*\d*\s[A-Z]*\d*[A-Z]*'
re3 = '[A-Z]*\d+[/]\d+[A-Z]\d+'
re4 = '\d+[/]\d+[A-Z]*\d+\s\d+[A-Z]\s[A-Z]*'
sentences = [string1, string2, string3, string4]
for sentence in sentences:
generic_re = re.compile("(%s|%s|%s|%s)" % (re1, re2, re3, re4)).findall(sentence)
To findall with an arbitrary series of REs all you have to do is concatenate the list of matches which each returns:
re_list = [
'\d+\.\d*[L][-]\d*\s[A-Z]*[/]\d*', # re1 in question,
...
'\d+[/]\d+[A-Z]*\d+\s\d+[A-z]\s[A-Z]*', # re4 in question
]
matches = []
for r in re_list:
matches += re.findall( r, string)
For efficiency it would be better to use a list of compiled REs.
Alternatively you could join the element RE strings using
generic_re = re.compile( '|'.join( re_list) )
I see lots of people are using pipes, but that seems to only match the first instance. If you want to match all, then try using lookaheads.
Example:
>>> fruit_string = "10a11p"
>>> fruit_regex = r'(?=.*?(?P<pears>\d+)p)(?=.*?(?P<apples>\d+)a)'
>>> re.match(fruit_regex, fruit_string).groupdict()
{'apples': '10', 'pears': '11'}
>>> re.match(fruit_regex, fruit_string).group(0)
'10a,11p'
>>> re.match(fruit_regex, fruit_string).group(1)
'11'
(?= ...) is a look ahead:
Matches if ... matches next, but doesn’t consume any of the string. This is called a lookahead assertion. For example, Isaac (?=Asimov) will match 'Isaac ' only if it’s followed by 'Asimov'.
.*?(?P<pears>\d+)p
find a number followed a p anywhere in the string and name the number "pears"
You might not need to compile both regex patterns. Here is a way, let's see if it works for you.
>>> import re
>>> text = 'aaabaaaabbb'
>>> A = 'aaa'
>>> B = 'bbb'
>>> re.findall(A+B, text)
['aaabbb']
>>>
further read read_doc
If you need to squash multiple regex patterns together the result can be annoying to parse--unless you use P<?> and .groupdict() but doing that can be pretty verbose and hacky. If you only need a couple matches then doing something like the following could be mostly safe:
bucket_name, blob_path = tuple(item for item in matches.groups() if item is not None)
Trying to write a Regex expression in Python to match strings.
I want to match input that starts as first, first?21313 but not first.
So basically, I don't want to match to anything that has . the period character.
I've tried word.startswith(('first[^.]?+')) but that doesn't work. I've also tried word.startswith(('first.?+')) but that hasn't worked either. Pretty stumped here
import re
def check(word):
regexp = re.compile('^first([^\..])+$')
return regexp.match(word)
And if you dont want the dot:
^first([^..])+$
(first + allcharacter except dot and first cant be alone).
You really don't need regex for this at all.
word.startswith('first') and word.find('.') == -1
But if you really want to take the regex route:
>>> import re
>>> re.match(r'first[^.]*$', 'first')
<_sre.SRE_Match object; span=(0, 5), match='first'>
>>> re.match(r'first[^.]*$', 'first.') is None
True
I want to use regex to search in a file for this expression:
time:<float> s
I only want to get the float number.
I'm learning about regex, and this is what I did:
astr = 'lalala time:1.5 s\n'
p = re.compile(r'time:(\d+).*(\d+)')
m = p.search(astr)
Well, I get time:1.5 from m.group(0)
How can I directly just get 1.5 ?
I'm including some extra python-specific materiel since you said you're learning regex. As already mentioned the simplest regex for this would certainly be \d+\.\d+ in various commands as described below.
Something that threw me off with python initially was getting my head around the return types of various re methods and when to use group() vs. groups().
There are several methods you might use:
re.match()
re.search()
re.findall()
match() will only return an object if the pattern is found at the beginning of the string.
search() will find the first pattern and top.
findall() will find everything in the string.
The return type for match() and search() is a match object, __Match[T], or None, if a match isn't found. However the return type for findall() is a list[T]. These different return types obviously have ramifications for how you get the values out of your match.
Both match and search expose the group() and groups() methods for retrieving your matches. But when using findall you'll want to iterate through your list or pull a value with an enumerator. So using findall:
>>>import re
>>>easy = re.compile(r'123')
>>>matches = easy.findall(search_me)
>>>for match in matches: print match
123
If you're using search() or match(), you'll want to use .group() or groups() to retrieve your match depending on how you've set up your regular expression.
From the documentation, "The groups() method returns a tuple containing the strings for all the subgroups, from 1 up to however many there are."
Therefore if you have no groups in your regex, as shown in the following example, you wont get anything back:
>>>import re
>>>search_me = '123abc'
>>>easy = re.compile(r'123')
>>>matches = easy.search(search_me)
>>>print matches.groups()
()
Adding a "group" to your regular expression enables you to use this:
>>>import re
>>>search_me = '123abc'
>>>easy = re.compile(r'(123)')
>>>matches = easy.search(search_me)
>>>print matches.groups()
('123',)
You don't have to specify groups in your regex. group(0) or group() will return the entire match even if you don't have anything in parenthesis in your expression. --group() defaults to group(0).
>>>import re
>>>search_me = '123abc'
>>>easy = re.compile(r'123')
>>>matches = easy.search(search_me)
>>>print matches.group(0)
123
If you are using parenthesis you can use group to match specific groups and subgroups.
>>>import re
>>>search_me = '123abc'
>>>easy = re.compile(r'((1)(2)(3))')
>>>matches = easy.search(search_me)
>>>print matches.group(1)
>>>print matches.group(2)
>>>print matches.group(3)
>>>print matches.group(4)
123
1
2
3
I'd like to point as well that you don't have to compile your regex unless you care to for reasons of usability and/or readability. It won't improve your performance.
>>>import re
>>>search_me = '123abc'
>>>#easy = re.compile(r'123')
>>>#matches = easy.search(search_me)
>>>matches = re.search(r'123', search_me)
>>>print matches.group()
Hope this helps! I found sites like debuggex helpful while learning regex. (Although sometimes you have to refresh those pages; I was banging my head for a couple hours one night before I realized that after reloading the page my regex worked just fine.) Lately I think you're served just as well by throwing sandbox code into something like wakari.io, or an IDE like PyCharm, etc., and observing the output. http://www.rexegg.com/ is also a good site for general regex knowledge.
You could do create another group for that. And I would also change the regex slightly to allow for numbers that don't have a decimal separator.
re.compile(r'time:((\d+)(\.?(\d+))?')
Now you can use group(1) to capture the match of the floating point number.
I think the regex you actually want is something more like:
re.compile(r'time:(\d+\.\d+)')
or even:
re.compile(r'time:(\d+(?:\.\d+)?)') # This one will capture integers too.
Note that I've put the entire time into 1 grouping. I've also escaped the . which means any character in regex.
Then, you'd get 1.5 from m.group(1) -- m.group(0) is the entire match. m.group(1) is the first submatch (parenthesized grouping), m.group(2) is the second grouping, etc.
example:
>>> import re
>>> p = re.compile(r'time:(\d+(?:\.\d+)?)')
>>> p.search('time:34')
<_sre.SRE_Match object at 0x10fa77d50>
>>> p.search('time:34').group(1)
'34'
>>> p.search('time:34.55').group(1)
'34.55'
I want to check either given words contain special character or not.
so below is my python code
The literal 'a#bcd' has '#', so it will be matchd and it's ok.
but 'a1bcd' has no special character. but it was filtered too!!
import re
regexp = re.compile('[~`!##$%^&*()-_=+\[\]{}\\|;:\'\",.<>/?]+')
if regexp.search('a#bcd') :
print 'matched!! nich catch!!'
if regexp.search('a1bcd') :
print 'something is wrong here!!!'
result :
python ../special_char.py
matched!! nich catch!!
something is wrong here!!!
I have no idea why it works like above..someone help me..T_T;;;
thanks~
Move the dash in you regular expression to the start of the [] group, like this:
regexp = re.compile('[-~`!##$%^&*()_=+\[\]{}\\|;:\'\",.<>/?]+')
Where you had the dash, it was read with the surrounding characters as )-_ and since it is inside [] it is interpreted as asking to match a range from ) to _. If you move the dash to just after the [ it has no special meaning and instead matches itself.
Here's an interactive session showing the specific problem there was in your regular expression:
>>> import re
>>> print re.search('[)-_]', 'abcd')
None
>>> print re.search('[)-_]', 'a1b')
<_sre.SRE_Match object at 0x7f71082247e8>
>>> print re.search('[)-_]', 'a1b').group(0)
1
After fixing it:
>>> print re.search('[-)_]', 'a1b')
None
Unless there's some reason not visible in your question, I'd also say that the final + is not needed.
re will be relatively slow for this
I'd suggest trying
specialchars = '''-~`!##$%^&*()_=+[]{}\\|;:'",.<>/?'''
len(word) != len(word.translate(None, specialchars))
or
set(word) & set(specialchars)
Trying to understand regular expressions and I am on the repetitions part: {m, n}.
I have this code:
>>> p = re.compile('a{1}b{1, 3}')
>>> p.match('ab')
>>> p.match('abbb')
As you can see both the strings are not matching the pattern. Why is this happening?
You shouldn't put a space after the comma, and the {1} is redundant.
Try
p = re.compile('a{1}b{1,3}')
...and mind the space.
Remove the extra whitespace in b.
Change:
p = re.compile('a{1}b{1, 3}')
to:
p = re.compile('a{1}b{1,3}')
^ # no whitespace
and all should be well.
You are seeing some re behaviour that is very "dark corner", nigh on a bug (or two).
# Python 2.7.1
>>> import re
>>> pat = r"b{1, 3}\Z"
>>> bool(re.match(pat, "bb"))
False
>>> bool(re.match(pat, "b{1, 3}"))
True
>>> bool(re.match(pat, "bb", re.VERBOSE))
False
>>> bool(re.match(pat, "b{1, 3}", re.VERBOSE))
False
>>> bool(re.match(pat, "b{1,3}", re.VERBOSE))
True
>>>
In other words, the pattern "b{1, 3}" matches the literal text "b{1, 3}" in normal mode, and the literal text "b{1,3}" in VERBOSE mode.
The "Law of Least Astonishment" would suggest either (1) the space in front of the 3 was ignored and it matched "b", "bb", or "bbb" as appropriate [preferable] or (2) an exception at compile time.
Looking at it another way: Two possibilities: (a) The person who writes "{1, 3}" is imbued with the spirit of PEP8 and believes it is prescriptive and applies everywhere (b) The person who writes that has tested re undocumented behaviour and actually wants to match the literal text "b{1, 3}" and perversely wants to use r"b{1, 3}" instead of explicitly escaping: r"b\{1, 3}". Seems to me that (a) is much more probable than (b), and re should act accordingly.
Yet another perspective: When the space is reached, it has already parsed {, a string of digits, and a comma i.e. well into the {m,n} "operator" ... to silently ignore an unexpected character and treat it as though it was literal text is mind-boggling, perlish, etc.
Update Bug report lodged.
Do not insert spaces between { and }.
p = re.compile('a{1}b{1,3}')
You can compile the regex with VERBOSE flag, this means most whitespace in the regex would be ignored. I think this is a very good practice to describe complex regular expressions in a more readable manner.
See here for details...
Hope this helps...