I need to validate a version number consisting of 'v' plus positive int, and nothing else
eg "v4", "v1004"
I have
import re
pattern = "\Av(?=\d+)\W"
m = re.match(pattern, "v303")
if m is None:
print "noMatch"
else:
print "match"
But this doesn't work! Removing the \A and \W will match for v303 but will also match for v30G, for example
Thanks
Pretty straightforward. First, put anchors on your pattern:
"^patternhere$"
Now, let's put together the pattern:
"^v\d+$"
That should do it.
I think you may want \b (word boundary) rather than \A (start of string) and \W (non word character), also you don't need to use lookahead (the (?=...)).
Try: "\bv(\d+)" if you need to capture the int, "\bv\d+" if you don't.
Edit: You probably want to use raw string syntax for Python regexes, r"\bv\d+\b", since "\b" is a backspace character in a regular string.
Edit 2: Since + is "greedy", no trailing \b is necessary or desired.
Simply use
\bv\d+\b
Or enclosed it with ^\bv\d+\b$
to match it entirely..
Related
I'm trying to check if a string is a number, so the regex "\d+" seemed good. However that regex also fits "78.46.92.168:8000" for some reason, which I do not want, a little bit of code:
class Foo():
_rex = re.compile("\d+")
def bar(self, string):
m = _rex.match(string)
if m != None:
doStuff()
And doStuff() is called when the ip adress is entered. I'm kind of confused, how does "." or ":" match "\d"?
\d+ matches any positive number of digits within your string, so it matches the first 78 and succeeds.
Use ^\d+$.
Or, even better: "78.46.92.168:8000".isdigit()
There are a couple of options in Python to match an entire input with a regex.
Python 2 and 3
In Python 2 and 3, you may use
re.match(r'\d+$') # re.match anchors the match at the start of the string, so $ is what remains to add
or - to avoid matching before the final \n in the string:
re.match(r'\d+\Z') # \Z will only match at the very end of the string
Or the same as above with re.search method requiring the use of ^ / \A start-of-string anchor as it does not anchor the match at the start of the string:
re.search(r'^\d+$')
re.search(r'\A\d+\Z')
Note that \A is an unambiguous string start anchor, its behavior cannot be redefined with any modifiers (re.M / re.MULTILINE can only redefine the ^ and $ behavior).
Python 3
All those cases described in the above section and one more useful method, re.fullmatch (also present in the PyPi regex module):
If the whole string matches the regular expression pattern, return a corresponding match object. Return None if the string does not match the pattern; note that this is different from a zero-length match.
So, after you compile the regex, just use the appropriate method:
_rex = re.compile("\d+")
if _rex.fullmatch(s):
doStuff()
re.match() always matches from the start of the string (unlike re.search()) but allows the match to end before the end of the string.
Therefore, you need an anchor: _rex.match(r"\d+$") would work.
To be more explicit, you could also use _rex.match(r"^\d+$") (which is redundant) or just drop re.match() altogether and just use _rex.search(r"^\d+$").
\Z matches the end of the string while $ matches the end of the string or just before the newline at the end of the string, and exhibits different behaviour in re.MULTILINE. See the syntax documentation for detailed information.
>>> s="1234\n"
>>> re.search("^\d+\Z",s)
>>> s="1234"
>>> re.search("^\d+\Z",s)
<_sre.SRE_Match object at 0xb762ed40>
Change it from \d+ to ^\d+$
I have a pattern I want to search for in my message.
The patterns are:
1. "aaa-b3-c"
2. "a3-b6-c"
3. "aaaa-bb-c"
I know how to search for one of the patterns, but how do I search for all 3?
Also, how do you identify and extract dates in this format: 5/21 or 5/21/2019.
found = re.findall(r'.{3}-.{2}-.{1}', message)
Try this :
found = re.findall(r'a{2,4}-b{2}-c', message)
You could use
a{2,4}-bb-c
as a pattern.
Now you need to check the match for truthiness:
match = re.search(pattern, string)
if match:
# do sth. here
As from Python 3.8 you can use the walrus operator as in
if (match := re.search(pattern, string)) is not None:
# do sth. here
try this:
re.findall(r'a.*-b.*-c',message)
The first part could be a quantifier {2,4} instead of 3. The dot matches any character except a newline, [a-zA-Z0-9] will match a upper or lowercase char a-z or a digit:
\b[a-zA-Z0-9]{2,4}-[a-zA-Z0-9]{2}-[a-zA-Z0-9]\b
Demo
You could add word boundaries \b or anchors ^ and $ on either side if the characters should not be part of a longer word.
For the second pattern you could also use \d with a quantifier to match a digit and an optional patter to match the part with / and 4 digits:
\d{1,2}/\d{2}(?:/\d{4})?
Regex demo
Note that the format does not validate a date itself. Perhaps this page can help you creating / customize a more specific date format.
Here, we might just want to write three expressions, and swipe our inputs from left to right just to be safe and connect them using logical ORs and in case we had more patterns we can simply add to it, similar to:
([a-z]+-[a-z]+[0-9]+-[a-z]+)
([a-z]+[0-9]+-[a-z]+[0-9]+-[a-z])
([a-z]+-[a-z]+-[a-z])
which would add to:
([a-z]+-[a-z]+[0-9]+-[a-z]+)|([a-z]+[0-9]+-[a-z]+[0-9]+-[a-z])|([a-z]+-[a-z]+-[a-z])
Then, we might want to bound it with start and end chars:
^([a-z]+-[a-z]+[0-9]+-[a-z]+)$|^([a-z]+[0-9]+-[a-z]+[0-9]+-[a-z])$|^([a-z]+-[a-z]+-[a-z])$
or
^(([a-z]+-[a-z]+[0-9]+-[a-z]+)|([a-z]+[0-9]+-[a-z]+[0-9]+-[a-z])|([a-z]+-[a-z]+-[a-z]))$
RegEx
If this expression wasn't desired, it can be modified or changed in regex101.com.
RegEx Circuit
jex.im visualizes regular expressions:
I would like to strip all of the the punctuations (except the dot) from the beginning and end of a string, but not in the middle of it.
For instance for an original string:
##%%.Hol$a.A.$%
I would like to get the word .Hol$a.A. removed from the end and beginning but not from the middle of the word.
Another example could be for the string:
##%%...&Hol$a.A....$%
In this case the returned string should be ..&Hol$a.A.... because we do not care if the allowed characters are repeated.
The idea is to remove all of the punctuations( except the dot ) just at the beginning and end of the word. A word is defined as \w and/or a .
A practical example is the string 'Barnes&Nobles'. For text analysis is important to recognize Barnes&Nobles as a single entity, but without the '
How to accomplish the goal using Regex?
Use this simple and easily adaptable regex:
[\w.].*[\w.]
It will match exactly your desired result, nothing more.
[\w.] matches any alphanumeric character and the dot
.* matches any character (except newline normally)
[\w.] matches any alphanumeric character and the dot
To change the delimiters, simply change the set of allowed characters inside the [] brackets.
Check this regex out on regex101.com
import re
data = '##%%.Hol$a.A.$%'
pattern = r'[\w.].*[\w.]'
print(re.search(pattern, data).group(0))
# Output: .Hol$a.A.
Depending on what you mean with striping the punctuation, you can adapt the following code :
import re
res = re.search(r"^[^.]*(.[^.]*.([^.]*.)*?)[^.]*$", "##%%.Hol$a.A.$%")
mystr = res.group(1)
This will strip everything before and after the dot in the expression.
Warning, you will have to check if the result is different of None, if the string doesn't match.
So i have a regex telling if a number is integer.
regex = '^(0|[1-9][0-9]*)$'
import re
bool(re.search(regex, '42\n'))
returns True, and it is not supposed to?
Where does the problem come from ?
From the documentation:
'$'
Matches the end of the string or just before the newline at the end of the string
Try \Z instead.
Also, any time you find yourself writing a regular expression that starts with ^ or \A and ends with $ or \Z, if your intent is to only match the entire string, you should probably use re.fullmatch() instead of re.search() (and omit the boundary markers from the regex). Or if you're using a version of Python that's too old to have re.fullmatch(), (you really need to upgrade but) you can use re.match() and omit the beginning-of-string boundary marker.
regex ahould be regex = '\b^(0|[1-9][0-9]*)$\b'
The regex in the question matches ->start of line, numbers and end of line. And the given string matches that, thats why it is returning true. If you want it to return False when there is a number present, you can use "!" to indicate NOT.
Refer https://docs.python.org/2/library/re.html
regex = '!(0|[1-9][0-9]*)$'
bool(re.search(regex, '42\n')) => (Returns false)
Yeah, that $ matching one \n before the end is kind of trap/inconsistency. Check out my list of regex traps for python: http://www.cofoh.com/advanced-regex-tutorial-python/traps
I want to strip all non-alphanumeric characters EXCEPT the hyphen from a string (python).
How can I change this regular expression to match any non-alphanumeric char except the hyphen?
re.compile('[\W_]')
Thanks.
You could just use a negated character class instead:
re.compile(r"[^a-zA-Z0-9-]")
This will match anything that is not in the alphanumeric ranges or a hyphen. It also matches the underscore, as per your current regex.
>>> r = re.compile(r"[^a-zA-Z0-9-]")
>>> s = "some#%te_xt&with--##%--5 hy-phens *#"
>>> r.sub("",s)
'sometextwith----5hy-phens'
Notice that this also replaces spaces (which may certainly be what you want).
Edit: SilentGhost has suggested it may likely be cheaper for the engine to process with a quantifier, in which case you can simply use:
re.compile(r"[^a-zA-Z0-9-]+")
The + will simply cause any runs of consecutively matched characters to all match (and be replaced) at the same time.
\w matches alphanumerics, add in the hyphen, then negate the entire set: r"[^\w-]"