I'm fairly new to using regular expressions in general. And I'm having trouble coming up with one that will suit my purpose.
I've tried this
line1 = 'REQ-1234'
match = re.match(r'^REQ-\d', line1, re.I)
This will work as long as the string is not something like
'REQ-1234 and then there is more stuff'
Is there a way to specify that there must not be anything after 'REQ-' except numbers? The other requirement is that 'REQ-1234' must be the only thing in the string. I think the caret symbol takes care of that though.
You need to add a + quantifier after \d to match 1 or more digits, and then add $ anchor to require the end of string position after these digits:
match = re.match(r'REQ-\d+$', line1, re.I)
^^
Note that ^ is redundant since you are using re.match that anchors the match at the string start.
To match a req- that may be followed with digits, replace + (1 or more repetitions) with * quantifier (0 or more repetitions).
Note that with Python 3, you may use re.fullmatch without explicit anchors, r'REQ-\d+' or r'REQ-\d*' will do.
Related
I'm trying to use regex in order to find missused operators in my program.
Specifically I'm trying to find whether some operators (say %, $ and #) were used without digits flanking their both sides.
Here are some examples of missuse:
'5%'
'%5'
'5%+3'
'5%%'
Is there a way to to that with a single re.search?
I know I can use + for at least one, or * for at least zero,
but looking at:
([^\d]*)(%)([^\d]\*)
I would like to find cases where at least one of group(1) and group(3) exist,
since inserting % with digits on its both sides is a good use of the operator.
I know I could use:
match = re.search(r'[^\d\.]+[#$%]', user_request)
if match:
return 'Illegal use of match.group()'
match = re.search(r'[#$%][^\d\.]+', user_request)
if match:
return 'Illegal use of match.group()'
But I would prefer to do so with a single re.search line.
And also - when I use [^\d.] does this include the beginning the end of the string? Or only different chars?
Thank you :)
You might use an alternation with a negative lookahead and a negative lookbehind to assert what is before and what is after is not a digit:
(?<!\d)[#$%]|[#$%](?!\d)
That will match:
(?<!\d) Negative lookbehind to check what is on the left is not a digit
[#$%] Character class, match one of #, $ or %
| Or
[#$%] Character class, match one of #, $ or %
(?!\d) Negative lookahead to check what is on the right is not a digit
For example:
match = re.search(r'(?<!\d)[#$%]|[#$%](?!\d)', user_request)
if match:
return 'Illegal use of match.group()'
Regex demo | Python demo
[^\d.] Matches not a digit or a literal dot. The ^ inside a character class negates what it contains. But if it is the first character of the string that is not a digit or a dot then it will match.
I am basically trying to match string pattern(wildcard match)
Please carefully look at this -
*(star) - means exactly one word .
This is not a regex pattern...it is a convention.
So,if there patterns like -
*.key - '.key.' is preceded by exactly one word(word containing no dots)
*.key.* - '.key.' is preceded and succeeded by exactly one word having no dots
key.* - '.key' preceeds exactly one word .
So,
"door.key" matches "*.key"
"brown.door.key" doesn't match "*.key".
"brown.key.door" matches "*.key.*"
but "brown.iron.key.door" doesn't match "*.key.*"
So, when I encounter a '*' in pattern, I have replace it with a regex so that it means it is exactly one word.(a-zA-z0-9_).Can anyone please help me do this in python?
To convert your pattern to a regexp, you first need to make sure each character is interpreted literally and not as a special character. We can do that by inserting a \ in front of any re special character. Those characters can be obtained through sre_parse.SPECIAL_CHARS.
Since you have a special meaning for *, we do not want to escape that one but instead replace it by \w+.
Code
import sre_parse
def convert_to_regexp(pattern):
special_characters = set(sre_parse.SPECIAL_CHARS)
special_characters.remove('*')
safe_pattern = ''.join(['\\' + c if c in special_characters else c for c in pattern ])
return safe_pattern.replace('*', '\\w+')
Example
import re
pattern = '*.key'
r_pattern = convert_to_regexp(pattern) # '\\w+\\.key'
re.match(r_pattern, 'door.key') # Match
re.match(r_pattern, 'brown.door.key') # None
And here is an example with escaped special characters
pattern = '*.(key)'
r_pattern = convert_to_regexp(pattern) # '\\w+\\.\\(key\\)'
re.match(r_pattern, 'door.(key)') # Match
re.match(r_pattern, 'brown.door.(key)') # None
Sidenote
If you intend looking for the output pattern with re.search or re.findall, you might want to wrap the re pattern between \b boundary characters.
The conversion rules you are looking for go like this:
* is a word, thus: \w+
. is a literal dot: \.
key is and stays a literal string
plus, your samples indicate you are going to match whole strings, which in turn means your pattern should match from the ^ beginning to the $ end of the string.
Therefore, *.key becomes ^\w+\.key$, *.key.* becomes ^\w+\.key\.\w+$, and so forth..
Online Demo: play with it!
^ means a string that starts with the given set of characters in a regular expression.
$ means a string that ends with the given set of characters in a regular expression.
\s means a whitespace character.
\S means a non-whitespace character.
+ means 1 or more characters matching given condition.
Now, you want to match just a single word meaning a string of characters that start and end with non-spaced string. So, the required regular expression is:
^\S+$
You could do it with a combination of "any characters that aren't period" and the start/end anchors.
*.key would be ^[^.]*\.key, and *.key.* would be ^[^.]*\.key\.[^.]*$
EDIT: As tripleee said, [^.]*, which matches "any number of characters that aren't periods," would allow whitespace characters (which of course aren't periods), so using \w+, "any number of 'word characters'" like the other answers is better.
I am new to python
I want to extract specific pattern in python 3.5
pattern: digit character digit
characters can be * + - / x X
how can I do this?
I have tried to use pattern [0-9\*/+-xX\0-9] but it returns either of the characters present in a string.
example: 2*3 or 2x3 or 2+3 or 2-3 should be matched
but asdXyz should not
You may use
[0-9][*+/xX-][0-9]
Or to match a whole string:
^[0-9][*+/xX-][0-9]$
In Python 3.x, you may discard the ^ (start of string anchor) and $ (end of string anchor) if you use the pattern in re.fullmatch (demo):
if re.fullmatch(r'[0-9][*+/xX-][0-9]', '5+5'):
print('5+5 string found!')
if re.fullmatch(r'[0-9][*+/xX-][0-9]', '5+56'):
print('5+56 string found!')
# => 5+5 string found!
The re.match() function will limit the search to the start and end of the string to prevent false positives.
A digit can be matched with \d.
The operator can be matched with [x/+\-] which matches exactly one of x, /, +, or - (which is escaped because it is a special regex character).
The last digit can be matched with \d.
Putting parentheses around each part allows the parts to extracted as separate subgroups.
For example:
>>> re.match(r'(\d)([x/+\-])(\d)', '3/4').groups()
('3', '/', '4')
I am trying to use re.findall to find this pattern:
01-234-5678
regex:
(\b\d{2}(?P<separator>[-:\s]?)\d{2}(?P=separator)\d{3}(?P=separator)\d{3}(?:(?P=separator)\d{4})?,?\.?\b)
however, some cases have shortened to 01-234-5 instead of 01-234-0005 when the last four digits are 3 zeros followed by a non-zero digit.
Since there does't seem to be any uniformity in formatting I had to account for a few different separator characters or possibly none at all. Luckily, I have only noticed this shortening when some separator has been used...
Is it possible to use a regex conditional to check if a separator does exist (not an empty string), then also check for the shortened variation?
So, something like if separator != '': re.findall(r'(\b\d{2}(?P<separator>[-:\s]?)\d{3}(?P=separator)(\d{4}|\d{1})\.?\b)', text)
Or is my only option to include all the possibly incorrect 6 digit patterns then check for a separator with python?
If you want the last group of digits to be "either one or four digits", try:
>>> import re
>>> example = "This has one pattern that you're expecting, 01-234-5678, and another that maybe you aren't: 23:456:7"
>>> pattern = re.compile(r'\b(\d{2}(?P<sep>[-:\s]?)\d{3}(?P=sep)\d(?:\d{3})?)\b')
>>> pattern.findall(example)
[('01-234-5678', '-'), ('23:456:7', ':')]
The last part of the pattern, \d(?:\d{3})?), means one digit, optionally followed by three more (i.e. one or four). Note that you don't need to include the optional full stop or comma, they're already covered by \b.
Given that you don't want to capture the case where there is no separator and the last section is a single digit, you could deal with that case separately:
r'\b(\d{9}|\d{2}(?P<sep>[-:\s])\d{3}(?P=sep)\d(?:\d{3})?)\b'
# ^ exactly nine digits
# ^ or
# ^ sep not optional
See this demo.
It is not clear why you are using word boundaries, but I have not seen your data.
Otherwise you can shorten the entire this to this:
re.compile(r'\d{2}(?P<separator>[-:\s]?)\d{3}(?P=separator)\d{1,4}')
Note that \d{1,4} matched a string with 1, 2, 3 or 4 digits
If there is no separator, e.g. "012340008" will match the regex above as you are using [-:\s]? which matches 0 or 1 times.
HTH
I am quite new to python and regex (regex newbie here), and I have the following simple string:
s=r"""99-my-name-is-John-Smith-6376827-%^-1-2-767980716"""
I would like to extract only the last digits in the above string i.e 767980716 and I was wondering how I could achieve this using python regex.
I wanted to do something similar along the lines of:
re.compile(r"""-(.*?)""").search(str(s)).group(1)
indicating that I want to find the stuff in between (.*?) which starts with a "-" and ends at the end of string - but this returns nothing..
I was wondering if anyone could point me in the right direction..
Thanks.
You can use re.match to find only the characters:
>>> import re
>>> s=r"""99-my-name-is-John-Smith-6376827-%^-1-2-767980716"""
>>> re.match('.*?([0-9]+)$', s).group(1)
'767980716'
Alternatively, re.finditer works just as well:
>>> next(re.finditer(r'\d+$', s)).group(0)
'767980716'
Explanation of all regexp components:
.*? is a non-greedy match and consumes only as much as possible (a greedy match would consume everything except for the last digit).
[0-9] and \d are two different ways of capturing digits. Note that the latter also matches digits in other writing schemes, like ୪ or ൨.
Parentheses (()) make the content of the expression a group, which can be retrieved with group(1) (or 2 for the second group, 0 for the whole match).
+ means multiple entries (at least one number at the end).
$ matches only the end of the input.
Nice and simple with findall:
import re
s=r"""99-my-name-is-John-Smith-6376827-%^-1-2-767980716"""
print re.findall('^.*-([0-9]+)$',s)
>>> ['767980716']
Regex Explanation:
^ # Match the start of the string
.* # Followed by anthing
- # Upto the last hyphen
([0-9]+) # Capture the digits after the hyphen
$ # Upto the end of the string
Or more simply just match the digits followed at the end of the string '([0-9]+)$'
Your Regex should be (\d+)$.
\d+ is used to match digit (one or more)
$ is used to match at the end of string.
So, your code should be: -
>>> s = "99-my-name-is-John-Smith-6376827-%^-1-2-767980716"
>>> import re
>>> re.compile(r'(\d+)$').search(s).group(1)
'767980716'
And you don't need to use str function here, as s is already a string.
Use the below regex
\d+$
$ depicts the end of string..
\d is a digit
+ matches the preceding character 1 to many times
Save the regular expressions for something that requires more heavy lifting.
>>> def parse_last_digits(line): return line.split('-')[-1]
>>> s = parse_last_digits(r"99-my-name-is-John-Smith-6376827-%^-1-2-767980716")
>>> s
'767980716'
I have been playing around with several of these solutions, but many seem to fail if there are no numeric digits at the end of the string. The following code should work.
import re
W = input("Enter a string:")
if re.match('.*?([0-9]+)$', W)== None:
last_digits = "None"
else:
last_digits = re.match('.*?([0-9]+)$', W).group(1)
print("Last digits of "+W+" are "+last_digits)
Try using \d+$ instead. That matches one or more numeric characters followed by the end of the string.