I want to add an optional part to my python expression:
myExp = re.compile("(.*)_(\d+)\.(\w+)")
so that
if my string is abc_34.txt, result.group(2) is 34
if my string is abc_2034.txt, results.group(2) is still 34
I tried myExp = re.compile("(.*)_[20](\d+)\.(\w+)")
but my results.groups(2) is 034 for the case of abc_2034.txt
Thanks F.J.
But I want to expand your solution and add a suffix.
so that if I put abc_203422.txt, results.group(2) is still 34
I tried "(.*)_(?:20)?(\d+)(?:22)?.(\w+)")
but I get 3422 instead of 34
strings = [
"abc_34.txt",
"abc_2034.txt",
]
for string in strings:
first_part, ext = string.split(".")
prefix, number = first_part.split("_")
print prefix, number[-2:], ext
--output:--
abc 34 txt
abc 34 txt
import re
strings = [
"abc_34.txt",
"abc_2034.txt",
]
pattern = r"""
([^_]*) #Match not an underscore, 0 or more times, captured in group 1
_ #followed by an underscore
\d* #followed by a digit, 0 or more times, greedy
(\d{2}) #followed by a digit, twice, captured in group 2
[.] #followed by a period
(.*) #followed by any character, 0 or more times, captured in group 3
"""
regex = re.compile(pattern, flags=re.X) #ignore whitespace and comments in regex
for string in strings:
md = re.match(regex, string)
if md:
print md.group(1), md.group(2), md.group(3)
--output:--
abc 34 txt
abc 34 txt
myExp = re.compile("(.*)_(?:20)?(\d+)\.(\w+)")
The ?: at the beginning of the group containing 20 makes this a non-capturing group, the ? after that group makes it optional. So (?:20)? means "optionally match 20".
Not sure if you're looking for this, but ? is the re symbol for 0 or 1 times. or {0,2} which is a bit hacky for up to two optional [0-9]. I will think more on it.
Related
I am trying to match Australian phone numbers. As the numbers can start with 0 or +61 or 61 followed by 2 or 3 or 4 or 5 or 7 or 8 and then followed by 8 digit number.
txt = "My phone number is 0412345677 or +61412345677 or 61412345677"
find_ph = re.find_all(r'(0|\+61|61)[234578]\d{8}', text)
find_ph
returns
['0', '61']
But I want it to return
['0412345677', '+61412345677' or '61412345677']
Can you please point me in the right direction?
>>> pattern = r'((?:0|\+61|61)[234578]\d{8})'
>>> find_ph = re.findall(pattern, txt)
>>> print(find_ph)
['0412345677', '+61412345677', '61412345677']
The problem you had was that the parentheses around just the prefix part were telling the findall function to only capture those characters, while matching all the rest. (Incidentally it's findall not find_all, and your string was in the variable txtnot text).
Instead, make that a non-capturing group with (?:0|+61|61). Now you capture the whole of the string that matches the entire pattern.
You can using Non-capturing group,
Regex Demo
import re
re.findall("(?:0|\+61|61)\d+", text)
['0412345677', '+61412345677', '61412345677']
One Solution
re.findall(r'(?:0|61|\+61)[2345678]\d{8}', txt)
# ['0412345677', '+61412345677', '61412345677']
Explanation
(?:0|61|\+61) Non-capturing group for 0, 61 or +61
(?:0|61|\+61)[2345678]\d{8} following by one digit except 0, 1, 9
\d{8} followed by 8 digits
Having a string like this: aa5f5 aa5f5 i try to split the tokens where non-digit meets digit, like this:
re.sub(r'([^\d])(\d{1})', r'\1 \2', 'aa5f5 aa5f5')
Out: aa 5f 5 aa 5f 5
Now i try to prevent some tokens from being splitted with specific prefix character($): $aa5f5 aa5f5, the desired output is $aa5f5 aa 5f 5
The problem is that i only came up with this ugly loop:
sentence = '$aa5f5 aa5f5'
new_sentence = []
for s in sentence.split():
if s.startswith('$'):
new_sentence.append(s)
else:
new_sentence.append(re.sub(r'([^\d])(\d{1})', r'\1 \2', s))
print(' '.join(new_sentence)) # $aa5f5 aa 5f 5
But could not find a way to make this possible with single line regexp. Need help with this, thank you.
You may use
new_sentence = re.sub(r'(\$\S*)|(?<=\D)\d', lambda x: x.group(1) if x.group(1) else rf' {x.group()}', sentence)
See the Python demo.
Here, (\$\S*)|(?<=\D)\d matches $ and any 0+ non-whitespace characters (with (\$\S*) capturing the value in Group 1, or a digit is matched that is preceded with a non-digit char (see (?<=\D)\d pattern part).
If Group 1 matched, it is pasted back as is (see x.group(1) if x.group(1) in the replacement), else, the space is inserted before the matched digit (see else rf' {x.group()}').
With PyPi regex module, you may do it in a simple way:
import regex
sentence = '$aa5f5 aa5f5'
print( regex.sub(r'(?<!\$\S*)(?<=\D)(\d)', r' \1', sentence) )
See this online Python demo.
The (?<!\$\S*)(?<=\D)(\d) pattern matches and captures into Group 1 any digit ((\d)) that is preceded with a non-digit ((?<=\D)) and not preceded with $ and then any 0+ non-whitespace chars ((?<!\$\S*)).
This is not something regular expression can do. If it can, it'll be a complex regex which will be hard to understand. And when a new developer joins your team, he will not understand it right away. It's better you write it the way you wrote it already. For the regex part, the following code will probably do the splitting correctly
' '.join(map(str.strip, re.findall(r'\d+|\D+', s)))
>>> s = "aa5f5 aa5f53r12"
>>> ' '.join(map(str.strip, re.findall(r'\d+|\D+', s)))
'aa 5 f 5 aa 5 f 53 r 12'
I need to extract the real issue number in my file name. There are 2 patterns:
if there is no leading number in the file name, then the number, which we read first, is the issue number. For example
asdasd 213.pdf ---> 213
abcd123efg456.pdf ---> 123
however, sometimes there is a leading number in the file name, which is just the index of file, so I have to ignore/skip it firstly. For example
123abcd 4567sdds.pdf ---> 4567, since 123 is ignored
890abcd 123efg456.pdf ---> 123, since 890 is ignored
I want to learn whether it is possilbe to write only one regular expression to implement it? Currently, my soluton involves 2 steps:
if there is a leading number, remove it
find the number in the remaining string
or in Python code
import re
reNumHeading = re.compile('^\d{1,}', re.IGNORECASE | re.VERBOSE) # to find leading number
reNum = re.compile('\d{1,}', re.IGNORECASE | re.VERBOSE) # to find number
lstTest = '''123abcd 4567sdds.pdf
asdasd 213.pdf
abcd 123efg456.pdf
890abcd 123efg456.pdf'''.split('\n')
for test in lstTest:
if reNumHeading.match(test):
span = reNumHeading.match(test).span()
stripTest = test[span[1]:]
else:
stripTest = test
result = reNum.findall(stripTest)
if result:
print(result[0])
thanks
You can use ? quantifier to define optional pattern
>>> import re
>>> s = '''asdasd 213.pdf
... abcd123efg456.pdf
... 123abcd 4567sdds.pdf
... 890abcd 123efg456.pdf'''
>>> for line in s.split('\n'):
... print(re.search(r'(?:^\d+)?.*?(\d+)', line)[1])
...
213
123
4567
123
(?:^\d+)? here a non-capturing group and ? quantifier is used to optionally match digits at start of line
since + is greedy, all the starting digits will be matched
.*? match any number of characters minimally (because we need the first match of digits)
(\d+) the required digits to be captured
re.search returns a re.Match object from which you can get various details
[1] on the re.Match object will give you string captured by first capturing group
use .group(1) if you are on older version of Python that doesn't support [1] syntax
See also: Reference - What does this regex mean?
Just match digits \d+ that follow a non-digit \D:
import re
lstTest = '''123abcd 4567sdds.pdf
asdasd 213.pdf
abcd 123efg456.pdf
890abcd 123efg456.pdf'''.split('\n')
for test in lstTest:
res = re.search(r'\D(\d+)', test)
print(res.group(1))
Output:
4567
213
123
123
I wrote a regex match pattern in python, but re.match() do not capture groups after | alternation operator.
Here is the pattern:
pattern = r"00([1-9]\d) ([1-9]\d) ([1-9]\d{5})|\+([1-9]\d) ([1-9]\d) ([1-9]\d{5})"
I feed the pattern with a qualified string: "+12 34 567890":
strng = "+12 34 567890"
pattern = r"00([1-9]\d) ([1-9]\d) ([1-9]\d{5})|\+([1-9]\d) ([1-9]\d) ([1-9]\d{5})"
m = re.match(pattern, strng)
print(m.group(1))
None is printed.
Buf if I delete the part before | alternation operator
strng = "+12 34 567890"
pattern = r"\+([1-9]\d) ([1-9]\d) ([1-9]\d{5})"
m = re.match(pattern, strng)
print(m.group(1))
It can capture all 3 groups:
12
34
567890
Thanks so much for your thoughts!
'|' has nothing to do with the index of group, index is always counted from left to right in the regex itself.
In your original regex, their are 6 groups:
In [270]: m.groups()
Out[270]: (None, None, None, '12', '34', '567890')
The matching part is the second part, thus you need:
In [271]: m.group(4)
Out[271]: '12'
You want to support two different patterns, one with 00 and the other with + at the start. You may merge the alternatives using a non-capturing group:
import re
strng = "+12 34 567890"
pattern = r"(?:00|\+)([1-9]\d) ([1-9]\d) ([1-9]\d{5})$"
m = re.match(pattern, strng)
if m:
print(m.group(1))
print(m.group(2))
print(m.group(3))
See the regex demo and the Python demo yielding
12
34
567890
The regex at the regex testing site is prepended with ^ (start of string) because re.match only matches at the start of the string. The whole pattern now matches:
^ - start of string (implicit in re.match)
(?:00|\+) - a 00 or + substrings
([1-9]\d) - Capturing group 1: a digit from 1 to 9 and then any digit
- a space (replace with \s to match any 1 whitespace chars)
([1-9]\d) - Capturing group 2: a digit from 1 to 9 and then any digit
- a space (replace with \s to match any 1 whitespace chars)
([1-9]\d{5}) - Capturing group 3: a digit from 1 to 9 and then any 5 digits
$ - end of string.
Remove $ if you do not need to match the end of the string right after the number.
If I had a sentence that has an age and a time :
import re
text = "I am 21 and work at 3:30"
answer= re.findall(r'\b\d{2}\b', text)
print(answer)
The issue is that it gives me not only the 21, but 30 (since it looks for 2 digits). How do I avoid this so it will only count the numbers and not the non-alphanumeric characters that leads to the issue? I tried to use [0-99] instead of the {} braces but that didn't seem to help.
Using \s\d{2}\s will give you only 2 digit combinations with spaces around them (before and after).
Or if you want to match without trailing whitespace: \s\d{2}
Thats because : is considered as non-word constituent character when you match empty string at word boundary with \b. In Regex term, a word for \b is \w+.
You can check for digits with space or start/end of input line around:
(?:^|\s)(\d{2})(?:\s|$)
Example:
In [85]: text = "I am 21 and work at 3:30"
...: re.findall(r'(?:^|\s)(\d{2})(?:\s|$)', text)
Out[85]: ['21']
You can use (?<!)(?!) negative lookahead to isolate and capture only 2 (two) digits.
Regex: (?<!\S)\d{2}(?!\S)
You can use the following regex:
^\d{2}$|(?<=\s)\d{2}(?=\s)|(?<=\s)\d{2}$|^\d{2}(?=\s)
that will match all the 21 in the following strings:
I am 21 and work at 3:30
21
abc 12:23
12345
I am 21
21 am I
demo: https://regex101.com/r/gP1KSf/1
Explanations:
^\d{2}$ match 2 digits only string or
(?<=\s)\d{2}(?=\s) 2 digits surrounded by space class char or
(?<=\s)\d{2}$ 2 digits at the end of the string and with a preceded by a a space class char
^\d{2}(?=\s) 2 digits at the beginning of the string and followed by a space class char