I need to extract the real issue number in my file name. There are 2 patterns:
if there is no leading number in the file name, then the number, which we read first, is the issue number. For example
asdasd 213.pdf ---> 213
abcd123efg456.pdf ---> 123
however, sometimes there is a leading number in the file name, which is just the index of file, so I have to ignore/skip it firstly. For example
123abcd 4567sdds.pdf ---> 4567, since 123 is ignored
890abcd 123efg456.pdf ---> 123, since 890 is ignored
I want to learn whether it is possilbe to write only one regular expression to implement it? Currently, my soluton involves 2 steps:
if there is a leading number, remove it
find the number in the remaining string
or in Python code
import re
reNumHeading = re.compile('^\d{1,}', re.IGNORECASE | re.VERBOSE) # to find leading number
reNum = re.compile('\d{1,}', re.IGNORECASE | re.VERBOSE) # to find number
lstTest = '''123abcd 4567sdds.pdf
asdasd 213.pdf
abcd 123efg456.pdf
890abcd 123efg456.pdf'''.split('\n')
for test in lstTest:
if reNumHeading.match(test):
span = reNumHeading.match(test).span()
stripTest = test[span[1]:]
else:
stripTest = test
result = reNum.findall(stripTest)
if result:
print(result[0])
thanks
You can use ? quantifier to define optional pattern
>>> import re
>>> s = '''asdasd 213.pdf
... abcd123efg456.pdf
... 123abcd 4567sdds.pdf
... 890abcd 123efg456.pdf'''
>>> for line in s.split('\n'):
... print(re.search(r'(?:^\d+)?.*?(\d+)', line)[1])
...
213
123
4567
123
(?:^\d+)? here a non-capturing group and ? quantifier is used to optionally match digits at start of line
since + is greedy, all the starting digits will be matched
.*? match any number of characters minimally (because we need the first match of digits)
(\d+) the required digits to be captured
re.search returns a re.Match object from which you can get various details
[1] on the re.Match object will give you string captured by first capturing group
use .group(1) if you are on older version of Python that doesn't support [1] syntax
See also: Reference - What does this regex mean?
Just match digits \d+ that follow a non-digit \D:
import re
lstTest = '''123abcd 4567sdds.pdf
asdasd 213.pdf
abcd 123efg456.pdf
890abcd 123efg456.pdf'''.split('\n')
for test in lstTest:
res = re.search(r'\D(\d+)', test)
print(res.group(1))
Output:
4567
213
123
123
Related
I am trying to find all occurrences of either "_"+digit or "^"+digit, using the regex ((_\^)[1-9])
The groups I'd expect back eg for "X_2ZZZY^5" would be [('_2'), ('^5')] but instead I am getting [('_2', '_'), ('^5', '^')]
Is my regex incorrect? Or is my expectation of what gets returned incorrect?
Many thanks
** my original re used (_|\^) this was incorrect, and should have been (_\^) -- question has been amended accordingly
You have 2 groups in your regex - so you're getting 2 groups. And you need to match atleast 1 number that follows.
try this:
([_\^][1-9]+)
See it in action here
Demand at least 1 digit (1-9) following the special characters _ or ^, placed inside a single capture group:
import re
text = "X_2ZZZY^5"
pattern = r"([_\^][1-9]{1,})"
regex = re.compile(pattern)
res = re.findall(regex, text)
print(res)
Returning:
['_2', '^5']
I am validating mac address, Below is the code
import re
input = """
abc
xyz
ff:ff:ff:ff:ff:ff
ff::ff::ff::ff::ff::ff
ff-ff-ff-ff-ff-ff
"""
regex = re.compile("([A-Fa-f0-9]{2}?-::){5}[A-Fa-f0-9]{2}")
mac_address = re.findall(regex, input)
print(mac_address)
actual output:
[] # Empty list
Expected output:
["ff:ff:ff:ff:ff:ff", "ff::ff::ff::ff::ff::ff", "ff-ff-ff-ff-ff-ff"]
My code explanation:
1) mac address contains alpha numeric that is [A-Fa-f0-9]
2) must contain two characters hence added {2}
3) optional -:: (one colon(:) or two colon(::) or dash) hence added ?-::
4) option 1,2,3 must matches 5 times hence added {5}
5) at last [A-Fa-f0-9]{2} must match 2 times
Could any please correct me what i am doing wrong here
The answer is the re.M flag (for "multi-line" - you've provided a string with multiple lines, so this is necessary). Your regex doesn't parse correctly for me, but my similar regex (five occurrences of "two hex digits followed by either :, ::, or -" followed by two more hex digits) works:
>>> inp = """
... abc
... xyz
... ff:ff:ff:ff:ff:ff
... ff::ff::ff::ff::ff::ff
... ff-ff-ff-ff-ff-ff
... """
>>> regex = re.compile(r'(?:[A-Fa-f0-9]{2}(?:\:|\:\:|-)?){5}[A-Fa-f0-9]{2}', flags=re.M)
>>> mac_address = re.findall(regex, inp)
>>> print(mac_address)
['ff:ff:ff:ff:ff:ff', 'ff::ff::ff::ff::ff::ff', 'ff-ff-ff-ff-ff-ff']
In your pattern your are matching 5 times 2 chars out of [A-Fa-f0-9]{2} followed by -::
In your example data, the match could be either :: or : or -
To match that you could make use of a non capturing group matching either :: or or one of : or - using a character class and a non capturing group like (?:::|[:-])
Note that using ?-:: is not the notation to make for an optional
character.
If you want to make all of the options optional, you can make the group optional (?:::|[:-])? (which would possibly also match ffffffffffff but is not in the example data)
\b(?:[A-Fa-f0-9]{2}(?:::|[:-])){5}[A-Fa-f0-9]{2}\b
Regex demo | Python demo
For example
import re
input = """
abc
xyz
ff:ff:ff:ff:ff:ff
ff::ff::ff::ff::ff::ff
ff-ff-ff-ff-ff-ff
"""
regex = re.compile(r"\b(?:[A-Fa-f0-9]{2}(?:::|[:-])){5}[A-Fa-f0-9]{2}\b")
mac_address = re.findall(regex, input)
Output
['ff:ff:ff:ff:ff:ff', 'ff::ff::ff::ff::ff::ff', 'ff-ff-ff-ff-ff-ff']
What I'm trying to do is extract only the digits from dollar figures.
Format of Input
...
$1,289,868
$62,000
$421
...
Desired Output
...
1289868
62000
421
...
The regular expression that I was using to extract only the digits and commas is:
r'\d+(,\d+){0,}'
which of course outputs...
...
1,289,868
62,000
421
...
What I'd like to do is convert the output to an integer (int(...)), but obviously this won't work with the commas. I'm sure I could figure this out on my own, but I'm running really short on time right now.
I know I can simply use r'\d+', but this obviously separates each chunk into separate matches...
You can't match discontinuous texts within one match operation. You can't put a regex into re.findall against 1,345,456 to receive 1345456. You will need to first match the strings you need, and then post-process them within code.
A regex you may use to extract the numbers themselves
re.findall(r'\$(\d{1,3}(?:,\d{3})*)', s)
See this regex demo.
Alternatively, you may use a bit more general regex to be used with re.findall:
r'\$(\d+(?:,\d+)*)'
See this regex demo.
Note that re.findall will only return the captured part of the string (the one matched with the (...) part in the regex).
Details
\$ - a dollar sign
(\d{1,3}(?:,\d{3})*) - Capturing group 1:
\d{1,3} - 1 to 3 digits (if \d+ is used, 1 or more digits)
(?:,\d{3})* - 0 or more sequences of
, - a comma
\d{3} - 3 digits (or if \d+ is used, 1 or more digits).
Python code sample (with removing commas):
import re
s = """$1,289,868
$62,000
$421"""
result = [x.replace(",", "") for x in re.findall(r'\$(\d{1,3}(?:,\d{3})*)', s)]
print(result) # => ['1289868', '62000', '421']
Using re.sub
Ex:
import re
s = """$1,289,868
$62,000
$421"""
print([int(i) for i in re.sub(r'[^0-9\s]', "", s).splitlines()])
Output:
[1289868, 62000, 421]
You don't need regex for this.
int(''.join(filter(str.isdigit, "$1,000,000")))
works just fine.
If you did want to use regex for some reason:
int(''.join(re.findall(r"\d", "$1,000,000")))
If you know how to extract the numbers with comma groupings, the easiest thing to do is just transform that into something int can handle:
for match in matches:
i = int(match.replace(',', ''))
For example, if match is '1,289,868', then match.replace(',', '') is '1289868', and obviously int(<that>) is 1289868.
You dont need regex for this. Just string operations should be enough
>>> string = '$1,289,868\n$62,000\n$421'
>>> [w.lstrip('$').replace(',', '') for w in string.splitlines()]
['1289868', '62000', '421']
Or alternatively, you can use locale.atoi to convert string of digits with commas to int
>>> import locale
>>> locale.setlocale(locale.LC_ALL, 'en_US.UTF8')
>>> list(map(lambda x: locale.atoi(x.lstrip('$')), string.splitlines()))
[1289868, 62000, 421]
I have tested the following code on http://regexpal.com/ and it correctly matches the string I want. I want to find 16 digit numbers which occur in blocks of 4 with a space in the middle, so I wrote the following regex:
\d{4}(\s\d{4}){3}
i.e. match 4 numbers, then match three repeating sets of a space followed by four numbers. On regexpal, this correctly matches:
test1234 message1234 5678 1234 5678
In Python, however, I run the following code:
>>> import re
>>> p = re.compile('\d{4}(\s\d{4}){3}')
>>> p.findall('test1234 message1234 5678 1234 5678')
[' 5678']
>>>
I don't understand why it is matching the second instance of '5678' and why it is not matching the block of numbers as I would expect.
raw string is the recommended way to define regex but the problem here is mainly because of the implementation of findall method. You need to turn capturing group present in your regex to non-capturing group. Because re.findall function gives the first preference to captures and then the matches. Your regex \d{4}(\s\d{4}){3} matches the 16 digit number but captures only the last four plus the preceding space.
p = re.compile(r'\d{4}(?:\s\d{4}){3}')
Example:
>>> import re
>>> p = re.compile(r'\d{4}(\s\d{4}){3}')
>>> p.findall('test1234 message1234 5678 1234 5678')
[' 5678']
>>> p = re.compile(r'\d{4}(?:\s\d{4}){3}')
>>> p.findall('test1234 message1234 5678 1234 5678')
['1234 5678 1234 5678']
You need to either prefix your string with an r or escape your backslashes:
p = re.compile(r'\d{4}(\s\d{4}){3}')
or
p = re.compile('\\d{4}(\\s\\d{4}){3}')
I want to add an optional part to my python expression:
myExp = re.compile("(.*)_(\d+)\.(\w+)")
so that
if my string is abc_34.txt, result.group(2) is 34
if my string is abc_2034.txt, results.group(2) is still 34
I tried myExp = re.compile("(.*)_[20](\d+)\.(\w+)")
but my results.groups(2) is 034 for the case of abc_2034.txt
Thanks F.J.
But I want to expand your solution and add a suffix.
so that if I put abc_203422.txt, results.group(2) is still 34
I tried "(.*)_(?:20)?(\d+)(?:22)?.(\w+)")
but I get 3422 instead of 34
strings = [
"abc_34.txt",
"abc_2034.txt",
]
for string in strings:
first_part, ext = string.split(".")
prefix, number = first_part.split("_")
print prefix, number[-2:], ext
--output:--
abc 34 txt
abc 34 txt
import re
strings = [
"abc_34.txt",
"abc_2034.txt",
]
pattern = r"""
([^_]*) #Match not an underscore, 0 or more times, captured in group 1
_ #followed by an underscore
\d* #followed by a digit, 0 or more times, greedy
(\d{2}) #followed by a digit, twice, captured in group 2
[.] #followed by a period
(.*) #followed by any character, 0 or more times, captured in group 3
"""
regex = re.compile(pattern, flags=re.X) #ignore whitespace and comments in regex
for string in strings:
md = re.match(regex, string)
if md:
print md.group(1), md.group(2), md.group(3)
--output:--
abc 34 txt
abc 34 txt
myExp = re.compile("(.*)_(?:20)?(\d+)\.(\w+)")
The ?: at the beginning of the group containing 20 makes this a non-capturing group, the ? after that group makes it optional. So (?:20)? means "optionally match 20".
Not sure if you're looking for this, but ? is the re symbol for 0 or 1 times. or {0,2} which is a bit hacky for up to two optional [0-9]. I will think more on it.