Python: Unexpected regex behaviour for {n} match - python

I have tested the following code on http://regexpal.com/ and it correctly matches the string I want. I want to find 16 digit numbers which occur in blocks of 4 with a space in the middle, so I wrote the following regex:
\d{4}(\s\d{4}){3}
i.e. match 4 numbers, then match three repeating sets of a space followed by four numbers. On regexpal, this correctly matches:
test1234 message1234 5678 1234 5678
In Python, however, I run the following code:
>>> import re
>>> p = re.compile('\d{4}(\s\d{4}){3}')
>>> p.findall('test1234 message1234 5678 1234 5678')
[' 5678']
>>>
I don't understand why it is matching the second instance of '5678' and why it is not matching the block of numbers as I would expect.

raw string is the recommended way to define regex but the problem here is mainly because of the implementation of findall method. You need to turn capturing group present in your regex to non-capturing group. Because re.findall function gives the first preference to captures and then the matches. Your regex \d{4}(\s\d{4}){3} matches the 16 digit number but captures only the last four plus the preceding space.
p = re.compile(r'\d{4}(?:\s\d{4}){3}')
Example:
>>> import re
>>> p = re.compile(r'\d{4}(\s\d{4}){3}')
>>> p.findall('test1234 message1234 5678 1234 5678')
[' 5678']
>>> p = re.compile(r'\d{4}(?:\s\d{4}){3}')
>>> p.findall('test1234 message1234 5678 1234 5678')
['1234 5678 1234 5678']

You need to either prefix your string with an r or escape your backslashes:
p = re.compile(r'\d{4}(\s\d{4}){3}')
or
p = re.compile('\\d{4}(\\s\\d{4}){3}')

Related

How to code a regex pattern for multiple identical characters in python3?

I have a long string of the following form:
joined_string = "ASOGHFFFFFFFFFFFFFFFFFFFGFIOSGFFFFFFFFURHDHREEKFFFFFFIIIEI..."
it is a concatenation of random strings interspersed by strings of consecutive F letters:
ASOGH
FFFFFFFFFFFFFFFFFFF
GFIOSG
FFFFFFFF
URHDHREEK
FFFFFF
IIIEI
The number of consecutive F letters is not fixed, but there will be more than 5,
and lets assume five F letters will not appear in random strings consecutively.
I want to extract only random strings to get the following list:
random_strings = ['ASOGH', 'GFIOSG', 'URHDHREEK', 'IIIEI']
I imagine there is a simple regex expression that would solve this task:
random_strings = joined_string.split('WHAT_TO_TYPE_HERE?')
Question: how to code a regex pattern for multiple identical characters?
I would use re.split for this task following way
import re
joined_string = "ASOGHFFFFFFFFFFFFFFFFFFFGFIOSGFFFFFFFFURHDHREEKFFFFFFIIIEI"
parts = re.split('F{5,}',joined_string)
print(parts)
output
['ASOGH', 'GFIOSG', 'URHDHREEK', 'IIIEI']
F{5,} denotes 5 or more F
You can use split using F{5,} and keep this in capture group so that split text is also part of result:
import re
s = "ASOGHFFFFFFFFFFFFFFFFFFFGFIOSGFFFFFFFFURHDHREEKFFFFFFIIIEI"
print( re.split(r'(F{5,})', s) )
Output:
['ASOGH', 'FFFFFFFFFFFFFFFFFFF', 'GFIOSG', 'FFFFFFFF', 'URHDHREEK', 'FFFFFF', 'IIIEI']
I would use a regex find all approach here:
joined_string = "ASOGHFFFFFFFFFFFFFFFFFFFGFIOSGFFFFFFFFURHDHREEKFFFFFFIIIEI"
parts = re.findall(r'F{2,}|(?:[A-EG-Z]|F(?!F))+', joined_string)
print(parts)
This prints:
['ASOGH', 'FFFFFFFFFFFFFFFFFFF', 'GFIOSG', 'FFFFFFFF', 'URHDHREEK', 'FFFFFF', 'IIIEI']
The regex pattern here can be explained as:
F{2,} match any group of 2 or more consecutive F's (first)
| OR, that failing
(?:
[A-EG-Z] match any non F character
| OR
F(?!F) match a single F (not followed by an F)
)+ all of these, one or more times

skip leading number in regular expression?

I need to extract the real issue number in my file name. There are 2 patterns:
if there is no leading number in the file name, then the number, which we read first, is the issue number. For example
asdasd 213.pdf ---> 213
abcd123efg456.pdf ---> 123
however, sometimes there is a leading number in the file name, which is just the index of file, so I have to ignore/skip it firstly. For example
123abcd 4567sdds.pdf ---> 4567, since 123 is ignored
890abcd 123efg456.pdf ---> 123, since 890 is ignored
I want to learn whether it is possilbe to write only one regular expression to implement it? Currently, my soluton involves 2 steps:
if there is a leading number, remove it
find the number in the remaining string
or in Python code
import re
reNumHeading = re.compile('^\d{1,}', re.IGNORECASE | re.VERBOSE) # to find leading number
reNum = re.compile('\d{1,}', re.IGNORECASE | re.VERBOSE) # to find number
lstTest = '''123abcd 4567sdds.pdf
asdasd 213.pdf
abcd 123efg456.pdf
890abcd 123efg456.pdf'''.split('\n')
for test in lstTest:
if reNumHeading.match(test):
span = reNumHeading.match(test).span()
stripTest = test[span[1]:]
else:
stripTest = test
result = reNum.findall(stripTest)
if result:
print(result[0])
thanks
You can use ? quantifier to define optional pattern
>>> import re
>>> s = '''asdasd 213.pdf
... abcd123efg456.pdf
... 123abcd 4567sdds.pdf
... 890abcd 123efg456.pdf'''
>>> for line in s.split('\n'):
... print(re.search(r'(?:^\d+)?.*?(\d+)', line)[1])
...
213
123
4567
123
(?:^\d+)? here a non-capturing group and ? quantifier is used to optionally match digits at start of line
since + is greedy, all the starting digits will be matched
.*? match any number of characters minimally (because we need the first match of digits)
(\d+) the required digits to be captured
re.search returns a re.Match object from which you can get various details
[1] on the re.Match object will give you string captured by first capturing group
use .group(1) if you are on older version of Python that doesn't support [1] syntax
See also: Reference - What does this regex mean?
Just match digits \d+ that follow a non-digit \D:
import re
lstTest = '''123abcd 4567sdds.pdf
asdasd 213.pdf
abcd 123efg456.pdf
890abcd 123efg456.pdf'''.split('\n')
for test in lstTest:
res = re.search(r'\D(\d+)', test)
print(res.group(1))
Output:
4567
213
123
123

Extract Only Digits from Dollar Figures

What I'm trying to do is extract only the digits from dollar figures.
Format of Input
...
$1,289,868
$62,000
$421
...
Desired Output
...
1289868
62000
421
...
The regular expression that I was using to extract only the digits and commas is:
r'\d+(,\d+){0,}'
which of course outputs...
...
1,289,868
62,000
421
...
What I'd like to do is convert the output to an integer (int(...)), but obviously this won't work with the commas. I'm sure I could figure this out on my own, but I'm running really short on time right now.
I know I can simply use r'\d+', but this obviously separates each chunk into separate matches...
You can't match discontinuous texts within one match operation. You can't put a regex into re.findall against 1,345,456 to receive 1345456. You will need to first match the strings you need, and then post-process them within code.
A regex you may use to extract the numbers themselves
re.findall(r'\$(\d{1,3}(?:,\d{3})*)', s)
See this regex demo.
Alternatively, you may use a bit more general regex to be used with re.findall:
r'\$(\d+(?:,\d+)*)'
See this regex demo.
Note that re.findall will only return the captured part of the string (the one matched with the (...) part in the regex).
Details
\$ - a dollar sign
(\d{1,3}(?:,\d{3})*) - Capturing group 1:
\d{1,3} - 1 to 3 digits (if \d+ is used, 1 or more digits)
(?:,\d{3})* - 0 or more sequences of
, - a comma
\d{3} - 3 digits (or if \d+ is used, 1 or more digits).
Python code sample (with removing commas):
import re
s = """$1,289,868
$62,000
$421"""
result = [x.replace(",", "") for x in re.findall(r'\$(\d{1,3}(?:,\d{3})*)', s)]
print(result) # => ['1289868', '62000', '421']
Using re.sub
Ex:
import re
s = """$1,289,868
$62,000
$421"""
print([int(i) for i in re.sub(r'[^0-9\s]', "", s).splitlines()])
Output:
[1289868, 62000, 421]
You don't need regex for this.
int(''.join(filter(str.isdigit, "$1,000,000")))
works just fine.
If you did want to use regex for some reason:
int(''.join(re.findall(r"\d", "$1,000,000")))
If you know how to extract the numbers with comma groupings, the easiest thing to do is just transform that into something int can handle:
for match in matches:
i = int(match.replace(',', ''))
For example, if match is '1,289,868', then match.replace(',', '') is '1289868', and obviously int(<that>) is 1289868.
You dont need regex for this. Just string operations should be enough
>>> string = '$1,289,868\n$62,000\n$421'
>>> [w.lstrip('$').replace(',', '') for w in string.splitlines()]
['1289868', '62000', '421']
Or alternatively, you can use locale.atoi to convert string of digits with commas to int
>>> import locale
>>> locale.setlocale(locale.LC_ALL, 'en_US.UTF8')
>>> list(map(lambda x: locale.atoi(x.lstrip('$')), string.splitlines()))
[1289868, 62000, 421]

Using Python's regex .match() method to get the string before and after an underscore

I have the following code:
tablesInDataset = ["henry_jones_12345678", "henry_jones", "henry_jones_123"]
for table in tablesInDataset:
tableregex = re.compile("\d{8}")
tablespec = re.match(tableregex, table)
everythingbeforedigits = tablespec.group(0)
digits = tablespec.group(1)
My regex should only return the string if it contains 8 digits after an underscore. Once it returns the string, I want to use .match() to get two groups using the .group() method. The first group should contain a string will all of the characters before the digits and the second should contain a string with the 8 digits.
What is the correct regex to get the results I am looking for using .match() and .group()?
Use capture groups:
>>> import re
>>> pat = re.compile(r'(?P<name>.*)_(?P<number>\d{8})')
>>> pat.findall(s)
[('henry_jones', '12345678')]
You get the nice feature of named groups, if you want it:
>>> match = pat.match(s)
>>> match.groupdict()
{'name': 'henry_jones', 'number': '12345678'}
tableregex = re.compile("(.*)_(\d{8})")
I think this pattern should match what you need: (.*?_)(\d{8}).
First group includes everything up to the 8 digits, including the underscore. Second group is the 8 digits.
If you don't want the underscore included, use this instead: (.*?)_(\d{8})
Here you go:
import re
tablesInDataset = ["henry_jones_12345678", "henry_jones", "henry_jones_123"]
rx = re.compile(r'^(\D+)_(\d{8})$')
matches = [(match.groups()) \
for item in tablesInDataset \
for match in [rx.search(item)] \
if match]
print(matches)
Better than any dot-star-soup :)

Python Regex matching already matched sub-string

I'm fairly new to Python Regex and I'm not able to understand the following:
I'm trying to find one small letter surrounded by three capital letters.
My first problem is that the below regex is giving only one match instead of the two matches that are present ['AbAD', 'DaDD']
>>> import re
>>>
>>> # String
... str = 'AbADaDD'
>>>
>>> pat = '[A-Z][a-z][A-Z][A-Z]'
>>> regex = re.compile(pat)
>>>
>>> print regex.findall(str)
['AbAD']
I guess the above is due to the fact that the last D in the first regex is not available for matching any more? Is there any way to turn off this kind of matching.
The second issue is the following regex:
>>> import re
>>>
>>> # String
... str = 'AbADaDD'
>>>
>>> pat = '[^A-Z][A-Z][a-z][A-Z][A-Z][^A-Z]'
>>> regex = re.compile(pat)
>>>
>>> print regex.findall(str)
[]
Basically what I want is that there shouldn't be more than three capital letters surrounding a small letter, and therefore I placed a negative match around them. But ['AbAD'] should be matched, but it is not getting matched. Any ideas?
It's mainly because of the overlapping of matches. Just put your regex inside a lookahead inorder to handle this type of overlapping matches.
(?=([A-Z][a-z][A-Z][A-Z]))
Code:
>>> s = 'AbADaDD'
>>> re.findall(r'(?=([A-Z][a-z][A-Z][A-Z]))', s)
['AbAD', 'DaDD']
DEMO
For the 2nd one, you should use negative lookahead and lookbehind assertion like below,
(?=(?<![A-Z])([A-Z][a-z][A-Z][A-Z])(?![A-Z]))
Code:
>>> re.findall(r'(?=(?<![A-Z])([A-Z][a-z][A-Z][A-Z])(?![A-Z]))', s)
['AbAD']
DEMO
The problem with your second regex is, [^A-Z] consumes a character (there isn't a character other than uppercase letter exists before first A) but the negative look-behind (?<![A-Z]) also do the same but it won't consume any character . It asserts that the match would be preceded by any but not of an uppercase letter. That;s why you won't get any match.
The problem with you regex is tha it is eating up the string as it progresses leaving nothing for second match.Use lookahead to make sure it does not eat up the string.
pat = '(?=([A-Z][a-z][A-Z][A-Z]))'
For your second regex again do the same.
print re.findall(r"(?=([A-Z][a-z][A-Z][A-Z](?=[^A-Z])))",s)
.For more insights see
1)After first match the string left is aDD as the first part has matched.
2)aDD does not satisfy pat = '[A-Z][a-z][A-Z][A-Z]'.So it is not a part of your match.
1st issue,
You should use this pattern,
r'([A-Z]{1}[a-z]{1}[A-Z]{1})'
Example
>>> import re
>>> str = 'AbADaDD'
>>> re.findall(r'([A-Z]{1}[a-z]{1}[A-Z]{1})', str)
['AbA', 'DaD']
2nd issue
You should use,
(?=(?<![A-Z])([A-Z]{1}[a-z]{1}[A-Z]{1}[A-Z]{1})(?![A-Z]))
Example
>>> import re
>>> str = 'AbADaDD'
>>> re.findall(r'(?=(?<![A-Z])([A-Z]{1}[a-z]{1}[A-Z]{1}[A-Z]{1})(?![A-Z]))', str)
['AbAD']

Categories