Python Regex - How to Get Positions and Values of Matches - python

How can I get the start and end positions of all matches using the re module? For example given the pattern r'[a-z]' and the string 'a1b2c3d4' I'd want to get the positions where it finds each letter. Ideally, I'd like to get the text of the match back too.

import re
p = re.compile("[a-z]")
for m in p.finditer('a1b2c3d4'):
print(m.start(), m.group())

Taken from
Regular Expression HOWTO
span() returns both start and end indexes in a single tuple. Since the
match method only checks if the RE matches at the start of a string,
start() will always be zero. However, the search method of RegexObject
instances scans through the string, so the match may not start at zero
in that case.
>>> p = re.compile('[a-z]+')
>>> print p.match('::: message')
None
>>> m = p.search('::: message') ; print m
<re.MatchObject instance at 80c9650>
>>> m.group()
'message'
>>> m.span()
(4, 11)
Combine that with:
In Python 2.2, the finditer() method is also available, returning a sequence of MatchObject instances as an iterator.
>>> p = re.compile( ... )
>>> iterator = p.finditer('12 drummers drumming, 11 ... 10 ...')
>>> iterator
<callable-iterator object at 0x401833ac>
>>> for match in iterator:
... print match.span()
...
(0, 2)
(22, 24)
(29, 31)
you should be able to do something on the order of
for match in re.finditer(r'[a-z]', 'a1b2c3d4'):
print match.span()

For Python 3.x
from re import finditer
for match in finditer("pattern", "string"):
print(match.span(), match.group())
You shall get \n separated tuples (comprising first and last indices of the match, respectively) and the match itself, for each hit in the string.

note that the span & group are indexed for multi capture groups in a regex
regex_with_3_groups=r"([a-z])([0-9]+)([A-Z])"
for match in re.finditer(regex_with_3_groups, string):
for idx in range(0, 4):
print(match.span(idx), match.group(idx))

Related

Python regex that matches any word that contains exactly n digits, but can contain other characters too

e.g. if n=10, then the regex:
Should match:
(123)456-7890
(123)456-(7890)
a1b2c3ddd4e5ff6g7h8i9jj0k
But should not match:
(123)456-789
(123)456-(78901)
etc.
Note: I'm strictly looking for a regex and that is a hard constraint.
======================================
Edit: Other constraints
I am looking for a solution of the form:
regex = re.compile(r'?????????')
where:
regex.findall(s)
... returns a non-empty array for s in ['(123)456-7890','(123)456-(7890)', 'a1b2c3ddd4e5ff6g7h8i9jj0k']
and returns an empty array for s in ['(123)456-789', '(123)456-(78901)']
The regex ^\D*\d\D*\d\D*\d\D*\d\D*\d\D*\d\D*\d\D*\d\D*\d\D*\d\D*$ will find all the matches. Changing this to work for n digits use "^"+"\D*\d"*n+"\D*$"
import re
n=10
regex = "^"+"\D*\d"*n+"\D*$"
numbers='''(123)456-7890
(123)456-(7890)
a1b2c3ddd4e5ff6g7h8i9jj0k
(123)456-789
(123)456-(78901)'''
matches=re.findall(regex,numbers,re.M)
print(matches)
Or for a single match
pattern = re.compile("^"+"\D*\d"*n+"\D*$")
print(pattern.match('(123)456-7890').group(0)) #(123)456-7890 or AttributeError if no match so wrap in try except
Simply by replacing all non-digit characters from an input string:
import re
def ensure_digits(s, limit=10):
return len(re.sub(r'\D+', '', s)) == limit
print(ensure_digits('(123)456-(7890)', 10)) # True
print(ensure_digits('a1b2c3ddd4e5ff6g7h8i9jj0k', 10)) # True
print(ensure_digits('(123)456-(78901)', 10)) # False
\D+ - matches one or more non-digit characters
Version for a list of words:
def ensure_digits(words_lst, limit=10):
pat = re.compile(r'\D+')
return [w for w in words_lst if len(pat.sub('', w)) == limit]
print(ensure_digits(['(123)456-7890','(123)456-(7890)', 'a1b2c3ddd4e5ff6g7h8i9jj0k'], 10))
print(ensure_digits(['(123)456-789', '(123)456-(78901)'], 10))
prints consecutively:
['(123)456-7890', '(123)456-(7890)', 'a1b2c3ddd4e5ff6g7h8i9jj0k']
[]
You can use string formatting to inject in your pattern the amount of numbers n you want. Also, you need to use the flag MULTILINE.
import re
txt = """(123)456-7890
(123)456-(7890)
a1b2c3ddd4e5ff6g7h8i9jj0k
(123)456-789
(123)456-(78901)"""
n = 10
rgx = re.compile(r"^(?:\D*\d\D*){%d}$" % n, re.MULTILINE)
result = rgx.findall(txt)
print(result)
Prints:
['(123)456-7890', '(123)456-(7890)', 'a1b2c3ddd4e5ff6g7h8i9jj0k']
This expression might likely validate the 10 digits:
^(?:\D*\d|\d\D*){10}\D*$
which we can simply replace 10 with an n var.
The expression is explained on the top right panel of regex101.com, if you wish to explore/simplify/modify it, and in this link, you can watch how it would match against some sample inputs, if you like.
Test
import re
print(re.findall(r"^(?:\D*\d|\d\D*){10}\D*$", "a1b2c3ddd4e5ff6g7h8i9jj0k"))
Output
['a1b2c3ddd4e5ff6g7h8i9jj0k']

How to get the first number from span=(2494, 2516) here?

I want to cut a text from the point where my regex expression is found to the end of the text. The position may vary, so I need that number as a variable.
The position can already be seen in the result of studentnrRegex.search(text):
>>> studentnrRegex = re.compile(r'(Studentnr = 18\d\d\d\d\d\d\d\d)')
>>> start = studentnrRegex.search(text)
>>> start
<_sre.SRE_Match object; span=(2494, 2516), match='Studentnr = 1825010243'>
>>> myText = text[2494:]
>>> myText
'Studentnr = 1825010243\nTEXT = blablabla
Can I get the start position as a variable directly from my variable start, in this case 2494?
The match object returned by calling .search() has .start() and .end() methods that return the starting and ending positions of the match.
studentnrRegex = re.compile(r'(Studentnr = 18\d\d\d\d\d\d\d\d)')
m = studentnrRegex.search(text)
start = m.start()
print(mytext[start:])
You can accomplish the same thing with a different regex that matches the student number and everything after it. This will save you the trouble of doing the slice:
studentnrRegex = re.compile(r'(Studentnr = 18\d{8}).*', re.DOTALL)
m = studentnrRegex.search(text)
print(m.group())
The {8} matches 8 repeats of the \d and the .* matches all remaining characters until the end of the string (including newlines) as long as the re.DOTALL flag is specified. The full match is group 0, which is the default value for the .group() method of the match object. You can access the student number as m.group(1).

get empty list when finding last match

I want to find the last word between slashes in a url. For example, find "nika" in "/gallery/haha/nika/7907/08-2015"
I wrote this in my python code:
>>> text = '/gallery/haha/nika/7907/08-2015'
>>> re.findall(r'/[a-zA-Z]*/$', text)
but I got an empty list:
[]
And if I delete that dollar sign:
>>> re.findall(r'/[a-zA-Z]*/', text)
The return list is not empty but '/haha/' is missed:
['/gallery/', '/nika/']
Anybody knows why?
Use lookarounds as in
re.findall(r'(?<=/)[a-zA-Z]*(?=/)', text)
See demo
$ means end of string so you are getting empty string.
haha is missing because you are capturing / and so / is not left for haha. When you use lookarounds it is a 0 width assertion and it does not consume / and so all are captured.
You don't need regex for this,
>>> s = "/gallery/haha/nika/7907/08-2015"
>>> for i in reversed(s.split('/')):
if i.isalpha():
print(i)
break
nika
or
>>> [i for i in s.split('/') if i.isalpha()][-1]
'nika'
>>>
or
>>> j = s.split('/')
>>> [i for i in j if i.isalpha()][-1]
'nika'
I want to find the last word between slashes...
To get the last... you can always throw a greedy dot before to ᗧ eat up:
^.*/([a-zA-Z]*)/
And capture wanted stuff to $1. See test at regex101

how can i finding the index of non-ASCII character in python string?

Python has string.find() and string.rfind() to get the index of a substring in string.
And re.search(regex,string) to get the 'first index' of a substring in string. but, this function is return to match object :(
So i wonder, merge the two function. by a regex to check for the string and return the first index. (index is not match object type :b)
example :
string = "abcdeÿÿaaaabbbÿÿcccdddÿÿeeeÿÿ"
print custom(string)
result :
>>> 5
non-ASCII range is [^\x20-\x7E], how does implementation this function??
If you want to use this 2 function use the first group of re.search within find :
>>> g = "abcdeÿÿaaaabbbÿÿcccdddÿÿeeeÿÿ"
>>> import re
>>> g.find(re.search(r'[^\x20-\x7E]',g).group(0))
5
But if you just want to find the index re.search has the start method that return the index of matched string :
>>> re.search(r'[^\x20-\x7E]',g).start()
5
Also you can do it without regex :
>>> import string
>>> next(i for i,j in enumerate(g) if j not in string.ascii_letters)
5
"MatchObjects" have a start method you can use:
import re
def custom(s):
mat = re.search(r'[^\x20-\x7E]', s)
if mat: return mat.start()
return -1 # ?? match failed
string = "abcdeÿÿaaaabbbÿÿcccdddÿÿeeeÿÿ"
print(custom(string)) # 5

Regex findall start() and end() ? Python

i'm trying to get the start and end positions of a query in sequence by using re.findall
import re
sequence = 'aaabbbaaacccdddeeefff'
query = 'aaa'
findall = re.findall(query,sequence)
>>> ['aaa','aaa']
how do i get something like findall.start() or findall.end() ?
i would like to get
start = [0,6]
end = [2,8]
i know that
search = re.search(query,sequence)
print search.start(),search.end()
>>> 0,2
would give me only the first instance
Use re.finditer:
>>> import re
>>> sequence = 'aaabbbaaacccdddeeefff'
>>> query = 'aaa'
>>> r = re.compile(query)
>>> [[m.start(),m.end()] for m in r.finditer(sequence)]
[[0, 3], [6, 9]]
From the docs:
Return an iterator yielding MatchObject instances over all
non-overlapping matches for the RE pattern in string. The string is
scanned left-to-right, and matches are returned in the order found.
You can't. findall is a convenience function that, as the docs say, returns "a list of strings". If you want a list of MatchObjects, you can't use findall.
However, you can use finditer. If you're just iterating over the matches for match in re.findall(…):, you can use for match in re.finditer(…) the same way—except you get MatchObject values instead of strings. If you actually need a list, just use matches = list(re.finditer(…)).
Use finditer instead of findall. This gives you back an iterator yielding MatchObject instances and you can get start/end from the MatchObject.

Categories