return all matches in regex in Python - python

>>> import re
>>> p=re.compile('(a(.)c)d')
Why does the following only return 'abcd' but not also 'aecd'? If I want to return both, how shall I do? If I only like to return aecd, what shall I do?
>>> m=p.match('abcdeaecd')
>>> m.group()
'abcd'
>>> m.groups()
('abc', 'b')
Thanks!

You can simplify your RegEx, like this
import re
p=re.compile(r'a.cd')
And use re.findall to get all the matches, like this
print p.findall('abcdeaecd')
# ['abcd', 'aecd']
Otherwise you can use your RegEx itself and iterate over the matches like this
print [item.group() for item in p.finditer('abcdeaecd')]
# ['abcd', 'aecd']

You will want to use finditer instead of match:
ms = p.finditer('abcdeaecd')
for m in ms:
# do something with m.group or m.groups

Related

Regular expressions (regex) - How to split a string by the first X digits appear in it?

I've been struggling to find the right regex (Python) to cover my requirement:
I want to split a string according to the first place in which there are 6 digits.
For example -
stringA = 'abcdf123456789'
Will ideally be cut into -
StringB='abcdf123456'
StringC='789'
So far - This is the solution I came up with:
x = re.split("(?=[0-9])", stringA)
And than loop over the results while counting the chars.
Your help will be greatly appreciated!
import re
stringA = 'abcdf123456789'
result = re.split("(?<=[0-9]{6})",stringA,maxsplit=1)
print(result)
# ['abcdf123456', '789']
Using a lookbehind:
>>> stringA = 'abcdf123456789'
>>> re.split(r'(?<=\d{6})', stringA, maxsplit=1)
['abcdf123456', '789']
Demo and explanation of the regex
You may use this code with 2 capture groups:
>>> import re
>>> stringA = 'abcdf123456789'
>>> [(stringB,stringC)] = re.findall(r'(.*?\d{6})(.*)', stringA)
>>> print (stringB)
abcdf123456
>>> print (stringC)
789
You can split on 6 digits with maxsplit=1, and capture the group you split on, then you can build your strings easily:
import re
stringA = 'abcdf123456789'
split = re.split(r'(\d{6})', stringA, maxsplit=1)
# split is now ['abcdf', '123456', '789']
stringB = ''.join(split[:2])
stringC = split[2]
print(stringB)
print(stringC)
# abcdf123456
# 789
import re
stringA = 'abcdf123456789'
index = re.search(r'\d{6}', stringA).end()
stringB = stringA[:index]
stringC = stringA[index:]
You can just use the findall() and groups aka '()' and simply finds the things you need!
import re
stringA = 'abcdf123456789'
pattern = r"([\D]*\d{6})(.*)"
result = re.findall(pattern, stringA)
print(result)
#output [('abcdf123456', '789')]

Regex multiple same pattern/repeated captures not work correctly, only match first and last

My regex:
联系人[::]\s{1,2}([^\s,,、]+)(?:[\s,,、]{1,2}([^\s,,、]+))*
Test string:
联系人: 啊啊,实打实大, 好说歹说、实打实 实打实大
Code
>>> import regex as re
>>> p = r'联系人[::]\s*([^\s,,、]+)(?:[\s,,、]{1,2}([^\s,,、]+))*'
>>> s = '联系人: 啊啊,实打实大, 好说歹说、实打实 实打实大'
>>> re.findall(p, s)
[('啊啊', '实打实大')]
# finditer
>>> for i in re.finditer(p, s):
... print(i.groups())
...
('啊啊', '实打实大')
Matchs:
You can test it here https://regex101.com/
(regex101 can't save regex now, so I have to post above pics)
I want all groups split by [\s,,、], but only match the first and last. I don't feel there is any wrong in my regex, though the result is wrong, this stuck me for half hour...
As I mentioned in my comments, you need to use re.search (to get a single match only) or re.finditer (to get multiple matches) and access the corresponding group captures (in your case, it is captures(2)):
>>> import regex as re
>>> p = r'联系人[::]\s*([^\s,,、]+)(?:[\s,,、]{1,2}([^\s,,、]+))*'
>>> s = '联系人: 啊啊,实打实大, 好说歹说、实打实 实打实大'
>>> res = []
>>> for x in re.finditer(p, s):
res.append(x.captures(2))
>>> print(res)
[['实打实大', '好说歹说', '实打实', '实打实大']]
>>> m = re.search(p, s)
>>> if m:
print(m.captures(2))
['实打实大', '好说歹说', '实打实', '实打实大']

mess with regex python

I am trying the next code but it seems that i am doing something wrong.
import re
lista = ["\\hola\\01\\02Jan\\05\\03",
"\\hola\\01\\02Dem\\12",
"\\hola\\01\\02March\\12\\04"]
for l in lista:
m= re.search("\\\\\d{2,2}\\\\\d{2,2}[a-zA-Z]+\\\\\d{2,2}\s",l)
if m:
print (m.group(0))
The result should be second string.
I have tried without \s but the result match with all strings.
You can try this regex:
lista = [r"\hola\01\02Jan\05\03", r"\hola\01\02Dem\12", r"\hola\01\02March\12\04"]
>>> for l in lista:
... m = re.search(r"\\\d{2,2}\\\d{2,2}[a-zA-Z]+\\\d{2}$", l)
... if m:
... print m.group()
...
Output:
\01\02Dem\12
Use r"..." form to declare a regex and input as raw string
Use anchor $ to avoid matching unwanted input
You can use the following code without regex:
>>> for l in lista:
totalNo = l.count('\\')
if totalNo == 4:
print l

regular expression in python between two words

I am trying to get value
l1 = [u'/worldcup/archive/southafrica2010/index.html', u'/worldcup/archive/germany2006/index.html', u'/worldcup/archive/edition=4395/index.html', u'/worldcup/archive/edition=1013/index.html', u'/worldcup/archive/edition=84/index.html', u'/worldcup/archive/edition=76/index.html', u'/worldcup/archive/edition=68/index.html', u'/worldcup/archive/edition=59/index.html', u'/worldcup/archive/edition=50/index.html', u'/worldcup/archive/edition=39/index.html', u'/worldcup/archive/edition=32/index.html', u'/worldcup/archive/edition=26/index.html', u'/worldcup/archive/edition=21/index.html', u'/worldcup/archive/edition=15/index.html', u'/worldcup/archive/edition=9/index.html', u'/worldcup/archive/edition=7/index.html', u'/worldcup/archive/edition=5/index.html', u'/worldcup/archive/edition=3/index.html', u'/worldcup/archive/edition=1/index.html']
I'm trying to do regular expression starting off with something like this below
m = re.search(r"\d+", l)
print m.group()
but I want value between "archive/" and "/index.html"
I goggled and have tried something like (?<=archive/\/index.html).*(?=\/index.html:)
but It didn't work for me .. how can I get my result list as '
result = ['germany2006','edition=4395','edition=1013' , ...]
If you know for sure that the pattern will match always, you can use this
import re
print [re.search("archive/(.*?)/index.html", l).group(1) for l in l1]
Or you can simply split like this
print [l.rsplit("/", 2)[-2] for l in l1]
You can take help from below code .It will solve your problem.
>>> import re
>>> p = '/worldcup/archive/southafrica2010/index.html'
>>> r = re.compile('archive/(.*?)/index.html')
>>> m = r.search(p)
>>> m.group(1)
'southafrica2010'
Look-arounds is what you need. You need to use it like this:
>>> [re.search(r"(?<=archive/).*?(?=/index.html)", s).group() for s in l1]
[u'southafrica2010', u'germany2006', u'edition=4395', u'edition=1013', u'edition=84', u'edition=76', u'edition=68', u'edition=59', u'edition=50', u'edition=39', u'edition=32', u'edition=26', u'edition=21', u'edition=15', u'edition=9', u'edition=7', u'edition=5', u'edition=3', u'edition=1']
The regular expression
m = re.search(r'(?<=archive\/).+(?=\/index.html)', s)
can solve this, suppose that s is a string from your list.

Python string split decimals from end of string

I use nlst on a ftp server which returns directories in the form of lists. The format of the returned list is as follows:
[xyz123,abcde345,pqrst678].
I have to separate each element of the list into two parts such that part1 = xyz and part2 = 123 i.e split the string at the beginning of the integer part. Any help on this will be appreciated!
>>> re.findall(r'\d+|[a-z]+', 'xyz123')
['xyz', '123']
For example, using the re module:
>>> import re
>>> a = ['xyz123','ABCDE345','pqRst678']
>>> regex = '(\D+)(\d+)'
>>> for item in a:
... m = re.match(regex, item)
... (a, b) = m.groups()
... print a, b
xyz 123
ABCDE 345
pqRst 678
Use the regular expression module re:
import re
def splitEntry(entry):
firstDecMatch = re.match(r"\d+$", entry)
alpha, numeric = "",""
if firstDecMatch:
pos = firstDecMatch.start(0)
alpha, numeric = entry[:pos], entry[pos:]
else # no decimals found at end of string
alpha = entry
return (alpha, numeric)
Note that the regular expression is `\d+$', which should match all decimals at the end of the string. If the string has decimals in the first part, it will not count those, e.g: xy3zzz134 -> "xy3zzz","134". I opted for that because you say you are expecting filenames, and filenames can include numbers. Of course it's still a problem if the filename ends with numbers.
Another non-re answer:
>>> [''.join(x[1]) for x in itertools.groupby('xyz123', lambda x: x.isalpha())]
['xyz', '123']
If you don't want to use regex, then you can do something like this. Note that I have not tested this so there could be a bug or typo somewhere.
list = ["xyz123", "abcde345", "pqrst678"]
newlist = []
for item in list:
for char in range(0, len(item)):
if item[char].isnumeric():
newlist.append([item[:char], item[char:]])
break
>>> import re
>>> [re.findall(r'(.*?)(\d+$)',x)[0] for x in ['xyz123','ABCDE345','pqRst678']]
[('xyz', '123'), ('ABCDE', '345'), ('pqRst', '678')]
I don't think its that difficult without re
>>> s="xyz123"
>>> for n,i in enumerate(s):
... if i.isdigit(): x=n ; break
...
>>> [ s[:x], s[x:] ]
['xyz', '123']
>>> s="abcde345"
>>> for n,i in enumerate(s):
... if i.isdigit(): x=n ; break
...
>>> [ s[:x], s[x:] ]
['abcde', '345']

Categories