return all matches in regex in Python

return all matches in regex in Python - python

>>> import re
>>> p=re.compile('(a(.)c)d')
Why does the following only return 'abcd' but not also 'aecd'? If I want to return both, how shall I do? If I only like to return aecd, what shall I do?
>>> m=p.match('abcdeaecd')
>>> m.group()
'abcd'
>>> m.groups()
('abc', 'b')
Thanks!

You can simplify your RegEx, like this
import re
p=re.compile(r'a.cd')
And use re.findall to get all the matches, like this
print p.findall('abcdeaecd')
# ['abcd', 'aecd']
Otherwise you can use your RegEx itself and iterate over the matches like this
print [item.group() for item in p.finditer('abcdeaecd')]
# ['abcd', 'aecd']

You will want to use finditer instead of match:
ms = p.finditer('abcdeaecd')
for m in ms:
# do something with m.group or m.groups

Related

Regular expressions (regex) - How to split a string by the first X digits appear in it?

I've been struggling to find the right regex (Python) to cover my requirement:
I want to split a string according to the first place in which there are 6 digits.
For example -
stringA = 'abcdf123456789'
Will ideally be cut into -
StringB='abcdf123456'
StringC='789'
So far - This is the solution I came up with:
x = re.split("(?=[0-9])", stringA)
And than loop over the results while counting the chars.
Your help will be greatly appreciated!

import re
stringA = 'abcdf123456789'
result = re.split("(?<=[0-9]{6})",stringA,maxsplit=1)
print(result)
# ['abcdf123456', '789']

Using a lookbehind:
>>> stringA = 'abcdf123456789'
>>> re.split(r'(?<=\d{6})', stringA, maxsplit=1)
['abcdf123456', '789']
Demo and explanation of the regex

You may use this code with 2 capture groups:
>>> import re
>>> stringA = 'abcdf123456789'
>>> [(stringB,stringC)] = re.findall(r'(.*?\d{6})(.*)', stringA)
>>> print (stringB)
abcdf123456
>>> print (stringC)
789

You can split on 6 digits with maxsplit=1, and capture the group you split on, then you can build your strings easily:
import re
stringA = 'abcdf123456789'
split = re.split(r'(\d{6})', stringA, maxsplit=1)
# split is now ['abcdf', '123456', '789']
stringB = ''.join(split[:2])
stringC = split[2]
print(stringB)
print(stringC)
# abcdf123456
# 789

import re
stringA = 'abcdf123456789'
index = re.search(r'\d{6}', stringA).end()
stringB = stringA[:index]
stringC = stringA[index:]

You can just use the findall() and groups aka '()' and simply finds the things you need!
import re
stringA = 'abcdf123456789'
pattern = r"([\D]*\d{6})(.*)"
result = re.findall(pattern, stringA)
print(result)
#output [('abcdf123456', '789')]

Regex multiple same pattern/repeated captures not work correctly, only match first and last

My regex:
联系人[:：]\s{1,2}([^\s,，、]+)(?:[\s,，、]{1,2}([^\s,，、]+))*
Test string:
联系人: 啊啊，实打实大, 好说歹说、实打实 实打实大
Code
>>> import regex as re
>>> p = r'联系人[:：]\s*([^\s,，、]+)(?:[\s,，、]{1,2}([^\s,，、]+))*'
>>> s = '联系人: 啊啊，实打实大, 好说歹说、实打实 实打实大'
>>> re.findall(p, s)
[('啊啊', '实打实大')]
# finditer
>>> for i in re.finditer(p, s):
... print(i.groups())
...
('啊啊', '实打实大')
Matchs:
You can test it here https://regex101.com/
(regex101 can't save regex now, so I have to post above pics)
I want all groups split by [\s,，、], but only match the first and last. I don't feel there is any wrong in my regex, though the result is wrong, this stuck me for half hour...

As I mentioned in my comments, you need to use re.search (to get a single match only) or re.finditer (to get multiple matches) and access the corresponding group captures (in your case, it is captures(2)):
>>> import regex as re
>>> p = r'联系人[:：]\s*([^\s,，、]+)(?:[\s,，、]{1,2}([^\s,，、]+))*'
>>> s = '联系人: 啊啊，实打实大, 好说歹说、实打实 实打实大'
>>> res = []
>>> for x in re.finditer(p, s):
res.append(x.captures(2))
>>> print(res)
[['实打实大', '好说歹说', '实打实', '实打实大']]
>>> m = re.search(p, s)
>>> if m:
print(m.captures(2))
['实打实大', '好说歹说', '实打实', '实打实大']

mess with regex python

I am trying the next code but it seems that i am doing something wrong.
import re
lista = ["\\hola\\01\\02Jan\\05\\03",
"\\hola\\01\\02Dem\\12",
"\\hola\\01\\02March\\12\\04"]
for l in lista:
m= re.search("\\\\\d{2,2}\\\\\d{2,2}[a-zA-Z]+\\\\\d{2,2}\s",l)
if m:
print (m.group(0))
The result should be second string.
I have tried without \s but the result match with all strings.

You can try this regex:
lista = [r"\hola\01\02Jan\05\03", r"\hola\01\02Dem\12", r"\hola\01\02March\12\04"]
>>> for l in lista:
... m = re.search(r"\\\d{2,2}\\\d{2,2}[a-zA-Z]+\\\d{2}$", l)
... if m:
... print m.group()
...
Output:
\01\02Dem\12
Use r"..." form to declare a regex and input as raw string
Use anchor $ to avoid matching unwanted input

You can use the following code without regex:
>>> for l in lista:
totalNo = l.count('\\')
if totalNo == 4:
print l

regular expression in python between two words

I am trying to get value
l1 = [u'/worldcup/archive/southafrica2010/index.html', u'/worldcup/archive/germany2006/index.html', u'/worldcup/archive/edition=4395/index.html', u'/worldcup/archive/edition=1013/index.html', u'/worldcup/archive/edition=84/index.html', u'/worldcup/archive/edition=76/index.html', u'/worldcup/archive/edition=68/index.html', u'/worldcup/archive/edition=59/index.html', u'/worldcup/archive/edition=50/index.html', u'/worldcup/archive/edition=39/index.html', u'/worldcup/archive/edition=32/index.html', u'/worldcup/archive/edition=26/index.html', u'/worldcup/archive/edition=21/index.html', u'/worldcup/archive/edition=15/index.html', u'/worldcup/archive/edition=9/index.html', u'/worldcup/archive/edition=7/index.html', u'/worldcup/archive/edition=5/index.html', u'/worldcup/archive/edition=3/index.html', u'/worldcup/archive/edition=1/index.html']
I'm trying to do regular expression starting off with something like this below
m = re.search(r"\d+", l)
print m.group()
but I want value between "archive/" and "/index.html"
I goggled and have tried something like (?<=archive/\/index.html).*(?=\/index.html:)
but It didn't work for me .. how can I get my result list as '
result = ['germany2006','edition=4395','edition=1013' , ...]

If you know for sure that the pattern will match always, you can use this
import re
print [re.search("archive/(.*?)/index.html", l).group(1) for l in l1]
Or you can simply split like this
print [l.rsplit("/", 2)[-2] for l in l1]

You can take help from below code .It will solve your problem.
>>> import re
>>> p = '/worldcup/archive/southafrica2010/index.html'
>>> r = re.compile('archive/(.*?)/index.html')
>>> m = r.search(p)
>>> m.group(1)
'southafrica2010'

Look-arounds is what you need. You need to use it like this:
>>> [re.search(r"(?<=archive/).*?(?=/index.html)", s).group() for s in l1]
[u'southafrica2010', u'germany2006', u'edition=4395', u'edition=1013', u'edition=84', u'edition=76', u'edition=68', u'edition=59', u'edition=50', u'edition=39', u'edition=32', u'edition=26', u'edition=21', u'edition=15', u'edition=9', u'edition=7', u'edition=5', u'edition=3', u'edition=1']

The regular expression
m = re.search(r'(?<=archive\/).+(?=\/index.html)', s)
can solve this, suppose that s is a string from your list.

Python string split decimals from end of string

I use nlst on a ftp server which returns directories in the form of lists. The format of the returned list is as follows:
[xyz123,abcde345,pqrst678].
I have to separate each element of the list into two parts such that part1 = xyz and part2 = 123 i.e split the string at the beginning of the integer part. Any help on this will be appreciated!

>>> re.findall(r'\d+|[a-z]+', 'xyz123')
['xyz', '123']

For example, using the re module:
>>> import re
>>> a = ['xyz123','ABCDE345','pqRst678']
>>> regex = '(\D+)(\d+)'
>>> for item in a:
... m = re.match(regex, item)
... (a, b) = m.groups()
... print a, b
xyz 123
ABCDE 345
pqRst 678

Use the regular expression module re:
import re
def splitEntry(entry):
firstDecMatch = re.match(r"\d+$", entry)
alpha, numeric = "",""
if firstDecMatch:
pos = firstDecMatch.start(0)
alpha, numeric = entry[:pos], entry[pos:]
else # no decimals found at end of string
alpha = entry
return (alpha, numeric)
Note that the regular expression is `\d+$', which should match all decimals at the end of the string. If the string has decimals in the first part, it will not count those, e.g: xy3zzz134 -> "xy3zzz","134". I opted for that because you say you are expecting filenames, and filenames can include numbers. Of course it's still a problem if the filename ends with numbers.

Another non-re answer:
>>> [''.join(x[1]) for x in itertools.groupby('xyz123', lambda x: x.isalpha())]
['xyz', '123']

If you don't want to use regex, then you can do something like this. Note that I have not tested this so there could be a bug or typo somewhere.
list = ["xyz123", "abcde345", "pqrst678"]
newlist = []
for item in list:
for char in range(0, len(item)):
if item[char].isnumeric():
newlist.append([item[:char], item[char:]])
break

>>> import re
>>> [re.findall(r'(.*?)(\d+$)',x)[0] for x in ['xyz123','ABCDE345','pqRst678']]
[('xyz', '123'), ('ABCDE', '345'), ('pqRst', '678')]

I don't think its that difficult without re
>>> s="xyz123"
>>> for n,i in enumerate(s):
... if i.isdigit(): x=n ; break
...
>>> [ s[:x], s[x:] ]
['xyz', '123']
>>> s="abcde345"
>>> for n,i in enumerate(s):
... if i.isdigit(): x=n ; break
...
>>> [ s[:x], s[x:] ]
['abcde', '345']

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

return all matches in regex in Python - python

>>> import re >>> p=re.compile('(a(.)c)d') Why does the following only return 'abcd' but not also 'aecd'? If I want to return both, how shall I do? If I only like to return aecd, what shall I do? >>> m=p.match('abcdeaecd') >>> m.group() 'abcd' >>> m.groups() ('abc', 'b') Thanks!

You will want to use finditer instead of match: ms = p.finditer('abcdeaecd') for m in ms: # do something with m.group or m.groups

Related

Regular expressions (regex) - How to split a string by the first X digits appear in it?

Regex multiple same pattern/repeated captures not work correctly, only match first and last

mess with regex python

regular expression in python between two words

Python string split decimals from end of string

Categories

Resources