regular expression in python between two words - python

I am trying to get value
l1 = [u'/worldcup/archive/southafrica2010/index.html', u'/worldcup/archive/germany2006/index.html', u'/worldcup/archive/edition=4395/index.html', u'/worldcup/archive/edition=1013/index.html', u'/worldcup/archive/edition=84/index.html', u'/worldcup/archive/edition=76/index.html', u'/worldcup/archive/edition=68/index.html', u'/worldcup/archive/edition=59/index.html', u'/worldcup/archive/edition=50/index.html', u'/worldcup/archive/edition=39/index.html', u'/worldcup/archive/edition=32/index.html', u'/worldcup/archive/edition=26/index.html', u'/worldcup/archive/edition=21/index.html', u'/worldcup/archive/edition=15/index.html', u'/worldcup/archive/edition=9/index.html', u'/worldcup/archive/edition=7/index.html', u'/worldcup/archive/edition=5/index.html', u'/worldcup/archive/edition=3/index.html', u'/worldcup/archive/edition=1/index.html']
I'm trying to do regular expression starting off with something like this below
m = re.search(r"\d+", l)
print m.group()
but I want value between "archive/" and "/index.html"
I goggled and have tried something like (?<=archive/\/index.html).*(?=\/index.html:)
but It didn't work for me .. how can I get my result list as '
result = ['germany2006','edition=4395','edition=1013' , ...]

If you know for sure that the pattern will match always, you can use this
import re
print [re.search("archive/(.*?)/index.html", l).group(1) for l in l1]
Or you can simply split like this
print [l.rsplit("/", 2)[-2] for l in l1]

You can take help from below code .It will solve your problem.
>>> import re
>>> p = '/worldcup/archive/southafrica2010/index.html'
>>> r = re.compile('archive/(.*?)/index.html')
>>> m = r.search(p)
>>> m.group(1)
'southafrica2010'

Look-arounds is what you need. You need to use it like this:
>>> [re.search(r"(?<=archive/).*?(?=/index.html)", s).group() for s in l1]
[u'southafrica2010', u'germany2006', u'edition=4395', u'edition=1013', u'edition=84', u'edition=76', u'edition=68', u'edition=59', u'edition=50', u'edition=39', u'edition=32', u'edition=26', u'edition=21', u'edition=15', u'edition=9', u'edition=7', u'edition=5', u'edition=3', u'edition=1']

The regular expression
m = re.search(r'(?<=archive\/).+(?=\/index.html)', s)
can solve this, suppose that s is a string from your list.

Related

Regex multiple same pattern/repeated captures not work correctly, only match first and last

My regex:
联系人[::]\s{1,2}([^\s,,、]+)(?:[\s,,、]{1,2}([^\s,,、]+))*
Test string:
联系人: 啊啊,实打实大, 好说歹说、实打实 实打实大
Code
>>> import regex as re
>>> p = r'联系人[::]\s*([^\s,,、]+)(?:[\s,,、]{1,2}([^\s,,、]+))*'
>>> s = '联系人: 啊啊,实打实大, 好说歹说、实打实 实打实大'
>>> re.findall(p, s)
[('啊啊', '实打实大')]
# finditer
>>> for i in re.finditer(p, s):
... print(i.groups())
...
('啊啊', '实打实大')
Matchs:
You can test it here https://regex101.com/
(regex101 can't save regex now, so I have to post above pics)
I want all groups split by [\s,,、], but only match the first and last. I don't feel there is any wrong in my regex, though the result is wrong, this stuck me for half hour...
As I mentioned in my comments, you need to use re.search (to get a single match only) or re.finditer (to get multiple matches) and access the corresponding group captures (in your case, it is captures(2)):
>>> import regex as re
>>> p = r'联系人[::]\s*([^\s,,、]+)(?:[\s,,、]{1,2}([^\s,,、]+))*'
>>> s = '联系人: 啊啊,实打实大, 好说歹说、实打实 实打实大'
>>> res = []
>>> for x in re.finditer(p, s):
res.append(x.captures(2))
>>> print(res)
[['实打实大', '好说歹说', '实打实', '实打实大']]
>>> m = re.search(p, s)
>>> if m:
print(m.captures(2))
['实打实大', '好说歹说', '实打实', '实打实大']

Remove a pattern in all strings in a list

For example, if I have a list of strings
alist=['a_name1_1', 'a_name1_2', 'a_name1_3']
How do I get this:
alist_changed = ['a_n1_1', 'a_n1_2', 'a_n1_3']
alist_changed = [s.replace("ame", "") for s in alist]
If you are looking for something that actually needs to be "pattern" based then you can use python's re module and sub the regular expression pattern for what you want.
import re
alist=['a_name1_1', 'a_name1_2', 'a_name1_3']
alist_changed = []
pattern = r'_\w*_'
for x in alist:
y = re.sub(pattern, '_n1_', x, 1)
#print(y)
alist_changed.append(y)
print(alist_changed)

mess with regex python

I am trying the next code but it seems that i am doing something wrong.
import re
lista = ["\\hola\\01\\02Jan\\05\\03",
"\\hola\\01\\02Dem\\12",
"\\hola\\01\\02March\\12\\04"]
for l in lista:
m= re.search("\\\\\d{2,2}\\\\\d{2,2}[a-zA-Z]+\\\\\d{2,2}\s",l)
if m:
print (m.group(0))
The result should be second string.
I have tried without \s but the result match with all strings.
You can try this regex:
lista = [r"\hola\01\02Jan\05\03", r"\hola\01\02Dem\12", r"\hola\01\02March\12\04"]
>>> for l in lista:
... m = re.search(r"\\\d{2,2}\\\d{2,2}[a-zA-Z]+\\\d{2}$", l)
... if m:
... print m.group()
...
Output:
\01\02Dem\12
Use r"..." form to declare a regex and input as raw string
Use anchor $ to avoid matching unwanted input
You can use the following code without regex:
>>> for l in lista:
totalNo = l.count('\\')
if totalNo == 4:
print l

return all matches in regex in Python

>>> import re
>>> p=re.compile('(a(.)c)d')
Why does the following only return 'abcd' but not also 'aecd'? If I want to return both, how shall I do? If I only like to return aecd, what shall I do?
>>> m=p.match('abcdeaecd')
>>> m.group()
'abcd'
>>> m.groups()
('abc', 'b')
Thanks!
You can simplify your RegEx, like this
import re
p=re.compile(r'a.cd')
And use re.findall to get all the matches, like this
print p.findall('abcdeaecd')
# ['abcd', 'aecd']
Otherwise you can use your RegEx itself and iterate over the matches like this
print [item.group() for item in p.finditer('abcdeaecd')]
# ['abcd', 'aecd']
You will want to use finditer instead of match:
ms = p.finditer('abcdeaecd')
for m in ms:
# do something with m.group or m.groups

Python string split decimals from end of string

I use nlst on a ftp server which returns directories in the form of lists. The format of the returned list is as follows:
[xyz123,abcde345,pqrst678].
I have to separate each element of the list into two parts such that part1 = xyz and part2 = 123 i.e split the string at the beginning of the integer part. Any help on this will be appreciated!
>>> re.findall(r'\d+|[a-z]+', 'xyz123')
['xyz', '123']
For example, using the re module:
>>> import re
>>> a = ['xyz123','ABCDE345','pqRst678']
>>> regex = '(\D+)(\d+)'
>>> for item in a:
... m = re.match(regex, item)
... (a, b) = m.groups()
... print a, b
xyz 123
ABCDE 345
pqRst 678
Use the regular expression module re:
import re
def splitEntry(entry):
firstDecMatch = re.match(r"\d+$", entry)
alpha, numeric = "",""
if firstDecMatch:
pos = firstDecMatch.start(0)
alpha, numeric = entry[:pos], entry[pos:]
else # no decimals found at end of string
alpha = entry
return (alpha, numeric)
Note that the regular expression is `\d+$', which should match all decimals at the end of the string. If the string has decimals in the first part, it will not count those, e.g: xy3zzz134 -> "xy3zzz","134". I opted for that because you say you are expecting filenames, and filenames can include numbers. Of course it's still a problem if the filename ends with numbers.
Another non-re answer:
>>> [''.join(x[1]) for x in itertools.groupby('xyz123', lambda x: x.isalpha())]
['xyz', '123']
If you don't want to use regex, then you can do something like this. Note that I have not tested this so there could be a bug or typo somewhere.
list = ["xyz123", "abcde345", "pqrst678"]
newlist = []
for item in list:
for char in range(0, len(item)):
if item[char].isnumeric():
newlist.append([item[:char], item[char:]])
break
>>> import re
>>> [re.findall(r'(.*?)(\d+$)',x)[0] for x in ['xyz123','ABCDE345','pqRst678']]
[('xyz', '123'), ('ABCDE', '345'), ('pqRst', '678')]
I don't think its that difficult without re
>>> s="xyz123"
>>> for n,i in enumerate(s):
... if i.isdigit(): x=n ; break
...
>>> [ s[:x], s[x:] ]
['xyz', '123']
>>> s="abcde345"
>>> for n,i in enumerate(s):
... if i.isdigit(): x=n ; break
...
>>> [ s[:x], s[x:] ]
['abcde', '345']

Categories