RegEx: Find all digits after certain string - python

I am trying to get all the digits from following string after the word classes (or its variations)
Accepted for all the goods and services in classes 16 and 41.
expected output:
16
41
I have multiple strings which follows this pattern and some others such as:
classes 5 et 30 # expected output 5, 30
class(es) 32,33 # expected output 32, 33
class 16 # expected output 5
Here is what I have tried so far: https://regex101.com/r/eU7dF6/3
(class[\(es\)]*)([and|et|,|\s]*(\d{1,}))+
But I am able to get only the last matched digit i.e. 41 in the above example.

I suggest grabbing all the substring with numbers after class or classes/class(es) and then get all the numbers from those:
import re
p = re.compile(r'\bclass(?:\(?es\)?)?(?:\s*(?:and|et|[,\s])?\s*\d+)+')
test_str = "Accepted for all the goods and services in classes 16 and 41."
results = [re.findall(r"\d+", x) for x in p.findall(test_str)]
print([x for l in results for x in l])
# => ['16', '41']
See IDEONE demo
As \G construct is not supported, nor can you access the captures stack using Python re module, you cannot use your approach.
However, you can do it the way you did with PyPi regex module.
>>> import regex
>>> test_str = "Accepted for all the goods and services in classes 16 and 41."
>>> rx = r'\bclass(?:\(?es\)?)?(?:\s*(?:and|et|[,\s])?\s*(?P<num>\d+))+'
>>> res = []
>>> for x in regex.finditer(rx, test_str):
res.extend(x.captures("num"))
>>> print res
['16', '41']

You can do it in 2 steps.Regex engine remebers only the last group in continous groups.
x="""Accepted for all the goods and services in classes 16 and 41."""
print re.findall(r"\d+",re.findall(r"class[\(es\)]*\s*(\d+(?:(?:and|et|,|\s)*\d+)*)",x)[0])
Output:['16', '41']
If you dont want string use
print map(ast.literal_eval,re.findall(r"\d+",re.findall(r"class[\(es\)]*\s*(\d+(?:(?:and|et|,|\s)*\d+)*)",x)[0]))
Output:[16, 41]
If you have to do it in one regex use regex module
import regex
x="""Accepted for all the goods and services in classes 16 and 41."""
print [ast.literal_eval(i) for i in regex.findall(r"class[\(es\)]*|\G(?:and|et|,|\s)*(\d+)",x,regex.VERSION1) if i]
Output:[16, 41]

Related

Regex: match one pattern and exclude another pattern

I have a regular expression that matches the phone numbers:
import re
phones = re.findall(r'[+(]?[0-9][0-9 \-()]{8,}[0-9]', text)
It shows good accuracy in a large raw text dataset.
But sometimes it matches unwanted results (ranges of years and random IDs).
Ranges of years:
'2012 - 2017'
'(2011 - 2013'
'1999 02224'
'2019 2010-2015'
'2018-2018 (5'
'2004 -2009'
'1) 2005-2006'
'2011 2020'
Random ids:
'5 5 5 5'
'100032479008252'
'100006711277302'
I have ideas on how to solve these problems.
Limit the total number of digits to 12 digits.
Limit the total number of characters to 16 characters.
Remove the ranges of years (19**|20** - 19**|20**).
But I do not know how to implement these ideas and make them as exceptions in my regular expression.
Some examples that a regular expression should catch are presented below:
380-956-425979
+38(097)877-43-88
+38(050) 284-24-20
(097) 261-60-52
380-956-425979
(068)1850063
0975533222
I suggest you write different patterns for different phone strucutres. I'm not so sure about your phone number structures, but this matches your example:
import re
test = '''380-956-425979
+38(097)877-43-88
+38(050) 284-24-20
(097) 261-60-52
380-956-425979
(068)1850063
0975533222'''
solution = test.split("\n")
p1 = "\+?\d{3}\-\d{3}\-\d{6}"
p2 = "\+?(?:\d{2})?\(\d{3}\) ?\d{3}\-\d{2}\-\d{2}"
p3 = "\+?\d{3}\-\d{3}\-\d{6}"
p4 = "\+?(?:\(\d{3}\)|\d{3})\d{7}"
result = re.findall(f'{p1}|{p2}|{p3}|{p4}', test)
print(solution)
print(result)
You could do it in python directly:
if regex.match("condition", "teststring") and not regex.match("not-condition", "teststring"):
print("Match!")

python regex: extract string from escaped sequences

I do not get it. Why people down vote this without explanation? What mistake I made?
How to extract Apple Recipe, 3, pages, 29.4KB from the following string?
'\r\n\t\t\t\t\t\r\n\t\t\t\t\tApple Recipe\r\r\n\t\t\t\t\t\r\n\t\t\t\t\t\r\n
\t\t\t\t\t3\r\n\t\t\t\t\t\t\r\n\t\t\t\t\t\t\tpages\r\n
\t\t\t\t\t\t\t\r\n\t\t\t\t\t\t\r\n\t\t\t\t\t\r\n
\t\t\t\t\t\r\n\t\t\t\t\t\t29.4KB\r\n
\t\t\t\t\t\r\n\t\t\t\t\t\r\n\t\t\t\t'
I've tried re.compile('\w+') but can only get results like:
Apple
Recipe
29
.
4
KB
However, I want to get them together as they are, not separately. For example, I want to get Apple Recipe together but not as two separate tokens.
data = """\r\n\t\t\t\t\t\r\n\t\t\t\t\tApple Recipe\r\r\n\t\t\t\t\t\r\n\t\t\t\t\t\r\n
\t\t\t\t\t3\r\n\t\t\t\t\t\t\r\n\t\t\t\t\t\t\tpages\r\n
\t\t\t\t\t\t\t\r\n\t\t\t\t\t\t\r\n\t\t\t\t\t\r\n
\t\t\t\t\t\r\n\t\t\t\t\t\t29.4KB\r\n
\t\t\t\t\t\r\n\t\t\t\t\t\r\n\t\t\t\t"""
import re
g = re.findall(r'[^\r\n\t]+', data)
print(g)
Prints:
['Apple Recipe', '3', 'pages', '29.4KB']
The [^\r\n\t]+ will match any string that doesn't contain \r, \n or \t characters.
txt = """\r\n\t\t\t\t\t\r\n\t\t\t\t\tApple Recipe\r\r\n\t\t\t\t\t\r\n\t\t\t\t\t\r\n
\t\t\t\t\t3\r\n\t\t\t\t\t\t\r\n\t\t\t\t\t\t\tpages\r\n
\t\t\t\t\t\t\t\r\n\t\t\t\t\t\t\r\n\t\t\t\t\t\r\n
\t\t\t\t\t\r\n\t\t\t\t\t\t29.4KB\r\n
\t\t\t\t\t\r\n\t\t\t\t\t\r\n\t\t\t\t"""
import re
output = re.findall(r'\w+[.\d]?\w+', txt)
print(output)
u will get the required output
['Apple', 'Recipe', '3', 'pages', '29.4KB']

How do I change strings based on some rules?

I have following texts, each line has two phrases and separated with "\t"
RoadTunnel RouteOfTransportation
LaunchPad Infrastructure
CyclingLeague SportsLeague
Territory PopulatedPlace
CurlingLeague SportsLeague
GatedCommunity PopulatedPlace
What I want to get is to add _ to separate words, the results should be:
Road_Tunnel Route_Of_Transportation
Launch_Pad Infrastructure
Cycling_League Sports_League
Territory Populated_Place
Curling_League Sports_League
Gated_Community Populated_Place
There is no cases such as "ABTest" or "aBTest", and there are cases such as three words together "RouteOfTransportation" I tried several ways but not succeeded.
One of my tries is:
textProcessed = re.sub(r"([A-Z][a-z]+)(?=([A-Z][a-z]+))", r"\1_", text)
But there is no result
Use a regular expression and re.sub.
>>> import re
>>> s = '''LaunchPad Infrastructure
... CyclingLeague SportsLeague
... Territory PopulatedPlace
... CurlingLeague SportsLeague
... GatedCommunity PopulatedPlace'''
>>> subbed = re.sub('([A-Z][a-z]+)([A-Z])', r'\1_\2', s)
>>> print(subbed)
Launch_Pad Infrastructure
Cycling_League Sports_League
Territory Populated_Place
Curling_League Sports_League
Gated_Community Populated_Place
edit: Here's another one, since your test cases don't cover enough to be sure what exactly you want:
>>> re.sub('([a-zA-Z])([A-Z])([a-z])', r'\1_\2\3', 'ABThingThing')
'AB_Thing_Thing'
Combining re.findall and str.join:
>>> "_".join(re.findall(r"[A-Z]{1}[^A-Z]*", text))
Depending on your needs, a slightly different solution can be this:
import re
result = re.sub(r"([a-zA-Z])(?=[A-Z])", r"\1_", s)
It will insert a _ before any upper case letter that follows another letter (whether it is upper or lower case).
"TheRabbit IsBlue" => "The_Rabbit Is_Blue"
"ABThing ThingAB" => "A_B_Thing Thing_A_B"
It does not support special chars.

Parsing string outside parenthetical expression

I have the following text:
s1 = 'Promo Tier 77 (4.89 USD)'
s2 = 'Promo (11.50 USD) Tier 1 Titles Only'
From this I want to pull out the number that is not included in the parenthetical. It would be:
s1 --> '77'
s2 --> '1'
I am currently using the weak regex re.findall('\s\d+\s',s1). What would be the correct regex? Something like re.findall('\d+',s1) but excluding anything within the parenthetical.
>>> re.findall('\d+',s1)
['77', '4', '89'] # two of these numbers are within the parenthetical.
# I only want '77'
One way that I find useful is to use the alternation operator in context placing what you want to exclude on the left side, (saying throw this away, it's garbage) and place what you want to match in a capturing group on the right side.
Then you can combine this with filter or use a list comprehension to remove the empty list items that the regular expression engine picks up from the expression on the left side of the alternation operator.
>>> import re
>>> s = """Promo (11.50 USD) Tier 1 Titles Only
Promo (11.50 USD) (10.50 USD, 11.50 USD) Tier 5
Promo Tier 77 (4.89 USD)"""
>>> filter(None, re.findall(r'\([^)]*\)|(\d+)', s))
['1', '5', '77']
You could make a temporary string that has the parenthesis section removed, then run your code. I used a space so that numbers before and after the missing string section can't be joined.
>>> import re
>>> s = 'Promo Tier 77 (11.50 USD) Tier 1 Titles Only'
>>> temp = re.sub(r'\(.*?\)', ' ', s)
Promo Tier 77 Tier 1 Titles Only
>>> re.findall('\d+', temp)
['77', '1']
And you could of course shorten this to a single line.
Do some splitting on your strings. eg pseudocode
s1 = "Promo Tier 77 (4.89 USD)"
s = s1.split(")")
for ss in s :
if "(" in ss: # check for the open brace
if the number in ss.split("(")[0]: # split at the open brace and do your regex
print the number
(\b\d+\b)(?=(?:[^()]*\([^)]*\))*[^()]*$)
Try this.Grab the capture.See demo.
http://regex101.com/r/gT6kI4/7

Regular expression help

I am trying to create a regex in Python 3 that matches 7 characters (eg. >AB0012) separated by an unknown number of characters then matching another 6 characters(eg. aaabbb or bbbaaa). My input string might look like this:
>AB0012xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa>CD00192aaabbblllllllllllllllllllllyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyybbbaaayyyyyyyyyyyyyyyyyyyy>ZP0199000000000000000000012mmmm3m4mmmmmmmmxxxxxxxxxxxxxxxxxaaabbbaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
This is the regex that I have come up with:
matches = re.findall(r'(>.{7})(aaabbb|bbbaaa)', mystring)
print(matches)
The output I am trying to product would look like this:
[('>CD00192', 'aaabbb'), ('>CD00192', 'bbbaaa'), ('>ZP01990', 'aaabbb')]
I read through the Python documentation, but I couldn't find how to match an unknown distance between two portions of a regex. Is there some sort of wildcard character that would allow me to complete my regex? Thanks in advance for the help!
EDIT:
If I use *? in my code like this:
mystring = str(input("Paste promoters here: "))
matches = re.findall(r'(>.{7})*?(aaabbb|bbbaaa)', mystring)
print(matches)
My output looks like this:
[('>CD00192', 'aaabbb'), ('', 'bbbaaa'), ('', 'aaabbb')]
*The second and third items in the list are missing the >CD00192 and >ZP01990, respectively. How can I have the regex include these characters in the list?
Here's a non regular expression approach. Split on ">" (your data will start from 2nd element onwards), then since you don't care what those 7 characters are, so start checking from 8th character onwards till 14th character.
>>> string=""" AB0012xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa>CD00192aaabbblllllllllllllllllllllyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyybbbaaayyyyyyyyyyyyyyyyyyyy>ZP0199000000000000000000012mmmm3m4mmmmmmmmxxxxxxxxxxxxxxxxxaaabbbaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"""
>>> for i in string.split(">")[1:]:
... if i[7:13] in ["aaabbb","bbbaaa"]:
... print ">" + i[:13]
...
>CD00192aaabbb
I have a code that gives also the positions.
Here's the simple version of this code:
import re
from collections import OrderedDict
ch = '>AB0012xxxxaaaaaaaaaaaa'\
'>CD00192aaabbbllyybbbaaayyyuubbbaaaggggaaabbb'\
'>ZP0199000012mmmm3m4mmmxxxxaaabbbaaaaaaaaaaaaa'\
'>QD1547zzzzzzzzjjjiii'\
'>SE457895ffffaaabbbbbbbgjhgjgjhgjhgbbbbbaaa'
print ch,'\n'
regx = re.compile('((?<=>)(.{7})[^>]*(?:aaabbb|bbbaaa)[^>]*?)(?=>|\Z)')
rag = re.compile('aaabbb|bbbaaa')
dic = OrderedDict()
# Finding the result
for mat in regx.finditer(ch):
chunk,head = mat.groups()
headstart = mat.start()
dic[(headstart,head)] = [(headstart+six.start(),six.start(),six.group())
for six in rag.finditer(chunk)]
# Diplaying the result
for (headstart,head),li in dic.iteritems():
print '{:>10} {}'.format(headstart,head)
for x in li:
print '{0[0]:>10} {0[1]:>6} {0[2]}'.format(x)
result
>AB0012xxxxaaaaaaaaaaaa>CD00192aaabbbllyybbbaaayyyuubbbaaaggggaaabbb>ZP0199000012mmmm3m4mmmxxxxaaabbbaaaaaaaaaaaaa>QD1547zzzzzzzzjjjiii>SE457895ffffaaabbbbbbbgjhgjgjhgjhgbbbbbaaa
24 CD00192
31 8 aaabbb
41 18 bbbaaa
52 29 bbbaaa
62 39 aaabbb
69 ZP01990
95 27 aaabbb
136 SE45789
148 13 aaabbb
172 37 bbbaaa
The same code, in a functional manner, using generators :
import re
from itertools import imap
from collections import OrderedDict
ch = '>AB0012xxxxaaaaaaaaaaaa'\
'>CD00192aaabbbllyybbbaaayyyuubbbaaaggggaaabbb'\
'>ZP0199000012mmmm3m4mmmxxxxaaabbbaaaaaaaaaaaaa'\
'>QD1547zzzzzzzzjjjiii'\
'>SE457895ffffaaabbbbbbbgjhgjgjhgjhgbbbbbaaa'
print ch,'\n'
regx = re.compile('((?<=>)(.{7})[^>]*(?:aaabbb|bbbaaa)[^>]*?)(?=>|\Z)')
rag = re.compile('aaabbb|bbbaaa')
gen = ((mat.groups(),mat.start()) for mat in regx.finditer(ch))
dic = OrderedDict(((headstart,head),
[(headstart+six.start(),six.start(),six.group())
for six in rag.finditer(chunk)])
for (chunk,head),headstart in gen)
print '\n'.join('{:>10} {}'.format(headstart,head)+'\n'+\
'\n'.join(imap('{0[0]:>10} {0[1]:>6} {0[2]}'.format,li))
for (headstart,head),li in dic.iteritems())
.
EDIT
I measured the execution's times.
For each code I measured the creation of the dictionary and the displaying separately.
The code using generators (the second) is 7.4 times more rapid to display the result ( 0.020 seconds) than the other one (0.148 seconds)
But surprisingly for me, the code with generators takes 47 % more time (0.000718 seconds) than the other (0.000489 seconds) to compute the dictionary.
.
EDIT 2
Another way to do:
import re
from collections import OrderedDict
from itertools import imap
ch = '>AB0012xxxxaaaaaaaaaaaa'\
'>CD00192aaabbbllyybbbaaayyyuubbbaaaggggaaabbb'\
'>ZP0199000012mmmm3m4mmmxxxxaaabbbaaaaaaaaaaaaa'\
'>QD1547zzzzzzzzjjjiii'\
'>SE457895ffffaaabbbbbbbgjhgjgjhgjhgbbbbbaaa'
print ch,'\n'
regx = re.compile('((?<=>).{7})|(aaabbb|bbbaaa)')
def collect(ch):
li = []
dic = OrderedDict()
gen = ( (x.start(),x.group(1),x.group(2)) for x in regx.finditer(ch))
for st,g1,g2 in gen:
if g1:
if li:
dic[(stprec,g1prec)] = li
li,stprec,g1prec = [],st,g1
elif g2:
li.append((st,g2))
if li:
dic[(stprec,g1prec)] = li
return dic
dic = collect(ch)
print '\n'.join('{:>10} {}'.format(headstart,head)+'\n'+\
'\n'.join(imap('{0[0]:>10} {0[1]}'.format,li))
for (headstart,head),li in dic.iteritems())
result
>AB0012xxxxaaaaaaaaaaaa>CD00192aaabbbllyybbbaaayyyuubbbaaaggggaaabbb>ZP0199000012mmmm3m4mmmxxxxaaabbbaaaaaaaaaaaaa>QD1547zzzzzzzzjjjiii>SE457895ffffaaabbbbbbbgjhgjgjhgjhgbbbbbaaa
24 CD00192
31 aaabbb
41 bbbaaa
52 bbbaaa
62 aaabbb
69 ZP01990
95 aaabbb
136 SE45789
148 aaabbb
172 bbbaaa
This code compute dic in 0.00040 seconds and displays it in 0.0321 seconds
.
EDIT 3
To answer to your question, you have no other possibility than keeping each current value among 'CD00192','ZP01990','SE45789' etc under a name (I don't like to say "in a variable" in Python, because there are no variables in Python. But you can read "under a name" as if I had written "in a variable" )
And for that, you must use finditer()
Here's the code for this solution:
import re
ch = '>AB0012xxxxaaaaaaaaaaaa'\
'>CD00192aaabbbllyybbbaaayyyuubbbaaaggggaaabbb'\
'>ZP0199000012mmmm3m4mmmxxxxaaabbbaaaaaaaaaaaaa'\
'>QD1547zzzzzzzzjjjiii'\
'>SE457895ffffaaabbbbbbbgjhgjgjhgjhgbbbbbaaa'
print ch,'\n'
regx = re.compile('(>.{7})|(aaabbb|bbbaaa)')
matches = []
for mat in regx.finditer(ch):
g1,g2= mat.groups()
if g1:
head = g1
else:
matches.append((head,g2))
print matches
result
>AB0012xxxxaaaaaaaaaaaa>CD00192aaabbbllyybbbaaayyyuubbbaaaggggaaabbb>ZP0199000012mmmm3m4mmmxxxxaaabbbaaaaaaaaaaaaa>QD1547zzzzzzzzjjjiii>SE457895ffffaaabbbbbbbgjhgjgjhgjhgbbbbbaaa
[('>CD00192', 'aaabbb'), ('>CD00192', 'bbbaaa'), ('>CD00192', 'bbbaaa'), ('>CD00192', 'aaabbb'), ('>ZP01990', 'aaabbb'), ('>SE45789', 'aaabbb'), ('>SE45789', 'bbbaaa')]
My preceding codes are more complicated because they catch the positions and gather the values 'aaabbb' and 'bbbaaa' of one header among 'CD00192','ZP01990','SE45789' etc in a list.
zero or more characters can be matched using *, so a* would match "", "a", "aa" etc. + matches one or more character.
You will perhaps want to make the quantifier (+ or *) lazy by using +? or *? as well.
See regular-expressions.info for more details.
Try this:
>>> r1 = re.findall(r'(>.{7})[^>]*?(aaabbb)', s)
>>> r2 = re.findall(r'(>.{7})[^>]*?(bbbaaa)', s)
>>> r1 + r2
[('>CD00192', 'aaabbb'), ('>ZP01990', 'aaabbb'), ('>CD00192', 'bbbaaa'), ('>ZP01990', 'bbbaaa')]

Categories