regex matching multiple repeating groups

regex matching multiple repeating groups - python

I have the following string:
s = " 3434 garbage workorders: 138 waiting, 2 running, 3 failed, 134 completed"
I would like to parse the statuses and counts after "workorders". I've tried the following regex:
r = r"workorders:( (\d+) (\w+),?)*"
but this only returns the last group. How can I return all groups?
p.s. I know I could do this in python, but was wondering if there's a pure regex solution
>>> s = " 3434 garbage workorders: 138 waiting, 2 running, 3 failed, 134 completed"
>>> r = r"workorders:( (\d+) (\w+),?)*"
>>> re.findall(r, s)
[(' 134 completed', '134', 'completed')]
>>>
output should be close to
[('138', 'waiting'), ('2', 'running'), ('3', 'failed'), ('134', 'completed')]

For the text in the example, you could try it like this:
(?:(\d+) (\w+)(?=,|$))+
Explanation
A non capturing group (?:
A capturing group for one or more digits (\d+)
A white space
A capturing group for one or more word characters (\w+)
A positive lookhead which asserts that what follows is either a comma or the end of the string (?=,|$)
Close the non capturing group and repeat that one or more times )+
Demo
That would give you:
[('138', 'waiting'), ('2', 'running'), ('3', 'failed'), ('134', 'completed')]

this should work for your particular case:
re.findall('[:,] (\d+)', s)

In my experience, I found it better to use regex after you process the string as much as possible; regex on an arbitrary string will only cause headaches.
In your case, try splitting on ':' (or even workorders:) and getting the stuff after to get only the counts of statuses. After that, it's easy to get the counts for each status.
s = " 3434 garbage workorders: 138 waiting, 2 running, 3 failed, 134
completed"
statuses = s.split(':') #['3434 garbage workorders', ' 138 waiting, 2 running, 3 failed, 134 completed']
statusesStr = ''.join(statuses[1]) # ' 138 waiting, 2 running, 3 failed, 134 completed'
statusRe = re.compile("(\d+)\s*(\w+)")
statusRe.findall(statusesStr) #[('138', 'waiting'), ('2', 'running'), ('3', 'failed'), ('134', 'completed')]
Edit: changed expression to meet desired outcome and more robust

Answer that will only look at regex that are after :
re.findall(r'(?: )\d+ \w+')

This will give you your output exactly.
map = re.findall(r'(\d+) ([A-Za-z]+)', s.split("workorders:")[1])
You can then bust this init.
x = {v: int(k) for k, v in map}

Related

Parsing semi structured text strings in Python

I am trying to parse pseudo-English scripts, and want to convert it into another machine readable language.
However the script have been written by many people in the past, and each had their own style of writing.
some Examples would be:
On Device 1 Set word 45 and 46 to hex 331
On Device 1 set words 45 and 46 bits 3..7 to 280
on Device 1 set word 45 to oct 332
on device 1 set speed to 60kts Words 3-4 to hex 34
(there are many more different ways used in the source text)
The issue is its not always logical nor consistent
I have looked at Regexp, and matching certain words. This works out ok, but when I need to know the next word (e.g in 'Word 24' I would match for 'Word' then try to figure out if the next token is a number or not). In the case of 'Words' i need to look for the words to set, as well as their values.
in example 1, it should produce to Set word 45 to hex 331 and Set word 46 to hex 331
or if possible Set word 45 to hex 331 and word 46 to hex 331
i tried using the findall method on re - that would only give me the matched words, and then i have to try to find out the next word (i.e value) manually
alternatively, i could split the string using a space and process each word manually, then be able to do something like
assuming list is
['On', 'device1:', 'set', 'Word', '1', '', 'to', '88', 'and', 'word', '2', 'to', '2151']
for i in range (0,sp.__len__()):
rew = re.search("[Ww]ord", sp[i])
if rew:
print ("Found word, next val is ", sp[i+1])
is there a better way to do what i want? i looked a little bit into tokenizing, but not sure that would work as the language is not structured in the first place.

I suggest you develop a program that gradually explores the syntax that people have used to write the scripts.
E.g., each instruction in your examples seems to break down into a device-part and a settings-part. So you could try matching each line against the regex ^(.+) set (.+). If you find lines that don't match that pattern, print them out. Examine the output, find a general pattern that matches some of them, add a corresponding regex to your program (or modify an existing regex), and repeat. Proceed until you've recognized (in a very general way) every line in your input.
(Since capitalization appears to be inconsistent, you can either do case-insensitive matches, or convert each line to lowercase before you start processing it. More generally, you may find other 'normalizations' that simplify subsequent processing. E.g., if people were inconsistent about spaces, you can convert every run of whitespace characters into a single space.)
(If your input has typographical errors, e.g. someone wrote "ste" for "set", then you can either change the regex to allow for that (... (set|ste) ...), or go to (a copy of) the input file and just fix the typo.)
Then go back to the lines that matched ^(.+) set (.+), print out just the first group for each, and repeat the above process for just those substrings.
Then repeat the process for the second group in each "set" instruction. And so on, recursively.
Eventually, your program will be, in effect, a parser for the script language. At that point, you can start to add code to convert each recognized construct into the output language.
Depending on your experience with Python, you can find ways to make the code concise.

Depending on what you actually want from these strings, you could use a parser, e.g. parsimonious:
from parsimonious.nodes import NodeVisitor
from parsimonious.grammar import Grammar
grammar = Grammar(
r"""
command = set operand to? number (operator number)* middle? to? numsys? number
operand = (~r"words?" / "speed") ws
middle = (~r"[Ww]ords" / "bits")+ ws number
to = ws "to" ws
number = ws ~r"[-\d.]+" "kts"? ws
numsys = ws ("oct" / "hex") ws
operator = ws "and" ws
set = ~"[Ss]et" ws
ws = ~r"\s*"
"""
)
class HorribleStuff(NodeVisitor):
def __init__(self):
self.cmds = []
def generic_visit(self, node, visited_children):
pass
def visit_operand(self, node, visited_children):
self.cmds.append(('operand', node.text))
def visit_number(self, node, visited_children):
self.cmds.append(('number', node.text))
examples = ['Set word 45 and 46 to hex 331',
'set words 45 and 46 bits 3..7 to 280',
'set word 45 to oct 332',
'set speed to 60kts Words 3-4 to hex 34']
for example in examples:
tree = grammar.parse(example)
hs = HorribleStuff()
hs.visit(tree)
print(hs.cmds)
This would yield
[('operand', 'word '), ('number', '45 '), ('number', '46 '), ('number', '331')]
[('operand', 'words '), ('number', '45 '), ('number', '46 '), ('number', '3..7 '), ('number', '280')]
[('operand', 'word '), ('number', '45 '), ('number', '332')]
[('operand', 'speed '), ('number', '60kts '), ('number', '3-4 '), ('number', '34')]

How to extract set of substrings from a paragraph of string

Say I have a string:
output='[{ "id":"b678792277461" ,"Responses":{"SUCCESS":{"sh xyz":"sh xyz\\n Name Age Height Weight\\n Ana \\u003c15 \\u003e 163 47\\n 43\\n DEB \\u003c23 \\u003e 155 \\n Grey \\u003c53 \\u003e 143 54\\n 63\\n Sch#"},"FAILURE":{},"BLACKLISTED":{}}}]'
This is just an example but I have much longer output which is response from an api call.
I want to extract all names (ana, dab, grey) and put in a separate list.
how can I do it?
json_data = json.loads(output)
json_data = [{'id': 'b678792277461', 'Responses': {'SUCCESS': {'sh xyz': 'sh xyz\n Name Age Height Weight\n Ana <15 > 163 47\n 43\n DEB <23 > 155 \n Grey <53 > 143 54\n 63\n Sch#'}, 'FAILURE': {}, 'BLACKLISTED': {}}}]
1) I have tried re.findall('\\n(.+)\\u',output)
but this didn't work because it says "incomplete sequence u"
2)
start = output.find('\\n')
end = output.find('\\u', start)
x=output[start:end]
But I couldn't figure out how to run this piece of code in loop to extract names
Thanks

The \u object is not a letter and it cannot be matched. It is a part of a Unicode sequence. The following regex works, but it is kind of quirky. It looks for the beginning of each line, except for the first one, until the first space.
output = json_data[0]['Responses']['SUCCESS']['sh xyz']
pattern = "\n\s*([a-z]+)\s+"
result = re.findall(pattern, output, re.M | re.I)
#['Name', 'Ana', 'DEB', 'Grey']
Explanation of the pattern:
start at a new line (\n)
skip all spaces, if any (\s*)
collect one or more letters ([a-z]+)
skip at least one space (\s+)
Unfortunately, "Name" is also recognized as a name. If you know that it is always present in the first line, slice the list of the results:
result[1:]
#['Ana', 'DEB', 'Grey']

I use regexr.com and play around with the regular expression until I get it right and then covert that into Python.
https://regexr.com/
I'm assuming the \n is the newline character here and I'll bet your \u error is caused by a line break. To use the multiline match in Python, you need to use that flag when you compile.
\n(.*)\n - this will be greedy and grab as many matches as possible (In the example it would grab the entire \nAna through 54\n
[{ "id":"678792277461" ,"Responses": {Name Age Height Weight\n Ana \u00315 \u003163 47\n 43\n Deb \u00323 \u003155 60 \n Grey \u00353 \u003144 54\n }]
import re
a = re.compile("\\n(.*)\\n", re.MULTILINE)
for responses in a.match(source):
match = responses.split("\n")
# match[0] should be " Ana \u00315 \u003163 47"
# match[1] should be " Deb \u00323 \u003155 60" etc.

Regex to find name in sentence

I have some sentence like
1:
"RLB shows Oubre Jr. (WAS) legally ties up Nurkic (POR), and a held
ball is correctly called."
2:
"Nurkic (POR) maintains legal
guarding position and makes incidental contact with Wall (WAS) that
does not affect his driving shot attempt."
I need to use Python regex to find the name "Oubre Jr." ,"Nurkic" and "Nurkic", "Wall".
p = r'\s*(\w+?)\s[(]'
use this pattern,
I can find "['Nurkic', 'Wall']", but in sentence 1, I just can find ['Nurkic'], missed "Oubre Jr."
Who can help me?

You can use the following regex:
(?:[A-Z][a-z][\s\.a-z]*)+(?=\s\()
|-----Main Pattern-----|
Details:
(?:) - Creates a non-capturing group
[A-Z] - Captures 1 uppercase letter
[a-z] - Captures 1 lowercase letter
[\s\.a-z]* - Captures spaces (' '), periods ('.') or lowercase letters 0+ times
(?=\s\() - Captures the main pattern if it is only followed by ' (' string
str = '''RLB shows Oubre Jr. (WAS) legally ties up Nurkic (POR), and a held ball is correctly called.
Nurkic (POR) maintains legal guarding position and makes incidental contact with Wall (WAS) that does not affect his driving shot attempt.'''
res = re.findall( r'(?:[A-Z][a-z][\s\.a-z]*)+(?=\s\()', str )
print(res)
Demo: https://repl.it/#RahulVerma8/OvalRequiredAdvance?language=python3
Match: https://regex101.com/r/OsLTrY/1

Here is one approach:
line = "RLB shows Oubre Jr (WAS) legally ties up Nurkic (POR), and a held ball is correctly called."
results = re.findall( r'([A-Z][\w+'](?: [JS][r][.]?)?)(?= \([A-Z]+\))', line, re.M|re.I)
print(results)
['Oubre Jr', 'Nurkic']
The above logic will attempt to match one name, beginning with a capital letter, which is possibly followed by either the suffix Jr. or Sr., which in turn is followed by a ([A-Z]+) term.

You need a pattern that you can match - for your sentence you cou try to match things before (XXX) and include a list of possible "suffixes" to include as well - you would need to extract them from your sources
import re
suffs = ["Jr."] # append more to list
rsu = r"(?:"+"|".join(suffs)+")? ?"
# combine with suffixes
regex = r"(\w+ "+rsu+")\(\w{3}\)"
test_str = "RLB shows Oubre Jr. (WAS) legally ties up Nurkic (POR), and a held ball is correctly called. Nurkic (POR) maintains legal guarding position and makes incidental contact with Wall (WAS) that does not affect his driving shot attempt."
matches = re.finditer(regex, test_str, re.MULTILINE)
names = []
for matchNum, match in enumerate(matches,1):
for groupNum in range(0, len(match.groups())):
names.extend(match.groups(groupNum))
print(names)
Output:
['Oubre Jr.', 'Nurkic ', 'Nurkic ', 'Wall ']
This should work as long as you do not have Names with non-\w in them. If you need to adapt the regex, use https://regex101.com/r/pRr9ZU/1 as starting point.
Explanation:
r"(?:"+"|".join(suffs)+")? ?" --> all items in the list suffs are strung together via | (OR) as non grouping (?:...) and made optional followed by optional space.
r"(\w+ "+rsu+")\(\w{3}\)" --> the regex looks for any word characters followed by optional suffs group we just build, followed by literal ( then three word characters followed by another literal )

Apply Regex to df add values in new column

This is my dataset:
BlaBla 128 MB EE
ADTD 6 gb DTS
EEEDC 2GB RS
STA 12MB DFA
BBNB 32 mb YED
From this data set I would like to extract the number of MB/GB and the unit MB/GB. Therefore I have created the following Regex:
(\d*)\s?(MB|GB)
The code that I have created so that the regex will be applied to my df is:
pattern = re.compile(r'(\d*)\s?(MB|GB)')
invoice_df['mbs'] = invoice_df['Rate Plan'].apply(lambda x: pattern.search(x).group(1))
invoice_df['unit'] = invoice_df['Rate Plan'].apply(lambda x: pattern.search(x).group(2))
However when applying the regex to my df it give the following error message:
AttributeError: 'NoneType' object has no attribute 'group'
What can I do to solve this?

Since some of the entries may have no match, the re.search fails (returns no match) for them. You need to account for those cases inside the lambda:
apply(lambda x: pattern.search(x).group(1) if pattern.search(x) else "")
I also advise to use
(?i)(\d+)\s*([MGK]B)
It will find 1+ digits (\d+, Group 1) followed with 0+ whitespaces (\s*) and will match KB, GB, MB into Group 2 (([MGK]B)) in a case-insensitive way.

You just need to check that something has been found before requesting the groups :
import re
inputs = ["BlaBla 128 MB EE",
"ADTD 6 gb DTS",
"EEEDC 2GB RS",
"STA 12MB DFA",
"BBNB 32 mb YED",
"Nothing to find here"]
pattern = re.compile("(\d+)\s*([MG]B)", re.IGNORECASE)
for input in inputs:
match = re.search(pattern, input)
if match:
mbs = match.group(1)
unit = match.group(2)
print (mbs, unit.upper())
else:
print "Nothing found for : %r" % input
# ('128', 'MB')
# ('6', 'GB')
# ('2', 'GB')
# ('12', 'MB')
# ('32', 'MB')
# Nothing found for : 'Nothing to find here'
With your code :
pattern = re.compile("(\d+)\s*([MG]B)", re.IGNORECASE)
match = re.search(pattern, invoice_df['Rate Plan'])
if match:
invoice_df['mbs'] = match.group(1)
invoice_df['unit'] = match.group(2)
It's more readable than a lambda IMHO, and it doesn't execute the search twice.

Parsing string outside parenthetical expression

I have the following text:
s1 = 'Promo Tier 77 (4.89 USD)'
s2 = 'Promo (11.50 USD) Tier 1 Titles Only'
From this I want to pull out the number that is not included in the parenthetical. It would be:
s1 --> '77'
s2 --> '1'
I am currently using the weak regex re.findall('\s\d+\s',s1). What would be the correct regex? Something like re.findall('\d+',s1) but excluding anything within the parenthetical.
>>> re.findall('\d+',s1)
['77', '4', '89'] # two of these numbers are within the parenthetical.
# I only want '77'

One way that I find useful is to use the alternation operator in context placing what you want to exclude on the left side, (saying throw this away, it's garbage) and place what you want to match in a capturing group on the right side.
Then you can combine this with filter or use a list comprehension to remove the empty list items that the regular expression engine picks up from the expression on the left side of the alternation operator.
>>> import re
>>> s = """Promo (11.50 USD) Tier 1 Titles Only
Promo (11.50 USD) (10.50 USD, 11.50 USD) Tier 5
Promo Tier 77 (4.89 USD)"""
>>> filter(None, re.findall(r'\([^)]*\)|(\d+)', s))
['1', '5', '77']

You could make a temporary string that has the parenthesis section removed, then run your code. I used a space so that numbers before and after the missing string section can't be joined.
>>> import re
>>> s = 'Promo Tier 77 (11.50 USD) Tier 1 Titles Only'
>>> temp = re.sub(r'\(.*?\)', ' ', s)
Promo Tier 77 Tier 1 Titles Only
>>> re.findall('\d+', temp)
['77', '1']
And you could of course shorten this to a single line.

Do some splitting on your strings. eg pseudocode
s1 = "Promo Tier 77 (4.89 USD)"
s = s1.split(")")
for ss in s :
if "(" in ss: # check for the open brace
if the number in ss.split("(")[0]: # split at the open brace and do your regex
print the number

(\b\d+\b)(?=(?:[^()]*\([^)]*\))*[^()]*$)
Try this.Grab the capture.See demo.
http://regex101.com/r/gT6kI4/7

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.