python: regex - catch variable number of groups - python

I have a string that looks like:
TABLE_ENTRY.0[hex_number]= <FIELD_1=hex_number, FIELD_2=hex_number..FIELD_X=hex>
TABLE_ENTRY.1[hex_number]= <FIELD_1=hex_number, FIELD_2=hex_number..FIELD_Y=hex>
number of fields is unknown and varies from entry to entry, I want to capture
each entry separately with all of its fields and their values.
I came up with:
([A-Z_0-9\.]+\[0x[0-9]+\]=)(0x[0-9]+|0):\s+<(([A-Z_0-9]+)=(0x[0-9]+|0))
which matches the table entry and the first field, but I dont know how to account for variable number of fields.
for input:
ENTRY_0[0x130]=0: <FIELD_0=0, FIELD_1=0x140... FIELD_2=0xff3>
output should be:
ENTRY 0:
FIELD_0=0
FIELD_1=0x140
FIELD_2=ff3
ENTRY 1:
...

In short, it's impossible to do all of this in the re engine. You cannot generate more groups dynamically. It will all put it in one group. You should re-parse the results like so:
import re
input_str = ("TABLE_ENTRY.0[0x1234]= <FIELD_1=0x1234, FIELD_2=0x1234, FIELD_3=0x1234>\n"
"TABLE_ENTRY.1[0x1235]= <FIELD_1=0x1235, FIELD_2=0x1235, FIELD_3=0x1235>")
results = {}
for match in re.finditer(r"([A-Z_0-9\.]+\[0x[0-9A-F]+\])=\s+<([^>]*)>", input_str):
fields = match.group(2).split(", ")
results[match.group(1)] = dict(f.split("=") for f in fields)
>>> results
{'TABLE_ENTRY.0[0x1234]': {'FIELD_2': '0x1234', 'FIELD_1': '0x1234', 'FIELD_3': '0x1234'}, 'TABLE_ENTRY.1[0x1235]': {'FIELD_2': '0x1235', 'FIELD_1': '0x1235', 'FIELD_3': '0x1235'}}
The output will just be a large dict consisting of a table entry, to a dict of it's fields.
It's also rather convinient as you may do this:
>>> results["TABLE_ENTRY.0[0x1234]"]["FIELD_2"]
'0x1234'
I personally suggest stripping off "TABLE_ENTRY" as it's repetative but as you wish.

Use a capture group for match unfit lengths:
([A-Z_0-9\.]+\[0x[0-9]+\]=)\s+<(([A-Z_0-9]+)=(0x[0-9]+|0),\s?)*([A-Z_0-9]+)=(0x[0-9]+|0)
The following part matches every number of fields with trailing comma and whitespace
(([A-Z_0-9]+)=(0x[0-9]+|0),\s?)*
And ([A-Z_0-9]+)=(0x[0-9]+|0) will match the latest field.
Demo: https://regex101.com/r/gP3oO6/1
Note: If you don't want some groups you better to use non-capturing groups by adding ?: at the leading of capture groups.((?: ...)), and note that (0x[0-9]+|0):\s+ as extra in your regex (based on your input pattern)

Related

Extract values from String using Python

I am getting a string
name="Mathew",lastname="Thomas",zipcode="PR123T",gender="male"
I need to get the values Mathew, Thomas, PR123T, male.
Also if the String doesnt have a value for zipcode, it should not assign any value to string.
I am newbie to python. Please help
You need to use the .split() function that is available on every string. First you need to split by comma ,, then you need to split by = and select the 1th element.
Once this is done, you need to .join() the elements on a comma , again.
def split_my_fields(input_string):
if not 'zipcode=""' in input_string:
output = ', '.join(e.split('=')[1].replace('"','') for e in input_string.split(','))
print(f'Output is {output}')
return output
else:
print('Zipcode is empty.')
split_my_fields(r'name="Mathew",lastname="Thomas",zipcode="PR123T",gender="male"')
Output:
>>> split_my_fields(r'name="Mathew",lastname="Thomas",zipcode="PR123T",gender="male"')
Output is Mathew, Thomas, PR123T, male
'Mathew, Thomas, PR123T, male'
In fact, my dear friend, you can use parse
>>from parse import *
>>parse("name={},lastname={},zipcode={},gender={}","name='Mathew',lastname='Thomas',zipcode='PR123T',gender='male'")
<Result ("'Mathew'", "'Thomas'", "'PR123T'", "'male'") {}>
You can use named groups and create dictionary with keys corresponding to the group names:
import re
text = 'name="Mathew",lastname="Thomas",zipcode="PR123T",gender="male"'
expr = re.compile(r'^(name="(\s+)?(?P<name>.*?)(\s+)?")?,?(lastname="(\s+)?(?P<lastname>.*?)(\s+)?")?,?(zipcode="(\s+)?(?P<zipcode>.*?)(\s+)?")?,?(gender="(\s+)?(?P<gender>.*?)(\s+)?")?$')
match = expr.search(text).groupdict()
print(match['name']) # Matthew
print(match['lastname']) # Thomas
print(match['zipcode']) # R123T
print(match['gender']) # male
The pattern will catch all non-whitespace characters between parentheses and strip whitespaces around it. For empty zipcode value it will return an empty string (the same applies for other named groups). It will also handle missing key-value pairs as long as the order in which keys are appearing will stay the same (e.g. text = 'name="Mathew",lastname="Thomas",gender="male"').

Extract named group regex pattern from a compiled regex in Python

I have a regex in Python that contains several named groups. However, patterns that match one group can be missed if previous groups have matched because overlaps don't seem to be allowed. As an example:
import re
myText = 'sgasgAAAaoasgosaegnsBBBausgisego'
myRegex = re.compile('(?P<short>(?:AAA))|(?P<long>(?:AAA.*BBB))')
x = re.findall(myRegex,myText)
print(x)
Produces the output:
[('AAA', '')]
The 'long' group does not find a match because 'AAA' was used-up in finding a match for the preceding 'short' group.
I've tried to find a method to allow overlapping but failed. As an alternative, I've been looking for a way to run each named group separately. Something like the following:
for g in myRegex.groupindex.keys():
match = re.findall(***regex_for_named_group_g***,myText)
Is it possible to extract the regex for each named group?
Ultimately, I'd like to produce a dictionary output (or similar) like:
{'short':'AAA',
'long':'AAAaoasgosaegnsBBB'}
Any and all suggestions would be gratefully received.
There really doesn't appear to be a nicer way to do this, but here's a another approach, along the lines of this other answer but somewhat simpler. It will work provided that a) your patterns will always formed as a series of named groups separated by pipes, and b) the named group patterns never contain named groups themselves.
The following would be my approach if you're interested in all matches of each pattern. The argument to re.split looks for a literal pipe followed by the (?=<, the beginning of a named group. It compiles each subpattern and uses the groupindex attribute to extract the name.
def nameToMatches(pattern, string):
result = dict()
for subpattern in re.split('\|(?=\(\?P<)', pattern):
rx = re.compile(subpattern)
name = list(rx.groupindex)[0]
result[name] = rx.findall(string)
return result
With your given text and pattern, returns {'long': ['AAAaoasgosaegnsBBB'], 'short': ['AAA']}. Patterns that don't match at all will have an empty list for their value.
If you only want one match per pattern, you can make it a bit simpler still:
def nameToMatch(pattern, string):
result = dict()
for subpattern in re.split('\|(?=\(\?P<)', pattern):
match = re.search(subpattern, string)
if match:
result.update(match.groupdict())
return result
This gives {'long': 'AAAaoasgosaegnsBBB', 'short': 'AAA'} for your givens. If one of the named groups doesn't match at all, it will be absent from the dict.
There didn't seem to be an obvious answer, so here's a hack. It needs a bit of finessing but basically it splits the original regex into its component parts and runs each group regex separately on the original text.
import re
myTextStr = 'sgasgAAAaoasgosaegnsBBBausgisego'
myRegexStr = '(?P<short>(?:AAA))|(?P<long>(?:AAA.*BBB))'
myRegex = re.compile(myRegexStr) # This is actually no longer needed
print("Full regex with multiple groups")
print(myRegexStr)
# Use a regex to split the original regex into separate regexes
# based on group names
mySplitGroupsRegexStr = '\(\?P<(\w+)>(\([\w\W]+?\))\)(?:\||\Z)'
mySplitGroupsRegex = re.compile(mySplitGroupsRegexStr)
mySepRegexesList = re.findall(mySplitGroupsRegex,myRegexStr)
print("\nList of separate regexes")
print(mySepRegexesList)
# Convert separate regexes to a dict with group name as key
# and regex as value
mySepRegexDict = {reg[0]:reg[1] for reg in mySepRegexesList}
print("\nDictionary of separate regexes with group names as keys")
print(mySepRegexDict)
# Step through each key and run the group regex on the original text.
# Results are stored in a dictionary with group name as key and
# extracted text as value.
myGroupRegexOutput = {}
for g,r in mySepRegexDict.items():
m = re.findall(re.compile(r),myTextStr)
myGroupRegexOutput[g] = m[0]
print("\nOutput of overlapping named group regexes")
print(myGroupRegexOutput)
The resulting output is:
Full regex with multiple groups
(?P<short>(?:AAA))|(?P<long>(?:AAA.*BBB))
List of separate regexes
[('short', '(?:AAA)'), ('long', '(?:AAA.*BBB)')]
Dictionary of separate regexes with group names as keys
{'short': '(?:AAA)', 'long': '(?:AAA.*BBB)'}
Output of overlapping named group regexes
{'short': 'AAA', 'long': 'AAAaoasgosaegnsBBB'}
This might be useful to someone somewhere.

Regular Expression in Python 3

I am new here and just start using regular expressions in my python codes. I have a string which has 6 commas inside. One of the commas is fallen between two quotation marks. I want to get rid of the quotation marks and the last comma.
The input:
string = 'Fruits,Pear,Cherry,Apple,Orange,"Cherry,"'
I want this output:
string = 'Fruits,Pear,Cherry,Apple,Orange,Cherry'
The output of my code:
string = 'Fruits,Pear,**CherryApple**,Orange,Cherry'
here is my code in python:
if (re.search('"', string)):
matches = re.findall(r'\"(.+?)\"',string);
matches1 = re.sub(",", "", matches[0]);
string = re.sub(matches[0],matches1,string);
string = re.sub('"','',string);
My problem is, I want to give a condition that the code only works for the last bit ("Cherry,") but unfortunately it affects other words in the middle (Cherry,Apple), which has the same text as the one between the quotation marks! That results in reducing the number of commas (from 6 to 4) as it merges two fields (Cherry,Apple) and I want to be left with 5 commas.
fullString = '2000-04-24 12:32:00.000,22186CBD0FDEAB049C60513341BA721B,0DDEB5,COMP,Ch‌​erry Corp.,DE,100,0.57,100,31213C678CC483768E1282A9D8CB524C,365.0‌​0000,business,acquis‌​itions-mergers,acqui‌​sition-bid,interest,‌​acquiree,fact,,,,,,,‌​,,,,,,acquisition-in‌​terest-acquiree,Cher‌​ry Corp. Gets Buyout Offer From Chairman President,FULL-ARTICLE,B5569E,Dow Jones Newswires,0.04,-0.18,0,0,1,0,0,0,0,1,1,5,RPA,DJ,DN2000042400‌​0597,"Cherry Corp. Gets Buyout Offer From Chairman President,"\n'
Many Thanks in advance
For your task you don't need regular expressions, just use replace:
string = 'Fruits,Pear,Cherry,Apple,Orange,"Cherry,"'
new_string = string.replace('"').strip(',')
The best way would be to use the newer regex module where (*SKIP)(*FAIL) is supported:
import regex as re
string = 'Fruits,Pear,Cherry,Apple,Orange,"Cherry,"'
# parts
rx = re.compile(r'"[^"]+"(*SKIP)(*FAIL)|,')
def cleanse(match):
rxi = re.compile(r'[",]+')
return rxi.sub('', match)
parts = [cleanse(match) for match in rx.split(string)]
print(parts)
# ['Fruits', 'Pear', 'Cherry', 'Apple', 'Orange', 'Cherry']
Here you match anything between double quotes and throw it away afterwards, thus only commas outside quotes are used for the split operation. The rest is a list comprehension with a cleaning function.
See a demo on regex101.com.
Why not simply use this:
>>>ans_string=string.replace('"','')[0:-1]
Output
>>>ans_string
'Fruits,Pear,Cherry,Apple,Orange,Cherry'
For the sake of simplicity and algorithmic complexity.
You might consider using the csv module to do this.
Example:
import csv
s='Fruits,Pear,Cherry,Apple,Orange,"Cherry,"'
>>> ','.join([e.replace(',','') for row in csv.reader([s]) for e in row])
Fruits,Pear,Cherry,Apple,Orange,Cherry
The csv module will strip the quotes but keep the commas on each quoted field. Then you can just remove that comma that was kept.
This will take care of any modifications desired (remove , for example) on a field by field basis. The fields with quotes and commas could be any field in the string.
If your content is in a csv file, you would do something like this (in pseudo code)
with open(file, 'rb') as csv_fo:
# modify(string) stands for what you want to do to each field...
for row in csv.reader(csv_fo):
new_row=[modify(field) for field in row]
# now do what you need with that row

Regular expression for multiple occurances in python

I need to parse lines having multiple language codes as below
008800002 Bruxelles-Nord$Br�ussel Nord$<deu>$Brussel Noord$<nld>
008800002 being a id
Bruxelles-Nord$Br�ussel Nord$ being name1
deu being language one
$Brussel Noord$ being name two
nld being language two.
SO, the idea is name and language can appear N number of times. I need to collect them all.
the language in <> is 3 characters in length (fixed)
and all names end with $ sign.
I tried this one but it is not giving expected output.
x = re.compile('(?P<stop_id>\d{9})\s(?P<authority>[[\x00-\x7F]{3}|\s{3}])\s(?P<stop_name>.*)
(?P<lang_code>(?:[<]\S{0,4}))',flags=re.UNICODE)
I have no idea how to get repeated elements.
It takes
Bruxelles-Nord$Br�ussel Nord$<deu>$Brussel Noord$ as stop_name and <nld> as language.
Do it in two steps. First separate ID from name/language pairs; then use re.finditer on the name/language section to iterate over the pairs and stuff them into a dict.
import re
line = u"008800002 Bruxelles-Nord$Br�ussel Nord$<deu>$Brussel Noord$<nld>"
m = re.search("(\d+)\s+(.*)", line, re.UNICODE)
id = m.group(1)
names = {}
for m in re.finditer("(.*?)<(.*?)>", m.group(2), re.UNICODE):
names[m.group(2)] = m.group(1)
print id, names
\b(\d+)\b\s*|(.*?)(?=<)<(.*?)>
Try this.Just grab the captures.see demo.
http://regex101.com/r/hS3dT7/4

Matching alternative regexps in Python

I'm using Python to parse a file in search for e-mail addresses, but I can't figure out what the syntax for alternative regexps should be. Here's the code:
addresses = []
pattern = '(\w+)#(\w+\.com)|(\w+)#(it.\w+\.com)'
for line in file:
matches = re.findall(pattern,line)
for m in matches:
address = '%s#%s' % m
addresses.append(address)
So I want to find addresses that look like john#company.com or john#it.company.com, but the above code doesn't work because either the first two groups are empty or the last two groups are empty. What is the correct solution? I need to use groups to store the user name (before #) and server name (after #) separately.
EDIT: Matching email adresses is only an example. What I'm trying to find out is how to match different regexps that have only one thing in common - they match two groups.
(\w+)#((?:it\.)?\w+\.com)
You want to capture the part after the # whether it's example.com or it.example.com, so you put both options inside the same capture group. But since they share a similar format, you can condense (it\.\w+\.com|\w+\.com) to just ((it\.)?\w+\.com)
The (?: ) makes that parens a non-capturing group, so it won't take part in your matched groups. There will be one match for the first (\w+), and one match for the whole ((?:it\.)?\w+\.com) after the #. That's two matches total, plus the default group-0 match for the full string.
EDIT: To answer your new question, all you have to do is follow the grouping I used, but stop before you condense it.
If your test cases are:
1) example#abcdef
2) example#123456
You could write your regex as such: (\w+)#([a-zA-Z]+|\d+), which would always have the part before the # in group 1, and the part after in group 2. Notice that there are only two pairs of parens, and the |("or") operator appears inside of the second parens group.
I once found here a well written email regex, it was build for extracting a wide range of valid email adresses from a generic string, so it should also be able to do what you're looking for.
Example:
>>> email_regex = re.compile("""((([a-zA-Z0-9!\#\$%&'*+\-\/=?^_`{|}~]+|"([a-zA-Z0-9!\#\$%&'*+\-\/=?^_`{|}~(),:;<>#\[\]\.]|\\[ \\"])*")\.)*([a-zA-Z0-9!\#\$%&'*+\-\/=?^_`{|}~]+|"([a-zA-Z0-9!\#\$%&'*+\-\/=?^_`{|}~(),:;<>#\[\]\.]|\\[ \\"])*"))#((([a-zA-Z0-9]([a-zA-Z0-9]*(\-[a-zA-Z0-9]*)*)?\.)*[a-zA-Z]+|\[((0?\d{1,2}|1\d{2}|2[0-4]\d|25[0-5])\.){3}(0?\d{1,2}|1\d{2}|2[0-4]\d|25[0-5])\]|\[[Ii][Pp][vV]6(:[0-9a-fA-F]{0,4}){6}\]))""")
>>>
>>> m = email_regex.search('john#it.company.com')
>>> m.group(0)
'john#it.company.com'
>>> m.group(1)
'john'
>>> m.group(7)
'it.company.com'
>>>
>>> n = email_regex.search('john#company.com')
>>> n.group(0)
'john#company.com'
>>> n.group(1)
'john'
>>> n.group(7)
'company.com'

Categories