Python regex parse file name with underscore separated fields - python

I have the following format which parameterises a file name.
"{variable}_{domain}_{GCMsource}_{scenario}_{member}_{RCMsource}_{RCMversion}_{frequency}_{start}-{end}_{fid}.nc"
e.g.
"pr_EUR-11_CNRM-CERFACS-CNRM-CM5_rcp45_r1i1p1_CLMcom-CCLM4-8-17_v1_day_20060101-20101231.nc"
(Note that {start}-{end} is meant to be hyphon separated instead of underscore)
The various fields are always separated by underscores and contain a predictable (but variable) format. In the example file name I have left out the final {fid} field as I would like that to be optional.
I'd like to use regex in python to parse such a file name to give me a dict or similar with keys for the field names in the format string and the corresponding values of the parsed file name. e.g.
{
"variable": "pr",
"domain", "EUR-11",
"GCMsource": "CNRM-CERFACS-CNRM-CM5",
"scenario": "rcp45",
"member": "r1i1p1",
"RCMsource": "CLMcom-CCLM4-8-17",
"RCMversion": "v1",
"frequency": "day",
"start": "20060101",
"end": "20101231".
"fid": None
}
The regex patten for each field can be constrained depending on the field. e.g.
"domain" is always 3 letters - 2 numbers
"member" is always rWiXpY where W, X and Y are numbers.
"scenario" always contains the letters "rcp" followed by 2 numbers.
"start" and "end" are always 8 digit numbers (YYYYMMDD)
There are never underscores within a field, underscores are only used to separate fields.
Note that I have used https://github.com/r1chardj0n3s/parse with some success but I don't think it is flexible enough for my needs (trying to parse other similar filenames with similar formats can often get confused with one another).
It would be great if the answer can explain some regex principles which will allow me to do this.

document for regular expression in python: https://docs.python.org/3/howto/regex.html#regex-howto
named group in regular expression in python:
https://docs.python.org/3/howto/regex.html#non-capturing-and-named-groups
import re
test_string = """pr_EUR-11_CNRM-CERFACS-CNRM-CM5_rcp45_r1i1p1_CLMcom-CCLM4-8-17_v1_day_20060101-20101231.nc"""
pattern = r"""
(?P<variable>\w+)_
(?P<domain>[a-zA-Z]{3}-\d{2})_
(?P<GCMsource>([A-Z0-9]+[-]?)+)_
(?P<scenario>rcp\d{2})_
(?P<member>([rip]\d)+)_
(?P<RCMsource>([a-zA-Z0-9]-?)+)_
(?P<RCMversion>[a-zA-Z0-9]+)_
(?P<frequency>[a-zA-Z-0-9]+)_
(?P<start>\d{8})-
(?P<end>\d{8})
_?
(?P<fid>[a-zA-Z0-9]+)?
.nc
"""
re_object = re.compile(pattern, re.VERBOSE) # we use VERBOSE flag
search_result = re_object.match(test_string)
print(search_result.groupdict())
# result:
"""
{'variable': 'pr', 'domain': 'EUR-11', 'GCMsource': 'CNRM-CERFACS-CNRM-CM5', 'scenario': 'rcp45', 'member': 'r1i1p1', 'RCMsource': 'CLMcom-CCLM4-8-17', 'RCMversion': 'v1', 'frequency': 'day', 'start': '20060101', 'end': '20101231', 'fid': None}
"""

Related

Regular expression to capture different lines

I'm trying to find a better way to capture variable values from a file that stores some information but facing the problem with line breaks and spaces. For example, a DataSetList variable is given that stores a value in two different ways:
Input
Templates = <
item
Name = 'fruits'
TemplateList = '7,12'
end>
Surveys = <
item
ID = 542
Name = 'apple'
end
item
ID = 872
Name = 'banana'
DataSetList = '873,887,971,1055'
PluginInfo = {something}
end
item
ID = 437
Name = 'cherry'
DataSetList =
'438,452,536,620,704,788,1143,1179,1563,1647,1731,1839,1875,1851,' +
'1863,2060,2359,2443,2469,2620'
PluginInfo = {something}
end>
The only way i've found to capture the values of the variables ID, Name, DataSetList variable values that are stored in 'item end' block is (My approach):
Expression
ID[\s\=]*(?P<UID>\d*)\s*Name[\s\=]*'(?P<Name>.*)'\s*DataSetList[\s\=]*(?P<DataSetList>(?:'[\d\,]*'[\s\+]*)*)
ID[\s\=]*(?P<UID>\d*) # capture ID
\s* # match spaces
Name[\s\=]*'(?P<Name>.*)' # capture Name
\s* # match spaces
DataSetList[\s\=]*(?P<DataSetList>(?:'[\d\,]*'[\s\+]*)*) # capture DataSetList
My approach output
{'UID': '872',
'Name': 'banana',
'DataSetList': "'873,887,971,1055'\n "}
{'UID': '437',
'Name': 'cherry',
'DataSetList': "'438,452,536,620,704,788,1143,1179,1563,1647,1731,1839,1875,1851,' +\n '1863,2060,2359,2443,2469,2620'\n "}
Problem
I don't think my approach is good because named capturing group DataSetList also captures spaces, line breaks, literal + and finally requires postprocessing of values.
Any approaches or ideas to improve my regular expression would be very helpful. Unfortunately my knowledge base of regex isn't as deep as i would like it to be. It's very interesting to see how it's done in other ways
You can improve the regex a bit.
ID[\s=]*(?P<UID>\d*)\s*Name[\s=]*'(?P<Name>.*)'\s*DataSetList[\s=]*(?P<DataSetList>'(?:[\d,]|'[\s+]*')*')
This gets rid of the unnecessary = and , escapes. The last part now won't match the whitespace after the final bit of the DataSetList.
I can't see a nice way to avoid having to post-process the DataSetList, if you stick to regular expressions.
If you need to do anything more complicated with this, I'd advise moving away from regexes. They are great for simple things, but it looks like in this case you'd be better off with a proper parser. If none already exists for the language you have here, you can use a parsing library such as Lark to create one without too much difficulty.

Parsing Regular expression from YAML file adds extra \

I have a bunch of regular expression I am using to scrape lot of specific fields from a text document. Those all work fine when used directly inside the python script.
But I thought of putting them in a YAML file and reading from there. Here's how it looks:
# Document file for Regular expression patterns for a company invoice
---
issuer: ABCCorp
fields:
invoice_number: INVOICE\s*(\S+)
invoice_date: INVOICE DATE\s*(\S+)
cusotmer_id: CUSTOMER ID\s*(\S+)
origin: ORIGIN\s*(.*)ETD
destination: DESTINATION\s*(.*)ETA
sub_total: SUBTOTAL\s*(\S+)
add_gst: SUBTOTAL\s*(\S+)
total_cost: TOTAL USD\s*(\S+)
description_breakdown: (?s)(DESCRIPTION\s*GST IN USD\s*.+?TOTAL CHARGES)
package_details_fields: (?s)(WEIGHT\s*VOLUME\s*.+?FLIGHT|ROAD REFERENCE)
mawb_hawb: (?s)((FLIGHT|ROAD REFERENCE).*(MAWB|MASTER BILL)\s*.+?GOODS COLLECTED FROM)
When I retrieve it using pyyml in python, it is adding a string quote around that (which is ok as I can add r'' later) but I see it is also adding extra \ in between the regex. That would make the regex go wrong when used in code now
import yaml
with open(os.path.join(TEMPLATES_DIR,"regex_template.yml")) as f:
my_dict = yaml.safe_load(f)
print(my_dict)
{'issuer': 'ABCCorp', 'fields': {'invoice_number': 'INVOICE\\s*(\\S+)', 'invoice_date': 'INVOICE DATE\\s*(\\S+)', 'cusotmer_id': 'CUSTOMER ID\\s*(\\S+)', 'origin': 'ORIGIN\\s*(.*)ETD', 'destination': 'DESTINATION\\s*(.*)ETA', 'sub_total': 'SUBTOTAL\\s*(\\S+)', 'add_gst': 'SUBTOTAL\\s*(\\S+)', 'total_cost': 'TOTAL USD\\s*(\\S+)', 'description_breakdown': '(?s)(DESCRIPTION\\s*GST IN USD\\s*.+?TOTAL CHARGES)', 'package_details_fields': '(?s)(WEIGHT\\s*VOLUME\\s*.+?FLIGHT|ROAD REFERENCE)', 'mawb_hawb'
How to read the right regex as I have it in yaml file? Does any string written in yaml file gets a quotation mark around that when read in python because that is a string?
EDIT:
The main regex in yaml file is:
INVOICE\s*(\S+)
Output in dict is:
'INVOICE\\s*(\\S+)'
This is too long to do as a comment.
The backslash character is used to escape special characters. For example:
'\n': newline
'\a': alarm
When you use it before a letter that has no special meaning it is just taken to be a backslash character:
'\s': backslash followed by 's'
But to be sure, whenever you want to enter a backslash character in a string and not have it interpreted as the start of an escape sequence, you double it up:
'\\s': also a backslash followed by a 's'
'\\a': a backslash followed by a 'a'
If you use a r'' type literal, then a backslash is never interpreted as the start of an escape sequence:
r'\a': a backslash followed by 'a' (not an alarm character)
r'\n': a backslash followed by n (not a newline -- however when used in a regex. it will match a newline)
Now here is the punchline:
When you print out these Python objects, such as:
d = {'x': 'ab\sd'}
print(d)
Python will print the string representation of the dictionary and the string will print:
'ab\\sd'. If you just did:
print('ab\sd')
You would see ab\sd. Quite a difference.
Why the difference. See if this makes sense:
d = {'x': 'ab\ncd'}
print(d)
print('ab\ncd')
Results:
d = {'x': 'ab\ncd'}
ab
cd
The bottom line is that when you print a Python object other than a string, it prints a representation of the object showing how you would have created it. And if the object contains a string and that string contains a backslash, you would have doubled up on that backslash when entering it.
Update
To process your my_dict: Since you did not provide the complete value of my_dict, I can only use a truncated version for demo purposes. But this will demonstrate that my_dict has perfectly good regular expressions:
import re
my_dict = {'issuer': 'ABCCorp', 'fields': {'invoice_number': 'INVOICE\\s*(\\S+)', 'invoice_date': 'INVOICE DATE\\s*(\\S+)'}}
fields = my_dict['fields']
invoice_number_re = fields['invoice_number']
m = re.search(invoice_number_re, 'blah-blah INVOICE 12345 blah-blah')
print(m[1])
Prints:
12345
If you are going to be using the same regular expressions over and over again, then it is best to compile them:
import re
my_dict = {'issuer': 'ABCCorp', 'fields': {'invoice_number': 'INVOICE\\s*(\\S+)', 'invoice_date': 'INVOICE DATE\\s*(\\S+)'}}
#compile the strings to regular expressions
fields = my_dict['fields']
for k, v in fields.items():
fields[k] = re.compile(v)
invoice_number_re = fields['invoice_number']
m = invoice_number_re.search('blah-blah INVOICE 12345 blah-blah')
print(m[1])

re groupdict: Is it possible to specify the value type?

I have the following regex to decompose a Tyre spec in sub elements which needs to be returned as a dict. Its numeric elements need to be returned as int.
Here is an input example:
tyre_specs = '255/45W17'
The desired output:
tyre_details = {'width': 255, 'profile': 45, 'rating': 'W', 'rim': 17}
I capture each element using a regular expression pattern with named capture which are matching the desired output dict keys. I then use groupdict to generate my output dict. However, all the values are strings. So I need to further process the relevant values to cast them as int.
My function, see below, works. However I was wondering if is there a better way to do this. Is there for instance a way to enforce the type of some specific matching groups?
If not, is this approach "pythonic"?
Here is my function
import re
def tyre_details(tyre_size):
pattern = r'(?P<width>\d{3})\/(?P<profile>\d{2})(?P<rating>[A-Z]{1,2})(?P<rim>\d{2})'
try:
details = re.match(pattern, tyre_size).groupdict()
except AttributeError:
raise ValueError('Input does not conform to the usual tyre size nomenclature "Width/ProfileRatingRim"')
int_keys = set('width profile rim'.split())
for key in int_keys:
details[key] = int(details[key])
return details
Edit:
Added handling exception for when the input string doesn't match. I raise this as a value error
defined the keys to be casted as a set instead of list.
removed redundant try/except clause.
I would first check if the regex matched. If it did, then the match.groups() can be dereferenced directly into variables and used to build the final dictionary object:
import re
def tyre_details(tyre_size):
pattern = r'(\d{3})/(\d{2})([A-Z]{1,2})(\d{2})'
m = re.match(pattern, tyre_size)
details = {}
if m:
width, profile, rating, rim = m.groups()
details = {"width": int(width), "profile": int(profile), "rating": rating, "rim": int(rim)}
return details
tyre_specs = '255/45W17'
print( tyre_details(tyre_specs) )
# => {'width': 255, 'profile': 45, 'rating': 'W', 'rim': 17}
See the Python demo
There is no need for named groups with this approach, and you do not need any try/except or other checks when casting str to int because the groups in question only match digits, see (\d{3}), (\d{2}) and (\d{2}).
If you need a full string match, replace re.match with re.fullmatch, and in case the match can appear anywhere in the string, use re.search.
Note / is not any special regex metacharacter, do not escape it in the pattern.

python: regex - catch variable number of groups

I have a string that looks like:
TABLE_ENTRY.0[hex_number]= <FIELD_1=hex_number, FIELD_2=hex_number..FIELD_X=hex>
TABLE_ENTRY.1[hex_number]= <FIELD_1=hex_number, FIELD_2=hex_number..FIELD_Y=hex>
number of fields is unknown and varies from entry to entry, I want to capture
each entry separately with all of its fields and their values.
I came up with:
([A-Z_0-9\.]+\[0x[0-9]+\]=)(0x[0-9]+|0):\s+<(([A-Z_0-9]+)=(0x[0-9]+|0))
which matches the table entry and the first field, but I dont know how to account for variable number of fields.
for input:
ENTRY_0[0x130]=0: <FIELD_0=0, FIELD_1=0x140... FIELD_2=0xff3>
output should be:
ENTRY 0:
FIELD_0=0
FIELD_1=0x140
FIELD_2=ff3
ENTRY 1:
...
In short, it's impossible to do all of this in the re engine. You cannot generate more groups dynamically. It will all put it in one group. You should re-parse the results like so:
import re
input_str = ("TABLE_ENTRY.0[0x1234]= <FIELD_1=0x1234, FIELD_2=0x1234, FIELD_3=0x1234>\n"
"TABLE_ENTRY.1[0x1235]= <FIELD_1=0x1235, FIELD_2=0x1235, FIELD_3=0x1235>")
results = {}
for match in re.finditer(r"([A-Z_0-9\.]+\[0x[0-9A-F]+\])=\s+<([^>]*)>", input_str):
fields = match.group(2).split(", ")
results[match.group(1)] = dict(f.split("=") for f in fields)
>>> results
{'TABLE_ENTRY.0[0x1234]': {'FIELD_2': '0x1234', 'FIELD_1': '0x1234', 'FIELD_3': '0x1234'}, 'TABLE_ENTRY.1[0x1235]': {'FIELD_2': '0x1235', 'FIELD_1': '0x1235', 'FIELD_3': '0x1235'}}
The output will just be a large dict consisting of a table entry, to a dict of it's fields.
It's also rather convinient as you may do this:
>>> results["TABLE_ENTRY.0[0x1234]"]["FIELD_2"]
'0x1234'
I personally suggest stripping off "TABLE_ENTRY" as it's repetative but as you wish.
Use a capture group for match unfit lengths:
([A-Z_0-9\.]+\[0x[0-9]+\]=)\s+<(([A-Z_0-9]+)=(0x[0-9]+|0),\s?)*([A-Z_0-9]+)=(0x[0-9]+|0)
The following part matches every number of fields with trailing comma and whitespace
(([A-Z_0-9]+)=(0x[0-9]+|0),\s?)*
And ([A-Z_0-9]+)=(0x[0-9]+|0) will match the latest field.
Demo: https://regex101.com/r/gP3oO6/1
Note: If you don't want some groups you better to use non-capturing groups by adding ?: at the leading of capture groups.((?: ...)), and note that (0x[0-9]+|0):\s+ as extra in your regex (based on your input pattern)

Extract the data specified in brackets '[ ]' from a string message in python

I want to extract fields from below Log message.
Example:
Ignoring entry, Affected columns [column1:column2], reason[some reason], Details[some entry details]
I need to extract the data specified in the brackets [ ] for "Affected columns,reason, Details"
What would be the efficient way to extract these fields in Python?
Note: I can modify the log message format if needed.
If you are free to change the log format, it's easiest to use a common data format - I'd recommend JSON for such data. It is structured, but lightweight enough to write it even from custom bash scripts. The json module allows you to directly convert it to native python objects:
import json # python has a default parser
# assume this is your log message
log_line = '{"Ignoring entry" : {"Affected columns": [1, 3], "reason" : "some reason", "Details": {}}}'
data = json.loads(log_line)
print("Columns to ignore:", data["Ignoring entry"]["Affected columns"])
If you want to work with the current format, you'll have to work with str methods or the re module.
For example, you could do this:
log_msg = "Ignoring entry, Affected columns [column1:column2], reason[some reason], Details[some entry details]"
def parse_log_line(log_line):
if log_line.startswith("Ignoring entry"):
log_data = {
for element in log_line.split(',')[1:]: # parse all elements but the header
key, value = element.partition('[')
if value[-1] != ']':
raise ValueError('Malformed Content. Expected %r to end with "]"' % element)
value = value[:-1]
log_data[key] = value
return log_data
raise ValueError('Unrecognized log line type')
Many parsing tasks are best compactly handled by the re module. It allows you to use regular expressions. They are very powerful, but difficult to maintain if you are not used to it. In your case, the following would work:
log_data = {key: value for key, value in re.findall(',\s?(.+?)\s?\[(.+?)\]', log_line)}
The re works like this:
, a literal comma, separating your entries
\s* an arbitrary sequence of whitespace after the comma, before the next element
(.+?) any non-whitespace characters (the key, captured via '()')
\s* an arbitrary sequence of whitespace between key and value
\[ a literal [
(.+?) the shortest sequence of non-whitespace characters before the next element (the value, captured via '()')
\] a literal ]
The symbols *, + and ? mean "any", "more than one", and "as few as possible".

Categories