How can I parse sequence of binary digits in python.
Following is an example for what i am trying to do.
I have a sequence of binary digits, for example
sequence = '1110110100110111011011110101100101100'
and, I need to parse this and extract the data.
Say the above sequence contains start, id, data and end fields
start is a 2 bit field, id is an 8 bit field, data field can vary from 1 to 8192 bits and end is a 4 bit field.
and after parsing I'm expecting the output as follows:
result = {start : 11,
id : 10110100,
data : 11011101101111010110010,
end : 1100,
}
I'm using this in one of my applications.
I'm able to parse the sequence using regex but, the problem is regex must be written by the user. So as an alternative i'm using BNF grammar as grammars are more readable.
I tried solving this using python's parsimonious and pyparsing parsers. But am not able to find the solution for the fields with variable length.
The grammar I wrote in parsimonious available for python is as follows:
grammar = """sequence = start id data end
start = ~"[01]{2}"
id = ~"[01]{8}"
data = ~"[01]{1,8192}"
end = ~"[01]{4}"
"""
Since the data field is of variable length, and the parser is greedy, the above sequence is not able to match with the above grammar. The parser takes end field bits into the data field.
I just simplified my problem to above example.
Let me describe the full problem. There are 3 kinds of packets (lets call them Token, Handshake and Data packets). Token and Handshake packets are of a fixed length and Data packet is variable length. (The example above shown is an example for data packet)
The input consists of a continuous stream of bits. Each packet beginning is marked by the "start" pattern and packet end is marked by the "end" pattern. Both of these are fixed bit patterns.
Example Token packet grammar:
start - 2 bits, id - 8 bits, address - 7bits, end - 4bits
111011010011011101100
Example Handshake packet grammar:
start - 2 bits, id - 8bits, end - 4 bits
11101101001100
Example top level rule:
packet = tokenpacket | datapacket | handshakepacket
If there were only one type of packet then slicing would work. But when we start parsing, we do not know which packet we will finally end up matching. This is why I thought of using a grammar as the problem is very similar to language parsing.
Can we make the slicing approach work in this case where we have 3 different packet types to be parsed?
Whats the best way to solve this problem?
Thanks in advance,
This will do, just use slicing for this job:
def binParser(data):
result = {}
result["start"] = data[:2]
result["id"] = data[2:8]
result["end"] = data[-4:]
result["data"] = data[10:-4]
return result
You will get the correct data from the string.
Presumably, there will only ever be one variable-length field, so you can allow this by defining a distance from the start of the sequence and a distance from the end, e.g.
rules = {'start': (None, 2), 'id': (2, 10),
'data': (10, -4), 'end': (-4, None)}
and then use slicing:
sequence = '1110110100110111011011110101100101100'
result = dict((k, sequence[v[0]:v[1]]) for k, v in rules.items())
This gives:
result == {'id': '10110100',
'end': '1100',
'data': '11011101101111010110010',
'start': '11'}
Since you mentioned pyparsing in the tags, here is how I would go about it using pyparsing. This uses Daniel Sanchez's binParser for post-processing.
from pyparsing import Word
#Post-processing of the data.
def binParser(m):
data = m[0]
return {'start':data[:2],
'id':data[2:8],
'end':data[-4:],
'data':data[10:-4]}
#At least 14 character for the required fields, attaching the processor
bin_sequence = Word('01',min=14).setParseAction(binParser)
sequence = '1110110100110111011011110101100101100'
print bin_sequence.parseString(sequence)[0]
This could then be used as part of a larger parser.
Related
I can't find a solution to this, so I'm asking here. I have a string that consists of several lines and in the string I want to increase exactly one number by one.
For example:
[CENTER]
[FONT=Courier New][COLOR=#00ffff][B][U][SIZE=4]{title}[/SIZE][/U][/B][/COLOR][/FONT]
[IMG]{cover}[/IMG]
[IMG]IMAGE[/IMG][/CENTER]
[QUOTE]
{description_de}
[/QUOTE]
[CENTER]
[IMG]IMAGE[/IMG]
[B]Duration: [/B]~5 min
[B]Genre: [/B]Action
[B]Subgenre: [/B]Mystery, Scifi
[B]Language: [/B]English
[B]Subtitles: [/B]German
[B]Episodes: [/B]01/5
[IMG]IMAGE[/IMG]
[spoiler]
[spoiler=720p]
[CODE=rich][color=Turquoise]
{mediaInfo1}
[/color][/code]
[/spoiler]
[spoiler=1080p]
[CODE=rich][color=Turquoise]
{mediaInfo2}
[/color][/code]
[/spoiler]
[/spoiler]
[hide]
[IMG]IMAGE[/IMG]
[/hide]
[/CENTER]
I'm getting this string from a request and I want to increment the episode by 1. So from 01/5 to 02/5.
What is the best way to make this possible?
I tried to solve this via regex but failed miserably.
Assuming the number you want to change is always after a given pattern, e.g. "Episodes: [/B]", you can use this code:
def increment_episode_num(request_string, episode_pattern="Episodes: [/B]"):
idx = req_str.find(episode_pattern) + len(episode_pattern)
episode_count = int(request_string[idx:idx+2])
return request_string[:idx]+f"{(episode_count+1):0>2}"+request_string[idx+2:]
For example, given your string:
req_str = """[B]Duration: [/B]~5 min
[B]Genre: [/B]Action
[B]Subgenre: [/B]Mystery, Scifi
[B]Language: [/B]English
[B]Subtitles: [/B]German
[B]Episodes: [/B]01/5
"""
res = increment_episode_num(req_str)
print(res)
which gives you the desired output:
[B]Duration: [/B]~5 min
[B]Genre: [/B]Action
[B]Subgenre: [/B]Mystery, Scifi
[B]Language: [/B]English
[B]Subtitles: [/B]German
[B]Episodes: [/B]02/5
As #Barmar suggested in Comments, and following the example from the documentation of re, also formatting to have the right amount of zeroes as padding:
pattern = r"(?<=Episodes: \[/B\])[\d]+?(?=/\d)"
def add_one(matchobj):
number = str(int(matchobj.group(0)) + 1)
return "{0:0>2}".format(number)
re.sub(pattern, add_one, request)
The pattern uses look-ahead and look-behind to capture only the number that corresponds to Episodes, and should work whether it's in the format 01/5 or 1/5, but always returns in the format 01/5. Of course, you can expand the function so it recognizes the format, or even so it can add different numbers instead of only 1.
I'm trying to create a phylogenetic tree by making a .phy file from my data.
I have a dataframe
ndf=
ESV trunc
1 esv1 TACGTAGGTG...
2 esv2 TACGGAGGGT...
3 esv3 TACGGGGGG...
7 esv7 TACGTAGGGT...
I checked the length of the elements of the column "trunc":
length_checker = np.vectorize(len)
arr_len = length_checker(ndf['trunc'])
The resulting arr_len gives the same length (=253) for all the elements.
I saved this dataframe as .phy file, which looks like this:
23 253
esv1 TACGTAGGTGGCGAGCGTTATCCGGAATTATTGGGCGTAAAGCGCGCGCAGGCGGTTTCTTAAGTCTGATGTGAAAGCCCACGGCTCAACCGTGGAGGGTCATTGGAAACTGGGAAACTTGAGTGCAGAAGAGGAAAGCGGAATTCCACGTGTAGCGGTGAAATGCGTAGAGATGTGGAGGAACACCAGTGGCGAAGGCGGCTTTCTGGTCTGTAACTGACGCTGAGGCGCGAAAGCGTGGGGAGCAAACAGG
esv2 TACGGAGGGTGCAAGCGTTATCCGGATTCACTGGGTTTAAAGGGTGCGTAGGTGGGTTGGTAAGTCAGTGGTGAAATCTCCGAGCTTAACTTGGAAACTGCCATTGATACTATTAATCTTGAATATTGTGGAGGTTAGCGGAATATGTCATGTAGCGGTGAAATGCTTAGAGATGACATAGAACACCAATTGCGAAGGCAGCTGGCTACACATATATTGACACTGAGGCACGAAAGCGTGGGGATCAAACAGG
esv3 TACGGGGGGGGCAAGCGTTGTTCGGAATTACTGGGCGTAAAGGGCGCGTAGGCGGCCAGACCAAGTCGAGTGTGAAATTGCAGGGCTTAACTTTGCAGGGTCGCTCGATACTGGTCGGCTAGAGTGTGGAAGAGGGTACTGGAATTCCCGGTGTAGCGGTGAAATGCGTAGATATCGGGAGGAACACCAGCGGCGAAGGCGGGTACCTGGGCCAACACTGACGCTGAGGCGCGAAAGCTAGGGGAGCAAACAG
This is similar to the file used in this tutorial.
However, when I run the command
aln = AlignIO.read('msa.phy', 'phylip')
I get "ValueError: Sequences must all be the same length"
I don't know why I'm getting this or how to fix it. Any help is greatly appreciated!
Thanks
Generally phylip is the fiddliest format in phylogenetics between different programs. There is strict phylip format and relaxed phylip format etc ... t is not easy to know which is the separator being used, a space character and/or a carriage return.
I think that you appear to have left a space between the name of the taxon (i.e. the sequence label) and sequence name, viz.
2. esv2
Phylip format is watching for the space between the label and the sequence data. In this example the sequence would be 3bp long. The use of a "." is generally not a great idea as well. The integer doesn't appear to denote a line number.
The other issue is you could/should try keeping the sequence on the same line as the label and remove the carriage return, viz.
esv2 TACGGAGGGTGCAAGCGTTATCCGGATTCACTGGGTTTAAAGGGTGCGTAGGTGGGTTGGTAAGTCAGTGGTGAAATCTCCGAGCTTAACTTGGAAACTGCCATTGATACTATTAATCTTGAATATTGTGGAGGTTAGCGGAATATGTCATGTAGCGGTGAAATGCTTAGAGATGACATAGAACACCAATTGCGAAGGCAGCTGGCTACACATATATTGACACTGAGGCACGAAAGCGTGGGGATCAAACAGG
Sometimes a carriage return does work (this could be relaxed phylip format), the traditional format uses a space character " ". I always maintained a uniform number of spaces to preserve the alignment ... not sure if that is needed.
Note if you taxon name exceeeds 10 characters you will need relaxed phylip format and this format in any case is generally a good idea.
The final solution is all else fails is to convert to fasta, import as fasta and then convert to phylip. If all this fails ... post back there's more trouble-shooting
Fasta format removes the "23 254" header and then each sequence looks like this,
>esv2
TACGGAGGGTGCAAGCGTTATCCGGATTCACTGGGTTTAAAGGGTGCGTAGGTGGGTTGGTAAGTCAGTGGTGAAATCTCCGAGCTTAACTTGGAAACTGCCATTGATACTATTAATCTTGAATATTGTGGAGGTTAGCGGAATATGTCATGTAGCGGTGAAATGCTTAGAGATGACATAGAACACCAATTGCGAAGGCAGCTGGCTACACATATATTGACACTGAGGCACGAAAGCGTGGGGATCAAACAGG
There is always a carriage return between ">esv2" and the sequence. In addition, ">" is always present to prefix the label (taxon name) without any spae. You can simply convert via reg-ex or "re" in Python. Using a perl one-liner it will be s/^([az]+[0-9]+)/>$1/g type code. I'm pretty sure they'll be an online website that will do this.
You then simply replace the "phylip" with "fasta" in your import command. Once imported you ask BioPython to convert to whatever format you want and it should not have any problem.
First, please read the answer to How to make good reproducible pandas examples. In the future please provide a minimal reproducibl example.
Secondly, Michael G is absolutely correct that phylip is a format that is very peculiar about its syntax.
The code below will alow you to generate a phylogenetic tree from your Pandas dataframe.
First some imports and let's recreate your dataframe.
import pandas as pd
from Bio import Phylo
from Bio.Phylo.TreeConstruction import DistanceCalculator
from Bio.Phylo.TreeConstruction import DistanceTreeConstructor
from Bio import AlignIO
data = {'ESV' : ['esv1', 'esv2', 'esv3'],
'trunc': ['TACGTAGGTGGCGAGCGTTATCCGGAATTATTGGGCGTAAAGCGCGCGCAGGCGGTTTCTTAAGTCTGATGTGAAAGCCCACGGCTCAACCGTGGAGGGTCATTGGAAACTGGGAAACTTGAGTGCAGAAGAGGAAAGCGGAATTCCACGTGTAGCGGTGAAATGCGTAGAGATGTGGAGGAACACCAGTGGCGAAGGCGGCTTTCTGGTCTGTAACTGACGCTGAGGCGCGAAAGCGTGGGGAGCAAACAGG',
'TACGGAGGGTGCAAGCGTTATCCGGATTCACTGGGTTTAAAGGGTGCGTAGGTGGGTTGGTAAGTCAGTGGTGAAATCTCCGAGCTTAACTTGGAAACTGCCATTGATACTATTAATCTTGAATATTGTGGAGGTTAGCGGAATATGTCATGTAGCGGTGAAATGCTTAGAGATGACATAGAACACCAATTGCGAAGGCAGCTGGCTACACATATATTGACACTGAGGCACGAAAGCGTGGGGATCAAACAGG',
'TACGGGGGGGGCAAGCGTTGTTCGGAATTACTGGGCGTAAAGGGCGCGTAGGCGGCCAGACCAAGTCGAGTGTGAAATTGCAGGGCTTAACTTTGCAGGGTCGCTCGATACTGGTCGGCTAGAGTGTGGAAGAGGGTACTGGAATTCCCGGTGTAGCGGTGAAATGCGTAGATATCGGGAGGAACACCAGCGGCGAAGGCGGGTACCTGGGCCAACACTGACGCTGAGGCGCGAAAGCTAGGGGAGCAAACAG']
}
ndf = pd.DataFrame.from_dict(data)
print(ndf)
Output:
ESV trunc
0 esv1 TACGTAGGTGGCGAGCGTTATCCGGAATTATTGGGCGTAAAGCGCG...
1 esv2 TACGGAGGGTGCAAGCGTTATCCGGATTCACTGGGTTTAAAGGGTG...
2 esv3 TACGGGGGGGGCAAGCGTTGTTCGGAATTACTGGGCGTAAAGGGCG...
Next, write the phylip file in the correct format.
with open("test.phy", 'w') as f:
f.write("{:10} {}\n".format(ndf.shape[0], ndf.trunc.str.len()[0]))
for row in ndf.iterrows():
f.write("{:10} {}\n".format(*row[1].to_list()))
Ouput of test.phy:
3 253
esv1 TACGTAGGTGGCGAGCGTTATCCGGAATTATTGGGCGTAAAGCGCGCGCAGGCGGTTTCTTAAGTCTGATGTGAAAGCCCACGGCTCAACCGTGGAGGGTCATTGGAAACTGGGAAACTTGAGTGCAGAAGAGGAAAGCGGAATTCCACGTGTAGCGGTGAAATGCGTAGAGATGTGGAGGAACACCAGTGGCGAAGGCGGCTTTCTGGTCTGTAACTGACGCTGAGGCGCGAAAGCGTGGGGAGCAAACAGG
esv2 TACGGAGGGTGCAAGCGTTATCCGGATTCACTGGGTTTAAAGGGTGCGTAGGTGGGTTGGTAAGTCAGTGGTGAAATCTCCGAGCTTAACTTGGAAACTGCCATTGATACTATTAATCTTGAATATTGTGGAGGTTAGCGGAATATGTCATGTAGCGGTGAAATGCTTAGAGATGACATAGAACACCAATTGCGAAGGCAGCTGGCTACACATATATTGACACTGAGGCACGAAAGCGTGGGGATCAAACAGG
esv3 TACGGGGGGGGCAAGCGTTGTTCGGAATTACTGGGCGTAAAGGGCGCGTAGGCGGCCAGACCAAGTCGAGTGTGAAATTGCAGGGCTTAACTTTGCAGGGTCGCTCGATACTGGTCGGCTAGAGTGTGGAAGAGGGTACTGGAATTCCCGGTGTAGCGGTGAAATGCGTAGATATCGGGAGGAACACCAGCGGCGAAGGCGGGTACCTGGGCCAACACTGACGCTGAGGCGCGAAAGCTAGGGGAGCAAACAG
Now we can start with the creation of our phylogenetic tree.
# Read the sequences and align
aln = AlignIO.read('test.phy', 'phylip')
print(aln)
Output:
SingleLetterAlphabet() alignment with 3 rows and 253 columns
TACGTAGGTGGCGAGCGTTATCCGGAATTATTGGGCGTAAAGCG...AGG esv1
TACGGAGGGTGCAAGCGTTATCCGGATTCACTGGGTTTAAAGGG...AGG esv2
TACGGGGGGGGCAAGCGTTGTTCGGAATTACTGGGCGTAAAGGG...CAG esv3
Calculate the distance matrix:
calculator = DistanceCalculator('identity')
dm = calculator.get_distance(aln)
print(dm)
Output:
esv1 0
esv2 0.3003952569169961 0
esv3 0.6086956521739131 0.6245059288537549 0
Construct the phylogenetic tree using UPGMA algorithm and draw the tree in ascii
constructor = DistanceTreeConstructor()
tree = constructor.upgma(dm)
Phylo.draw_ascii(tree)
Output:
________________________________________________________________________ esv3
_|
| ___________________________________ esv2
|____________________________________|
|___________________________________ esv1
Or make a nice plot of the tree:
Phylo.draw(tree)
Output:
I used to decode AIS messages with theis package (Python) https://github.com/schwehr/noaadata/tree/master/ais until I started getting a new format of the messages.
As you may know, AIS messages come in two types mostly. one part (one message) or two parts (multi message). Message#5 is always comes in two parts. example:
!AIVDM,2,1,1,A,55?MbV02;H;s<HtKR20EHE:address#hidden#Dn2222222216L961O5Gf0NSQEp6ClRp8,0*1C
!AIVDM,2,2,1,A,88888888880,2*25
I used to decode this just fine using the following piece of code:
nmeamsg = fields.split(',')
if nmeamsg[0] != '!AIVDM':
return
total = eval(nmeamsg[1])
part = eval(nmeamsg[2])
aismsg = nmeamsg[5]
nmeastring = string.join(nmeamsg[0:-1],',')
bv = binary.ais6tobitvec(aismsg)
msgnum = int(bv[0:6])
--
elif (total>1):
# Multi Slot Messages: 5,6,8,12,14,17,19,20?,21,24,26
global multimsg
if total==2:
if msgnum==5:
if nmeastring.count('!AIVDM')==2 and len(nmeamsg)==13: # make sure there are two parts concatenated together
aismsg = nmeamsg[5]+nmeamsg[11]
bv = binary.ais6tobitvec(aismsg)
msg5 = ais_msg_5.decode(bv)
print "message5 :",msg5
return msg5
Now I'm getting a new format of the messages:
!SAVDM,2,1,7,A,55#0hd01sq`pQ3W?O81L5#E:1=0U8U#000000016000006H0004m8523k#Dp,0*2A,1410825672
!SAVDM,2,2,7,A,4hC`2U#C`40,2*76,1410825672,1410825673
Note. the number at the last index is the time in epoch format
I tried to adjust my code to decode this new format. I succeed in decoding messages with one part. My problem is multi message type.
nmeamsg = fields.split(',')
if nmeamsg[0] != '!AIVDM' and nmeamsg[0] != '!SAVDM':
return
total = eval(nmeamsg[1])
part = eval(nmeamsg[2])
aismsg = nmeamsg[5]
nmeastring = string.join(nmeamsg[0:-1],',')
dbtimestring = time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(float(nmeamsg[7])))
bv = binary.ais6tobitvec(aismsg)
msgnum = int(bv[0:6])
Decoder can't bring the two lines as one. So decoding fails because message#5 should contain two strings not one. The error i get is in these lines:
if nmeastring.count('!SAVDM')==2 and len(nmeamsg)==13:
aismsg = nmeamsg[5]+nmeamsg[11]
Where len(nmeamsg) is always 8 (second line) and nmeastring.count('!SAVDM') is always 1
I hope I explained this clearly so someone can let me know what I'm missing here.
UPDATE
Okay I think I found the reason. I pass messages from file to script line by line:
for line in file:
i=i+1
try:
doais(line)
Where message#5 should be passed as two lines. Any idea on how can I accomplish that?
UPDATE
I did it by modifying the code a little bit:
for line in file:
i=i+1
try:
nmeamsg = line.split(',')
aismsg = nmeamsg[5]
bv = binary.ais6tobitvec(aismsg)
msgnum = int(bv[0:6])
print msgnum
if nmeamsg[0] != '!AIVDM' and nmeamsg[0] != '!SAVDM':
print "wrong format"
total = eval(nmeamsg[1])
if total == 1:
dbtimestring = time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(float(nmeamsg[8])))
doais(line,msgnum,dbtimestring,aismsg)
if total == 2: #Multi-line messages
lines= line+file.next()
nmeamsg = lines.split(',')
dbtimestring = time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(float(nmeamsg[15])))
aismsg = nmeamsg[5]+nmeamsg[12]
doais(lines,msgnum,dbtimestring,aismsg)
Be aware that noaadata is my old research code. libais is my production library thst is in use for NOAA's ERMA and WhaleAlert.
I usually make decoding a two pass process. First join multi-line messages. I refer to this as normalization (ais_normalize.py). You have several issues in this step. First the two component lines have different timestamps on the right of the second string. By the USCG old metadata standard, the last one matters. So my code will assume that these two lines are not related. Second, you don't have the required station id field.
Where are you getting the SA from in SAVDM? What device ("talker" in the NMEA vocab) is receiving these messages?
If you're in Ruby, I can recommend the NMEA and AIS decoder ruby gem that I wrote, available on github. It's based on the unofficial AIS spec at catb.org which is maintained by one of Kurt's colleagues.
It handles combining of multipart messages, reads from streams, and supports a large of NMEA and AIS messages. Decoding the 50 binary subtypes of AIS messages 6 and 8 is presently in development.
To handle the nonstandard lines you posted:
!SAVDM,2,1,7,A,55#0hd01sq`pQ3W?O81L5#E:1=0U8U#000000016000006H0004m8523k#Dp,0*2A,1410825672
!SAVDM,2,2,7,A,4hC`2U#C`40,2*76,1410825672,1410825673
It would be necessary to add a new parse rule that accepts fields after the checksum, but aside from that it should go smoothly. In other words, you'd copy the parser line here:
| BANG DATA CSUM { result = NMEAPlus::AISMessageFactory.create(val[0], val[1], val[2]) }
and have something like
| BANG DATA CSUM COMMA DATA { result = NMEAPlus::AISMessageFactory.create(val[0], val[1], val[2], val[4]) }
What do you do with those extra timestamp(s)? It almost looks like they've been appended by whatever software is doing the logging, rather than being part of the actual message.
I'm trying to write a small regexp checker for mail with following conditions:
1) domain name between 2 and 128 symbols (numbers, alphabet and .-) = /^[a-z0-9_.-]{2,128}$/
2) minus symbol - not in the begining or the end of the login or domain name = /^[^-]|[^-]$/
3) account name not less then 64 symbols = /^.{64,}$/
4) two dots together are disallowed = /^([^.]|([^.]).[^.])*$/
5) if double quotes would exist in the string they will be twin (have a pair)
6) !,: - could exist between double quotes
What could I use from the regexp to perform theese conditions and bring them together in accordance with claims?
Do not write your own regular expression for email addresses. Thousands of people have re-invented this wheel. Here is one example, from http://code.iamcal.com/php/rfc822/full_regexp.txt
(((?:(?:(?:(?:(?:(?:[\x20\x09]*(?:\x0d\x0a))?[\x20\x09]+)|(?:[\x20\x09]+(?:(?:\x0d\x0a)[\x20\x09]+)*
))?(?:\x28(?:(?:(?:(?:[\x20\x09]*(?:\x0d\x0a))?[\x20\x09]+)|(?:[\x20\x09]+(?:(?:\x0d\x0a)[\x20\x09]+
)*))?(?:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x7f]|[\x21-\x27\x2A-\x5b\x5d-\x7e])|(?:\x5c(?:\x0a*\x0d*[\x00
-\x09\x0b\x0c\x0e-\x7f]\x0a*\x0d*)|(?:\x5c[\x00-\x7f]))))*(?:(?:(?:[\x20\x09]*(?:\x0d\x0a))?[\x20\x0
9]+)|(?:[\x20\x09]+(?:(?:\x0d\x0a)[\x20\x09]+)*))?\x29))*(?:(?:(?:(?:[\x20\x09]*(?:\x0d\x0a))?[\x20\
x09]+)|(?:[\x20\x09]+(?:(?:\x0d\x0a)[\x20\x09]+)*))?(?:\x28(?:(?:(?:(?:[\x20\x09]*(?:\x0d\x0a))?[\x2
0\x09]+)|(?:[\x20\x09]+(?:(?:\x0d\x0a)[\x20\x09]+)*))?(?:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x7f]|[\x21-\
x27\x2A-\x5b\x5d-\x7e])|(?:\x5c(?:\x0a*\x0d*[\x00-\x09\x0b\x0c\x0e-\x7f]\x0a*\x0d*)|(?:\x5c[\x00-\x7
f]))))*(?:(?:(?:[\x20\x09]*(?:\x0d\x0a))?[\x20\x09]+)|(?:[\x20\x09]+(?:(?:\x0d\x0a)[\x20\x09]+)*))?\
x29)|(?:(?:(?:[\x20\x09]*(?:\x0d\x0a))?[\x20\x09]+)|(?:[\x20\x09]+(?:(?:\x0d\x0a)[\x20\x09]+)*))))?(
?:(?:[\x41-\x5a\x61-\x7a]|[\x30-\x39]|[\x21\x23-\x27\x2a\x2b\x2d\x2f\x3d\x3f\x5e\x5f\x60\x7b-\x7e])+
(?:\x2e(?:[\x41-\x5a\x61-\x7a]|[\x30-\x39]|[\x21\x23-\x27\x2a\x2b\x2d\x2f\x3d\x3f\x5e\x5f\x60\x7b-\x
7e])+)*)(?:(?:(?:(?:(?:[\x20\x09]*(?:\x0d\x0a))?[\x20\x09]+)|(?:[\x20\x09]+(?:(?:\x0d\x0a)[\x20\x09]
+)*))?(?:\x28(?:(?:(?:(?:[\x20\x09]*(?:\x0d\x0a))?[\x20\x09]+)|(?:[\x20\x09]+(?:(?:\x0d\x0a)[\x20\x0
9]+)*))?(?:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x7f]|[\x21-\x27\x2A-\x5b\x5d-\x7e])|(?:\x5c(?:\x0a*\x0d*[\
x00-\x09\x0b\x0c\x0e-\x7f]\x0a*\x0d*)|(?:\x5c[\x00-\x7f]))))*(?:(?:(?:[\x20\x09]*(?:\x0d\x0a))?[\x20
\x09]+)|(?:[\x20\x09]+(?:(?:\x0d\x0a)[\x20\x09]+)*))?\x29))*(?:(?:(?:(?:[\x20\x09]*(?:\x0d\x0a))?[\x
20\x09]+)|(?:[\x20\x09]+(?:(?:\x0d\x0a)[\x20\x09]+)*))?(?:\x28(?:(?:(?:(?:[\x20\x09]*(?:\x0d\x0a))?[
\x20\x09]+)|(?:[\x20\x09]+(?:(?:\x0d\x0a)[\x20\x09]+)*))?(?:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x7f]|[\x2
1-\x27\x2A-\x5b\x5d-\x7e])|(?:\x5c(?:\x0a*\x0d*[\x00-\x09\x0b\x0c\x0e-\x7f]\x0a*\x0d*)|(?:\x5c[\x00-
\x7f]))))*(?:(?:(?:[\x20\x09]*(?:\x0d\x0a))?[\x20\x09]+)|(?:[\x20\x09]+(?:(?:\x0d\x0a)[\x20\x09]+)*)
)?\x29)|(?:(?:(?:[\x20\x09]*(?:\x0d\x0a))?[\x20\x09]+)|(?:[\x20\x09]+(?:(?:\x0d\x0a)[\x20\x09]+)*)))
)?))|((?:(?:(?:(?:(?:(?:[\x20\x09]*(?:\x0d\x0a))?[\x20\x09]+)|(?:[\x20\x09]+(?:(?:\x0d\x0a)[\x20\x09
]+)*))?(?:\x28(?:(?:(?:(?:[\x20\x09]*(?:\x0d\x0a))?[\x20\x09]+)|(?:[\x20\x09]+(?:(?:\x0d\x0a)[\x20\x
09]+)*))?(?:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x7f]|[\x21-\x27\x2A-\x5b\x5d-\x7e])|(?:\x5c(?:\x0a*\x0d*[
\x00-\x09\x0b\x0c\x0e-\x7f]\x0a*\x0d*)|(?:\x5c[\x00-\x7f]))))*(?:(?:(?:[\x20\x09]*(?:\x0d\x0a))?[\x2
0\x09]+)|(?:[\x20\x09]+(?:(?:\x0d\x0a)[\x20\x09]+)*))?\x29))*(?:(?:(?:(?:[\x20\x09]*(?:\x0d\x0a))?[\
x20\x09]+)|(?:[\x20\x09]+(?:(?:\x0d\x0a)[\x20\x09]+)*))?(?:\x28(?:(?:(?:(?:[\x20\x09]*(?:\x0d\x0a))?
[\x20\x09]+)|(?:[\x20\x09]+(?:(?:\x0d\x0a)[\x20\x09]+)*))?(?:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x7f]|[\x
21-\x27\x2A-\x5b\x5d-\x7e])|(?:\x5c(?:\x0a*\x0d*[\x00-\x09\x0b\x0c\x0e-\x7f]\x0a*\x0d*)|(?:\x5c[\x00
-\x7f]))))*(?:(?:(?:[\x20\x09]*(?:\x0d\x0a))?[\x20\x09]+)|(?:[\x20\x09]+(?:(?:\x0d\x0a)[\x20\x09]+)*
))?\x29)|(?:(?:(?:[\x20\x09]*(?:\x0d\x0a))?[\x20\x09]+)|(?:[\x20\x09]+(?:(?:\x0d\x0a)[\x20\x09]+)*))
))?\x22(?:(?:(?:(?:[\x20\x09]*(?:\x0d\x0a))?[\x20\x09]+)|(?:[\x20\x09]+(?:(?:\x0d\x0a)[\x20\x09]+)*)
)?(?:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x7f]|[\x21\x23-\x5b\x5d-\x7e])|(?:\x5c(?:\x0a*\x0d*[\x00-\x09\x0
b\x0c\x0e-\x7f]\x0a*\x0d*)|(?:\x5c[\x00-\x7f]))))+(?:(?:(?:[\x20\x09]*(?:\x0d\x0a))?[\x20\x09]+)|(?:
[\x20\x09]+(?:(?:\x0d\x0a)[\x20\x09]+)*))?\x22(?:(?:(?:(?:(?:[\x20\x09]*(?:\x0d\x0a))?[\x20\x09]+)|(
?:[\x20\x09]+(?:(?:\x0d\x0a)[\x20\x09]+)*))?(?:\x28(?:(?:(?:(?:[\x20\x09]*(?:\x0d\x0a))?[\x20\x09]+)
|(?:[\x20\x09]+(?:(?:\x0d\x0a)[\x20\x09]+)*))?(?:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x7f]|[\x21-\x27\x2A-
\x5b\x5d-\x7e])|(?:\x5c(?:\x0a*\x0d*[\x00-\x09\x0b\x0c\x0e-\x7f]\x0a*\x0d*)|(?:\x5c[\x00-\x7f]))))*(
?:(?:(?:[\x20\x09]*(?:\x0d\x0a))?[\x20\x09]+)|(?:[\x20\x09]+(?:(?:\x0d\x0a)[\x20\x09]+)*))?\x29))*(?
:(?:(?:(?:[\x20\x09]*(?:\x0d\x0a))?[\x20\x09]+)|(?:[\x20\x09]+(?:(?:\x0d\x0a)[\x20\x09]+)*))?(?:\x28
(?:(?:(?:(?:[\x20\x09]*(?:\x0d\x0a))?[\x20\x09]+)|(?:[\x20\x09]+(?:(?:\x0d\x0a)[\x20\x09]+)*))?(?:(?
:[\x01-\x08\x0b\x0c\x0e-\x1f\x7f]|[\x21-\x27\x2A-\x5b\x5d-\x7e])|(?:\x5c(?:\x0a*\x0d*[\x00-\x09\x0b\
x0c\x0e-\x7f]\x0a*\x0d*)|(?:\x5c[\x00-\x7f]))))*(?:(?:(?:[\x20\x09]*(?:\x0d\x0a))?[\x20\x09]+)|(?:[\
x20\x09]+(?:(?:\x0d\x0a)[\x20\x09]+)*))?\x29)|(?:(?:(?:[\x20\x09]*(?:\x0d\x0a))?[\x20\x09]+)|(?:[\x2
0\x09]+(?:(?:\x0d\x0a)[\x20\x09]+)*))))?))|((?:(?:(?:(?:(?:(?:(?:(?:[\x20\x09]*(?:\x0d\x0a))?[\x20\x
09]+)|(?:[\x20\x09]+(?:(?:\x0d\x0a)[\x20\x09]+)*))?(?:\x28(?:(?:(?:(?:[\x20\x09]*(?:\x0d\x0a))?[\x20
\x09]+)|(?:[\x20\x09]+(?:(?:\x0d\x0a)[\x20\x09]+)*))?(?:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x7f]|[\x21-\x
27\x2A-\x5b\x5d-\x7e])|(?:\x5c(?:\x0a*\x0d*[\x00-\x09\x0b\x0c\x0e-\x7f]\x0a*\x0d*)|(?:\x5c[\x00-\x7f
]))))*(?:(?:(?:[\x20\x09]*(?:\x0d\x0a))?[\x20\x09]+)|(?:[\x20\x09]+(?:(?:\x0d\x0a)[\x20\x09]+)*))?\x
29))*(?:(?:(?:(?:[\x20\x09]*(?:\x0d\x0a))?[\x20\x09]+)|(?:[\x20\x09]+(?:(?:\x0d\x0a)[\x20\x09]+)*))?
(?:\x28(?:(?:(?:(?:[\x20\x09]*(?:\x0d\x0a))?[\x20\x09]+)|(?:[\x20\x09]+(?:(?:\x0d\x0a)[\x20\x09]+)*)
)?(?:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x7f]|[\x21-\x27\x2A-\x5b\x5d-\x7e])|(?:\x5c(?:\x0a*\x0d*[\x00-\x
09\x0b\x0c\x0e-\x7f]\x0a*\x0d*)|(?:\x5c[\x00-\x7f]))))*(?:(?:(?:[\x20\x09]*(?:\x0d\x0a))?[\x20\x09]+
)|(?:[\x20\x09]+(?:(?:\x0d\x0a)[\x20\x09]+)*))?\x29)|(?:(?:(?:[\x20\x09]*(?:\x0d\x0a))?[\x20\x09]+)|
(?:[\x20\x09]+(?:(?:\x0d\x0a)[\x20\x09]+)*))))?(?:(?:[\x41-\x5a\x61-\x7a]|[\x30-\x39]|[\x21\x23-\x27
\x2a\x2b\x2d\x2f\x3d\x3f\x5e\x5f\x60\x7b-\x7e]))+(?:(?:(?:(?:(?:[\x20\x09]*(?:\x0d\x0a))?[\x20\x09]+
)|(?:[\x20\x09]+(?:(?:\x0d\x0a)[\x20\x09]+)*))?(?:\x28(?:(?:(?:(?:[\x20\x09]*(?:\x0d\x0a))?[\x20\x09
]+)|(?:[\x20\x09]+(?:(?:\x0d\x0a)[\x20\x09]+)*))?(?:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x7f]|[\x21-\x27\x
2A-\x5b\x5d-\x7e])|(?:\x5c(?:\x0a*\x0d*[\x00-\x09\x0b\x0c\x0e-\x7f]\x0a*\x0d*)|(?:\x5c[\x00-\x7f])))
)*(?:(?:(?:[\x20\x09]*(?:\x0d\x0a))?[\x20\x09]+)|(?:[\x20\x09]+(?:(?:\x0d\x0a)[\x20\x09]+)*))?\x29))
*(?:(?:(?:(?:[\x20\x09]*(?:\x0d\x0a))?[\x20\x09]+)|(?:[\x20\x09]+(?:(?:\x0d\x0a)[\x20\x09]+)*))?(?:\
x28(?:(?:(?:(?:[\x20\x09]*(?:\x0d\x0a))?[\x20\x09]+)|(?:[\x20\x09]+(?:(?:\x0d\x0a)[\x20\x09]+)*))?(?
:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x7f]|[\x21-\x27\x2A-\x5b\x5d-\x7e])|(?:\x5c(?:\x0a*\x0d*[\x00-\x09\x
0b\x0c\x0e-\x7f]\x0a*\x0d*)|(?:\x5c[\x00-\x7f]))))*(?:(?:(?:[\x20\x09]*(?:\x0d\x0a))?[\x20\x09]+)|(?
:[\x20\x09]+(?:(?:\x0d\x0a)[\x20\x09]+)*))?\x29)|(?:(?:(?:[\x20\x09]*(?:\x0d\x0a))?[\x20\x09]+)|(?:[
\x20\x09]+(?:(?:\x0d\x0a)[\x20\x09]+)*))))?)|(?:(?:(?:(?:(?:(?:[\x20\x09]*(?:\x0d\x0a))?[\x20\x09]+)
|(?:[\x20\x09]+(?:(?:\x0d\x0a)[\x20\x09]+)*))?(?:\x28(?:(?:(?:(?:[\x20\x09]*(?:\x0d\x0a))?[\x20\x09]
+)|(?:[\x20\x09]+(?:(?:\x0d\x0a)[\x20\x09]+)*))?(?:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x7f]|[\x21-\x27\x2
A-\x5b\x5d-\x7e])|(?:\x5c(?:\x0a*\x0d*[\x00-\x09\x0b\x0c\x0e-\x7f]\x0a*\x0d*)|(?:\x5c[\x00-\x7f]))))
*(?:(?:(?:[\x20\x09]*(?:\x0d\x0a))?[\x20\x09]+)|(?:[\x20\x09]+(?:(?:\x0d\x0a)[\x20\x09]+)*))?\x29))*
(?:(?:(?:(?:[\x20\x09]*(?:\x0d\x0a))?[\x20\x09]+)|(?:[\x20\x09]+(?:(?:\x0d\x0a)[\x20\x09]+)*))?(?:\x
28(?:(?:(?:(?:[\x20\x09]*(?:\x0d\x0a))?[\x20\x09]+)|(?:[\x20\x09]+(?:(?:\x0d\x0a)[\x20\x09]+)*))?(?:
(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x7f]|[\x21-\x27\x2A-\x5b\x5d-\x7e])|(?:\x5c(?:\x0a*\x0d*[\x00-\x09\x0
b\x0c\x0e-\x7f]\x0a*\x0d*)|(?:\x5c[\x00-\x7f]))))*(?:(?:(?:[\x20\x09]*(?:\x0d\x0a))?[\x20\x09]+)|(?:
[\x20\x09]+(?:(?:\x0d\x0a)[\x20\x09]+)*))?\x29)|(?:(?:(?:[\x20\x09]*(?:\x0d\x0a))?[\x20\x09]+)|(?:[\
x20\x09]+(?:(?:\x0d\x0a)[\x20\x09]+)*))))?\x22(?:(?:(?:(?:[\x20\x09]*(?:\x0d\x0a))?[\x20\x09]+)|(?:[
\x20\x09]+(?:(?:\x0d\x0a)[\x20\x09]+)*))?(?:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x7f]|[\x21\x23-\x5b\x5d-\
x7e])|(?:\x5c(?:\x0a*\x0d*[\x00-\x09\x0b\x0c\x0e-\x7f]\x0a*\x0d*)|(?:\x5c[\x00-\x7f]))))+(?:(?:(?:[\
x20\x09]*(?:\x0d\x0a))?[\x20\x09]+)|(?:[\x20\x09]+(?:(?:\x0d\x0a)[\x20\x09]+)*))?\x22(?:(?:(?:(?:(?:
[\x20\x09]*(?:\x0d\x0a))?[\x20\x09]+)|(?:[\x20\x09]+(?:(?:\x0d\x0a)[\x20\x09]+)*))?(?:\x28(?:(?:(?:(
?:[\x20\x09]*(?:\x0d\x0a))?[\x20\x09]+)|(?:[\x20\x09]+(?:(?:\x0d\x0a)[\x20\x09]+)*))?(?:(?:[\x01-\x0
8\x0b\x0c\x0e-\x1f\x7f]|[\x21-\x27\x2A-\x5b\x5d-\x7e])|(?:\x5c(?:\x0a*\x0d*[\x00-\x09\x0b\x0c\x0e-\x
7f]\x0a*\x0d*)|(?:\x5c[\x00-\x7f]))))*(?:(?:(?:[\x20\x09]*(?:\x0d\x0a))?[\x20\x09]+)|(?:[\x20\x09]+(
?:(?:\x0d\x0a)[\x20\x09]+)*))?\x29))*(?:(?:(?:(?:[\x20\x09]*(?:\x0d\x0a))?[\x20\x09]+)|(?:[\x20\x09]
+(?:(?:\x0d\x0a)[\x20\x09]+)*))?(?:\x28(?:(?:(?:(?:[\x20\x09]*(?:\x0d\x0a))?[\x20\x09]+)|(?:[\x20\x0
9]+(?:(?:\x0d\x0a)[\x20\x09]+)*))?(?:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x7f]|[\x21-\x27\x2A-\x5b\x5d-\x7
e])|(?:\x5c(?:\x0a*\x0d*[\x00-\x09\x0b\x0c\x0e-\x7f]\x0a*\x0d*)|(?:\x5c[\x00-\x7f]))))*(?:(?:(?:[\x2
0\x09]*(?:\x0d\x0a))?[\x20\x09]+)|(?:[\x20\x09]+(?:(?:\x0d\x0a)[\x20\x09]+)*))?\x29)|(?:(?:(?:[\x20\
x09]*(?:\x0d\x0a))?[\x20\x09]+)|(?:[\x20\x09]+(?:(?:\x0d\x0a)[\x20\x09]+)*))))?))(?:\x2e(?:(?:(?:(?:
(?:(?:(?:[\x20\x09]*(?:\x0d\x0a))?[\x20\x09]+)|(?:[\x20\x09]+(?:(?:\x0d\x0a)[\x20\x09]+)*))?(?:\x28(
?:(?:(?:(?:[\x20\x09]*(?:\x0d\x0a))?[\x20\x09]+)|(?:[\x20\x09]+(?:(?:\x0d\x0a)[\x20\x09]+)*))?(?:(?:
[\x01-\x08\x0b\x0c\x0e-\x1f\x7f]|[\x21-\x27\x2A-\x5b\x5d-\x7e])|(?:\x5c(?:\x0a*\x0d*[\x00-\x09\x0b\x
0c\x0e-\x7f]\x0a*\x0d*)|(?:\x5c[\x00-\x7f]))))*(?:(?:(?:[\x20\x09]*(?:\x0d\x0a))?[\x20\x09]+)|(?:[\x
20\x09]+(?:(?:\x0d\x0a)[\x20\x09]+)*))?\x29))*(?:(?:(?:(?:[\x20\x09]*(?:\x0d\x0a))?[\x20\x09]+)|(?:[
\x20\x09]+(?:(?:\x0d\x0a)[\x20\x09]+)*))?(?:\x28(?:(?:(?:(?:[\x20\x09]*(?:\x0d\x0a))?[\x20\x09]+)|(?
:[\x20\x09]+(?:(?:\x0d\x0a)[\x20\x09]+)*))?(?:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x7f]|[\x21-\x27\x2A-\x5
b\x5d-\x7e])|(?:\x5c(?:\x0a*\x0d*[\x00-\x09\x0b\x0c\x0e-\x7f]\x0a*\x0d*)|(?:\x5c[\x00-\x7f]))))*(?:(
?:(?:[\x20\x09]*(?:\x0d\x0a))?[\x20\x09]+)|(?:[\x20\x09]+(?:(?:\x0d\x0a)[\x20\x09]+)*))?\x29)|(?:(?:
(?:[\x20\x09]*(?:\x0d\x0a))?[\x20\x09]+)|(?:[\x20\x09]+(?:(?:\x0d\x0a)[\x20\x09]+)*))))?(?:(?:[\x41-
\x5a\x61-\x7a]|[\x30-\x39]|[\x21\x23-\x27\x2a\x2b\x2d\x2f\x3d\x3f\x5e\x5f\x60\x7b-\x7e]))+(?:(?:(?:(
?:(?:[\x20\x09]*(?:\x0d\x0a))?[\x20\x09]+)|(?:[\x20\x09]+(?:(?:\x0d\x0a)[\x20\x09]+)*))?(?:\x28(?:(?
:(?:(?:[\x20\x09]*(?:\x0d\x0a))?[\x20\x09]+)|(?:[\x20\x09]+(?:(?:\x0d\x0a)[\x20\x09]+)*))?(?:(?:[\x0
1-\x08\x0b\x0c\x0e-\x1f\x7f]|[\x21-\x27\x2A-\x5b\x5d-\x7e])|(?:\x5c(?:\x0a*\x0d*[\x00-\x09\x0b\x0c\x
0e-\x7f]\x0a*\x0d*)|(?:\x5c[\x00-\x7f]))))*(?:(?:(?:[\x20\x09]*(?:\x0d\x0a))?[\x20\x09]+)|(?:[\x20\x
09]+(?:(?:\x0d\x0a)[\x20\x09]+)*))?\x29))*(?:(?:(?:(?:[\x20\x09]*(?:\x0d\x0a))?[\x20\x09]+)|(?:[\x20
\x09]+(?:(?:\x0d\x0a)[\x20\x09]+)*))?(?:\x28(?:(?:(?:(?:[\x20\x09]*(?:\x0d\x0a))?[\x20\x09]+)|(?:[\x
20\x09]+(?:(?:\x0d\x0a)[\x20\x09]+)*))?(?:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x7f]|[\x21-\x27\x2A-\x5b\x5
d-\x7e])|(?:\x5c(?:\x0a*\x0d*[\x00-\x09\x0b\x0c\x0e-\x7f]\x0a*\x0d*)|(?:\x5c[\x00-\x7f]))))*(?:(?:(?
:[\x20\x09]*(?:\x0d\x0a))?[\x20\x09]+)|(?:[\x20\x09]+(?:(?:\x0d\x0a)[\x20\x09]+)*))?\x29)|(?:(?:(?:[
\x20\x09]*(?:\x0d\x0a))?[\x20\x09]+)|(?:[\x20\x09]+(?:(?:\x0d\x0a)[\x20\x09]+)*))))?)|(?:(?:(?:(?:(?
:(?:[\x20\x09]*(?:\x0d\x0a))?[\x20\x09]+)|(?:[\x20\x09]+(?:(?:\x0d\x0a)[\x20\x09]+)*))?(?:\x28(?:(?:
(?:(?:[\x20\x09]*(?:\x0d\x0a))?[\x20\x09]+)|(?:[\x20\x09]+(?:(?:\x0d\x0a)[\x20\x09]+)*))?(?:(?:[\x01
-\x08\x0b\x0c\x0e-\x1f\x7f]|[\x21-\x27\x2A-\x5b\x5d-\x7e])|(?:\x5c(?:\x0a*\x0d*[\x00-\x09\x0b\x0c\x0
e-\x7f]\x0a*\x0d*)|(?:\x5c[\x00-\x7f]))))*(?:(?:(?:[\x20\x09]*(?:\x0d\x0a))?[\x20\x09]+)|(?:[\x20\x0
9]+(?:(?:\x0d\x0a)[\x20\x09]+)*))?\x29))*(?:(?:(?:(?:[\x20\x09]*(?:\x0d\x0a))?[\x20\x09]+)|(?:[\x20\
x09]+(?:(?:\x0d\x0a)[\x20\x09]+)*))?(?:\x28(?:(?:(?:(?:[\x20\x09]*(?:\x0d\x0a))?[\x20\x09]+)|(?:[\x2
0\x09]+(?:(?:\x0d\x0a)[\x20\x09]+)*))?(?:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x7f]|[\x21-\x27\x2A-\x5b\x5d
-\x7e])|(?:\x5c(?:\x0a*\x0d*[\x00-\x09\x0b\x0c\x0e-\x7f]\x0a*\x0d*)|(?:\x5c[\x00-\x7f]))))*(?:(?:(?:
[\x20\x09]*(?:\x0d\x0a))?[\x20\x09]+)|(?:[\x20\x09]+(?:(?:\x0d\x0a)[\x20\x09]+)*))?\x29)|(?:(?:(?:[\
x20\x09]*(?:\x0d\x0a))?[\x20\x09]+)|(?:[\x20\x09]+(?:(?:\x0d\x0a)[\x20\x09]+)*))))?\x22(?:(?:(?:(?:[
\x20\x09]*(?:\x0d\x0a))?[\x20\x09]+)|(?:[\x20\x09]+(?:(?:\x0d\x0a)[\x20\x09]+)*))?(?:(?:[\x01-\x08\x
0b\x0c\x0e-\x1f\x7f]|[\x21\x23-\x5b\x5d-\x7e])|(?:\x5c(?:\x0a*\x0d*[\x00-\x09\x0b\x0c\x0e-\x7f]\x0a*
\x0d*)|(?:\x5c[\x00-\x7f]))))+(?:(?:(?:[\x20\x09]*(?:\x0d\x0a))?[\x20\x09]+)|(?:[\x20\x09]+(?:(?:\x0
d\x0a)[\x20\x09]+)*))?\x22(?:(?:(?:(?:(?:[\x20\x09]*(?:\x0d\x0a))?[\x20\x09]+)|(?:[\x20\x09]+(?:(?:\
x0d\x0a)[\x20\x09]+)*))?(?:\x28(?:(?:(?:(?:[\x20\x09]*(?:\x0d\x0a))?[\x20\x09]+)|(?:[\x20\x09]+(?:(?
:\x0d\x0a)[\x20\x09]+)*))?(?:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x7f]|[\x21-\x27\x2A-\x5b\x5d-\x7e])|(?:\
x5c(?:\x0a*\x0d*[\x00-\x09\x0b\x0c\x0e-\x7f]\x0a*\x0d*)|(?:\x5c[\x00-\x7f]))))*(?:(?:(?:[\x20\x09]*(
?:\x0d\x0a))?[\x20\x09]+)|(?:[\x20\x09]+(?:(?:\x0d\x0a)[\x20\x09]+)*))?\x29))*(?:(?:(?:(?:[\x20\x09]
*(?:\x0d\x0a))?[\x20\x09]+)|(?:[\x20\x09]+(?:(?:\x0d\x0a)[\x20\x09]+)*))?(?:\x28(?:(?:(?:(?:[\x20\x0
9]*(?:\x0d\x0a))?[\x20\x09]+)|(?:[\x20\x09]+(?:(?:\x0d\x0a)[\x20\x09]+)*))?(?:(?:[\x01-\x08\x0b\x0c\
x0e-\x1f\x7f]|[\x21-\x27\x2A-\x5b\x5d-\x7e])|(?:\x5c(?:\x0a*\x0d*[\x00-\x09\x0b\x0c\x0e-\x7f]\x0a*\x
0d*)|(?:\x5c[\x00-\x7f]))))*(?:(?:(?:[\x20\x09]*(?:\x0d\x0a))?[\x20\x09]+)|(?:[\x20\x09]+(?:(?:\x0d\
x0a)[\x20\x09]+)*))?\x29)|(?:(?:(?:[\x20\x09]*(?:\x0d\x0a))?[\x20\x09]+)|(?:[\x20\x09]+(?:(?:\x0d\x0
a)[\x20\x09]+)*))))?)))*)))\x40(((?:(?:(?:(?:(?:(?:[\x20\x09]*(?:\x0d\x0a))?[\x20\x09]+)|(?:[\x20\x0
9]+(?:(?:\x0d\x0a)[\x20\x09]+)*))?(?:\x28(?:(?:(?:(?:[\x20\x09]*(?:\x0d\x0a))?[\x20\x09]+)|(?:[\x20\
x09]+(?:(?:\x0d\x0a)[\x20\x09]+)*))?(?:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x7f]|[\x21-\x27\x2A-\x5b\x5d-\
x7e])|(?:\x5c(?:\x0a*\x0d*[\x00-\x09\x0b\x0c\x0e-\x7f]\x0a*\x0d*)|(?:\x5c[\x00-\x7f]))))*(?:(?:(?:[\
x20\x09]*(?:\x0d\x0a))?[\x20\x09]+)|(?:[\x20\x09]+(?:(?:\x0d\x0a)[\x20\x09]+)*))?\x29))*(?:(?:(?:(?:
[\x20\x09]*(?:\x0d\x0a))?[\x20\x09]+)|(?:[\x20\x09]+(?:(?:\x0d\x0a)[\x20\x09]+)*))?(?:\x28(?:(?:(?:(
?:[\x20\x09]*(?:\x0d\x0a))?[\x20\x09]+)|(?:[\x20\x09]+(?:(?:\x0d\x0a)[\x20\x09]+)*))?(?:(?:[\x01-\x0
8\x0b\x0c\x0e-\x1f\x7f]|[\x21-\x27\x2A-\x5b\x5d-\x7e])|(?:\x5c(?:\x0a*\x0d*[\x00-\x09\x0b\x0c\x0e-\x
7f]\x0a*\x0d*)|(?:\x5c[\x00-\x7f]))))*(?:(?:(?:[\x20\x09]*(?:\x0d\x0a))?[\x20\x09]+)|(?:[\x20\x09]+(
?:(?:\x0d\x0a)[\x20\x09]+)*))?\x29)|(?:(?:(?:[\x20\x09]*(?:\x0d\x0a))?[\x20\x09]+)|(?:[\x20\x09]+(?:
(?:\x0d\x0a)[\x20\x09]+)*))))?(?:(?:[\x41-\x5a\x61-\x7a]|[\x30-\x39]|[\x21\x23-\x27\x2a\x2b\x2d\x2f\
x3d\x3f\x5e\x5f\x60\x7b-\x7e])+(?:\x2e(?:[\x41-\x5a\x61-\x7a]|[\x30-\x39]|[\x21\x23-\x27\x2a\x2b\x2d
\x2f\x3d\x3f\x5e\x5f\x60\x7b-\x7e])+)*)(?:(?:(?:(?:(?:[\x20\x09]*(?:\x0d\x0a))?[\x20\x09]+)|(?:[\x20
\x09]+(?:(?:\x0d\x0a)[\x20\x09]+)*))?(?:\x28(?:(?:(?:(?:[\x20\x09]*(?:\x0d\x0a))?[\x20\x09]+)|(?:[\x
20\x09]+(?:(?:\x0d\x0a)[\x20\x09]+)*))?(?:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x7f]|[\x21-\x27\x2A-\x5b\x5
d-\x7e])|(?:\x5c(?:\x0a*\x0d*[\x00-\x09\x0b\x0c\x0e-\x7f]\x0a*\x0d*)|(?:\x5c[\x00-\x7f]))))*(?:(?:(?
:[\x20\x09]*(?:\x0d\x0a))?[\x20\x09]+)|(?:[\x20\x09]+(?:(?:\x0d\x0a)[\x20\x09]+)*))?\x29))*(?:(?:(?:
(?:[\x20\x09]*(?:\x0d\x0a))?[\x20\x09]+)|(?:[\x20\x09]+(?:(?:\x0d\x0a)[\x20\x09]+)*))?(?:\x28(?:(?:(
?:(?:[\x20\x09]*(?:\x0d\x0a))?[\x20\x09]+)|(?:[\x20\x09]+(?:(?:\x0d\x0a)[\x20\x09]+)*))?(?:(?:[\x01-
\x08\x0b\x0c\x0e-\x1f\x7f]|[\x21-\x27\x2A-\x5b\x5d-\x7e])|(?:\x5c(?:\x0a*\x0d*[\x00-\x09\x0b\x0c\x0e
-\x7f]\x0a*\x0d*)|(?:\x5c[\x00-\x7f]))))*(?:(?:(?:[\x20\x09]*(?:\x0d\x0a))?[\x20\x09]+)|(?:[\x20\x09
]+(?:(?:\x0d\x0a)[\x20\x09]+)*))?\x29)|(?:(?:(?:[\x20\x09]*(?:\x0d\x0a))?[\x20\x09]+)|(?:[\x20\x09]+
(?:(?:\x0d\x0a)[\x20\x09]+)*))))?))|((?:(?:(?:(?:(?:(?:[\x20\x09]*(?:\x0d\x0a))?[\x20\x09]+)|(?:[\x2
0\x09]+(?:(?:\x0d\x0a)[\x20\x09]+)*))?(?:\x28(?:(?:(?:(?:[\x20\x09]*(?:\x0d\x0a))?[\x20\x09]+)|(?:[\
x20\x09]+(?:(?:\x0d\x0a)[\x20\x09]+)*))?(?:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x7f]|[\x21-\x27\x2A-\x5b\x
5d-\x7e])|(?:\x5c(?:\x0a*\x0d*[\x00-\x09\x0b\x0c\x0e-\x7f]\x0a*\x0d*)|(?:\x5c[\x00-\x7f]))))*(?:(?:(
?:[\x20\x09]*(?:\x0d\x0a))?[\x20\x09]+)|(?:[\x20\x09]+(?:(?:\x0d\x0a)[\x20\x09]+)*))?\x29))*(?:(?:(?
:(?:[\x20\x09]*(?:\x0d\x0a))?[\x20\x09]+)|(?:[\x20\x09]+(?:(?:\x0d\x0a)[\x20\x09]+)*))?(?:\x28(?:(?:
(?:(?:[\x20\x09]*(?:\x0d\x0a))?[\x20\x09]+)|(?:[\x20\x09]+(?:(?:\x0d\x0a)[\x20\x09]+)*))?(?:(?:[\x01
-\x08\x0b\x0c\x0e-\x1f\x7f]|[\x21-\x27\x2A-\x5b\x5d-\x7e])|(?:\x5c(?:\x0a*\x0d*[\x00-\x09\x0b\x0c\x0
e-\x7f]\x0a*\x0d*)|(?:\x5c[\x00-\x7f]))))*(?:(?:(?:[\x20\x09]*(?:\x0d\x0a))?[\x20\x09]+)|(?:[\x20\x0
9]+(?:(?:\x0d\x0a)[\x20\x09]+)*))?\x29)|(?:(?:(?:[\x20\x09]*(?:\x0d\x0a))?[\x20\x09]+)|(?:[\x20\x09]
+(?:(?:\x0d\x0a)[\x20\x09]+)*))))?\x5b(?:(?:(?:(?:[\x20\x09]*(?:\x0d\x0a))?[\x20\x09]+)|(?:[\x20\x09
]+(?:(?:\x0d\x0a)[\x20\x09]+)*))?(?:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x7f]|[\x21-\x5a\x5e-\x7e])|(?:\x5
c(?:\x0a*\x0d*[\x00-\x09\x0b\x0c\x0e-\x7f]\x0a*\x0d*)|(?:\x5c[\x00-\x7f]))))*(?:(?:(?:[\x20\x09]*(?:
\x0d\x0a))?[\x20\x09]+)|(?:[\x20\x09]+(?:(?:\x0d\x0a)[\x20\x09]+)*))?\x5d(?:(?:(?:(?:(?:[\x20\x09]*(
?:\x0d\x0a))?[\x20\x09]+)|(?:[\x20\x09]+(?:(?:\x0d\x0a)[\x20\x09]+)*))?(?:\x28(?:(?:(?:(?:[\x20\x09]
*(?:\x0d\x0a))?[\x20\x09]+)|(?:[\x20\x09]+(?:(?:\x0d\x0a)[\x20\x09]+)*))?(?:(?:[\x01-\x08\x0b\x0c\x0
e-\x1f\x7f]|[\x21-\x27\x2A-\x5b\x5d-\x7e])|(?:\x5c(?:\x0a*\x0d*[\x00-\x09\x0b\x0c\x0e-\x7f]\x0a*\x0d
*)|(?:\x5c[\x00-\x7f]))))*(?:(?:(?:[\x20\x09]*(?:\x0d\x0a))?[\x20\x09]+)|(?:[\x20\x09]+(?:(?:\x0d\x0
a)[\x20\x09]+)*))?\x29))*(?:(?:(?:(?:[\x20\x09]*(?:\x0d\x0a))?[\x20\x09]+)|(?:[\x20\x09]+(?:(?:\x0d\
x0a)[\x20\x09]+)*))?(?:\x28(?:(?:(?:(?:[\x20\x09]*(?:\x0d\x0a))?[\x20\x09]+)|(?:[\x20\x09]+(?:(?:\x0
d\x0a)[\x20\x09]+)*))?(?:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x7f]|[\x21-\x27\x2A-\x5b\x5d-\x7e])|(?:\x5c(
?:\x0a*\x0d*[\x00-\x09\x0b\x0c\x0e-\x7f]\x0a*\x0d*)|(?:\x5c[\x00-\x7f]))))*(?:(?:(?:[\x20\x09]*(?:\x
0d\x0a))?[\x20\x09]+)|(?:[\x20\x09]+(?:(?:\x0d\x0a)[\x20\x09]+)*))?\x29)|(?:(?:(?:[\x20\x09]*(?:\x0d
\x0a))?[\x20\x09]+)|(?:[\x20\x09]+(?:(?:\x0d\x0a)[\x20\x09]+)*))))?))|((?:(?:(?:(?:(?:(?:(?:[\x20\x0
9]*(?:\x0d\x0a))?[\x20\x09]+)|(?:[\x20\x09]+(?:(?:\x0d\x0a)[\x20\x09]+)*))?(?:\x28(?:(?:(?:(?:[\x20\
x09]*(?:\x0d\x0a))?[\x20\x09]+)|(?:[\x20\x09]+(?:(?:\x0d\x0a)[\x20\x09]+)*))?(?:(?:[\x01-\x08\x0b\x0
c\x0e-\x1f\x7f]|[\x21-\x27\x2A-\x5b\x5d-\x7e])|(?:\x5c(?:\x0a*\x0d*[\x00-\x09\x0b\x0c\x0e-\x7f]\x0a*
\x0d*)|(?:\x5c[\x00-\x7f]))))*(?:(?:(?:[\x20\x09]*(?:\x0d\x0a))?[\x20\x09]+)|(?:[\x20\x09]+(?:(?:\x0
d\x0a)[\x20\x09]+)*))?\x29))*(?:(?:(?:(?:[\x20\x09]*(?:\x0d\x0a))?[\x20\x09]+)|(?:[\x20\x09]+(?:(?:\
x0d\x0a)[\x20\x09]+)*))?(?:\x28(?:(?:(?:(?:[\x20\x09]*(?:\x0d\x0a))?[\x20\x09]+)|(?:[\x20\x09]+(?:(?
:\x0d\x0a)[\x20\x09]+)*))?(?:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x7f]|[\x21-\x27\x2A-\x5b\x5d-\x7e])|(?:\
x5c(?:\x0a*\x0d*[\x00-\x09\x0b\x0c\x0e-\x7f]\x0a*\x0d*)|(?:\x5c[\x00-\x7f]))))*(?:(?:(?:[\x20\x09]*(
?:\x0d\x0a))?[\x20\x09]+)|(?:[\x20\x09]+(?:(?:\x0d\x0a)[\x20\x09]+)*))?\x29)|(?:(?:(?:[\x20\x09]*(?:
\x0d\x0a))?[\x20\x09]+)|(?:[\x20\x09]+(?:(?:\x0d\x0a)[\x20\x09]+)*))))?(?:(?:[\x41-\x5a\x61-\x7a]|[\
x30-\x39]|[\x21\x23-\x27\x2a\x2b\x2d\x2f\x3d\x3f\x5e\x5f\x60\x7b-\x7e]))+(?:(?:(?:(?:(?:[\x20\x09]*(
?:\x0d\x0a))?[\x20\x09]+)|(?:[\x20\x09]+(?:(?:\x0d\x0a)[\x20\x09]+)*))?(?:\x28(?:(?:(?:(?:[\x20\x09]
*(?:\x0d\x0a))?[\x20\x09]+)|(?:[\x20\x09]+(?:(?:\x0d\x0a)[\x20\x09]+)*))?(?:(?:[\x01-\x08\x0b\x0c\x0
e-\x1f\x7f]|[\x21-\x27\x2A-\x5b\x5d-\x7e])|(?:\x5c(?:\x0a*\x0d*[\x00-\x09\x0b\x0c\x0e-\x7f]\x0a*\x0d
*)|(?:\x5c[\x00-\x7f]))))*(?:(?:(?:[\x20\x09]*(?:\x0d\x0a))?[\x20\x09]+)|(?:[\x20\x09]+(?:(?:\x0d\x0
a)[\x20\x09]+)*))?\x29))*(?:(?:(?:(?:[\x20\x09]*(?:\x0d\x0a))?[\x20\x09]+)|(?:[\x20\x09]+(?:(?:\x0d\
x0a)[\x20\x09]+)*))?(?:\x28(?:(?:(?:(?:[\x20\x09]*(?:\x0d\x0a))?[\x20\x09]+)|(?:[\x20\x09]+(?:(?:\x0
d\x0a)[\x20\x09]+)*))?(?:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x7f]|[\x21-\x27\x2A-\x5b\x5d-\x7e])|(?:\x5c(
?:\x0a*\x0d*[\x00-\x09\x0b\x0c\x0e-\x7f]\x0a*\x0d*)|(?:\x5c[\x00-\x7f]))))*(?:(?:(?:[\x20\x09]*(?:\x
0d\x0a))?[\x20\x09]+)|(?:[\x20\x09]+(?:(?:\x0d\x0a)[\x20\x09]+)*))?\x29)|(?:(?:(?:[\x20\x09]*(?:\x0d
\x0a))?[\x20\x09]+)|(?:[\x20\x09]+(?:(?:\x0d\x0a)[\x20\x09]+)*))))?)(?:\x2e(?:(?:(?:(?:(?:(?:[\x20\x
09]*(?:\x0d\x0a))?[\x20\x09]+)|(?:[\x20\x09]+(?:(?:\x0d\x0a)[\x20\x09]+)*))?(?:\x28(?:(?:(?:(?:[\x20
\x09]*(?:\x0d\x0a))?[\x20\x09]+)|(?:[\x20\x09]+(?:(?:\x0d\x0a)[\x20\x09]+)*))?(?:(?:[\x01-\x08\x0b\x
0c\x0e-\x1f\x7f]|[\x21-\x27\x2A-\x5b\x5d-\x7e])|(?:\x5c(?:\x0a*\x0d*[\x00-\x09\x0b\x0c\x0e-\x7f]\x0a
*\x0d*)|(?:\x5c[\x00-\x7f]))))*(?:(?:(?:[\x20\x09]*(?:\x0d\x0a))?[\x20\x09]+)|(?:[\x20\x09]+(?:(?:\x
0d\x0a)[\x20\x09]+)*))?\x29))*(?:(?:(?:(?:[\x20\x09]*(?:\x0d\x0a))?[\x20\x09]+)|(?:[\x20\x09]+(?:(?:
\x0d\x0a)[\x20\x09]+)*))?(?:\x28(?:(?:(?:(?:[\x20\x09]*(?:\x0d\x0a))?[\x20\x09]+)|(?:[\x20\x09]+(?:(
?:\x0d\x0a)[\x20\x09]+)*))?(?:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x7f]|[\x21-\x27\x2A-\x5b\x5d-\x7e])|(?:
\x5c(?:\x0a*\x0d*[\x00-\x09\x0b\x0c\x0e-\x7f]\x0a*\x0d*)|(?:\x5c[\x00-\x7f]))))*(?:(?:(?:[\x20\x09]*
(?:\x0d\x0a))?[\x20\x09]+)|(?:[\x20\x09]+(?:(?:\x0d\x0a)[\x20\x09]+)*))?\x29)|(?:(?:(?:[\x20\x09]*(?
:\x0d\x0a))?[\x20\x09]+)|(?:[\x20\x09]+(?:(?:\x0d\x0a)[\x20\x09]+)*))))?(?:(?:[\x41-\x5a\x61-\x7a]|[
\x30-\x39]|[\x21\x23-\x27\x2a\x2b\x2d\x2f\x3d\x3f\x5e\x5f\x60\x7b-\x7e]))+(?:(?:(?:(?:(?:[\x20\x09]*
(?:\x0d\x0a))?[\x20\x09]+)|(?:[\x20\x09]+(?:(?:\x0d\x0a)[\x20\x09]+)*))?(?:\x28(?:(?:(?:(?:[\x20\x09
]*(?:\x0d\x0a))?[\x20\x09]+)|(?:[\x20\x09]+(?:(?:\x0d\x0a)[\x20\x09]+)*))?(?:(?:[\x01-\x08\x0b\x0c\x
0e-\x1f\x7f]|[\x21-\x27\x2A-\x5b\x5d-\x7e])|(?:\x5c(?:\x0a*\x0d*[\x00-\x09\x0b\x0c\x0e-\x7f]\x0a*\x0
d*)|(?:\x5c[\x00-\x7f]))))*(?:(?:(?:[\x20\x09]*(?:\x0d\x0a))?[\x20\x09]+)|(?:[\x20\x09]+(?:(?:\x0d\x
0a)[\x20\x09]+)*))?\x29))*(?:(?:(?:(?:[\x20\x09]*(?:\x0d\x0a))?[\x20\x09]+)|(?:[\x20\x09]+(?:(?:\x0d
\x0a)[\x20\x09]+)*))?(?:\x28(?:(?:(?:(?:[\x20\x09]*(?:\x0d\x0a))?[\x20\x09]+)|(?:[\x20\x09]+(?:(?:\x
0d\x0a)[\x20\x09]+)*))?(?:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x7f]|[\x21-\x27\x2A-\x5b\x5d-\x7e])|(?:\x5c
(?:\x0a*\x0d*[\x00-\x09\x0b\x0c\x0e-\x7f]\x0a*\x0d*)|(?:\x5c[\x00-\x7f]))))*(?:(?:(?:[\x20\x09]*(?:\
x0d\x0a))?[\x20\x09]+)|(?:[\x20\x09]+(?:(?:\x0d\x0a)[\x20\x09]+)*))?\x29)|(?:(?:(?:[\x20\x09]*(?:\x0
d\x0a))?[\x20\x09]+)|(?:[\x20\x09]+(?:(?:\x0d\x0a)[\x20\x09]+)*))))?))*)))
use this /\A([^#\s]+)#((?:[-a-z0-9]+\.)+[a-z]{2,})\Z/i and make some changes accordingly if you want but this is a standard one that goes with most of the cases.
You can make your self regex using below tool:
http://www.txt2re.com/
This tool is very cool
Hope this help you.
I am trying to parse the output of a statistical program (Mplus) using Python.
The format of the output (example here) is structured in blocks, sub-blocks, columns, etc. where the whitespace and breaks are very important. Depending on the eg. options requested you get an addional (sub)block or column here or there.
Approaching this using regular expressions has been a PITA and completely unmaintainable. I have been looking into parsers as a more robust solution, but
am a bit overwhelmed by all the possible tools and approaches;
have the impression that they are not well suited for this kind of output.
E.g. LEPL has something called line-aware parsing, which seems to go in the right direction (whitespace, blocks, ...) but is still geared to parsing programming syntax, not output.
Suggestion in which direction to look would be appreciated.
Yes, this is a pain to parse. You don't -- however -- actually need very many regular expressions. Ordinary split may be sufficient for breaking this document into manageable sequences of strings.
These are a lot of what I call "Head-Body" blocks of text. You have titles, a line of "--"'s and then data.
What you want to do is collapse a "head-body" structure into a generator function that yields individual dictionaries.
def get_means_intecepts_thresholds( source_iter ):
"""Precondition: Current line is a "MEANS/INTERCEPTS/THRESHOLDS" line"""
head= source_iter.next().strip().split()
junk= source_iter.next().strip()
assert set( junk ) == set( [' ','-'] )
for line in source_iter:
if len(line.strip()) == 0: continue
if line.strip() == "SLOPES": break
raw_data= line.strip().split()
data = dict( zip( head, map( float, raw_data[1:] ) ) )
yield int(raw_data[0]), data
def get_slopes( source_iter ):
"""Precondition: Current line is a "SLOPES" line"""
head= source_iter.next().strip().split()
junk= source_iter.next().strip()
assert set( junk ) == set( [' ','-'] )
for line in source_iter:
if len(line.strip()) == 0: continue
if line.strip() == "SLOPES": break
raw_data= line.strip().split() )
data = dict( zip( head, map( float, raw_data[1:] ) ) )
yield raw_data[0], data
The point is to consume the head and the junk with one set of operations.
Then consume the rows of data which follow using a different set of operations.
Since these are generators, you can combine them with other operations.
def get_estimated_sample_statistics( source_iter ):
"""Precondition: at the ESTIMATED SAMPLE STATISTICS line"""
for line in source_iter:
if len(line.strip()) == 0: continue
assert line.strip() == "MEANS/INTERCEPTS/THRESHOLDS"
for data in get_means_intercepts_thresholds( source_iter ):
yield data
while True:
if len(line.strip()) == 0: continue
if line.strip() != "SLOPES": break
for data in get_slopes( source_iter ):
yield data
Something like this may be better than regular expressions.
Based on your example, what you have is a bunch of different, nested sub-formats that, individually, are very easily parsed. What can be overwhelming is the sheer number of formats and the fact that they can be nested in different ways.
At the lowest level you have a set of whitespace-separated values on a single line. Those lines combine into blocks, and how the blocks combine and nest within each other is the complex part. This type of output is designed for human reading and was never intended to be "scraped" back into machine-readable form.
First, I would contact the author of the software and find out if there is an alternate output format available, such as XML or CSV. If done correctly (i.e. not just the print-format wrapped in clumsy XML, or with commas replacing whitespace), this would be much easier to handle. Failing that I would try to come up with a hierarchical list of formats and how they nest. For example,
ESTIMATED SAMPLE STATISTICS begins a block
Within that block MEANS/INTERCEPTS/THRESHOLDS begins a nested block
The next two lines are a set of column headings
This is followed by one (or more?) rows of data, with a row header and data values
And so on. If you approach each of these problems separately, you will find that it's tedious but not complex. Think of each of the above steps as modules that test the input to see if it matches and if it does, then call other modules to test further for things that can occur "inside" the block, backtracking if you get to something that doesn't match what you expect (this is called "recursive descent" by the way).
Note that you will have to do something like this anyway, in order to build an in-memory version of the data (the "data model") on which you can operate.
My suggestion is to do rough massaging of the lines to more useful form. Here is some experiments with your data:
from __future__ import print_function
from itertools import groupby
import string
counter = 0
statslist = [ statsblocks.split('\n')
for statsblocks in open('mlab.txt').read().split('\n\n')
]
print(len(statslist), 'blocks')
def blockcounter(line):
global counter
if not line[0]:
counter += 1
return counter
blocklist = [ [block, list(stats)] for block, stats in groupby(statslist, blockcounter)]
for blockno,block in enumerate(blocklist):
print(120 * '=')
for itemno,line in enumerate(block[1:][0]):
if len(line)<4 and any(line[-1].endswith(c) for c in string.letters) :
print('\n** DATA %i, HEADER (%r)**' % (blockno,line[-1]))
else:
print('\n** DATA %i, item %i, length %i **' % (blockno, itemno, len(line)))
for ind,subdata in enumerate(line):
if '___' in subdata:
print(' *** Numeric data starts: ***')
else:
if 6 < len(subdata)<16:
print( '** TYPE: %s **' % subdata)
print('%3i : %s' %( ind, subdata))
You could try PyParsing. It enables you to write a grammar for what you want to parse. It has other examples than parsing programming languages. But I agree with Jim Garrison that your case doesn't seem to call for a real parser, because writing the grammar would be cumbersome. I would try a brute-force solution, e.g. splitting lines at whitespaces. It's not foolproof, but we can assume the output is correct, so if a line has n headers, the next line will have exactly n values.
It turns out that tabular program output like this was one of my earliest applications of pyparsing. Unfortunately, that exact example dealt with a proprietary format that I can't publish, but there is a similar example posted here: http://pyparsing.wikispaces.com/file/view/dictExample2.py .