I used to decode AIS messages with theis package (Python) https://github.com/schwehr/noaadata/tree/master/ais until I started getting a new format of the messages.
As you may know, AIS messages come in two types mostly. one part (one message) or two parts (multi message). Message#5 is always comes in two parts. example:
!AIVDM,2,1,1,A,55?MbV02;H;s<HtKR20EHE:address#hidden#Dn2222222216L961O5Gf0NSQEp6ClRp8,0*1C
!AIVDM,2,2,1,A,88888888880,2*25
I used to decode this just fine using the following piece of code:
nmeamsg = fields.split(',')
if nmeamsg[0] != '!AIVDM':
return
total = eval(nmeamsg[1])
part = eval(nmeamsg[2])
aismsg = nmeamsg[5]
nmeastring = string.join(nmeamsg[0:-1],',')
bv = binary.ais6tobitvec(aismsg)
msgnum = int(bv[0:6])
--
elif (total>1):
# Multi Slot Messages: 5,6,8,12,14,17,19,20?,21,24,26
global multimsg
if total==2:
if msgnum==5:
if nmeastring.count('!AIVDM')==2 and len(nmeamsg)==13: # make sure there are two parts concatenated together
aismsg = nmeamsg[5]+nmeamsg[11]
bv = binary.ais6tobitvec(aismsg)
msg5 = ais_msg_5.decode(bv)
print "message5 :",msg5
return msg5
Now I'm getting a new format of the messages:
!SAVDM,2,1,7,A,55#0hd01sq`pQ3W?O81L5#E:1=0U8U#000000016000006H0004m8523k#Dp,0*2A,1410825672
!SAVDM,2,2,7,A,4hC`2U#C`40,2*76,1410825672,1410825673
Note. the number at the last index is the time in epoch format
I tried to adjust my code to decode this new format. I succeed in decoding messages with one part. My problem is multi message type.
nmeamsg = fields.split(',')
if nmeamsg[0] != '!AIVDM' and nmeamsg[0] != '!SAVDM':
return
total = eval(nmeamsg[1])
part = eval(nmeamsg[2])
aismsg = nmeamsg[5]
nmeastring = string.join(nmeamsg[0:-1],',')
dbtimestring = time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(float(nmeamsg[7])))
bv = binary.ais6tobitvec(aismsg)
msgnum = int(bv[0:6])
Decoder can't bring the two lines as one. So decoding fails because message#5 should contain two strings not one. The error i get is in these lines:
if nmeastring.count('!SAVDM')==2 and len(nmeamsg)==13:
aismsg = nmeamsg[5]+nmeamsg[11]
Where len(nmeamsg) is always 8 (second line) and nmeastring.count('!SAVDM') is always 1
I hope I explained this clearly so someone can let me know what I'm missing here.
UPDATE
Okay I think I found the reason. I pass messages from file to script line by line:
for line in file:
i=i+1
try:
doais(line)
Where message#5 should be passed as two lines. Any idea on how can I accomplish that?
UPDATE
I did it by modifying the code a little bit:
for line in file:
i=i+1
try:
nmeamsg = line.split(',')
aismsg = nmeamsg[5]
bv = binary.ais6tobitvec(aismsg)
msgnum = int(bv[0:6])
print msgnum
if nmeamsg[0] != '!AIVDM' and nmeamsg[0] != '!SAVDM':
print "wrong format"
total = eval(nmeamsg[1])
if total == 1:
dbtimestring = time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(float(nmeamsg[8])))
doais(line,msgnum,dbtimestring,aismsg)
if total == 2: #Multi-line messages
lines= line+file.next()
nmeamsg = lines.split(',')
dbtimestring = time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(float(nmeamsg[15])))
aismsg = nmeamsg[5]+nmeamsg[12]
doais(lines,msgnum,dbtimestring,aismsg)
Be aware that noaadata is my old research code. libais is my production library thst is in use for NOAA's ERMA and WhaleAlert.
I usually make decoding a two pass process. First join multi-line messages. I refer to this as normalization (ais_normalize.py). You have several issues in this step. First the two component lines have different timestamps on the right of the second string. By the USCG old metadata standard, the last one matters. So my code will assume that these two lines are not related. Second, you don't have the required station id field.
Where are you getting the SA from in SAVDM? What device ("talker" in the NMEA vocab) is receiving these messages?
If you're in Ruby, I can recommend the NMEA and AIS decoder ruby gem that I wrote, available on github. It's based on the unofficial AIS spec at catb.org which is maintained by one of Kurt's colleagues.
It handles combining of multipart messages, reads from streams, and supports a large of NMEA and AIS messages. Decoding the 50 binary subtypes of AIS messages 6 and 8 is presently in development.
To handle the nonstandard lines you posted:
!SAVDM,2,1,7,A,55#0hd01sq`pQ3W?O81L5#E:1=0U8U#000000016000006H0004m8523k#Dp,0*2A,1410825672
!SAVDM,2,2,7,A,4hC`2U#C`40,2*76,1410825672,1410825673
It would be necessary to add a new parse rule that accepts fields after the checksum, but aside from that it should go smoothly. In other words, you'd copy the parser line here:
| BANG DATA CSUM { result = NMEAPlus::AISMessageFactory.create(val[0], val[1], val[2]) }
and have something like
| BANG DATA CSUM COMMA DATA { result = NMEAPlus::AISMessageFactory.create(val[0], val[1], val[2], val[4]) }
What do you do with those extra timestamp(s)? It almost looks like they've been appended by whatever software is doing the logging, rather than being part of the actual message.
Related
I am making a program that should be able to extract the notes, rests, and chords from a certain midi file and write the respective pitch (in midi tone numbers - they go from 0-127) of the notes and chords to a csv file for later use.
For this project, I am using the Python Library "Music21".
from music21 import *
import pandas as pd
#SETUP
path = r"Pirates_TheCarib_midi\1225766-Pirates_of_The_Caribbean_Medley.mid"
#create a function for taking parsing and extracting the notes
def extract_notes(path):
stm = converter.parse(path)
treble = stm[0] #access the first part (if there is only one part)
bass = stm[1]
#note extraction
notes_treble = []
notes_bass = []
for thisNote in treble.getElementsByClass("Note"):
indiv_note = [thisNote.name, thisNote.pitch.midi, thisNote.offset]
notes_treble.append(indiv_note) # print's the note and the note's
offset
for thisNote in bass.getElementsByClass("Note"):
indiv_note = [thisNote.name, thisNote.pitch.midi, thisNote.offset]
notes_bass.append(indiv_note) #add the notes to the bass
return notes_treble, notes_bass
#write to csv
def to_csv(notes_array):
df = pd.DataFrame(notes_array, index=None, columns=None)
df.to_csv("attempt1_v1.csv")
#using the functions
notes_array = extract_notes(path)
#to_csv(notes_array)
#DEBUGGING
stm = converter.parse(path)
print(stm.parts)
Here is the link to the score I am using as a test.
https://musescore.com/user/1699036/scores/1225766
When I run the extract_notes function, it returns two empty arrays and the line:
print(stm.parts)
it returns
<music21.stream.iterator.StreamIterator for Score:0x1b25dead550 #:0>
I am confused as to why it does this. The piece should have two parts, treble and bass. How can I get each note, chord and rest into an array so I can put it in a csv file?
Here is small snippet how I did it. I needed to get all notes, chords and rests for specific instrument. So at first I iterated through part and found specific instrument and afterwards check what kind of type note it is and append it.
you can call this method like
notes = get_notes_chords_rests(keyboard_instruments, "Pirates_of_The_Caribbean.mid")
where keyboard_instruments is list of instruments.
keyboard_nstrument = ["KeyboardInstrument", "Piano", "Harpsichord", "Clavichord", "Celesta", ]
def get_notes_chords_rests(instrument_type, path):
try:
midi = converter.parse(path)
parts = instrument.partitionByInstrument(midi)
note_list = []
for music_instrument in range(len(parts)):
if parts.parts[music_instrument].id in instrument_type:
for element_by_offset in stream.iterator.OffsetIterator(parts[music_instrument]):
for entry in element_by_offset:
if isinstance(entry, note.Note):
note_list.append(str(entry.pitch))
elif isinstance(entry, chord.Chord):
note_list.append('.'.join(str(n) for n in entry.normalOrder))
elif isinstance(entry, note.Rest):
note_list.append('Rest')
return note_list
except Exception as e:
print("failed on ", path)
pass
P.S. It is important to use try block because a lot of midi files on the web are corrupted.
I am trying to read a log file and compare certain values against preset thresholds. My code manages to log the raw data from with the first for loop in my function.
I have added print statements to try and figure out what was going on and I've managed to deduce that my second for loop never "happens".
This is my code:
def smartTest(log, passed_file):
# Threshold values based on averages, subject to change if need be
RRER = 5
SER = 5
OU = 5
UDMA = 5
MZER = 5
datafile = passed_file
# Log the raw data
log.write('=== LOGGING RAW DATA FROM SMART TEST===\r\n')
for line in datafile:
log.write(line)
log.write('=== END OF RAW DATA===\r\n')
print 'Checking SMART parameters...',
log.write('=== VERIFYING SMART PARAMETERS ===\r\n')
for line in datafile:
if 'Raw_Read_Error_Rate' in line:
line = line.split()
if int(line[9]) < RRER and datafile == 'diskOne.txt':
log.write("Raw_Read_Error_Rate SMART parameter is: %s. Value under threshold. DISK ONE OK!\r\n" %int(line[9]))
elif int(line[9]) < RRER and datafile == 'diskTwo.txt':
log.write("Raw_Read_Error_Rate SMART parameter is: %s. Value under threshold. DISK TWO OK!\r\n" %int(line[9]))
else:
print 'FAILED'
log.write("WARNING: Raw_Read_Error_Rate SMART parameter is: %s. Value over threshold!\r\n" %int(line[9]))
rcode = mbox(u'Attention!', u'One or more hardrives may need replacement.', 0x30)
This is how I am calling this function:
dataOne = diskOne()
smartTest(log, dataOne)
print 'Disk One Done'
diskOne() looks like this:
def diskOne():
if os.path.exists(r"C:\Dejero\HDD Guardian 0.6.1\Smartctl"):
os.chdir(r"C:\Dejero\HDD Guardian 0.6.1\Smartctl")
os.system("Smartctl -a /dev/csmi0,0 > C:\Dejero\Installation-Scripts\diskOne.txt")
# Store file in variable
os.chdir(r"C:\Dejero\Installation-Scripts")
datafile = open('diskOne.txt', 'rb')
return datafile
else:
log.write('Smart utility not found.\r\n')
I have tried googling similar issues to mine and have found none. I tried moving my first for loop into diskOne() but the same issue occurs. There is no syntax error and I am just not able to see the issue at this point.
It is not skipping your second loop. You need to seek the position back. This is because after reading the file, the file offset will be placed at the end of the file, so you will need to put it back at the start. This can be done easily by adding a line
datafile.seek(0);
Before the second loop.
Ref: Documentation
I have this task that I've been working on, but am having extreme misgivings about my methodology.
So the problem is that I have a ton of excel files that are formatted strangely (and not consistently) and I need to extract certain fields for each entry. An example data set is
My original approach was this:
Export to csv
Separate into counties
Separate into districts
Analyze each district individually, pull out values
write to output.csv
The problem I've run into is that the format (seemingly well organized) is almost random across files. Each line contains the same fields, but in a different order, spacing, and wording. I wrote a script to correctly process one file, but it doesn't work on any other files.
So my question is, is there a more robust method of approaching this problem rather than simple string processing? What I had in mind was more of a fuzzy logic approach for trying to pin which field an item was, which could handle the inputs being a little arbitrary. How would you approach this problem?
If it helps clear up the problem, here is the script I wrote:
# This file takes a tax CSV file as input
# and separates it into counties
# then appends each county's entries onto
# the end of the master out.csv
# which will contain everything including
# taxes, bonds, etc from all years
#import the data csv
import sys
import re
import csv
def cleancommas(x):
toggle=False
for i,j in enumerate(x):
if j=="\"":
toggle=not toggle
if toggle==True:
if j==",":
x=x[:i]+" "+x[i+1:]
return x
def districtatize(x):
#list indexes of entries starting with "for" or "to" of length >5
indices=[1]
for i,j in enumerate(x):
if len(j)>2:
if j[:2]=="to":
indices.append(i)
if len(j)>3:
if j[:3]==" to" or j[:3]=="for":
indices.append(i)
if len(j)>5:
if j[:5]==" \"for" or j[:5]==" \'for":
indices.append(i)
if len(j)>4:
if j[:4]==" \"to" or j[:4]==" \'to" or j[:4]==" for":
indices.append(i)
if len(indices)==1:
return [x[0],x[1:len(x)-1]]
new=[x[0],x[1:indices[1]+1]]
z=1
while z<len(indices)-1:
new.append(x[indices[z]+1:indices[z+1]+1])
z+=1
return new
#should return a list of lists. First entry will be county
#each successive element in list will be list by district
def splitforstos(string):
for itemind,item in enumerate(string): # take all exception cases that didn't get processed
splitfor=re.split('(?<=\d)\s\s(?=for)',item) # correctly and split them up so that the for begins
splitto=re.split('(?<=\d)\s\s(?=to)',item) # a cell
if len(splitfor)>1:
print "\n\n\nfor detected\n\n"
string.remove(item)
string.insert(itemind,splitfor[0])
string.insert(itemind+1,splitfor[1])
elif len(splitto)>1:
print "\n\n\nto detected\n\n"
string.remove(item)
string.insert(itemind,splitto[0])
string.insert(itemind+1,splitto[1])
def analyze(x):
#input should be a string of content
#target values are nomills,levytype,term,yearcom,yeardue
clean=cleancommas(x)
countylist=clean.split(',')
emptystrip=filter(lambda a: a != '',countylist)
empt2strip=filter(lambda a: a != ' ', emptystrip)
singstrip=filter(lambda a: a != '\' \'',empt2strip)
quotestrip=filter(lambda a: a !='\" \"',singstrip)
splitforstos(quotestrip)
distd=districtatize(quotestrip)
print '\n\ndistrictized\n\n',distd
county = distd[0]
for x in distd[1:]:
if len(x)>8:
district=x[0]
vote1=x[1]
votemil=x[2]
spaceindex=[m.start() for m in re.finditer(' ', votemil)][-1]
vote2=votemil[:spaceindex]
mills=votemil[spaceindex+1:]
votetype=x[4]
numyears=x[6]
yearcom=x[8]
yeardue=x[10]
reason=x[11]
data = [filename,county,district, vote1, vote2, mills, votetype, numyears, yearcom, yeardue, reason]
print "data",data
else:
print "x\n\n",x
district=x[0]
vote1=x[1]
votemil=x[2]
spaceindex=[m.start() for m in re.finditer(' ', votemil)][-1]
vote2=votemil[:spaceindex]
mills=votemil[spaceindex+1:]
votetype=x[4]
special=x[5]
splitspec=special.split(' ')
try:
forind=[i for i,j in enumerate(splitspec) if j=='for'][0]
numyears=splitspec[forind+1]
yearcom=splitspec[forind+6]
except:
forind=[i for i,j in enumerate(splitspec) if j=='commencing'][0]
numyears=None
yearcom=splitspec[forind+2]
yeardue=str(x[6])[-4:]
reason=x[7]
data = [filename,county,district,vote1,vote2,mills,votetype,numyears,yearcom,yeardue,reason]
print "data other", data
openfile=csv.writer(open('out.csv','a'),delimiter=',', quotechar='|',quoting=csv.QUOTE_MINIMAL)
openfile.writerow(data)
# call the file like so: python tax.py 2007May8Tax.csv
filename = sys.argv[1] #the file is the first argument
f=open(filename,'r')
contents=f.read() #entire csv as string
#find index of every instance of the word county
separators=[m.start() for m in re.finditer('\w+\sCOUNTY',contents)] #alternative implementation in regex
# split contents into sections by county
# analyze each section and append to out.csv
for x,y in enumerate(separators):
try:
data = contents[y:separators[x+1]]
except:
data = contents[y:]
analyze(data)
is there a more robust method of approaching this problem rather than simple string processing?
Not really.
What I had in mind was more of a fuzzy logic approach for trying to pin which field an item was, which could handle the inputs being a little arbitrary. How would you approach this problem?
After a ton of analysis and programming, it won't be significantly better than what you've got.
Reading stuff prepared by people requires -- sadly -- people-like brains.
You can mess with NLTK to try and do a better job, but it doesn't work out terribly well either.
You don't need a radically new approach. You need to streamline the approach you have.
For example.
district=x[0]
vote1=x[1]
votemil=x[2]
spaceindex=[m.start() for m in re.finditer(' ', votemil)][-1]
vote2=votemil[:spaceindex]
mills=votemil[spaceindex+1:]
votetype=x[4]
numyears=x[6]
yearcom=x[8]
yeardue=x[10]
reason=x[11]
data = [filename,county,district, vote1, vote2, mills, votetype, numyears, yearcom, yeardue, reason]
print "data",data
Might be improved by using a named tuple.
Then build something like this.
data = SomeSensibleName(
district= x[0],
vote1=x[1], ... etc.
)
So that you're not creating a lot of intermediate (and largely uninformative) loose variables.
Also, keep looking at your analyze function (and any other function) to pull out the various "pattern matching" rules. The idea is that you'll examine a county's data, step through a bunch of functions until one matches the pattern; this will also create the named tuple. You want something like this.
for p in ( some, list, of, functions ):
match= p(data)
if match:
return match
Each function either returns a named tuple (because it liked the row) or None (because it didn't like the row).
I've been searching through this website and have seen multiple references to time deltas, but haven't quite found what I'm looking for.
Basically, I have a list of messages that are received by a comms server and I want to calcuate the latency time between each message out and in. It looks like this:
161336.934072 - TMsg out: [O] enter order. RefID [123] OrdID [4568]
161336.934159 - TMsg in: [A] accepted. ordID [456] RefNumber [123]
Mixed in with these messages are other messages as well, however, I only want to capture the difference between the Out messages and in messages with the same RefID.
So far, to sort out from the main log which messages are Tmessages I've been doing this, but it's really inefficient. I don't need to be making new files everytime.:
big_file = open('C:/Users/kdalton/Documents/Minicomm.txt', 'r')
small_file1 = open('small_file1.txt', 'w')
for line in big_file:
if 'T' in line: small_file1.write(line)
big_file.close()
small_file1.close()
How do I calculate the time deltas between the two messages and sort out these messages from the main log?
First of all, don't write out the raw log lines. Secondly use a dict.
tdeltas = {} # this is an empty dict
if "T" in line:
get Refid number
if Refid in tedeltas:
tdeltas[Refid] = timestamp - tdeltas[Refid]
else:
tdeltas[Refid] = timestamp
Then at the end, convert to a list and print
allRefids = sorted(tdeltas.keys())
for k in allRefids:
print k+": "+tdeltas[k]+" secs"
You may want to convert your dates into time objects from the datetime module and then use timedelta objects to store in the dict. Probably not worth it for this task but it is worthwhile to learn how to use the datetime module.
Also, I have glossed over parsing the Refid from the input string, and the possible issue of converting the times from string to float and back.
Actually, just storing deltas will cause confusion if you ever have a Refid that is not accepted. If I were doing this for real, I would store a tuple in the value with the start datetime, end datetime and the delta. For a new record it would look like this: (161336.934072,0,0) and after the acceptance was detected it would look like this: (161336.934072,161336.934159,.000087). If the logging activity was continuous, say a global ecommerce site running 24x7, then I would periodically scan the dict for any entries with a non-zero delta, report them, and delete them. Then I would take the remaining values, sort them on the start datetime, then report and delete any where the start datetime is too old because that indicates failed transactions that will never complete.
Also, in a real ecommerce site, I might consider using something like Redis or Memcache as an external dict so that reporting and maintenance can be done by another server/application.
This generator function returns a tuple containing the id and the difference in timestamps between the out and in messages. (If you want to do something more complex with the time difference, check out datetime.timedelta). Note that this assumes out messages always appear before in messages.
def get_time_deltas(infile):
entries = (line.split() for line in open(INFILE, "r"))
ts = {}
for e in entries:
if len(e) == 11 and " ".join(e[2:5]) == "TMsg out: [O]":
ts[e[8]] = e[0] # store timestamp for id
elif len(e) == 10 and " ".join(e[2:5]) == "TMsg in: [A]":
in_ts, ref_id = e[0], e[9]
# Raises KeyError if out msg not seen yet. Handle if required.
out_ts = ts.pop(ref_id) # get ts for this id
yield (ref_id[1:-1], float(in_ts) - float(out_ts))
You can now get a list out of it:
>>> INFILE = 'C:/Users/kdalton/Documents/Minicomm.txt'
>>> list(get_time_deltas(INFILE))
[('123', 8.699999307282269e-05), ('1233', 0.00028700000257231295)]
Or write it to a file:
>>> with open("out.txt", "w") as outfile:
... for id, td in get_time_deltas(INFILE):
... outfile.write("Msg %s took %f seconds\n", (id, td))
Or chain it into a more complex workflow.
Update:
(in response to looking at the actual data)
Try this instead:
def get_time_deltas(infile):
entries = (line.split() for line in open(INFILE, "r"))
ts = {}
for e in entries:
if " ".join(e[2:5]) == "OuchMsg out: [O]":
ts[e[8]] = e[0] # store timestamp for id
elif " ".join(e[2:5]) == "OuchMsg in: [A]":
in_ts, ref_id = e[0], e[7]
out_ts = ts.pop(ref_id, None) # get ts for this id
# TODO: handle case where out_ts = None (no id found)
yield (ref_id[1:-1], float(in_ts) - float(out_ts))
INFILE = 'C:/Users/kdalton/Documents/Minicomm.txt'
print list(get_time_deltas(INFILE))
Changes in this version:
the number of fields is not as stated in the sample input posted in question. Removed check based on entry number
ordID for in messages is the one that matches refID in the out messages
used OuchMsg instead of TMsg
Update 2
To get an average of the deltas:
deltas = [d for _, d in get_time_deltas(INFILE)]
average = sum(deltas) / len(deltas)
Or, if you have previously generated a list containing all the data, we can reuse it instead of reparsing the file:
data = list(get_time_deltas(INFILE))
# .. use data for something some operation ...
# calculate average using the list
average = sum(d for _, d in data) / len(data)
I am trying to parse the output of a statistical program (Mplus) using Python.
The format of the output (example here) is structured in blocks, sub-blocks, columns, etc. where the whitespace and breaks are very important. Depending on the eg. options requested you get an addional (sub)block or column here or there.
Approaching this using regular expressions has been a PITA and completely unmaintainable. I have been looking into parsers as a more robust solution, but
am a bit overwhelmed by all the possible tools and approaches;
have the impression that they are not well suited for this kind of output.
E.g. LEPL has something called line-aware parsing, which seems to go in the right direction (whitespace, blocks, ...) but is still geared to parsing programming syntax, not output.
Suggestion in which direction to look would be appreciated.
Yes, this is a pain to parse. You don't -- however -- actually need very many regular expressions. Ordinary split may be sufficient for breaking this document into manageable sequences of strings.
These are a lot of what I call "Head-Body" blocks of text. You have titles, a line of "--"'s and then data.
What you want to do is collapse a "head-body" structure into a generator function that yields individual dictionaries.
def get_means_intecepts_thresholds( source_iter ):
"""Precondition: Current line is a "MEANS/INTERCEPTS/THRESHOLDS" line"""
head= source_iter.next().strip().split()
junk= source_iter.next().strip()
assert set( junk ) == set( [' ','-'] )
for line in source_iter:
if len(line.strip()) == 0: continue
if line.strip() == "SLOPES": break
raw_data= line.strip().split()
data = dict( zip( head, map( float, raw_data[1:] ) ) )
yield int(raw_data[0]), data
def get_slopes( source_iter ):
"""Precondition: Current line is a "SLOPES" line"""
head= source_iter.next().strip().split()
junk= source_iter.next().strip()
assert set( junk ) == set( [' ','-'] )
for line in source_iter:
if len(line.strip()) == 0: continue
if line.strip() == "SLOPES": break
raw_data= line.strip().split() )
data = dict( zip( head, map( float, raw_data[1:] ) ) )
yield raw_data[0], data
The point is to consume the head and the junk with one set of operations.
Then consume the rows of data which follow using a different set of operations.
Since these are generators, you can combine them with other operations.
def get_estimated_sample_statistics( source_iter ):
"""Precondition: at the ESTIMATED SAMPLE STATISTICS line"""
for line in source_iter:
if len(line.strip()) == 0: continue
assert line.strip() == "MEANS/INTERCEPTS/THRESHOLDS"
for data in get_means_intercepts_thresholds( source_iter ):
yield data
while True:
if len(line.strip()) == 0: continue
if line.strip() != "SLOPES": break
for data in get_slopes( source_iter ):
yield data
Something like this may be better than regular expressions.
Based on your example, what you have is a bunch of different, nested sub-formats that, individually, are very easily parsed. What can be overwhelming is the sheer number of formats and the fact that they can be nested in different ways.
At the lowest level you have a set of whitespace-separated values on a single line. Those lines combine into blocks, and how the blocks combine and nest within each other is the complex part. This type of output is designed for human reading and was never intended to be "scraped" back into machine-readable form.
First, I would contact the author of the software and find out if there is an alternate output format available, such as XML or CSV. If done correctly (i.e. not just the print-format wrapped in clumsy XML, or with commas replacing whitespace), this would be much easier to handle. Failing that I would try to come up with a hierarchical list of formats and how they nest. For example,
ESTIMATED SAMPLE STATISTICS begins a block
Within that block MEANS/INTERCEPTS/THRESHOLDS begins a nested block
The next two lines are a set of column headings
This is followed by one (or more?) rows of data, with a row header and data values
And so on. If you approach each of these problems separately, you will find that it's tedious but not complex. Think of each of the above steps as modules that test the input to see if it matches and if it does, then call other modules to test further for things that can occur "inside" the block, backtracking if you get to something that doesn't match what you expect (this is called "recursive descent" by the way).
Note that you will have to do something like this anyway, in order to build an in-memory version of the data (the "data model") on which you can operate.
My suggestion is to do rough massaging of the lines to more useful form. Here is some experiments with your data:
from __future__ import print_function
from itertools import groupby
import string
counter = 0
statslist = [ statsblocks.split('\n')
for statsblocks in open('mlab.txt').read().split('\n\n')
]
print(len(statslist), 'blocks')
def blockcounter(line):
global counter
if not line[0]:
counter += 1
return counter
blocklist = [ [block, list(stats)] for block, stats in groupby(statslist, blockcounter)]
for blockno,block in enumerate(blocklist):
print(120 * '=')
for itemno,line in enumerate(block[1:][0]):
if len(line)<4 and any(line[-1].endswith(c) for c in string.letters) :
print('\n** DATA %i, HEADER (%r)**' % (blockno,line[-1]))
else:
print('\n** DATA %i, item %i, length %i **' % (blockno, itemno, len(line)))
for ind,subdata in enumerate(line):
if '___' in subdata:
print(' *** Numeric data starts: ***')
else:
if 6 < len(subdata)<16:
print( '** TYPE: %s **' % subdata)
print('%3i : %s' %( ind, subdata))
You could try PyParsing. It enables you to write a grammar for what you want to parse. It has other examples than parsing programming languages. But I agree with Jim Garrison that your case doesn't seem to call for a real parser, because writing the grammar would be cumbersome. I would try a brute-force solution, e.g. splitting lines at whitespaces. It's not foolproof, but we can assume the output is correct, so if a line has n headers, the next line will have exactly n values.
It turns out that tabular program output like this was one of my earliest applications of pyparsing. Unfortunately, that exact example dealt with a proprietary format that I can't publish, but there is a similar example posted here: http://pyparsing.wikispaces.com/file/view/dictExample2.py .