I'm parsing through a decent sized xml file, and I ran into a problem. For some reason I cannot extract data even though I have done the exact same thing on different xml files before.
Here's a snippet of my code: (rest of the program, I've tested and they work fine)
EDIT: changed to include a testing try&except block
def parseXML():
file = open(str(options.drugxml),'r')
data = file.read()
file.close()
dom = parseString(data)
druglist = dom.getElementsByTagName('drug')
with codecs.open(str(options.csvdata),'w','utf-8') as csvout, open('DrugTargetRel.csv','w') as dtout:
for entry in druglist:
count = count + 1
try:
drugtype = entry.attributes['type'].value
print count
except:
print count
print entry
drugidObj = entry.getElementsByTagName('drugbank-id')[0]
drugid = drugidObj.childNodes[0].nodeValue
drugnameObj = entry.getElementsByTagName('name')[0]
drugname = drugnameObj.childNodes[0].nodeValue
targetlist = entry.getElementsByTagName('target')
for target in targetlist:
targetid = target.attributes['partner'].value
dtout.write((','.join((drugid,targetid)))+'\n')
csvout.write((','.join((drugid,drugname,drugtype)))+'\n')
In case you're wondering what the XML file's schema roughly looks like, here's a rough god-awful sketch of the levels:
<drugs>
<drug type='something' ...>
<drugbank-id>
<name>
...
<targets>
<target partner='something'>
Those that I typed in here, I need to extract from the XML file and stick it in csv files (as the code above shows), and the code has worked for different xml files before, not sure why it's not working on this one. I've gotten KeyError on 'type', I've also gotten indexing errors on line that extracts drugid even though EVERY drug has a drugid. What am I screwing up here?
EDIT: the stuff I'm extracting are guaranteed to be in each drug.
For anyone who cares, here's the link to the XML file I'm parsing:
http://www.drugbank.ca/system/downloads/current/drugbank.xml.zip
EDIT: After implementing a try & except block (see above) here's what I found out:
In the schema, there are sections called "drug interactions" that also have a subfield called drug. So like this:
<drugs>
<drug type='something' ...>
<drugbank-id>
<name>
...
<targets>
<target partner='something'>
<drug-interactions>
<drug>
I think that my line druglist = dom.getElementsByTagName('drug') is unintentionally picking those up as well -- I don't know how I could fix this... any suggestions?
Basically when parsin an xml, you can't rely on the fact that you know the structure. It's a good practise to find out structure in code.
So everytime you access elements or attributes, check before if there any. In your code it means following:
Make sure there's an attribute 'type' on a drug element:
drugtype = entry.attributes['type'].value if entry.attributes.has_key('type') else 'defaulttype'
Make sure getElementsByTagName doesn't returns empty array before accessing its elements:
drugbank-id = entry.getElementsByTagName('drugbank-id')
drugidObj = drugbank-id[0] if drugbank-id else None
Also before accessing childnodes make sure there are any:
if drugidObj.hasChildNodes:
drugid = drugidObj.childNodes[0].nodeValue
Or use for loop to loop through them.
And when you call getElementsByTagName on the drugs elemet it returns all elements including the nested ones. To get only drug elements, which are dirrect children of drugs element you have to use childNodes attribute.
I had a feeling that maybe there was something weird happening due to running out of memory or something, so I rewrote the parser using an iterator over each drug and tried it out and got the program to complete without raising an exception.
Basically what I'm doing here is, instead of loading the entire XML file into memory, I parse the XML file for the beginning and end of each <drug> and </drug> tag. Then I parse that with the minidom each time.
The code might be a little fragile as I assume that each <drug> and </drug> pair are on their own lines. Hopefully it helps more than it harms though.
#!python
import codecs
from xml.dom import minidom
class DrugBank(object):
def __init__(self, filename):
self.fp = open(filename, 'r')
def __iter__(self):
return self
def next(self):
state = 0
while True:
line = self.fp.readline()
if state == 0:
if line.strip().startswith('<drug '):
lines = [line]
state = 1
continue
if line.strip() == '</drugs>':
self.fp.close()
raise StopIteration()
if state == 1:
lines.append(line)
if line.strip() == '</drug>':
return minidom.parseString("".join(lines))
with codecs.open('csvout.csv', 'w', 'utf-8') as csvout, open('dtout.csv', 'w') as dtout:
db = DrugBank('drugbank.xml')
for dom in db:
entry = dom.firstChild
drugtype = entry.attributes['type'].value
drugidObj = entry.getElementsByTagName('drugbank-id')[0]
drugid = drugidObj.childNodes[0].nodeValue
drugnameObj = entry.getElementsByTagName('name')[0]
drugname = drugnameObj.childNodes[0].nodeValue
targetlist = entry.getElementsByTagName('target')
for target in targetlist:
targetid = target.attributes['partner'].value
dtout.write((','.join((drugid,targetid)))+'\n')
csvout.write((','.join((drugid,drugname,drugtype)))+'\n')
An interesting read that might help you out further is here:
http://www.ibm.com/developerworks/xml/library/x-hiperfparse/
Related
I have a question in relation to comparing directories. I have requested a transfer of data from one company server , I received this data in 8 portable hard drives as it was not possible due to large volume to put it on one. each of them contains about 1TB data. Now, I want to make sure that all the files that were on that company server were fully transferred , and if there are any files missing I want to be able to detect it and ask for them. The issue is that the only thing I received from the company is one txt file inn which there is detailed directory structure saved in a tree format. In principle great I could just look one by one through it but due to large amount of data that is just not achievable. I can generate the same directory list out of every single one of the 8 drives that I received. but how can I ten compare this one file into those 8 files? I tried different python comparison codes to parse through line by line but it does not work as this compares them string by string(line by line) but the are not in string format , they are in tree style format. Anyone has any suggestions how to do it?is there a way to convert the file of a tree format into a string format to then run it in Python program and compare? Or should I request (not sure if thats possible) another file with directories saved in different structure format than tree? if yes how this should be generated?
I tried to parse it through lists in python
Your task consists of two subtasks:
Prepare data;
Compare trees.
Task 2 can be solved by this Tree class:
from collections import defaultdict
class Tree:
def __init__(self):
self._children = defaultdict(Tree)
def __len__(self):
return len(self._children)
def add(self, value: str):
value = value.removeprefix('/').removesuffix('/')
if value == '':
return
first, rest = self._get_values(value)
self._children[first].add(rest)
def remove(self, value: str):
value = value.removeprefix('/').removesuffix('/')
if value == '':
return
first, rest = self._get_values(value)
self._children[first].remove(rest)
if len(self._children[first]) == 0:
del self._children[first]
def _get_values(self, value):
values = value.split('/', 1)
first, rest = values[0], ''
if len(values) == 2:
rest = values[1]
return first, rest
def __iter__(self):
if len(self._children) == 0:
yield ''
return
for key, child in self._children.items():
for value in child:
if value != '':
yield key + '/' + value
else:
yield key
def main():
tree = Tree()
tree.add('a/b/c')
tree.add('a/b/d')
tree.add('b/b/e')
tree.add('b/c/f')
tree.add('c/b/e')
# duplicates are okay
tree.add('b/c/f')
# leading and trailing slashes are okay
tree.add('b/c/f/')
tree.add('/b/c/f')
print('Before removing:', [x for x in tree])
tree.remove('a/b/c')
tree.remove('a/b/d')
tree.remove('b/b/e')
# it's okay to remove non-existent values
tree.remove('this/path/does/not/exist')
# it will not remove root-level b folder, because it's not empty
tree.remove('b')
print('After removing:', [x for x in tree])
if __name__ == '__main__':
main()
Output:
Before removing: ['a/b/c', 'a/b/d', 'b/b/e', 'b/c/f', 'c/b/e']
After removing: ['b/c/f', 'c/b/e']
So, your algorithm is as follows:
Build tree of file paths on company server (let's call it A-tree);
Build trees of file paths on portable hard drives (let's call them B-trees);
Remove from the A-tree file paths that exist in B-trees;
Print content of the A-tree.
Now, all you need - is to build these trees, which is task 1 from the answer beginning. And how it will be - depends on the data, that you have in your .txt file.
This is more of a theoretical question to understand objects, garbage collection and performance of Python better.
Lets say i have a ton of XML files and want to iterate over each one, get all the tags, store them in a dict, increase counters for each tag etc. When i do this, the first, lets say 15k iterations, process really quick but afterwards the script slows down significantly, while the memory usage, CPU load etc. are fine. Why is that? Do i write hidden objects each iteration which are not cleaned up, can i do something to improve it? I tried to use regex instead of ElementTree but it wasnt worth the effort since i only want to extract first level tags and it would make it more complex.
Unfortunately i cannot give a reproducible example without providing the XML files, however this is my code:
import os
import datetime
import xml.etree.ElementTree as ElementTree
start_time = datetime.datetime.now()
original_implemented_tags = os.path.abspath("/path/to/file")
required_tags = {}
optional_tags = {}
new_tags = {}
# read original tags
for _ in open(original_implemented_tags, "r"):
if "#XmlElement(name =" in _:
_xml_attr = _.split('"')[1]
if "required = true" in _:
required_tags[_xml_attr] = 1 # i set this to 1 so i can use if dict.get(_xml_attr) (0 returns False)
else:
optional_tags[_xml_attr] = 1
# read all XML files from nested folder containing XML dumps and other files
clinical_trial_root_dir = os.path.abspath("/path/to/dump/folder")
xml_files = []
for root, dirs, files in os.walk(clinical_trial_root_dir):
xml_files.extend([os.path.join(root, _) for _ in files if os.path.splitext(_)[-1] == '.xml'])
# function for parsing a file and extract unique tags
def read_via_etree(file):
_root = ElementTree.parse(file).getroot()
_main_tags = list(set([_.tag for _ in _root.findall("./")])) # some tags occur twice
for _attr in _main_tags:
# if tag doesnt exist in original document, increase counts in new_tags
if _attr not in required_tags.keys() and _attr not in optional_tags.keys():
if _attr not in new_tags.keys():
new_tags[_attr] = 1
else:
new_tags[_attr] += 1
# otherwise, increase counts in either one of required_tags or optional_tags
if required_tags.get(_attr):
required_tags[_attr] += 1
if optional_tags.get(_attr):
optional_tags[_attr] += 1
# actual parsing with indicator
for idx, xml in enumerate(xml_files):
if idx % 1000 == 0:
print(f"Analyzed {idx} files")
read_via_etree(xml)
# undoing the initial 1
for k in required_tags.keys():
required_tags[k] -= 1
for k in optional_tags.keys():
optional_tags[k] -= 1
print(f"Done parsing {len(xml_files)} documents in {datetime.datetime.now() - start_time}")
Example of one XML file:
<parent_element>
<tag_i_need>
<tag_i_dont_need>Some text i dont need</tag_i_dont_need>
</tag_i_need>
<another_tag_i_need>Some text i also dont need</another_tag_i_need>
</parent_element>
After the helpful comments i added a timestamp to my loop indicating how much time is passed since the last 1k documents and flushed the sys.stdout:
import sys
loop_timer = datetime.datetime.now()
for idx, xml in enumerate(xml_files):
if idx % 1000 == 0:
print(f"Analyzed {idx} files in {datetime.datetime.now() - loop_timer}")
sys.stdout.flush()
loop_timer = datetime.datetime.now()
read_via_etree(xml)
I think it makes sense now since the XML files vary in size, and due the fact that the standard output stream is buffered. Thanks to Albert Winestein
I've figured out how to get data from a single XML file into a row on a CSV. I'd like to iterate this across a number of files in a directory so that the data from each XML file is extracted to a new row on the CSV. I've done some searching and I get the gist of having to create a loop (perhaps using the OS module) but the specifics are lost on me.
This script does the extraction for a single XML file.
import xml.etree.ElementTree as ET
import csv
tree = ET.parse("[PATH/FILE.xml]")
root = tree.getroot()
test_file = open('PATH','w',newline='')
csvwriter = csv.writer(test_file)
header = []
count = 0
for trial in root.iter('[XML_ROOT]'):
item_info = []
if count == 0:
item_ID = trial.find('itemid').tag
header.append(item_ID)
data_1 = trial.find('data1').tag
header.append(data_1)
csvwriter.writerow(header)
count = count + 1
item_ID = trial.find('itemid').text
item_info.append(item_ID)
data_1 = trial.find('data1').text
trial_info.append(data_1)
csvwriter.writerow(item_info)
test_file.close()
Now I need to figure out what to do to it to iterate.
Edit:
Here is an example of an XML file i'm using. Just for testing i'm pulling out actrnumber as item_id and stage as data_1. Eventually I'll need to figure out the most sensible way to create arrays for the nested data. For instance in the outcomes node, nesting the data, probably in an array for primaryOutcome and all secondaryOutcome instances.
<?xml-stylesheet type='text/xsl' href='anzctrTransform.xsl'?>
<ANZCTR_Trial requestNumber="1">
<stage>Registered</stage>
<submitdate>6/07/2005</submitdate>
<approvaldate>7/07/2005</approvaldate>
<actrnumber>ACTRN12605000001695</actrnumber>
<trial_identification>
<studytitle>A phase II trial of gemcitabine in a fixed dose rate infusion combined with cisplatin in patients with operable biliary tract carcinomas</studytitle>
<scientifictitle>A phase II trial of gemcitabine in a fixed dose rate infusion combined with cisplatin in patients with operable biliary tract carcinomas with the primary objective tumour response</scientifictitle>
<utrn />
<trialacronym>ABC trial</trialacronym>
<secondaryid>National Clinical Trials Registry: NCTR570</secondaryid>
</trial_identification>
<conditions>
<healthcondition>Adenocarcinoma of the gallbladder or intra/extrahepatic bile ducts</healthcondition>
<conditioncode>
<conditioncode1>Cancer</conditioncode1>
<conditioncode2>Biliary tree (gall bladder and bile duct)</conditioncode2>
</conditioncode>
</conditions>
<interventions>
<interventions>Gemcitabine delivered as fixed dose-rate infusion with cisplatin</interventions>
<comparator>Single arm trial</comparator>
<control>Uncontrolled</control>
<interventioncode>Treatment: drugs</interventioncode>
</interventions>
<outcomes>
<primaryOutcome>
<outcome>Objective tumour response.</outcome>
<timepoint>Measured every 6 weeks during study treatment, and post treatment.</timepoint>
</primaryOutcome>
<secondaryOutcome>
<outcome>Tolerability and safety of treatment</outcome>
<timepoint>Prior to each cycle of treatment, and at end of treatment</timepoint>
</secondaryOutcome>
<secondaryOutcome>
<outcome>Duration of response</outcome>
<timepoint>Prior to starting every second treatment cycle, then 6 monthly for 12 months, then as clinically indicated</timepoint>
</secondaryOutcome>
<secondaryOutcome>
<outcome>Time to treatment failure</outcome>
<timepoint>Assessed at end of treatment</timepoint>
</secondaryOutcome>
...
</ANZCTR_Trial>
Simply generalize your process in a method and iterate across files with os.listdir assuming all XML files reside in same folder. And be sure to use context manager using with to better manage the open/close file process.
Also, your header parsing is redundant since you name the very tags that you extract: itemid and data1. Node names likely stay the same so can be hard-coded while text values differ, requiring parsing. Below uses list comprehension for a more streamlined collection of data within XML files and across XML files. This also separates the XML parsing and CSV writing.
# GENERALIZED METHOD
def proc_xml(xml_path):
full_path = os.path.join('/path/to/xml/folder', xml_path)
print(full_path)
tree = ET.parse(full_path)
root = tree.getroot()
item_info = [[trial.find('itemid').text, trial.find('data1').text] \
for trial in root.iter('[XML_ROOT]')][0]
return item_info
# NESTED LIST OF XML DATA PER FILE
xml_data_lst = [proc_xml(f) for f in os.listdir('/path/to/xml/folder') \
if f.endswith('.xml')]
# WRITE TO CSV FILE
with open('/path/to/final.csv', 'w', newline='') as test_file:
csvwriter = csv.writer(test_file)
# HEADERS
csvwriter.writerow(['itemid', 'data1'])
# DATA ROWS
for i in xml_data_lst:
csvwriter.writerow(i)
While .find gets you the next match, .findall should return a list of all of them. So you could do something like this:
extracted_IDs = []
item_IDs = trial.findall('itemid')
for id_tags in item_IDs:
extracted_IDs.append(id_tag.text)
Or, to do the same thing in one line:
extracted_IDs = [item.text for item in trial.findall('itemid')]
Likewise, try:
extracted_data = [item.text for item in trial.findall('data1')]
If you have an equal number of both, and if the row you want to write each time is in the form of [<itemid>,<data1>] paired sets, then you can just make a combined set like this:
combined_pairs = [(extracted_IDs[i], extracted_data[i]) for i in range(len(extracted_IDs))]
I am making a program that should be able to extract the notes, rests, and chords from a certain midi file and write the respective pitch (in midi tone numbers - they go from 0-127) of the notes and chords to a csv file for later use.
For this project, I am using the Python Library "Music21".
from music21 import *
import pandas as pd
#SETUP
path = r"Pirates_TheCarib_midi\1225766-Pirates_of_The_Caribbean_Medley.mid"
#create a function for taking parsing and extracting the notes
def extract_notes(path):
stm = converter.parse(path)
treble = stm[0] #access the first part (if there is only one part)
bass = stm[1]
#note extraction
notes_treble = []
notes_bass = []
for thisNote in treble.getElementsByClass("Note"):
indiv_note = [thisNote.name, thisNote.pitch.midi, thisNote.offset]
notes_treble.append(indiv_note) # print's the note and the note's
offset
for thisNote in bass.getElementsByClass("Note"):
indiv_note = [thisNote.name, thisNote.pitch.midi, thisNote.offset]
notes_bass.append(indiv_note) #add the notes to the bass
return notes_treble, notes_bass
#write to csv
def to_csv(notes_array):
df = pd.DataFrame(notes_array, index=None, columns=None)
df.to_csv("attempt1_v1.csv")
#using the functions
notes_array = extract_notes(path)
#to_csv(notes_array)
#DEBUGGING
stm = converter.parse(path)
print(stm.parts)
Here is the link to the score I am using as a test.
https://musescore.com/user/1699036/scores/1225766
When I run the extract_notes function, it returns two empty arrays and the line:
print(stm.parts)
it returns
<music21.stream.iterator.StreamIterator for Score:0x1b25dead550 #:0>
I am confused as to why it does this. The piece should have two parts, treble and bass. How can I get each note, chord and rest into an array so I can put it in a csv file?
Here is small snippet how I did it. I needed to get all notes, chords and rests for specific instrument. So at first I iterated through part and found specific instrument and afterwards check what kind of type note it is and append it.
you can call this method like
notes = get_notes_chords_rests(keyboard_instruments, "Pirates_of_The_Caribbean.mid")
where keyboard_instruments is list of instruments.
keyboard_nstrument = ["KeyboardInstrument", "Piano", "Harpsichord", "Clavichord", "Celesta", ]
def get_notes_chords_rests(instrument_type, path):
try:
midi = converter.parse(path)
parts = instrument.partitionByInstrument(midi)
note_list = []
for music_instrument in range(len(parts)):
if parts.parts[music_instrument].id in instrument_type:
for element_by_offset in stream.iterator.OffsetIterator(parts[music_instrument]):
for entry in element_by_offset:
if isinstance(entry, note.Note):
note_list.append(str(entry.pitch))
elif isinstance(entry, chord.Chord):
note_list.append('.'.join(str(n) for n in entry.normalOrder))
elif isinstance(entry, note.Rest):
note_list.append('Rest')
return note_list
except Exception as e:
print("failed on ", path)
pass
P.S. It is important to use try block because a lot of midi files on the web are corrupted.
I have this task that I've been working on, but am having extreme misgivings about my methodology.
So the problem is that I have a ton of excel files that are formatted strangely (and not consistently) and I need to extract certain fields for each entry. An example data set is
My original approach was this:
Export to csv
Separate into counties
Separate into districts
Analyze each district individually, pull out values
write to output.csv
The problem I've run into is that the format (seemingly well organized) is almost random across files. Each line contains the same fields, but in a different order, spacing, and wording. I wrote a script to correctly process one file, but it doesn't work on any other files.
So my question is, is there a more robust method of approaching this problem rather than simple string processing? What I had in mind was more of a fuzzy logic approach for trying to pin which field an item was, which could handle the inputs being a little arbitrary. How would you approach this problem?
If it helps clear up the problem, here is the script I wrote:
# This file takes a tax CSV file as input
# and separates it into counties
# then appends each county's entries onto
# the end of the master out.csv
# which will contain everything including
# taxes, bonds, etc from all years
#import the data csv
import sys
import re
import csv
def cleancommas(x):
toggle=False
for i,j in enumerate(x):
if j=="\"":
toggle=not toggle
if toggle==True:
if j==",":
x=x[:i]+" "+x[i+1:]
return x
def districtatize(x):
#list indexes of entries starting with "for" or "to" of length >5
indices=[1]
for i,j in enumerate(x):
if len(j)>2:
if j[:2]=="to":
indices.append(i)
if len(j)>3:
if j[:3]==" to" or j[:3]=="for":
indices.append(i)
if len(j)>5:
if j[:5]==" \"for" or j[:5]==" \'for":
indices.append(i)
if len(j)>4:
if j[:4]==" \"to" or j[:4]==" \'to" or j[:4]==" for":
indices.append(i)
if len(indices)==1:
return [x[0],x[1:len(x)-1]]
new=[x[0],x[1:indices[1]+1]]
z=1
while z<len(indices)-1:
new.append(x[indices[z]+1:indices[z+1]+1])
z+=1
return new
#should return a list of lists. First entry will be county
#each successive element in list will be list by district
def splitforstos(string):
for itemind,item in enumerate(string): # take all exception cases that didn't get processed
splitfor=re.split('(?<=\d)\s\s(?=for)',item) # correctly and split them up so that the for begins
splitto=re.split('(?<=\d)\s\s(?=to)',item) # a cell
if len(splitfor)>1:
print "\n\n\nfor detected\n\n"
string.remove(item)
string.insert(itemind,splitfor[0])
string.insert(itemind+1,splitfor[1])
elif len(splitto)>1:
print "\n\n\nto detected\n\n"
string.remove(item)
string.insert(itemind,splitto[0])
string.insert(itemind+1,splitto[1])
def analyze(x):
#input should be a string of content
#target values are nomills,levytype,term,yearcom,yeardue
clean=cleancommas(x)
countylist=clean.split(',')
emptystrip=filter(lambda a: a != '',countylist)
empt2strip=filter(lambda a: a != ' ', emptystrip)
singstrip=filter(lambda a: a != '\' \'',empt2strip)
quotestrip=filter(lambda a: a !='\" \"',singstrip)
splitforstos(quotestrip)
distd=districtatize(quotestrip)
print '\n\ndistrictized\n\n',distd
county = distd[0]
for x in distd[1:]:
if len(x)>8:
district=x[0]
vote1=x[1]
votemil=x[2]
spaceindex=[m.start() for m in re.finditer(' ', votemil)][-1]
vote2=votemil[:spaceindex]
mills=votemil[spaceindex+1:]
votetype=x[4]
numyears=x[6]
yearcom=x[8]
yeardue=x[10]
reason=x[11]
data = [filename,county,district, vote1, vote2, mills, votetype, numyears, yearcom, yeardue, reason]
print "data",data
else:
print "x\n\n",x
district=x[0]
vote1=x[1]
votemil=x[2]
spaceindex=[m.start() for m in re.finditer(' ', votemil)][-1]
vote2=votemil[:spaceindex]
mills=votemil[spaceindex+1:]
votetype=x[4]
special=x[5]
splitspec=special.split(' ')
try:
forind=[i for i,j in enumerate(splitspec) if j=='for'][0]
numyears=splitspec[forind+1]
yearcom=splitspec[forind+6]
except:
forind=[i for i,j in enumerate(splitspec) if j=='commencing'][0]
numyears=None
yearcom=splitspec[forind+2]
yeardue=str(x[6])[-4:]
reason=x[7]
data = [filename,county,district,vote1,vote2,mills,votetype,numyears,yearcom,yeardue,reason]
print "data other", data
openfile=csv.writer(open('out.csv','a'),delimiter=',', quotechar='|',quoting=csv.QUOTE_MINIMAL)
openfile.writerow(data)
# call the file like so: python tax.py 2007May8Tax.csv
filename = sys.argv[1] #the file is the first argument
f=open(filename,'r')
contents=f.read() #entire csv as string
#find index of every instance of the word county
separators=[m.start() for m in re.finditer('\w+\sCOUNTY',contents)] #alternative implementation in regex
# split contents into sections by county
# analyze each section and append to out.csv
for x,y in enumerate(separators):
try:
data = contents[y:separators[x+1]]
except:
data = contents[y:]
analyze(data)
is there a more robust method of approaching this problem rather than simple string processing?
Not really.
What I had in mind was more of a fuzzy logic approach for trying to pin which field an item was, which could handle the inputs being a little arbitrary. How would you approach this problem?
After a ton of analysis and programming, it won't be significantly better than what you've got.
Reading stuff prepared by people requires -- sadly -- people-like brains.
You can mess with NLTK to try and do a better job, but it doesn't work out terribly well either.
You don't need a radically new approach. You need to streamline the approach you have.
For example.
district=x[0]
vote1=x[1]
votemil=x[2]
spaceindex=[m.start() for m in re.finditer(' ', votemil)][-1]
vote2=votemil[:spaceindex]
mills=votemil[spaceindex+1:]
votetype=x[4]
numyears=x[6]
yearcom=x[8]
yeardue=x[10]
reason=x[11]
data = [filename,county,district, vote1, vote2, mills, votetype, numyears, yearcom, yeardue, reason]
print "data",data
Might be improved by using a named tuple.
Then build something like this.
data = SomeSensibleName(
district= x[0],
vote1=x[1], ... etc.
)
So that you're not creating a lot of intermediate (and largely uninformative) loose variables.
Also, keep looking at your analyze function (and any other function) to pull out the various "pattern matching" rules. The idea is that you'll examine a county's data, step through a bunch of functions until one matches the pattern; this will also create the named tuple. You want something like this.
for p in ( some, list, of, functions ):
match= p(data)
if match:
return match
Each function either returns a named tuple (because it liked the row) or None (because it didn't like the row).