Removing and replacing specific nodes in an XML file - python

I have been working on a project that analyses a musical score and removes specific notes from it. So now that I have the information required from my code I now need to edit the original XML score with my new information. I am doing this in Python and have already used Minidom so I would obviously like to stick to that (I know this was perhaps a silly choice as a lot of the posts on here recommend different methods of XML parsing due to the not so friendly interface present in Minidom).
So say in my original XML file I have a musical piece made up of just 10 notes. The XML format for a note is shown below:
<note>
<pitch>
<step>E</step>
<alter>-1</alter>
<octave>5</octave>
</pitch>
<duration>72</duration>
</note>
So this would be repeated 10 times for each note value. Now that I have done my analysis I want to remove 5 of these notes. By remove I mean replace with a rest (as it is a musical score after all and it has a shape to conform to). So the format for a rest in an XML file is shown below:
<note>
<rest/>
<duration>72</duration>
</note>
So all that I have to do is remove the pitch tag and replace it with a rest tag. However I am unsure on how to go about this, haven't really found anything from my searching that seems similar.
I am not too bothered about finding where the notes to be removed are, as I have written a quick test harness to show how I would go about that below in Python (xml_format is essentially just a list of dictionaries containing my new information). It contains the same number of notes as the original XML file, with the only difference being that some of them are now marked for being removed. So the original file could have notes like : G, Bb, D, C, G, F, G, D, Bb and the xml_format would have G, Bb, D, REMOVE, G, REMOVE, G, D, Bb etc.
I have just returned a at the moment to make sure that the correct number of notes are being removed.
def remove_notes(xml_format, filename):
doc = minidom.parse(filename)
count = 0
a = 0
note = doc.getElementsByTagName("note")
for item in note:
if xml_format[count]['step'] == 'Remove':
a = a + 1
# THEN REMOVE THE ENTIRE PITCH TAG, REPLACE WITH REST
count = count + 1
# ELSE DON'T DO ANYTHING
return a
So basically I am just looking for some assistance in the kind of syntax or code that could be used to remove a specific node at a specific point and then be replaced with a new node, before being written to a new file. Thank you very much for any help and I do hope that this is something which is possible (the logic doesn't seem complicated, but who knows what is possible)!

What you need to do for every <note> node is:
Create a new <rest/> node
Locate the <pitch> node
Replace the <pitch> node with the new <rest/> node
Here is the code:
def remove_notes(xml_format, filename):
doc = minidom.parse(filename)
count = 0
a = 0
note_nodes = doc.getElementsByTagName("note")
for note_node in note_nodes:
if xml_format[count]['step'] == 'Remove':
a = a + 1
# Create a <rest/> node
rest_node = note_node.ownerDocument.createElement('rest')
# Locate the <pitch> node
pitch_node = note_node.getElementsByTagName('pitch')[0]
# Replace pitch with rest
note_node.replaceChild(rest_node, pitch_node)
count = count + 1
# ELSE DON'T DO ANYTHING
# Finished with the loop, doc now contains the replaced nodes, we need
# to write it to a file
return a
Please note that you will need to write the changes to a new file or your changes will be lost.

Related

iterating through a XML file using Python

I am pretty new to Python, just finishing up a college class on it, and am also working through a book called "Head First Python", so I have a good basic understanding of the language but I need some help with a task that is a bit over my head. I have an XML file that my companies CAD software reads in to assign the correct material to a 3D solid model, in our case assigning a material name and density. However, this XML has its units in lbf.s^2/in^4 but it's been requested that they be in lbm/in^3 by one of our customers.
Can I import the XML into Python and iterate through it to change the values?
The first would be the material unit name itself, I would need to iterate through the XML and replace every instance of this:
<PropertyData property="MassDensity_0">
With this:
<PropertyData property="MassDensity_4">
Then for every instance found of MassDensity_0 I would need to multiply the density value by a conversion value to get the correct density value in the new units. As you can see below, the data values like the one you see below would need to be multiplied by a conversion factor.
<PropertyData property="MassDensity_0">
<Data format="exponential">
7.278130e-04</Data>
</PropertyData>
Does it make sense to attempt this in Python? There are over a hundred materials in this file, and editing them manually would be very tedious and time-consuming. I'm hoping Python can do the heavy lifting here.
I appreciate any assistance you can provide and thank you in advance!!!
This looks like task for built-in module xml.etree.ElementTree. After loading XML from file or string, you might alter it and save changed one to new file. It does support subset of XPath, which should allow to select elements to change, consider following simple example:
import xml.etree.ElementTree as ET
xml_string = '<?xml version="1.0"?><catalog><product id="p1"><price>100</price></product><product id="p2"><price>120</price></product><product id="p3"><price>150</price></product></catalog>'
root = ET.fromstring(xml_string) # now root is <catalog>
price3 = root.find('product[#id="p3"]/price') # get price of product with id p3
price3.text = str(int(price3.text) + 50) # .text is str, convert to int and adding 50 to it, then back to int
output = ET.tostring(root)
print(output)
output
b'<catalog><product id="p1"><price>100</price></product><product id="p2"><price>120</price></product><product id="p3"><price>200</price></product></catalog>'
Note that output is bytes and as such can be written to file open in binary mode. Consult docs for more information.
I'm sure it can and probably should be done with a special xml module. But rather for educational purposes here is the straightforward verbose Python solution:
import re
xml_in = \
'''<PropertyData property="MassDensity_0">
<Data format="exponential">
7.278130e-04
</Data>
</PropertyData>
<PropertyData property="MassDensity_0">
<Data format="exponential">
7.278130e-04
</Data>
</PropertyData>
'''
# remove spaces after ">" and before "<"
xml_in = re.sub(">\s*",">", xml_in)
xml_in = re.sub("\s*<","<", xml_in)
# split the xml by the mask
mask = "<PropertyData property=\"MassDensity_0\"><Data format=\"exponential\">"
chunks = xml_in.split(mask)
# change numbers
result = []
for chunk in chunks:
try:
splitted_chunk = chunk.split("<") # split the chunk by "<"
num = float(splitted_chunk[0]) # get the number
num *= 2 # <--- change the number
num = f"{num:e}" # get e-notation
new_chunk = "<".join([num] + splitted_chunk[1:]) # make the new chunk # make the new chunk
result.append(new_chunk) # add the new chunk to the result list
except:
result.append(chunk) # if there is no number add the chunk as is
# assembly xml back from the chunks and the mask
xml_out = mask.join(result)
# output
print(">\n<".join(xml_out.split("><")))
Output:
<PropertyData property="MassDensity_0">
<Data format="exponential">1.455626e-03</Data>
</PropertyData>
<PropertyData property="MassDensity_0">
<Data format="exponential">1.455626e-03</Data>
</PropertyData>

Loop function across multiple XML files in directory so each XML becomes a row in a CSV

I've figured out how to get data from a single XML file into a row on a CSV. I'd like to iterate this across a number of files in a directory so that the data from each XML file is extracted to a new row on the CSV. I've done some searching and I get the gist of having to create a loop (perhaps using the OS module) but the specifics are lost on me.
This script does the extraction for a single XML file.
import xml.etree.ElementTree as ET
import csv
tree = ET.parse("[PATH/FILE.xml]")
root = tree.getroot()
test_file = open('PATH','w',newline='')
csvwriter = csv.writer(test_file)
header = []
count = 0
for trial in root.iter('[XML_ROOT]'):
item_info = []
if count == 0:
item_ID = trial.find('itemid').tag
header.append(item_ID)
data_1 = trial.find('data1').tag
header.append(data_1)
csvwriter.writerow(header)
count = count + 1
item_ID = trial.find('itemid').text
item_info.append(item_ID)
data_1 = trial.find('data1').text
trial_info.append(data_1)
csvwriter.writerow(item_info)
test_file.close()
Now I need to figure out what to do to it to iterate.
Edit:
Here is an example of an XML file i'm using. Just for testing i'm pulling out actrnumber as item_id and stage as data_1. Eventually I'll need to figure out the most sensible way to create arrays for the nested data. For instance in the outcomes node, nesting the data, probably in an array for primaryOutcome and all secondaryOutcome instances.
<?xml-stylesheet type='text/xsl' href='anzctrTransform.xsl'?>
<ANZCTR_Trial requestNumber="1">
<stage>Registered</stage>
<submitdate>6/07/2005</submitdate>
<approvaldate>7/07/2005</approvaldate>
<actrnumber>ACTRN12605000001695</actrnumber>
<trial_identification>
<studytitle>A phase II trial of gemcitabine in a fixed dose rate infusion combined with cisplatin in patients with operable biliary tract carcinomas</studytitle>
<scientifictitle>A phase II trial of gemcitabine in a fixed dose rate infusion combined with cisplatin in patients with operable biliary tract carcinomas with the primary objective tumour response</scientifictitle>
<utrn />
<trialacronym>ABC trial</trialacronym>
<secondaryid>National Clinical Trials Registry: NCTR570</secondaryid>
</trial_identification>
<conditions>
<healthcondition>Adenocarcinoma of the gallbladder or intra/extrahepatic bile ducts</healthcondition>
<conditioncode>
<conditioncode1>Cancer</conditioncode1>
<conditioncode2>Biliary tree (gall bladder and bile duct)</conditioncode2>
</conditioncode>
</conditions>
<interventions>
<interventions>Gemcitabine delivered as fixed dose-rate infusion with cisplatin</interventions>
<comparator>Single arm trial</comparator>
<control>Uncontrolled</control>
<interventioncode>Treatment: drugs</interventioncode>
</interventions>
<outcomes>
<primaryOutcome>
<outcome>Objective tumour response.</outcome>
<timepoint>Measured every 6 weeks during study treatment, and post treatment.</timepoint>
</primaryOutcome>
<secondaryOutcome>
<outcome>Tolerability and safety of treatment</outcome>
<timepoint>Prior to each cycle of treatment, and at end of treatment</timepoint>
</secondaryOutcome>
<secondaryOutcome>
<outcome>Duration of response</outcome>
<timepoint>Prior to starting every second treatment cycle, then 6 monthly for 12 months, then as clinically indicated</timepoint>
</secondaryOutcome>
<secondaryOutcome>
<outcome>Time to treatment failure</outcome>
<timepoint>Assessed at end of treatment</timepoint>
</secondaryOutcome>
...
</ANZCTR_Trial>
Simply generalize your process in a method and iterate across files with os.listdir assuming all XML files reside in same folder. And be sure to use context manager using with to better manage the open/close file process.
Also, your header parsing is redundant since you name the very tags that you extract: itemid and data1. Node names likely stay the same so can be hard-coded while text values differ, requiring parsing. Below uses list comprehension for a more streamlined collection of data within XML files and across XML files. This also separates the XML parsing and CSV writing.
# GENERALIZED METHOD
def proc_xml(xml_path):
full_path = os.path.join('/path/to/xml/folder', xml_path)
print(full_path)
tree = ET.parse(full_path)
root = tree.getroot()
item_info = [[trial.find('itemid').text, trial.find('data1').text] \
for trial in root.iter('[XML_ROOT]')][0]
return item_info
# NESTED LIST OF XML DATA PER FILE
xml_data_lst = [proc_xml(f) for f in os.listdir('/path/to/xml/folder') \
if f.endswith('.xml')]
# WRITE TO CSV FILE
with open('/path/to/final.csv', 'w', newline='') as test_file:
csvwriter = csv.writer(test_file)
# HEADERS
csvwriter.writerow(['itemid', 'data1'])
# DATA ROWS
for i in xml_data_lst:
csvwriter.writerow(i)
While .find gets you the next match, .findall should return a list of all of them. So you could do something like this:
extracted_IDs = []
item_IDs = trial.findall('itemid')
for id_tags in item_IDs:
extracted_IDs.append(id_tag.text)
Or, to do the same thing in one line:
extracted_IDs = [item.text for item in trial.findall('itemid')]
Likewise, try:
extracted_data = [item.text for item in trial.findall('data1')]
If you have an equal number of both, and if the row you want to write each time is in the form of [<itemid>,<data1>] paired sets, then you can just make a combined set like this:
combined_pairs = [(extracted_IDs[i], extracted_data[i]) for i in range(len(extracted_IDs))]

(Blender) (Python)How can I animate the factor value in the mix node with Python code?

What I want is a way to handle the 'factor' value in the mixRGB node like a normal object, like for example a cube, so with fcurves, fmodifiers and so on.
All this via Python code made in the Text Editor
The first step is to find the mix node you want. Within a material you can access each node by name, while the first mixRGB node is named 'Mix', following mix nodes will have a numerical extension added to the name. The name may also be changed manually by the user (or python script). By showing the properties region (press N) you can see the name of the active node in the node properties.
To adjust the fac value you alter the default_value of the fac input. To keyframe the mix factor you tell the fac input to insert a keyframe with a data_path of default_value
import bpy
cur_frame = bpy.context.scene.frame_current
mat_nodes = bpy.data.materials['Material'].node_tree.nodes
mix_factor = mat_nodes['Mix.002'].inputs['Fac']
mix_factor.default_value = 0.5
mix_factor.keyframe_insert('default_value', frame=cur_frame)
Of course you may specify any frame number for the keyframe not just the current frame.
If you have many mix nodes, you can loop over the nodes and add each mix shader to a list
mix_nodes = [n for n in mat_nodes if n.type == 'MIX_RGB']
You can then loop over them and keyframe as desired.
for m in mix_nodes:
m.inputs['Fac'].default_value = 0.5
m.inputs['Fac'].keyframe_insert('default_value', frame=cur_frame)
Finding the fcurves after adding them is awkward for nodes. While you tell the input socket to insert a keyframe, the fcurve is stored in the node_tree so after keyframe_insert() you would use
bpy.data.materials['Material'].node_tree.animation_data.action.fcurves.find()
Knowing the data path you want to search for can be tricky, as the data path for the Fac input of node Mix.002 will be nodes["Mix.002"].inputs[0].default_value
If you want to find an fcurve after adding it to adjust values or add modifiers you will most likely find it easier to keep a list of them as you add the keyframes. After keyframe_insert() the new fcurve should be at
material.node_tree.animation_data.action.fcurves[-1]

Python: Parsing through XML with mini dom

I'm parsing through a decent sized xml file, and I ran into a problem. For some reason I cannot extract data even though I have done the exact same thing on different xml files before.
Here's a snippet of my code: (rest of the program, I've tested and they work fine)
EDIT: changed to include a testing try&except block
def parseXML():
file = open(str(options.drugxml),'r')
data = file.read()
file.close()
dom = parseString(data)
druglist = dom.getElementsByTagName('drug')
with codecs.open(str(options.csvdata),'w','utf-8') as csvout, open('DrugTargetRel.csv','w') as dtout:
for entry in druglist:
count = count + 1
try:
drugtype = entry.attributes['type'].value
print count
except:
print count
print entry
drugidObj = entry.getElementsByTagName('drugbank-id')[0]
drugid = drugidObj.childNodes[0].nodeValue
drugnameObj = entry.getElementsByTagName('name')[0]
drugname = drugnameObj.childNodes[0].nodeValue
targetlist = entry.getElementsByTagName('target')
for target in targetlist:
targetid = target.attributes['partner'].value
dtout.write((','.join((drugid,targetid)))+'\n')
csvout.write((','.join((drugid,drugname,drugtype)))+'\n')
In case you're wondering what the XML file's schema roughly looks like, here's a rough god-awful sketch of the levels:
<drugs>
<drug type='something' ...>
<drugbank-id>
<name>
...
<targets>
<target partner='something'>
Those that I typed in here, I need to extract from the XML file and stick it in csv files (as the code above shows), and the code has worked for different xml files before, not sure why it's not working on this one. I've gotten KeyError on 'type', I've also gotten indexing errors on line that extracts drugid even though EVERY drug has a drugid. What am I screwing up here?
EDIT: the stuff I'm extracting are guaranteed to be in each drug.
For anyone who cares, here's the link to the XML file I'm parsing:
http://www.drugbank.ca/system/downloads/current/drugbank.xml.zip
EDIT: After implementing a try & except block (see above) here's what I found out:
In the schema, there are sections called "drug interactions" that also have a subfield called drug. So like this:
<drugs>
<drug type='something' ...>
<drugbank-id>
<name>
...
<targets>
<target partner='something'>
<drug-interactions>
<drug>
I think that my line druglist = dom.getElementsByTagName('drug') is unintentionally picking those up as well -- I don't know how I could fix this... any suggestions?
Basically when parsin an xml, you can't rely on the fact that you know the structure. It's a good practise to find out structure in code.
So everytime you access elements or attributes, check before if there any. In your code it means following:
Make sure there's an attribute 'type' on a drug element:
drugtype = entry.attributes['type'].value if entry.attributes.has_key('type') else 'defaulttype'
Make sure getElementsByTagName doesn't returns empty array before accessing its elements:
drugbank-id = entry.getElementsByTagName('drugbank-id')
drugidObj = drugbank-id[0] if drugbank-id else None
Also before accessing childnodes make sure there are any:
if drugidObj.hasChildNodes:
drugid = drugidObj.childNodes[0].nodeValue
Or use for loop to loop through them.
And when you call getElementsByTagName on the drugs elemet it returns all elements including the nested ones. To get only drug elements, which are dirrect children of drugs element you have to use childNodes attribute.
I had a feeling that maybe there was something weird happening due to running out of memory or something, so I rewrote the parser using an iterator over each drug and tried it out and got the program to complete without raising an exception.
Basically what I'm doing here is, instead of loading the entire XML file into memory, I parse the XML file for the beginning and end of each <drug> and </drug> tag. Then I parse that with the minidom each time.
The code might be a little fragile as I assume that each <drug> and </drug> pair are on their own lines. Hopefully it helps more than it harms though.
#!python
import codecs
from xml.dom import minidom
class DrugBank(object):
def __init__(self, filename):
self.fp = open(filename, 'r')
def __iter__(self):
return self
def next(self):
state = 0
while True:
line = self.fp.readline()
if state == 0:
if line.strip().startswith('<drug '):
lines = [line]
state = 1
continue
if line.strip() == '</drugs>':
self.fp.close()
raise StopIteration()
if state == 1:
lines.append(line)
if line.strip() == '</drug>':
return minidom.parseString("".join(lines))
with codecs.open('csvout.csv', 'w', 'utf-8') as csvout, open('dtout.csv', 'w') as dtout:
db = DrugBank('drugbank.xml')
for dom in db:
entry = dom.firstChild
drugtype = entry.attributes['type'].value
drugidObj = entry.getElementsByTagName('drugbank-id')[0]
drugid = drugidObj.childNodes[0].nodeValue
drugnameObj = entry.getElementsByTagName('name')[0]
drugname = drugnameObj.childNodes[0].nodeValue
targetlist = entry.getElementsByTagName('target')
for target in targetlist:
targetid = target.attributes['partner'].value
dtout.write((','.join((drugid,targetid)))+'\n')
csvout.write((','.join((drugid,drugname,drugtype)))+'\n')
An interesting read that might help you out further is here:
http://www.ibm.com/developerworks/xml/library/x-hiperfparse/

Data analysis for inconsistent string formatting

I have this task that I've been working on, but am having extreme misgivings about my methodology.
So the problem is that I have a ton of excel files that are formatted strangely (and not consistently) and I need to extract certain fields for each entry. An example data set is
My original approach was this:
Export to csv
Separate into counties
Separate into districts
Analyze each district individually, pull out values
write to output.csv
The problem I've run into is that the format (seemingly well organized) is almost random across files. Each line contains the same fields, but in a different order, spacing, and wording. I wrote a script to correctly process one file, but it doesn't work on any other files.
So my question is, is there a more robust method of approaching this problem rather than simple string processing? What I had in mind was more of a fuzzy logic approach for trying to pin which field an item was, which could handle the inputs being a little arbitrary. How would you approach this problem?
If it helps clear up the problem, here is the script I wrote:
# This file takes a tax CSV file as input
# and separates it into counties
# then appends each county's entries onto
# the end of the master out.csv
# which will contain everything including
# taxes, bonds, etc from all years
#import the data csv
import sys
import re
import csv
def cleancommas(x):
toggle=False
for i,j in enumerate(x):
if j=="\"":
toggle=not toggle
if toggle==True:
if j==",":
x=x[:i]+" "+x[i+1:]
return x
def districtatize(x):
#list indexes of entries starting with "for" or "to" of length >5
indices=[1]
for i,j in enumerate(x):
if len(j)>2:
if j[:2]=="to":
indices.append(i)
if len(j)>3:
if j[:3]==" to" or j[:3]=="for":
indices.append(i)
if len(j)>5:
if j[:5]==" \"for" or j[:5]==" \'for":
indices.append(i)
if len(j)>4:
if j[:4]==" \"to" or j[:4]==" \'to" or j[:4]==" for":
indices.append(i)
if len(indices)==1:
return [x[0],x[1:len(x)-1]]
new=[x[0],x[1:indices[1]+1]]
z=1
while z<len(indices)-1:
new.append(x[indices[z]+1:indices[z+1]+1])
z+=1
return new
#should return a list of lists. First entry will be county
#each successive element in list will be list by district
def splitforstos(string):
for itemind,item in enumerate(string): # take all exception cases that didn't get processed
splitfor=re.split('(?<=\d)\s\s(?=for)',item) # correctly and split them up so that the for begins
splitto=re.split('(?<=\d)\s\s(?=to)',item) # a cell
if len(splitfor)>1:
print "\n\n\nfor detected\n\n"
string.remove(item)
string.insert(itemind,splitfor[0])
string.insert(itemind+1,splitfor[1])
elif len(splitto)>1:
print "\n\n\nto detected\n\n"
string.remove(item)
string.insert(itemind,splitto[0])
string.insert(itemind+1,splitto[1])
def analyze(x):
#input should be a string of content
#target values are nomills,levytype,term,yearcom,yeardue
clean=cleancommas(x)
countylist=clean.split(',')
emptystrip=filter(lambda a: a != '',countylist)
empt2strip=filter(lambda a: a != ' ', emptystrip)
singstrip=filter(lambda a: a != '\' \'',empt2strip)
quotestrip=filter(lambda a: a !='\" \"',singstrip)
splitforstos(quotestrip)
distd=districtatize(quotestrip)
print '\n\ndistrictized\n\n',distd
county = distd[0]
for x in distd[1:]:
if len(x)>8:
district=x[0]
vote1=x[1]
votemil=x[2]
spaceindex=[m.start() for m in re.finditer(' ', votemil)][-1]
vote2=votemil[:spaceindex]
mills=votemil[spaceindex+1:]
votetype=x[4]
numyears=x[6]
yearcom=x[8]
yeardue=x[10]
reason=x[11]
data = [filename,county,district, vote1, vote2, mills, votetype, numyears, yearcom, yeardue, reason]
print "data",data
else:
print "x\n\n",x
district=x[0]
vote1=x[1]
votemil=x[2]
spaceindex=[m.start() for m in re.finditer(' ', votemil)][-1]
vote2=votemil[:spaceindex]
mills=votemil[spaceindex+1:]
votetype=x[4]
special=x[5]
splitspec=special.split(' ')
try:
forind=[i for i,j in enumerate(splitspec) if j=='for'][0]
numyears=splitspec[forind+1]
yearcom=splitspec[forind+6]
except:
forind=[i for i,j in enumerate(splitspec) if j=='commencing'][0]
numyears=None
yearcom=splitspec[forind+2]
yeardue=str(x[6])[-4:]
reason=x[7]
data = [filename,county,district,vote1,vote2,mills,votetype,numyears,yearcom,yeardue,reason]
print "data other", data
openfile=csv.writer(open('out.csv','a'),delimiter=',', quotechar='|',quoting=csv.QUOTE_MINIMAL)
openfile.writerow(data)
# call the file like so: python tax.py 2007May8Tax.csv
filename = sys.argv[1] #the file is the first argument
f=open(filename,'r')
contents=f.read() #entire csv as string
#find index of every instance of the word county
separators=[m.start() for m in re.finditer('\w+\sCOUNTY',contents)] #alternative implementation in regex
# split contents into sections by county
# analyze each section and append to out.csv
for x,y in enumerate(separators):
try:
data = contents[y:separators[x+1]]
except:
data = contents[y:]
analyze(data)
is there a more robust method of approaching this problem rather than simple string processing?
Not really.
What I had in mind was more of a fuzzy logic approach for trying to pin which field an item was, which could handle the inputs being a little arbitrary. How would you approach this problem?
After a ton of analysis and programming, it won't be significantly better than what you've got.
Reading stuff prepared by people requires -- sadly -- people-like brains.
You can mess with NLTK to try and do a better job, but it doesn't work out terribly well either.
You don't need a radically new approach. You need to streamline the approach you have.
For example.
district=x[0]
vote1=x[1]
votemil=x[2]
spaceindex=[m.start() for m in re.finditer(' ', votemil)][-1]
vote2=votemil[:spaceindex]
mills=votemil[spaceindex+1:]
votetype=x[4]
numyears=x[6]
yearcom=x[8]
yeardue=x[10]
reason=x[11]
data = [filename,county,district, vote1, vote2, mills, votetype, numyears, yearcom, yeardue, reason]
print "data",data
Might be improved by using a named tuple.
Then build something like this.
data = SomeSensibleName(
district= x[0],
vote1=x[1], ... etc.
)
So that you're not creating a lot of intermediate (and largely uninformative) loose variables.
Also, keep looking at your analyze function (and any other function) to pull out the various "pattern matching" rules. The idea is that you'll examine a county's data, step through a bunch of functions until one matches the pattern; this will also create the named tuple. You want something like this.
for p in ( some, list, of, functions ):
match= p(data)
if match:
return match
Each function either returns a named tuple (because it liked the row) or None (because it didn't like the row).

Categories