Improving speed while iterating over ~400k XML files

Improving speed while iterating over ~400k XML files - python

This is more of a theoretical question to understand objects, garbage collection and performance of Python better.
Lets say i have a ton of XML files and want to iterate over each one, get all the tags, store them in a dict, increase counters for each tag etc. When i do this, the first, lets say 15k iterations, process really quick but afterwards the script slows down significantly, while the memory usage, CPU load etc. are fine. Why is that? Do i write hidden objects each iteration which are not cleaned up, can i do something to improve it? I tried to use regex instead of ElementTree but it wasnt worth the effort since i only want to extract first level tags and it would make it more complex.
Unfortunately i cannot give a reproducible example without providing the XML files, however this is my code:
import os
import datetime
import xml.etree.ElementTree as ElementTree
start_time = datetime.datetime.now()
original_implemented_tags = os.path.abspath("/path/to/file")
required_tags = {}
optional_tags = {}
new_tags = {}
# read original tags
for _ in open(original_implemented_tags, "r"):
if "#XmlElement(name =" in _:
_xml_attr = _.split('"')[1]
if "required = true" in _:
required_tags[_xml_attr] = 1 # i set this to 1 so i can use if dict.get(_xml_attr) (0 returns False)
else:
optional_tags[_xml_attr] = 1
# read all XML files from nested folder containing XML dumps and other files
clinical_trial_root_dir = os.path.abspath("/path/to/dump/folder")
xml_files = []
for root, dirs, files in os.walk(clinical_trial_root_dir):
xml_files.extend([os.path.join(root, _) for _ in files if os.path.splitext(_)[-1] == '.xml'])
# function for parsing a file and extract unique tags
def read_via_etree(file):
_root = ElementTree.parse(file).getroot()
_main_tags = list(set([_.tag for _ in _root.findall("./")])) # some tags occur twice
for _attr in _main_tags:
# if tag doesnt exist in original document, increase counts in new_tags
if _attr not in required_tags.keys() and _attr not in optional_tags.keys():
if _attr not in new_tags.keys():
new_tags[_attr] = 1
else:
new_tags[_attr] += 1
# otherwise, increase counts in either one of required_tags or optional_tags
if required_tags.get(_attr):
required_tags[_attr] += 1
if optional_tags.get(_attr):
optional_tags[_attr] += 1
# actual parsing with indicator
for idx, xml in enumerate(xml_files):
if idx % 1000 == 0:
print(f"Analyzed {idx} files")
read_via_etree(xml)
# undoing the initial 1
for k in required_tags.keys():
required_tags[k] -= 1
for k in optional_tags.keys():
optional_tags[k] -= 1
print(f"Done parsing {len(xml_files)} documents in {datetime.datetime.now() - start_time}")
Example of one XML file:
<parent_element>
<tag_i_need>
<tag_i_dont_need>Some text i dont need</tag_i_dont_need>
</tag_i_need>
<another_tag_i_need>Some text i also dont need</another_tag_i_need>
</parent_element>

After the helpful comments i added a timestamp to my loop indicating how much time is passed since the last 1k documents and flushed the sys.stdout:
import sys
loop_timer = datetime.datetime.now()
for idx, xml in enumerate(xml_files):
if idx % 1000 == 0:
print(f"Analyzed {idx} files in {datetime.datetime.now() - loop_timer}")
sys.stdout.flush()
loop_timer = datetime.datetime.now()
read_via_etree(xml)
I think it makes sense now since the XML files vary in size, and due the fact that the standard output stream is buffered. Thanks to Albert Winestein

Related

Loop function across multiple XML files in directory so each XML becomes a row in a CSV

I've figured out how to get data from a single XML file into a row on a CSV. I'd like to iterate this across a number of files in a directory so that the data from each XML file is extracted to a new row on the CSV. I've done some searching and I get the gist of having to create a loop (perhaps using the OS module) but the specifics are lost on me.
This script does the extraction for a single XML file.
import xml.etree.ElementTree as ET
import csv
tree = ET.parse("[PATH/FILE.xml]")
root = tree.getroot()
test_file = open('PATH','w',newline='')
csvwriter = csv.writer(test_file)
header = []
count = 0
for trial in root.iter('[XML_ROOT]'):
item_info = []
if count == 0:
item_ID = trial.find('itemid').tag
header.append(item_ID)
data_1 = trial.find('data1').tag
header.append(data_1)
csvwriter.writerow(header)
count = count + 1
item_ID = trial.find('itemid').text
item_info.append(item_ID)
data_1 = trial.find('data1').text
trial_info.append(data_1)
csvwriter.writerow(item_info)
test_file.close()
Now I need to figure out what to do to it to iterate.
Edit:
Here is an example of an XML file i'm using. Just for testing i'm pulling out actrnumber as item_id and stage as data_1. Eventually I'll need to figure out the most sensible way to create arrays for the nested data. For instance in the outcomes node, nesting the data, probably in an array for primaryOutcome and all secondaryOutcome instances.
<?xml-stylesheet type='text/xsl' href='anzctrTransform.xsl'?>
<ANZCTR_Trial requestNumber="1">
<stage>Registered</stage>
<submitdate>6/07/2005</submitdate>
<approvaldate>7/07/2005</approvaldate>
<actrnumber>ACTRN12605000001695</actrnumber>
<trial_identification>
<studytitle>A phase II trial of gemcitabine in a fixed dose rate infusion combined with cisplatin in patients with operable biliary tract carcinomas</studytitle>
<scientifictitle>A phase II trial of gemcitabine in a fixed dose rate infusion combined with cisplatin in patients with operable biliary tract carcinomas with the primary objective tumour response</scientifictitle>
<utrn />
<trialacronym>ABC trial</trialacronym>
<secondaryid>National Clinical Trials Registry: NCTR570</secondaryid>
</trial_identification>
<conditions>
<healthcondition>Adenocarcinoma of the gallbladder or intra/extrahepatic bile ducts</healthcondition>
<conditioncode>
<conditioncode1>Cancer</conditioncode1>
<conditioncode2>Biliary tree (gall bladder and bile duct)</conditioncode2>
</conditioncode>
</conditions>
<interventions>
<interventions>Gemcitabine delivered as fixed dose-rate infusion with cisplatin</interventions>
<comparator>Single arm trial</comparator>
<control>Uncontrolled</control>
<interventioncode>Treatment: drugs</interventioncode>
</interventions>
<outcomes>
<primaryOutcome>
<outcome>Objective tumour response.</outcome>
<timepoint>Measured every 6 weeks during study treatment, and post treatment.</timepoint>
</primaryOutcome>
<secondaryOutcome>
<outcome>Tolerability and safety of treatment</outcome>
<timepoint>Prior to each cycle of treatment, and at end of treatment</timepoint>
</secondaryOutcome>
<secondaryOutcome>
<outcome>Duration of response</outcome>
<timepoint>Prior to starting every second treatment cycle, then 6 monthly for 12 months, then as clinically indicated</timepoint>
</secondaryOutcome>
<secondaryOutcome>
<outcome>Time to treatment failure</outcome>
<timepoint>Assessed at end of treatment</timepoint>
</secondaryOutcome>
...
</ANZCTR_Trial>

Simply generalize your process in a method and iterate across files with os.listdir assuming all XML files reside in same folder. And be sure to use context manager using with to better manage the open/close file process.
Also, your header parsing is redundant since you name the very tags that you extract: itemid and data1. Node names likely stay the same so can be hard-coded while text values differ, requiring parsing. Below uses list comprehension for a more streamlined collection of data within XML files and across XML files. This also separates the XML parsing and CSV writing.
# GENERALIZED METHOD
def proc_xml(xml_path):
full_path = os.path.join('/path/to/xml/folder', xml_path)
print(full_path)
tree = ET.parse(full_path)
root = tree.getroot()
item_info = [[trial.find('itemid').text, trial.find('data1').text] \
for trial in root.iter('[XML_ROOT]')][0]
return item_info
# NESTED LIST OF XML DATA PER FILE
xml_data_lst = [proc_xml(f) for f in os.listdir('/path/to/xml/folder') \
if f.endswith('.xml')]
# WRITE TO CSV FILE
with open('/path/to/final.csv', 'w', newline='') as test_file:
csvwriter = csv.writer(test_file)
# HEADERS
csvwriter.writerow(['itemid', 'data1'])
# DATA ROWS
for i in xml_data_lst:
csvwriter.writerow(i)

While .find gets you the next match, .findall should return a list of all of them. So you could do something like this:
extracted_IDs = []
item_IDs = trial.findall('itemid')
for id_tags in item_IDs:
extracted_IDs.append(id_tag.text)
Or, to do the same thing in one line:
extracted_IDs = [item.text for item in trial.findall('itemid')]
Likewise, try:
extracted_data = [item.text for item in trial.findall('data1')]
If you have an equal number of both, and if the row you want to write each time is in the form of [<itemid>,<data1>] paired sets, then you can just make a combined set like this:
combined_pairs = [(extracted_IDs[i], extracted_data[i]) for i in range(len(extracted_IDs))]

parsing sdf file, performance issue

I've wrtien a script which read different files and search molecular ID in big sdf databases (about 4.0 GB each).
the idea of this script is to copy every molecules from a list of id (about 287212 molecules) from my original databases to a new one in a way to have only one single copy of each molecule (in this case, the first copy encountered)
I've writen this script:
import re
import sys
import os
def sdf_grep (molname,files):
filin = open(files, 'r')
filine= filin.readlines()
for i in range(0,len(filine)):
if filine[i][0:-1] == molname and filine[i][0:-1] not in past_mol:
past_mol.append(filine[i][0:-1])
iterate = 1
while iterate == 1:
if filine[i] == "$$$$\n":
filout.write(filine[i])
iterate = 0
break
else:
filout.write(filine[i])
i = i+1
else:
continue
filin.close()
mol_dock = os.listdir("test")
listmol = []
past_mol = []
imp_listmol = open("consensus_sorted_surflex.txt", 'r')
filout = open('test_ini.sdf','wa')
for line in imp_listmol:
listmol.append(line.split('\t')[0])
print 'list ready... reading files'
imp_listmol.close()
for f in mol_dock:
print 'reading '+f
for molecule in listmol:
if molecule not in past_mol:
sdf_grep(molecule , 'test/'+f)
print len(past_mol)
filout.close()
it works perfectely, but it is very slow... too slow for what I need. Is there a way to rewrite this script in a way that can reduce the computation time?
thank you very much.

The main problem is that you have three nested loops: molecular documents, molecules and file parsing in the inner loop. That smells like trouble - I mean, quadratic complexity. You should move huge files parsing outside of inner loop and use set or dictionary for molecules.
Something like this:
For each sdf file
For each line, if it is molecule definition
Check in dictionary of unfound molecules
If present, process it and remove from dictionary of unfound molecules
This way, you will parse each sdf file exactly once, and with each found molecule, speed will further increase.

Let past_mol be a set, rather than a list. That will speed up
filine[i][0:-1] not in past_mol
since checking membership in a set is O(1), while checking membership in a list is O(n).
Try not to write to a file one line at a time. Instead, save up lines in a list, join them into a single string, and then write it out with one call to filout.write.
It is generally better not to allow functions to modify global variables. sdf_grep modifies the global variable past_mol.
By adding past_mol to the arguments of sdf_grep you make it explicit that sdf_grep depends on the existence of past_mol (otherwise, sdf_grep is not really a standalone function).
If you pass past_mol in as a third argument to sdf_grep, then Python will make a new local variable named past_mol which will point to the same object as the global variable past_mol. Since that object is a set and a set is a mutable object, past_mol.add(sline) will affect the global variable past_mol as well.
As an added bonus, Python looks up local variables faster than global variables:
def using_local():
x = set()
for i in range(10**6):
x
y = set
def using_global():
for i in range(10**6):
y
In [5]: %timeit using_local()
10 loops, best of 3: 33.1 ms per loop
In [6]: %timeit using_global()
10 loops, best of 3: 41 ms per loop
sdf_grep can be simplified greatly if you use a variable (let's call it found) which keeps track of whether or not we are inside one of the chunks of lines we want to keep. (By "chunk of lines" I mean one that begins with molname and ends with "$$$$"):
import re
import sys
import os
def sdf_grep(molname, files, past_mol):
chunk = []
found = False
with open(files, 'r') as filin:
for line in filin:
sline = line.rstrip()
if sline == molname and sline not in past_mol:
found = True
past_mol.add(sline)
elif sline == '$$$$':
chunk.append(line)
found = False
if found:
chunk.append(line)
return '\n'.join(chunk)
def main():
past_mol = set()
with open("consensus_sorted_surflex.txt", 'r') as imp_listmol:
listmol = [line.split('\t')[0] for line in imp_listmol]
print 'list ready... reading files'
with open('test_ini.sdf', 'wa') as filout:
for f in os.listdir("test"):
print 'reading ' + f
for molecule in listmol:
if molecule not in past_mol:
filout.write(sdf_grep(molecule, os.path.join('test/', f), past_mol))
print len(past_mol)
if __name__ == '__main__':
main()

Python: Parsing through XML with mini dom

I'm parsing through a decent sized xml file, and I ran into a problem. For some reason I cannot extract data even though I have done the exact same thing on different xml files before.
Here's a snippet of my code: (rest of the program, I've tested and they work fine)
EDIT: changed to include a testing try&except block
def parseXML():
file = open(str(options.drugxml),'r')
data = file.read()
file.close()
dom = parseString(data)
druglist = dom.getElementsByTagName('drug')
with codecs.open(str(options.csvdata),'w','utf-8') as csvout, open('DrugTargetRel.csv','w') as dtout:
for entry in druglist:
count = count + 1
try:
drugtype = entry.attributes['type'].value
print count
except:
print count
print entry
drugidObj = entry.getElementsByTagName('drugbank-id')[0]
drugid = drugidObj.childNodes[0].nodeValue
drugnameObj = entry.getElementsByTagName('name')[0]
drugname = drugnameObj.childNodes[0].nodeValue
targetlist = entry.getElementsByTagName('target')
for target in targetlist:
targetid = target.attributes['partner'].value
dtout.write((','.join((drugid,targetid)))+'\n')
csvout.write((','.join((drugid,drugname,drugtype)))+'\n')
In case you're wondering what the XML file's schema roughly looks like, here's a rough god-awful sketch of the levels:
<drugs>
<drug type='something' ...>
<drugbank-id>
<name>
...
<targets>
<target partner='something'>
Those that I typed in here, I need to extract from the XML file and stick it in csv files (as the code above shows), and the code has worked for different xml files before, not sure why it's not working on this one. I've gotten KeyError on 'type', I've also gotten indexing errors on line that extracts drugid even though EVERY drug has a drugid. What am I screwing up here?
EDIT: the stuff I'm extracting are guaranteed to be in each drug.
For anyone who cares, here's the link to the XML file I'm parsing:
http://www.drugbank.ca/system/downloads/current/drugbank.xml.zip
EDIT: After implementing a try & except block (see above) here's what I found out:
In the schema, there are sections called "drug interactions" that also have a subfield called drug. So like this:
<drugs>
<drug type='something' ...>
<drugbank-id>
<name>
...
<targets>
<target partner='something'>
<drug-interactions>
<drug>
I think that my line druglist = dom.getElementsByTagName('drug') is unintentionally picking those up as well -- I don't know how I could fix this... any suggestions?

Basically when parsin an xml, you can't rely on the fact that you know the structure. It's a good practise to find out structure in code.
So everytime you access elements or attributes, check before if there any. In your code it means following:
Make sure there's an attribute 'type' on a drug element:
drugtype = entry.attributes['type'].value if entry.attributes.has_key('type') else 'defaulttype'
Make sure getElementsByTagName doesn't returns empty array before accessing its elements:
drugbank-id = entry.getElementsByTagName('drugbank-id')
drugidObj = drugbank-id[0] if drugbank-id else None
Also before accessing childnodes make sure there are any:
if drugidObj.hasChildNodes:
drugid = drugidObj.childNodes[0].nodeValue
Or use for loop to loop through them.
And when you call getElementsByTagName on the drugs elemet it returns all elements including the nested ones. To get only drug elements, which are dirrect children of drugs element you have to use childNodes attribute.

I had a feeling that maybe there was something weird happening due to running out of memory or something, so I rewrote the parser using an iterator over each drug and tried it out and got the program to complete without raising an exception.
Basically what I'm doing here is, instead of loading the entire XML file into memory, I parse the XML file for the beginning and end of each <drug> and </drug> tag. Then I parse that with the minidom each time.
The code might be a little fragile as I assume that each <drug> and </drug> pair are on their own lines. Hopefully it helps more than it harms though.
#!python
import codecs
from xml.dom import minidom
class DrugBank(object):
def __init__(self, filename):
self.fp = open(filename, 'r')
def __iter__(self):
return self
def next(self):
state = 0
while True:
line = self.fp.readline()
if state == 0:
if line.strip().startswith('<drug '):
lines = [line]
state = 1
continue
if line.strip() == '</drugs>':
self.fp.close()
raise StopIteration()
if state == 1:
lines.append(line)
if line.strip() == '</drug>':
return minidom.parseString("".join(lines))
with codecs.open('csvout.csv', 'w', 'utf-8') as csvout, open('dtout.csv', 'w') as dtout:
db = DrugBank('drugbank.xml')
for dom in db:
entry = dom.firstChild
drugtype = entry.attributes['type'].value
drugidObj = entry.getElementsByTagName('drugbank-id')[0]
drugid = drugidObj.childNodes[0].nodeValue
drugnameObj = entry.getElementsByTagName('name')[0]
drugname = drugnameObj.childNodes[0].nodeValue
targetlist = entry.getElementsByTagName('target')
for target in targetlist:
targetid = target.attributes['partner'].value
dtout.write((','.join((drugid,targetid)))+'\n')
csvout.write((','.join((drugid,drugname,drugtype)))+'\n')
An interesting read that might help you out further is here:
http://www.ibm.com/developerworks/xml/library/x-hiperfparse/

Upper memory limit?

Is there a limit to memory for python? I've been using a python script to calculate the average values from a file which is a minimum of 150mb big.
Depending on the size of the file I sometimes encounter a MemoryError.
Can more memory be assigned to the python so I don't encounter the error?
EDIT: Code now below
NOTE: The file sizes can vary greatly (up to 20GB) the minimum size of the a file is 150mb
file_A1_B1 = open("A1_B1_100000.txt", "r")
file_A2_B2 = open("A2_B2_100000.txt", "r")
file_A1_B2 = open("A1_B2_100000.txt", "r")
file_A2_B1 = open("A2_B1_100000.txt", "r")
file_write = open ("average_generations.txt", "w")
mutation_average = open("mutation_average", "w")
files = [file_A2_B2,file_A2_B2,file_A1_B2,file_A2_B1]
for u in files:
line = u.readlines()
list_of_lines = []
for i in line:
values = i.split('\t')
list_of_lines.append(values)
count = 0
for j in list_of_lines:
count +=1
for k in range(0,count):
list_of_lines[k].remove('\n')
length = len(list_of_lines[0])
print_counter = 4
for o in range(0,length):
total = 0
for p in range(0,count):
number = float(list_of_lines[p][o])
total = total + number
average = total/count
print average
if print_counter == 4:
file_write.write(str(average)+'\n')
print_counter = 0
print_counter +=1
file_write.write('\n')

(This is my third answer because I misunderstood what your code was doing in my original, and then made a small but crucial mistake in my second—hopefully three's a charm.
Edits: Since this seems to be a popular answer, I've made a few modifications to improve its implementation over the years—most not too major. This is so if folks use it as template, it will provide an even better basis.
As others have pointed out, your MemoryError problem is most likely because you're attempting to read the entire contents of huge files into memory and then, on top of that, effectively doubling the amount of memory needed by creating a list of lists of the string values from each line.
Python's memory limits are determined by how much physical ram and virtual memory disk space your computer and operating system have available. Even if you don't use it all up and your program "works", using it may be impractical because it takes too long.
Anyway, the most obvious way to avoid that is to process each file a single line at a time, which means you have to do the processing incrementally.
To accomplish this, a list of running totals for each of the fields is kept. When that is finished, the average value of each field can be calculated by dividing the corresponding total value by the count of total lines read. Once that is done, these averages can be printed out and some written to one of the output files. I've also made a conscious effort to use very descriptive variable names to try to make it understandable.
try:
from itertools import izip_longest
except ImportError: # Python 3
from itertools import zip_longest as izip_longest
GROUP_SIZE = 4
input_file_names = ["A1_B1_100000.txt", "A2_B2_100000.txt", "A1_B2_100000.txt",
"A2_B1_100000.txt"]
file_write = open("average_generations.txt", 'w')
mutation_average = open("mutation_average", 'w') # left in, but nothing written
for file_name in input_file_names:
with open(file_name, 'r') as input_file:
print('processing file: {}'.format(file_name))
totals = []
for count, fields in enumerate((line.split('\t') for line in input_file), 1):
totals = [sum(values) for values in
izip_longest(totals, map(float, fields), fillvalue=0)]
averages = [total/count for total in totals]
for print_counter, average in enumerate(averages):
print(' {:9.4f}'.format(average))
if print_counter % GROUP_SIZE == 0:
file_write.write(str(average)+'\n')
file_write.write('\n')
file_write.close()
mutation_average.close()

You're reading the entire file into memory (line = u.readlines()) which will fail of course if the file is too large (and you say that some are up to 20 GB), so that's your problem right there.
Better iterate over each line:
for current_line in u:
do_something_with(current_line)
is the recommended approach.
Later in your script, you're doing some very strange things like first counting all the items in a list, then constructing a for loop over the range of that count. Why not iterate over the list directly? What is the purpose of your script? I have the impression that this could be done much easier.
This is one of the advantages of high-level languages like Python (as opposed to C where you do have to do these housekeeping tasks yourself): Allow Python to handle iteration for you, and only collect in memory what you actually need to have in memory at any given time.
Also, as it seems that you're processing TSV files (tabulator-separated values), you should take a look at the csv module which will handle all the splitting, removing of \ns etc. for you.

Python can use all memory available to its environment. My simple "memory test" crashes on ActiveState Python 2.6 after using about
1959167 [MiB]
On jython 2.5 it crashes earlier:
239000 [MiB]
probably I can configure Jython to use more memory (it uses limits from JVM)
Test app:
import sys
sl = []
i = 0
# some magic 1024 - overhead of string object
fill_size = 1024
if sys.version.startswith('2.7'):
fill_size = 1003
if sys.version.startswith('3'):
fill_size = 497
print(fill_size)
MiB = 0
while True:
s = str(i).zfill(fill_size)
sl.append(s)
if i == 0:
try:
sys.stderr.write('size of one string %d\n' % (sys.getsizeof(s)))
except AttributeError:
pass
i += 1
if i % 1024 == 0:
MiB += 1
if MiB % 25 == 0:
sys.stderr.write('%d [MiB]\n' % (MiB))
In your app you read whole file at once. For such big files you should read the line by line.

No, there's no Python-specific limit on the memory usage of a Python application. I regularly work with Python applications that may use several gigabytes of memory. Most likely, your script actually uses more memory than available on the machine you're running on.
In that case, the solution is to rewrite the script to be more memory efficient, or to add more physical memory if the script is already optimized to minimize memory usage.
Edit:
Your script reads the entire contents of your files into memory at once (line = u.readlines()). Since you're processing files up to 20 GB in size, you're going to get memory errors with that approach unless you have huge amounts of memory in your machine.
A better approach would be to read the files one line at a time:
for u in files:
for line in u: # This will iterate over each line in the file
# Read values from the line, do necessary calculations

Not only are you reading the whole of each file into memory, but also you laboriously replicate the information in a table called list_of_lines.
You have a secondary problem: your choices of variable names severely obfuscate what you are doing.
Here is your script rewritten with the readlines() caper removed and with meaningful names:
file_A1_B1 = open("A1_B1_100000.txt", "r")
file_A2_B2 = open("A2_B2_100000.txt", "r")
file_A1_B2 = open("A1_B2_100000.txt", "r")
file_A2_B1 = open("A2_B1_100000.txt", "r")
file_write = open ("average_generations.txt", "w")
mutation_average = open("mutation_average", "w") # not used
files = [file_A2_B2,file_A2_B2,file_A1_B2,file_A2_B1]
for afile in files:
table = []
for aline in afile:
values = aline.split('\t')
values.remove('\n') # why?
table.append(values)
row_count = len(table)
row0length = len(table[0])
print_counter = 4
for column_index in range(row0length):
column_total = 0
for row_index in range(row_count):
number = float(table[row_index][column_index])
column_total = column_total + number
column_average = column_total/row_count
print column_average
if print_counter == 4:
file_write.write(str(column_average)+'\n')
print_counter = 0
print_counter +=1
file_write.write('\n')
It rapidly becomes apparent that (1) you are calculating column averages (2) the obfuscation led some others to think you were calculating row averages.
As you are calculating column averages, no output is required until the end of each file, and the amount of extra memory actually required is proportional to the number of columns.
Here is a revised version of the outer loop code:
for afile in files:
for row_count, aline in enumerate(afile, start=1):
values = aline.split('\t')
values.remove('\n') # why?
fvalues = map(float, values)
if row_count == 1:
row0length = len(fvalues)
column_index_range = range(row0length)
column_totals = fvalues
else:
assert len(fvalues) == row0length
for column_index in column_index_range:
column_totals[column_index] += fvalues[column_index]
print_counter = 4
for column_index in column_index_range:
column_average = column_totals[column_index] / row_count
print column_average
if print_counter == 4:
file_write.write(str(column_average)+'\n')
print_counter = 0
print_counter +=1

Quickly find differences between two large text files

I have two 3GB text files, each file has around 80 million lines. And they share 99.9% identical lines (file A has 60,000 unique lines, file B has 80,000 unique lines).
How can I quickly find those unique lines in two files? Is there any ready-to-use command line tools for this? I'm using Python but I guess it's less possible to find a efficient Pythonic method to load the files and compare.
Any suggestions are appreciated.

If order matters, try the comm utility. If order doesn't matter, sort file1 file2 | uniq -u.

I think this is the fastest method (whether it's in Python or another language shouldn't matter too much IMO).
Notes:
1.I only store each line's hash to save space (and time if paging might occur)
2.Because of the above, I only print out line numbers; if you need actual lines, you'd just need to read the files in again
3.I assume that the hash function results in no conflicts. This is nearly, but not perfectly, certain.
4.I import hashlib because the built-in hash() function is too short to avoid conflicts.
import sys
import hashlib
file = []
lines = []
for i in range(2):
# open the files named in the command line
file.append(open(sys.argv[1+i], 'r'))
# stores the hash value and the line number for each line in file i
lines.append({})
# assuming you like counting lines starting with 1
counter = 1
while 1:
# assuming default encoding is sufficient to handle the input file
line = file[i].readline().encode()
if not line: break
hashcode = hashlib.sha512(line).hexdigest()
lines[i][hashcode] = sys.argv[1+i]+': '+str(counter)
counter += 1
unique0 = lines[0].keys() - lines[1].keys()
unique1 = lines[1].keys() - lines[0].keys()
result = [lines[0][x] for x in unique0] + [lines[1][x] for x in unique1]

With 60,000 or 80,000 unique lines you could just create a dictionary for each unique line, mapping it to a number. mydict["hello world"] => 1, etc. If your average line is around 40-80 characters this will be in the neighborhood of 10 MB of memory.
Then read each file, converting it to an array of numbers via the dictionary. Those will fit easily in memory (2 files of 8 bytes * 3GB / 60k lines is less than 1 MB of memory). Then diff the lists. You could invert the dictionary and use it to print out the text of the lines that differ.
EDIT:
In response to your comment, here's a sample script that assigns numbers to unique lines as it reads from a file.
#!/usr/bin/python
class Reader:
def __init__(self, file):
self.count = 0
self.dict = {}
self.file = file
def readline(self):
line = self.file.readline()
if not line:
return None
if self.dict.has_key(line):
return self.dict[line]
else:
self.count = self.count + 1
self.dict[line] = self.count
return self.count
if __name__ == '__main__':
print "Type Ctrl-D to quit."
import sys
r = Reader(sys.stdin)
result = 'ignore'
while result:
result = r.readline()
print result

If I understand correctly, you want the lines of these files without duplicates. This does the job:
uniqA = set(open('fileA', 'r'))

Python has difflib which claims to be quite competitive with other diff utilities see:
http://docs.python.org/library/difflib.html

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Improving speed while iterating over ~400k XML files - python

Related

Loop function across multiple XML files in directory so each XML becomes a row in a CSV

parsing sdf file, performance issue

Python: Parsing through XML with mini dom

Upper memory limit?

Quickly find differences between two large text files

Categories

Resources