Python - List appending getting slower? - python

I have a problem with my Python code. I am processing a big (3.5gb) JSON file that contains scores, and I need it in chunks of 21984 scores (which is all the scores for one query). The code works fine, but my testset is 4000 queries. The first 10 go fast but after that it exponentially increases time to calculate this part of the code. So after 5 hours i'm on 500 queries. The prints were for logging, and it seems that the problem lies in the translating or appending the lines to the list. Does anyone know how to make this faster or sees what caused it to get slower?
def getscorebatch(number):
print('Creating Batch..')
batch_temp = list()
with open(json_file_name, 'r') as FileObj:
print("Creating slice...")
lines_gen = islice(FileObj, (21894 * number), ((21894 * number) + 21894))
print("Appending slice...")
for line in lines_gen:
line = line.translate({ord(c): None for c in ':",}{ \n'})
batch_temp.append(line)
return batch_temp
UPDATE: I tried to implement your suggestions, and it is way faster! Thank you so much. Im fairly new to generators, so what i do not understand is now, how do I get the correct chunk the code? This will give me the first chunk everytime..
def generator(file_to_read):
c = 0
while c < 21894:
data = file_to_read.readline()
c += 1
if not data:
break
yield data
def getscorebatch(number):
print('Creating Batch..')
batch_temp = [0]*22000
with open(json_file_name, 'r') as FileObj:
gen_file = generator(FileObj)
batch_temp = [line.translate(line.maketrans("", "", REMOVE)) for line in gen_file]
print(len(batch_temp))
return batch_temp

Related

Meaningful IO error messages without overhead in Python

I have the following dilemma. I am parsing huge CSV files, which theoretically can contain invalid records, with python. To be able to fix an issue quickly I would like to see the line numbers in the error messages. However, as I am parsing many files and errors are very rare, I do not want my error handling adding overheads to the main pipeline. That is why I would not like to use enumerate or a similar approach.
In a nutshell, I am looking for a get_line_number function to work like this:
with open('file.csv', 'r') as f:
for line in f:
try:
process(line)
except:
line_no = get_line_number(f)
raise RuntimeError('Error while processing the line ' + line_no)
However, this seems to be complicated, as f.tell() will not work in this loop.
EDIT:
It seems like overheads are quite significant. In my real world case (which is painful, as the files are lists of pretty short records: single floats, int-float pairs or string-int pairs; the file.csv is about 800MB large and has around 80M lines), it is about 2.5 seconds per file read for enumerate. For some reason, fileinput is extremely slow.
import timeit
s = """
with open('file.csv', 'r') as f:
for line in f:
pass
"""
print(timeit.repeat(s, number = 10, repeat = 3))
s = """
with open('file.csv', 'r') as f:
for idx, line in enumerate(f):
pass
"""
print(timeit.repeat(s, number = 10, repeat = 3))
s = """
count = 0
with open('file.csv', 'r') as f:
for line in f:
count += 1
"""
print(timeit.repeat(s, number = 10, repeat = 3))
setup = """
import fileinput
"""
s = """
for line in fileinput.input('file.csv'):
pass
"""
print(timeit.repeat(s, setup = setup, number = 10, repeat = 3))
outputs
[45.790788270998746, 44.88589363079518, 44.93949336092919]
[70.25306860171258, 70.28569177398458, 70.2074502906762]
[75.43606997421011, 74.39759518811479, 75.02027251804247]
[325.1898657102138, 321.0400970801711, 326.23809849238023]
EDIT 2:
Getting close to the real-world scenario. The try-except clause is outside of the loop to reduce the overhead.
import timeit
setup = """
def process(line):
if float(line) < 0.5:
outliers += 1
"""
s = """
outliers = 0
with open('file.csv', 'r') as f:
for line in f:
process(line)
"""
print(timeit.repeat(s, setup = setup, number = 10, repeat = 3))
s = """
outliers = 0
with open('file.csv', 'r') as f:
try:
for idx, line in enumerate(f):
process(line)
except ValueError:
raise RuntimeError('Invalid value in line' + (idx + 1)) from None
"""
print(timeit.repeat(s, setup = setup, number = 10, repeat = 3))
outputs
[244.9097429071553, 242.84596176538616, 242.74369075801224
[293.32093235617504, 274.17732743313536, 274.00854821596295]
So, in my case, overhead from the enumerate is around 10%.
Do use enumerate
for line_ref, line in enumerate(f):
line_no = line_ref + 1 # enumerate starts at zero
It's not adding any significant overhead. The work involved in getting records out of a file vastly exceeds the work involved in keeping a counter, and the tuple assignment in the for statement is just a name-binding, not an extra copy of the data referred to by line
Replacement update:
Made a mistake in generating my test file. Have now pretty much confirmed the first timing test added to the question.
Personally I'd regard a 10% overhead on a worst(ish)-case file with 10-byte records as completely acceptable, given that the alternative is not knowing which of 80 million records were in error.
If you are sure adding debugging info is too much overhead (I do not want to argue on that topic), you could implement two versions of the function. High performance one and one with thorough checking and verbose debugging. The basic idea is:
try:
func_quick(args)
except Exception:
func_verbose(args)
The drawback is that the processing will start again when an error occurs. But if you have to manually correct the error, penalty of several seconds wasted in such case should not harm. Also the func_verbose() doesn't have to stop on first error and may check the whole file and list all errors.
The standard library fileinput module memory-efficiently processes large files and provides a built-in line number counter. It also automatically picks up multiple filenames for files to read from the command line arguments. However there doesn't seem to be a (simple?) way to use it with context handlers.
As for performance, you'd need to test it in comparison with other approaches.
import fileinput
for line in fileinput.input():
try:
process(line)
except:
line_no = fileinput.filelineno()
raise RuntimeError('Error while processing the line ' + line_no)
BTW I'd recommend catching only relevant exceptions, probably a custom one, otherwise you'll mask out unanticipated exceptions.

Python compare every line in file with all others

I am implementing a statistical program and have created a performance bottleneck and was hoping that I could obtain some help from the community to possibly point me in the direction of optimization.
I am creating a set for each row in a file and finding the intersection of that set by comparing the set data of each row in the same file. I then use the size of that intersection to filter certain sets from the output. The problem is that I have a nested for loop (O(n2)) and the standard size of the files incoming into the program are just over 20,000 lines long. I have timed the algorithm and for under 500 lines it runs in about 20 minutes but for the big files it takes about 8 hours to finish.
I have 16GB of RAM at disposal and a significantly quick 4-core Intel i7 processor. I have noticed no significant difference in memory use by copying the list1 and using a second list for comparison instead of opening the file again(maybe this is because I have an SSD?). I thought the 'with open' mechanism reads/writes directly to the HDD which is slower but noticed no difference when using two lists. In fact, the program rarely uses more than 1GB of RAM during operation.
I am hoping that other people have used a certain datatype or maybe better understands multiprocessing in Python and that they might be able to help me speed things up. I appreciate any help and I hope my code isn't too poorly written.
import ast, sys, os, shutil
list1 = []
end = 0
filterValue = 3
# creates output file with filterValue appended to name
with open(arg2 + arg1 + "/filteredSets" + str(filterValue) , "w") as outfile:
with open(arg2 + arg1 + "/file", "r") as infile:
# create a list of sets of rows in file
for row in infile:
list1.append(set(ast.literal_eval(row)))
infile.seek(0)
for row in infile:
# if file only has one row, no comparisons need to be made
if not(len(list1) == 1):
# get the first set from the list and...
set1 = set(ast.literal_eval(row))
# ...find the intersection of every other set in the file
for i in range(0, len(list1)):
# don't compare the set with itself
if not(pos == i):
set2 = list1[i]
set3 = set1.intersection(set2)
# if the two sets have less than 3 items in common
if(len(set3) < filterValue):
# and you've reached the end of the file
if(i == len(list1)):
# append the row in outfile
outfile.write(row)
# increase position in infile
pos += 1
else:
break
else:
outfile.write(row)
Sample input would be a file with this format:
[userID1, userID2, userID3]
[userID5, userID3, userID9]
[userID10, userID2, userID3, userID1]
[userID8, userID20, userID11, userID1]
The output file if this were the input file would be:
[userID5, userID3, userID9]
[userID8, userID20, userID11, userID1]
...because the two sets removed contained three or more of the same user id's.
This answer is not about how to split code in functions, name variables etc. It's about faster algorithm in terms of complexity.
I'd use a dictionary. Will not write exact code, you can do it yourself.
Sets = dict()
for rowID, row in enumerate(Rows):
for userID in row:
if Sets.get(userID) is None:
Sets[userID] = set()
Sets[userID].add(rowID)
So, now we have a dictionary, which can be used to quickly obtain rownumbers of rows containing given userID.
BadRows = set()
for rowID, row in enumerate(Rows):
Intersections = dict()
for userID in row:
for rowID_cmp in Sets[userID]:
if rowID_cmp != rowID:
Intersections[rowID_cmp] = Intersections.get(rowID_cmp, 0) + 1
# Now Intersections contains info about how many "times"
# row numbered rowID_cmp intersectcs current row
filteredOut = False
for rowID_cmp in Intersections:
if Intersections[rowID_cmp] >= filterValue:
BadRows.add(rowID_cmp)
filteredOut = True
if filteredOut:
BadRows.add(rowID)
Having rownumbers of all filtered out rows saved to BadRows, now we do iteration one last time:
for rowID, row in enumerate(Rows):
if rowID not in BadRows:
# output row
This works in 3 scans and in O(nlogn) time. Maybe you'd have to rework iterating Rows array, because it's a file in your case, but doesn't really change much.
Not sure about python syntax and details, but you get the idea behind my code.
First of all, please pack your the code into functions which do one thing well.
def get_data(*args):
# get the data.
def find_intersections_sets(list1, list2):
# do the intersections part.
def loop_over_some_result(result):
# insert assertions so that you don't end up looping in infinity:
assert result is not None
...
def myfunc(*args):
source1, source2 = args
L1, L2 = get_data(source1), get_data(source2)
intersects = find_intersections_sets(L1,L2)
...
if __name__ == "__main__":
myfunc()
then you can easily profile the code using:
if __name__ == "__main__":
import cProfile
cProfile.run('myfunc()')
which gives you invaluable insight into your code behaviour and allows you to track down logical bugs. For more on cProfile, see How can you profile a python script?
An option to track down a logical flaw (we're all humans, right?) is to user a timeout function in a decorate like this (python2) or this (python3):
Hereby myfunc can be changed to:
def get_data(*args):
# get the data.
def find_intersections_sets(list1, list2):
# do the intersections part.
def myfunc(*args):
source1, source2 = args
L1, L2 = get_data(source1), get_data(source2)
#timeout(10) # seconds <---- the clever bit!
intersects = find_intersections_sets(L1,L2)
...
...where the timeout operation will raise an error if it takes too long.
Here is my best guess:
import ast
def get_data(filename):
with open(filename, 'r') as fi:
data = fi.readlines()
return data
def get_ast_set(line):
return set(ast.literal_eval(line))
def less_than_x_in_common(set1, set2, limit=3):
if len(set1.intersection(set2)) < limit:
return True
else:
return False
def check_infile(datafile, savefile, filtervalue=3):
list1 = [get_ast_set(row) for row in get_data(datafile)]
outlist = []
for row in list1:
if any([less_than_x_in_common(set(row), set(i), limit=filtervalue) for i in outlist]):
outlist.append(row)
with open(savefile, 'w') as fo:
fo.writelines(outlist)
if __name__ == "__main__":
datafile = str(arg2 + arg1 + "/file")
savefile = str(arg2 + arg1 + "/filteredSets" + str(filterValue))
check_infile(datafile, savefile)

Python: performance issues with islice

With the following code, I'm seeing longer and longer execution times as I increase the starting row in islice. For example, a start_row of 4 will execute in 1s but a start_row of 500004 will take 11s. Why does this happen and is there a faster way to do this? I want to be able to iterate over several ranges of rows in a large CSV file (several GB) and make some calculations.
import csv
import itertools
from collections import deque
import time
my_queue = deque()
start_row = 500004
stop_row = start_row + 50000
with open('test.csv', 'rb') as fin:
#load into csv's reader
csv_f = csv.reader(fin)
#start logging time for performance
start = time.time()
for row in itertools.islice(csv_f, start_row, stop_row):
my_queue.append(float(row[4])*float(row[10]))
#stop logging time
end = time.time()
#display performance
print "Initial queue populating time: %.2f" % (end-start)
For example, a start_row of 4 will execute in 1s but a start_row of
500004 will take 11s
That is islice being intelligent. Or lazy, depending on which term you prefer.
Thing is, files are "just" strings of bytes on your hard drive. They don't have any internal organization. \n is just another set of bytes in that long, long string. There is no way to access any particular line without looking at all of the information before it (unless your lines are of the exact same length, in which case you can use file.seek).
Line 4? Finding line 4 is fast, your computer just needs to find 3 \n. Line 50004? Your computer has to read through the file until it finds 500003 \n. No way around it, and if someone tells you otherwise, they either have some other sort of quantum computer or their computer is reading through the file just like every other computer in the world, just behind their back.
As for what you can do about it: Try to be smart when trying to grab lines to iterate over. Smart, and lazy. Arrange your requests so you're only iterating through the file once, and close the file as soon as you've pulled the data you need. (islice does all of this, by the way.)
In python
lines_I_want = [(start1, stop1), (start2, stop2),...]
with f as open(filename):
for i,j in enumerate(f):
if i >= lines_I_want[0][0]:
if i >= lines_I_want[0][1]:
lines_I_want.pop(0)
if not lines_I_want: #list is empty
break
else:
#j is a line I want. Do something
And if you have any control over making that file, make every line the same length so you can seek. Or use a database.
The problem with using islice() for what you're doing is that iterates through all the lines before the first one you want before returning anything. Obviously the larger the starting row, the longer this will take. Another is that you're using a csv.reader to read these lines, which incurs likely unnecessary overhead since one line of the csv file is often one row of it. The only time that's not true is when the csv file has string fields in it that contain embedded newline characters — which in my experience is uncommon.
If this is a valid assumption for your data, it would likely be much faster to first index the file and build a table of (filename, offset, number-of-rows) tuples indicating the approximately equally-sized logical chunks of lines/rows in the file. With that, you can process them relatively quickly by first seeking to the starting offset and then reading the specified number of csv rows from that point on.
Another advantage to this approach is it would allow you to process the chunks in parallel, which I suspect is is the real problem you're trying to solve based on a previous question of yours. So, even though you haven't mentioned multiprocessing here, this following has been written to be compatible with doing that, if that's the case.
import csv
from itertools import islice
import os
import sys
def open_binary_mode(filename, mode='r'):
""" Open a file proper way (depends on Python verion). """
kwargs = (dict(mode=mode+'b') if sys.version_info[0] == 2 else
dict(mode=mode, newline=''))
return open(filename, **kwargs)
def split(infilename, num_chunks):
infile_size = os.path.getsize(infilename)
chunk_size = infile_size // num_chunks
offset = 0
num_rows = 0
bytes_read = 0
chunks = []
with open_binary_mode(infilename, 'r') as infile:
for _ in range(num_chunks):
while bytes_read < chunk_size:
try:
bytes_read += len(next(infile))
num_rows += 1
except StopIteration: # end of infile
break
chunks.append((infilename, offset, num_rows))
offset += bytes_read
num_rows = 0
bytes_read = 0
return chunks
chunks = split('sample_simple.csv', num_chunks=4)
for filename, offset, rows in chunks:
print('processing: {} rows starting at offset {}'.format(rows, offset))
with open_binary_mode(filename, 'r') as fin:
fin.seek(offset)
for row in islice(csv.reader(fin), rows):
print(row)

Upper memory limit?

Is there a limit to memory for python? I've been using a python script to calculate the average values from a file which is a minimum of 150mb big.
Depending on the size of the file I sometimes encounter a MemoryError.
Can more memory be assigned to the python so I don't encounter the error?
EDIT: Code now below
NOTE: The file sizes can vary greatly (up to 20GB) the minimum size of the a file is 150mb
file_A1_B1 = open("A1_B1_100000.txt", "r")
file_A2_B2 = open("A2_B2_100000.txt", "r")
file_A1_B2 = open("A1_B2_100000.txt", "r")
file_A2_B1 = open("A2_B1_100000.txt", "r")
file_write = open ("average_generations.txt", "w")
mutation_average = open("mutation_average", "w")
files = [file_A2_B2,file_A2_B2,file_A1_B2,file_A2_B1]
for u in files:
line = u.readlines()
list_of_lines = []
for i in line:
values = i.split('\t')
list_of_lines.append(values)
count = 0
for j in list_of_lines:
count +=1
for k in range(0,count):
list_of_lines[k].remove('\n')
length = len(list_of_lines[0])
print_counter = 4
for o in range(0,length):
total = 0
for p in range(0,count):
number = float(list_of_lines[p][o])
total = total + number
average = total/count
print average
if print_counter == 4:
file_write.write(str(average)+'\n')
print_counter = 0
print_counter +=1
file_write.write('\n')
(This is my third answer because I misunderstood what your code was doing in my original, and then made a small but crucial mistake in my second—hopefully three's a charm.
Edits: Since this seems to be a popular answer, I've made a few modifications to improve its implementation over the years—most not too major. This is so if folks use it as template, it will provide an even better basis.
As others have pointed out, your MemoryError problem is most likely because you're attempting to read the entire contents of huge files into memory and then, on top of that, effectively doubling the amount of memory needed by creating a list of lists of the string values from each line.
Python's memory limits are determined by how much physical ram and virtual memory disk space your computer and operating system have available. Even if you don't use it all up and your program "works", using it may be impractical because it takes too long.
Anyway, the most obvious way to avoid that is to process each file a single line at a time, which means you have to do the processing incrementally.
To accomplish this, a list of running totals for each of the fields is kept. When that is finished, the average value of each field can be calculated by dividing the corresponding total value by the count of total lines read. Once that is done, these averages can be printed out and some written to one of the output files. I've also made a conscious effort to use very descriptive variable names to try to make it understandable.
try:
from itertools import izip_longest
except ImportError: # Python 3
from itertools import zip_longest as izip_longest
GROUP_SIZE = 4
input_file_names = ["A1_B1_100000.txt", "A2_B2_100000.txt", "A1_B2_100000.txt",
"A2_B1_100000.txt"]
file_write = open("average_generations.txt", 'w')
mutation_average = open("mutation_average", 'w') # left in, but nothing written
for file_name in input_file_names:
with open(file_name, 'r') as input_file:
print('processing file: {}'.format(file_name))
totals = []
for count, fields in enumerate((line.split('\t') for line in input_file), 1):
totals = [sum(values) for values in
izip_longest(totals, map(float, fields), fillvalue=0)]
averages = [total/count for total in totals]
for print_counter, average in enumerate(averages):
print(' {:9.4f}'.format(average))
if print_counter % GROUP_SIZE == 0:
file_write.write(str(average)+'\n')
file_write.write('\n')
file_write.close()
mutation_average.close()
You're reading the entire file into memory (line = u.readlines()) which will fail of course if the file is too large (and you say that some are up to 20 GB), so that's your problem right there.
Better iterate over each line:
for current_line in u:
do_something_with(current_line)
is the recommended approach.
Later in your script, you're doing some very strange things like first counting all the items in a list, then constructing a for loop over the range of that count. Why not iterate over the list directly? What is the purpose of your script? I have the impression that this could be done much easier.
This is one of the advantages of high-level languages like Python (as opposed to C where you do have to do these housekeeping tasks yourself): Allow Python to handle iteration for you, and only collect in memory what you actually need to have in memory at any given time.
Also, as it seems that you're processing TSV files (tabulator-separated values), you should take a look at the csv module which will handle all the splitting, removing of \ns etc. for you.
Python can use all memory available to its environment. My simple "memory test" crashes on ActiveState Python 2.6 after using about
1959167 [MiB]
On jython 2.5 it crashes earlier:
239000 [MiB]
probably I can configure Jython to use more memory (it uses limits from JVM)
Test app:
import sys
sl = []
i = 0
# some magic 1024 - overhead of string object
fill_size = 1024
if sys.version.startswith('2.7'):
fill_size = 1003
if sys.version.startswith('3'):
fill_size = 497
print(fill_size)
MiB = 0
while True:
s = str(i).zfill(fill_size)
sl.append(s)
if i == 0:
try:
sys.stderr.write('size of one string %d\n' % (sys.getsizeof(s)))
except AttributeError:
pass
i += 1
if i % 1024 == 0:
MiB += 1
if MiB % 25 == 0:
sys.stderr.write('%d [MiB]\n' % (MiB))
In your app you read whole file at once. For such big files you should read the line by line.
No, there's no Python-specific limit on the memory usage of a Python application. I regularly work with Python applications that may use several gigabytes of memory. Most likely, your script actually uses more memory than available on the machine you're running on.
In that case, the solution is to rewrite the script to be more memory efficient, or to add more physical memory if the script is already optimized to minimize memory usage.
Edit:
Your script reads the entire contents of your files into memory at once (line = u.readlines()). Since you're processing files up to 20 GB in size, you're going to get memory errors with that approach unless you have huge amounts of memory in your machine.
A better approach would be to read the files one line at a time:
for u in files:
for line in u: # This will iterate over each line in the file
# Read values from the line, do necessary calculations
Not only are you reading the whole of each file into memory, but also you laboriously replicate the information in a table called list_of_lines.
You have a secondary problem: your choices of variable names severely obfuscate what you are doing.
Here is your script rewritten with the readlines() caper removed and with meaningful names:
file_A1_B1 = open("A1_B1_100000.txt", "r")
file_A2_B2 = open("A2_B2_100000.txt", "r")
file_A1_B2 = open("A1_B2_100000.txt", "r")
file_A2_B1 = open("A2_B1_100000.txt", "r")
file_write = open ("average_generations.txt", "w")
mutation_average = open("mutation_average", "w") # not used
files = [file_A2_B2,file_A2_B2,file_A1_B2,file_A2_B1]
for afile in files:
table = []
for aline in afile:
values = aline.split('\t')
values.remove('\n') # why?
table.append(values)
row_count = len(table)
row0length = len(table[0])
print_counter = 4
for column_index in range(row0length):
column_total = 0
for row_index in range(row_count):
number = float(table[row_index][column_index])
column_total = column_total + number
column_average = column_total/row_count
print column_average
if print_counter == 4:
file_write.write(str(column_average)+'\n')
print_counter = 0
print_counter +=1
file_write.write('\n')
It rapidly becomes apparent that (1) you are calculating column averages (2) the obfuscation led some others to think you were calculating row averages.
As you are calculating column averages, no output is required until the end of each file, and the amount of extra memory actually required is proportional to the number of columns.
Here is a revised version of the outer loop code:
for afile in files:
for row_count, aline in enumerate(afile, start=1):
values = aline.split('\t')
values.remove('\n') # why?
fvalues = map(float, values)
if row_count == 1:
row0length = len(fvalues)
column_index_range = range(row0length)
column_totals = fvalues
else:
assert len(fvalues) == row0length
for column_index in column_index_range:
column_totals[column_index] += fvalues[column_index]
print_counter = 4
for column_index in column_index_range:
column_average = column_totals[column_index] / row_count
print column_average
if print_counter == 4:
file_write.write(str(column_average)+'\n')
print_counter = 0
print_counter +=1

What is the lightest way of doing this task?

I have a file whose contents are of the form:
.2323 1
.2327 1
.3432 1
.4543 1
and so on some 10,000 lines in each file.
I have a variable whose value is say a=.3344
From the file I want to get the row number of the row whose first column is closest to this variable...for example it should give row_num='3' as .3432 is closest to it.
I have tried in a method of loading the first columns element in a list and then comparing the variable to each element and getting the index number
If I do in this method it is very much time consuming and slow my model...I want a very quick method as this need to to called some 1000 times minimum...
I want a method with least overhead and very quick can anyone please tell me how can it be done very fast.
As the file size is maximum of 100kb can this be done directly without loading into any list of anything...if yes how can it be done.
Any method quicker than the method mentioned above are welcome but I am desperate to improve the speed -- please help.
def get_list(file, cmp, fout):
ind, _ = min(enumerate(file), key=lambda x: abs(x[1] - cmp))
return fout[ind].rstrip('\n').split(' ')
#root = r'c:\begpython\wavnk'
header = 6
for lst in lists:
save = database_index[lst]
#print save
index, base,abs2, _ , abs1 = save
using_data[index] = save
base = 'C:/begpython/wavnk/'+ base.replace('phone', 'text')
fin, fout = base + '.pm', base + '.mcep'
file = open(fin)
fout = open(fout).readlines()
[next(file) for _ in range(header)]
file = [float(line.partition(' ')[0]) for line in file]
join_cost_index_end[index] = get_list(file, float(abs1), fout)
join_cost_index_strt[index] = get_list(file, float(abs2), fout)
this is the code i was using..copying file into a list.and all please give better alternarives to this
Building on John Kugelman's answer, here's a way you might be able to do a binary search on a file with fixed-length lines:
class SubscriptableFile(object):
def __init__(self, file):
self._file = file
file.seek(0,0)
self._line_length = len(file.readline())
file.seek(0,2)
self._len = file.tell() / self._line_length
def __len__(self):
return self._len
def __getitem__(self, key):
self._file.seek(key * self._line_length)
s = self._file.readline()
if s:
return float(s.split()[0])
else:
raise KeyError('Line number too large')
This class wraps a file in a list-like structure, so that now you can use the functions of the bisect module on it:
def find_row(file, target):
fw = SubscriptableFile(file)
i = bisect.bisect_left(fw, target)
if fw[i + 1] - target < target - fw[i]:
return i + 1
else:
return i
Here file is an open file object and target is the number you want to find. The function returns the number of the line with the closest value.
I will note, however, that the bisect module will try to use a C implementation of its binary search when it is available, and I'm not sure if the C implementation supports this kind of behavior. It might require a true list, rather than a "fake list" (like my SubscriptableFile).
Is the data in the file sorted in numerical order? Are all the lines of the same length? If not, the simplest approach is best. Namely, reading through the file line by line. There's no need to store more than one line in memory at a time.
Code:
def closest(num):
closest_row = None
closest_value = None
for row_num, row in enumerate(file('numbers.txt')):
value = float(row.split()[0])
if closest_value is None or abs(value - num) < abs(closest_value - num):
closest_row = row
closest_row_num = row_num
closest_value = value
return (closest_row_num, closest_row)
print closest(.3344)
Output for sample data:
(2, '.3432 1\n')
If the lines are all the same length and the data is sorted then there are some optimizations that will make this a very fast process. All the lines being the same length would let you seek directly to particular lines (you can't do this in a normal text file with lines of different length). Which would then enable you to do a binary search.
A binary search would be massively faster than a linear search. A linear search will on average have to read 5,000 lines of a 10,000 line file each time, whereas a binary search would on average only read log2 10,000 ≈ 13 lines.
Load it into a list then use bisect.

Categories