I am implementing a statistical program and have created a performance bottleneck and was hoping that I could obtain some help from the community to possibly point me in the direction of optimization.
I am creating a set for each row in a file and finding the intersection of that set by comparing the set data of each row in the same file. I then use the size of that intersection to filter certain sets from the output. The problem is that I have a nested for loop (O(n2)) and the standard size of the files incoming into the program are just over 20,000 lines long. I have timed the algorithm and for under 500 lines it runs in about 20 minutes but for the big files it takes about 8 hours to finish.
I have 16GB of RAM at disposal and a significantly quick 4-core Intel i7 processor. I have noticed no significant difference in memory use by copying the list1 and using a second list for comparison instead of opening the file again(maybe this is because I have an SSD?). I thought the 'with open' mechanism reads/writes directly to the HDD which is slower but noticed no difference when using two lists. In fact, the program rarely uses more than 1GB of RAM during operation.
I am hoping that other people have used a certain datatype or maybe better understands multiprocessing in Python and that they might be able to help me speed things up. I appreciate any help and I hope my code isn't too poorly written.
import ast, sys, os, shutil
list1 = []
end = 0
filterValue = 3
# creates output file with filterValue appended to name
with open(arg2 + arg1 + "/filteredSets" + str(filterValue) , "w") as outfile:
with open(arg2 + arg1 + "/file", "r") as infile:
# create a list of sets of rows in file
for row in infile:
list1.append(set(ast.literal_eval(row)))
infile.seek(0)
for row in infile:
# if file only has one row, no comparisons need to be made
if not(len(list1) == 1):
# get the first set from the list and...
set1 = set(ast.literal_eval(row))
# ...find the intersection of every other set in the file
for i in range(0, len(list1)):
# don't compare the set with itself
if not(pos == i):
set2 = list1[i]
set3 = set1.intersection(set2)
# if the two sets have less than 3 items in common
if(len(set3) < filterValue):
# and you've reached the end of the file
if(i == len(list1)):
# append the row in outfile
outfile.write(row)
# increase position in infile
pos += 1
else:
break
else:
outfile.write(row)
Sample input would be a file with this format:
[userID1, userID2, userID3]
[userID5, userID3, userID9]
[userID10, userID2, userID3, userID1]
[userID8, userID20, userID11, userID1]
The output file if this were the input file would be:
[userID5, userID3, userID9]
[userID8, userID20, userID11, userID1]
...because the two sets removed contained three or more of the same user id's.
This answer is not about how to split code in functions, name variables etc. It's about faster algorithm in terms of complexity.
I'd use a dictionary. Will not write exact code, you can do it yourself.
Sets = dict()
for rowID, row in enumerate(Rows):
for userID in row:
if Sets.get(userID) is None:
Sets[userID] = set()
Sets[userID].add(rowID)
So, now we have a dictionary, which can be used to quickly obtain rownumbers of rows containing given userID.
BadRows = set()
for rowID, row in enumerate(Rows):
Intersections = dict()
for userID in row:
for rowID_cmp in Sets[userID]:
if rowID_cmp != rowID:
Intersections[rowID_cmp] = Intersections.get(rowID_cmp, 0) + 1
# Now Intersections contains info about how many "times"
# row numbered rowID_cmp intersectcs current row
filteredOut = False
for rowID_cmp in Intersections:
if Intersections[rowID_cmp] >= filterValue:
BadRows.add(rowID_cmp)
filteredOut = True
if filteredOut:
BadRows.add(rowID)
Having rownumbers of all filtered out rows saved to BadRows, now we do iteration one last time:
for rowID, row in enumerate(Rows):
if rowID not in BadRows:
# output row
This works in 3 scans and in O(nlogn) time. Maybe you'd have to rework iterating Rows array, because it's a file in your case, but doesn't really change much.
Not sure about python syntax and details, but you get the idea behind my code.
First of all, please pack your the code into functions which do one thing well.
def get_data(*args):
# get the data.
def find_intersections_sets(list1, list2):
# do the intersections part.
def loop_over_some_result(result):
# insert assertions so that you don't end up looping in infinity:
assert result is not None
...
def myfunc(*args):
source1, source2 = args
L1, L2 = get_data(source1), get_data(source2)
intersects = find_intersections_sets(L1,L2)
...
if __name__ == "__main__":
myfunc()
then you can easily profile the code using:
if __name__ == "__main__":
import cProfile
cProfile.run('myfunc()')
which gives you invaluable insight into your code behaviour and allows you to track down logical bugs. For more on cProfile, see How can you profile a python script?
An option to track down a logical flaw (we're all humans, right?) is to user a timeout function in a decorate like this (python2) or this (python3):
Hereby myfunc can be changed to:
def get_data(*args):
# get the data.
def find_intersections_sets(list1, list2):
# do the intersections part.
def myfunc(*args):
source1, source2 = args
L1, L2 = get_data(source1), get_data(source2)
#timeout(10) # seconds <---- the clever bit!
intersects = find_intersections_sets(L1,L2)
...
...where the timeout operation will raise an error if it takes too long.
Here is my best guess:
import ast
def get_data(filename):
with open(filename, 'r') as fi:
data = fi.readlines()
return data
def get_ast_set(line):
return set(ast.literal_eval(line))
def less_than_x_in_common(set1, set2, limit=3):
if len(set1.intersection(set2)) < limit:
return True
else:
return False
def check_infile(datafile, savefile, filtervalue=3):
list1 = [get_ast_set(row) for row in get_data(datafile)]
outlist = []
for row in list1:
if any([less_than_x_in_common(set(row), set(i), limit=filtervalue) for i in outlist]):
outlist.append(row)
with open(savefile, 'w') as fo:
fo.writelines(outlist)
if __name__ == "__main__":
datafile = str(arg2 + arg1 + "/file")
savefile = str(arg2 + arg1 + "/filteredSets" + str(filterValue))
check_infile(datafile, savefile)
Related
I have a large (30GB) file consisting of random terms and sentences. I have two separate lists of words and phrases I want to apply to that and mark (or alternatively filter) a row in which a term for that list appears.
If a term from list X appears in a row of the large file, mark it X, if from list Y, mark it Y. When done, take the rows marked X and output it to a file, same for Y as a separate file. My problem is that both of my lists are 1500 terms long and take a while to go through line by line.
After fidgeting around a while, I've arrived on my current method which filters the chunks on whether it contains a term. My issue is that it is very slow. I am wondering if there is a way to speed up my script to get through it faster? I was using Pandas to process the file in chunks of 1 million rows, it takes around 3 minutes to process a chunk with my current method:
white_text_file = open('lists/health_whitelist_final.txt', "r")
white_list = white_text_file.read().split(',\n')
black_text_file = open('lists/health_blacklist_final.txt', "r")
black_list = black_text_file.read().split(',\n')
for chunk in pd.read_csv('final_cleaned_corpus.csv', chunksize=chunksize, names=['Keyword']):
print("Chunk")
chunk_y = chunk[chunk['Keyword'].str.contains('|'.join(white_list), na=False)]
chunk_y.to_csv(VERTICAL+'_y_list.csv', mode='a', header=None)
chunk_x = chunk[chunk['Keyword'].str.contains('|'.join(black_list), na=False)]
chunk_x.to_csv(VERTICAL+'_x_list.csv', mode='a', header=None)
My first attempt was less pythonic but the aim was to break the loop the first time an element appears, this way was slower that my current one:
def x_or_y(keyword):
#print(keyword)
iden = ''
for item in y_list:
if item in keyword:
iden = 'Y'
break
for item in x_list:
if item in keyword:
iden = 'X'
break
return iden
Is there a faster way I'm missing here?
If I understand correctly, you need to do the following:
fileContent = ['yes','foo','junk','yes','foo','junk']
x_list = ['yes','no','maybe-so']
y_list = ['foo','bar','fizzbuzz']
def x_or_y(keyword):
if keyword in x_list:
return 'X'
if keyword in y_list:
return 'Y'
return ''
results = map(x_or_y, fileContent)
print(list(results))
Here is an example: https://repl.it/F566
I have one very large text file(27GB) that I am attempting to make smaller by removing lines that are duplicated in a second database that has several files of a more reasonable size(500MB-2GB). I have some functional code, what I am wondering is is there any way I can optimize this code to run faster, human clock time? At the moment, on a small test run with a 1.5GB input and 500MB filter, this takes 75~ seconds to complete.
I've gone through many iterations of this idea, this one is currently best for time, if anyone has ideas for making a better logical structure for the filter I'd love to hear it, past attempts that have all been worse than this one: Loading the filter into a set and cycling through the input searching for duplicates(about half as fast as this), loading the input into a set and running the filter through a difference_update(almost as fast as this but also doing the reverse of what I wanted), and loading both input and filter into sets in chunks and doing set differences(that was a horrible, horrible idea that would work(maybe?) if my filters were smaller so I didn't have to split them, too.)
So, those are all the things I've tried. All of these processes max out on CPU, and my final version runs at about 25-50% disk I/O, filter and output are on one physical disk, input is on another. I am running a dual core and have no idea if this particular script can be threaded, never done any multithreading before so if that's a possibility I'd love to be pointed in the right direction.
Information about the data! As said earlier, the input is many times larger than the filter. I am expecting a very small percentage of duplication. The data is in lines, all of which are under 20 ASCII characters long. The files are all sorted.
I've already changed the order of the three logical statements, based on the expectation that unique input lines will be the majority of the lines, then unique filter, then duplicates, which on the 'best' case of having no duplicates at all, saved me about 10% time.
Any suggestions?
def sortedfilter(input,filter,output):
file_input = open(input,'r')
file_filter = open(filter,'r')
file_output = open(output,'w')
inline = file_input.next()
filterline = file_filter.next()
try:
while inline and filterline:
if inline < filterline:
file_output.write(inline)
inline = file_input.next()
continue
if inline > filterline:
filterline = file_filter.next()
continue
if inline == filterline:
filterline = file_filter.next()
inline = file_input.next()
except StopIteration:
file_output.writelines(file_input.readlines())
finally:
file_filter.close()
file_input.close()
file_output.close()
I'd be interested in knowing how this compares; it's basically just playing with the order of iteration a bit.
def sortedfilter(in_fname, filter_fname, out_fname):
with open(in_fname) as inf, open(filter_fname) as fil, open(out_fname, 'w') as outf:
ins = inf.next()
try:
for fs in fil:
while ins < fs:
outf.write(ins)
ins = inf.next()
while ins == fs:
ins = inf.next()
except StopIteration:
# reached end of inf before end of fil
pass
else:
# reached end of fil first, pass rest of inf through
file_output.writelines(file_input.readlines())
You can do the string comparison operation only once per line by executing cmp(inline, filterline):
-1 means inline < filterline
0 means inline == filterline
+1 means inline < filterline.
This might get you a few extra percent. So something like:
while inline and filterline:
comparison = cmp(inline, filterline)
if comparison == -1:
file_output.write(inline)
inline = file_input.next()
continue
if comparison == 1:
filterline = file_filter.next()
continue
if comparison == 0:
filterline = file_filter.next()
inline = file_input.next()
seeing that your input files are sorted, the following should work. heapq's merge produces a sorted stream from sorted inputs upon which groupby operates; groups whose lengths are greater than 1 are discarded. This stream-based approach should have relatively low memory requirements
from itertools import groupby, repeat, izip
from heapq import merge
from operator import itemgetter
with open('input.txt', 'r') as file_input, open('filter.txt', 'r') as file_filter, open('output.txt', 'w') as file_output:
file_input_1 = izip(file_input, repeat(1))
file_filter_1 = izip(file_filter, repeat(2))
gen = merge(file_input_1, file_filter_1)
gen = ((k, list(g)) for (k, g) in groupby(gen, key=itemgetter(0)))
gen = (k for (k, g) in gen if len(g) == 1 and g[0][1] == 1)
for line in gen:
file_output.write(line)
Consider opening your files as binary so no conversion to unicode needs to be made:
with open(in_fname,'rb') as inf, open(filter_fname,'rb') as fil, open(out_fname, 'wb') as outf:
Is there a limit to memory for python? I've been using a python script to calculate the average values from a file which is a minimum of 150mb big.
Depending on the size of the file I sometimes encounter a MemoryError.
Can more memory be assigned to the python so I don't encounter the error?
EDIT: Code now below
NOTE: The file sizes can vary greatly (up to 20GB) the minimum size of the a file is 150mb
file_A1_B1 = open("A1_B1_100000.txt", "r")
file_A2_B2 = open("A2_B2_100000.txt", "r")
file_A1_B2 = open("A1_B2_100000.txt", "r")
file_A2_B1 = open("A2_B1_100000.txt", "r")
file_write = open ("average_generations.txt", "w")
mutation_average = open("mutation_average", "w")
files = [file_A2_B2,file_A2_B2,file_A1_B2,file_A2_B1]
for u in files:
line = u.readlines()
list_of_lines = []
for i in line:
values = i.split('\t')
list_of_lines.append(values)
count = 0
for j in list_of_lines:
count +=1
for k in range(0,count):
list_of_lines[k].remove('\n')
length = len(list_of_lines[0])
print_counter = 4
for o in range(0,length):
total = 0
for p in range(0,count):
number = float(list_of_lines[p][o])
total = total + number
average = total/count
print average
if print_counter == 4:
file_write.write(str(average)+'\n')
print_counter = 0
print_counter +=1
file_write.write('\n')
(This is my third answer because I misunderstood what your code was doing in my original, and then made a small but crucial mistake in my second—hopefully three's a charm.
Edits: Since this seems to be a popular answer, I've made a few modifications to improve its implementation over the years—most not too major. This is so if folks use it as template, it will provide an even better basis.
As others have pointed out, your MemoryError problem is most likely because you're attempting to read the entire contents of huge files into memory and then, on top of that, effectively doubling the amount of memory needed by creating a list of lists of the string values from each line.
Python's memory limits are determined by how much physical ram and virtual memory disk space your computer and operating system have available. Even if you don't use it all up and your program "works", using it may be impractical because it takes too long.
Anyway, the most obvious way to avoid that is to process each file a single line at a time, which means you have to do the processing incrementally.
To accomplish this, a list of running totals for each of the fields is kept. When that is finished, the average value of each field can be calculated by dividing the corresponding total value by the count of total lines read. Once that is done, these averages can be printed out and some written to one of the output files. I've also made a conscious effort to use very descriptive variable names to try to make it understandable.
try:
from itertools import izip_longest
except ImportError: # Python 3
from itertools import zip_longest as izip_longest
GROUP_SIZE = 4
input_file_names = ["A1_B1_100000.txt", "A2_B2_100000.txt", "A1_B2_100000.txt",
"A2_B1_100000.txt"]
file_write = open("average_generations.txt", 'w')
mutation_average = open("mutation_average", 'w') # left in, but nothing written
for file_name in input_file_names:
with open(file_name, 'r') as input_file:
print('processing file: {}'.format(file_name))
totals = []
for count, fields in enumerate((line.split('\t') for line in input_file), 1):
totals = [sum(values) for values in
izip_longest(totals, map(float, fields), fillvalue=0)]
averages = [total/count for total in totals]
for print_counter, average in enumerate(averages):
print(' {:9.4f}'.format(average))
if print_counter % GROUP_SIZE == 0:
file_write.write(str(average)+'\n')
file_write.write('\n')
file_write.close()
mutation_average.close()
You're reading the entire file into memory (line = u.readlines()) which will fail of course if the file is too large (and you say that some are up to 20 GB), so that's your problem right there.
Better iterate over each line:
for current_line in u:
do_something_with(current_line)
is the recommended approach.
Later in your script, you're doing some very strange things like first counting all the items in a list, then constructing a for loop over the range of that count. Why not iterate over the list directly? What is the purpose of your script? I have the impression that this could be done much easier.
This is one of the advantages of high-level languages like Python (as opposed to C where you do have to do these housekeeping tasks yourself): Allow Python to handle iteration for you, and only collect in memory what you actually need to have in memory at any given time.
Also, as it seems that you're processing TSV files (tabulator-separated values), you should take a look at the csv module which will handle all the splitting, removing of \ns etc. for you.
Python can use all memory available to its environment. My simple "memory test" crashes on ActiveState Python 2.6 after using about
1959167 [MiB]
On jython 2.5 it crashes earlier:
239000 [MiB]
probably I can configure Jython to use more memory (it uses limits from JVM)
Test app:
import sys
sl = []
i = 0
# some magic 1024 - overhead of string object
fill_size = 1024
if sys.version.startswith('2.7'):
fill_size = 1003
if sys.version.startswith('3'):
fill_size = 497
print(fill_size)
MiB = 0
while True:
s = str(i).zfill(fill_size)
sl.append(s)
if i == 0:
try:
sys.stderr.write('size of one string %d\n' % (sys.getsizeof(s)))
except AttributeError:
pass
i += 1
if i % 1024 == 0:
MiB += 1
if MiB % 25 == 0:
sys.stderr.write('%d [MiB]\n' % (MiB))
In your app you read whole file at once. For such big files you should read the line by line.
No, there's no Python-specific limit on the memory usage of a Python application. I regularly work with Python applications that may use several gigabytes of memory. Most likely, your script actually uses more memory than available on the machine you're running on.
In that case, the solution is to rewrite the script to be more memory efficient, or to add more physical memory if the script is already optimized to minimize memory usage.
Edit:
Your script reads the entire contents of your files into memory at once (line = u.readlines()). Since you're processing files up to 20 GB in size, you're going to get memory errors with that approach unless you have huge amounts of memory in your machine.
A better approach would be to read the files one line at a time:
for u in files:
for line in u: # This will iterate over each line in the file
# Read values from the line, do necessary calculations
Not only are you reading the whole of each file into memory, but also you laboriously replicate the information in a table called list_of_lines.
You have a secondary problem: your choices of variable names severely obfuscate what you are doing.
Here is your script rewritten with the readlines() caper removed and with meaningful names:
file_A1_B1 = open("A1_B1_100000.txt", "r")
file_A2_B2 = open("A2_B2_100000.txt", "r")
file_A1_B2 = open("A1_B2_100000.txt", "r")
file_A2_B1 = open("A2_B1_100000.txt", "r")
file_write = open ("average_generations.txt", "w")
mutation_average = open("mutation_average", "w") # not used
files = [file_A2_B2,file_A2_B2,file_A1_B2,file_A2_B1]
for afile in files:
table = []
for aline in afile:
values = aline.split('\t')
values.remove('\n') # why?
table.append(values)
row_count = len(table)
row0length = len(table[0])
print_counter = 4
for column_index in range(row0length):
column_total = 0
for row_index in range(row_count):
number = float(table[row_index][column_index])
column_total = column_total + number
column_average = column_total/row_count
print column_average
if print_counter == 4:
file_write.write(str(column_average)+'\n')
print_counter = 0
print_counter +=1
file_write.write('\n')
It rapidly becomes apparent that (1) you are calculating column averages (2) the obfuscation led some others to think you were calculating row averages.
As you are calculating column averages, no output is required until the end of each file, and the amount of extra memory actually required is proportional to the number of columns.
Here is a revised version of the outer loop code:
for afile in files:
for row_count, aline in enumerate(afile, start=1):
values = aline.split('\t')
values.remove('\n') # why?
fvalues = map(float, values)
if row_count == 1:
row0length = len(fvalues)
column_index_range = range(row0length)
column_totals = fvalues
else:
assert len(fvalues) == row0length
for column_index in column_index_range:
column_totals[column_index] += fvalues[column_index]
print_counter = 4
for column_index in column_index_range:
column_average = column_totals[column_index] / row_count
print column_average
if print_counter == 4:
file_write.write(str(column_average)+'\n')
print_counter = 0
print_counter +=1
I have two 3GB text files, each file has around 80 million lines. And they share 99.9% identical lines (file A has 60,000 unique lines, file B has 80,000 unique lines).
How can I quickly find those unique lines in two files? Is there any ready-to-use command line tools for this? I'm using Python but I guess it's less possible to find a efficient Pythonic method to load the files and compare.
Any suggestions are appreciated.
If order matters, try the comm utility. If order doesn't matter, sort file1 file2 | uniq -u.
I think this is the fastest method (whether it's in Python or another language shouldn't matter too much IMO).
Notes:
1.I only store each line's hash to save space (and time if paging might occur)
2.Because of the above, I only print out line numbers; if you need actual lines, you'd just need to read the files in again
3.I assume that the hash function results in no conflicts. This is nearly, but not perfectly, certain.
4.I import hashlib because the built-in hash() function is too short to avoid conflicts.
import sys
import hashlib
file = []
lines = []
for i in range(2):
# open the files named in the command line
file.append(open(sys.argv[1+i], 'r'))
# stores the hash value and the line number for each line in file i
lines.append({})
# assuming you like counting lines starting with 1
counter = 1
while 1:
# assuming default encoding is sufficient to handle the input file
line = file[i].readline().encode()
if not line: break
hashcode = hashlib.sha512(line).hexdigest()
lines[i][hashcode] = sys.argv[1+i]+': '+str(counter)
counter += 1
unique0 = lines[0].keys() - lines[1].keys()
unique1 = lines[1].keys() - lines[0].keys()
result = [lines[0][x] for x in unique0] + [lines[1][x] for x in unique1]
With 60,000 or 80,000 unique lines you could just create a dictionary for each unique line, mapping it to a number. mydict["hello world"] => 1, etc. If your average line is around 40-80 characters this will be in the neighborhood of 10 MB of memory.
Then read each file, converting it to an array of numbers via the dictionary. Those will fit easily in memory (2 files of 8 bytes * 3GB / 60k lines is less than 1 MB of memory). Then diff the lists. You could invert the dictionary and use it to print out the text of the lines that differ.
EDIT:
In response to your comment, here's a sample script that assigns numbers to unique lines as it reads from a file.
#!/usr/bin/python
class Reader:
def __init__(self, file):
self.count = 0
self.dict = {}
self.file = file
def readline(self):
line = self.file.readline()
if not line:
return None
if self.dict.has_key(line):
return self.dict[line]
else:
self.count = self.count + 1
self.dict[line] = self.count
return self.count
if __name__ == '__main__':
print "Type Ctrl-D to quit."
import sys
r = Reader(sys.stdin)
result = 'ignore'
while result:
result = r.readline()
print result
If I understand correctly, you want the lines of these files without duplicates. This does the job:
uniqA = set(open('fileA', 'r'))
Python has difflib which claims to be quite competitive with other diff utilities see:
http://docs.python.org/library/difflib.html
I have a file whose contents are of the form:
.2323 1
.2327 1
.3432 1
.4543 1
and so on some 10,000 lines in each file.
I have a variable whose value is say a=.3344
From the file I want to get the row number of the row whose first column is closest to this variable...for example it should give row_num='3' as .3432 is closest to it.
I have tried in a method of loading the first columns element in a list and then comparing the variable to each element and getting the index number
If I do in this method it is very much time consuming and slow my model...I want a very quick method as this need to to called some 1000 times minimum...
I want a method with least overhead and very quick can anyone please tell me how can it be done very fast.
As the file size is maximum of 100kb can this be done directly without loading into any list of anything...if yes how can it be done.
Any method quicker than the method mentioned above are welcome but I am desperate to improve the speed -- please help.
def get_list(file, cmp, fout):
ind, _ = min(enumerate(file), key=lambda x: abs(x[1] - cmp))
return fout[ind].rstrip('\n').split(' ')
#root = r'c:\begpython\wavnk'
header = 6
for lst in lists:
save = database_index[lst]
#print save
index, base,abs2, _ , abs1 = save
using_data[index] = save
base = 'C:/begpython/wavnk/'+ base.replace('phone', 'text')
fin, fout = base + '.pm', base + '.mcep'
file = open(fin)
fout = open(fout).readlines()
[next(file) for _ in range(header)]
file = [float(line.partition(' ')[0]) for line in file]
join_cost_index_end[index] = get_list(file, float(abs1), fout)
join_cost_index_strt[index] = get_list(file, float(abs2), fout)
this is the code i was using..copying file into a list.and all please give better alternarives to this
Building on John Kugelman's answer, here's a way you might be able to do a binary search on a file with fixed-length lines:
class SubscriptableFile(object):
def __init__(self, file):
self._file = file
file.seek(0,0)
self._line_length = len(file.readline())
file.seek(0,2)
self._len = file.tell() / self._line_length
def __len__(self):
return self._len
def __getitem__(self, key):
self._file.seek(key * self._line_length)
s = self._file.readline()
if s:
return float(s.split()[0])
else:
raise KeyError('Line number too large')
This class wraps a file in a list-like structure, so that now you can use the functions of the bisect module on it:
def find_row(file, target):
fw = SubscriptableFile(file)
i = bisect.bisect_left(fw, target)
if fw[i + 1] - target < target - fw[i]:
return i + 1
else:
return i
Here file is an open file object and target is the number you want to find. The function returns the number of the line with the closest value.
I will note, however, that the bisect module will try to use a C implementation of its binary search when it is available, and I'm not sure if the C implementation supports this kind of behavior. It might require a true list, rather than a "fake list" (like my SubscriptableFile).
Is the data in the file sorted in numerical order? Are all the lines of the same length? If not, the simplest approach is best. Namely, reading through the file line by line. There's no need to store more than one line in memory at a time.
Code:
def closest(num):
closest_row = None
closest_value = None
for row_num, row in enumerate(file('numbers.txt')):
value = float(row.split()[0])
if closest_value is None or abs(value - num) < abs(closest_value - num):
closest_row = row
closest_row_num = row_num
closest_value = value
return (closest_row_num, closest_row)
print closest(.3344)
Output for sample data:
(2, '.3432 1\n')
If the lines are all the same length and the data is sorted then there are some optimizations that will make this a very fast process. All the lines being the same length would let you seek directly to particular lines (you can't do this in a normal text file with lines of different length). Which would then enable you to do a binary search.
A binary search would be massively faster than a linear search. A linear search will on average have to read 5,000 lines of a 10,000 line file each time, whereas a binary search would on average only read log2 10,000 ≈ 13 lines.
Load it into a list then use bisect.