Reading Huge File in Python - python

I have a 384MB text file with 50 million lines. Each line contains 2 space-separated integers: a key and a value. The file is sorted by key. I need an efficient way of looking up the values of a list of about 200 keys in Python.
My current approach is included below. It takes 30 seconds. There must be more efficient Python foo to get this down to a reasonable efficiency of a couple of seconds at most.
# list contains a sorted list of the keys we need to lookup
# there is a sentinel at the end of list to simplify the code
# we use pointer to iterate through the list of keys
for line in fin:
line = map(int, line.split())
while line[0] == list[pointer].key:
list[pointer].value = line[1]
pointer += 1
while line[0] > list[pointer].key:
pointer += 1
if pointer >= len(list) - 1:
break # end of list; -1 is due to sentinel
Coded binary search + seek solution (thanks kigurai!):
entries = 24935502 # number of entries
width = 18 # fixed width of an entry in the file padded with spaces
# at the end of each line
for i, search in enumerate(list): # list contains the list of search keys
left, right = 0, entries-1
key = None
while key != search and left <= right:
mid = (left + right) / 2
fin.seek(mid * width)
key, value = map(int, fin.readline().split())
if search > key:
left = mid + 1
else:
right = mid - 1
if key != search:
value = None # for when search key is not found
search.result = value # store the result of the search

If you only need 200 of 50 million lines, then reading all of it into memory is a waste. I would sort the list of search keys and then apply binary search to the file using seek() or something similar. This way you would not read the entire file to memory which I think should speed things up.

Slight optimization of S.Lotts answer:
from collections import defaultdict
keyValues= defaultdict(list)
targetKeys= # some list of keys as strings
for line in fin:
key, value = line.split()
if key in targetKeys:
keyValues[key].append( value )
Since we're using a dictionary rather than a list, the keys don't have to be numbers. This saves the map() operation and a string to integer conversion for each line. If you want the keys to be numbers, do the conversion a the end, when you only have to do it once for each key, rather than for each of 50 million lines.

It's not clear what "list[pointer]" is all about. Consider this, however.
from collections import defaultdict
keyValues= defaultdict(list)
targetKeys= # some list of keys
for line in fin:
key, value = map( int, line.split())
if key in targetKeys:
keyValues[key].append( value )

I would use memory-maping: http://docs.python.org/library/mmap.html.
This way you can use the file as if it's stored in memory, but the OS decides which pages should actually be read from the file.

Here is a recursive binary search on the text file
import os, stat
class IntegerKeyTextFile(object):
def __init__(self, filename):
self.filename = filename
self.f = open(self.filename, 'r')
self.getStatinfo()
def getStatinfo(self):
self.statinfo = os.stat(self.filename)
self.size = self.statinfo[stat.ST_SIZE]
def parse(self, line):
key, value = line.split()
k = int(key)
v = int(value)
return (k,v)
def __getitem__(self, key):
return self.findKey(key)
def findKey(self, keyToFind, startpoint=0, endpoint=None):
"Recursively search a text file"
if endpoint is None:
endpoint = self.size
currentpoint = (startpoint + endpoint) // 2
while True:
self.f.seek(currentpoint)
if currentpoint <> 0:
# may not start at a line break! Discard.
baddata = self.f.readline()
linestart = self.f.tell()
keyatpoint = self.f.readline()
if not keyatpoint:
# read returned empty - end of file
raise KeyError('key %d not found'%(keyToFind,))
k,v = self.parse(keyatpoint)
if k == keyToFind:
print 'key found at ', linestart, ' with value ', v
return v
if endpoint == startpoint:
raise KeyError('key %d not found'%(keyToFind,))
if k > keyToFind:
return self.findKey(keyToFind, startpoint, currentpoint)
else:
return self.findKey(keyToFind, currentpoint, endpoint)
A sample text file created in jEdit seems to work:
>>> i = integertext.IntegerKeyTextFile('c:\\sampledata.txt')
>>> i[1]
key found at 0 with value 345
345
It could definitely be improved by caching found keys and using the cache to determine future starting seek points.

If you have any control over the format of the file, the "sort and binary search" responses are correct. The detail is that this only works with records of a fixed size and offset (well, I should say it only works easily with fixed length records).
With fixed length records, you can easily seek() around the sorted file to find your keys.

One possible optimization is to do a bit of buffering using the sizehint option in file.readlines(..). This allows you to load multiple lines in memory totaling to approximately sizehint bytes.

You need to implement binary search using seek()

Related

How to save and load a large dictionary to storage in python?

I have a 1.5GB size dictionary that it takes about 90 seconds to calculate so I want to save it once to storage and load it every time I want to use it again. This creates two challenges:
Loading the file has to take less than 90 seconds.
As RAM is limited (in pycharm) at ~4GB it cannot be memory-intensive.
I also need it to be utf-8 capable.
I have tried solutions such as pickle but they always end up throwing a Memory Error.
Notice that my dictionary is made of Strings and thus solutions like in this post do not apply.
Things I do not care about:
Saving time (as long as it's not more than ~20 minutes, as I'm looking to do it once).
How much space it takes in storage to save the dictionary.
How can I do that? thanks
Edit:
I forgot to mention it's a dictionary containing sets, so json.dump() doesn't work as it can't handle sets.
If the dict consumes a lot of memory because it has many items, you could try dump many smaller dicts and combine them with update:
mk_pickle.py
import pickle
CHUNKSIZE = 10 #You will make this number of course bigger
def mk_chunks(d, chunk_size):
chunk = {}
ctr = chunk_size
for key, val in d.items():
chunk[key] = val
ctr -= 1
if ctr == 0:
yield chunk
ctr = chunk_size
chunk = {}
if chunk:
yield chunk
def dump_big_dict(d):
with open("dump.pkl", "wb") as fout:
for chunk in mk_chunks(d, CHUNKSIZE):
pickle.dump(chunk, fout)
# For testing:
N = 1000
big_dict = dict()
for n in range(N):
big_dict[n] = "entry_" + str(n)
dump_big_dict(big_dict)
read_dict.py
import pickle
d= {}
with open("dump.pkl", "rb") as fin:
while True:
try:
small_dict = pickle.load(fin)
except EOFError:
break
d.update(small_dict)
You could try to generate and save it by parts in several files. I mean generate some key value pairs, store them in a file with pickle, and delete the dict from memory, then continue until all key value pair are exausted.
Then to load the whole dict use dict.update for each part, but that could also run in memory trouble, so instead you can make a class derived from dict which reads the corresponding file on demand according to the key (I mean overriding __getitiem__), something like this:
class Dict(dict):
def __init__(self):
super().__init__()
self.dict = {}
def __getitiem__(key):
if key in self.dict:
return self.dict[key]
else:
del self.dict # destroy the old before the new is created
self.dict = pickle.load(self.getFileName(key))
return self.dict[key]
filenames = ['key1', 'key1000', 'key2000']
def getFileName(key):
'''assuming the keys are separated in files by alphabetical order,
each file name taken from its first key'''
if key in filenames:
return key
else:
A = list(sorted(filenames + [key]))
return A[A.index(key) - 1]
Have in count that smaller dicts will be loaded faster, so you should experiment and find the right amount of files.
Also you can let reside in memory more than one dict according to memory resource.

How to use multiprocess in csv.DictReader?

This is a script to calculate histogram, and I find the lib csv.py takes most time. How can I run it paralleled ?
The size of input file samtools.depth.gz is 14G, contains about 3 billion lines.
SamplesList = ('Sample_A', 'Sample_B', 'Sample_C', 'Sample_D')
from collections import Counter
cDepthCnt = {key:Counter() for key in SamplesList}
cDepthStat = {key:[0,0] for key in SamplesList} # x and x^2
RecordCnt,MaxDepth = inStat('samtools.depth.gz')
print('xxx')
def inStat(inDepthFile):
import gzip
import csv
RecordCnt = 0
MaxDepth = 0
with gzip.open(inDepthFile, 'rt') as tsvfin:
tsvin = csv.DictReader(tsvfin, delimiter='\t', fieldnames=('ChrID','Pos')+SamplesList )
RecordCnt += 1
for row in tsvin:
for k in SamplesList:
theValue = int(row[k])
if theValue > MaxDepth:
MaxDepth = theValue
cDepthCnt[k][theValue] += 1
cDepthStat[k][0] += theValue
cDepthStat[k][1] += theValue * theValue
return RecordCnt,MaxDepth
There are ways to read huge file into chunks and distribute them with list, like https://stackoverflow.com/a/30294434/159695 :
bufsize = 65536
with open(path) as infile:
while True:
lines = infile.readlines(bufsize)
if not lines:
break
for line in lines:
process(line)
However, csv.DictReader only accepts file handles.
There is a way to split to temporary files at https://gist.github.com/jbylund/c37402573a896e5b5fc8 , I wonder whether I can use fifo to do it on-the-fly.
I just find csv.DictReader accepts any object which supports the iterator protocol and returns a string each time its next() method is called — file objects and list objects are both suitable.
I have modify inStat() to accept lines. Would you please help me to complete statPool() ?
def statPool(inDepthFile):
import gzip
RecordCnt = 0
MaxDepth = 0
cDepthCnt = {key:Counter() for key in SamplesList}
cDepthStat = {key:[0,0,0,0,0] for key in SamplesList} # x and x^2
with gzip.open(inDepthFile, 'rt') as tsvfin:
while True:
lines = tsvfin.readlines(ChunkSize)
if not lines:
break
with Pool(processes=4) as pool:
res = pool.apply_async(inStat,[lines])
iRecordCnt,iMaxDepth,icDepthCnt,icDepthStat = res.get()
RecordCnt += iRecordCnt
if iMaxDepth > MaxDepth:
MaxDepth = iMaxDepth
for k in SamplesList:
cDepthCnt[k].update(icDepthCnt[k])
cDepthStat[k][0] += icDepthStat[k][0]
cDepthStat[k][1] += icDepthStat[k][1]
return RecordCnt,MaxDepth,cDepthCnt,cDepthStat
I think asyncio.Queue seems be a good way to pipe to multiple csv.DictReader workers.
Looking up things in global scope takes longer then looking up stuff in local scope.
You do a lot of lookups - I suggest changing your code to:
cDepthCnt = {key:Counter() for key in SamplesList}
cDepthStat = {key:[0,0] for key in SamplesList} # x and x^2
RecordCnt,MaxDepth = inStat('samtools.depth.gz', cDepthCnt, cDepthStat)
print('xxx')
def inStat(inDepthFile, depthCount, depthStat):
# use the local depthCount, depthStat
to speed that part up by some.
Running parallellized when accessing the same keys over and over will introduce locks on those values to avoid mishaps - locking/unlocking takes time as well. You would have to see if it is faster.
All you do is summing up values - you could partition your data and use the 4 parts for 4(times 2) different dictionarys and afterwards add up the 4 dicts into your global one to avoid locks.

Python compare every line in file with all others

I am implementing a statistical program and have created a performance bottleneck and was hoping that I could obtain some help from the community to possibly point me in the direction of optimization.
I am creating a set for each row in a file and finding the intersection of that set by comparing the set data of each row in the same file. I then use the size of that intersection to filter certain sets from the output. The problem is that I have a nested for loop (O(n2)) and the standard size of the files incoming into the program are just over 20,000 lines long. I have timed the algorithm and for under 500 lines it runs in about 20 minutes but for the big files it takes about 8 hours to finish.
I have 16GB of RAM at disposal and a significantly quick 4-core Intel i7 processor. I have noticed no significant difference in memory use by copying the list1 and using a second list for comparison instead of opening the file again(maybe this is because I have an SSD?). I thought the 'with open' mechanism reads/writes directly to the HDD which is slower but noticed no difference when using two lists. In fact, the program rarely uses more than 1GB of RAM during operation.
I am hoping that other people have used a certain datatype or maybe better understands multiprocessing in Python and that they might be able to help me speed things up. I appreciate any help and I hope my code isn't too poorly written.
import ast, sys, os, shutil
list1 = []
end = 0
filterValue = 3
# creates output file with filterValue appended to name
with open(arg2 + arg1 + "/filteredSets" + str(filterValue) , "w") as outfile:
with open(arg2 + arg1 + "/file", "r") as infile:
# create a list of sets of rows in file
for row in infile:
list1.append(set(ast.literal_eval(row)))
infile.seek(0)
for row in infile:
# if file only has one row, no comparisons need to be made
if not(len(list1) == 1):
# get the first set from the list and...
set1 = set(ast.literal_eval(row))
# ...find the intersection of every other set in the file
for i in range(0, len(list1)):
# don't compare the set with itself
if not(pos == i):
set2 = list1[i]
set3 = set1.intersection(set2)
# if the two sets have less than 3 items in common
if(len(set3) < filterValue):
# and you've reached the end of the file
if(i == len(list1)):
# append the row in outfile
outfile.write(row)
# increase position in infile
pos += 1
else:
break
else:
outfile.write(row)
Sample input would be a file with this format:
[userID1, userID2, userID3]
[userID5, userID3, userID9]
[userID10, userID2, userID3, userID1]
[userID8, userID20, userID11, userID1]
The output file if this were the input file would be:
[userID5, userID3, userID9]
[userID8, userID20, userID11, userID1]
...because the two sets removed contained three or more of the same user id's.
This answer is not about how to split code in functions, name variables etc. It's about faster algorithm in terms of complexity.
I'd use a dictionary. Will not write exact code, you can do it yourself.
Sets = dict()
for rowID, row in enumerate(Rows):
for userID in row:
if Sets.get(userID) is None:
Sets[userID] = set()
Sets[userID].add(rowID)
So, now we have a dictionary, which can be used to quickly obtain rownumbers of rows containing given userID.
BadRows = set()
for rowID, row in enumerate(Rows):
Intersections = dict()
for userID in row:
for rowID_cmp in Sets[userID]:
if rowID_cmp != rowID:
Intersections[rowID_cmp] = Intersections.get(rowID_cmp, 0) + 1
# Now Intersections contains info about how many "times"
# row numbered rowID_cmp intersectcs current row
filteredOut = False
for rowID_cmp in Intersections:
if Intersections[rowID_cmp] >= filterValue:
BadRows.add(rowID_cmp)
filteredOut = True
if filteredOut:
BadRows.add(rowID)
Having rownumbers of all filtered out rows saved to BadRows, now we do iteration one last time:
for rowID, row in enumerate(Rows):
if rowID not in BadRows:
# output row
This works in 3 scans and in O(nlogn) time. Maybe you'd have to rework iterating Rows array, because it's a file in your case, but doesn't really change much.
Not sure about python syntax and details, but you get the idea behind my code.
First of all, please pack your the code into functions which do one thing well.
def get_data(*args):
# get the data.
def find_intersections_sets(list1, list2):
# do the intersections part.
def loop_over_some_result(result):
# insert assertions so that you don't end up looping in infinity:
assert result is not None
...
def myfunc(*args):
source1, source2 = args
L1, L2 = get_data(source1), get_data(source2)
intersects = find_intersections_sets(L1,L2)
...
if __name__ == "__main__":
myfunc()
then you can easily profile the code using:
if __name__ == "__main__":
import cProfile
cProfile.run('myfunc()')
which gives you invaluable insight into your code behaviour and allows you to track down logical bugs. For more on cProfile, see How can you profile a python script?
An option to track down a logical flaw (we're all humans, right?) is to user a timeout function in a decorate like this (python2) or this (python3):
Hereby myfunc can be changed to:
def get_data(*args):
# get the data.
def find_intersections_sets(list1, list2):
# do the intersections part.
def myfunc(*args):
source1, source2 = args
L1, L2 = get_data(source1), get_data(source2)
#timeout(10) # seconds <---- the clever bit!
intersects = find_intersections_sets(L1,L2)
...
...where the timeout operation will raise an error if it takes too long.
Here is my best guess:
import ast
def get_data(filename):
with open(filename, 'r') as fi:
data = fi.readlines()
return data
def get_ast_set(line):
return set(ast.literal_eval(line))
def less_than_x_in_common(set1, set2, limit=3):
if len(set1.intersection(set2)) < limit:
return True
else:
return False
def check_infile(datafile, savefile, filtervalue=3):
list1 = [get_ast_set(row) for row in get_data(datafile)]
outlist = []
for row in list1:
if any([less_than_x_in_common(set(row), set(i), limit=filtervalue) for i in outlist]):
outlist.append(row)
with open(savefile, 'w') as fo:
fo.writelines(outlist)
if __name__ == "__main__":
datafile = str(arg2 + arg1 + "/file")
savefile = str(arg2 + arg1 + "/filteredSets" + str(filterValue))
check_infile(datafile, savefile)

Python Linear Search Better Efficiency

I've got a question regarding Linear Searching in Python. Say I've got the base code of
for l in lines:
for f in search_data:
if my_search_function(l[1],[f[0],f[2]]):
print "Found it!"
break
in which we want to determine where in search_data exists the value stored in l[1]. Say my_search_function() looks like this:
def my_search_function(search_key, search_values):
for s in search_values:
if search_key in s:
return True
return False
Is there any way to increase the speed of processing? Binary Search would not work in this case, as lines and search_data are multidimensional lists and I need to preserve the indexes. I've tried an outside-in approach, i.e.
for line in lines:
negative_index = -1
positive_index = 0
middle_element = len(search_data) /2 if len(search_data) %2 == 0 else (len(search_data)-1) /2
found = False
while positive_index < middle_element:
# print str(positive_index)+","+str(negative_index)
if my_search_function(line[1], [search_data[positive_index][0],search_data[negative_index][0]]):
print "Found it!"
break
positive_index = positive_index +1
negative_index = negative_index -1
However, I'm not seeing any speed increases from this. Does anyone have a better approach? I'm looking to cut the processing speed in half as I'm working with large amounts of CSV and the processing time for one file is > 00:15 which is unacceptable as I'm processing batches of 30+ files. Basically the data I'm searching on is essentially SKUs. A value from lines[0] could be something like AS123JK and a valid match for that value could be AS123. So a HashMap would not work here, unless there exists a way to do partial matches in a HashMap lookup that wouldn't require me breaking down the values like ['AS123', 'AS123J', 'AS123JK'], which is not ideal in this scenario. Thanks!
Binary Search would not work in this case, as lines and search_data are multidimensional lists and I need to preserve the indexes.
Regardless, it may be worth your while to extract the strings (along with some reference to the original data structure) into a flat list, sort it, and perform fast binary searches on it with help of the bisect module.
Or, instead of a large number of searches, sort also a combined list of all the search keys and traverse both lists in parallel, looking for matches. (Proceeding in a similar manner to the merge step in merge sort, without actually outputting a merged list)
Code to illustrate the second approach:
lines = ['AS12', 'AS123', 'AS123J', 'AS123JK','AS124']
search_keys = ['AS123', 'AS125']
try:
iter_keys = iter(sorted(search_keys))
key = next(iter_keys)
for line in sorted(lines):
if line.startswith(key):
print('Line {} matches {}'.format(line, key))
else:
while key < line[:len(key)]:
key = next(iter_keys)
except StopIteration: # all keys processed
pass
Depends on problem detail.
For instance if you search for complete words, you could create a hashtable on searchable elements, and the final search would be a simple lookup.
Filling the hashtable is pseudo-linear.
Ultimately, I was broke down and implemented Binary Search on my multidimensional lists by sorting using the sorted() function with a lambda as a key argument.Here is the first pass code that I whipped up. It's not 100% efficient, but it's a vast improvement from where we were
def binary_search(master_row, source_data,master_search_index, source_search_index):
lower_bound = 0
upper_bound = len(source_data) - 1
found = False
while lower_bound <= upper_bound and not found:
middle_pos = (lower_bound + upper_bound) // 2
if source_data[middle_pos][source_search_index] < master_row[master_search_index]:
if search([source_data[middle_pos][source_search_index]],[master_row[master_search_index]]):
return {"result": True, "index": middle_pos}
break
lower_bound = middle_pos + 1
elif source_data[middle_pos][source_search_index] > master_row[master_search_index] :
if search([master_row[master_search_index]],[source_data[middle_pos][source_search_index]]):
return {"result": True, "index": middle_pos}
break
upper_bound = middle_pos - 1
else:
if len(source_data[middle_pos][source_search_index]) > 5:
return {"result": True, "index": middle_pos}
else:
break
and then where we actually make the Binary Search call
#where master_copy is the first multidimensional list, data_copy is the second
#the search columns are the columns we want to search against
for line in master_copy:
for m in master_search_columns:
found = False
for d in data_search_columns:
data_copy = sorted(data_copy, key=lambda x: x[d], reverse=False)
results = binary_search(line, data_copy,m, d)
found = results["result"]
if found:
line = update_row(line, data_copy[results["index"]], column_mapping)
found_count = found_count +1
break
if found:
break
Here's the info for sorting a multidimensional list Python Sort Multidimensional Array Based on 2nd Element of Subarray

What is the lightest way of doing this task?

I have a file whose contents are of the form:
.2323 1
.2327 1
.3432 1
.4543 1
and so on some 10,000 lines in each file.
I have a variable whose value is say a=.3344
From the file I want to get the row number of the row whose first column is closest to this variable...for example it should give row_num='3' as .3432 is closest to it.
I have tried in a method of loading the first columns element in a list and then comparing the variable to each element and getting the index number
If I do in this method it is very much time consuming and slow my model...I want a very quick method as this need to to called some 1000 times minimum...
I want a method with least overhead and very quick can anyone please tell me how can it be done very fast.
As the file size is maximum of 100kb can this be done directly without loading into any list of anything...if yes how can it be done.
Any method quicker than the method mentioned above are welcome but I am desperate to improve the speed -- please help.
def get_list(file, cmp, fout):
ind, _ = min(enumerate(file), key=lambda x: abs(x[1] - cmp))
return fout[ind].rstrip('\n').split(' ')
#root = r'c:\begpython\wavnk'
header = 6
for lst in lists:
save = database_index[lst]
#print save
index, base,abs2, _ , abs1 = save
using_data[index] = save
base = 'C:/begpython/wavnk/'+ base.replace('phone', 'text')
fin, fout = base + '.pm', base + '.mcep'
file = open(fin)
fout = open(fout).readlines()
[next(file) for _ in range(header)]
file = [float(line.partition(' ')[0]) for line in file]
join_cost_index_end[index] = get_list(file, float(abs1), fout)
join_cost_index_strt[index] = get_list(file, float(abs2), fout)
this is the code i was using..copying file into a list.and all please give better alternarives to this
Building on John Kugelman's answer, here's a way you might be able to do a binary search on a file with fixed-length lines:
class SubscriptableFile(object):
def __init__(self, file):
self._file = file
file.seek(0,0)
self._line_length = len(file.readline())
file.seek(0,2)
self._len = file.tell() / self._line_length
def __len__(self):
return self._len
def __getitem__(self, key):
self._file.seek(key * self._line_length)
s = self._file.readline()
if s:
return float(s.split()[0])
else:
raise KeyError('Line number too large')
This class wraps a file in a list-like structure, so that now you can use the functions of the bisect module on it:
def find_row(file, target):
fw = SubscriptableFile(file)
i = bisect.bisect_left(fw, target)
if fw[i + 1] - target < target - fw[i]:
return i + 1
else:
return i
Here file is an open file object and target is the number you want to find. The function returns the number of the line with the closest value.
I will note, however, that the bisect module will try to use a C implementation of its binary search when it is available, and I'm not sure if the C implementation supports this kind of behavior. It might require a true list, rather than a "fake list" (like my SubscriptableFile).
Is the data in the file sorted in numerical order? Are all the lines of the same length? If not, the simplest approach is best. Namely, reading through the file line by line. There's no need to store more than one line in memory at a time.
Code:
def closest(num):
closest_row = None
closest_value = None
for row_num, row in enumerate(file('numbers.txt')):
value = float(row.split()[0])
if closest_value is None or abs(value - num) < abs(closest_value - num):
closest_row = row
closest_row_num = row_num
closest_value = value
return (closest_row_num, closest_row)
print closest(.3344)
Output for sample data:
(2, '.3432 1\n')
If the lines are all the same length and the data is sorted then there are some optimizations that will make this a very fast process. All the lines being the same length would let you seek directly to particular lines (you can't do this in a normal text file with lines of different length). Which would then enable you to do a binary search.
A binary search would be massively faster than a linear search. A linear search will on average have to read 5,000 lines of a 10,000 line file each time, whereas a binary search would on average only read log2 10,000 ≈ 13 lines.
Load it into a list then use bisect.

Categories