Efficient Python way to process two huge files?

Efficient Python way to process two huge files? - python

I am working on a problem where I have to find if a number falls within a certain range. However, the problem is complicated due to the fact that the files I am dealing with have hundreds of thousands of lines.
Below I try to explain the problem in as simple a language as possible.
Here is a brief description of my input files :
File Ranges.txt has some ranges whose min and max are tab separated.
10 20
30 40
60 70
This can have about 10,000,000 such lines with ranges.
NOTE: The ranges never overlap.
File Numbers.txt has a list of numbers and some values associated with each number.
12 0.34
22 0.14
34 0.79
37 0.87
And so on. Again there are hundreds of thousands of such lines with numbers and their associated values.
What I wish to do is take every number from Numbers.txt and check if it falls within any of the ranges in Ranges.txt.
For all such numbers that fall within a range, I have to get a mean of their associated values (ie a mean per range).
For eg. in the example above in Numbers.txt, there are two numbers 34 and 37 that fall within the range 30-40 in Ranges.txt, so for the range 30-40 I have to calculate the mean of the associated values of 34 and 37. (i.e mean of 0.79 and 0.87), which is 0.82
My final output file should be the Ranges.txt but with the mean of the associated values of all numbers falling within each range. Something like :
Output.txt
10 20 <mean>
30 40 0.82
60 70 <mean>
and so on.
Would appreciate any help and ideas on how this can be written efficiently in Python.

Obviously you need to run each line from Numbers.txt against each line from Ranges.txt.
You could just iterate over Numbers.txt, and, for each line, iterate over Ranges.txt. But this will take forever, reading the whole Ranges.txt file millions of times.
You could read both of them into memory, but that will take a lot of storage, and it means you won't be able to do any processing until you've finished reading and preprocessing both files.
So, what you want to do is read Ranges.txt into memory once and store it as, say, a list of pairs of ints instead, but read Numbers.txt lazily, iterating over the list for each number.
This kind of thing comes up all the time. In general, you want to make the bigger collection into the outer loop, and make it as lazy as possible, while the smaller collection goes into the inner loop, and is pre-processed to make it as fast as possible. But if the bigger collection can be preprocessed more efficiently (and you have enough memory to store it!), reverse that.
And speaking of preprocessing, you can do a lot better than just reading into a list of pairs of ints. If you sorted Ranges.txt, you could find the closest range without going over by bisecting then just check that (18 steps), instead of checking each range exhaustively (100000 steps).
This is a bit of a pain with the stdlib, because it's easy to make off-by-one errors when using bisect, but there are plenty of ActiveState recipes to make it easier (including one linked from the official docs), not to mention third-party modules like blist or bintrees that give you a sorted collection in a simple OO interface.
So, something like this pseudocode:
with open('ranges.txt') as f:
ranges = sorted([map(int, line.split()) for line in f])
range_values = {}
with open('numbers.txt') as f:
rows = (map(int, line.split()) for line in f)
for number, value in rows:
use the sorted ranges to find the appropriate range (if any)
range_values.setdefault(range, []).append(value)
with open('output.txt') as f:
for r, values in range_values.items():
mean = sum(values) / len(values)
f.write('{} {} {}\n'.format(r[0], r[1], mean))
By the way, if the parsing turns out to be any more complicated than just calling split on each line, I'd suggest using the csv module… but it looks like that won't be a problem here.
What if you can't fit Ranges.txt into memory, but can fit Numbers.txt? Well, you can sort that, then iterate over Ranges.txt, find all of the matches in the sorted numbers, and write the results out for that range.
This is a bit more complicated, because it you have to bisect_left and bisect_right and iterate everything in between. But that's the only way in which it's any harder. (And here, a third-party class will help even more. For example, with a bintrees.FastRBTree as your sorted collection, it's just sorted_number_tree[low:high].)
If the ranges can overlap, you need to be a bit smarter—you have to find the closest range without going over the start, and the closest range without going under the end, and check everything in between. But the main trick there is the exact same one used for the last version. The only other trick is to keep two copies of ranges, one sorted by the start value and one by the end, and you'll need to have one of them be a map to indices in the other instead of just a plain list.

The naive approach would be to read Numbers.txt into some structure in number order, then read each line of Ranges, us a binary search to find the lowest number in the range, and the read through the numbers higher than that to find all those within the range, so that you can produce the corresponding line of output.
I assume the problem is that you can't have all of Numbers in memory.
So you could do the problem in phases, where each phase reads a portion of Numbers in, then goes through the process outlined above, but using an annotated version of Ranges, where each line includes the COUNT of the values so far that has produced that mean, and will write a similarly annotated version.
Obviously, the initial pass will not have an annotated version of Ranges, and the final pass will not produce one.

It looks like your data in both the files are already sorted. If not, first sort them by an external tool or using Python.
Then, you can go through the two files in parallel. You read a number from the Numbers.txt file, and see if it is in a range in Ranges.txt file, reading as many lines from that file as needed to answer that question. Then read the next number from Numbers.txt, and repeat. The idea is similar to merging two sorted arrays, and should run in O(n+m) time, n and m are the sizes of the files. If you need to sort the files, the run time is O(n lg(n) + m lg(m)). Here is a quick program I wrote to implement this:
import sys
from collections import Counter
class Gen(object):
__slots__ = ('rfp', 'nfp', 'mn', 'mx', 'num', 'd', 'n')
def __init__(self, ranges_filename, numbers_filename):
self.d = Counter() # sum of elements keyed by range
self.n = Counter() # number of elements keyed by range
self.rfp = open(ranges_filename)
self.nfp = open(numbers_filename)
# Read the first number and the first range values
self.num = float(self.nfp.next()) # Current number
self.mn, self.mx = [int(x) for x in self.rfp.next().split()] # Current range
def go(self):
while True:
if self.mx < self.num:
try:
self.mn, self.mx = [int(x) for x in self.rfp.next().split()]
except StopIteration:
break
else:
if self.mn <= self.num <= self.mx:
self.d[(self.mn, self.mx)] += self.num
self.n[(self.mn, self.mx)] += 1
try:
self.num = float(self.nfp.next())
except StopIteration:
break
self.nfp.close()
self.rfp.close()
return self.d, self.n
def run(ranges_filename, numbers_filename):
r = Gen(ranges_filename, numbers_filename)
d, n = r.go()
for mn, mx in sorted(d):
s, N = d[(mn, mx)], n[(mn, mx)]
if s:
av = s/N
else:
av = 0
sys.stdout.write('%d %d %.3f\n' % (mn, mx, av))
On files with 10,000,000 numbers in each of the files, the above runs in about 1.5 minute on my computer without the output part.

Related

Efficient algorithm for only keeping distant values

I have a list of values which might look something like this: [500,501,809,702,808,807,703,502,499] and I would like to only keep the first instance of each number within a certain distance. In other words, I'd like to get the list: [500,809,702] because the other numbers are within a certain distance from those numbers. So it would keep 500, skip 501 because it's too close, keep 809 because it's far away from the already selected values, keep 702, etc.
Here's my current solution:
vals = ... #the original data
result = []
tolerance = 50
for i in vals:
if not len(np.where(np.abs(result - i) < tolerance)[0]):
results.append(i)
This works fine, but it's too slow for my purposes (I'm dealing with 2.4 million elements in the list). Is there an efficient solution to this problem? Thanks!
EDIT: Just to clarify, I need to keep the first element of each group, not the smallest element (i.e. [499, 702, 807] would not be a valid result in the above example), so sorting it might not help so much.

vals = [500,501,809,702,808,807,703,502,499]
close_set = set()
tolerance = 5
result = []
for e in vals:
if e in close_set:
continue
else:
result.append(e)
close_set.update([*range(e-tolerance, e+tolerance+1)])
print(result) # [500, 809, 702]
This should be pretty fast (I tested it on a list of 1,000,000 elements and it took ~3 seconds). For each element in the list, you check to see if a close value has been seen before by checking for membership in the set of close numbers, which is O(1). If it's not, you add it to your results and then update the set of close numbers.

A better solution is to use a SortedSet from http://www.grantjenks.com/docs/sortedcontainers/index.html.
Before inserting an element you check irange_key all values within +- tolerance. If nothing is there, then add this element.
This solution should be at least an order of magnitude faster than the close_set approach already suggested, and an order of magnitude better as well on memory usage. Plus it will work for floats as well as integers if you should need that.

Can I set a max file size using the filterwriter in Python?

I got a rather simple question. I have a very large list defined in Python and if I output it to 1 text file the file size will get up to 200mb big. Which I can not open easily.
I was wondering is there any option available within Python which can set the maximum size of a specific write file and create a new file if the size is exceeded?
To summarize:
Current situation: 1 file (200mb)
Desired situation: 8 files (25mb each)
Code so far:
file = open("output_users.txt", "w")
file.write("Total number of users: " + str(len(user_id)))
file.write(str(user_id))
file.close()

There isn't a built-in way to do that in open(). What I would suggest is that you break up your data into several chunks, then open a different file per chunk. E.g., say you have just over ten thousand items (I use integers here for simplicity, but they could be user records or whatever you're working with) to process. You could split them into ten chunks like so, using the itertools module's groupby function to make your job a bit easier:
import itertools
original_data = range(10003) # Note how this is *not* divisible by 10
num_chunks = 10
length_of_one_chunk = len(original_data) // num_chunks
chunked_data = []
def keyfunc(t):
# Given a tuple of (index, data_item), return the index
# divided by N where N is the length of one chunk. This
# will produce the value 0 for the first N items, then 1
# for the next N items, and so on, making this very
# suitable for passing into itertools.groupby.
# Note the // operator, which means integer division
return (t[0] // length_of_one_chunk)
for n, chunk in itertools.groupby(enumerate(original_data), keyfunc):
chunked_data.append(list(chunk))
This will produce a chunked_data list with a length of 11; each of its elements is a list of data items (in this case, they're just integers). The first ten items of chunked_data will all have N items, where N is the value of length_of_one_chunk (in this case, precisely 1000). The last element of chunked_data will be a list of the 3 leftover items that didn't fit evenly across the other lists; you could write them to a separate file, or just append them to the end of the last file.
If you change the range(10003) to range(10027), then N would be 1002 and the last element would have 7 leftover items. And so on.
Then you just run chunked_data through a for loop, and for each list inside it, process the data normally, opening a new file each time. And you'll have your 10 files (or 8, or whatever you set num_chunks to).

Function awfully slow

I was looking for historical data from our Brazilian stock market and found it at Bovespa's
website.
The problem is the format the data is in is terrible, it is mingled with all sorts of
other information about any particular stock!
So far so good! A great opportunity to test my fresh python skills (or so I thought)!
I managed to "organize/parse" pretty much all of the data with a few lines of code,
and then stumbled on a very annoying fact about the data. The very information I needed, stock prices(open, high, low, close), had no commas and was formatted like this: 0000000011200, which would be equivalent to 11 digits before the decimal comma.
So basically 0000000011200 = 112,00... You get the gist..
I wrote a few lines of code to edit that and then the nightmare kicked in.
The whole data set is around 358K rows long, and with my current script the deeper it
runs inside the list to edit it the longer it takes per edit.
Here is the code snipped I used for that:
#profile
def dataFix(datas):
x = 0
for entry in datas:
for i in range(9, 16):
data_org[datas.index(entry)][i] = entry[i][:11]+'.'+entry[i][11:]
x += 1
print x
Would anyone mind shining some light into this matter?

datas.index(entry)
There's your problem. datas.index(entry) requires Python to go through the datas list one element at a time, searching for entry. It's an incredibly slow way to do things, slower the bigger the list is, and it doesn't even work, because duplicate elements are always found at their first occurrence instead of the occurrence you're processing.
If you want to use the indices of the elements in a loop, use enumerate:
for index, entry in enumerate(datas):
...

First, probably more easy to convert price directly to a more usable format.
For exemple Decimal format permit you to do easy calculation without loosing precision.
Secondly, i think you didn't even need the index and can just use append.
Thirdly, say welcome to list comprehension and slice :P
from decimal import Decimal
data_org = []
for entries in datas:
data_org.append([Decimal(entry).scaleb(-2) for entry in entries[9:16]])
or even:
data_org = [[Decimal(entry).scaleb(-2) for entry in entries[9:16]] for entries in datas]
or in a generator form:
data_org = ([Decimal(entry).scaleb(-2) for entry in entries[9:16]] for entries in datas)
or if you want to keeping the text form:
data_org = [['.'.join((entry[:-2], entry[-2:])) for entry in entries[9:16]] for entries in datas]
(replaceing [:11] by [:-2] permit to be independent of the input size and get 2 decimal from the end)

Program running too slow! : Suggest Algorithmic/Implementation Optimization

I have a huge python list(A) of lists. The length of list A is around 90,000. Each inner list contain around 700 tuples of (datetime.date,string). Now, I am analyzing this data. What I am doing is I am taking a window of size x in inner lists where- x = len(inner list) * (some fraction <= 1) and I am saving each ordered pair (a,b) where a occurs before b in that window (actually the innerlists are sorted wrt time). I am moving this window upto the last element adding one element at a time from one end and removing from other which takes O(window-size)time as I am considering the new tuples only. My code:
for i in xrange(window_size):
j = i+1;
while j<window_size:
check_and_update(cur, my_list[i][1], my_list[j][1],log);
j=j+1
i=1;
while i<=len(my_list)-window_size:
j=i;
k=i+window_size-1;
while j<k:
check_and_update(cur, my_list[j][1], my_list[k][1],log);
j+=1
i += 1
Here cur is actually a sqlite3 database cursor,my_list is a list containing the tuples and I iterate this code for all the lists in A and log is a opened logfile. In method check_and_update() I am looking up my database to find the tuple if exists or else I insert it, along with its total number of occurrence so far. Code:
def check_and_update(cur,start,end,log):
t = str(start)+":"+ str(end)
cur.execute("INSERT OR REPLACE INTO Extra (tuple,count)\
VALUES ( ? , coalesce((SELECT count +1 from Extra WHERE tuple = ?),1))",[t,t])
As expected this number of tuples is HUGE and I have previously experimented with dictionary which eats up the memory quite fast. So, I resorted to SQLite3, but now it is too slow. I have tried indexing but with no help. Probably the my program is spending way to much time querying and updating the database. Do you have any optimization ideas for this problem? Probably changing the algorithm or some different approach/tools. Thank you!
Edit: My goal here is to find the total number of tuples of strings that occur within the window grouped by the number of different innerlists they occur in. I extract this information with this query:
for i in range(1,size+1):
cur.execute('select * from Extra where count = ?',str(i))
#other stuff
For Example ( I am ignoring the date entries and will write them as 'dt'):
My_list = [
[ ( dt,'user1') , (dt, 'user2'), (dt, 'user3') ]
[ ( dt,'user3') , (dt, 'user4')]
[ ( dt,'user2') , (dt, 'user3'), (dt,'user1') ]
]
here if I take fraction = 1 then, results:
only 1 occurrence in window: 5 (user 1-2,1-3,3-4,2-1,3-1)
only 2 occurrence in window: 2 (user 2-3)

Let me get this straight.
You have up to about 22 billion potential tuples (for 90000 lists, any of 700, any of the following entries, on average 350) which might be less depending on the window size. You want to find, but number of inner lists that they appear in, how many tuples there are.
Data of this size has to live on disk. The rule for data that lives on disk due to size is, "Never randomly access, instead generate and then sort."
So I'd suggest that you write out each tuple to a log file, one tuple per line. Sort that file. Now all instances of any given tuple are in one place. Then run through the file, and for each tuple emit the count of how many times it appears in (that is how many inner lists it is in). Sort that second file. Now run through that file, and you can extract how many tuples appeared 1x, 2x, 3x, etc.
If you have multiple machines, it is easy to convert this into a MapReduce. (Which is morally the same approach, but you get to parallelize a lot of stuff.)

Apache Hadoop is one of the MapReduce implementations that is suited for this kind of problem:

Is there any built-in way to get the length of an iterable in python?

For example, files, in Python, are iterable - they iterate over the lines in the file. I want to count the number of lines.
One quick way is to do this:
lines = len(list(open(fname)))
However, this loads the whole file into memory (at once). This rather defeats the purpose of an iterator (which only needs to keep the current line in memory).
This doesn't work:
lines = len(line for line in open(fname))
as generators don't have a length.
Is there any way to do this short of defining a count function?
def count(i):
c = 0
for el in i: c += 1
return c
To clarify, I understand that the whole file will have to be read! I just don't want it in memory all at once

Short of iterating through the iterable and counting the number of iterations, no. That's what makes it an iterable and not a list. This isn't really even a python-specific problem. Look at the classic linked-list data structure. Finding the length is an O(n) operation that involves iterating the whole list to find the number of elements.
As mcrute mentioned above, you can probably reduce your function to:
def count_iterable(i):
return sum(1 for e in i)
Of course, if you're defining your own iterable object you can always implement __len__ yourself and keep an element count somewhere.

If you need a count of lines you can do this, I don't know of any better way to do it:
line_count = sum(1 for line in open("yourfile.txt"))

The cardinality package provides an efficient count() function and some related functions to count and check the size of any iterable: http://cardinality.readthedocs.org/
import cardinality
it = some_iterable(...)
print(cardinality.count(it))
Internally it uses enumerate() and collections.deque() to move all the actual looping and counting logic to the C level, resulting in a considerable speedup over for loops in Python.

I've used this redefinition for some time now:
def len(thingy):
try:
return thingy.__len__()
except AttributeError:
return sum(1 for item in iter(thingy))

It turns out there is an implemented solution for this common problem. Consider using the ilen() function from more_itertools.
more_itertools.ilen(iterable)
An example of printing a number of lines in a file (we use the with statement to safely handle closing files):
# Example
import more_itertools
with open("foo.py", "r+") as f:
print(more_itertools.ilen(f))
# Output: 433
This example returns the same result as solutions presented earlier for totaling lines in a file:
# Equivalent code
with open("foo.py", "r+") as f:
print(sum(1 for line in f))
# Output: 433

Absolutely not, for the simple reason that iterables are not guaranteed to be finite.
Consider this perfectly legal generator function:
def forever():
while True:
yield "I will run forever"
Attempting to calculate the length of this function with len([x for x in forever()]) will clearly not work.
As you noted, much of the purpose of iterators/generators is to be able to work on a large dataset without loading it all into memory. The fact that you can't get an immediate length should be considered a tradeoff.

Because apparently the duplication wasn't noticed at the time, I'll post an extract from my answer to the duplicate here as well:
There is a way to perform meaningfully faster than sum(1 for i in it) when the iterable may be long (and not meaningfully slower when the iterable is short), while maintaining fixed memory overhead behavior (unlike len(list(it))) to avoid swap thrashing and reallocation overhead for larger inputs.
# On Python 2 only, get zip that lazily generates results instead of returning list
from future_builtins import zip
from collections import deque
from itertools import count
def ilen(it):
# Make a stateful counting iterator
cnt = count()
# zip it with the input iterator, then drain until input exhausted at C level
deque(zip(it, cnt), 0) # cnt must be second zip arg to avoid advancing too far
# Since count 0 based, the next value is the count
return next(cnt)
Like len(list(it)), ilen(it) performs the loop in C code on CPython (deque, count and zip are all implemented in C); avoiding byte code execution per loop is usually the key to performance in CPython.
Rather than repeat all the performance numbers here, I'll just point you to my answer with the full perf details.

For filtering, this variation can be used:
sum(is_good(item) for item in iterable)
which can be naturally read as "count good items" and is shorter and simpler (although perhaps less idiomatic) than:
sum(1 for item in iterable if is_good(item)))
Note: The fact that True evaluates to 1 in numeric contexts is specified in the docs
(https://docs.python.org/3.6/library/stdtypes.html#boolean-values), so this coercion is not a hack (as opposed to some other languages like C/C++).

We'll, if you think about it, how do you propose you find the number of lines in a file without reading the whole file for newlines? Sure, you can find the size of the file, and if you can gurantee that the length of a line is x, you can get the number of lines in a file. But unless you have some kind of constraint, I fail to see how this can work at all. Also, since iterables can be infinitely long...

I did a test between the two common procedures in some code of mine, which finds how many graphs on n vertices there are, to see which method of counting elements of a generated list goes faster. Sage has a generator graphs(n) which generates all graphs on n vertices. I created two functions which obtain the length of a list obtained by an iterator in two different ways and timed each of them (averaging over 100 test runs) using the time.time() function. The functions were as follows:
def test_code_list(n):
l = graphs(n)
return len(list(l))
and
def test_code_sum(n):
S = sum(1 for _ in graphs(n))
return S
Now I time each method
import time
t0 = time.time()
for i in range(100):
test_code_list(5)
t1 = time.time()
avg_time = (t1-t0)/10
print 'average list method time = %s' % avg_time
t0 = time.time()
for i in range(100):
test_code_sum(5)
t1 = time.time()
avg_time = (t1-t0)/100
print "average sum method time = %s" % avg_time
average list method time = 0.0391882109642
average sum method time = 0.0418473792076
So computing the number of graphs on n=5 vertices this way, the list method is slightly faster (although 100 test runs isn't a great sample size). But when I increased the length of the list being computed by trying graphs on n=7 vertices (i.e. changing graphs(5) to graphs(7)), the result was this:
average list method time = 4.14753051996
average sum method time = 3.96504004002
In this case the sum method was slightly faster. All in all, the two methods are approximately the same speed but the difference MIGHT depend on the length of your list (it might also just be that I only averaged over 100 test runs, which isn't very high -- would have taken forever otherwise).

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.