Where do you use generators feature in your python code? - python

I have studied generators feature and i think i got it but i would like to understand where i could apply it in my code.
I have in mind the following example i read in "Python essential reference" book:
# tail -f
def tail(f):
f.seek(0,2)
while True:
line = f.readline()
if not line:
time.sleep(0.1)
continue
yield line
Do you have any other effective example where generators are the best tool for the job like tail -f?
How often do you use generators feature and in which kind of functionality\part of program do you usually apply it?

I use them a lot when I implement scanners (tokenizers) or when I iterate over data containers.
Edit: here is a demo tokenizer I used for a C++ syntax highlight program:
whitespace = ' \t\r\n'
operators = '~!%^&*()-+=[]{};:\'"/?.,<>\\|'
def scan(s):
"returns a token and a state/token id"
words = {0:'', 1:'', 2:''} # normal, operator, whitespace
state = 2 # I pick ws as first state
for c in s:
if c in operators:
if state != 1:
yield (words[state], state)
words[state] = ''
state = 1
words[state] += c
elif c in whitespace:
if state != 2:
yield (words[state], state)
words[state] = ''
state = 2
words[state] += c
else:
if state != 0:
yield (words[state], state)
words[state] = ''
state = 0
words[state] += c
yield (words[state], state)
Usage example:
>>> it = scan('foo(); i++')
>>> it.next()
('', 2)
>>> it.next()
('foo', 0)
>>> it.next()
('();', 1)
>>> it.next()
(' ', 2)
>>> it.next()
('i', 0)
>>> it.next()
('++', 1)
>>>

Whenever your code would either generate an unlimited number of values or more generally if too much memory would be consumed by generating the whole list at first.
Or if it is likely that you don't iterate over the whole generated list (and the list is very large). I mean there is no point in generating every value first (and waiting for the generation) if it is not used.
My latest encounter with generators was when I implemented a linear recurrent sequence (LRS) like e.g. the Fibonacci sequence.

In all cases where I have algorithms that read anything, I use generators exclusively.
Why?
Layering in filtering, mapping and reduction rules is so much easier in a context of multiple generators.
Example:
def discard_blank( source ):
for line in source:
if len(line) == 0:
continue
yield line
def clean_end( source ):
for line in source:
yield line.rstrip()
def split_fields( source ):
for line in source;
yield line.split()
def convert_pos( tuple_source, position ):
for line in tuple_source:
yield line[:position]+int(line[position])+line[position+1:]
with open('somefile','r') as source:
data= convert_pos( split_fields( discard_blank( clean_end( source ) ) ), 0 )
total= 0
for l in data:
print l
total += l[0]
print total
My preference is to use many small generators so that a small change is not disruptive to the entire process chain.

In general, to separate data aquisition (which might be complicated) from consumption. In particular:
to concatenate results of several b-tree queries - the db part generates and executes the queries yield-ing records from each one, the consumer only sees single data items arriving.
buffering (read-ahead ) - the generator fetches data in blocks and yields single elements from each block. Again, the consumer is separated from the gory details.
Generators can also work as coroutines. You can pass data into them using nextval=g.next(data) on the 'consumer' side and data = yield(nextval) on the generator side. In this case the generator and its consumer 'swap' values. You can even make yield throw an exception within the generator context: g.throw(exc) does that.

Related

Python: itertools.product consuming too much resources

I've created a Python script that generates a list of words by permutation of characters. I'm using itertools.product to generate my permutations. My char list is composed by letters and numbers 01234567890abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVXYZ. Here is my code:
#!/usr/bin/python
import itertools, hashlib, math
class Words:
chars = '01234567890abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVXYZ'
def __init__(self, size):
self.make(size)
def getLenght(self, size):
res = []
for i in range(1, size+1):
res.append(math.pow(len(self.chars), i))
return sum(res)
def getMD5(self, text):
m = hashlib.md5()
m.update(text.encode('utf-8'))
return m.hexdigest()
def make(self, size):
file = open('res.txt', 'w+')
res = []
i = 1
for i in range(1, size+1):
prod = list(itertools.product(self.chars, repeat=i))
res = res + prod
j = 1
for r in res:
text = ''.join(r)
md5 = self.getMD5(text)
res = text+'\t'+md5
print(res + ' %.3f%%' % (j/float(self.getLenght(size))*100))
file.write(res+'\n')
j = j + 1
file.close()
Words(3)
This script works fine for list of words with max 4 characters. If I try 5 or 6 characters, my computer consumes 100% of CPU, 100% of RAM and freezes.
Is there a way to restrict the use of those resources or optimize this heavy processing?
Does this do what you want?
I've made all the changes in the make method:
def make(self, size):
with open('res.txt', 'w+') as file_: # file is a builtin function in python 2
# also, use with statements for files used on only a small block, it handles file closure even if an error is raised.
for i in range(1, size+1):
prod = itertools.product(self.chars, repeat=i)
for j, r in enumerate(prod):
text = ''.join(r)
md5 = self.getMD5(text)
res = text+'\t'+md5
print(res + ' %.3f%%' % ((j+1)/float(self.get_length(size))*100))
file_.write(res+'\n')
Be warned this will still chew up gigabytes of memory, but not virtual memory.
EDIT: As noted by Padraic, there is no file keyword in Python 3, and as it is a "bad builtin", it's not too worrying to override it. Still, I'll name it file_ here.
EDIT2:
To explain why this works so much faster and better than the previous, original version, you need to know how lazy evaluation works.
Say we have a simple expression as follows (for Python 3) (use xrange for Python 2):
a = [i for i in range(1e12)]
This immediately evaluates 1 trillion elements into memory, overflowing your memory.
So we can use a generator to solve this:
a = (i for i in range(1e12))
Here, none of the values have been evaluated, just given the interpreter instructions on how to evaluate it. We can then iterate through each item one by one and do work on each separately, so almost nothing is in memory at a given time (only 1 integer at a time). This makes the seemingly impossible task very manageable.
The same is true with itertools: it allows you to do memory-efficient, fast operations by using iterators rather than lists or arrays to do operations.
In your example, you have 62 characters and want to do the cartesian product with 5 repeats, or 62**5 (nearly a billion elements, or over 30 gigabytes of ram). This is prohibitively large."
In order to solve this, we can use iterators.
chars = '01234567890abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVXYZ'
for i in itertools.product(chars, repeat=5):
print(i)
Here, only a single item from the cartesian product is in memory at a given time, meaning it is very memory efficient.
However, if you evaluate the full iterator using list(), it then exhausts the iterator and adds it to a list, meaning the nearly one billion combinations are suddenly in memory again. We don't need all the elements in memory at once: just 1. Which is the power of iterators.
Here are links to the itertools module and another explanation on iterators in Python 2 (mostly true for 3).

Delete element of list in pool.map() python

processPool.map(parserMethod, ((inputFile[line:line + chunkSize], sharedQueue) for line in xrange(0, lengthOfFile, chunkSize)))
Here, I am passing control to parserMethod with a tuple of params inputFile[line:line + chunkSize] and a sharedQueue.
Can anyone tell me how I can delete the elements of inputFile[line:line + chunkSize] after it is passed to the parserMethod ?
Thanks !
del inputFile[line:line + chunkSize]
will remove those items. However, your map is stepping through the entire file, which makes me wonder: are you trying to remove them as they're parsed? This requires the map or parser to alter an input argument, which invites trouble.
If you're only trying to save memory usage, it's a little late: you already saved the entire file in InputFile. If you need only to clean up after the parsing, then use the extreme form of delete, once, after the parsing is finished:
del inputFile[:]
If you want to reduce the memory requirement up front, you have to back up a step. Instead of putting the entire file into a list, try making an nice input pipeline. You didn't post the context of this code, so I'm going to use a generic case with a couple of name assumptions:
def line_chunk_stream(input_stream, chunk_size):
# Generator to return a stream of paring units,
# <chunk_size> lines each.
# To make sure you could check the logic here,
# I avoided several Pythonic short-cuts.
line_count = 0
parse_chunk = []
for line in input_stream:
line_count += 1
parse_chunk.append(line)
if line_count % chunk_size == 0:
yield parse_chunk
del parse_chunk[:]
input_stream = open("source_file", 'r')
parse_stream = line_chunk_stream(input_stream, chunk_size)
parserMethod(parse_stream)
I hope that at least one of these solves your underlying problem.

OCaml equivalent of Python generators

The french Sécurité Sociale identification numbers end with a check code of two digits. I have verified that every possible common transcription error can be detected, and found some other kinds of errors (e.g., rolling three consecutive digits) that may stay undetected.
def check_code(number):
return 97 - int(number) % 97
def single_digit_generator(number):
for i in range(len(number)):
for wrong_digit in "0123456789":
yield number[:i] + wrong_digit + number[i+1:]
def roll_generator(number):
for i in range(len(number) - 2):
yield number[:i] + number[i+2] + number[i] + number[i+1] + number[i+3:]
yield number[:i] + number[i+1] + number[i+2] + number[i] + number[i+3:]
def find_error(generator, number):
control = check_code(number)
for wrong_number in generator(number):
if number != wrong_number and check_code(wrong_number) == control:
return (number, wrong_number)
assert find_error(single_digit_generator, "0149517490979") is None
assert find_error(roll_generator, "0149517490979") == ('0149517490979', '0149517499709')
My Python 2.7 code (working fragment above) makes heavy use of generators. I was wondering how I could adapt them in OCaml. I surely can write a function maintaining some internal state, but I'm looking for a purely functional solution. May I study the library lazy, which I'm not too familiar with? I'm not asking for code, just directions.
You may simply define a generator as a stream, using the extension of the language :
let range n = Stream.from (fun i -> if i < n then Some i else None);;
The for syntactic construct cannot be used with that, but there's a set of functions provided by the Stream module to check the state of the stream and iterate on its elements.
try
let r = range 10 in
while true do
Printf.printf "next element: %d\n" ## Stream.next r
done
with Stream.Failure -> ();;
Or more simply:
Stream.iter (Printf.printf "next element: %d\n") ## range 10;;
You may also use a special syntax provided by the camlp4 preprocessor:
Stream.iter (Printf.printf "next element: %d\n") [< '11; '3; '19; '52; '42 >];;
Other features include creating streams out of lists, strings, bytes or even channels. The API documentation succinctly describes the different possibilities.
The special syntax let you compose them, so as to put back elements before or after, but it can be somewhat unintuitive at first:
let dump = Stream.iter (Printf.printf "next element: %d\n");;
dump [< range 10; range 20 >];;
will produce the elements of the first stream, and then pick up the second stream at element ranked right after the rank of the last yielded element, thus it will appear in this case as if only the second stream got iterated.
To get all the elements, you could construct a stream of 'a Stream.t and then recursively iterate on each:
Stream.iter dump [< '(range 10); '(range 20) >];;
These would produce the expected output.
I recommend reading the old book on OCaml (available online) for a better introduction on the topic.
Core library provides generators in a python style, see Sequence module.
Here is an example, taken from one of my project:
open Core_kernel.Std
let intersections tab (x : mem) : _ seq =
let open Sequence.Generator in
let init = return () in
let m = fold_intersections tab x ~init ~f:(fun addr x gen ->
gen >>= fun () -> yield (addr,x)) in
run m

Optimizing removing duplicates in large files in Python

I have one very large text file(27GB) that I am attempting to make smaller by removing lines that are duplicated in a second database that has several files of a more reasonable size(500MB-2GB). I have some functional code, what I am wondering is is there any way I can optimize this code to run faster, human clock time? At the moment, on a small test run with a 1.5GB input and 500MB filter, this takes 75~ seconds to complete.
I've gone through many iterations of this idea, this one is currently best for time, if anyone has ideas for making a better logical structure for the filter I'd love to hear it, past attempts that have all been worse than this one: Loading the filter into a set and cycling through the input searching for duplicates(about half as fast as this), loading the input into a set and running the filter through a difference_update(almost as fast as this but also doing the reverse of what I wanted), and loading both input and filter into sets in chunks and doing set differences(that was a horrible, horrible idea that would work(maybe?) if my filters were smaller so I didn't have to split them, too.)
So, those are all the things I've tried. All of these processes max out on CPU, and my final version runs at about 25-50% disk I/O, filter and output are on one physical disk, input is on another. I am running a dual core and have no idea if this particular script can be threaded, never done any multithreading before so if that's a possibility I'd love to be pointed in the right direction.
Information about the data! As said earlier, the input is many times larger than the filter. I am expecting a very small percentage of duplication. The data is in lines, all of which are under 20 ASCII characters long. The files are all sorted.
I've already changed the order of the three logical statements, based on the expectation that unique input lines will be the majority of the lines, then unique filter, then duplicates, which on the 'best' case of having no duplicates at all, saved me about 10% time.
Any suggestions?
def sortedfilter(input,filter,output):
file_input = open(input,'r')
file_filter = open(filter,'r')
file_output = open(output,'w')
inline = file_input.next()
filterline = file_filter.next()
try:
while inline and filterline:
if inline < filterline:
file_output.write(inline)
inline = file_input.next()
continue
if inline > filterline:
filterline = file_filter.next()
continue
if inline == filterline:
filterline = file_filter.next()
inline = file_input.next()
except StopIteration:
file_output.writelines(file_input.readlines())
finally:
file_filter.close()
file_input.close()
file_output.close()
I'd be interested in knowing how this compares; it's basically just playing with the order of iteration a bit.
def sortedfilter(in_fname, filter_fname, out_fname):
with open(in_fname) as inf, open(filter_fname) as fil, open(out_fname, 'w') as outf:
ins = inf.next()
try:
for fs in fil:
while ins < fs:
outf.write(ins)
ins = inf.next()
while ins == fs:
ins = inf.next()
except StopIteration:
# reached end of inf before end of fil
pass
else:
# reached end of fil first, pass rest of inf through
file_output.writelines(file_input.readlines())
You can do the string comparison operation only once per line by executing cmp(inline, filterline):
-1 means inline < filterline
0 means inline == filterline
+1 means inline < filterline.
This might get you a few extra percent. So something like:
while inline and filterline:
comparison = cmp(inline, filterline)
if comparison == -1:
file_output.write(inline)
inline = file_input.next()
continue
if comparison == 1:
filterline = file_filter.next()
continue
if comparison == 0:
filterline = file_filter.next()
inline = file_input.next()
seeing that your input files are sorted, the following should work. heapq's merge produces a sorted stream from sorted inputs upon which groupby operates; groups whose lengths are greater than 1 are discarded. This stream-based approach should have relatively low memory requirements
from itertools import groupby, repeat, izip
from heapq import merge
from operator import itemgetter
with open('input.txt', 'r') as file_input, open('filter.txt', 'r') as file_filter, open('output.txt', 'w') as file_output:
file_input_1 = izip(file_input, repeat(1))
file_filter_1 = izip(file_filter, repeat(2))
gen = merge(file_input_1, file_filter_1)
gen = ((k, list(g)) for (k, g) in groupby(gen, key=itemgetter(0)))
gen = (k for (k, g) in gen if len(g) == 1 and g[0][1] == 1)
for line in gen:
file_output.write(line)
Consider opening your files as binary so no conversion to unicode needs to be made:
with open(in_fname,'rb') as inf, open(filter_fname,'rb') as fil, open(out_fname, 'wb') as outf:

Parsing blockbased program output using Python

I am trying to parse the output of a statistical program (Mplus) using Python.
The format of the output (example here) is structured in blocks, sub-blocks, columns, etc. where the whitespace and breaks are very important. Depending on the eg. options requested you get an addional (sub)block or column here or there.
Approaching this using regular expressions has been a PITA and completely unmaintainable. I have been looking into parsers as a more robust solution, but
am a bit overwhelmed by all the possible tools and approaches;
have the impression that they are not well suited for this kind of output.
E.g. LEPL has something called line-aware parsing, which seems to go in the right direction (whitespace, blocks, ...) but is still geared to parsing programming syntax, not output.
Suggestion in which direction to look would be appreciated.
Yes, this is a pain to parse. You don't -- however -- actually need very many regular expressions. Ordinary split may be sufficient for breaking this document into manageable sequences of strings.
These are a lot of what I call "Head-Body" blocks of text. You have titles, a line of "--"'s and then data.
What you want to do is collapse a "head-body" structure into a generator function that yields individual dictionaries.
def get_means_intecepts_thresholds( source_iter ):
"""Precondition: Current line is a "MEANS/INTERCEPTS/THRESHOLDS" line"""
head= source_iter.next().strip().split()
junk= source_iter.next().strip()
assert set( junk ) == set( [' ','-'] )
for line in source_iter:
if len(line.strip()) == 0: continue
if line.strip() == "SLOPES": break
raw_data= line.strip().split()
data = dict( zip( head, map( float, raw_data[1:] ) ) )
yield int(raw_data[0]), data
def get_slopes( source_iter ):
"""Precondition: Current line is a "SLOPES" line"""
head= source_iter.next().strip().split()
junk= source_iter.next().strip()
assert set( junk ) == set( [' ','-'] )
for line in source_iter:
if len(line.strip()) == 0: continue
if line.strip() == "SLOPES": break
raw_data= line.strip().split() )
data = dict( zip( head, map( float, raw_data[1:] ) ) )
yield raw_data[0], data
The point is to consume the head and the junk with one set of operations.
Then consume the rows of data which follow using a different set of operations.
Since these are generators, you can combine them with other operations.
def get_estimated_sample_statistics( source_iter ):
"""Precondition: at the ESTIMATED SAMPLE STATISTICS line"""
for line in source_iter:
if len(line.strip()) == 0: continue
assert line.strip() == "MEANS/INTERCEPTS/THRESHOLDS"
for data in get_means_intercepts_thresholds( source_iter ):
yield data
while True:
if len(line.strip()) == 0: continue
if line.strip() != "SLOPES": break
for data in get_slopes( source_iter ):
yield data
Something like this may be better than regular expressions.
Based on your example, what you have is a bunch of different, nested sub-formats that, individually, are very easily parsed. What can be overwhelming is the sheer number of formats and the fact that they can be nested in different ways.
At the lowest level you have a set of whitespace-separated values on a single line. Those lines combine into blocks, and how the blocks combine and nest within each other is the complex part. This type of output is designed for human reading and was never intended to be "scraped" back into machine-readable form.
First, I would contact the author of the software and find out if there is an alternate output format available, such as XML or CSV. If done correctly (i.e. not just the print-format wrapped in clumsy XML, or with commas replacing whitespace), this would be much easier to handle. Failing that I would try to come up with a hierarchical list of formats and how they nest. For example,
ESTIMATED SAMPLE STATISTICS begins a block
Within that block MEANS/INTERCEPTS/THRESHOLDS begins a nested block
The next two lines are a set of column headings
This is followed by one (or more?) rows of data, with a row header and data values
And so on. If you approach each of these problems separately, you will find that it's tedious but not complex. Think of each of the above steps as modules that test the input to see if it matches and if it does, then call other modules to test further for things that can occur "inside" the block, backtracking if you get to something that doesn't match what you expect (this is called "recursive descent" by the way).
Note that you will have to do something like this anyway, in order to build an in-memory version of the data (the "data model") on which you can operate.
My suggestion is to do rough massaging of the lines to more useful form. Here is some experiments with your data:
from __future__ import print_function
from itertools import groupby
import string
counter = 0
statslist = [ statsblocks.split('\n')
for statsblocks in open('mlab.txt').read().split('\n\n')
]
print(len(statslist), 'blocks')
def blockcounter(line):
global counter
if not line[0]:
counter += 1
return counter
blocklist = [ [block, list(stats)] for block, stats in groupby(statslist, blockcounter)]
for blockno,block in enumerate(blocklist):
print(120 * '=')
for itemno,line in enumerate(block[1:][0]):
if len(line)<4 and any(line[-1].endswith(c) for c in string.letters) :
print('\n** DATA %i, HEADER (%r)**' % (blockno,line[-1]))
else:
print('\n** DATA %i, item %i, length %i **' % (blockno, itemno, len(line)))
for ind,subdata in enumerate(line):
if '___' in subdata:
print(' *** Numeric data starts: ***')
else:
if 6 < len(subdata)<16:
print( '** TYPE: %s **' % subdata)
print('%3i : %s' %( ind, subdata))
You could try PyParsing. It enables you to write a grammar for what you want to parse. It has other examples than parsing programming languages. But I agree with Jim Garrison that your case doesn't seem to call for a real parser, because writing the grammar would be cumbersome. I would try a brute-force solution, e.g. splitting lines at whitespaces. It's not foolproof, but we can assume the output is correct, so if a line has n headers, the next line will have exactly n values.
It turns out that tabular program output like this was one of my earliest applications of pyparsing. Unfortunately, that exact example dealt with a proprietary format that I can't publish, but there is a similar example posted here: http://pyparsing.wikispaces.com/file/view/dictExample2.py .

Categories