Here’s a quick example of I’m trying to do and the error I’m getting:
for symbol in itertools.product(list_a, repeat=8):
list_b.append(symbol)
I’m also afterwards excluding combinations from that list like so:
for combination in list_b:
valid_b = True
for symbols in range(len(list_exclude)):
if list_exclude[symbols] in combination:
valid_b = False
else:
pass
if valid_b:
new_list.append(combination)
I’ve heard somehow chunking the process might help, not sure how that could be done here though.
I’m using multiprocessing for this as well.
When I run it I get “MemoryError”
How would you go about it?
Don't pre-compute anything, especially not the first full list:
def symbols(lst, exclude):
for symbol in map(''.join, itertools.product(lst, repeat=8)):
if any(map(symbol.__contains__, exclude)):
continue
yield symbol
Now use the generator as you need to lazily evaluate the elements. Keep in mind that since it's pre-filtering the data, even list(symbols(list_a, list_exclude)) will he much cheaper than what you originally wrote.
Here is a breakdown of what happens:
itertools.product is a generator. That means that it produces an output without retaining a reference to any previous items. Each element it returns is a tuple containing some combination of the input elements.
Since you want to compare strings, you need to convert the tuples. Hence, ''.join. Mapping it onto each of the tuples that itertools.product produces converts those elements into strings. For example:
>>> ''.join(('$', '$', '&', '&', '♀', '#', '%', '$'))
'$$&&♀#%$'
Filtering each symbol thus created can be done by checking if any of the items in excludes are contained in it. You can do this with something like
[ex in symbol for ex in exclude]
The operation ... in symbol is implemented via the magic method symbol.__contains__. You can therefore map that method to every element of exclude.
Since the first element of exclude that is contained in symbol invalidates it, you don't need to check the remainder. This is called short-circuiting, and is implemented in the any function. Notice that because map is a generator, the remaining elements will actually not be computed once a match is found. This is different from using a list comprehension, which pre-computed all the elements.
Putting yield into your function turns it into a generator function. That means that when you call symbols(...), it returns a generator object that you can iterate over. This object does not pre-compute anything until you call next on it. So if you write the data to a file (for example), only the current result will be in memory at once. It may take a long time to write out a large number of results but your memory usage should not spike at all from it.
This little change i made could save you a bit of ram usage.
for combination in list_b:
valid_b = True
for symbols in list_exclude:
if symbols in combination:
valid_b = False
else:
pass
if valid_b:
new_list.append(combination)
Related
I need to turn a list of various entities into strings. So far I use:
all_ents_dead=[] # converted to strings
for i in all_ents:
all_ents_dead.append(str(i))
Is there an optimized way of doing that?
EDIT: I then need to find which of these contain certain string. So far I have:
matching = [s for s in all_ents_dead if "GROUPS" in s]
Whenever you have a name = [], then name.append() in a loop pattern, consider using a list comprehension. A list comprehension builds a list from a loop, without having to use list.append() lookups and calls, making it faster:
all_ents_dead = [str(i) for i in all_ents]
This directly echoes the code you had, but with the expression inside all_ents_dead.append(...) moved to the front of the for loop.
If you don't actually need a list, but only need to iterate over the str() conversions you should consider lazy conversion options. You can turn the list comprehension in to a generator expression:
all_ents_dead = (str(i) for i in all_ents)
or, when only applying a function, the faster alternative in the map() function:
all_ents_dead = map(str, all_ents) # assuming Python 3
both of which lazily apply str() as you iterate over the resulting object. This helps avoid creating a new list object where you don't actually need one, saving on memory. Do note that a generator expression can be slower however; if performance is at stake consider all options based on input sizes, memory constraints and time trials.
For your specific search example, you could just embed the map() call:
matching = [s for s in map(str, all_ents) if "GROUPS" in s]
which would produce a list of matching strings, without creating an intermediary list of string objects that you then don't use anywhere else.
Use the map() function. This will take your existing list, run a function on each item, and return a new list/iterator (see below) with the result of the function applied on each element.
all_ends_dead = map(str, all_ents)
In Python 3+, map() will return an iterator, while in Python 2 it will return a list. An iterator can have optimisations you desire since it generates the values when demanded, and not all at once (as opposed to a list).
What I have is a dictionary of words and I'm generating objects that contain
(1) Original word (e.g. cats)
(2) Alphabetized word (e.g. acst)
(3) Length of the word
Without knowing the length of the longest word, is it possible to create an array (or, in Python, a list) such that, as I scan through the dictionary, it will append an object with x chars into a list in array[x]?
For example, when I encounter the word "a", it will append the generated object to the list at array[1]. Next, for aardvark, if will append the generated object to the list at array[8], etc.
I thought about creating an array of size 1 and then adding on to it, but I'm not sure how it would work.
Foe example: for the first word, a, it will append it to the list stored in array[1]. However, for next word, aardvark, how am I supposed to check/generate more spots in the list until it hits 8? If I append to array, I need give the append function an arg. But, I can't give it just any arg since I don't want to change previously entered values (e.g. 'a' in array[1]).
I'm trying to optimize my code for an assignment, so the alternative is going through the list a second time after I've determined the longest word. However, I think it would be better to do it as I alphabetize the words and create the objects such that I don't have to go through the lengthy dictionary twice.
Also, quick question about syntax: listOfStuff[x].append(y) will initialize/append to the list within listOfStuff at the value x with the value y, correct?
Store the lengths as keys in a dict rather than as indexes in a list. This is really easy if you use a defaultdict from the collections module - your algorithm will look like this:
from collections import defaultdict
results = defaultdict(list)
for word in words:
results[len(word)].append(word)
This ties in to your second question: listOfStuff[x].append(y) will append to a list that already exists at listofStuff[x]. It will not create a new one if that hasn't already been initialised to a (possibly empty) list. If x isn't a valid index to the list (eg, x=3 into a listOfStuff length 2), you'll get an IndexError. If it exists but there is something other than another list there, you will probably get an AttributeError.
Using a dict takes care of the first problem for you - assigning to a non-existent dict key is always valid. Using a defaultdict extends this idea to also reading from a non-existent key - it will insert a default value given by calling the function you give the defaultdict when you create it (in this case, we gave it list, so it calls it and gets an empty list) into the dict the first time you use it.
If you can't use collections for some reason, the next best way is still to use dicts - they have a method called setdefault that works similarly to defaultdicts. You can use it like this:
results = {}
for word in words:
results.setdefault(len(word), []).append(word)
as you can see, setdefault takes two arguments: a key and a default value. If the key already exists in the dict, setdefault just returns its current value as if you'd done results[key]. If that would be an error, however, it inserts the second argument into the dictionary at that key, and then returns it. This is a little bit clunkier to use than defaultdict, but when your default value is an empty list it is otherwise the same (defaultdict is better to use when your default is expensive to create, however, since it only calls the factory function as needed, but you need to precompute it to pass into setdefault).
It is technically possible to do this with nested lists, but it is ugly. You have to:
Detect the case that the list isn't big enough
Figure out how many more elements the list needs
Grow the list to that size
the most Pythonic way to do the first bit is to catch the error (something you could also do with dicts if setdefault and defaultdict didn't exist). The whole thing looks like this:
results = []
for word in words:
try:
results[len(word)]
except IndexError:
# Grow the list so that the new highest index is
# len(word)
new_length = len(word) + 1
difference = len(results) - new_length
results.extend([] for _ in range(difference))
finally:
results[len(word)].append(word)
Stay with dicts to avoid this kind of mess. lists are specifically optimised for the case that the exact numeric index of any element isn't meaningful outside of the list, which doesn't meet your use case. This type of code is really common when you have a mismatch between what your code needs to do and what the data structures you're using are good at, and it is worth learning as early as possible how to avoid it.
I've tried searching for an answer to this question and read a lot about decorators and global variables, but have not found anything that exactly makes sense with the problem at hand: I want to make every permutation of N-length using A-alphabet, fxn(A,N). I will pass the function 2 arguments: A and N. It will make dummy result of length N. Then, with N nested for loops it will update each index of the result with every element of A starting from the innermost loop. So with fxn(‘01’,4) it will produce
1111, 1110, 1101, 1100, 1011, 1010, 1001, 1000,
0111, 0110, 0101, 0100, 0011, 0010, 0001, 0000
It is straightforward to do this if you know how many nested loops you will need (N; although for more than 4 it starts to get really messy and cumbersome). However, if you want to make all arbitrary-length sequences using A, then you need some way to automate this looping behavior. In particular I will also want this function to act as a generator to prevent having to store all these values in memory, such as with a list. To start it needs to initialize the first loop and keep initializing nested loops with a single value change (the index to update) N-1 times. It will then yield the value of the innermost loop.
The straightforward way to do fxn('01',4) would be:
for i in alphabet:
tempresult[0] = i
for i in alphabet:
tempresult[1] = i
for i in alphabet:
tempresult[2] = i
for i in alphabet:
tempresult[3] = i
yield tempresult
Basically, how can I extend this to an arbitrary length list or string and still get each nest loop to update the appropriate index. I know there is probably a permutation function as part of numpy that will do this, but I haven't been able to come across one. Any advice would be appreciated.
You don't actually want permutations here, but the cartesian product of alphabet*alphabet*alphabet*alphabet. Which you can write as:
itertools.product(alphabet, repeat=4)
Or, if you want to get strings back instead of tuples:
map(''.join, itertools.product(alphabet, repeat=4))
(In 2.x, if you want this to return a lazy iterator instead of a list, as your original code does, use itertools.imap instead of map.)
If you want to do this with numpy, the best way I could think of is to use a recursive function that tiles and repeats for each factor, but this answer has a better implementation, which you can copy from there, or apparently pull out of scikit-learn as sklearn.utils.extmath.cartesian, and then just do this:
cartesian([alphabet]*4)
Of course that gives you a 2D array of single-digit strings; you still need one more step to flatten it to a 1D array of N-digit strings, and numpy will slow you down there more than it speeds you up in the product calculation, so… unless you actually needed a numpy array anyway, I'd stick with itertools here.
You can look to see how the itertools.permutations function works.
Just a fundamental question regarding python and .join() method:
file1 = open(f1,"r")
file2 = open(f2,"r")
file3 = open("results","w")
diff = difflib.Differ()
result = diff.compare(file1.read(),file2.read())
file3.write("".join(result)),
The above snippet of code yields a nice output stored in a file called "results", in string format, showing the differences between the two files line-by-line. However I notice that if I just print "result" without using .join(), the compiler returns a message that includes a memory address. After trying to write the result to the file without using .join(), I was informed by the compiler that only strings and character buffers may be used in the .join() method, and not generator objects. So based off of all the evidence that I have adduced, please correct me if I am wrong:
result = diff.compare(file1.read(),file2.read()) <---- result is a generator object?
result is a list of strings, with result itself being the reference to the first string?
.join() takes a memory address and points to the first, and then iterates over the rest of the addresses of strings in that structure?
A generator object is an object that returns a pointer?
I apologize if my questions are unclear, but I basically wanted to ask the python veterans if my deductions were correct. My question is less about the observable results, and more so about the inner workings of python. I appreciate all of your help.
join is a method of strings. That method takes any iterable and iterates over it and joins the contents together. (The contents have to be strings, or it will raise an exception.)
If you attempt to write the generator object directly to the file, you will just get the generator object itself, not its contents. join "unrolls" the contents of the generator.
You can see what is going with a simple, explicit generator:
def gen():
yield 'A'
yield 'B'
yield 'C'
>>> g = gen()
>>> print g
<generator object gen at 0x0000000004BB9090>
>>> print ''.join(g)
ABC
The generator doles out its contents one at a time. If you try to look at the generator itself, it doesn't dole anything out and you just see it as "generator object". To get at its contents, you need to iterate over them. You can do this with a for loop, with the next function, or with any of various other functions/methods that iterate over things (str.join among them).
When you say that result "is a list of string" you are getting close to the idea. A generator (or iterable) is sort of like a "potential list". Instead of actually being a list of all its contents all at once, it lets you peel off each item one at a time.
None of the objects is a "memory address". The string representation of a generator object (like that of many other objects) includes a memory address, so if you print it (as above) or write it to a file, you'll see that address. But that doesn't mean that object "is" that memory address, and the address itself isn't really usable as such. It's just a handy identifying tag so that if you have multiple objects you can tell them apart.
I have a config file that contains a list of strings. I need to read these strings in order and store them in memory and I'm going to be iterating over them many times when certain events take place. Since once they're read from the file I don't need to add or modify the list, a tuple seems like the most appropriate data structure.
However, I'm a little confused on the best way to first construct the tuple since it's immutable. Should I parse them into a list then put them in a tuple? Is that wasteful? Is there a way to get them into a tuple first without the overhead of copying/destroying the tuple every time I add a new element.
As you said, you're going to read the data gradually - so a tuple isn't a good idea after all, as it's immutable.
Is there a reason for not using a simple list for holding the strings?
Since your data is changing, I am not sure you need a tuple. A list should do fine.
Look at the following which should provide you further information. Assigning a tuple is much faster than assigning a list. But if you are trying to modify elements every now and then then creating a tuple may not make more sense.
Are tuples more efficient than lists in Python?
I wouldn't worry about the overhead of first creating a list and then a tuple from that list. My guess is that the overhead will turn out to be negligible if you measure it.
On the other hand, I would stick with the list and iterate over that instead of creating a tuple. Tuples should be used for struct like data and list for lists of data, which is what your data sounds like to me.
with open("config") as infile:
config = tuple(infile)
You may want to try using chained generators to create your tuple. You can use the generators to perform multiple filtering and transformation operations on your input without creating intermediate lists. All of the generator processing is delayed until iteration. In the example below the processing/iteration all happens on the last line.
Like so:
f = open('settings.cfg')
step1 = (tuple(i.strip() for i in l.split(':', 1)) for l in f if len(l) > 2 and ':' in l)
step2 = ((l[0], ',' in l[1] and 'Tag' in l[0] and l[1].split(',') or l[1]) for l in step1)
t = tuple(step2)