python optimization: [a, b, c] > [str(a), str(b), str(c)] - python

I need to turn a list of various entities into strings. So far I use:
all_ents_dead=[] # converted to strings
for i in all_ents:
all_ents_dead.append(str(i))
Is there an optimized way of doing that?
EDIT: I then need to find which of these contain certain string. So far I have:
matching = [s for s in all_ents_dead if "GROUPS" in s]

Whenever you have a name = [], then name.append() in a loop pattern, consider using a list comprehension. A list comprehension builds a list from a loop, without having to use list.append() lookups and calls, making it faster:
all_ents_dead = [str(i) for i in all_ents]
This directly echoes the code you had, but with the expression inside all_ents_dead.append(...) moved to the front of the for loop.
If you don't actually need a list, but only need to iterate over the str() conversions you should consider lazy conversion options. You can turn the list comprehension in to a generator expression:
all_ents_dead = (str(i) for i in all_ents)
or, when only applying a function, the faster alternative in the map() function:
all_ents_dead = map(str, all_ents) # assuming Python 3
both of which lazily apply str() as you iterate over the resulting object. This helps avoid creating a new list object where you don't actually need one, saving on memory. Do note that a generator expression can be slower however; if performance is at stake consider all options based on input sizes, memory constraints and time trials.
For your specific search example, you could just embed the map() call:
matching = [s for s in map(str, all_ents) if "GROUPS" in s]
which would produce a list of matching strings, without creating an intermediary list of string objects that you then don't use anywhere else.

Use the map() function. This will take your existing list, run a function on each item, and return a new list/iterator (see below) with the result of the function applied on each element.
all_ends_dead = map(str, all_ents)
In Python 3+, map() will return an iterator, while in Python 2 it will return a list. An iterator can have optimisations you desire since it generates the values when demanded, and not all at once (as opposed to a list).

Related

RAM overload - itertools

Here’s a quick example of I’m trying to do and the error I’m getting:
for symbol in itertools.product(list_a, repeat=8):
list_b.append(symbol)
I’m also afterwards excluding combinations from that list like so:
for combination in list_b:
valid_b = True
for symbols in range(len(list_exclude)):
if list_exclude[symbols] in combination:
valid_b = False
else:
pass
if valid_b:
new_list.append(combination)
I’ve heard somehow chunking the process might help, not sure how that could be done here though.
I’m using multiprocessing for this as well.
When I run it I get “MemoryError”
How would you go about it?
Don't pre-compute anything, especially not the first full list:
def symbols(lst, exclude):
for symbol in map(''.join, itertools.product(lst, repeat=8)):
if any(map(symbol.__contains__, exclude)):
continue
yield symbol
Now use the generator as you need to lazily evaluate the elements. Keep in mind that since it's pre-filtering the data, even list(symbols(list_a, list_exclude)) will he much cheaper than what you originally wrote.
Here is a breakdown of what happens:
itertools.product is a generator. That means that it produces an output without retaining a reference to any previous items. Each element it returns is a tuple containing some combination of the input elements.
Since you want to compare strings, you need to convert the tuples. Hence, ''.join. Mapping it onto each of the tuples that itertools.product produces converts those elements into strings. For example:
>>> ''.join(('$', '$', '&', '&', '♀', '#', '%', '$'))
'$$&&♀#%$'
Filtering each symbol thus created can be done by checking if any of the items in excludes are contained in it. You can do this with something like
[ex in symbol for ex in exclude]
The operation ... in symbol is implemented via the magic method symbol.__contains__. You can therefore map that method to every element of exclude.
Since the first element of exclude that is contained in symbol invalidates it, you don't need to check the remainder. This is called short-circuiting, and is implemented in the any function. Notice that because map is a generator, the remaining elements will actually not be computed once a match is found. This is different from using a list comprehension, which pre-computed all the elements.
Putting yield into your function turns it into a generator function. That means that when you call symbols(...), it returns a generator object that you can iterate over. This object does not pre-compute anything until you call next on it. So if you write the data to a file (for example), only the current result will be in memory at once. It may take a long time to write out a large number of results but your memory usage should not spike at all from it.
This little change i made could save you a bit of ram usage.
for combination in list_b:
valid_b = True
for symbols in list_exclude:
if symbols in combination:
valid_b = False
else:
pass
if valid_b:
new_list.append(combination)

Create new list using other list to look up values in dictionary - python

Consider the below situation. I have a list:
feature_dict = vectorizer.get_feature_names()
Which just have some strings, all of which are a kind of internal identifiers, completely meaningless. I also have a dictionary (it is filled in different part of code):
phoneDict = dict()
This dictionary has mentioned identifiers as keys, and values assigned to them are, well, good values which mean something.
I want to create a new list preserving the order of original list (this is crucial) but replacing each element with the value from dictionary. So I thought about creating new list by applying a function to each element of list but with no luck.
I tried to create a fuction:
def fastMap(x):
return phoneDict[x]
And then map it:
map(fastMap, feature_dict)
It just returns me
map object at 0x0000000017DFBD30.
Nothing else
Anyone tried to solve similar problem?
Just convert the result to list:
list(map(fastMap, feature_dict))
Why? map() returns an iterator, see https://docs.python.org/3/library/functions.html#map:
map(function, iterable, ...)
Return an iterator that applies function
to every item of iterable, yielding the results. If additional
iterable arguments are passed, function must take that many arguments
and is applied to the items from all iterables in parallel. With
multiple iterables, the iterator stops when the shortest iterable is
exhausted. For cases where the function inputs are already arranged
into argument tuples, see itertools.starmap().
which you can convert to a list with list()
Note: in python 2, map() returns a list, but this was changed in python 3 to return an iterator

which is faster and efficient between generator expression or itertools.chain for iterating over large list?

I have large list of string and i want to iteratoe over this list. I want to figure out which is the best way to iterate over list. I have tried using the following ways:
Generator Expression: g = (x for x in list)
Itertools.chain: ch = itertools.chain(list)
Is there is another approach, better than these two, for list iteration?
The fastest way is just to iterate over the list. If you already have a list, layering more iterators/generators isn't going to speed anything up.
A good old for item in a_list: is going to be just as fast as any other option, and definitely more readable.
Iterators and generators are for when you don't already have a list sitting around in memory. itertools.count() for instance just generates a single number at a time; it's not working off of an existing list of numbers.
Another possible use is when you're chaining a number of operations - your intermediate steps can create iterators/generators rather than creating intermediate lists. For instance, if you're wanting to chain a lookup for each item in the list with a sum() call, you could use a generator expression for the output of the lookups, which sum() would then consume:
total_inches_of_snow = sum(inches_of_snow(date) for date in list_of_dates)
This allows you to avoid creating an intermediate list with all of the individual inches of snow and instead just generate them as sum() consumes them, thus saving memory.

Joining a list of object values into a string using list comprehensions in python 2.7

I have a list of objects. In a single line I would like to create a string that contains a specific variable of each object in the list, separated by commas.
Right now I'm able to achieve this using a combination of list comprehensions and map like so:
','.join(map(str, [instance.public_dns_name for instance in instances]))
or using lambda:
','.join(map(str, [(lambda(i): i.public_dns_name)(instance) for instance in instances]))
Each instance object has a "public_dns_name" variable that returns the host name. This returns a string like this:
host1,host2,hos3,host4
Is it possible to achieve the same thing using only the list comprehension?
You can't use just a list comprehension, you'll still need to use join.
It's more efficient to use a generator expression
','.join(str(instance.public_dns_name) for instance in instances)
A list comprehension would look like this:
','.join([str(instance.public_dns_name) for instance in instances])
The difference is that it creates the entire list in memory before joining, whereas the generator expression will create the components as they are joined
I'm not sure what you mean by "only the list comprehension", you should still use join ultimately, but the process can be far less convoluted:
','.join(str(instance.public_dns_name) for instance in instances)
No need to create a lambda function here. And remember that join takes any iterable, so you don't have to create a list just to pass it to join.

Parsing indeterminate amount of data into a python tuple

I have a config file that contains a list of strings. I need to read these strings in order and store them in memory and I'm going to be iterating over them many times when certain events take place. Since once they're read from the file I don't need to add or modify the list, a tuple seems like the most appropriate data structure.
However, I'm a little confused on the best way to first construct the tuple since it's immutable. Should I parse them into a list then put them in a tuple? Is that wasteful? Is there a way to get them into a tuple first without the overhead of copying/destroying the tuple every time I add a new element.
As you said, you're going to read the data gradually - so a tuple isn't a good idea after all, as it's immutable.
Is there a reason for not using a simple list for holding the strings?
Since your data is changing, I am not sure you need a tuple. A list should do fine.
Look at the following which should provide you further information. Assigning a tuple is much faster than assigning a list. But if you are trying to modify elements every now and then then creating a tuple may not make more sense.
Are tuples more efficient than lists in Python?
I wouldn't worry about the overhead of first creating a list and then a tuple from that list. My guess is that the overhead will turn out to be negligible if you measure it.
On the other hand, I would stick with the list and iterate over that instead of creating a tuple. Tuples should be used for struct like data and list for lists of data, which is what your data sounds like to me.
with open("config") as infile:
config = tuple(infile)
You may want to try using chained generators to create your tuple. You can use the generators to perform multiple filtering and transformation operations on your input without creating intermediate lists. All of the generator processing is delayed until iteration. In the example below the processing/iteration all happens on the last line.
Like so:
f = open('settings.cfg')
step1 = (tuple(i.strip() for i in l.split(':', 1)) for l in f if len(l) > 2 and ':' in l)
step2 = ((l[0], ',' in l[1] and 'Tag' in l[0] and l[1].split(',') or l[1]) for l in step1)
t = tuple(step2)

Categories