This question already has answers here:
Avoiding nested for loops
(3 answers)
Closed 14 days ago.
I would like to create a set of identities using all combinations of three uppercase letters while avoiding for loops to save computation time. I would like to have identities that range from ID_AAA to ID_ZZZ.
I can do this using for loops:
> from string import ascii_uppercase
> IDs = []
> for id_first_letter in ascii_uppercase:
> for id_second_letter in ascii_uppercase:
> for id_third_letter in ascii_uppercase:
> IDs.append('ID_' + id_first_letter + id_second_letter + id_third_letter)
But of course I would like to simplify the code here. I have tried to use the map function but the best I could come up with was this:
> from string import ascii_uppercase
> IDs = list(map(lambda x,y,z: 'ID_' + x + y + z,ascii_uppercase,ascii_uppercase,ascii_uppercase))
This is iterating among all letters at the same time, so I can only get ID_AAA, ID_BBB, ..., ID_ZZZ. All three letters are always the same as a consequence. Can I fine-tune this approach in order to iterate one letter at a time or do I need to use a totally different approach?
You are just reimplementing itertools.product.
from itertools import product
from string import ascii_uppercase
IDs = [''.join(["ID_", *t]) for t in product(ascii_uppercase, repeat=3)]
or
IDs = [f'ID_{x}{y}{z}' for x, y, z in product(ascii_uppercase, repeat=3)]
depending on your preference for constructing a string from a trio produced by product.
"Can I fine-tune this approach in order to iterate one letter at a time or do I need to use a totally different approach?"
Yes...it will be 3 independent loops. It doesn't matter if you use map, loops or comprehensions, it's syntactic sugar in this case, you have to go through all the possibilities in this case, no shortcuts imho.
You can use e.g. product from itertools to make it more readable though:
from itertools import product
from string import ascii_uppercase
IDs= []
for a,b,c in product(ascii_uppercase, repeat=3):
IDs.append(f'ID_{a}{b}{c}')
You are essentially doing this:
>>> [f'ID_{x}{y}{z}' for x in ascii_uppercase for y in ascii_uppercase for z in ascii_uppercase]
['ID_AAA', 'ID_AAB', 'ID_AAC', ... 'ID_ZZX', 'ID_ZZY', 'ID_ZZZ']
Which is otherwise known as a Cartesian Product.
You can also just use product from itertools:
>>> from itertools import product
>>> [f'ID_{x}{y}{z}' for x,y,z in product(ascii_uppercase, repeat=3)]
['ID_AAA', 'ID_AAB', 'ID_AAC', ... 'ID_ZZX', 'ID_ZZY', 'ID_ZZZ']
And don't forget that it is better to refactor your code so that you don't generate the entire list if it is not necessary.
You can do this:
>>> ids_gen=(f'ID_{x}{y}{z}' for x in ascii_uppercase for y in ascii_uppercase for z in ascii_uppercase)
>>> next(ids_gen)
'ID_AAA' # and so on until exhausted...
or this:
>>> ids_gen=(f'ID_{x}{y}{z}' for x,y,z in product(ascii_uppercase, repeat=3))
>>> next(ids_gen)
'ID_AAA'
to generate one at a time.
(From comments, you can also do this generator:
from itertools import product, starmap
ids_gen=starmap("ID_{}{}{}".format, product(ascii_uppercase, repeat=3))
which is pretty cool...)
Related
I know how to create a new list based on the values of an existing list, eg casting
numspec = [float(x) for x in textspec]
Now I have a list of numbers where I need to subtract a value based on the index of a list. I have calculated an a and b value and ended up doing
peakadj = []
for i in range(len(peakvalues)):
val=peakvalues[i]-(i*a+b)
peakadj.append(val)
This works, but I don't like the feel of it, is there any more pythonic way of doing this?
Use the builtin enumerate function and a list comprehension.
peakadj = [val-(i*a+b) for i, val in enumerate(peakvalues)]
Perhaps faster:
from itertools import count
peakadj = [val-iab for val, iab in zip(peakvalues, count(b, a))]
Or:
from itertools import count
from operator import sub
peakadj = [*map(sub, peakvalues, count(b, a))]
Little benchmark
I have a string array for example [a_text, b_text, ab_text, a_text]. I would like to get the number of objects that contain each prefix such as ['a_', 'b_', 'ab_'] so the number of 'a_' objects would be 2.
so far I've been counting each by filtering the array e.g num_a = len(filter(lambda x: x.startswith('a_'), array)). I'm not sure if this is slower than looping through all the fields and incrementing each counter since I am filtering the array for each prefix I am counting. Are functions such as filter() faster than a for loop? For this scenario I don't need to build the filtered list if I use a for loop so that may make it faster.
Also perhaps instead of the filter I could use list comprehension to make it faster?
You can use collections.Counter with a regular expression (if all of your strings have prefixes):
from collections import Counter
arr = ['a_text', 'b_text', 'ab_text', 'a_text']
Counter([re.match(r'^.*?_', i).group() for i in arr])
Output:
Counter({'a_': 2, 'b_': 1, 'ab_': 1})
If not all of your strings have prefixes, this will throw an error, since re.match will return None. If this is a possibility, just add an extra step:
arr = ['a_text', 'b_text', 'ab_text', 'a_text', 'test']
matches = [re.match(r'^.*?_', i) for i in arr]
Counter([i.group() for i in matches if i])
Output:
Counter({'a_': 2, 'b_': 1, 'ab_': 1})
Another way would be to use a defaultdict() object. You just go over the whole list once and count each prefix as you encounter it by splitting at the underscore. You need to check the underscore exists, else the whole word will be taken as a prefix (otherwise it wouldn't distinguish between 'a' and 'a_a').
from collections import defaultdict
array = ['a_text', 'b_text', 'ab_text', 'a_text'] * 250000
def count_prefixes(arr):
counts = defaultdict(int)
for item in arr:
if '_' in item:
counts[item.split('_')[0] + '_'] += 1
return counts
The logic is similar to user3483203's answer, in that all prefixes are calculated in one pass. However, it seems invoking regex methods is a bit slower than simple string operations. But I also have to echo Michael's comment, in that the speed difference is insignificant for even 1 million items.
from timeit import timeit
setup = """
from collections import Counter, defaultdict
import re
array = ['a_text', 'b_text', 'ab_text', 'a_text']
def with_defaultdict(arr):
counts = defaultdict(int)
for item in arr:
if '_' in item:
counts[item.split('_')[0] + '_'] += 1
return counts
def with_counter(arr):
matches = [re.match(r'^.*?_', i) for i in arr]
return Counter([i.group() for i in matches if i])
"""
for method in ('with_defaultdict', 'with_counter'):
print(timeit('{}(array)'.format(method), setup=setup, number=1))
Timing results:
0.4836089063341265
1.3238173544676142
If I'm understanding what you're asking for, it seems like you really want to use Regular Expressions (Regex). They're built for just this sort of pattern-matching use. I don't know Python, but I do see that regular expressions are supported, so it's a matter of using them. I use this tool because it makes it easy to craft and test your regex.
You could also try using str.partition() to extract the string before the separator and the separator, then just concatenate these two to form the prefix. Then you just have to check if this prefix exists in the prefixes set, and count them with collections.Counter():
from collections import Counter
arr = ['a_text', 'b_text', 'ab_text', 'a_text']
prefixes = {'a_', 'b_', 'ab_'}
counter = Counter()
for word in arr:
before, delim, _ = word.partition('_')
prefix = before + delim
if prefix in prefixes:
counter[prefix] += 1
print(counter)
Which Outputs:
Counter({'a_': 2, 'b_': 1, 'ab_': 1})
I only have two sentences that I want to produce variations and compute the leveshtein distance of, but when trying to produce this list with itertools even my 64GB RAM machine gets overloaded.
Is there a way to limit this, even if I have to limit it to a certain number of combinations.
here is my code so far:
from __future__ import print_function
import itertools
import sys
in_file = sys.argv[1]
X = []
with open(in_file) as f:
lis = list(f)
X.append([' '.join(x) for x in itertools.product(*map(set, zip(*map(str.split, lis))))])
for x in X:
print x
The problem is not with itertools: itertools works lazily: it produces iterables. The problem is that you first want to put all these elements in a list. As a result all the combinations have to exist at the same time. This obviously requires more memory than doing this in an iterative way since in the latter case, the memory of a previous combination can be reused.
If you thus want to print all combinations, without storing them, you can use:
with open(in_file) as f:
lis = list(f)
for x in itertools.product(*map(set, zip(*map(str.split, lis)))):
print(' '.join(x))
In case you want to store them, you can limit the number by using itertools.islice:
from itertools import islice, product
X = []
with open(in_file) as f:
lis = list(f)
X += [' '.join(x) for x in islice(product(*map(set, zip(*map(str.split, lis)))),1000000)])
Here we thus limit the number of products to 1'000'000.
I have used the following script to generate all combinations:
import itertools
x = 7
my_list = range(1, 11)
for k in [x]:
for sublist in itertools.combinations(my_list, k):
print sublist
For the second part I will take 6 random elements from range(1, 11). Let's call them my_second_list.
I need to generate the minimum number of combinations of my_list in order to obtain at least one combination to include let's say 5 elements from my_second_list.
Any ideas on how to do that?
import itertools
x = 7
my_list = range(1,11)
for k in [x]: # isn't this just 7?
your_output = (combination for combination in itertools.combinations(my_list,k) if all(element in combination for element in [1,2,3,4,5]))
It's ugly as heck, but that's how I'd do it (if I'm understanding your question correctly. You're trying to get only those combinations that contain a certain subset of items, right? If you want all combinations BEFORE and INCLUDING that first combination that contains the subset of items, I'd do:
accumulator = list()
subset = [1,2,3,4,5] # or whatever
for k in [x]:
for combination in itertools.combinations(my_list,k):
accumulator.append(combination)
if all(el in combination for el in subset):
break
Depending on your exact use case you may want to consider defining subset as a set (e.g. {1,2,3,4,5}) and do subset.issubset(set(combination)) but it's hard to tell if that's better or not without doing some profiling.
I have a dictionary of Counters, e.g:
from collections import Counter, defaultdict
numbers = defaultdict(Counter)
numbers['a']['first'] = 1
numbers['a']['second'] = 2
numbers['b']['first'] = 3
I want to get the sum: 1+2+3 = 6
What would be the simplest / idiomatic way to do this in python 3?
Use a nested comprehension:
sum(x for counter in numbers.values() for x in counter.values())
Or sum first the counters (starting with an empty one), and then their values:
sum(sum(numbers.values(), Counter()).values())
Or first each counter's values, and then the intermediate results:
sum(sum(c.values()) for c in numbers.values())
Or use chain:
from itertools import chain
sum(chain.from_iterable(d.values() for d in numbers.values()))
I prefer the first way.
sum(sum(c.values()) for c in numbers.values())
from itertools import chain
sum(chain.from_iterable(d.values() for d in numbers.values()))
# outputs: 6
In terms of performance use .itervalues() in python 2.x, that avoids building intermediary list (applies to all solutions here).
sum(chain.from_iterable(d.itervalues() for d in numbers.itervalues()))