I only have two sentences that I want to produce variations and compute the leveshtein distance of, but when trying to produce this list with itertools even my 64GB RAM machine gets overloaded.
Is there a way to limit this, even if I have to limit it to a certain number of combinations.
here is my code so far:
from __future__ import print_function
import itertools
import sys
in_file = sys.argv[1]
X = []
with open(in_file) as f:
lis = list(f)
X.append([' '.join(x) for x in itertools.product(*map(set, zip(*map(str.split, lis))))])
for x in X:
print x
The problem is not with itertools: itertools works lazily: it produces iterables. The problem is that you first want to put all these elements in a list. As a result all the combinations have to exist at the same time. This obviously requires more memory than doing this in an iterative way since in the latter case, the memory of a previous combination can be reused.
If you thus want to print all combinations, without storing them, you can use:
with open(in_file) as f:
lis = list(f)
for x in itertools.product(*map(set, zip(*map(str.split, lis)))):
print(' '.join(x))
In case you want to store them, you can limit the number by using itertools.islice:
from itertools import islice, product
X = []
with open(in_file) as f:
lis = list(f)
X += [' '.join(x) for x in islice(product(*map(set, zip(*map(str.split, lis)))),1000000)])
Here we thus limit the number of products to 1'000'000.
Related
This question already has answers here:
Avoiding nested for loops
(3 answers)
Closed 14 days ago.
I would like to create a set of identities using all combinations of three uppercase letters while avoiding for loops to save computation time. I would like to have identities that range from ID_AAA to ID_ZZZ.
I can do this using for loops:
> from string import ascii_uppercase
> IDs = []
> for id_first_letter in ascii_uppercase:
> for id_second_letter in ascii_uppercase:
> for id_third_letter in ascii_uppercase:
> IDs.append('ID_' + id_first_letter + id_second_letter + id_third_letter)
But of course I would like to simplify the code here. I have tried to use the map function but the best I could come up with was this:
> from string import ascii_uppercase
> IDs = list(map(lambda x,y,z: 'ID_' + x + y + z,ascii_uppercase,ascii_uppercase,ascii_uppercase))
This is iterating among all letters at the same time, so I can only get ID_AAA, ID_BBB, ..., ID_ZZZ. All three letters are always the same as a consequence. Can I fine-tune this approach in order to iterate one letter at a time or do I need to use a totally different approach?
You are just reimplementing itertools.product.
from itertools import product
from string import ascii_uppercase
IDs = [''.join(["ID_", *t]) for t in product(ascii_uppercase, repeat=3)]
or
IDs = [f'ID_{x}{y}{z}' for x, y, z in product(ascii_uppercase, repeat=3)]
depending on your preference for constructing a string from a trio produced by product.
"Can I fine-tune this approach in order to iterate one letter at a time or do I need to use a totally different approach?"
Yes...it will be 3 independent loops. It doesn't matter if you use map, loops or comprehensions, it's syntactic sugar in this case, you have to go through all the possibilities in this case, no shortcuts imho.
You can use e.g. product from itertools to make it more readable though:
from itertools import product
from string import ascii_uppercase
IDs= []
for a,b,c in product(ascii_uppercase, repeat=3):
IDs.append(f'ID_{a}{b}{c}')
You are essentially doing this:
>>> [f'ID_{x}{y}{z}' for x in ascii_uppercase for y in ascii_uppercase for z in ascii_uppercase]
['ID_AAA', 'ID_AAB', 'ID_AAC', ... 'ID_ZZX', 'ID_ZZY', 'ID_ZZZ']
Which is otherwise known as a Cartesian Product.
You can also just use product from itertools:
>>> from itertools import product
>>> [f'ID_{x}{y}{z}' for x,y,z in product(ascii_uppercase, repeat=3)]
['ID_AAA', 'ID_AAB', 'ID_AAC', ... 'ID_ZZX', 'ID_ZZY', 'ID_ZZZ']
And don't forget that it is better to refactor your code so that you don't generate the entire list if it is not necessary.
You can do this:
>>> ids_gen=(f'ID_{x}{y}{z}' for x in ascii_uppercase for y in ascii_uppercase for z in ascii_uppercase)
>>> next(ids_gen)
'ID_AAA' # and so on until exhausted...
or this:
>>> ids_gen=(f'ID_{x}{y}{z}' for x,y,z in product(ascii_uppercase, repeat=3))
>>> next(ids_gen)
'ID_AAA'
to generate one at a time.
(From comments, you can also do this generator:
from itertools import product, starmap
ids_gen=starmap("ID_{}{}{}".format, product(ascii_uppercase, repeat=3))
which is pretty cool...)
I have a 1m+ row dataset, and each row has a combination of lower/uppercase letters, symbols and numbers. I am looking to clean this data and only keep the last instance of where a lowercase letter and number are beside each other. For speed efficiency, my current plan was to have this data as an array of strings and then use the .findall operation to keep the letter/number combo I'm looking for.
Here is something along the lines of what I am trying to do:
Input
list = Array(["Nd4","0-0","Nxe4","e8+","e4g2"])
newList = list.findall('[a-z]\d')[len(list.findall('[a-z]\d')-1]
Expected Output from newList
newList = ("d4","","e4","e8","g2")
It is not recommend to use "list" to assign a variable since it a built-in function
import re
import numpy as np
lists = np.array(["Nd4","0-0","Nxe4","e8+","e4g2"])
def findall(i,pattern=r'[a-z1-9]+'):
return re.findall(pattern,i)[0] if re.findall(pattern,i) else ""
newList = [findall(i) for i in lists]
# OR if you want to return an array
newList = np.array(list(map(findall,lists)))
# >>> ['d4', '', 'xe4', 'e8', 'e4g2']
This may not be the prettiest way, but I think it gets the job done!
import re
import numpy as np
lists = np.array(["Nd4","0-0","Nxe4","e8+","e4g2"])
def function(i):
try:
return re.findall(r'[a-z]\d',i)[len(re.findall(r'[a-z]\d',i))-1]
except:
return ""
newList = [function(i) for i in lists]
I wanted to know which would be better to return: a list of elements or a generator which yields each element from that list. After checking the size of each, the generator was much smaller.
from sys import getsizeof
my_list = [i for i in range(100)]
my_gen = (i for i in my_list)
print(getsizeof(my_list)) # prints "452"
print(getsizeof(my_gen)) # prints "56"
How does the generator take up so much less memory?
Check Generators documentation. It's a pattern that produces items on the fly. Using the keywords yield and yield from in Python can handle infinite data structures very well, specifically because of the construction of generators.
As chepner guessed, the size of a generator is not affected by the size of the list.
from sys import getsizeof
short_list = [i for i in range(10)]
long_list = [i for i in range(100)]
print(getsizeof(short_list)) # prints '92'
print(getsizeof(long_list)) # prints '452'
gen_short_list = (i for i in short_list)
gen_long_list = (i for i in long_list)
print(getsizeof(gen_short_list)) # prints '56'
print(getsizeof(gen_long_list)) # prints '56'
Besides that, the size of the return object does not matter anyways. So, there we go.
I have a string array for example [a_text, b_text, ab_text, a_text]. I would like to get the number of objects that contain each prefix such as ['a_', 'b_', 'ab_'] so the number of 'a_' objects would be 2.
so far I've been counting each by filtering the array e.g num_a = len(filter(lambda x: x.startswith('a_'), array)). I'm not sure if this is slower than looping through all the fields and incrementing each counter since I am filtering the array for each prefix I am counting. Are functions such as filter() faster than a for loop? For this scenario I don't need to build the filtered list if I use a for loop so that may make it faster.
Also perhaps instead of the filter I could use list comprehension to make it faster?
You can use collections.Counter with a regular expression (if all of your strings have prefixes):
from collections import Counter
arr = ['a_text', 'b_text', 'ab_text', 'a_text']
Counter([re.match(r'^.*?_', i).group() for i in arr])
Output:
Counter({'a_': 2, 'b_': 1, 'ab_': 1})
If not all of your strings have prefixes, this will throw an error, since re.match will return None. If this is a possibility, just add an extra step:
arr = ['a_text', 'b_text', 'ab_text', 'a_text', 'test']
matches = [re.match(r'^.*?_', i) for i in arr]
Counter([i.group() for i in matches if i])
Output:
Counter({'a_': 2, 'b_': 1, 'ab_': 1})
Another way would be to use a defaultdict() object. You just go over the whole list once and count each prefix as you encounter it by splitting at the underscore. You need to check the underscore exists, else the whole word will be taken as a prefix (otherwise it wouldn't distinguish between 'a' and 'a_a').
from collections import defaultdict
array = ['a_text', 'b_text', 'ab_text', 'a_text'] * 250000
def count_prefixes(arr):
counts = defaultdict(int)
for item in arr:
if '_' in item:
counts[item.split('_')[0] + '_'] += 1
return counts
The logic is similar to user3483203's answer, in that all prefixes are calculated in one pass. However, it seems invoking regex methods is a bit slower than simple string operations. But I also have to echo Michael's comment, in that the speed difference is insignificant for even 1 million items.
from timeit import timeit
setup = """
from collections import Counter, defaultdict
import re
array = ['a_text', 'b_text', 'ab_text', 'a_text']
def with_defaultdict(arr):
counts = defaultdict(int)
for item in arr:
if '_' in item:
counts[item.split('_')[0] + '_'] += 1
return counts
def with_counter(arr):
matches = [re.match(r'^.*?_', i) for i in arr]
return Counter([i.group() for i in matches if i])
"""
for method in ('with_defaultdict', 'with_counter'):
print(timeit('{}(array)'.format(method), setup=setup, number=1))
Timing results:
0.4836089063341265
1.3238173544676142
If I'm understanding what you're asking for, it seems like you really want to use Regular Expressions (Regex). They're built for just this sort of pattern-matching use. I don't know Python, but I do see that regular expressions are supported, so it's a matter of using them. I use this tool because it makes it easy to craft and test your regex.
You could also try using str.partition() to extract the string before the separator and the separator, then just concatenate these two to form the prefix. Then you just have to check if this prefix exists in the prefixes set, and count them with collections.Counter():
from collections import Counter
arr = ['a_text', 'b_text', 'ab_text', 'a_text']
prefixes = {'a_', 'b_', 'ab_'}
counter = Counter()
for word in arr:
before, delim, _ = word.partition('_')
prefix = before + delim
if prefix in prefixes:
counter[prefix] += 1
print(counter)
Which Outputs:
Counter({'a_': 2, 'b_': 1, 'ab_': 1})
I'm basically trying to do this (pseudo code, not valid python):
limit = 10
results = [xml_to_dict(artist) for artist in xml.findall('artist') while limit--]
So how could I code this in a concise and efficient way?
The XML file can contain anything between 0 and 50 artists, and I can't control how many to get at a time, and AFAIK, there's no XPATH expression to say something like "get me up to 10 nodes".
Thanks!
Are you using lxml? You could use XPath to limit the items in the query level, e.g.
>>> from lxml import etree
>>> from io import StringIO
>>> xml = etree.parse(StringIO('<foo><bar>1</bar><bar>2</bar><bar>4</bar><bar>8</bar></foo>'))
>>> [bar.text for bar in xml.xpath('bar[position()<=3]')]
['1', '2', '4']
You could also use itertools.islice to limit any iterable, e.g.
>>> from itertools import islice
>>> [bar.text for bar in islice(xml.iterfind('bar'), 3)]
['1', '2', '4']
>>> [bar.text for bar in islice(xml.iterfind('bar'), 5)]
['1', '2', '4', '8']
Assuming that xml is an ElementTree object, the findall() method returns a list, so just slice that list:
limit = 10
limited_artists = xml.findall('artist')[:limit]
results = [xml_to_dict(artist) for artist in limited_artists]
For everyone else who found this question because they were trying to limit items returned from an infinite generator:
from itertools import takewhile
ltd = takewhile(lambda x: x[0] < MY_LIMIT, enumerate( MY_INFINITE_GENERATOR ))
# ^ This is still an iterator.
# If you want to materialize the items, e.g. in a list, do:
ltd_m = list( ltd )
# If you don't want the enumeration indices, you can strip them as usual:
ltd_no_enum = [ v for i,v in ltd_m ]
EDIT: Actually, islice is a much better option.
limit = 10
limited_artists = [artist in xml.findall('artist')][:limit]
results = [xml_to_dict(artist) for limited_artists]
This avoids the issues of slicing: it doesn't change the order of operations, and doesn't construct a new list, which can matter for large lists if you're filtering the list comprehension.
def first(it, count):
it = iter(it)
for i in xrange(0, count):
yield next(it)
raise StopIteration
print [i for i in first(range(1000), 5)]
It also works properly with generator expressions, where slicing will fall over due to memory use:
exp = (i for i in first(xrange(1000000000), 10000000))
for i in exp:
print i