Concatenate Big List Elements Efficiently - python

I want to make a list of elements where each element starts with 4 numbers and ends with 4 letters with every possible combination. This is my code
import itertools
def char_range(c1, c2):
"""Generates the characters from `c1` to `c2`"""
for c in range(ord(c1), ord(c2)+1):
yield chr(c)
chars =list()
nums =list()
for combination in itertools.product(char_range('a','b'),repeat=4):
chars.append(''.join(map(str, combination)))
for combination in itertools.product(range(10),repeat=4):
nums.append(''.join(map(str, combination)))
c = [str(x)+y for x,y in itertools.product(nums,chars)]
for dd in c:
print(dd)
This runs fine but when I use a bigger range of characters, such as (a-z) the program hogs the CPU and memory, and the PC becomes unresponsive. So how can I do this in a more efficient way?

The documentation of itertools says that "it is roughly equivalent to nested for-loops in a generator expression". So itertools.product is never an enemy of memory, but if you store its results in a list, that list is. Therefore:
for element in itertools.product(...):
print element
is okay, but
myList = [element for itertools.product(...)]
or the equivalent loop of
for element in itertools.product(...):
myList.append(element)
is not! So you want itertools to generate results for you, but you don't want to store them, rather use them as they are generated. Think about this line of your code:
c = [str(x)+y for x,y in itertools.product(nums,chars)]
Given that nums and chars can be huge lists, building another gigantic list of all combinations on top of them is definitely going to choke your system.
Now, as mentioned in the comments, if you replace all the lists that are too fat to fit into the memory with generators (functions that just yield), memory is not going to be a concern anymore.
Here is my full code. I basically changed your lists of chars and nums to generators, and got rid of the final list of c.
import itertools
def char_range(c1, c2):
"""Generates the characters from `c1` to `c2`"""
for c in range(ord(c1), ord(c2)+1):
yield chr(c)
def char(a):
for combination in itertools.product(char_range(str(a[0]),str(a[1])),repeat=4):
yield ''.join(map(str, combination))
def num(n):
for combination in itertools.product(range(n),repeat=4):
yield ''.join(map(str, combination))
def final(one,two):
for foo in char(one):
for bar in num(two):
print str(bar)+str(foo)
Now let's ask what every combination of ['a','b'] and range(2) is:
final(['a','b'],2)
Produces this:
0000aaaa
0001aaaa
0010aaaa
0011aaaa
0100aaaa
0101aaaa
0110aaaa
0111aaaa
1000aaaa
1001aaaa
1010aaaa
1011aaaa
1100aaaa
1101aaaa
1110aaaa
1111aaaa
0000aaab
0001aaab
0010aaab
0011aaab
0100aaab
0101aaab
0110aaab
0111aaab
1000aaab
1001aaab
1010aaab
1011aaab
1100aaab
1101aaab
1110aaab
1111aaab
0000aaba
0001aaba
0010aaba
0011aaba
0100aaba
0101aaba
0110aaba
0111aaba
1000aaba
1001aaba
1010aaba
1011aaba
1100aaba
1101aaba
1110aaba
1111aaba
0000aabb
0001aabb
0010aabb
0011aabb
0100aabb
0101aabb
0110aabb
0111aabb
1000aabb
1001aabb
1010aabb
1011aabb
1100aabb
1101aabb
1110aabb
1111aabb
0000abaa
0001abaa
0010abaa
0011abaa
0100abaa
0101abaa
0110abaa
0111abaa
1000abaa
1001abaa
1010abaa
1011abaa
1100abaa
1101abaa
1110abaa
1111abaa
0000abab
0001abab
0010abab
0011abab
0100abab
0101abab
0110abab
0111abab
1000abab
1001abab
1010abab
1011abab
1100abab
1101abab
1110abab
1111abab
0000abba
0001abba
0010abba
0011abba
0100abba
0101abba
0110abba
0111abba
1000abba
1001abba
1010abba
1011abba
1100abba
1101abba
1110abba
1111abba
0000abbb
0001abbb
0010abbb
0011abbb
0100abbb
0101abbb
0110abbb
0111abbb
1000abbb
1001abbb
1010abbb
1011abbb
1100abbb
1101abbb
1110abbb
1111abbb
0000baaa
0001baaa
0010baaa
0011baaa
0100baaa
0101baaa
0110baaa
0111baaa
1000baaa
1001baaa
1010baaa
1011baaa
1100baaa
1101baaa
1110baaa
1111baaa
0000baab
0001baab
0010baab
0011baab
0100baab
0101baab
0110baab
0111baab
1000baab
1001baab
1010baab
1011baab
1100baab
1101baab
1110baab
1111baab
0000baba
0001baba
0010baba
0011baba
0100baba
0101baba
0110baba
0111baba
1000baba
1001baba
1010baba
1011baba
1100baba
1101baba
1110baba
1111baba
0000babb
0001babb
0010babb
0011babb
0100babb
0101babb
0110babb
0111babb
1000babb
1001babb
1010babb
1011babb
1100babb
1101babb
1110babb
1111babb
0000bbaa
0001bbaa
0010bbaa
0011bbaa
0100bbaa
0101bbaa
0110bbaa
0111bbaa
1000bbaa
1001bbaa
1010bbaa
1011bbaa
1100bbaa
1101bbaa
1110bbaa
1111bbaa
0000bbab
0001bbab
0010bbab
0011bbab
0100bbab
0101bbab
0110bbab
0111bbab
1000bbab
1001bbab
1010bbab
1011bbab
1100bbab
1101bbab
1110bbab
1111bbab
0000bbba
0001bbba
0010bbba
0011bbba
0100bbba
0101bbba
0110bbba
0111bbba
1000bbba
1001bbba
1010bbba
1011bbba
1100bbba
1101bbba
1110bbba
1111bbba
0000bbbb
0001bbbb
0010bbbb
0011bbbb
0100bbbb
0101bbbb
0110bbbb
0111bbbb
1000bbbb
1001bbbb
1010bbbb
1011bbbb
1100bbbb
1101bbbb
1110bbbb
1111bbbb
Which is the exact result you are looking for. Each element of this result is generated on the fly, hence never creates a memory problem. You can now try and see that much bigger operations such as final(['a','z'],10) are CPU-friendly.

Related

Printing Items From a Weaved List in a For Loop

I am trying to write a short python script for use in a genome assembly. I have generated a long list of compressed files, that are listed alphabetically into List1 and List2.
List1=[R1_aa.fq.gz, R1_ab.fq.gz, R1_ac.fq.gz]
List2=[R2_aa.fq.gz, R2_ab.fq.gz, R2_ac.fq.gz]
Where both lists go all the way to R1/2_bv.fq.gz The first part of my script needs to generate pe# for as many items as there are in the list. In the above examples of my List1 and List2, it would be pe1 pe2 pe3. This is easily done, and I am not having an issue in my script. Where I encounter my problem, is in the second half, where I need to generate text that says where to locate my files in the lists. For instance:
pe1="Path/To_File/R1.aa.fq.gz Path/To_File/R2.aa.fq.gz", pe2="Path/To_File/R1.ab.fq.gz Path/To_file/R2.ab.fq.gz, and so on.
Below is a portion of my script.
List1 = file1in.read().split('\n')
List2 = file2in.read().split('\n')
CombinedList = []
for i,j in zip(List1, List2)
CombinedList.append([i,j])
for i in range(len(CombinedList)//2):
print("pe"+str(i+1), file=Output)
for i in range(len(CombinedList)//2):
print("pe"+str(i+1)+'="'f'{Path}/To_List1/'+CombinedList[i]+" "+f'{Path}/To_list2/'+CombinedList[i+1], file=Output)
exit()
What I am instead getting as an output is the following:
pe1="/Users/devonboland/Desktop/Test RGA/PU_L/PairedUnmapped_R1_split_aa.fq.gz /Users/devonboland/Desktop/Test RGA/PU_R/PairedUnmapped_R2_split_aa.fq.gz"
pe2="/Users/devonboland/Desktop/Test RGA/PU_L/PairedUnmapped_R2_split_aa.fq.gz /Users/devonboland/Desktop/Test RGA/PU_R/PairedUnmapped_R1_split_ab.fq.gz"
pe3="/Users/devonboland/Desktop/Test RGA/PU_L/PairedUnmapped_R1_split_ab.fq.gz /Users/devonboland/Desktop/Test RGA/PU_R/PairedUnmapped_R2_split_ab.fq.gz"
I have been at this small script for over 2 weeks now and have gotten nowhere fast, I would appreciate any help offered!
Based on comments and some lucky guessing, this is what you seem to be looking for.
List1 = ["R1_aa.fq.gz", "R1_ab.fq.gz", "R1_ac.fq.gz"]
List2 = ["R2_aa.fq.gz", "R2_ab.fq.gz", "R2_ac.fq.gz"]
# No need to .append, just create the list
CombinedList = list(zip(List1, List2))
# Apparently
Path = "/Users/devonboland/Desktop/Test RGA"
PeList = []
for i in range(len(CombinedList)):
PeList.append(
f'pe{str(i+1)}="{Path}/PU_L/{CombinedList[i][0]}'
f' {Path}/PU_R/{CombinedList[i][1]}"')
print(", ".join(PeList)
Notice how the items of CombinedList are pairs of items, and the length of the list is simply the number of pairs. Notice also how you can use braces inside an f-string to refer to variables, like you were already doing in some places but not in others. And of course there's no need to exit() at the end of a script; Python will naturally stop executing it when it reaches the end.
... Actually there is no need to zip the lists into pairs, just loop over one and fetch items at the same index from the other in the same loop.
for i in range(len(List1)):
PeList.append(
f'pe{str(i+1)}="{Path}/PU_L/{List1[i]}'
f' {Path}/PU_R/{List2[i]}"')

How to filter elements of Cartesian product following specific ordering conditions

I have to generate multiple reactions with different variables. They have 3 elements. Let's call them B, S and H. And they all start with B1. S can be appended to the element if there is at least one B. So it can be B1S1 or B2S2 or B2S1 etc... but not B1S2. The same goes for H. B1S1H1 or B2S2H1 or B4S1H1 but never B2S2H3. The final variation would be B5S5H5. I tried with itertools.product. But I don't know how to get rid of the elements that don't match my condition and how to add the next element. Here is my code:
import itertools
a = list(itertools.product([1, 2, 3, 4], repeat=4))
#print (a)
met = open('random_dat.dat', 'w')
met.write('Reactions')
met.write('\n')
for i in range(1,256):
met.write('\n')
met.write('%s: B%sS%sH%s -> B%sS%sH%s' %(i, a[i][3], a[i][2], a[i][1], a[i][3], a[i][2], a[i][1]))
met.write('\n')
met.close()
Simple for loops will do what you want:
bsh = []
for b in range(1,6):
for s in range(1,b+1):
for h in range(1,b+1):
bsh.append( f"B{b}S{s}H{h}" )
print(bsh)
Output:
['B1S1H1', 'B2S1H1', 'B2S1H2', 'B2S2H1', 'B2S2H2', 'B3S1H1', 'B3S1H2', 'B3S1H3',
'B3S2H1', 'B3S2H2', 'B3S2H3', 'B3S3H1', 'B3S3H2', 'B3S3H3', 'B4S1H1', 'B4S1H2',
'B4S1H3', 'B4S1H4', 'B4S2H1', 'B4S2H2', 'B4S2H3', 'B4S2H4', 'B4S3H1', 'B4S3H2',
'B4S3H3', 'B4S3H4', 'B4S4H1', 'B4S4H2', 'B4S4H3', 'B4S4H4', 'B5S1H1', 'B5S1H2',
'B5S1H3', 'B5S1H4', 'B5S1H5', 'B5S2H1', 'B5S2H2', 'B5S2H3', 'B5S2H4', 'B5S2H5',
'B5S3H1', 'B5S3H2', 'B5S3H3', 'B5S3H4', 'B5S3H5', 'B5S4H1', 'B5S4H2', 'B5S4H3',
'B5S4H4', 'B5S4H5', 'B5S5H1', 'B5S5H2', 'B5S5H3', 'B5S5H4', 'B5S5H5']
Thanks to #mikuszefski for pointing out improvements.
Patrick his answer in list comprehension style
bsh = [f"B{b}S{s}H{h}" for b in range(1,5) for s in range(1,b+1) for h in range(1,b+1)]
Gives
['B1S1H1',
'B2S1H1',
'B2S1H2',
'B2S2H1',
'B2S2H2',
'B3S1H1',
'B3S1H2',
'B3S1H3',
'B3S2H1',
'B3S2H2',
'B3S2H3',
'B3S3H1',
'B3S3H2',
'B3S3H3',
'B4S1H1',
'B4S1H2',
'B4S1H3',
'B4S1H4',
'B4S2H1',
'B4S2H2',
'B4S2H3',
'B4S2H4',
'B4S3H1',
'B4S3H2',
'B4S3H3',
'B4S3H4',
'B4S4H1',
'B4S4H2',
'B4S4H3',
'B4S4H4']
I would implement your "use itertools.product and get rid off unnecessary elements" solution following way:
import itertools
a = list(itertools.product([1,2,3,4,5],repeat=3))
a = [i for i in a if (i[1]<=i[0] and i[2]<=i[1] and i[2]<=i[0])]
Note that I assumed last elements needs to be smaller or equal than any other. Note that a is now list of 35 tuples each holding 3 ints. So you need to made strs of them for example using so-called f-string:
a = [f"B{i[0]}S{i[1]}H{i[2]}" for i in a]
print(a)
output:
['B1S1H1', 'B2S1H1', 'B2S2H1', 'B2S2H2', 'B3S1H1', 'B3S2H1', 'B3S2H2', 'B3S3H1', 'B3S3H2', 'B3S3H3', 'B4S1H1', 'B4S2H1', 'B4S2H2', 'B4S3H1', 'B4S3H2', 'B4S3H3', 'B4S4H1', 'B4S4H2', 'B4S4H3', 'B4S4H4', 'B5S1H1', 'B5S2H1', 'B5S2H2', 'B5S3H1', 'B5S3H2', 'B5S3H3', 'B5S4H1', 'B5S4H2', 'B5S4H3', 'B5S4H4', 'B5S5H1', 'B5S5H2', 'B5S5H3', 'B5S5H4', 'B5S5H5']
However you might also use another methods of formatting instead of f-string if you wish.

Speeding up a function that outputs products with repeats in Python

I've written a function which returns a list of all possible 'products' of a string with repeats (for a spell check program):
def productRepeats(string):
comboList = []
for item in [p for p in product(string, repeat = len(list(string)))]:
if item not in comboList:
comboList.append("".join(item))
return list(set(comboList))
Therefore, when print(productRepeats("PIP")) is inputted, he output is (I don't care what the order is):
['PII', 'IIP', 'PPI', 'IPI', 'IPP', 'PPP', 'PIP', 'III']
However, if I try anything greater than 5 digits (PIIIIP), it takes about 30 seconds to output, even though there are only 64 ways
Is there any way I could speed this up, as getting the list for the string 'GERAWUHGP', for instance, takes well over half an hour?
Eliminate duplicates before calling product()
product(seq, repeat=len(seq)) will produce duplicate results if & only if seq contains any duplicate elements; e.g., product('ABC', repeat=3) will have no duplicates, but product('ABA', repeat=3) will have some duplicates because A will be selected more than once (and that'll be compounded by the fact that 'ABA' is used as an argument three times). Filter out any duplicates from string first and then pass the result to product, and you'll be able to completely drop the post-product duplicate check, so you can just return the result of product directly:
def productRepeats(string):
return product(set(string), repeat=len(string))
There are a couple of tricks you can use:
Use a list comprehension or map to perform your iteration.
As #jwodder explains, use set(string) to avoid checking for duplicates at a later stage.
Here's a demo. I'm seeing a ~900x improvement for "hello":
from itertools import product
def productRepeats(string):
comboList = []
for item in [p for p in product(string, repeat = len(list(string)))]:
if item not in comboList:
comboList.append("".join(item))
return list(set(comboList))
def productRepeats2(string):
return list(map(''.join, product(set(string), repeat=len(string))))
assert set(productRepeats2('hello')) == set(productRepeats('hello'))
%timeit productRepeats('hello') # 127 ms
%timeit productRepeats2('hello') # 143 µs

Python: itertools.product consuming too much resources

I've created a Python script that generates a list of words by permutation of characters. I'm using itertools.product to generate my permutations. My char list is composed by letters and numbers 01234567890abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVXYZ. Here is my code:
#!/usr/bin/python
import itertools, hashlib, math
class Words:
chars = '01234567890abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVXYZ'
def __init__(self, size):
self.make(size)
def getLenght(self, size):
res = []
for i in range(1, size+1):
res.append(math.pow(len(self.chars), i))
return sum(res)
def getMD5(self, text):
m = hashlib.md5()
m.update(text.encode('utf-8'))
return m.hexdigest()
def make(self, size):
file = open('res.txt', 'w+')
res = []
i = 1
for i in range(1, size+1):
prod = list(itertools.product(self.chars, repeat=i))
res = res + prod
j = 1
for r in res:
text = ''.join(r)
md5 = self.getMD5(text)
res = text+'\t'+md5
print(res + ' %.3f%%' % (j/float(self.getLenght(size))*100))
file.write(res+'\n')
j = j + 1
file.close()
Words(3)
This script works fine for list of words with max 4 characters. If I try 5 or 6 characters, my computer consumes 100% of CPU, 100% of RAM and freezes.
Is there a way to restrict the use of those resources or optimize this heavy processing?
Does this do what you want?
I've made all the changes in the make method:
def make(self, size):
with open('res.txt', 'w+') as file_: # file is a builtin function in python 2
# also, use with statements for files used on only a small block, it handles file closure even if an error is raised.
for i in range(1, size+1):
prod = itertools.product(self.chars, repeat=i)
for j, r in enumerate(prod):
text = ''.join(r)
md5 = self.getMD5(text)
res = text+'\t'+md5
print(res + ' %.3f%%' % ((j+1)/float(self.get_length(size))*100))
file_.write(res+'\n')
Be warned this will still chew up gigabytes of memory, but not virtual memory.
EDIT: As noted by Padraic, there is no file keyword in Python 3, and as it is a "bad builtin", it's not too worrying to override it. Still, I'll name it file_ here.
EDIT2:
To explain why this works so much faster and better than the previous, original version, you need to know how lazy evaluation works.
Say we have a simple expression as follows (for Python 3) (use xrange for Python 2):
a = [i for i in range(1e12)]
This immediately evaluates 1 trillion elements into memory, overflowing your memory.
So we can use a generator to solve this:
a = (i for i in range(1e12))
Here, none of the values have been evaluated, just given the interpreter instructions on how to evaluate it. We can then iterate through each item one by one and do work on each separately, so almost nothing is in memory at a given time (only 1 integer at a time). This makes the seemingly impossible task very manageable.
The same is true with itertools: it allows you to do memory-efficient, fast operations by using iterators rather than lists or arrays to do operations.
In your example, you have 62 characters and want to do the cartesian product with 5 repeats, or 62**5 (nearly a billion elements, or over 30 gigabytes of ram). This is prohibitively large."
In order to solve this, we can use iterators.
chars = '01234567890abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVXYZ'
for i in itertools.product(chars, repeat=5):
print(i)
Here, only a single item from the cartesian product is in memory at a given time, meaning it is very memory efficient.
However, if you evaluate the full iterator using list(), it then exhausts the iterator and adds it to a list, meaning the nearly one billion combinations are suddenly in memory again. We don't need all the elements in memory at once: just 1. Which is the power of iterators.
Here are links to the itertools module and another explanation on iterators in Python 2 (mostly true for 3).

How to check python codes by reduction?

import numpy
def rtpairs(R,T):
for i in range(numpy.size(R)):
o=0.0
for j in range(T[i]):
o +=2*(numpy.pi)/T[i]
yield R[i],o
R=[0.0,0.1,0.2]
T=[1,10,20]
for r,t in genpolar.rtpairs(R,T):
plot(r*cos(t),r*sin(t),'bo')
This program is supposed to be a generator, but I would like to check if i'm doing the right thing by first asking it to return some values for pheta (see below)
import numpy as np
def rtpairs (R=None,T=None):
R = np.array(R)
T = np.array(T)
for i in range(np.size(R)):
pheta = 0.0
for j in range(T[i]):
pheta += (2*np.pi)/T[i]
return pheta
Then
I typed import omg as o in the prompt
x = [o.rtpairs(R=[0.0,0.1,0.2],T=[1,10,20])]
# I tried to collect all values generated by the loops
It turns out to give me only one value which is 2 pi ... I have a habit to check my codes in the half way through, Is there any way for me to get a list of angles by using the code above? I don't understand why I must use a generator structure to check (first one) , but I couldn't use normal loop method to check.
Normal loop e.g.
x=[i for i in range(10)]
x=[0,1,2,3,4,5,6,7,8,9]
Here I can see a list of values I should get.
return pheta
You switched to return instead of yield. It isn't a generator any more; it's stopping at the first return. Change it back.
x = [o.rtpairs(R=[0.0,0.1,0.2],T=[1,10,20])]
This wraps the rtpairs return value in a 1-element list. That's not what you want. If you want to extract all elements from a generator and store them in a list, call list on the generator:
x = list(o.rtpairs(R=[0.0,0.1,0.2],T=[1,10,20]))

Categories