Finding overlapping bits in a set of binary strings

Finding overlapping bits in a set of binary strings - python

I am working on a project in which I need to generate several identifiers for combinatorial pooling of different molecules. To do so, I assign each molecule an n-bit string (where n is the number of pools I have. In this case, 79 pools) and each string has 4 "on" bits (4 bits equal to 1) corresponding to which pools that molecule will appear in. Next, I want to pare down the number of strings such that no two molecules appear in the same pool more than twice (in other words, the greatest number of overlapping bits between two strings can be no greater than 2).
To do this, I: 1) compiled a list of all n-bit strings with k "on" bits, 2) generated a list of lists where each element is a list of indices where the bit is on using re.finditer and 3) iterate through the list to compare strings, adding only strings that meet my criteria into my final list of strings.
The code I use to compare strings:
drug_strings = [] #To store suitable strings for combinatorial pooling rules
class badString(Exception): pass
for k in range(len(strings)):
bit_current = bit_list[k]
try:
for bits in bit_list[:k]:
intersect = set.intersection(set(bit_current),set(bits))
if len(intersect) > 2:
raise badString() #pass on to next iteration if string has overlaps in previous set
drug_strings.append(strings[k])
except badString:
pass
However, this code takes forever to run. I am running this with n=79-bit strings with k=4 "on" bits per string (~1.5M possible strings) so I assume that the long runtime is because I am comparing each string to every previous string. Is there an alternative/smarter way to go about doing this? An algorithm that would work faster and be more robust?
EDIT: I realized that the simpler way to approach this problem instead of identifying the entire subset of strings that would be suitable for my project was to just randomly sample the larger set of n-bit strings with k "on" bits, store only the strings that fit my criteria, and then once I have an appropriate amount of suitable strings, simply take as many as I need from those. New code is as follows:
my_strings = []
my_bits = []
for k in range(2000):
random = np.random.randint(0, len(strings_77))
string = strings_77.pop(random)
bits = [m.start()+1 for m in re.finditer('1',string)]
if all(len(set(bits) & set(my_bit)) <= 2
for my_bit in my_bits[:k]):
my_strings.append(string)
my_bits.append(bits)
Now I only have to compare against strings I've already pulled (at most 1999 previous strings instead of up to 1 million). It runs much more quickly this way. Thanks for the help!

Raising exceptions is expensive. A complex data structure is created and the stack has to be unwound. In fact, setting up a try/except block is expensive.
Really you're wanting to check that all intersections have length less than or equal to two, and then append. There is no need for exceptions.
for k in range(len(strings)):
bit_current = bit_list[k]
if all(len(set(bit_current) & set(bits)) <= 2
for bits in bit_list[:k]):
drug_strings.append(strings[k])
Also, instead of having to look up the strings and bit_list index, you can iterate over all the parts you need at the same time. You still need the index for the bit_list slice:
for index, (drug_string, bit_current) in enumerate(zip(strings, bit_list)):
if all(len(set(bit_current) & set(bits)) <= 2
for bits in bit_list[:k]):
drug_strings.append(drug_string)
You can also avoid creating the bit_current set with each loop:
for index, (drug_string, bit_current) in enumerate(zip(strings, bit_list)):
bit_set = set(bit_current)
if all(len(bit_set & set(bits)) <= 2
for bits in bit_list[:k]):
drug_strings.append(drug_string)

Some minor things that I would improve in your code that may be causing some overhead:
set(bit_current) move this outside the inner loop;
remove the raise except part;
Since you have this condition if len(intersect) > 2: you could try to implement the interception method to stop when that condition is meet. So that you avoid unnecessary computation.
So the code would become:
for k in range(len(strings)):
bit_current = set(bit_list[k])
intersect = []
for bits in bit_list[:k]:
intersect = []
b = set(bits)
for i in bit_current:
if i in b:
intersect.append(i)
if len(intersect) > 2:
break
if len(intersect) > 2:
break
if len(intersect) <= 2:
drug_strings.append(strings[k])

Related

Finding common string in list and displaying them

I am trying to create a function compare(lst1,lst2) which compares the each element in a list and returns every common element in a new list and shows percentage of how common it is. All the elements in the list are going to be strings. For example the function should return:
lst1 = AAAAABBBBBCCCCCDDDD
lst2 = ABCABCABCABCABCABCA
common strand = AxxAxxxBxxxCxxCxxxx
similarity = 25%
The parts of the list which are not similar will simply be returned as x.
I am having trouble in completing this function without the python set and zip method. I am not allowed to use them for this task and I have to achieve this using while and for loops. Kindly guide me as to how I can achieve this.

This is what I came up with.
lst1 = 'AAAAABBBBBCCCCCDDDD'
lst2 = 'ABCABCABCABCABCABCA'
common_strand = ''
score = 0
for i in range(len(lst1)):
if lst1[i] == lst2[i]:
common_strand = common_strand + str(lst1[i])
score += 1
else:
common_strand = common_strand + 'x'
print('Common Strand: ', common_strand)
print('Similarity Score: ', score/len(lst1))
Output:
Common Strand: AxxAxxxBxxxCxxCxxxx
Similarity Score: 0.2631578947368421

I am having trouble in completing this function without the python set and zip method. I am not allowed to use them for this task and I have to achieve this using while and for loops. Kindly guide me as to how I can achieve this.
You have two strings A and B. Strings are ordered sequences of characters.
Suppose both A and B have equal length (the same number of characters). Choose some position i < len(A), len(B) (remember Python sequences are 0-indexed). Your problem statement requires:
If character i in A is identical to character i in B, yield that character
Otherwise, yield some placeholder to denote the mismatch
How do you find the ith character in some string A? Take a look at Python's string methods. Remember: strings are sequences of characters, so Python strings also implement several sequence-specific operations.
If len(A) != len(B), you need to decide what to do if you're comparing the ith element in either string to a string smaller than i. You might think to represent these as the same placeholder in (2).
If you know how to iterate the result of zip, you know how to use for loops. All you need is a way to iterate over the sequence of indices. Check out the language built-in functions.
Finally, for your measure of similarity: if you've compared n characters and found that N <= n are mismatched, you can define 1 - (N / n) as your measure of similarity. This works well for equally-long strings (for two strings with different lengths, you're always going to be calculating the proportion relative to the longer string).

Re-generate a random index until the indexed element is met the condition n times in python

I know there have been a few similar questions regarding loop and random numbers, but I can't seem to find a solution for my problem.
Say I have a fixed list of numbers from my dataset and a threshold that the number has to meet:
x = (7,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39,41)
threshold= 25
I need to randomly pick a number from this list. Unfortunately, I cannot just directly loop over my original list, I'm forced to randomly pick an index of the list first, and find my number. So for example if I now randomly generate a index number 1 then I get x[1] which is 11
The final result I need is to find numbers that are greater than the threshold for at least 3 times and then put all the resulting number in a list, then my loop can stop. (The indexes cannot repeat).
As an example, a possible final results would be (27,29,31) (The results can be in any format) . I'm thinking maybe something like this to start but really need help to proceed:

Unless you are particularly concerned about the memory usage of creating an additional filtered list, the simplest would probably be to start by doing this:
filtered = [i for i in x if i > threshold]
You can then choose three samples from this filtered list (after import random). The following will potentially choose the same item more than once:
random.choices(filtered, k=3)
or if you want to avoid choosing the same item more than once:
random.sample(filtered, k=3)
Each of the above functions will output a list. Use tuple(....) on the output if you need to convert it to a tuple.

First a clarification. Do you need to pick a random element from the list each iteration, or do you need to pick a different random element from the list each time. I.e., can the same index be picked twice? You're doing the latter.
Second, you want to use range(len(x)). You don't want to hardwire the length of x into your code, and you want index 0 to be a possibility. random.shuffle() may be a better choice.
Lastly, you want to do something like:
result = []
for ....
if select >= threshold:
result.append(select)
if len(result) >= 3: break

If we assume the following constraints:
We are not allowed to loop over the original list (including list comprehension)
We are only allowed to access one member of the original list at a time through its index
We must pick 3 distinct members of the list that are greater or equal to the threshold
The following code should satisfy all of them:
x = [ 7,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39,41 ]
threshold = 25
result_index = []
while len(result_index) < 3:
index = random.range(0, len(x))
if x[index] >= threshold and index not in result_index:
result_index.append(index)
result = [ x[a] for a in result_index ]
Here is how this works:
In the loop, we store indices, not the numbers them selves.
For each index we check 2 conditions: there is a number there that is bigger or equal to the threshold and we haven't seen this index before.
If the conditions are satisfied, we save the index, not the number!
Repeat until we have 3 indices.
Build new list by getting numbers from those indices directly.

need to decrease the run time of my program

I had a question where I had to find contiguous substrings of a string, and the condition was the first and last letters of the substring had to be same. I tried doing it, but the runtime exceed the time-limit for the question for several test cases. I tried using map for a for loop, but I have no idea what to do for the nested for loop. Can anyone please help me to decrease the runtime of this program?
n = int(raw_input())
string = str(raw_input())
def get_substrings(string):
length = len(string)
list = []
for i in range(length):
for j in range(i,length):
list.append(string[i:j + 1])
return list
substrings = get_substrings(string)
contiguous = filter(lambda x: (x[0] == x[len(x) - 1]), substrings)
print len(contiguous)

If i understand properly the question, please let me know if thats not the case but try this:
Not sure if this will speed up runtime, but i believe this algorithm may for longer strings especially (eliminates nested loop). Iterate through the string once, storing the index (position) of each character in a data structure with constant time lookup (hashmap, or an array if setup properly). When finished you should have a datastructure storing all the different locations of every character. Using this you can easily retrieve the substrings.
Example:
codingisfun
take the letter i for example, after doing what i said above, you look it up in the hashmap and see that it occurs at index 3 and 6. Meaning you can do something like substring(3, 6) to get it.
not the best code, but it seems reasonable for a starting point...you may be able to eliminate a loop with some creative thinking:
import string
import itertools
my_string = 'helloilovetocode'
mappings = dict()
for index, char in enumerate(my_string):
if not mappings.has_key(char):
mappings[char] = list()
mappings[char].append(index)
print char
for char in mappings:
if len(mappings[char]) > 1:
for subset in itertools.combinations(mappings[char], 2):
print my_string[subset[0]:(subset[1]+1)]

The problem is that your code far too inefficient in terms of algorithmic complexity.
Here's an alternative (a cleaner but slightly slower version of soliman's I believe)
import collections
def index_str(s):
"""
returns the indices characters show up at
"""
indices = collections.defaultdict(list)
for index, char in enumerate(s):
indices[char].append(index)
return indices
def get_substrings(s):
indices = index_str(s)
for key, index_lst in indices.items():
num_indices = len(index_lst)
for i in range(num_indices):
for j in range(i, num_indices):
yield s[index_lst[i]: index_lst[j] + 1]
The algorithmic problem with your solution is that you blindly check each possible substring, when you can easily determine what actual pairs are in a single, linear time pass. If you only want the count, that can be determined easily in O(MN) time, for a string of length N and M unique characters (given the number of occurrences of a char, you can mathematically figure out how many substrings there are). Of course, in the worst case (all chars are the same), your code will have the same complexity as ours, but the in average case complexity yours is much worse since you have a nested for loop (n^2 time)

Condensing a long if statement to a short one automatically

Lets say we have an if statement in python of form:
if a == 1 or a == 2 or a == 3 ... a == 100000 for a large number of comparisons, all connected with an or.
What would a good algorithm be to compress that into a smaller if statement?
eg for the above
if a >= 1 and a <= 100000
Sometimes there will be a pattern in the numbers and sometimes they will be completely random so the algorithm must deal well with both cases.
Can anyone suggest a decent algorithm that will efficiently condense an if statement of this form?
Edit: The goal is to have the resulting if statement be as short as possible. The efficiency of evaluating the if statement is secondary to length.

If there is no pattern in your numbers, you can use a tuple and use membership test
if a in (1,2,3,... 100000)

You sort your "compare list", then traverse it to extract runs of consecutive integers as separate intervals. For intervals of length 1 (i.e. single numbers) you perform an == test and for larger intervals you can perform the chained =</>= comparisons.

Maintain a sorted array of numbers to compare and perform binary search on it whenever you want to check for a . If a exists in array then the statement is true else false. It will be O(logn) for each if query

a quick and easy way is:
if a in set(pattern in the numbers and sometimes they will be completely random):
whaterver
You can also given
x = random.sample(xrange(100), 30)
you could build up a list of ranges:
for i in x:
if ranges and i-1 in ranges[-1]:
ranges[-1].append(i)
else:
if ranges: ranges[-1] = ranges[-1][0], ranges[-1][-1]
ranges.append([i])
tidy the ranges up and then then loop through it
for range in ranges:
if range[0] <= a <= range[0]:
whatever

What works in the least space for the if statement is going to be a test for membership in a set of values. (Don't use a tuple or a list for this - this is what sets are for):
if a in set_of_numbers:
blah ...
If you know from your wider problem that when expressed as ranges of integers, that you will usually have sufficiently less ranges to compare, then you can use code from the Rosetta Code Range extraction task to create the ranges, with a different print routine printi to format the output as a for statement:
def printi(ranges):
print( 'if %s:' %
' or '.join( (('%i<=a<=%i' % r) if len(r) == 2 else 'a==%i' % r)
for r in ranges ) )
For the Rosetta code example numbers it will produce the following:
if 0<=a<=2 or a==4 or 6<=a<=8 or a==11 or a==12 or 14<=a<=25 or 27<=a<=33 or 35<=a<=39:
Tradeoffs
For a few but large ranges then a large set would have to be created in your source against a more compact creation of the if statement comparing explicit ranges.
For many scattered ranges the if statement becomes very long - the set solution would be easier to maintain - still long but probably easier to scan.
I don't know your full problem, but probably the best way to handle this is if the integers come in a file that is easy to parse then make your program parse this file and create an appropriate set or list of ranges on the fly for your if statement.

Python, Huge Iteration Performance Problem

I'm doing an iteration through 3 words, each about 5 million characters long, and I want to find sequences of 20 characters that identifies each word. That is, I want to find all sequences of length 20 in one word that is unique for that word. My problem is that the code I've written takes an extremely long time to run. I've never even completed one word running my program over night.
The function below takes a list containing dictionaries where each dictionary contains each possible word of 20 and its location from one of the 5 million long words.
If anybody has an idea how to optimize this I would be really thankful, I don't have a clue how to continue...
here's a sample of my code:
def findUnique(list):
# Takes a list with dictionaries and compairs each element in the dictionaries
# with the others and puts all unique element in new dictionaries and finally
# puts the new dictionaries in a list.
# The result is a list with (in this case) 3 dictionaries containing all unique
# sequences and their locations from each string.
dicList=[]
listlength=len(list)
s=0
valuelist=[]
for i in list:
j=i.values()
valuelist.append(j)
while s<listlength:
currdic=list[s]
dic={}
for key in currdic:
currval=currdic[key]
test=True
n=0
while n<listlength:
if n!=s:
if currval in valuelist[n]: #this is where it takes to much time
n=listlength
test=False
else:
n+=1
else:
n+=1
if test:
dic[key]=currval
dicList.append(dic)
s+=1
return dicList

def slices(seq, length, prefer_last=False):
unique = {}
if prefer_last: # this doesn't have to be a parameter, just choose one
for start in xrange(len(seq) - length + 1):
unique[seq[start:start+length]] = start
else: # prefer first
for start in xrange(len(seq) - length, -1, -1):
unique[seq[start:start+length]] = start
return unique
# or find all locations for each slice:
import collections
def slices(seq, length):
unique = collections.defaultdict(list)
for start in xrange(len(seq) - length + 1):
unique[seq[start:start+length]].append(start)
return unique
This function (currently in my iter_util module) is O(n) (n being the length of each word) and you would use set(slices(..)) (with set operations such as difference) to get slices unique across all words (example below). You could also write the function to return a set, if you don't want to track locations. Memory usage will be high (though still O(n), just a large factor), possibly mitigated (though not by much if length is only 20) with a special "lazy slice" class that stores the base sequence (the string) plus start and stop (or start and length).
Printing unique slices:
a = set(slices("aab", 2)) # {"aa", "ab"}
b = set(slices("abb", 2)) # {"ab", "bb"}
c = set(slices("abc", 2)) # {"ab", "bc"}
all = [a, b, c]
import operator
a_unique = reduce(operator.sub, (x for x in all if x is not a), a)
print a_unique # {"aa"}
Including locations:
a = slices("aab", 2)
b = slices("abb", 2)
c = slices("abc", 2)
all = [a, b, c]
import operator
a_unique = reduce(operator.sub, (set(x) for x in all if x is not a), set(a))
# a_unique is only the keys so far
a_unique = dict((k, a[k]) for k in a_unique)
# now it's a dict of slice -> location(s)
print a_unique # {"aa": 0} or {"aa": [0]}
# (depending on which slices function used)
In a test script closer to your conditions, using randomly generated words of 5m characters and a slice length of 20, memory usage was so high that my test script quickly hit my 1G main memory limit and started thrashing virtual memory. At that point Python spent very little time on the CPU and I killed it. Reducing either the slice length or word length (since I used completely random words that reduces duplicates and increases memory use) to fit within main memory and it ran under a minute. This situation plus O(n**2) in your original code will take forever, and is why algorithmic time and space complexity are both important.
import operator
import random
import string
def slices(seq, length):
unique = {}
for start in xrange(len(seq) - length, -1, -1):
unique[seq[start:start+length]] = start
return unique
def sample_with_repeat(population, length, choice=random.choice):
return "".join(choice(population) for _ in xrange(length))
word_length = 5*1000*1000
words = [sample_with_repeat(string.lowercase, word_length) for _ in xrange(3)]
slice_length = 20
words_slices_sets = [set(slices(x, slice_length)) for x in words]
unique_words_slices = [reduce(operator.sub,
(x for x in words_slices_sets if x is not n),
n)
for n in words_slices_sets]
print [len(x) for x in unique_words_slices]

You say you have a "word" 5 million characters long, but I find it hard to believe this is a word in the usual sense.
If you can provide more information about your input data, a specific solution might be available.
For example, English text (or any other written language) might be sufficiently repetitive that a trie would be useable. In the worst case however, it would run out of memory constructing all 256^20 keys. Knowing your inputs makes all the difference.
edit
I took a look at some genome data to see how this idea stacked up, using a hardcoded [acgt]->[0123] mapping and 4 children per trie node.
adenovirus 2: 35,937bp -> 35,899 distinct 20-base sequences using 469,339 trie nodes
enterobacteria phage lambda: 48,502bp -> 40,921 distinct 20-base sequences using 529,384 trie nodes.
I didn't get any collisions, either within or between the two data sets, although maybe there is more redundancy and/or overlap in your data. You'd have to try it to see.
If you do get a useful number of collisions, you could try walking the three inputs together, building a single trie, recording the origin of each leaf and pruning collisions from the trie as you go.
If you can't find some way to prune the keys, you could try using a more compact representation. For example you only need 2 bits to store [acgt]/[0123], which might save you space at the cost of slightly more complex code.
I don't think you can just brute force this though - you need to find some way to reduce the scale of the problem, and that depends on your domain knowledge.

Let me build off Roger Pate's answer. If memory is an issue, I'd suggest instead of using the strings as the keys to the dictionary, you could use a hashed value of the string. This would save the cost of the storing the extra copy of the strings as the keys (at worst, 20 times the storage of an individual "word").
import collections
def hashed_slices(seq, length, hasher=None):
unique = collections.defaultdict(list)
for start in xrange(len(seq) - length + 1):
unique[hasher(seq[start:start+length])].append(start)
return unique
(If you really want to get fancy, you can use a rolling hash, though you'll need to change the function.)
Now, we can combine all the hashes :
unique = [] # Unique words in first string
# create a dictionary of hash values -> word index -> start position
hashed_starts = [hashed_slices(word, 20, hashing_fcn) for word in words]
all_hashed = collections.defaultdict(dict)
for i, hashed in enumerate(hashed_starts) :
for h, starts in hashed.iteritems() :
# We only care about the first word
if h in hashed_starts[0] :
all_hashed[h][i]=starts
# Now check all hashes
for starts_by_word in all_hashed.itervalues() :
if len(starts_by_word) == 1 :
# if there's only one word for the hash, it's obviously valid
unique.extend(words[0][i:i+20] for i in starts_by_word.values())
else :
# we might have a hash collision
candidates = {}
for word_idx, starts in starts_by_word.iteritems() :
candidates[word_idx] = set(words[word_idx][j:j+20] for j in starts)
# Now go that we have the candidate slices, find the unique ones
valid = candidates[0]
for word_idx, candidate_set in candidates.iteritems() :
if word_idx != 0 :
valid -= candidate_set
unique.extend(valid)
(I tried extending it to do all three. It's possible, but the complications would detract from the algorithm.)
Be warned, I haven't tested this. Also, there's probably a lot you can do to simplify the code, but the algorithm makes sense. The hard part is choosing the hash. Too many collisions and you'll won't gain anything. Too few and you'll hit the memory problems. If you are dealing with just DNA base codes, you can hash the 20-character string to a 40-bit number, and still have no collisions. So the slices will take up nearly a fourth of the memory. That would save roughly 250 MB of memory in Roger Pate's answer.
The code is still O(N^2), but the constant should be much lower.

Let's attempt to improve on Roger Pate's excellent answer.
Firstly, let's keep sets instead of dictionaries - they manage uniqueness anyway.
Secondly, since we are likely to run out of memory faster than we run out of CPU time (and patience), we can sacrifice CPU efficiency for the sake of memory efficiency. So perhaps try only the 20s starting with one particular letter. For DNA, this cuts the requirements down by 75%.
seqlen = 20
maxlength = max([len(word) for word in words])
for startletter in letters:
for letterid in range(maxlength):
for wordid,word in words:
if (letterid < len(word)):
letter = word[letterid]
if letter is startletter:
seq = word[letterid:letterid+seqlen]
if seq in seqtrie and not wordid in seqtrie[seq]:
seqtrie[seq].append(wordid)
Or, if that's still too much memory, we can go through for each possible starting pair (16 passes instead of 4 for DNA), or every 3 (64 passes) etc.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.