Lets say we have an if statement in python of form:
if a == 1 or a == 2 or a == 3 ... a == 100000 for a large number of comparisons, all connected with an or.
What would a good algorithm be to compress that into a smaller if statement?
eg for the above
if a >= 1 and a <= 100000
Sometimes there will be a pattern in the numbers and sometimes they will be completely random so the algorithm must deal well with both cases.
Can anyone suggest a decent algorithm that will efficiently condense an if statement of this form?
Edit: The goal is to have the resulting if statement be as short as possible. The efficiency of evaluating the if statement is secondary to length.
If there is no pattern in your numbers, you can use a tuple and use membership test
if a in (1,2,3,... 100000)
You sort your "compare list", then traverse it to extract runs of consecutive integers as separate intervals. For intervals of length 1 (i.e. single numbers) you perform an == test and for larger intervals you can perform the chained =</>= comparisons.
Maintain a sorted array of numbers to compare and perform binary search on it whenever you want to check for a . If a exists in array then the statement is true else false. It will be O(logn) for each if query
a quick and easy way is:
if a in set(pattern in the numbers and sometimes they will be completely random):
whaterver
You can also given
x = random.sample(xrange(100), 30)
you could build up a list of ranges:
for i in x:
if ranges and i-1 in ranges[-1]:
ranges[-1].append(i)
else:
if ranges: ranges[-1] = ranges[-1][0], ranges[-1][-1]
ranges.append([i])
tidy the ranges up and then then loop through it
for range in ranges:
if range[0] <= a <= range[0]:
whatever
What works in the least space for the if statement is going to be a test for membership in a set of values. (Don't use a tuple or a list for this - this is what sets are for):
if a in set_of_numbers:
blah ...
If you know from your wider problem that when expressed as ranges of integers, that you will usually have sufficiently less ranges to compare, then you can use code from the Rosetta Code Range extraction task to create the ranges, with a different print routine printi to format the output as a for statement:
def printi(ranges):
print( 'if %s:' %
' or '.join( (('%i<=a<=%i' % r) if len(r) == 2 else 'a==%i' % r)
for r in ranges ) )
For the Rosetta code example numbers it will produce the following:
if 0<=a<=2 or a==4 or 6<=a<=8 or a==11 or a==12 or 14<=a<=25 or 27<=a<=33 or 35<=a<=39:
Tradeoffs
For a few but large ranges then a large set would have to be created in your source against a more compact creation of the if statement comparing explicit ranges.
For many scattered ranges the if statement becomes very long - the set solution would be easier to maintain - still long but probably easier to scan.
I don't know your full problem, but probably the best way to handle this is if the integers come in a file that is easy to parse then make your program parse this file and create an appropriate set or list of ranges on the fly for your if statement.
Related
My understanding of list and set in Python are mainly that list allows duplicates, list allows ordered information, and list has position information. I found while I was trying to search if an element is with in a list, the runtime is much faster if I convert the list to a set first. For example, I wrote a code trying find the longest consecutive sequence in a list. Use a list from 0 to 10000 as an example, the longest consecutive is 10001. While using a list:
start_time = datetime.now()
nums = list(range(10000))
longest = 0
for number in nums:
if number - 1 not in nums:
length = 0
##Search if number + 1 also in the list##
while number + length in nums:
length += 1
longest = max(length, longest)
end_time = datetime.now()
timecost = 'Duration: {}'.format(end_time - start_time)
print(timecost)
The run time for above code is "Duration: 0:00:01.481939"
With adding only one line to convert the list to set in third row below:
start_time = datetime.now()
nums = list(range(10000))
nums = set(nums)
longest = 0
for number in nums:
if number - 1 not in nums:
length = 0
##Search if number + 1 also in the set(was a list)##
while number + length in nums:
length += 1
longest = max(length, longest)
end_time = datetime.now()
timecost = 'Duration: {}'.format(end_time - start_time)
print(timecost)
The run time for above code by using a set is now "Duration: 0:00:00.005138", Many time shorter than search through a list. Could anyone help me to understand the reason for that? Thank you!
This is a great question.
The issue with arrays is that there is no smarter way to search in some array a besides just comparing every element one by one.
Sometimes you'll get lucky and get a match on the first element of a.
Sometimes you'll get unlucky and not find a match until the last element of a, or perhaps none at all.
On average, you'll have to search half the elements of they array each time.
This is said to have a "time complexity" of O(len(a)), or colloquially, O(n). This means the time taken by the algorithm (searching for a value in array) is linearly propertional to the size of the input (the number of elements in the array to be searched). This is why it's called "linear search". Oh, your array got 2x bigger? Well your searches just got 2x slower. 1000x bigger? 1000x slower.
Arrays are great, but they're 💩 for searching if the element count gets too high.
Sets are clever. In Python, they're implemented as if they were a Dictionary with keys and no values. Like dictionaries, they're backed by data structure called a hash table.
A hash table uses the hash of a value as a quick way to get a "summary" of an object. This "summary" is then used to narrow down its search, so it only needs to linearly search a very small subset of all the elements. Searching in a hash table a time complexity of O(1). Notice that there's no "n" or len(the_set) in there. That's because the time taken to search in a hash table does not grow as the size of the hash table grows. So it's said to have constant time complexity.
By analogy, you only search the dairy isle when you're looking for milk. You know the hash value of milk (say, it's isle) is "dairy" and not "deli", so you don't have to waste any time searching for milk
A natural follow-up question is "then why don't we always use sets?". Well, there's a trade-off.
As you mentioned, sets can't contain duplicates, so if you want to store two of something, it's a non-starter.
Prior to Python 3.7, they were also unordered, so if you cared about
the order of elements, they won't do, either. * Sets generally have a
larger cpu/memory overhead, which adds up when using many sets containing small numbers of elements.
Also, it's possible
that because of fancy CPU features (like CPU caches and branch
prediction), linear searching through small arrays can actually be
faster than the hash-based look-up in sets.
I'd recommend you do some further reading into data structures and algorithms. This stuff is quite language-independent. Now that you know that set and dict use a Hash Table behind the scenes, you can look up resource that cover hash tables in any language, and that should help. There's also some Python-centric resoruces too, like https://www.interviewcake.com/concept/python/hash-map
I am trying to create a function compare(lst1,lst2) which compares the each element in a list and returns every common element in a new list and shows percentage of how common it is. All the elements in the list are going to be strings. For example the function should return:
lst1 = AAAAABBBBBCCCCCDDDD
lst2 = ABCABCABCABCABCABCA
common strand = AxxAxxxBxxxCxxCxxxx
similarity = 25%
The parts of the list which are not similar will simply be returned as x.
I am having trouble in completing this function without the python set and zip method. I am not allowed to use them for this task and I have to achieve this using while and for loops. Kindly guide me as to how I can achieve this.
This is what I came up with.
lst1 = 'AAAAABBBBBCCCCCDDDD'
lst2 = 'ABCABCABCABCABCABCA'
common_strand = ''
score = 0
for i in range(len(lst1)):
if lst1[i] == lst2[i]:
common_strand = common_strand + str(lst1[i])
score += 1
else:
common_strand = common_strand + 'x'
print('Common Strand: ', common_strand)
print('Similarity Score: ', score/len(lst1))
Output:
Common Strand: AxxAxxxBxxxCxxCxxxx
Similarity Score: 0.2631578947368421
I am having trouble in completing this function without the python set and zip method. I am not allowed to use them for this task and I have to achieve this using while and for loops. Kindly guide me as to how I can achieve this.
You have two strings A and B. Strings are ordered sequences of characters.
Suppose both A and B have equal length (the same number of characters). Choose some position i < len(A), len(B) (remember Python sequences are 0-indexed). Your problem statement requires:
If character i in A is identical to character i in B, yield that character
Otherwise, yield some placeholder to denote the mismatch
How do you find the ith character in some string A? Take a look at Python's string methods. Remember: strings are sequences of characters, so Python strings also implement several sequence-specific operations.
If len(A) != len(B), you need to decide what to do if you're comparing the ith element in either string to a string smaller than i. You might think to represent these as the same placeholder in (2).
If you know how to iterate the result of zip, you know how to use for loops. All you need is a way to iterate over the sequence of indices. Check out the language built-in functions.
Finally, for your measure of similarity: if you've compared n characters and found that N <= n are mismatched, you can define 1 - (N / n) as your measure of similarity. This works well for equally-long strings (for two strings with different lengths, you're always going to be calculating the proportion relative to the longer string).
I am working on a project in which I need to generate several identifiers for combinatorial pooling of different molecules. To do so, I assign each molecule an n-bit string (where n is the number of pools I have. In this case, 79 pools) and each string has 4 "on" bits (4 bits equal to 1) corresponding to which pools that molecule will appear in. Next, I want to pare down the number of strings such that no two molecules appear in the same pool more than twice (in other words, the greatest number of overlapping bits between two strings can be no greater than 2).
To do this, I: 1) compiled a list of all n-bit strings with k "on" bits, 2) generated a list of lists where each element is a list of indices where the bit is on using re.finditer and 3) iterate through the list to compare strings, adding only strings that meet my criteria into my final list of strings.
The code I use to compare strings:
drug_strings = [] #To store suitable strings for combinatorial pooling rules
class badString(Exception): pass
for k in range(len(strings)):
bit_current = bit_list[k]
try:
for bits in bit_list[:k]:
intersect = set.intersection(set(bit_current),set(bits))
if len(intersect) > 2:
raise badString() #pass on to next iteration if string has overlaps in previous set
drug_strings.append(strings[k])
except badString:
pass
However, this code takes forever to run. I am running this with n=79-bit strings with k=4 "on" bits per string (~1.5M possible strings) so I assume that the long runtime is because I am comparing each string to every previous string. Is there an alternative/smarter way to go about doing this? An algorithm that would work faster and be more robust?
EDIT: I realized that the simpler way to approach this problem instead of identifying the entire subset of strings that would be suitable for my project was to just randomly sample the larger set of n-bit strings with k "on" bits, store only the strings that fit my criteria, and then once I have an appropriate amount of suitable strings, simply take as many as I need from those. New code is as follows:
my_strings = []
my_bits = []
for k in range(2000):
random = np.random.randint(0, len(strings_77))
string = strings_77.pop(random)
bits = [m.start()+1 for m in re.finditer('1',string)]
if all(len(set(bits) & set(my_bit)) <= 2
for my_bit in my_bits[:k]):
my_strings.append(string)
my_bits.append(bits)
Now I only have to compare against strings I've already pulled (at most 1999 previous strings instead of up to 1 million). It runs much more quickly this way. Thanks for the help!
Raising exceptions is expensive. A complex data structure is created and the stack has to be unwound. In fact, setting up a try/except block is expensive.
Really you're wanting to check that all intersections have length less than or equal to two, and then append. There is no need for exceptions.
for k in range(len(strings)):
bit_current = bit_list[k]
if all(len(set(bit_current) & set(bits)) <= 2
for bits in bit_list[:k]):
drug_strings.append(strings[k])
Also, instead of having to look up the strings and bit_list index, you can iterate over all the parts you need at the same time. You still need the index for the bit_list slice:
for index, (drug_string, bit_current) in enumerate(zip(strings, bit_list)):
if all(len(set(bit_current) & set(bits)) <= 2
for bits in bit_list[:k]):
drug_strings.append(drug_string)
You can also avoid creating the bit_current set with each loop:
for index, (drug_string, bit_current) in enumerate(zip(strings, bit_list)):
bit_set = set(bit_current)
if all(len(bit_set & set(bits)) <= 2
for bits in bit_list[:k]):
drug_strings.append(drug_string)
Some minor things that I would improve in your code that may be causing some overhead:
set(bit_current) move this outside the inner loop;
remove the raise except part;
Since you have this condition if len(intersect) > 2: you could try to implement the interception method to stop when that condition is meet. So that you avoid unnecessary computation.
So the code would become:
for k in range(len(strings)):
bit_current = set(bit_list[k])
intersect = []
for bits in bit_list[:k]:
intersect = []
b = set(bits)
for i in bit_current:
if i in b:
intersect.append(i)
if len(intersect) > 2:
break
if len(intersect) > 2:
break
if len(intersect) <= 2:
drug_strings.append(strings[k])
Sorry if being so simple; as I searched elsewhere but nobody had pointed out to this specific problem. I'd like to learn python in a way that makes my code compact! So, to this end, I'm trying to make use of one-line (i.e., short) loops instead of multi-line loops, specifically, for loops. The problem arises when I try to use one-line if and else inside the one-line loops. It just doesn't seem to be working. Consider the following, for example:
numbers = ... # an arbitrary array of integer numbers
over_30 = [number if number > 30 for number in numbers]
This is problematic since one-line if does need else following it. Even though, when I add else to the above script (after if): over_30 = [number if number > 30 else continue for number in numbers], it turns into just another pythonic error.
I know that the problem is actually with one-line if and else, because python needs to identify a value that should be assigned to the lefthand operator. But, is there a work-around for the specific use-case of this schema as above?
They are different syntaxes. The one you are looking for is:
over_30 = [number for number in numbers if number > 30]
This is a conditional list comprehension. The else clause is actually a non-conditional list comprehension, combined with a ternary expression:
over_30 = [number if number > 30 else 0 for number in numbers]
Here you are computing the ternary expression (number if number > 30 else 0) for each number in the numbers iterable.
continue won't work since this is ternary expression, in which you need to return something.
val1 if condition else val2
You can try this way:
over_30 = [number for number in numbers if number > 30]
I am working through another interview Question, and it is about the following coding interview question.
So I'm building a feature for choosing two movies whose total runtimes will equal the exact flight length.
The question asks the following:
Write a function that takes an integer flight_length (in minutes) and a list of integers movie_lengths (in minutes) and returns a boolean indicating whether there are two numbers in movie_lengths whose sum equals flight_length.
I first thought we can do through nest two loops (the outer choosing first_movie_length, the inner choosing second_movie_length). That’d give us a runtime of O(n^2)O(n2)
But is it possible that we can do better?
I have the following solution:
def can_two_movies_fill_flight(movie_lengths, flight_length):
# movie lengths we've seen so far
movie_lengths_seen = set()
for first_movie_length in movie_lengths:
matching_second_movie_length = flight_length - first_movie_length
if matching_second_movie_length in movie_lengths_seen:
return True
movie_lengths_seen.add(first_movie_length)
# we never found a match, so return False
return False
I think this solution gives me O(n) time, and O(n) O(n) space.
Is it possible that I can use hash map?
We could sort the movie_lengths first—then we could use binary search to find second_movie_length in O(\lg{n})O(lgn) time instead of O(n)O(n) time.
But sorting would cost O(nlg(n))O(nlg(n)), and we can do even better than that.
My Solution:
Using a set as our data structure.
We make one pass through movie_lengths, treating each item as the first_movie_length. At each iteration.
See if there is a matching_second_movie_length we've seen already (stored in our movie_lengths_seen set) that is equal to flight_length - first_movie_length. If there is, we short-circuit and return True.
Keep our movie_lengths_seen set up to date by throwing in the current first_movie_length.
def can_two_movies_fill_flight(movie_lengths, flight_length):
# movie lengths we've seen so far
movie_lengths_seen = set()
for first_movie_length in movie_lengths:
matching_second_movie_length = flight_length - first_movie_length
if matching_second_movie_length in movie_lengths_seen:
return True
movie_lengths_seen.add(first_movie_length)
return False
We know users won't watch the same movie twice because we check movie_lengths_seen for matching_second_movie_length before we've put first_movie_length in it!
Efficiency and Algorithmic complexity: O(n)O(n) time, and O(n)O(n) space. Note while optimizing runtime we added a bit of space cost.