Merge Sort on Python: Unusual pattern of result obtained - python

I have a unsorted array of 10,000 integers from 0 to 9,999. I wanted to apply merge sort on this unsorted array and I wrote the following code-
import sys
def merge_sort(data):
result = []
if len(data) <= 1:
return data
else:
mid = int(len(data)/2)
left = data[:mid]
right = data[mid:]
sorted_left = merge_sort(left)
sorted_right = merge_sort(right)
i = j = k = 0
total_len = len(sorted_left) + len(sorted_right)
for k in range(0, total_len):
if i < len(sorted_left) and j < len(sorted_right):
if sorted_left[i] < sorted_right[j]:
result.append(sorted_left[i])
i = i+1
k = k+1
elif sorted_left[i] > sorted_right[j]:
result.append(sorted_right[j])
j = j+1
k = k+1
elif i < len(sorted_left):
result.append(sorted_left[i])
i = i+1
k = k+1
elif j < len(sorted_right):
result.append(sorted_right[j])
j = j+1
k = k+1
else:
sys.exit("There is some issue with the code")
return result
print merge_sort(data)
So when I sort this data, I get a correct sort order except for a few entries. For example- towards the end I get this kind of result-
[...'9989', '999', '9990', '9991', '9992', '9993', '9994', '9995', '9996', '9997', '9998', '9999']
As you might observe, number '999' is at the wrong place. Not just in this snippet but it happens in other places too like '995' appearing between '9949' and '9950'.So anybody has any idea why this is happening?
P.S.- I ran this code for debug and it ran without errors producing these results

You are ordering strings: '9989' < '999' < '9990'. If you want to order integers, you'll have to convert your input list to integers.

Your data is in strings, not numbers. To convert to integers, use:
data = [int(x) for x in data]
Python will "compare" a wide variety of objects. For example:
>>> "a" < "ab"
True
>>> None < "a"
True
>>> 1 < "a"
True
If you compare such items, python will not object.
Comparison in python
For integers and strings, python has built-in methods for comparisons. For objects that you create, you can define your own comparison methods. Methods that you can define for your objects that python will use for comparison include:
object.__lt__(self, other)
object.__le__(self, other)
object.__eq__(self, other)
object.__ne__(self, other)
object.__gt__(self, other)
object.__ge__(self, other)
By defining your own methods for your objects, there is great flexibility.

Is your data coming in as strings or integers? Based on your sample output, they are strings.
In such a case, '1' comes just before '10'. If you're expecting integers, then you could convert to int to do the sort.

Related

List comprehension with if else conditions - Python

so I've been trying to figure out how do I convert the main for loop in this code to a list comprehension for efficiency, I've seen some examples, but none of them seem to cater to this sort of scenario. Any help is appreciated!
key = 'abcdefghij'
def getChunks(key):
''' Dividing the key into byte sized chunks'''
currVal = ''
remainder = ''
chunks = []
currVal = ''
for char in key:
if len(currVal) == 8:
chunks.append(currVal)
currVal = hex(ord(char))[2:]
else:
currVal += hex(ord(char))[2:]
if len(currVal) == 8:
chunks.append(currVal)
elif currVal:
remainder = currVal
return (chunks, remainder)
print(getChunks(key))
The desired output dividing the input string/key into byte sized chunks of hexadecimal values + any remainder in the following format
>> (['61626364', '65666768'], '696a')
Oh and this too:
for i in range(1, self.hashCount+1):
h = hash(item, i) % self.bitArraySize # Generate Hash
# set the bit True in bit_array
self.bitArray[h] = True
for i in range(1, self.hashCount+1):
h = hash(item, i) % self.bitArraySize
if self.bitArray[h] == False:
return False
None of these should be list comprehensions. List comprehensions should not have side-effects (they're a functional programming construct and violating functional programming expectations leads to unmaintainable/unreadable code). In all cases, your loops aren't just building a new list element by element in order, they're also making stateful changes and/or building the list out of order.
Side-note: if self.bitArray[h] == False: is a slower, unPythonic way to spell if not self.bitArray[h]:; comparing to True and False is almost always the wrong way to go, per the PEP8 style guide:
Don't compare boolean values to True or False using ==:
# Correct:
if greeting:
# Wrong:
if greeting == True:
For question #1
key = 'abcdefghij'
def getChunks(key):
''' Dividing the key into byte sized chunks'''
hex_string = key.encode().hex()
length = len(hex_string)
sep = length%8
return [hex_string[i:i+8] for i in range(0, length-sep, 8)], hex_string[-sep:] if sep !=0 else ''
print(getChunks(key))

how to generate a set of similar strings in python

I am wondering how to generate a set of similar strings based on Levenshtein distance (string edit distance). Ideally, I like to pass in, a source string (i.e. a string which is used to generate other strings that are similar to it), the number of strings need to be generated and a threshold as parameters, i.e. similarities among the strings in the generated set should be greater than the threshold. I am wondering what Python package(s) should I use to achieve that? Or any idea how to implement this?
I think you can think of the problem in another way (reversed).
Given a string, say it is sittin.
Given a threshold (edit distance), say it is k.
Then you apply combinations of different "edits" in k-steps.
For example, let's say k = 2. And assume the allowed edit modes you have are:
delete one character
add one character
substitute one character with another one.
Then the logic is something like below:
input = 'sittin'
for num in 1 ... n: # suppose you want to have n strings generated
my_input_ = input
# suppose the edit distance should be smaller or equal to k;
# but greater or equal to one
for i in in 1 ... randint(k):
pick a random edit mode from (delete, add, substitute)
do it! and update my_input_
If you need to stick with a pre-defined dictionary, that adds some complexity but it is still doable. In this case, the edit must be valid.
Borrowing heavily on the pseudocode in #greeness answer I thought I would include the code I used to do this for DNA sequences.
This may not be your exact use case but I think it should be easily adaptable.
import random
dna = set(["A", "C", "G", "T"])
class Sequence(str):
def mutate(self, d, n):
mutants = set([self])
while len(mutants) < n:
k = random.randint(1, d)
for _ in range(k):
mutant_type = random.choice(["d", "s", "i"])
if mutant_type == "i":
mutants.add(self.insertion(k))
elif mutant_type == "d":
mutants.add(self.deletion(k))
elif mutant_type == "s":
mutants.add(self.substitute(k))
return list(mutants)
def deletion(self, n):
if n >= len(self):
return ""
chars = list(self)
i = 0
while i < n:
idx = random.choice(range(len(chars)))
del chars[idx]
i += 1
return "".join(chars)
def insertion(self, n):
chars = list(self)
i = 0
while i < n:
idx = random.choice(range(len(chars)))
new_base = random.choice(list(dna))
chars.insert(idx, new_base)
i += 1
return "".join(chars)
def substitute(self, n):
idxs = random.sample(range(len(self)), n)
chars = list(self)
for i in idxs:
new_base = random.choice(list(dna.difference(chars[i])))
chars[i] = new_base
return "".join(chars)
To use this you can do the following
s = Sequence("AAAAA")
d = 2 # max edit distance
n = 5 # number of strings in result
s.mutate(d, n)
>>> ['AAA', 'GACAAAA', 'AAAAA', 'CAGAA', 'AACAAAA']

Efficient way to get every integer vectors summing to a given number [duplicate]

I've been working on some quick and dirty scripts for doing some of my chemistry homework, and one of them iterates through lists of a constant length where all the elements sum to a given constant. For each, I check if they meet some additional criteria and tack them on to another list.
I figured out a way to meet the sum criteria, but it looks horrendous, and I'm sure there's some type of teachable moment here:
# iterate through all 11-element lists where the elements sum to 8.
for a in range(8+1):
for b in range(8-a+1):
for c in range(8-a-b+1):
for d in range(8-a-b-c+1):
for e in range(8-a-b-c-d+1):
for f in range(8-a-b-c-d-e+1):
for g in range(8-a-b-c-d-e-f+1):
for h in range(8-a-b-c-d-e-f-g+1):
for i in range(8-a-b-c-d-e-f-g-h+1):
for j in range(8-a-b-c-d-e-f-g-h-i+1):
k = 8-(a+b+c+d+e+f+g+h+i+j)
x = [a,b,c,d,e,f,g,h,i,j,k]
# see if x works for what I want
Here's a recursive generator that yields the lists in lexicographic order. Leaving exact as True gives the requested result where every sum==limit; setting exact to False gives all lists with 0 <= sum <= limit. The recursion takes advantage of this option to produce the intermediate results.
def lists_with_sum(length, limit, exact=True):
if length:
for l in lists_with_sum(length-1, limit, False):
gap = limit-sum(l)
for i in range(gap if exact else 0, gap+1):
yield l + [i]
else:
yield []
Generic, recursive solution:
def get_lists_with_sum(length, my_sum):
if my_sum == 0:
return [[0 for _ in range(length)]]
if not length:
return [[]]
elif length == 1:
return [[my_sum]]
else:
lists = []
for i in range(my_sum+1):
rest = my_sum - i
sublists = get_lists_with_sum(length-1, rest)
for sl in sublists:
sl.insert(0, i)
lists.append(sl)
return lists
print get_lists_with_sum(11, 8)

How does Python handle multiple conditions in a list comprehension?

I was trying to create a list comprehension from a function that I had and I came across an unexpected behavior. Just for a better understanding, my function gets an integer and checks which of its digits divides the integer exactly:
# Full function
divs = list()
for i in str(number):
digit = int(i)
if digit > 0 and number % digit == 0:
divs.append(digit)
return len(divs)
# List comprehension
return len([x for x in str(number) if x > 0 and number % int(x) == 0])
The problem is that, if I give a 1012 as an input, the full function returns 3, which is the expected result. The list comprehension returns a ZeroDivisionError: integer division or modulo by zero instead. I understand that it is because of this condition:
if x > 0 and number % int(x) == 0
In the full function, the multiple condition is handled from the left to the right, so it is fine. In the list comprehension, I do not really know, but I was guessing that it was not handled in the same way.
Until I tried with a simpler function:
# Full function
positives = list()
for i in numbers:
if i > 0 and 20 % i ==0:
positives.append(i)
return positives
# List comprehension
return [i for i in numbers if i > 0 and 20 % i == 0]
Both of them worked. So I am thinking that maybe it has something to do with the number % int(x)? This is just curiosity on how this really works? Any ideas?
The list comprehension is different, because you compare x > 0 without converting x to int. In Py2, mismatched types will compare in an arbitrary and stupid but consistent way, which in this case sees all strs (the type of x) as greater than all int (the type of 0) meaning that the x > 0 test is always True and the second test always executes (see Footnote below for details of this nonsense). Change the list comprehension to:
[x for x in str(number) if int(x) > 0 and number % int(x) == 0]
and it will work.
Note that you could simplify a bit further (and limit redundant work and memory consumption) by importing a Py3 version of map at the top of your code (from future_builtins import map), and using a generator expression with sum, instead of a list comprehension with len:
return sum(1 for i in map(int, str(number)) if i > 0 and number % i == 0)
That only calls int once per digit, and constructs no intermediate list.
Footnote: 0 is a numeric type, and all numeric types are "smaller" than everything except None, so a str is always greater than 0. In non-numeric cases, it would be comparing the string type names, so dict < frozenset < list < set < str < tuple, except oops, frozenset and set compare "naturally" to each other, so you can have non-transitive relationships; frozenset() < [] is true, [] < set() is true, but frozenset() < set() is false, because the type specific comparator gets invoked in the final version. Like I said, arbitrary and confusing; it was removed from Python 3 for a reason.
You should say int(x) > 0 in the list comprehension

Memoryerror with too big list

I'm writing script in python, and now I have to create pretty big list exactly containing 248956422 integers. The point is, that some of this "0" in this table will be changed for 1,2 or 3, cause I have 8 lists, 4 with beginning positions of genes, and 4 with endings of them.
The point is i have to iterate "anno" several time cause numbers replacing 0 can change with other iteration.
"Anno" has to be written to the file to create annotation file.
Here's my question, how can I divide, or do it on-the-fly , not to get memoryerror including replacing "0" for others, and 1,2,3s for others.
Mabye rewriting the file? I'm waitin for your advice, please ask me if it is not so clear what i wrote :P .
whole_st_gen = [] #to make these lists more clear for example
whole_end_gen = [] # whole_st_gen has element "177"
whole_st_ex = [] # and whole_end_gen has "200" so from position 177to200
whole_end_ex = [] # i need to put "1"
whole_st_mr = [] # of course these list can have even 1kk+ elements
whole_end_mr = [] # note that every st/end of same kind have equal length
whole_st_nc = []
whole_end_nc = [] #these lists are including some values of course
length = 248956422
anno = ['0' for i in range(0,length)] # here i get the memoryerror
#then i wanted to do something like..
for j in range(0, len(whole_st_gen)):
for y in range(whole_st_gen[j],whole_end_gen[j]):
anno[y]='1'
You might be better of by determine the value of each element in anno on the fly:
def anno():
for idx in xrange(248956422):
elm = "0"
for j in range(0, len(whole_st_gen)):
if whole_st_gen[j] <= idx < whole_end_gen[j]:
elm = "1"
for j in range(0, len(whole_st_ex)):
if whole_st_ex[j] <= idx < whole_end_ex[j]:
elm = "2"
for j in range(0, len(whole_st_mr)):
if whole_st_mr[j] <= idx < whole_end_mr[j]:
elm = "3"
for j in range(0, len(whole_st_nc)):
if whole_st_nc[j] <= idx < whole_end_nc[j]:
elm = "4"
yield elm
Then you just iterate using for elm in anno().
I got an edit proposal from the OP suggesting one function for each of whole_*_gen, whole_st_ex and so on, something like this:
def anno_st():
for idx in xrange(248956422):
elm = "0"
for j in range(0, len(whole_st_gen)):
if whole_st_ex[j] <= idx <= whole_end_ex[j]:
elm = "2"
yield elm
That's of course doable, but it will only result in the changes from whole_*_ex applied and one would need to combine them afterwards when writing to file which may be a bit awkward:
for a, b, c, d in zip(anno_st(), anno_ex(), anno_mr(), anno_nc()):
if d != "0":
write_to_file(d)
elif c != "0":
write_to_file(c)
elif b != "0":
write_to_file(b)
else:
write_to_file(a)
However if you only want to apply some of the change sets you could write a function that takes them as parameters:
def anno(*args):
for idx in xrange(248956422):
elm = "0"
for st, end, tag in args:
for j in range(0, len(st)):
if st <= idx < end[j]:
elm = tag
yield tag
And then call by supplying the lists (for example with only the two first changes):
for tag in anno((whole_st_gen, whole_end_gen, "1"),
(whole_st_ex, whole_end_ex, "2")):
write_to_file(tag)
You could use a bytearray object to have a much more compact memory representation than a list of integers:
anno = bytearray(b'\0' * 248956422)
print(anno[0]) # → 0
anno[0] = 2
print(anno[0]) # → 2
print(anno.__sizeof__()) # → 248956447 (on my computer)
Instead of creating a list using list comprehension I suggest to create an iterator using a generator-expression which produce the numbers on demand instead of saving all of them in memory.Also you don't need to use the i in your loop since it's just a throw away variable which you don't use it.
anno = ('0' for _ in range(0,length)) # In python 2.X use xrange() instead of range()
But note that and iterator is a one shot iterable and you can not use it after iterating over it one time.If you want to use it for multiple times you can create N independent iterators from it using itertools.tee().
Also note that you can not change it in-place if you want to change some elements based on a condition you can create a new iterator by iterating over your iterator and applying the condition using a generator expression.
For example :
new_anno =("""do something with i""" for i in anno if #some condition)

Categories