Why is bitwise operator needed in this powerset generator? - python

I am currently following MITx's 6.00.2x, and we are asked to come up with a variant of power set generator of the one at the bottom.
But before I can work on the variant, I do not even understand what's going on with the given generator. Specifically:
What does (i >> j) % 2 == 1, and in fact the whole for j in range(N): block do? I understand that i >> j shifts
the binary of i by j, then returns the decimal representation of
that shifted binary number. But I have absolutely no clue why binary
is even needed in a powerset generator in the first place, let alone
the necessity of this conditional.
I understand that for any given set A a cardinality n, the
cardinality of its powerset is 2**n - because for every subset of A
every member is either in or not, and we repeat that for n times.
Is that what for i in range(2**N): is doing? i.e. going over 2**n subsets and either include or not include any given member of the set?
I tried running it with items=['apple,'banana','orange'] and items=[1,2,3], and both returned an empty list, which makes it all the more confusing.
def powerSet(items):
# generate all combinations of N items, items is a Python list
N = len(items)
# enumerate the 2**N possible combinations
for i in range(2**N):
combo = []
for j in range(N):
# test bit jth of integer i
if (i >> j) % 2 == 1:
combo.append(items[j])
return combo

So the algorithm here starts with an observation that any subset of {1,...,N} can be seen as a function f:{1,...,N}->{0,1}, i.e. the characteristic function. How it works? Well, if A is a subset of {1,...,N} then f is given by f(x)=0 if x not in A and f(x)=1 otherwise.
Now another observation is that any function f:{1,...,N}->{0,1} can be encoded as a binary number of N bits: j-th bit is 1 if f(j)=1 and 0 otherwise.
And so if we want to generate all subsets of {1,..,N} it is enough to generate all binary numbers of length N. So how many such numbers are there? Of course 2**N. And since every number between 0 and 2**N - 1 (-1 since we count from 0) uniquely corresponds to some subset of {1,...,N} then we can simply loop through them. That's where the for i in range(2**N): loop comes from.
But we don't simply deal with subsets of {1,...,N}, we actually have some unknown set/list items of length N. So if A is a subset of {1,...,N}, meaning A is a number between 0 and 2**N - 1 then how do we convert it to a subset of items? Well, again, we use the fact that the bit 1 corresponds to "is in set" and the bit 0 corresponds to "is not in set". And that's where (i >> j) % 2 == 1 comes from. It simply means "if j-th bit is 1" which in the consequence leads to "j-th element should be in the subset".
There's a slight issue with your code. You should maybe yield instead of return:
def powerSet(items):
N = len(items)
for i in range(2**N):
combo = [] # <-- this is our subset
for j in range(N):
if (i >> j) % 2 == 1:
combo.append(items[j])
yield combo # <-- here we yield it to caller
subsets = list(powerSet(["apple", "banana", "pear"]))
Here's an example of this binary encoding of subsets. Say you have a list
["apple", "banana", "pear"]
It has 3 elements so we are looking at numbers of (binary) length 3. So here are all possible subsets and their encodings in the "loop" order:
000 == []
001 == ["apple"]
010 == ["banana"]
011 == ["apple", "banana"]
100 == ["pear"]
101 == ["apple", "pear"]
110 == ["banana", "pear"]
111 == ["apple", "banana", "pear"]

Your code was basically creating new lists in every loop and not saving the previous results.
Here is the corrected code to get all combinations:
def powerSet(items):
# generate all combinations of N items, items is a Python list
N = len(items)
# This will store the complete set of combinations
outer_combo = []
# enumerate the 2**N possible combinations
for i in range(2**N):
# This will store the intermediate sets
inner_combo = []
for j in range(N):
# test bit jth of integer i
if (i >> j) % 2 == 1:
inner_combo.append(items[j])
# Uncomment below to understand each step
# print(inner_combo)
# Add the intermediate set to final result
outer_combo.append(inner_combo)
return outer_combo
print(powerSet([1,2,3]))
# Output : [[], [1], [2], [1, 2], [3], [1, 3], [2, 3], [1, 2, 3]]
Now lets come to your points:
Basically you are generating all numbers from 0 to (2**N)-1. So, in our example [1, 2, 3], i has the values 0,1,2,3,4,5,6,7
The binary representation of these values is 000, 001, 010, 011, 100, 101, 110, 111 respectively
Using i>>j you are basically trying to shift all the 1's in each binary representation to the right most side.
Then using (i>>j)%2==1 you are checking if 1 even exists at all
The second loop for j in range(N): will help using in two ways. First here N not only stores the number of elements in list, but all the number of relevant bits to look up in the operation (i>>j)%2==1. This is because, internally the binary representation can have upto 64 bits, but the relevant bits here are the first N bits (remember the operation (2**N)-1 ?). Secondly, this will shift the bits N times to right to check how many 1's are actually there.
An example is something like this. For example, i=5 i.e. 101. Now j can have values 0, 1, 2. So, in first case when j=0, the operation (i>>j)%2==1 will return True since the bit at 0th position is 1. So, item[0], i.e. 1 is appended to intermediate combination, i.e. we have [1] till now. Now j=1 and the operation (i>>j)%2==1 will return False since the bit at 1st position is 0. So no element is added. Finally, when j=2, (i>>j)%2==1 will return True since the bit at 2nd position is 1. Hence item[2], i.e. 3 is added to the intermediate result, i.e. the set now becomes [1, 3].

Related

When making comparison between the elements in a list, how to efficiently iterate and improve the time complexity from O(n^2)?

I have a list where I would like to compare each element of the list with each other. I know we can do that using a nested loop but the time complexity is O(n^2). Is there any option to improve the time complexity and make the comparisons efficient?
For example:
I have a list where I would like to find the difference in digits among each element. Consider a list array=[100,110,010,011,100] where I am trying to find the difference in the digits among each integer. array[0] is same as array[4] (i.e 100 and 100), while array[0] has 1 digit that is different from array[1] (i.e 100 and 110) and array[0] has 3 digits that are different from array[3] (i.e 100 and 011). Assuming similar integers are defined as integers that have either identical or the difference in digits is just 1, I would like to return a list as output, where every element denotes the integers with similar digits (i.e difference in digits <=1).
For the input list array=[100,110,010,011,100], my expected output should be [2,3,2,1,2]. In the output list, the output[0] indicates that array[0] is similar to array[1] and array[4] (i.e similar to 100 , we have 2 other integers 110,100 in the list)
This is my code that works, though very inefficient O(n^2):
def diff(a,b):
difference= [i for i in range(len(a)) if a[i]!=b[i]]
return len(difference)
def find_similarity_int(array):
# write your code in Python 3.6
res=[0]*len(array)
string=[]
for n in array:
string.append(str(n))
for i in range(0,len(string)):
for j in range(i+1,len(string)):
count=diff(string[i],string[j])
if(count<=1):
res[i]=res[i]+1
res[j]=res[j]+1
return res
input_list=['100','110','010','011','100']
output=find_similarity_int(input_list)
print("The similarity metrics for the given list is : ",output)
Output:
The similarity metrics for the given list is : [2, 3, 2, 1, 2]
Could anyone please suggest an efficient way to make the comparison, preferably with just 1 loop? Thanks!
If the values are binary digits only, you can get a O(nxm) solution (where m is the width of the values) using a multiset (Counter from collections). With the count of values in the multiset, add the counts of items that correspond to exactly one bit change in each number (plus the number of duplicates):
from collections import Counter
def simCount(L):
counts = Counter(L) # multiset of distinct values / count
result = []
for n in L:
r = counts[n]-1 # duplicates
for i,b in enumerate(n): # 1 bit changes
r += counts[n[:i]+"01"[b=="0"]+n[i+1:]] # count others
result.append(r) # sum of similars
return result
Output:
A = ['100','110','010','011','100']
print(simCount(A)) # [2, 3, 2, 1, 2]
To avoid the string manipulations on every item, you can convert them to integers and use bitwise operators to make the 1-bit changes:
from collections import Counter
def simCount(L):
bits = [1<<i for i in range(len(L[0]))] # bit masks
L = [int(n,2) for n in L] # numeric values
counts = Counter(L) # multiset n:count
result = []
for n in L:
result.append(counts[n]-1) # duplicates
for b in bits: # 1 bit changes
result[-1] += counts[b^n] # sum similars
return result
A = ['100','110','010','011','100']
print(simCount(A)) # [2, 3, 2, 1, 2]

Generate combinations such that the total is always 100 and uses a defined jump value

I am looking to generate a list of combinations such that the total is always 100. The combinations have to be generated based on a jump value (similar to how we use it in range or loop).
The number of elements in each combination is based on the length of the parent_list. If the parent list of 10 elements, we need each list in the output to be of 10 elements.
parent_list=['a','b','c', 'd']
jump=15
sample of expected output is
[[15,25,25,35],[30,50,10,10],[20,15,20,45]]
I used the solution given in this question, but it doesn't give the option to add the Jump parameter. Fractions are allowed too.
This program finds all combinations of n positive integers whose sum is total such that at least one of them is a multiple of jump. It works in a recursive way, setting jump to 1 if the current sequence already contains an element that's a multiple of the original jump.
def find_sum(n, total, jump, seq=()):
if n == 0:
if total == 0 and jump == 1: yield seq
return
for i in range(1, total+1):
yield from find_sum(n - 1, total - i, jump if i % jump else 1, seq + (i,))
for seq in find_sum(4, 100, 15):
print(seq)
There's still a lot of solutions.

Using greedy algorithm within two lists in Python

We observe a particular data sample explained by an int value n and two lists A and B, where the two lists contain integer element or elements ranging from 1 to n, and the elements in each list aren't repeated. (There could be the same element in both lists, however.)
n represents the size of the observed sample.
Elements in A represent the numbers that are 'taken out' from the sample. Hence, if n=5 and A=[2,3], the size of our resulting sample would be 3.
Elements in B represent the numbers that are 'put back into' the sample. The maximum size of the resulting sample cannot exceed n.
However, the elements in B can only be put back in if and only if there is an element in A that is either equal to the element in B, or one less or greater than the element in B. For example, if n=5, A=[2,3], B=[4], the size of our sample would be 4, as there exists an element in A that is one less than the element in B.
Finally, the elements in B are only considered once if they are 'put back in'. If n=5, A=[2,3,5], B=[3,4], even though the elements in B satisfy the condition twice each, the size of the resulting sample would still be 4.
Some of the test cases are given:
n A B return
5 [2, 4] [1, 3, 5] 5
5 [2, 4] [3] 4
3 [3] [1] 2
I'm aware that this is a type of a greedy algorithm (which I am not super familiar with), but I also tried the following:
def solution(n, A, B):
count = n - len(A)
for i in range(len(B)):
if B[i]-1 in A:
count += 1
elif B[i]+1 in A:
count += 1
elif B[i] in A:
count += 1
else:
count += 0
if n > count:
answer = count
else:
answer = n
return answer
While this seemingly works, it doesn't take into account that the elements in B cannot be considered once they are put back in already. Is there any edit I can make to my code, and how would this problem be optimally solved?
I guess the key was to use set()in order to retrieve the set without any overlapping elements first, and then start removing the elements that are gone over (which is done similarly to my initial code).
def solution(n, A, B):
B_uniq = set(B)-set(A)
A_uniq = set(A)-set(B)
for i in B_uniq:
if i-1 in A_uniq:
A_uniq.remove(i-1)
elif i+1 in A_uniq:
A_uniq.remove(i+1)
return n-len(A_uniq)

maximal subset of integers that are not evenly divisible by k

This is a practice problem I am solving:
Given a set S of distinct integers, print the size of a maximal subset S' of S where the sum of any 2 numbers in S' are not evenly divisible by k.
My approach was to limit the problem to the subset S[0...i] where 0 < i <= n-1 and determine the length of the maximal subset for that subproblem, then extend the subproblem by 1. I know there is a different approach to this problem but I am confused why my solution does not work.
ex) for n = 10, k = 5, and S = [770528134, 663501748, 384261537, 800309024, 103668401, 538539662, 385488901, 101262949, 557792122, 46058493]
dp = [0 for _ in range(n)]
dp[0] = 1
for i in range(1, n):
flag = 0
for j in range(i):
if s[j] == "#":
pass
elif (not s[j] == "#") and (s[j] + s[i])%k==0:
dp[i] = dp[i-1]
flag = 1
s[i] = "#"
break
if not flag == 1:
dp[i] = dp[i-1] + 1
print dp[-1]
The output should be 6 but my function prints 5. What I try to do is iterate from j=0 to i and check if for any j < i if (s[j] + s[i])%k==0. If so, then considering s[i] in S' would be erroneous so instead mark s[i] with a # to indicate it is not in S'.
Your lack of comments and explanatory names makes your code very hard to follow, so I do not understand it. (Your example using a list when you talk of sets, and the use of both s and S for your "set", do not help.) However, the basic idea of your algorithm is flawed: this problem for a given set cannot be solved by extending the solution for a proper subset.
For example, take k=3, set S=[1,4,2,5,8]. For the first three elements [1,4,2], the solution is [1,4]. For the first four elements [1,4,2,5], the solution is either [1,4] or [2,5]. For the entire set, the solution is [2,5,8]. You see there is no "path" from the solution from the first three elements through the first five: you need to "restart" at either the first four or the entire set.
A solution that does work partitions the entire set S into equivalence classes where the elements in each class have the same remainder when divided by k. Examining these equivalence classes gives the final result. Let me know if you need more details. Note that you will need to decide clearly if any 2 numbers in S' means any 2 distinct numbers in S': this changes what you do at one or two of the equivalence classes.

What is fastest way to determine numbers are within specific range of each other in Python?

I have list of numbers as follows -
L = [ 1430185458, 1430185456, 1430185245, 1430185246, 1430185001 ]
I am trying to determine which numbers are within range of "2" from each other. List will be in unsorted when I receive it.
If there are numbers within range of 2 from each other I have to return "1" at exact same position number was received in.
I was able to achieve desired result , however code is running very slow. My approach involves sorting list, iterating it twice taking two pointers and comparing it successively. I will have millions of records coming as seperate lists.
Just trying to see what is best possible approach to address this problem.
Edit - Apology as I was away for a while. List can have any number of elements in it ranging from 1 to n. Idea is to return either 0 or 1 in exact same position number was received. I can not post actual code I implemented but here is pseudo code.
a. create new list as list of list with second part as 0 for each element. We assume that there are no numbers within range of 2 of each other.
[[1430185458,0], [1430185456,0], [1430185245,0], [1430185246,0], [1430185001,0]]
b. sort original list
c. compare first element to second, second to third and so on until end of list is reached and whenever difference is less than or equal to 2 update corresponding second elements in step a to 1.
[[1430185458,1], [1430185456,1], [1430185245,1], [1430185246,1], [1430185001,0]]
The goal is to be fast, so that presumably means an O(N) algorithm. Building an NxN difference matrix is O(N^2), so that's not good at all. Sorting is O(N*log(N)), so that's out, too. Assuming average case O(1) behavior for dictionary insert and lookup, the following is an O(N) algorithm. It rips through a list of a million random integers in a couple of seconds.
def in_range (numbers) :
result = [0] * len(numbers)
index = {}
for idx, number in enumerate(numbers) :
for offset in range(-2,3) :
match_idx = index.get(number+offset)
if match_idx is not None :
result[match_idx] = result[idx] = 1
index[number] = idx
return result
Update
I have to return "1" at exact same position number was received in.
The update to the question asks for a list of the form [[1,1],[2,1],[5,0]] given an input of [1,2,5]. I didn't do that. Instead, my code returns [1,1,0] given [1,2,5]. It's about 15% faster to produce that simple 0/1 list compared to the [[value,in_range],...] list. The desired list can easily be created using zip:
zip(numbers,in_range(numbers)) # Generator
list(zip(numbers,in_range(numbers))) # List of (value,in_range) tuples
I think this does what you need (process() modifies the list L). Very likely it's still optimizable, though:
def process(L):
s = [(v,k) for k,v in enumerate(L)]
s.sort()
j = 0
for i,v_k in enumerate(s):
v = v_k[0]
while j < i and v-s[j][0]>2:
j += 1
while j < i:
L[s[j][1]] = 1
L[s[i][1]] = 1
j += 1

Categories