0-1 Knapsack: Find Solution Set in Space-Optimised Implementation - python

I want to solve a 0-1 Knapsack problem with a maximum weight of ~ 200k and over 100k elements and eventual determination of the item set rather than only the optimal weight.
Researching 0-1 Knapsack, I read that a common way to solve this problem is via dynamic programming and creating a table containing optimal solutions for subproblems, thus splitting up the original problem into smaller parts and later backtracing on the table to determine the item set. The maximum profit, without regard for the items taken, can be calculated in a memory efficient manner (as outlined here).
The obvious issue here is that for the dimensions I have in mind, this approach would consume more memory than is feasible (requiring O(n*W) space, with n being the number of elements and W being the maximum capacity). Researching further I found mention (here for example, also see "Knapsack Problems" by Kellerer, Pferschy and Pisinger) of a memory efficient way to solve 0-1 Knapsack.
We start by splitting up the item set up into two subsets, roughly equal in size. We treat both subsets as their own knapsack problem given the original maximum weight W and determine the last row of the maximum profit calculation for both subsets in the memory-efficient way (detailed above).
The next step is to find out where to optimally split the two subsets. To do this, we determine the maximum profit for weight w1 and w2 of the two rows. As I understand, it is critical to maintain w1 + w2 = W, so I iterate through the first row and take the index on the opposite end of the current index. My current implementation for this step looks like this:
def split(weights, values, n, w, i):
# s1 is the bigger subset size if n is not even
s1 = n // 2 + (n&1)
s2 = n // 2
row1 = maximum_profit(weights, values, s1, w)
row2 = maximum_profit(weights[s1:], values[s1:], s2, w)
max_profits_for_capacity = [x + y for x, y in zip(row1, row2[::-1])]
max_profits = max(max_profits_for_capacity)
optimal_weight_index = max_profits_for_capacity.index(max_value)
c1 = row1[optimal_weight_index]
c2 = row2[w-optimal_weight_index-1]
c1 and c2 are the maximum profits for each of the subsets then while maintaining c1 + c2 = W. With these values we recurse into each of the subsets:
split(weights[:s1], values[:s1], s1, c1, i)
split(weights[s1:], values[s1:], s2, c2, i+s1)
This is where the descriptions lose me. Eventually this code will recurse to n == 1 with a value of w. How do I determine if an element is included given an item index i and a maximum (local) capacity w?
I can provide a small example data set to illustrate the workings of my code in detail and where it goes wrong. Thank you very much.

First, I guess you have a mistake saying about c, w and their relation as capacity, but getting c1 and c2 from profit lists.
To the question, by the returning value of your split function you can define what type of question you are answering.
As you take the split direct to n == 1 point and you want to get the indices of the picked items into knapsack, you can simply at this step return the value consisting of [0] or [1] as the output:
if n == 1:
if weights[0] < w:
return [1]
return [0]
[1] means picking the item into resulting set
[0] otherwise
then concatenate them into one during other steps of recurse of your split function like:
def split(..):
..
# since it is lists concatenation
return split(weights[:s1], values[:s1], s1, c1, i) + split(weights[s1:], values[s1:], s2, c2, i+s1)
In result you will get list of size n (for the number of items you make splits) with zeroes and ones.
Total complexity would be:
O(nWlogn) for time, since we make splits till the n == 1 step
O(W) for memory, since we store always only a part of the resulting list when recurse

Related

String sorting problem with code execution time limit

I was recently trying to solve a HackerEarth problem. The code worked on the sample inputs and some custom inputs that I gave. But, when I submitted, it showed errors for exceeding the time limit. Can someone explain how I can make the code run faster?
Problem Statement: Cyclic shift
A large binary number is represented by a string A of size N and comprises of 0s and 1s. You must perform a cyclic shift on this string. The cyclic shift operation is defined as follows:
If the string A is [A0, A1,..., An-1], then after performing one cyclic shift, the string becomes [A1, A2,..., An-1, A0].
You performed the shift infinite number of times and each time you recorded the value of the binary number represented by the string. The maximum binary number formed after performing (possibly 0) the operation is B. Your task is to determine the number of cyclic shifts that can be performed such that the value represented by the string A will be equal to B for the Kth time.
Input format:
First line: A single integer T denoting the number of test cases
For each test case:
First line: Two space-separated integers N and K
Second line: A denoting the string
Output format:
For each test case, print a single line containing one integer that represents the number of cyclic shift operations performed such that the value represented by string A is equal to B for the Kth time.
Code:
import math
def value(s):
u = len(s)
d = 0
for h in range(u):
d = d + (int(s[u-1-h]) * math.pow(2, h))
return d
t = int(input())
for i in range(t):
x = list(map(int, input().split()))
n = x[0]
k = x[1]
a = input()
v = 0
for j in range(n):
a = a[1:] + a[0]
if value(a) > v:
b = a
v = value(a)
ctr = 0
cou = 0
while ctr < k:
a = a[1:] + a[0]
cou = cou + 1
if a == b:
ctr = ctr + 1
print(cou)
In the problem, the constraint on n is 0<=n<=1e5. In the function value(), you calculating integer from the binary string whose length can go up to 1e5. so the integer calculating by you can go as high as pow(2, 1e5). This surely impractical.
As mentioned by Prune, you must use some efficient algorithms for finding a subsequence, say sub1, whose repetitions make up the given string A. If you solve this by brute-force, the time complexity will be O(n*n), as maximum value of n is 1e5, time limit will exceed. so use some efficient algorithm.
I can't do much with the code you posted, since you obfuscated it with meaningless variables and a lack of explanation. When I scan it, I get the impression that you've made the straightforward approach of doing a single-digit shift in a long-running loop. You count iterations until you hit B for the Kth time.
This is easy to understand, but cumbersome and inefficient.
Since the cycle repeats every N iterations, you gain no new information from repeating that process. All you need to do is find where in the series of N iterations you encounter B ... which could be multiple times.
In order for B to appear multiple times, A must consist of a particular sub-sequence of bits, repeated 2 or more times. For instance, 101010 or 011011. You can detect this with a simple addition to your current algorithm: at each iteration, check to see whether the current string matches the original. The first time you hit this, simply compute the repetition factor as rep = len(a) / j. At this point, exit the shifting loop: the present value of b is the correct one.
Now that you have b and its position in the first j rotations, you can directly compute the needed result without further processing.
I expect that you can finish the algorithm and do the coding from here.
Ah -- taken as a requirements description, the wording of your problem suggests that B is a given. If not, then you need to detect the largest value.
To find B, append A to itself. Find the A-length string with the largest value. You can hasten this by finding the longest string of 1s, applying other well-known string-search algorithms for the value-trees after the first 0 following those largest strings.
Note that, while you iterate over A, you look for the first place in which you repeat the original value: this is the desired repetition length, which drives the direct-computation phase in the first part of my answer.

Generating n binary vectors where each vector has a Hamming distance of d from every other vector

I'm trying to generate n binary vectors of some arbitrary length l, where each vector i has a Hamming distance of d (where d is even) from every other vector j. I'm not sure if there are any theoretical relationships between n, l, and d, but I'm wondering if there are any implementations for this task. My current implementation is shown below. Sometimes I am successful, other times the code hangs, which indicates either a) it's not possible to find n such vectors given l and d, or b) the search takes a long time especially for large values of l.
My questions are:
Are there any efficient implementations of this task?
What kind of theoretical relationships exist between n, l, and d?
import numpy as np
def get_bin(n):
return ''.join([str(np.random.randint(0, 2)) for _ in range(n)])
def hamming(s1, s2):
return sum(c1 != c2 for c1, c2 in zip(s1, s2))
def generate_codebook(n, num_codes, d):
codebooks = []
seen = []
while len(codebooks) < num_codes:
code = get_bin(n)
if code in seen:
continue
else:
if len(codebooks) == 0:
codebooks.append(code)
print len(codebooks), code
else:
if all(map(lambda x: int(hamming(code, x)) == d, codebooks)):
codebooks.append(code)
print len(codebooks), code
seen.append(code)
codebook_vectorized = map(lambda x: map(lambda b: int(b), x), codebooks)
return np.array(codebook_vectorized)
Example:
codebook = generate_codebook(4,3,2)
codebook
1 1111
2 1001
3 0101
Let's build a graph G where every L-bit binary vector v is a vertex. And there is an edge (vi, vj) only when a Hamming distance between vi and vj is equal to d. Now we need to find a clique of size n is this graph.
Clique is a subset of vertices of an undirected graph such that every
two distinct vertices in the clique are adjacent.
The task of finding a clique of given size in an arbitrary graph is NP-complete. You can read about this problem and some algorithms in this wikipedia article.
There are many special cases of this problem. For example, for perfect graphs there is a polynomial algorithm. Don't know if it is possible to show that our graph is one of these special cases.
Not a real solution, but more of a partial discussion about the relationship between l, d and n and the process of generating vectors. In any case, you may consider posting the question (or a similar one, in more formal terms) to Mathematics Stack Exchange. I have been reasoning as I was writing, but I hope I didn't make a mistake.
Let's say we have l = 6. Since the Hamming distance depends only on position-wise differences, you can decide to start by putting one first arbitrary vector in your set (if there are solutions, some may not include it, but at least one should). So let's begin with an initial v1 = 000000. Now, if d = 1 then obviously n can only be 1 or 2 (with 111111). If d = 1, you will find that n can also only be 1 or 2; for example, you could add 000001, but any other possible vector will have a distance of 2 or more with at least one the vectors you have.
Let's say d = 4. You need to change 4 positions and keep the other 2, so you have 4-combinations from a 6-element set, which is 15 choices, 001111, 010111, etc. - you can see now that the binomial coefficient C(n, d) plus 1 is an upper bound for n. Let's pick v2 = 001111, and say that the kept positions are T = [1, 2] and the changed ones are S = [3, 4, 5, 6]. Now to go on, we could consider making changes to v2; however, in order to keep the right distances we must follow these rules:
We must make 4 changes to v2.
If we change a position in a position in S, we must make another change in a position in T (and viceversa). Otherwise, the distance to v1 would not be kept.
Logically, if d were odd you would be done now (only sets of two elements could be formed), but fortunately you already said that your distance numbers are even. So we divide our number by two, which is 2, and need to pick 2 elements from S, C(4, 2) = 6, and 2 elements from T, C(2, 2) = 1, giving us 6 * 1 = 6 options - you should note now that C(d, d/2) * C(l - d, d/2) + 2 is a new, lower upper bound for n, if d is even. Let's pick v3 = 111100. v3 has now four kinds of positions: positions that have changed with respect to both v1 and v2, P1 = [1, 2], positions that have not changed with respect to either v1 or v2, P2 = [] (none in this case), positions that have changed with respect to v1 but not with respect to v2, P3 = [3, 4], and positions that have changed with respect to v2 but not with respect to v1, P4 = [5, 6]. Same deal, we need 4 changes, but now each change we make to a P1 position must imply a change in a P2 position, and each change we make to a P3 position must imply a change in a P4 position. The only remaining option is v4 = 110011, and that would be it, the maximum n would be 4.
So, thinking about the problem from a combinatoric point of view, after each change you will have an exponentially increasing number of "types of positions" (2 after the first change, 4 after the second, 8, 16...) defined in terms of whether they are equal or not in each of the previously added vectors, and these can be arranged in couples through a "symmetry" or "complement" relationship. On each step, you can (I think, and this is the part of this reasoning that I am less sure about) greedily choose a set of changes from these couples and compute the sizes of the "types of positions" for the next step. If this is all correct, you should be able to write an algorithm based on this to generate and/or count the possible sets of vectors for some particular l and d and n if given.

The fastest way to find 2 numbers from two lists that in sum equal to x

My code:
n = 3
a1 = 0
b1 = 10
a2 = 2
b2 = 2
if b1>n:
b1=n
if b2>n:
b2=n
diap1 = [x for x in range(a1, b1+1)]
diap2 = [x for x in range(a2, b2+1)]
def pairs(d1, d2, n):
res = 0
same = 0
sl1 = sorted(d1)
sl2 = sorted(d2)
for i in sl1:
for j in sl2:
if i+j==n and i!=j:
res+=1
elif i+j==n and i==j:
same+=1
return(res+same)
result = pairs(diap1, diap2, n)
print(result)
NOTE: n, a1, b1, a2, b2 can change. The code should find 2 numbers from 2 lists(1 from each) that in sum equal to n. For example: pairs (a, b) and (b, a) are different but (a, a) and (a, a) is the same pair. So, output of my code is correct and for the code above it's 1(1, 2) but for big inputs it takes too much time. How can I optimize it to work faster ?
Use set() for fast lookup...
setd2 = set(d2)
Don't try all possible number pairs. Once you fix on a number from the first list, say i, just see if (n-i) is in the second set.
for i in sl1:
if (n-i) in setd2:
# found match
else:
# no match in setd2 for i
The following way you can work the fastest and find the two numbers whose sum is equal to n and store them as well in a list of tuples.
s1 = set(list1)
s2 = set(list2)
nums = []
for item in s1:
if n-item in s2:
nums.append((item, n-item))
The accepted answer is really easy to understand and implement but i just had to share this method. You can see your question is the same as this one.
This answer in particular is interesting because you do not need extra space by inserting into the sets. I'm including the algorithm here in my answer.
If the arrays are sorted you can do it in linear time and constant storage.
Start with two pointers, one pointing at the smallest element of A, the other pointing to the largest element of B.
Calculate the sum of the pointed to elements.
If it is smaller than k increment the pointer into A so that it points to the next largest element.
If it is larger than k decrement the pointer into B so that it points to the next smallest element.
If it is exactly k you've found a pair. Move one of the pointers and keep going to find the next pair.
If the arrays are initially unsorted then you can first sort them then use the above algorithm.
Thank you for clearly defining your question and for providing your code
example that you are attempting to optimize.
Utilizing two key definitions from your question and the notation you
provided, I limited my optimization attempt to the use of lists, and added
the ability to randomly change the values associated to n, a1, b1, a2 and
b2.
In order to show the optimization results, I created a module which includes
the use of the random.randit function to create a variety of list sizes and
the timeit.Timer function to capture the amount of time your original pairs() function takes as well as my suggested optimization in the the pairs2() function.
In the pairs2() function, you will note that each iteration loop contains a
break statement. These eliminate needless iteration through each list once
the desired criteria is met. You should note that as the size of the lists
grow, the pairs2() vs. pairs() time improves.
Test module code:
import random
from timeit import Timer
max_value = 10000
n = random.randint(1, max_value)
a1 = random.randint(0, max_value)
b1 = random.randint(1, max_value+1)
a2 = random.randint(0, max_value)
b2 = random.randint(1, max_value+1)
if b1>n:
b1=n
if b2>n:
b2=n
if a1>=b1:
a1 = random.randint(0, b1-1)
if a2>=b2:
a2 = random.randint(0, b2-1)
diap1 = [x for x in range(a1, b1)]
diap2 = [x for x in range(a2, b2)]
print("Length diap1 =", len(diap1))
print("Length diap2 =", len(diap2))
def pairs(d1, d2, n):
res = 0
same = 0
sl1 = sorted(d1)
sl2 = sorted(d2)
for i in sl1:
for j in sl2:
if i+j==n and i!=j:
res+=1
elif i+j==n and i==j:
same+=1
return(res+same)
def pairs2(d1, d2, n):
res = 0
same = 0
sl1 = sorted(d1)
sl2 = sorted(d2)
for i in sl1:
for j in sl2:
if i+j==n and i!=j:
res+=1
break
elif i+j==n and i==j:
same+=1
break
if res+same>0:
break
return(res+same)
if __name__ == "__main__":
result=0
timer = Timer("result = pairs(diap1, diap2, n)",
"from __main__ import diap1, diap2, n, pairs")
print("pairs_time = ", timer.timeit(number=1), "result =", result)
result=0
timer = Timer("result = pairs2(diap1, diap2, n)",
"from __main__ import diap1, diap2, n, pairs2")
print("pairs2_time = ", timer.timeit(number=1), "result =", result)
If you pull the value n from the first list and then search for a value m in the second list so that the sum matches the searched value, you can make a few shortcuts. For example, if the sum is less, all values from the second list that are smaller than or equal to m will also not give the right sum. Similarly, if the sum is larger.
Using this info, I'd use following steps:
Set up two heaps, one minimum heap, one maximum heap.
Look at the top elements of each heap:
If the sum matches the searched value, you are done.
If the sum exceeds the searched value, remove the value from the maximum heap.
If the sum is less than the searched value, remove the value from the minimum heap.
If either heap is empty, there is no solution.
Note that using a heap is an optimization over rightaway sorting the two sequences. However, if you often have the case that there is no match, sorting the numbers before the algorithm might be a faster approach. The reason for that is that a good sorting algorithm will outperform the implicit sorting through the heaps, not by its asymptotic complexity but rather by some constant factors.

k-greatest double selection

Imagine you have two sacks (A and B) with N and M balls respectively in it. Each ball with a known numeric value (profit). You are asked to extract (with replacement) the pair of balls with the maximum total profit (given by the multiplication of the selected balls).
The best extraction is obvious: Select the greatest valued ball from A as well as from B.
The problem comes when you are asked to give the 2nd or kth best selection. Following the previous approach you should select the greatest valued balls from A and B without repeating selections.
This can be clumsily solved calculating the value of every possible selection, ordering and ordering it (example in python):
def solution(A,B,K):
if K < 1:
return 0
pool = []
for a in A:
for b in B:
pool.append(a*b)
pool.sort(reverse=True)
if K>len(pool):
return 0
return pool[K-1]
This works but its worst time complexity is O(N*M*Log(M*M)) and I bet there are better solutions.
I reached a solution based on a table where A and B elements are sorted from higher value to lower and each of these values has associated an index representing the next value to test from the other column. Initially this table would look like:
The first element from A is 25 and it has to be tested (index 2 select from b = 0) against 20 so 25*20=500 is the first greatest selection and, after increasing the indexes to check, the table changes to:
Using these indexes we have a swift way to get the best selection candidates:
25 * 20 = 500 #first from A and second from B
20 * 20 = 400 #second from A and first from B
I tried to code this solution:
def solution(A,B,K):
if K < 1:
return 0
sa = sorted(A,reverse=true)
sb = sorted(B,reverse=true)
for k in xrange(K):
i = xfrom
j = yfrom
if i >= n and j >= n:
ret = 0
break
best = None
while i < n and j < n:
selected = False
#From left
nexti = i
nextj = sa[i][1]
a = sa[nexti][0]
b = sb[nextj][0]
if best is None or best[2]<a*b:
selected = True
best = [nexti,nextj,a*b,'l']
#From right
nexti = sb[j][1]
nextj = j
a = sa[nexti][0]
b = sb[nextj][0]
if best is None or best[2]<a*b:
selected = True
best = [nexti,nextj,a*b,'r']
#Keep looking?
if not selected or abs(best[0]-best[1])<2:
break
i = min(best[:2])+1
j = i
print("Continue with: ", best, selected,i,j)
#go,go,go
print(best)
if best[3] == 'l':
dx[best[0]][1] = best[1]+1
dy[best[1]][1] += 1
else:
dx[best[0]][1] += 1
dy[best[1]][1] = best[0]+1
if dx[best[0]][1]>= n:
xfrom = best[0]+1
if dy[best[1]][1]>= n:
yfrom = best[1]+1
ret = best[2]
return ret
But it did not work for the on-line Codility judge (Did I mention this is part of the solution to an, already expired, Codility challenge? Sillicium 2014)
My questions are:
Is the second approach an unfinished good solution? If that is the case, any clue on what I may be missing?
Do you know any better approach for the problem?
You need to maintain a priority queue.
You start with (sa[0], sb[0]), then move onto (sa[0], sb[1]) and (sa[1], sb[0]). If (sa[0] * sb[1]) > (sa[1] * sb[0]), can we say anything about the comparative sizes of (sa[0], sb[2]) and (sa[1], sb[0])?
The answer is no. Thus we must maintain a priority queue, and after removing each (sa[i], sb[j]) (such that sa[i] * sb[j] is the biggest in the queue), we must add to the priority queue (sa[i - 1], sb[j]) and (sa[i], sb[j - 1]), and repeat this k times.
Incidentally, I gave this algorithm as an answer to a different question. The algorithm may seem to be different at first, but essentially it's solving the same problem.
I'm not sure I understand the "with replacement" bit...
...but assuming this is in fact the same as "How to find pair with kth largest sum?", then the key to the solution is to consider the matrix S of all the sums (or products, in your case), constructed from A and B (once they are sorted) -- this paper (referenced by #EvgenyKluev) gives this clue.
(You want A*B rather than A+B... but the answer is the same -- though negative numbers complicate but (I think) do not invalidate the approach.)
An example shows what is going on:
for A = (2, 3, 5, 8, 13)
and B = (4, 8, 12, 16)
we have the (notional) array S, where S[r, c] = A[r] + B[c], in this case:
6 ( 2+4), 10 ( 2+8), 14 ( 2+12), 18 ( 2+16)
7 ( 3+4), 11 ( 3+8), 15 ( 3+12), 19 ( 3+16)
9 ( 5+4), 13 ( 5+8), 17 ( 5+12), 21 ( 5+16)
12 ( 8+4), 16 ( 8+8), 20 ( 8+12), 14 ( 8+16)
17 (13+4), 21 (13+8), 25 (13+12), 29 (13+16)
(As the referenced paper points out, we don't need to construct the array S, we can generate the value of an item in S if or when we need it.)
The really interesting thing is that each column of S contains values in ascending order (of course), so we can extract the values from S in descending order by doing a merge of the columns (reading from the bottom).
Of course, merging the columns can be done using a priority queue (heap) -- hence the max-heap solution. The simplest approach being to start the heap with the bottom row of S, marking each heap item with the column it came from. Then pop the top of the heap, and push the next item from the same column as the one just popped, until you pop the kth item. (Since the bottom row is sorted, it is a trivial matter to seed the heap with it.)
The complexity of this is O(k log n) -- where 'n' is the number of columns. The procedure works equally well if you process the rows... so if there are 'm' rows and 'n' columns, you can choose the smaller of the two !
NB: the complexity is not O(k log k)... and since for a given pair of A and B the 'n' is constant, O(k log n) is really O(k) !!
If you want to do many probes for different 'k', then the trick might be to cache the state of the process every now and then, so that future 'k's can be done by restarting from the nearest check-point. In the limit, one would run the merge to completion and store all possible values, for O(1) lookup !

Generate "random" matrix of certain rank over a fixed set of elements

I'd like to generate matrices of size mxn and rank r, with elements coming from a specified finite set, e.g. {0,1} or {1,2,3,4,5}. I want them to be "random" in some very loose sense of that word, i.e. I want to get a variety of possible outputs from the algorithm with distribution vaguely similar to the distribution of all matrices over that set of elements with the specified rank.
In fact, I don't actually care that it has rank r, just that it's close to a matrix of rank r (measured by the Frobenius norm).
When the set at hand is the reals, I've been doing the following, which is perfectly adequate for my needs: generate matrices U of size mxr and V of nxr, with elements independently sampled from e.g. Normal(0, 2). Then U V' is an mxn matrix of rank r (well, <= r, but I think it's r with high probability).
If I just do that and then round to binary / 1-5, though, the rank increases.
It's also possible to get a lower-rank approximation to a matrix by doing an SVD and taking the first r singular values. Those values, though, won't lie in the desired set, and rounding them will again increase the rank.
This question is related, but accepted answer isn't "random," and the other answer suggests SVD, which doesn't work here as noted.
One possibility I've thought of is to make r linearly independent row or column vectors from the set and then get the rest of the matrix by linear combinations of those. I'm not really clear, though, either on how to get "random" linearly independent vectors, or how to combine them in a quasirandom way after that.
(Not that it's super-relevant, but I'm doing this in numpy.)
Update: I've tried the approach suggested by EMS in the comments, with this simple implementation:
real = np.dot(np.random.normal(0, 1, (10, 3)), np.random.normal(0, 1, (3, 10)))
bin = (real > .5).astype(int)
rank = np.linalg.matrix_rank(bin)
niter = 0
while rank > des_rank:
cand_changes = np.zeros((21, 5))
for n in range(20):
i, j = random.randrange(5), random.randrange(5)
v = 1 - bin[i,j]
x = bin.copy()
x[i, j] = v
x_rank = np.linalg.matrix_rank(x)
cand_changes[n,:] = (i, j, v, x_rank, max((rank + 1e-4) - x_rank, 0))
cand_changes[-1,:] = (0, 0, bin[0,0], rank, 1e-4)
cdf = np.cumsum(cand_changes[:,-1])
cdf /= cdf[-1]
i, j, v, rank, score = cand_changes[np.searchsorted(cdf, random.random()), :]
bin[i, j] = v
niter += 1
if niter % 1000 == 0:
print(niter, rank)
It works quickly for small matrices but falls apart for e.g. 10x10 -- it seems to get stuck at rank 6 or 7, at least for hundreds of thousands of iterations.
It seems like this might work better with a better (ie less-flat) objective function, but I don't know what that would be.
I've also tried a simple rejection method for building up the matrix:
def fill_matrix(m, n, r, vals):
assert m >= r and n >= r
trans = False
if m > n: # more columns than rows I think is better
m, n = n, m
trans = True
get_vec = lambda: np.array([random.choice(vals) for i in range(n)])
vecs = []
n_rejects = 0
# fill in r linearly independent rows
while len(vecs) < r:
v = get_vec()
if np.linalg.matrix_rank(np.vstack(vecs + [v])) > len(vecs):
vecs.append(v)
else:
n_rejects += 1
print("have {} independent ({} rejects)".format(r, n_rejects))
# fill in the rest of the dependent rows
while len(vecs) < m:
v = get_vec()
if np.linalg.matrix_rank(np.vstack(vecs + [v])) > len(vecs):
n_rejects += 1
if n_rejects % 1000 == 0:
print(n_rejects)
else:
vecs.append(v)
print("done ({} total rejects)".format(n_rejects))
m = np.vstack(vecs)
return m.T if trans else m
This works okay for e.g. 10x10 binary matrices with any rank, but not for 0-4 matrices or much larger binaries with lower rank. (For example, getting a 20x20 binary matrix of rank 15 took me 42,000 rejections; with 20x20 of rank 10, it took 1.2 million.)
This is clearly because the space spanned by the first r rows is too small a portion of the space I'm sampling from, e.g. {0,1}^10, in these cases.
We want the intersection of the span of the first r rows with the set of valid values.
So we could try sampling from the span and looking for valid values, but since the span involves real-valued coefficients that's never going to find us valid vectors (even if we normalize so that e.g. the first component is in the valid set).
Maybe this can be formulated as an integer programming problem, or something?
My friend, Daniel Johnson who commented above, came up with an idea but I see he never posted it. It's not very fleshed-out, but you might be able to adapt it.
If A is m-by-r and B is r-by-n and both have rank r then AB has rank r. Now, we just have to pick A and B such that AB has values only in the given set. The simplest case is S = {0,1,2,...,j}.
One choice would be to make A binary with appropriate row/col sums
that guaranteed the correct rank and B with column sums adding to no
more than j (so that each term in the product is in S) and row sums
picked to cause rank r (or at least encourage it as rejection can be
used).
I just think that we can come up with two independent sampling
schemes on A and B that are less complicated and quicker than trying
to attack the whole matrix at once. Unfortunately, all my matrix
sampling code is on the other computer. I know it generalized easily
to allowing entries in a bigger set than {0,1} (i.e. S), but I can't
remember how the computation scaled with m*n.
I am not sure how useful this solution will be, but you can construct a matrix that will allow you to search for the solution on another matrix with only 0 and 1 as entries. If you search randomly on the binary matrix, it is equivalent to randomly modifying the elements of the final matrix, but it is possible to come up with some rules to do better than a random search.
If you want to generate an m-by-n matrix over the element set E with elements ei, 0<=i<k, you start off with the m-by-k*m matrix, A:
Clearly, this matrix has rank m. Now, you can construct another matrix, B, that has 1s at certain locations to pick the elements from the set E. The structure of this matrix is:
Each Bi is a k-by-n matrix. So, the size of AB is m-by-n and rank(AB) is min(m, rank(B)). If we want the output matrix to have only elements from our set, E, then each column of Bi has to have exactly one element set to 1, and the rest set to 0.
If you want to search for a certain rank on B randomly, you need to start off with a valid B with max rank, and rotate a random column j of a random Bi by a random amount. This is equivalent to changing column i row j of A*B to a random element from our set, so it is not a very useful method.
However, you can do certain tricks with the matrices. For example, if k is 2, and there are no overlaps on first rows of B0 and B1, you can generate a linearly dependent row by adding the first rows of these two sub-matrices. The second row will also be linearly dependent on rows of these two matrices. I am not sure if this will easily generalize to k larger than 2, but I am sure there will be other tricks you can employ.
For example, one simple method to generate at most rank k (when m is k+1) is to get a random valid B0, keep rotating all rows of this matrix up to get B1 to Bm-2, set first row of Bm-1 to all 1, and the remaining rows to all 0. The rank cannot be less than k (assuming n > k), because B_0 columns have exactly 1 nonzero element. The remaining rows of the matrices are all linear combinations (in fact exact copies for almost all submatrices) of these rows. The first row of the last submatrix is the sum of all rows of the first submatrix, and the remaining rows of it are all zeros. For larger values of m, you can use permutations of rows of B0 instead of simple rotation.
Once you generate one matrix that satisfies the rank constraint, you may get away with randomly shuffling the rows and columns of it to generate others.
How about like this?
rank = 30
n1 = 100; n2 = 100
from sklearn.decomposition import NMF
model = NMF(n_components=rank, init='random', random_state=0)
U = model.fit_transform(np.random.randint(1, 5, size=(n1, n2)))
V = model.components_
M = np.around(U) # np.around(V)

Categories