I am given the following statistics of an array:
length
Minimum
Maximum
Average
Median
Quartiles
I am supposed to recreate a list with more or less the same statistics. I know that the list for which the statistics were calculated is not normally distributed.
My first idea was to just brute-force it by creating a list of random numbers in the given range and hope that one would fit. The benefit of this method is it works. While the downside obviously is the efficiency.
So I'm looking for a more efficient way to solve this problem. Hope that someone can help...
P.S. Currently I only use numpy but I'm not limited to it.
Edit 1:
As an example input and output was requested:
A input could look as follows:
statistics = {
'length' : 200,
'minimum_value' : 5,
'maximum_vlaue': 132,
'mean': 30,
'median' : 22,
'Q1': 13,
'Q3': 68
}
The desired output would than look like this:
similar_list = function_to_create_similar_list(statistics)
len(similar_list) # should be roughly 200
min(similar_list) # should be roughly 5
max(similar_list) # should be roughly 132
np.mean(similar_list) # should be roughly 30
np.median(similar_list) # should be roughly 22
np.quantile(similar_list, 0.25) # should be roughly 13
np.quantile(similar_list, 0.75) # should be roughly 68
function_to_create_similar_list is the function I want to define
Edit 2.
My first edit was not enough I'm sorry for that. Here is my current code:
def get_statistics(input_list):
output = {}
output['length'] = len(input_list)
output['minimum_value'] = min(input_list)
output['maximum_value'] = max(input_list)
output['mean'] = np.mean(input_list)
output['median'] = np.median(input_list)
output['q1'] = np.quantile(input_list, 0.25)
output['q3'] = np.quantile(input_list, 0.75)
return output
def recreate_similar_list(statistics, maximum_deviation = 0.1 ):
sufficient_list_was_found = False
while True:
candidate_list = [random.uniform(statistics['minimum_value'],statistics['maximum_value']) for _ in range(statistics['length'])]
candidate_statistics = get_statistics(candidate_list)
sufficient_list_was_found = True
for key in statistics.keys():
if np.abs(statistics[key] - candidate_statistics[key]) / statistics[key] > maximum_deviation:
sufficient_list_was_found = False
break
if(sufficient_list_was_found):
return candidate_list
example_input_list_1 = [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,10]
recreated_list_1 = recreate_similar_list(get_statistics(example_input_list_1),0.3)
print(recreated_list_1)
print(get_statistics(recreated_list_1))
example_input_list_2 = [1,1,1,1,3,3,4,4,4,4,4,5,18,19,32,35,35,42,49,68]
recreated_list_2 = recreate_similar_list(get_statistics(example_input_list_2),0.3)
print(recreated_list_2)
print(get_statistics(recreated_list_2))
The first example can find a solution. That was no surprise to me. The second one does not (or not in sufficient time). That also did not surprise me as the lists generated in the recreate_similar_list function are uniformly distributed. Though both examples represent the real task. (Keep in mind that I only get the statistics not the list)
I hope this is now a sufficient example
Your existing solution is interesting, but effectively a bogo-solution. There are direct solutions possible that do not need to rely on random-and-check.
The easy-ish part is to create the array of a correct length, and place all five min/max/quartiles in their appropriate positions (this only works for a somewhat simple interpretation of the problem and has limitations).
The trickier part is to choose "fill values" between the quartiles. These fill values can be identical within one interquartile section, because the only things that matter are the sum and bounds. One fairly straightforward way is linear programming, via Scipy's scipy.optimize.linprog. It's typically used for bounded linear algebra problems and this is one. For parameters we use:
Zeros for c, the minimization coefficients, because we don't care about minimization
For A_eq, the equality constraint matrix, we pass a matrix of element counts. This is a length-4 matrix because there are four interquartile sections, each potentially with a slightly different element count. In your example these will each be close to 50.
For B_eq, the equality constraint right-hand side vector, we calculate the desired sum of all interquartile sections based on the desired mean.
For bounds we pass the bounds of each interquartile section.
One tricky aspect is that this assumes easily-divided sections, and a quantile calculation using the lower method. But there are at least thirteen methods! Some will be more difficult to target with an algorithm than others. Also, lower introduces statistical bias. I leave solving these edge cases as an exercise to the reader. But the example works:
import numpy as np
from scipy.optimize import linprog
def solve(length: int, mean: float,
minimum_value: float, q1: float, median: float, q3: float,
maximum_value: float) -> np.ndarray:
sections = (np.arange(5)*(length - 1))//4
sizes = np.diff(sections) - 1
quartiles = np.array((minimum_value, q1, median, q3, maximum_value))
# (quartiles + sizes#x)/length = mean
# sizes#x = mean*length - quartiles
result = linprog(c=np.zeros_like(sizes),
A_eq=sizes[np.newaxis, :],
b_eq=np.array((mean*length - quartiles.sum(),)),
bounds=np.stack((quartiles[:-1], quartiles[1:]), axis=1),
method='highs')
if not result.success:
raise ValueError(result.message)
x = np.empty(length)
x[sections] = quartiles
for i, inner in enumerate(result.x):
i0, i1 = sections[i: i+2]
x[i0+1: i1] = inner
return x
def summarise(x: np.ndarray) -> dict[str, float]:
q0, q1, q2, q3, q4 = np.quantile(
a=x, q=np.linspace(0, 1, num=5), method='lower')
return {'length': len(x), 'mean': x.mean(),
'minimum_value': q0, 'q1': q1, 'median': q2, 'q3': q3, 'maximum_value': q4}
def test() -> None:
statistics = {'length': 200, 'mean': 30, # 27.7 - 58.7 are solvable
'minimum_value': 5, 'q1': 13, 'median': 22, 'q3': 68, 'maximum_value': 132}
x = solve(**statistics)
for k, v in summarise(x).items():
assert np.isclose(v, statistics[k])
if __name__ == '__main__':
test()
Related
I have an ordered list that I want to select items from but with a decreasing probability and for the steepeness of that probability to be able to be changed. I have been able to select the first n number of items easily in Python using:
list = [1,2,3,4,5,6,7,8 ... 46]
subset = list[0:3]
and a random sample from that list as:
list = [1,2,3,4,5,6,7,8 ... 46]
subset = random.sample(list, k =3)
But I'm not sure how to get different probabilities within that range i.e. a greater probability of selecting 1,2 or 3 than compared to 44, 45 and 46 with the probability continually decreasing. This would also be for different subset sizes such as 5, 7, 10 and 16. Any ideas would be appreciated. I want this to be done without replacement.
Numpy's random.choice supports picking samples with weights:
from numpy.random import choice
pop = range(1, 47)
weights = [1/(idx+1) for idx in range(len(pop))]
sw = sum(weights)
weights = [w/sw for w in weights] # weights need to sum to 1
k = 3
print(choice(pop, k, p=weights, replace=False))
You can change the specifics of probabilities by constructing weights list with a different function.
Your question is quite broad, so I have developed quite a broad solution. It's probably not optimal in terms of computation time or storage space, but it does solve the stated problem and it is useful as learning device to understand how such an algorithm might work:
import random as r
def vectorTotal(vec):
output = 0
for x in vec:
output += x
return(output)
def reduce(vec):
output = []
for x in vec:
output.append(x/vectorTotal(vec))
return(output)
xValues = [1,2,3,4,5,6,7,8,9,10]
pValues = [10,6,8,1,8,2,3,9,1,20]
probabilities = reduce(pValues)
def pickWithProb(options, probs):
rand1 = r.uniform(0,1)
threshold = 0
for i in range(len(options)):
threshold += probs[i]
if threshold > rand1:
return(options[i])
pickWithProb(xValues, probabilities)
So the thing I called "xValues" are the options, I just put in 1:10 so it was easy to keep track of everything. "pValues" is the relative likelihood of each option being chosen, but it doesn't have to be a valid probability distribution. "Reduce" turns pValues into a valid probability distribution and "vectorTotal" was used to implement "Reduce".
In order to actually use this, it may be helpful to have some kind of function to actually generate pValues, something like:
def generatePValues(xValues):
output = []
for x in xValues:
output.append(2**-x)
return(output)
This will work but it depends on your desired distribution I guess.
list = [1,2,3,4,5,6,7,8 ... 46]
def get_random_weighted_value(list):
# Make this value lower to decrease the chance of each value being picked
probability_value = 0.5
for value in list:
if random.random() < probability_value/value:
return value
# You can also add values to the probability value here
# this will make the probabilities closer together
# i.e.
probability_value = 0.5 + (0.3*value)
This will select a value based on the following probabilities (for probability_value = 0.5):
1: 50%, 2: 25%, 3: 16.75%...
I have a sparse 60000x10000 matrix M where each element is either a 1 or 0. Each column in the matrix is a different combination of signals (ie. 1s and 0s). I want to choose five column vectors from M and take the Hadamard (ie. element-wise) product of them; I call the resulting vector the strategy vector. After this step, I compute the dot product of this strategy vector with a target vector (that does not change). The target vector is filled with 1s and -1s such that having a 1 in a specific row of the strategy vector is either rewarded or penalised.
Is there some heuristic or linear algebra method that I could use to help me pick the five vectors from the matrix M that result in a high dot product? I don't have any experience with Google's OR tools nor Scipy's optimization methods so I am not too sure if they can be applied to my problem. Advice on this would be much appreciated! :)
Note: the five column vectors given as the solution does not need to be the optimal one; I'd rather have something that does not take months/years to run.
First of all, thanks for a good question. I don't get to practice numpy that often. Also, I don't have much experience in posting to SE, so any feedback, code critique, and opinions relating to the answer are welcome.
This was an attempt at finding an optimal solution at first, but I didn't manage to deal with the complexity. The algorithm should, however, give you a greedy solution that might prove to be adequate.
Colab Notebook (Python code + Octave validation)
Core Idea
Note: During runtime, I've transposed the matrix. So, the column vectors in the question correspond to row vectors in the algorithm.
Notice that you can multiply the target with one vector at a time, effectively getting a new target, but with some 0s in it. These will never change, so you can filter out some computations by removing those rows (columns, in the algorithm) in further computations entirely - both from the target and the matrix. - you're then left with a valid target again (only 1s and -1 in it).
That's the basic idea of the algorithm. Given:
n: number of vectors you need to pick
b: number of best vectors to check
m: complexity of matrix operations to check one vector
Do an exponentially-complex O((n*m)^b) depth-first search, but decrease the complexity of the calculations in deeper layers by reducing target/matrix size, while cutting down a few search paths with some heuristics.
Heuristics used
The best score achieved so far is known in every recursion step. Compute an optimistic vector (turn -1 to 0) and check what scores can still be achieved. Do not search in levels where the score cannot be surpassed.
This is useless if the best vectors in the matrix have 1s and 0s equally distributed. The optimistic scores are just too high. However, it gets better with more sparsity.
Ignore duplicates. Basically, do not check duplicate vectors in the same layer. Because we reduce the matrix size, the chance for ending up with duplicates increases in deeper recursion levels.
Further Thoughts on Heuristics
The most valuable ones are those that eliminate the vector choices at the start. There's probably a way to find vectors that are worse-or-equal than others, with respect to their affects on the target. Say, if v1 only differs from v2 by an extra 1, and target has a -1 in that row, then v1 is worse-or-equal than v2.
The problem is that we need to find more than 1 vector, and can't readily discard the rest. If we have 10 vectors, each worse-or-equal than the one before, we still have to keep 5 at the start (in case they're still the best option), then 4 in the next recursion level, 3 in the following, etc.
Maybe it's possible to produce a tree and pass it on in into recursion? Still, that doesn't help trim down the search space at the start... Maybe it would help to only consider 1 or 2 of the vectors in the worse-or-equal chain? That would explore more diverse solutions, but doesn't guarantee that it's more optimal.
Warning: Note that the MATRIX and TARGET in the example are in int8. If you use these for the dot product, it will overflow. Though I think all operations in the algorithm are creating new variables, so are not affected.
Code
# Given:
TARGET = np.random.choice([1, -1], size=60000).astype(np.int8)
MATRIX = np.random.randint(0, 2, size=(10000,60000), dtype=np.int8)
# Tunable - increase to search more vectors, at the cost of time.
# Performs better if the best vectors in the matrix are sparse
MAX_BRANCHES = 3 # can give more for sparser matrices
# Usage
score, picked_vectors_idx = pick_vectors(TARGET, MATRIX, 5)
# Function
def pick_vectors(init_target, init_matrix, vectors_left_to_pick: int, best_prev_result=float("-inf")):
assert vectors_left_to_pick >= 1
if init_target.shape == (0, ) or len(init_matrix.shape) <= 1 or init_matrix.shape[0] == 0 or init_matrix.shape[1] == 0:
return float("inf"), None
target = init_target.copy()
matrix = init_matrix.copy()
neg_matrix = np.multiply(target, matrix)
neg_matrix_sum = neg_matrix.sum(axis=1)
if vectors_left_to_pick == 1:
picked_id = np.argmax(neg_matrix_sum)
score = neg_matrix[picked_id].sum()
return score, [picked_id]
else:
sort_order = np.argsort(neg_matrix_sum)[::-1]
sorted_sums = neg_matrix_sum[sort_order]
sorted_neg_matrix = neg_matrix[sort_order]
sorted_matrix = matrix[sort_order]
best_score = best_prev_result
best_picked_vector_idx = None
# Heuristic 1 (H1) - optimistic target.
# Set a maximum score that can still be achieved
optimistic_target = target.copy()
optimistic_target[target == -1] = 0
if optimistic_target.sum() <= best_score:
# This check can be removed - the scores are too high at this point
return float("-inf"), None
# Heuristic 2 (H2) - ignore duplicates
vecs_tried = set()
# MAIN GOAL: for picked_id, picked_vector in enumerate(sorted_matrix):
for picked_id, picked_vector in enumerate(sorted_matrix[:MAX_BRANCHES]):
# H2
picked_tuple = tuple(picked_vector)
if picked_tuple in vecs_tried:
continue
else:
vecs_tried.add(picked_tuple)
# Discard picked vector
new_matrix = np.delete(sorted_matrix, picked_id, axis=0)
# Discard matrix and target rows where vector is 0
ones = np.argwhere(picked_vector == 1).squeeze()
new_matrix = new_matrix[:, ones]
new_target = target[ones]
if len(new_matrix.shape) <= 1 or new_matrix.shape[0] == 0:
return float("-inf"), None
# H1: Do not compute if best score cannot be improved
new_optimistic_target = optimistic_target[ones]
optimistic_matrix = np.multiply(new_matrix, new_optimistic_target)
optimistic_sums = optimistic_matrix.sum(axis=1)
optimistic_viable_vector_idx = optimistic_sums > best_score
if optimistic_sums.max() <= best_score:
continue
new_matrix = new_matrix[optimistic_viable_vector_idx]
score, next_picked_vector_idx = pick_vectors(new_target, new_matrix, vectors_left_to_pick - 1, best_prev_result=best_score)
if score <= best_score:
continue
# Convert idx of trimmed-down matrix into sorted matrix IDs
for i, returned_id in enumerate(next_picked_vector_idx):
# H1: Loop until you hit the required number of 'True'
values_passed = 0
j = 0
while True:
value_picked: bool = optimistic_viable_vector_idx[j]
if value_picked:
values_passed += 1
if values_passed-1 == returned_id:
next_picked_vector_idx[i] = j
break
j += 1
# picked_vector index
if returned_id >= picked_id:
next_picked_vector_idx[i] += 1
best_score = score
# Convert from sorted matrix to input matrix IDs before returning
matrix_id = sort_order[picked_id]
next_picked_vector_idx = [sort_order[x] for x in next_picked_vector_idx]
best_picked_vector_idx = [matrix_id] + next_picked_vector_idx
return best_score, best_picked_vector_idx
Maybe it's too naive, but the first thing that occurs to me is to choose the 5 columns with the shortest distance to the target:
import scipy
import numpy as np
from sklearn.metrics.pairwise import pairwise_distances
def sparse_prod_axis0(A):
"""Sparse equivalent of np.prod(arr, axis=0)
From https://stackoverflow.com/a/44321026/3381305
"""
valid_mask = A.getnnz(axis=0) == A.shape[0]
out = np.zeros(A.shape[1], dtype=A.dtype)
out[valid_mask] = np.prod(A[:, valid_mask].A, axis=0)
return np.matrix(out)
def get_strategy(M, target, n=5):
"""Guess n best vectors.
"""
dists = np.squeeze(pairwise_distances(X=M, Y=target))
idx = np.argsort(dists)[:n]
return sparse_prod_axis0(M[idx])
# Example data.
M = scipy.sparse.rand(m=6000, n=1000, density=0.5, format='csr').astype('bool')
target = np.atleast_2d(np.random.choice([-1, 1], size=1000))
# Try it.
strategy = get_strategy(M, target, n=5)
result = strategy # target.T
It strikes me that you could add another step of taking the top few percent from the M–target distances and check their mutual distances — but this could be quite expensive.
I have not checked how this compares to an exhaustive search.
numpy.nanpercentile is extremely slow.
So, I wanted to use cupy.nanpercentile; but there is not cupy.nanpercentile implemented yet.
Do someone have solution for it?
I also had a problem with np.nanpercentile being very slow for my datasets. I found a wokraround that lets you use the standard np.percentile. And it can also be applied to many other libs.
This one should solve your problem. And it also works alot faster than np.nanpercentile:
arr = np.array([[np.nan,2,3,1,2,3],
[np.nan,np.nan,1,3,2,1],
[4,5,6,7,np.nan,9]])
mask = (arr >= np.nanmin(arr)).astype(int)
count = mask.sum(axis=1)
groups = np.unique(count)
groups = groups[groups > 0]
p90 = np.zeros((arr.shape[0]))
for g in range(len(groups)):
pos = np.where (count == groups[g])
values = arr[pos]
values = np.nan_to_num (values, nan=(np.nanmin(arr)-1))
values = np.sort (values, axis=1)
values = values[:,-groups[g]:]
p90[pos] = np.percentile (values, 90, axis=1)
So instead of taking the percentile with the nans, it sorts the rows by the amount of valid data, and takes the percentile of those rows separated. Then adds everything back together. This also works for 3D-arrays, just add y_pos and x_pos instead of pos. And watch out for what axis you are calculating over.
def testset_gen(num):
init=[]
for i in range (num):
a=random.randint(65,122) # Dummy name
b=random.randint(1,100) # Dummy value: 11~100 and 10% of nan
if b<11:
b=np.nan # 10% = nan
init.append([a,b])
return np.array(init)
np_testset=testset_gen(30000000) # 468,751KB
def f1_np (arr, num):
return np.percentile (arr[:,1], num)
# 55.0, 0.523902416229248 sec
print (f1_np(np_testset[:,1], 50))
def cupy_nanpercentile (arr, num):
return len(cp.where(arr > num)[0]) / (len(arr) - cp.sum(cp.isnan(arr))) * 100
# 55.548758317136446, 0.3640251159667969 sec
# 43% faster
# If You need same result, use int(). But You lose saved time.
print (cupy_nanpercentile(cp_testset[:,1], 50))
I can't imagine How test result takes few days. With my computer, It seems 1 Trillion line of data or more. Because of this, I can't reproduce same problem due to lack of resource.
Here's an implementation with numba. After it's been compiled it is more than 7x faster than the numpy version.
Right now it is set up to take the percentile along the first axis, however it could be changed easily.
#numba.jit(nopython=True, cache=True)
def nan_percentile_axis0(arr, percentiles):
"""Faster implementation of np.nanpercentile
This implementation always takes the percentile along axis 0.
Uses numba to speed up the calculation by more than 7x.
Function is equivalent to np.nanpercentile(arr, <percentiles>, axis=0)
Params:
arr (np.array): Array to calculate percentiles for
percentiles (np.array): 1D array of percentiles to calculate
Returns:
(np.array) Array with first dimension corresponding to
values as passed in percentiles
"""
shape = arr.shape
arr = arr.reshape((arr.shape[0], -1))
out = np.empty((len(percentiles), arr.shape[1]))
for i in range(arr.shape[1]):
out[:,i] = np.nanpercentile(arr[:,i], percentiles)
shape = (out.shape[0], *shape[1:])
return out.reshape(shape)
I have been tasked with implementing a local (non-interactive) differential privacy mechanism. I am working with a large database of census data. The only sensitive attribute is "Number of children" which is a numerical value ranging from 0 to 13.
I decided to go with the Generalized Random Response mechanism as it seems like the most intuitive method. This mechanism is described here and presented here.
After loading each value into an array (ignoring the other attributes for now), I perform the perturbation as follows.
d = 14 # values may range from 0 to 13
eps = 1 # epsilon level of privacy
p = (math.exp(eps)/(math.exp(eps)+d-1))
q = 1/(math.exp(eps)+d-1)
p_dataset = []
for row in dataset:
coin = random.random()
if coin <= p:
p_dataset.append(row)
else:
p_dataset.append(random.randint(0,13))
Unless I have misinterpreted the definition, I believe this will guarantee epsilon differential privacy on p_dataset.
However, I am having difficulty understanding how the aggregator must interpret this dataset. Following the presentation above, I attempted to implement a method for estimating the number of individuals who answered a particular value.
v = 0 # we are estimating the number of individuals in the dataset who answered 0
nv = 0 # number of users in the perturbed dataset who answered the value
for row in p_dataset:
if row == v:
nv += 1
Iv = nv * p + (n - nv) * q
estimation = (Iv - (n*q)) / (p-q)
I do not know if I have correctly implemented the method described as I do not completely understand what it is doing, and cannot find a clear definition.
Regardless, I used this method to estimate the total amount of individuals who answered each value in the dataset with a value for epsilon ranging from 1 to 14, and then compared this to the actual values. The results are below (please excuse the formatting).
As you can see, the utility of the dataset suffers greatly for low values of epsilon. Additionally, when executed multiple times, there was relatively little deviation in estimations, even for small values of epsilon.
For example, when estimating the number of participants who answered 0 and using an epsilon of 1, all estimations seemed to be centered around 1600, with the largest distance between estimations being 100. Considering the actual value of this query is 5969, I am led to believe that I may have implemented something incorrectly.
Is this the expected behaviour of the Generalized Random Response mechanism, or have I made a mistake in my implementation?
I think when getting a false answer, we cannot directly use p_dataset.append(random.randint(0,13)), because it contains true answer
max_v = 13
min_v = 0
for row in dataset: #row就是dataset里的真实值
coin = random.random()
if coin <= p:
p_dataset.append(row)
else:
ans = []
if row == min_v:
ans = np.arange(min_v + 1, max_v + 1).tolist()
elif row == max_v:
ans = np.arange(min_v, max_v).tolist()
else:
a = np.arange(min_v, row).tolist()
b = np.arange(row + 1, max_v + 1).tolist()
[ans.append(i) for i in a]
[ans.append(i) for i in b]
p_dataset.append(random.sample(ans, 1)) # 这其实有一点问题 应该是真实值以外的其他值 这样写还包括真实值
The subset sum problem is well-known for being NP-complete, but there are various tricks to solve versions of the problem somewhat quickly.
The usual dynamic programming algorithm requires space that grows with the target sum. My question is: can we reduce this space requirement?
I am trying to solve a subset sum problem with a modest number of elements but a very large target sum. The number of elements is too large for the exponential time algorithm (and shortcut method) and the target sum is too large for the usual dynamic programming method.
Consider this toy problem that illustrates the issue. Given the set A = [2, 3, 6, 8] find the number of subsets that sum to target = 11 . Enumerating all subsets we see the answer is 2: (3, 8) and (2, 3, 6).
The dynamic programming solution gives the same result, of course - ways[11] returns 2:
def subset_sum(A, target):
ways = [0] * (target + 1)
ways[0] = 1
ways_next = ways[:]
for x in A:
for j in range(x, target + 1):
ways_next[j] += ways[j - x]
ways = ways_next[:]
return ways[target]
Now consider targeting the sum target = 1100 the set A = [200, 300, 600, 800]. Clearly there are still 2 solutions: (300, 800) and (200, 300, 600). However, the ways array has grown by a factor of 100.
Is it possible to skip over certain weights when filling out the dynamic programming storage array? For my example problem I could compute the greatest common denominator of the input set and then reduce all items by that constant, but this won't work for my real application.
This SO question is related, but those answers don't use the approach I have in mind. The second comment by Akshay on this page says:
...in the cases where n is very small (eg. 6) and sum is very large
(eg. 1 million) then the space complexity will be too large. To avoid
large space complexity n HASHTABLES can be used.
This seems closer to what I'm looking for, but I can't seem to actually implement the idea. Is this really possible?
Edited to add: A smaller example of a problem to solve. There is 1 solution.
target = 5213096522073683233230240000
A = [2316931787588303659213440000,
1303274130518420808307560000,
834095443531789317316838400,
579232946897075914803360000,
425558899761116998631040000,
325818532629605202076890000,
257436865287589295468160000,
208523860882947329329209600,
172333769324749858949760000,
144808236724268978700840000,
123386899930738064691840000,
106389724940279249657760000,
92677271503532146368537600,
81454633157401300519222500,
72153585080604612224640000,
64359216321897323867040000,
57762842349846905631360000,
52130965220736832332302400,
47284322195679666514560000,
43083442331187464737440000,
39418499221729173786240000,
36202059181067244675210000,
33363817741271572692673536,
30846724982684516172960000,
28604096143065477274240000,
26597431235069812414440000,
24794751591313594450560000,
23169317875883036592134400,
21698632766175580575360000,
20363658289350325129805625,
19148196591638873216640000,
18038396270151153056160000,
17022355990444679945241600]
A real problem is:
target = 262988806539946324131984661067039976436265064677212251086885351040000
A = [116883914017753921836437627140906656193895584300983222705282378240000,
65747201634986581032996165266759994109066266169303062771721337760000,
42078209046391411861117545770726396229802410348353960173901656166400,
29220978504438480459109406785226664048473896075245805676320594560000,
21468474003260924418937523352411426647858372626711204170357987840000,
16436800408746645258249041316689998527266566542325765692930334440000,
12987101557528213537381958571211850688210620477887024745031375360000,
10519552261597852965279386442681599057450602587088490043475414041600,
8693844844295746252297013588993057072273225278585528961549928960000,
7305244626109620114777351696306666012118474018811451419080148640000,
6224587137040149683597270084426981690799173128454727836375984640000,
5367118500815231104734380838102856661964593156677801042589496960000,
4675356560710156873457505085636266247755823372039328908211295129600,
4109200102186661314562260329172499631816641635581441423232583610000,
3639983481521748430892521260443459881470796742937193786669693440000,
3246775389382053384345489642802962672052655119471756186257843840000,
2914003396564502206448583502127866774917064428556368433095682560000,
2629888065399463241319846610670399764362650646772122510868853510400,
2385386000362324935437502594712380738650930291856800463373109760000,
2173461211073936563074253397248264268068306319646382240387482240000,
1988573206351200938616141104476672789688204647842814753019927040000,
1826311156527405028694337924076666503029618504702862854770037160000,
1683128361855656474444701830829055849192096413934158406956066246656,
1556146784260037420899317521106745422699793282113681959093996160000,
1443011284169801504153550952356872298690068941987447193892375040000,
1341779625203807776183595209525714165491148289169450260647374240000,
1250838556670374906691960338012080744048823137584838292922165760000,
1168839140177539218364376271409066561938955843009832227052823782400,
1094646437211014876720019400903392201607763016346356924399106560000,
1027300025546665328640565082293124907954160408895360355808145902500,
965982760477305139144112620999228563585913919842836551283325440000,
909995870380437107723130315110864970367699185734298446667423360000,
858738960130436976757500934096457065914334905068448166814319513600,
811693847345513346086372410700740668013163779867939046564460960000,
768411414287644482489363509326632509674989232073666182868912640000,
728500849141125551612145875531966693729266107139092108273920640000,
691620793004461075955252231602997965644352569828303092930664960000,
657472016349865810329961652667599941090662661693030627717213377600,
625791330255672395317036671188673352614551016483550865168079360000,
596346500090581233859375648678095184662732572964200115843277440000,
568931977371436071675467087219123799753953628290345594563299840000,
543365302768484140768563349312066067017076579911595560096870560000,
519484062301128541495278342848474027528424819115480989801255014400,
497143301587800234654035276119168197422051161960703688254981760000,
476213321032044045508347054897310957784092466595223632570186240000,
456577789131851257173584481019166625757404626175715713692509290000,
438132122515529069774235170457376054037925971973698044293020160000,
420782090463914118611175457707263962298024103483539601739016561664,
404442609057972047876946806715939986830088526993021531852188160000,
389036696065009355224829380276686355674948320528420489773499040000,
374494562534633427030238036407319297168052779889230688624970240000,
360752821042450376038387738089218074672517235496861798473093760000,
347753793771829850091880543559722282890929011143421158461997158400,
335444906300951944045898802381428541372787072292362565161843560000,
323778155173833578494287055791985197213007158728485381455075840000,
312709639167593726672990084503020186012205784396209573230541440000,
302199145693704480473409550206308504954053507241841138853071360000,
292209785044384804591094067852266640484738960752458056763205945600,
282707666261699891568916593460940582033071824431295083135592960000,
273661609302753719180004850225848050401940754086589231099776640000,
265042888929147215048611399412486748738992254650755607041456640000,
256825006386666332160141270573281226988540102223840088952036475625,
248983485481605987343890803377079267631966925138189113455039385600,
241495690119326284786028155249807140896478479960709137820831360000,
234340660761814501342824380545368657996226388663143017230461440000,
227498967595109276930782578777716242591924796433574611666855840000,
220952578483466770957349011608519198854244960871423861446658560000,
214684740032609244189375233524114266478583726267112041703579878400,
208679870295533683104133831435857945991878646837700655494453760000,
202923461836378336521593102675185167003290944966984761641115240000,
197401994025105141026072179446079922264038329650750423033879040000,
192102853571911120622340877331658127418747308018416545717228160000,
187014262428406274938300203425450649910232934881573156328451805184,
182125212285281387903036468882991673432316526784773027068480160000,
177425404985627474536673746714144021883127046501745489011223040000,
172905198251115268988813057900749491411088142457075773232666240000,
168555556186474170249629649778586749838977769381324948621621760000,
164368004087466452582490413166899985272665665423257656929303344400]
In the particular comment you linked to, the suggestion is to use a hashtable to only store values which actually arise as a sum of some subset. In the worst case, this is exponential in the number of elements, so it is basically equivalent to the brute force approach you already mentioned and ruled out.
In general, there are two parameters to the problem - the number of elements in the set and the size of the target sum. Naive brute force is exponential in the first, while the standard dynamic programming solution is exponential in the second. This works well when one of the parameters is small, but you already indicated that both parameters are too big for an exponential solution. Therefore, you are stuck with the "hard" general case of the problem.
Most NP-Complete problems have some underlying graph whether implicit or explicit. Using graph partitioning and DP, it can be solved exponential in the treewidth of the graph but only polynomial in the size of the graph with treewidth held constant. Of course, without access to your data, it is impossible to say what the underlying graph might look like or whether it is in one of the classes of graphs that have bounded treewidths and hence can be solved efficiently.
Edit: I just wrote the following code to show what I meant by reducing it mod small numbers. The following code solves your first problem in less than a second, but it doesn't work on the larger problem (though it does reduce it to n=57, log(t)=68).
target = 5213096522073683233230240000
A = [2316931787588303659213440000,
1303274130518420808307560000,
834095443531789317316838400,
579232946897075914803360000,
425558899761116998631040000,
325818532629605202076890000,
257436865287589295468160000,
208523860882947329329209600,
172333769324749858949760000,
144808236724268978700840000,
123386899930738064691840000,
106389724940279249657760000,
92677271503532146368537600,
81454633157401300519222500,
72153585080604612224640000,
64359216321897323867040000,
57762842349846905631360000,
52130965220736832332302400,
47284322195679666514560000,
43083442331187464737440000,
39418499221729173786240000,
36202059181067244675210000,
33363817741271572692673536,
30846724982684516172960000,
28604096143065477274240000,
26597431235069812414440000,
24794751591313594450560000,
23169317875883036592134400,
21698632766175580575360000,
20363658289350325129805625,
19148196591638873216640000,
18038396270151153056160000,
17022355990444679945241600]
import itertools, time
from fractions import gcd
def gcd_r(seq):
return reduce(gcd, seq)
def miniSolve(t, vals):
vals = [x for x in vals if x and x <= t]
for k in range(len(vals)):
for sub in itertools.combinations(vals, k):
if sum(sub) == t:
return sub
return None
def tryMod(n, state, answer):
t, vals, mult = state
mods = [x%n for x in vals if x%n]
if (t%n or mods) and sum(mods) < n:
print 'Filtering with', n
print t.bit_length(), len(vals)
else:
return state
newvals = list(vals)
tmod = t%n
if not tmod:
for x in vals:
if x%n:
newvals.remove(x)
else:
if len(set(mods)) != len(mods):
#don't want to deal with the complexity of multisets for now
print 'skipping', n
else:
mini = miniSolve(tmod, mods)
if mini is None:
return None
mini = set(mini)
for x in vals:
mod = x%n
if mod:
if mod in mini:
t -= x
answer.add(x*mult)
newvals.remove(x)
g = gcd_r(newvals + [t])
t = t//g
newvals = [x//g for x in newvals]
mult *= g
return (t, newvals, mult)
def solve(t, vals):
answer = set()
mult = 1
for d in itertools.count(2):
if not t:
return answer
elif not vals or t < min(vals):
return None #no solution'
res = tryMod(d, (t, vals, mult), answer)
if res is None:
return None
t, vals, mult = res
if len(vals) < 23:
break
if (d % 10000) == 0:
print 'd', d
#don't want to deal with the complexity of multisets for now
assert(len(set(vals)) == len(vals))
rest = miniSolve(t, vals)
if rest is None:
return None
answer.update(x*mult for x in rest)
return answer
start_t = time.time()
answer = solve(target, A)
assert(answer <= set(A) and sum(answer) == target)
print answer