actually, I already have a partial answer for this question, but I'm wondering if this small piece of greedy code can be generalized to something closer to the optimal solution.
how I met this problem (not relevant for problem itself, but maybe interesting):
I receive a large collection of objects (it's a set of profiles of dykes, and each dyke keeps more or less the same shape along its length), I can group them according to a property (the name of the dyke). the output of my program goes to an external program that we have to invoke by hand (don't ask me why) and which can't recover from failures (one mistake stops the whole batch).
in the application where I'm using this, there's no hard requirement on the amount of bins nor to the maximum size of the bins, what I try to do is to
keep the amount of groups low (invoke the program few times.)
keep the sets small (reduce the damage if a batch fails)
keep similar things together (a failure in a group is probably a failure in the whole group).
I did not have much time, so I wrote a small greedy function that groups sets together.
a colleague suggested I could add some noise to the data to explore the neighbourhood of the approximate solution I find, and we were wondering how far from optimal are the solutions found.
not that it is relevant to the original task, which doesn't need a true optimal solution, but I thought I would share the question with the community and see what comments come out of it.
def group_to_similar_sizes(orig, max_size=None, max_factor=None):
"""group orig list in sections that to not overflow max(orig) (or given max_size).
return list of grouped indices, plus max effective length.
>>> group_to_similar_sizes([1, 3, 7, 13])
([[2, 1, 0], [3]], 13)
>>> group_to_similar_sizes([2, 9, 9, 11, 12, 19, 19, 22, 22, ])
([[3, 1], [4, 2], [5], [6, 0], [7], [8]], 22)
result is independent of original ordering
>>> group_to_similar_sizes([9, 19, 22, 12, 19, 9, 2, 22, 11, ])
([[3, 1], [4, 2], [5], [6, 0], [7], [8]], 22)
you can specify a desired max size
>>> group_to_similar_sizes([2, 9, 9, 11, 12, 19, 19, 22, 22, ], 50)
([[3, 2, 1], [6, 5, 4], [8, 7, 0]], 50)
if the desired max size is too small, it still influences the way we make groups.
>>> group_to_similar_sizes([1, 3, 7, 13], 8)
([[1], [2, 0], [3]], 13)
>>> group_to_similar_sizes([2, 9, 9, 11, 12, 19, 19, 22, 22, ], 20)
([[1], [3, 2], [4, 0], [5], [6], [7], [8]], 22)
max size can be adjusted by a multiplication factor
>>> group_to_similar_sizes([9, 19, 22, 12, 5, 9, 2, 22, 11, ], max_factor=0.75)
([[2], [3], [4, 1], [5, 0], [6], [7], [8]], 22)
>>> group_to_similar_sizes([9, 19, 22, 12, 5, 9, 2, 22, 11, ], max_factor=1.5)
([[2, 1], [6, 5], [7, 3, 0], [8, 4]], 33)
"""
ordered = sorted(orig)
max_size = max_size or ordered[-1]
if max_factor is not None:
max_size = int(max_size * max_factor)
orig_ordered = list(ordered)
todo = set(range(len(orig)))
effective_max = 0
result = []
## while we still have unassigned items
while ordered:
## choose the largest item
## make it member of a group
## check which we can still put in its bin
candidate_i = len(ordered) - 1
candidate = ordered.pop()
if candidate_i not in todo:
continue
todo.remove(candidate_i)
group = [candidate_i]
group_size = candidate
for j in sorted(todo, reverse=True):
if orig_ordered[j] + group_size <= max_size:
group.append(j)
group_size += orig_ordered[j]
todo.remove(j)
result.insert(0, group)
effective_max = max(group_size, effective_max)
return result, effective_max
I like the idea of your colleague to add some noise data, But may be it's better to make a few swaps in ordered after you call ordered = sorted(orig)?
Related
I have an algorithm problem to solve.
Description: I have a python dictionary that contains following
modules = {"auth_provider_1": [3, 4, 17, 19],
"auth_provider_2": [1, 6, 8, 10, 13, 14, 16, 18],
"auth_provider_3": [0, 7, 11, 12, 15],
"auth_provider_4": [2, 5, 9],
"cont_provider_1": [4, 14],
"cont_provider_2": [8, 9, 13, 15, 16, 17],
"cont_provider_3": [2, 3, 5, 10, 11, 18],
"cont_provider_4": [0, 1, 6, 7, 12, 19]}
There are two types of modules, auth_provider, and cont_provider.
Each one has 4 different providers. For example, we have 4 auth_providers; auth_provider_1, auth_provider_2, auth_provider_3 and auth_provider_4.
Each provider has a list that contains which user is using that provider.
each user is using only one auth provider and only one cont provider.
For example;
user 3 is using auth_provider_1 and cont_provider_3
user 1 is using auth_provider_2 and cont_provider_4.. and so on.
Problem: We want to check all providers with a minimum of a user group. For example, if we want to check what providers are users 0,2,4,8 using we will be checking all providers available. Same way with the users 0,2,4,16 and 0,2,4,13...etc
What I tried: Making a list from provider names sorted by their list length. For example;
sorted_list = ['cont_provider_1', 'auth_provider_4', 'auth_provider_1', 'auth_provider_3', 'cont_provider_2', 'cont_provider_3', 'cont_provider_4', 'auth_provider_2']
Iterating this sorted list, searching each element of its list in the modules dictionary and when it's found in a list I removed the key(provider_name) of that list from the sorted list.
For example from a sorted list, the first element is cont_provider_1 which has the smallest list length (it has only 4 and 14), I wanted to start from the smallest one because I thought it would make more sense. Then I search cont_provider_1 in the modules dictionary.
But when I find the 4 and auth_provider_1 and cont_provider_1 somehow iteration stuck somewhere and gives me this answer
sorted_list_after_iteration_is_over = ['auth_provider_4']
min_users = [4, 14, 0, 7, 11, 12, 15]
Question: What would be the best algorithm for this problem? Am I on the right track? Where I am doing wrong? Any suggestions or help?
here is my whole code
modules = {"auth_provider_1": [3, 4, 17, 19],
"auth_provider_2": [1, 6, 8, 10, 13, 14, 16, 18],
"auth_provider_3": [0, 7, 11, 12, 15],
"auth_provider_4": [2, 5, 9],
"cont_provider_1": [4, 14],
"cont_provider_2": [8, 9, 13, 15, 16, 17],
"cont_provider_3": [2, 3, 5, 10, 11, 18],
"cont_provider_4": [0, 1, 6, 7, 12, 19]}
providers_sorted_list = sorted(modules, key = lambda key: len(modules[key]))
# ['cont_provider_1', 'auth_provider_4', 'auth_provider_1', 'auth_provider_3', 'cont_provider_2', 'cont_provider_3', 'cont_provider_4', 'auth_provider_2']
test_users = []
for provider in providers_sorted_list:
search_list = modules[provider]
for user in search_list:
for key, val in modules.items():
if user in val:
if not user in test_users:
test_users.append(user)
if key in providers_sorted_list:
providers_sorted_list.remove(key)
print(providers_sorted_list)
print(test_users)
edited with a clearer example, and included solution
I'd like to slice an arbitrary dimensional array, where I pin the first n dimensions and keep the remaining dimensions. In addition, I'd like to be able to store the n pinning dimensions in a variable. For example
Q = np.arange(24).reshape(2, 3, 4) # array to be sliced
# array([[[ 0, 1, 2, 3],
# [ 4, 5, 6, 7],
# [ 8, 9, 10, 11]],
# [[12, 13, 14, 15],
# [16, 17, 18, 19],
# [20, 21, 22, 23]]])
Q[0, 1, ...] # this is what I want manually
# array([4, 5, 6, 7])
# but programmatically:
s = np.array([0, 1])
Q[s, ...] # this doesn't do what I want: it uses both s[0] and s[1] along the 0th dimension of Q
# array([[[ 0, 1, 2, 3],
# [ 4, 5, 6, 7],
# [ 8, 9, 10, 11]],
# [[12, 13, 14, 15],
# [16, 17, 18, 19],
# [20, 21, 22, 23]]])
np.take(Q, s) # this unravels the indices and takes the s[i]th elements of Q
# array([0, 1])
Q[tuple(s)] # this works! Thank you kwin
# array([4, 5, 6, 7])
Is there a clean way to do this?
You could do this:
Q[tuple(s)]
Or this:
np.take(Q, s)
Both of these yield array([0.58383736, 0.80486868]).
I'm afraid I don't have a great intuition for exactly why the tuple version of s works differently from indexing with s itself. The other thing I intuitively tried is Q[*s] but that's a syntax error.
I am not sure what output you want but there are several things you can do.
If you want the output to be like this:
array([[[0.46988733, 0.19062458],
[0.69307707, 0.80242129],
[0.36212295, 0.2927196 ],
[0.34043998, 0.87408959],
[0.5096636 , 0.37797475]],
[[0.98322049, 0.00572271],
[0.06374176, 0.98195354],
[0.63195656, 0.44767722],
[0.61140211, 0.58889763],
[0.18344186, 0.9587247 ]]])
Q[list(s)] should work. np.array([Q[i] for i in s]) also works.
If you want the output to be like this:
array([0.58383736, 0.80486868])
Then as #kwinkunks mentioned you could use Q[tuple(s)] or np.take(Q, s)
Consider below list of 2x2 tables and CMH(Cochran–Mantel–Haenszel) test results. We are trying to determine if each specific centre was accociated with the sucess of the treatment [Data from Agresti, Categorical Data Analysis, second edition]
tables= [
[[11, 10], [25, 27]],
[[16, 22], [4, 10]],
[[14, 7], [5, 12]],
[[2, 1], [14, 16]],
[[6, 0], [11, 12]],
[[1, 0], [10, 10]],
[[1, 1], [4, 8]],
[[4, 6], [2, 1]]]
cmh = sm.stats.contingency_tables.StratifiedTable(tables = tables)
print(cmh.test_null_odds())
pvalue ~ 0.012
statistic ~ 6.38
The tables parameters in StratifiedTable can also take a numpy array shape 2 x 2 x k, where k is a slice return each of the contingency tables.
I've been unable to wrap my head around the array reshaping, this based on the above 8, 2, 2 shape the list of lists can more intuitively offer (at least for me).
Any toughts on how to re run this same test with a nd array?
UPDATE: I've tried to reshape my tables var in numpy as suggested in comment below to a nd array 2 x 2 x k , with a transpose. The below TypeError is rasied when running the same test with
TypeError: No loop matching the specified signature and casting was found for ufunc true_divide
Note: in R the following matrix would return the desired output
data = array (c(11, 10, 25, 27, 16, 22, 4, 10,
14, 7, 5, 12, 2, 1, 14, 16,
6, 0, 11, 12, 1, 0, 10, 10,
1, 1, 4, 8, 4, 6, 2, 1),
c(2,2,8))
mantelhaen.test(data, correct=F)
Just referencing #Josef comment as the answer. I missed/ not accounted for a dtype convertion.
Your example worked for me with the transpose, .T. It looks like you have a separate problem with the dtype. Use float: tables = np.asarray(tables).T.astype(float) This was recently fixed github.com/statsmodels/statsmodels/pull/7279
The following example illustartes my question clearly :
suppose their is an array 'arr'
>>import numpy as np
>>from skimage.util.shape import view_as_blocks
>>arr=np.array([[1,2,3,4,5,6,7,8],[1,2,3,4,5,6,7,8],[9,10,11,12,13,14,15,16],[17,18,19,20,21,22,23,24]])
>>arr
array([[ 1, 2, 3, 4, 5, 6, 7, 8],
[ 1, 2, 3, 4, 5, 6, 7, 8],
[ 9, 10, 11, 12, 13, 14, 15, 16],
[17, 18, 19, 20, 21, 22, 23, 24]])
I segmented this array in to 2*2 blocks using :
>>img= view_as_blocks(arr, block_shape=(2,2))
>>img
array([[[[ 1, 2],
[ 1, 2]],
[[ 3, 4],
[ 3, 4]],
[[ 5, 6],
[ 5, 6]],
[[ 7, 8],
[ 7, 8]]],
[[[ 9, 10],
[17, 18]],
[[11, 12],
[19, 20]],
[[13, 14],
[21, 22]],
[[15, 16],
[23, 24]]]])
I have an other array "cor"
>>cor
(array([0, 1, 1], dtype=int64), array([2, 1, 3], dtype=int64))
In "cor" the 1st array ([0,1,1]) gives the coordinates of rows and 2nd array ([2,1,3]) gives the coordinates of corresponding columns in sequential order.
Now my work is to access segments of img whose positional coordinates are [0,2],[1,1]and [1,3] (taken from "cor". x from 1st array and corresponding y from 2nd array) automatically by reading "cor".
In the above example
img[0,2]= [[ 5, 6], img[1,1]= [[11, 12], img[1,3]=[[15, 16],
[ 5, 6]], [19, 20]] [23, 24]]
then find the mean value of each segment seperately.
ie. img[0,2]=5.5 img[1,1]=15.5 img[1,3]=19.5
Now, check if its mean values are less than the mean vlaue of whole array "img".
Here, mean value of img is 10.5. hence only mean value of img[0,2] is less than 10.5.
Therefore finally return coordinate of segment img[0,2] ie [0,2] as output in sequential order if more segments exists in any other big array.
##expected output for above example:
[0,2]
We simply need to index with cor and perform those mean computations (along last two axes) and check -
# Convert to array format
In [229]: cor = np.asarray(cor)
# Index into `img` with tuple version of `cor`, so that we get all the
# blocks in one go and then compute mean along last two axes i.e. 1,2.
# Then compare against global mean - `img.mean()` to give us a valid
# mask. Then index into columns of `cor with it, to give us a slice of
# valid `cor`. Finally transpose, so that we get per row valid indices set.
In [254]: cor[:,img[tuple(cor)].mean((1,2))<img.mean()].T
Out[254]: array([[0, 2]])
Another way to set it up, would be to split up the indices -
In [235]: r,c = cor
In [236]: v = img[r,c].mean((1,2))<img.mean() # or img[cor].mean((1,2))<img.mean()
In [237]: r[v],c[v]
Out[237]: (array([0]), array([2]))
Same as first approach, with the only difference of using split indices to index into cor and getting the final indices.
Or a compact version -
In [274]: np.asarray(cor).T[img[cor].mean((1,2))<img.mean()]
Out[274]: array([[0, 2]])
In this solution, we are directly feeding in the original tuple version of cor, rest being same as approach#1.
I am trying to build a Python program that helps me arrange my timetable so I get most day offs (least school days) for university.
The user shall input a number of courses (Course A, Course B, Course C) and they will receive a list of combinations suggested that will give them least school days without time clash(eg. Course A (L1,T1) Course B(L3,T3B) Course C(L2,T2C))
I scraped some information about courses from my university website and I am now stuck.. here is a sample of what I scraped.
{'Lectures':{'L1 (2196)': 'Mo04:30PM-05:50PM Fr12:00PM-01:20PM'}, 'Tutorial': {'T1 (2198)': 'Th06:00PM-06:50PM', 'T2 (2200)': 'Mo03:00PM-03:50PM'}, 'Lab': {}}
{'Lectures': {'L1 (2201)': 'Tu09:00AM-10:20AM Th09:00AM-10:20AM', 'L2 (2203)': 'Tu12:00PM-01:20PM Th12:00PM-01:20PM', 'L3 (2205)': 'Tu03:00PM-04:20PM Th03:00PM-04:20PM', 'L4 (2207)': 'Tu01:30PM-02:50PM Th01:30PM-02:50PM', 'L5 (2209)': 'Tu10:30AM-11:50AM Th10:30AM-11:50AM', 'L6 (2211)': 'Tu04:30PM-05:50PM Th04:30PM-05:50PM'}, 'Tutorial': {'T1A (2213)': 'Mo05:30PM-06:20PM', 'T1B (2215)': 'We06:00PM-06:50PM', 'T2A (2216)': 'Fr12:00PM-12:50PM', 'T2B (2217)': 'Fr01:30PM-02:20PM', 'T3A (2218)': 'We04:30PM-05:20PM', 'T3B (2219)': 'Th12:00PM-12:50PM', 'T4A (2220)': 'Fr03:30PM-04:20PM', 'T4B (2221)': 'Mo02:00PM-02:50PM', 'T5A (2222)': 'Fr12:00PM-12:50PM', 'T5B (2223)': 'Mo06:00PM-06:50PM', 'T6A (2224)': 'We06:00PM-06:50PM', 'T6B (2225)': 'Mo02:00PM-02:50PM'}, 'Lab': {}}
The outtermost dictionary are the courses, inside are three dictionaries naming "Lectures","Tutorial","Lab", not all courses have all three of these, but you have to arrange it to a timetable whenever one or more session exists. I want to create a combination of these courses then check if time clash occurs and if yes trash those combination. However, I am not sure how could I create such combinations to ensure that.
EDIT
My ultimate goal is for a list like this:
course A={'L':{'L1':'Time','L2':'Time'},'T':{'T1':'Time','T2':'Time'},'LAB':{'LAB1':'Time}}
course B={'L':{'L1':'Time'},'T':{'T1':'Time','T2':'Time'},'LAB':{'LAB1':'Time}}
I would want a combination like
CourseA(L1,T1,LAB1)CourseB(L1,T1)
CourseA(L1,T2,LAB1)CourseB(L1,T1)
CourseA(L2,T1,LAB1)CourseB(L1,T1)
CourseA(L1,T1,LAB1)CourseB(L1,T2)
CourseA(L1,T2,LAB1)CourseB(L1,T2)
CourseA(L2,T1,LAB1)CourseB(L1,T2)
to be given, then maybe I will further filter out those that have time clash by tracing back to the sessions' value(time).
Generating permutations for nested lists like that is a bit tricky.
Just to get you started here's how you could go about generating all the possible permutations for some nested lists. You need to try to figure out how you can apply this pattern to your problem.
import itertools
# a list with 4 categories
a = [[1,2,3],[4,5,6],[7,8,9], [10,11,12]]
# this function will help generate permutations
prod = itertools.product
# combine the first two columns. The for loop below assumes
# that the previous result is a list.
result = list(prod(a[0],a[1]))
for x in a[2:]:
result = list(prod(x, result))
result = [[r[0]] + list(r[1]) for r in result]
print(result)
The output of this program is:
[[10, 7, 1, 4], [10, 7, 1, 5], [10, 7, 1, 6], [10, 7, 2, 4], [10, 7, 2, 5],
[10, 7, 2, 6], [10, 7, 3, 4], [10, 7, 3, 5], [10, 7, 3, 6], [10, 8, 1, 4], ...
[12, 9, 1, 5], [12, 9, 1, 6], [12, 9, 2, 4], [12, 9, 2, 5], [12, 9, 2, 6],
[12, 9, 3, 4], [12, 9, 3, 5], [12, 9, 3, 6]]
Now you would want to generate permutations representing each class, and then do the same thing with those lists.