Distribute elements based on percentages - python

Let's say that I want to distribute a number of items (n) into an array of fixed size (x).
The difficult part is that I have to distribute the items using a flexibility array.
Assuming that x = 4, n = 11 and flexibility = [20, 20, 30, 30] with len(flexibility) == x.
My question is:
How can I distribute the n elements in an array of length equal to x using the percentage defined in f?
What I want at the end is something like:
n = 11
x = 4
flexibility = [20, 20, 30, 30]
distributed = distribute_elements_in_slots(n, x, flexibility)
print(distributed)
# distributed = [2, 2, 3, 4]
In the case of equal flexibility values, the final result will depend on the rule that we decide to apply to use all the item. In the previous case, the final result will be good with [2, 2, 3, 4] and with [2, 2, 4, 3].
Edit: An example of the method that I want to have is as follows:
def distribute_elements_in_slots(n, x, flexibility=[25,25,25,25]):
element_in_slots = []
element_per_percentage = x / 100
for i in range(x):
element_in_slots.append(round(slots_per_point_percentage * flexibility[i])
Edit 2: One of the solutions that I found is the following:
def distribute_elements_in_slots(n, x, flexibility=[25,25,25,25]):
element_in_slots = [f * n / 100 for f in flexibility]
carry = 0
for i in range(len(element_in_slots)):
element = element_in_slots[i] + carry
element_in_slot[i] = floor(element)
carry = element- floor(element)
if np.sum(element_in_slots) < n:
# Here the carry is almost 1
max_index = element_in_slots.index(max(flexibiliyt))
appointments_per_slot[max_index] = appointments_per_slot[max_index] + 1
This will distribute almost evenly the slots based on the flexibility array.

what you need to do is split the number 11 according to certain percents given in the array so initially it becomes percentage * number(11). Then we get remainder and put assign it somewhere which in your case is the last element.
In [10]: [i*n/100 for i in f]
Out[10]: [2.2, 2.2, 3.3, 3.3]
In [11]: b=[i*n/100 for i in f]
In [12]: rem = sum(b) - sum(map(int,b))
In [13]: rem
Out[13]: 1.0
In [24]: b= list(map(int,b))
In [26]: b[-1] +=rem
In [27]: b
Out[27]: [2, 2, 3, 4.0]
Hope it helps. :)

As Albin Paul did, we need to allocate the whole-number amount for each slot's percentage. The leftovers need to be allocated, largest first.
def distribute_elements_in_slots(total, slots, pct):
# Compute proportional distribution by given percentages.
distr = [total * pct[i] / 100 for i in range(slots)]
# Truncate each position and store the difference in a new list.
solid = [int(elem) for elem in distr]
short = [distr[i] - solid[i] for i in range(slots)]
print(distr)
print(solid)
print(short)
# allocate leftovers
leftover = int(round(sum(short)))
print(leftover)
# For each unallocated item,
# find the neediest slot, and put an extra there.
for i in range(leftover):
shortest = short.index(max(short))
solid[shortest] += 1
short[shortest] = 0
print("Added 1 to slot", shortest)
return solid
n = 11
x = 4
flexibility = [20, 20, 30, 30]
distributed = distribute_elements_in_slots(n, x, flexibility)
print(distributed)
# distributed = [2, 2, 3, 4]
Output:
[2.2, 2.2, 3.3, 3.3]
[2, 2, 3, 3]
[0.2, 0.2, 0.3, 0.3]
1
Added 1 to slot 2
[2, 2, 4, 3]

Related

Length of the intersections between a list an list of list

Note : almost duplicate of Numpy vectorization: Find intersection between list and list of lists
Differences :
I am focused on efficiently when the lists are large
I'm searching for the largest intersections.
x = [500 numbers between 1 and N]
y = [[1, 2, 3], [4, 5, 6, 7], [8, 9], [10, 11, 12], etc. up to N]
Here are some assumptions:
y is a list of ~500,000 sublist of ~500 elements
each sublist in y is a range, so y is characterized by the last elements of each sublists. In the example : 3, 7, 9, 12 ...
x is not sorted
y contains once and only once each numbers between 1 and ~500000*500
y is sorted in the sense that, as in the example, the sub-lists are sorted and the first element of one sublist is the next of the last element of the previous list.
y is known long before even compile-time
My purpose is to know, among the sublists of y, which have at least 10 intersections with x.
I can obviously make a loop :
def find_best(x, y):
result = []
for index, sublist in enumerate(y):
intersection = set(x).intersection(set(sublist))
if len(intersection) > 2: # in real live: > 10
result.append(index)
return(result)
x = [1, 2, 3, 4, 5, 6]
y = [[1, 2, 3], [4], [5, 6], [7], [8, 9, 10, 11]]
res = find_best(x, y)
print(res) # [0, 2]
Here the result is [0,2] because the first and third sublist of y have 2 elements in intersection with x.
An other method should to parse only once y and count the intesections :
def find_intersec2(x, y):
n_sublists = len(y)
res = {num: 0 for num in range(0, n_sublists + 1)}
for list_no, sublist in enumerate(y):
for num in sublist:
if num in x:
x.remove(num)
res[list_no] += 1
return [n for n in range(n_sublists + 1) if res[n] >= 2]
This second method uses more the hypothesis.
Questions :
what optimizations are possibles ?
Is there a completely different approach ? Indexing, kdtree ? In my use case, the large list y is known days before the actual run. So i'm not afraid to buildind an index or whatever from y. The small list x is only known at runtime.
Since y contains disjoint ranges and the union of them is also a range, a very fast solution is to first perform a binary search on y and then count the resulting indices and only return the ones that appear at least 10 times. The complexity of this algorithm is O(Nx log Ny) with Nx and Ny the number of items in respectively x and y. This algorithm is nearly optimal (since x needs to be read entirely).
Actual implementation
First of all, you need to transform your current y to a Numpy array containing the beginning value of all ranges (in an increasing order) with N as the last value (assuming N is excluded for the ranges of y, or N+1 otherwise). This part can be assumed as free since y can be computed at compile time in your case. Here is an example:
import numpy as np
y = np.array([1, 4, 8, 10, 13, ..., N])
Then, you need to perform the binary search and check that the values fits in the range of y:
indices = np.searchsorted(y, x, 'right')
# The `0 < indices < len(y)` check should not be needed regarding the input.
# If so, you can use only `indices -= 1`.
indices = indices[(0 < indices) & (indices < len(y))] - 1
Then you need to count the indices and filter the ones with at least :
uniqueIndices, counts = np.unique(indices, return_counts=True)
result = uniqueIndices[counts >= 10]
Here is an example based on your:
x = np.array([1, 2, 3, 4, 5, 6])
# [[1, 2, 3], [4], [5, 6], [7], [8, 9, 10, 11]]
y = np.array([1, 4, 5, 7, 8, 12])
# Actual simplified version of the above algorithm
indices = np.searchsorted(y, x, 'right') - 1
uniqueIndices, counts = np.unique(indices, return_counts=True)
result = uniqueIndices[counts >= 2]
# [0, 2]
print(result.tolist())
It runs in less than 0.1 ms on my machine on a random input based on your input constraints.
Turn y into 2 dicts.
index = { # index to count map
0 : 0,
1 : 0,
2 : 0,
3 : 0,
4 : 0
}
y = { # elem to index map
1: 0,
2: 0,
3: 0,
4: 1,
5: 2,
6: 2,
7: 3,
8 : 4,
9 : 4,
10 : 4,
11 : 4
}
Since you know y in advance, I don't count the above operations into the time complexity. Then, to count the intersection:
x = [1, 2, 3, 4, 5, 6]
for e in x: index[y[e]] += 1
Since you mentioned x is small, I try to make the time complexity depends only on the size of x (in this case O(n)).
Finally, the answer is the list of keys in index dict where the value is >= 2 (or 10 in real case).
answer = [i for i in index if index[i] >= 2]
This uses y to create a linear array mapping every int to the (1 plus), the index of the range or subgroup the int is in; called x2range_counter.
x2range_counter uses a 32 bit array.array type to save memory and can be cached and used for calculations of all x on the same y.
calculating the hits in each range for a particular x is then just indirected array incrementing of a count'er in function count_ranges`.
y = [[1, 2], [3, 4, 5], [6, 7, 8], [9, 10, 11, 12]]
x = [5, 3, 1, 11, 8, 10]
range_counter_max = len(y)
extent = y[-1][-1] + 1 # min in y must be 1 not 0 remember.
x2range_counter = array.array('L', [0] * extent) # efficient 32 bit array storage
# Map any int in any x to appropriate ranges counter.
for range_counter_index, rng in enumerate(y, start=1):
for n in rng:
x2range_counter[n] = range_counter_index
print(x2range_counter) # array('L', [0, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 4])
# x2range_counter can be saved for this y and any x on this y.
def count_ranges(x: List[int]) -> List[int]:
"Number of x-hits on each y subgroup in order"
# Note: count[0] initially catches errors. count[1..] counts x's in y ranges [0..]
count = array.array('L', [0] * (range_counter_max + 1))
for xx in x:
count[x2range_counter[xx]] += 1
assert count[0] == 0, "x values must all exist in a y range and y must have all int in its range."
return count[1:]
print(count_ranges(x)) # array('L', [1, 2, 1, 2])
I created a class for this, with extra functionality such as returning the ranges rather than the indices; all ranges hit >=M times; (range, hit-count) tuples sorted most hit first.
Range calculations for different x are proportional to x and are simple array lookups rather than any hashing of dicts.
What do you think?

How could I write a function to find fractional ranking of a list of numbers?

I'm trying to write a code in Python to create a fractional ranking list for a given one.
The fraction ranking is basically the following:
We have a list of numbers x = [4,4,10,4,10,2,4,1,1,2]
First, we need to sort the list in ascending order. I will use insertion sort for it, I already coded this part.
Now we have the sorted list x = [1, 1, 2, 2, 4, 4, 4, 4, 10, 10]. The list has 10 elements and we need to compare it with a list of the first 10 natural numbers n = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
For each element in x we assign a value. Notice the number 1 appears in positions 1 and 2. So, the number 1 receives the rank (1 + 2) / 2 = 1.5.
The number 2 appears in positions 3 and 4, so it receives the rank (3 + 4) / 2 = 3.5.
The number 4 appears in positions 5, 6, 7 and 8, so it receives the rank (5 + 6 + 7 + 8) / 4 = 6.5
The number 10 appears in positions 9 and 10, so it receives the rank (9 + 10) / 2 = 9.5
In the end of this process we need to have a new list of ranks r = [1.5, 1.5, 3.5, 3.5, 6.5, 6.5, 6.5, 6.5, 9.5, 9.5]
I don't want an entire solution, I want some tips to guide me while writing down the code.
I'm trying to use the for function to make a new list using the elements in the original one, but my first attempt failed so bad. I tried to get at least the first elements right, but it didn't work as expected:
# Suppose the list is already sorted.
def ranking(x):
l = len(x)
for ele in range(1, l):
t = x[ele-1]
m = x.count(t)
i = 0
sum = 0
while i < m: # my intention was to get right at least the rank of the first item of the list
sum = sum + 1
i = i + 1
x[ele] = sum/t
return x
Any ideais about how could I solve this problem?
Ok, first, for your for loop there you can more easily loop through each element in the list by just saying for i in x:. At least for me, that would make it a little easier to read. Then, to get the rank, maybe loop through again with a nested for loop and check if it equals whatever element you're currently on. I don't know if that makes sense; I didn't want to provide too many details because you said you didn't want the full solution (definitely reply if you want me to explain better).
Here is an idea:
You can use x.count(1) to see how many number 1s you have in list, x.count(2) for number 2 etc.
Also, never use sum as a variable name since it is an inbuilt function.
Maybe use 2 for loops. First one will go through elements in list x, second one will also go through elements in list x, and if it finds the same element, appends it to new_list.
You can then use something like sum(new_list) and clear list after each iteration.
You don't even need to loop through list n if you use indexing while looping through x
for i, y in enumerate(x) so you could use n[i] to read the value
If you want the code I'll post it in the comment
#VictorPaesPlinio- would you try this sample code for the problem: (it's a partial solution, did the data aggregation work, and leave the last part put the output for your own exercise).
from collections import defaultdict
x = [4, 4, 10, 4, 10, 2, 4, 1, 1, 2]
x.sort()
print(x)
lst = list(range(1, len(x)+1))
# [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
ranking = defaultdict(list)
for idx, num in enumerate(x, 1):
print(idx, num)
ranking[num].append(idx)
print(ranking)
'''defaultdict(<class 'list'>, {1: [1, 2], 2: [3, 4],
4: [5, 6, 7, 8], 10: [9, 10]})
'''
r = []
# r = [1.5, 1.5, 3.5, 3.5, 6.5, 6.5, 6.5, 6.5, 9.5, 9.5]
# 1 1 2 2 4 4 4 4 10 10
for key, values in ranking.items():
# key is the number, values in the list()
print(key, values, sum(values))
Outputs:
1 [1, 2] 3
2 [3, 4] 7
4 [5, 6, 7, 8] 26
10 [9, 10] 19 # then you can do the final outputs part...

Python: Find outliers inside a list

I'm having a list with a random amount of integers and/or floats. What I'm trying to achieve is to find the exceptions inside my numbers (hoping to use the right words to explain this). For example:
list = [1, 3, 2, 14, 108, 2, 1, 8, 97, 1, 4, 3, 5]
90 to 99% of my integer values are between 1 and 20
sometimes there are values that are much higher, let's say somewhere around 100 or 1.000 or even more
My problem is, that these values can be different all the time. Maybe the regular range is somewhere between 1.000 to 1.200 and the exceptions are in the range of half a million.
Is there a function to filter out these special numbers?
Assuming your list is l:
If you know you want to filter a certain percentile/quantile, you can
use:
This removes bottom 10% and top 90%. Of course, you can change any of
them to your desired cut-off (for example you can remove the bottom filter and only filter the top 90% in your example):
import numpy as np
l = np.array(l)
l = l[(l>np.quantile(l,0.1)) & (l<np.quantile(l,0.9))].tolist()
output:
[ 3 2 14 2 8 4 3 5]
If you are not sure of the percentile cut-off and are looking to
remove outliers:
You can adjust your cut-off for outliers by adjusting argument m in
function call. The larger it is, the less outliers are removed. This function seems to be more robust to various types of outliers compared to other outlier removal techniques.
import numpy as np
l = np.array(l)
def reject_outliers(data, m=6.):
d = np.abs(data - np.median(data))
mdev = np.median(d)
s = d / (mdev if mdev else 1.)
return data[s < m].tolist()
print(reject_outliers(l))
output:
[1, 3, 2, 14, 2, 1, 8, 1, 4, 3, 5]
You can use the built-in filter() method:
lst1 = [1, 3, 2, 14, 108, 2, 1, 8, 97, 1, 4, 3, 5]
lst2 = list(filter(lambda x: x > 5,lst1))
print(lst2)
Output:
[14, 108, 8, 97]
So here is a method how to block out those deviators
import math
_list = [1, 3, 2, 14, 108, 2, 1, 8, 97, 1, 4, 3, 5]
def consts(_list):
mu = 0
for i in _list:
mu += i
mu = mu/len(_list)
sigma = 0
for i in _list:
sigma += math.pow(i-mu,2)
sigma = math.sqrt(sigma/len(_list))
return sigma, mu
def frequence(x, sigma, mu):
return (1/(sigma*math.sqrt(2*math.pi)))*math.exp(-(1/2)*math.pow(((x-mu)/sigma),2))
sigma, mu = consts(_list)
new_list = []
for i in range(len(_list)):
if frequence(_list[i], sigma, mu) > 0.01:
new_list.append(i)
print(new_list)

constructing arithmetic progressions from loop

I am trying to work out a program that would calculate the diagonal coefficients of pascal's triangle.
For those who are not familiar with it, the general terms of sequences are written below.
1st row = 1 1 1 1 1....
2nd row = N0(natural number) // 1 = 1 2 3 4 5 ....
3rd row = N0(N0+1) // 2 = 1 3 6 10 15 ...
4th row = N0(N0+1)(N0+2) // 6 = 1 4 10 20 35 ...
the subsequent sequences for each row follows a specific pattern and it is my goal to output those sequences in a for loop with number of units as input.
def figurate_numbers(units):
row_1 = str(1) * units
row_1_list = list(row_1)
for i in range(1, units):
sequences are
row_2 = n // i
row_3 = (n(n+1)) // (i(i+1))
row_4 = (n(n+1)(n+2)) // (i(i+1)(i+2))
>>> def figurate_numbers(4): # coefficients for 4 rows and 4 columns
[1, 1, 1, 1]
[1, 2, 3, 4]
[1, 3, 6, 10]
[1, 4, 10, 20] # desired output
How can I iterate for both n and i in one loop such that each sequence of corresponding row would output coefficients?
You can use map or a list comprehension to hide a loop.
def f(x, i):
return lambda x: ...
row = [ [1] * k ]
for i in range(k):
row[i + 1] = map( f(i), row[i])
where f is function that descpribe the dependency on previous element of row.
Other possibility adapt a recursive Fibbonachi to rows. Numpy library allows for array arifmetics so even do not need map. Also python has predefined libraries for number of combinations etc, perhaps can be used.
To compute efficiently, without nested loops, use Rational Number based solution from
https://medium.com/#duhroach/fast-fun-with-pascals-triangle-6030e15dced0 .
from fractions import Fraction
def pascalIndexInRowFast(row,index):
lastVal=1
halfRow = (row>>1)
#early out, is index < half? if so, compute to that instead
if index > halfRow:
index = halfRow - (halfRow - index)
for i in range(0, index):
lastVal = lastVal * (row - i) / (i + 1)
return lastVal
def pascDiagFast(row,length):
#compute the fractions of this diag
fracs=[1]*(length)
for i in range(length-1):
num = i+1
denom = row+1+i
fracs[i] = Fraction(num,denom)
#now let's compute the values
vals=[0]*length
#first figure out the leftmost tail of this diag
lowRow = row + (length-1)
lowRowCol = row
tail = pascalIndexInRowFast(lowRow,lowRowCol)
vals[-1] = tail
#walk backwards!
for i in reversed(range(length-1)):
vals[i] = int(fracs[i]*vals[i+1])
return vals
Don't reinvent the triangle:
>>> from scipy.linalg import pascal
>>> pascal(4)
array([[ 1, 1, 1, 1],
[ 1, 2, 3, 4],
[ 1, 3, 6, 10],
[ 1, 4, 10, 20]], dtype=uint64)
>>> pascal(4).tolist()
[[1, 1, 1, 1], [1, 2, 3, 4], [1, 3, 6, 10], [1, 4, 10, 20]]

Packing a small number of packages into a fixed number of bins

I have a list of package sizes. There will be a maximum of around 5 different sizes and they may occur a few times (<50).
packages = [5,5,5,5,5,5,10,11]
I need to pack them into a fixed number of bins, for example 3.
number_of_bins = 3
The bins may vary in size (sum of the sizes of the packed packages) between 0 and, say, 2 (that is, the difference of the sum of the sizes of the packages in the bins must be equal or nearly equal). So having bins with [1,2] (=3) and [2] (=2) (difference is 1) is fine, having them with [10] (=10) and [5] (=5) (difference is 5) is not.
It is possible not to sort all packages into the bins, but I want the solution where a minimum number of packages remains unpacked.
So the best solution in this case (I think) would be
bins = [11,5],[10,5],[5,5,5]
remaining = [5]
There's probably a knapsack or bin-packing algorithm to do this, but I haven't found it. I'm fine with brute-forcing it, but I'm not sure what's an efficient way to do that.
Is there any efficient way of doing this easily? Did I just miss the relevant search term to find it?
Another example:
packages = [5,10,12]
number_of_bins = 2
leads to
bins = [12],[10]
remaining = [5]
because
bins = [12],[10,5]
has bin sizes of 12 and 15 which vary by more than 2.
Analog:
packages = [2,10,12]
number_of_bins = 3
leads to
bins = [2],[],[]
remaining = [12,10]
Here is a solution using pulp:
from pulp import *
packages = [18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 65, 65, 65]
number_of_bins = 3
bins = range(1, number_of_bins + 1)
items = range(0, len(packages))
x = LpVariable.dicts('x',[(i,b) for i in items for b in bins],0,1,LpBinary)
y = LpVariable('y', 0, 2, LpInteger)
prob=LpProblem("bin_packing",LpMinimize)
#maximize items placed in bins
prob.setObjective(LpAffineExpression([(x[i,b], -3) for i in items for b in bins] + [(y, 1)]))
#every item is placed in at most 1 bin
for i in items:
prob+= lpSum([x[i,b] for b in bins]) <= 1
for b in bins:
if b != 1: # bin 1 is the one with lowest sum
prob+= LpAffineExpression([(x[i,b], packages[i]) for i in items] + [(x[i,1], -packages[i]) for i in items]) >= 0
if b != number_of_bins: # last bin is the one with highest
prob+= LpAffineExpression([(x[i,number_of_bins], packages[i]) for i in items] + [(x[i,b], -packages[i]) for i in items]) >= 0
#highest sum - lowest sum <= 2 so every difference of bin sums must be under 2
prob += LpAffineExpression([(x[i,number_of_bins], packages[i]) for i in items] + [(x[i,1], -packages[i]) for i in items]) <= 2
prob += LpAffineExpression([(x[i,number_of_bins], packages[i]) for i in items] + [(x[i,1], -packages[i]) for i in items]) == y
prob.solve()
print(LpStatus[prob.status])
for b in bins:
print(b,':',', '.join([str(packages[i]) for i in items if value(x[i,b]) !=0 ]))
print('left out: ', ', '.join([str(packages[i]) for i in items if sum(value(x[i,b]) for b in bins) ==0 ]))
Tricky one, really not sure about an optimal solution. Below is a solution that just iterates all possible groups and halts at the first solution. This should be a minimal-remainder solution since we first iterate through all solutions without any remainder.
It also iterates over solutions as everything in the first bin, which could be excluded for a faster result.
import numpy as np
def int_to_base_list(x, base, length):
""" create a list of length length that expresses a base-10 integer
e.g. binary: int2list(101, 2, 10) returns array([0, 0, 0, 1, 1, 0, 0, 1, 0, 1])
"""
placeholder = np.array([0] * length) # will contain the actual answer
for i in reversed(range(length)):
# standard base mathematics, see http://www.oxfordmathcenter.com/drupal7/node/18
placeholder[i] = x % base
x //= base
return placeholder
def get_groups(packages, max_diff_sum, number_of_bins):
""" Get number_of_bins packaging groups that differ no more than max_diff_sum
e.g.
[5, 5, 5, 5, 5, 5, 10, 11] with 2, 3 gives [5,5,5], [10,5], [11,5]
[5, 10, 12] with 2, 2 gives [10], [12]
[2, 6, 12] with 2, 3 gives [2], [], []
We approach the problem by iterating over group indices, so the first
example above has solution [0 0 0 1 2 3 1 2] with the highest number being
the 'remainder' group.
"""
length = len(packages)
for i in range((number_of_bins + 1)**length - 1): # All possible arrangements in groups
index = int_to_base_list(i, number_of_bins + 1, length) # Get the corresponding indices
sums_of_bins = [np.sum(packages[index==ii]) for ii in range(number_of_bins)]
if max(sums_of_bins) - min(sums_of_bins) <= max_diff_sum: # the actual requirement
# print(index)
break
groups = [packages[index==ii] for ii in range(number_of_bins)]
# remainder = packages[index==number_of_bins+1]
return groups
On your examples:
packages = np.array([5, 5, 5, 5, 5, 5, 10, 11])
max_diff_sum = 2
number_of_bins = 3
get_groups(packages, max_diff_sum, number_of_bins)
>> [array([5, 5, 5]), array([ 5, 10]), array([ 5, 11])]
And
packages = np.array([5,10,12])
max_diff_sum = 2
number_of_bins = 2
get_groups(packages, max_diff_sum, number_of_bins)
>> [array([10]), array([12])]

Categories