Guessing a missing value based on historical data

Guessing a missing value based on historical data - python

Let's assume i have 100 different kinds of items, each item got a name and a physical weight.
I know the names of all 100 items but only the weight of 80 items.
When i ship items, i pack them in groups of 10 and sum the weight of these items.
Due to some items are missing their weight, this will give an inaccurate sum when im about to ship.
I have different shipments with missing weights
Shipment 1
Item Name
Item Weight
Item 2
10
Item 27
20
Item 42
20
Item 71
-
Item 77
-
Total weight: 75
Shipment 2
Item Name
Item Weight
Item 2
10
Item 27
20
Item 42
20
Item 71
-
Item 92
-
Total weight: 90
Shipment 3
Item Name
Item Weight
Item 2
10
Item 27
20
Item 42
20
Item 55
35
Item 77
-
Total weight: 100
Since some of the shipments share the same items with missing weights and i have the shipments total weight, is there a way with machine learning to determine the weight of these items without im unpacking the entire shipment?
Or would it just be a, in this case, 100x3 Matrix with a lot of empty values?
At this point im not really sure if i should use some type of regression to solve this or if its just a matrix, that would expand a lot if i had n more items to ship.
I also wondered if this was some type of knapsack problem, but i hope anyone can guide my in the right direction.

Forget about machine learning. This is a simple system of linear equations.
w_71 + w_77 = 25
w_71 + w_92 = 40
w_77 = 15
You can solve it with sympy.solvers.solveset.linsolve, or scipy.optimize.linprog, or scipy.linalg.lstsq, or numpy.linalg.lstsq
sympy.linsolve is maybe the easiest to understand if you are not familiar with matrices; however, if the system is underdetermined, then instead of returning a particular solution to the system, sympy.linsolve will return the general solution in parametric form.
scipy.lstsq or numpy.lstsq expect the problem to be given in matrix form. If there is more than one possible solution, they will return the most "average" solution. However, they cannot take any positivity constraint into account: they might return a solution where one of the variables is negative. You can maybe fix this behaviour by adding a new equation to the system to manually force a variable to be positive, then solve again.
scipy.linprog expects the problem to be given in matrix form; it also expects you to specify a linear objective function, to choose which particular solution is "best" in case there is more than one possible solution. linprog also considers that all variables are nonnegative by default, or allows you to specify explicit bounds for the variables yourself. It also allows you to add inequality constraints, in addition to the equations, if you wish to.
Using sympy.solvers.solveset.linsolve
from sympy.solvers.solveset import linsolve
from sympy import symbols
w71, w77, w92 = symbols('w71 w77 w92')
eqs = [w71+w77-25, w71+w92-40, w77-15]
solution = linsolve(eqs, [w71, w77, w92])
# solution = {(10, 15, 30)}
In your example, there is only one possible solution, so linsolve returned that solution: w71 = 10, w77 = 15, w92 = 30.
However, in case there is more than one possible solution, linsolve will return a parametric form for the general solution:
x,y,z = symbols('x y z')
eqs = [x+y-10, y+z-20]
solution = linsolve(eqs, [x, y, z])
# solution = {(z - 10, 20 - z, z)}
Here there is an infinity of possible solutions. linsolve is telling us that we can pick any value for z, and then we'll get the corresponding x and y as x = z - 10 and y = 20 - z.
Using numpy.linalg.lstsq
lstsq expects the system of equations to be given in matrix form. If there is more than one possible solution, then it will return the most "average" solution. For instance, if the system of equation is simply x + y = 10, then lstsq will return the particular solution x = 5, y = 5 and will ignore more "extreme" solutions such as x = 10, y = 0.
from numpy.linalg import lstsq
# w_71 + w_77 = 25
# w_71 + w_92 = 40
# w_77 = 15
A = [[1, 1, 0], [1, 0, 1], [0, 1, 0]]
b = [25, 40, 15]
solution = lstsq(A, b)
solution[0]
# array([10., 15., 30.])
Here lstsq found the unique solution, w71 = 10, w77=15, w92 = 30.
# x + y = 10
# y + z = 20
A = [[1, 1, 0], [0, 1, 1]]
b = [10, 20]
solution = lstsq(A, B)
solution[0]
# array([-3.55271368e-15, 1.00000000e+01, 1.00000000e+01])
Here lstsq had to choose a particular solution, and chose the one it considered most "average", x = 0, y = 10, z = 10. You might want to round the solution to integers.
One drawback of lstsq is that it doesn't take into account your non-negativity constraint. That is, it might return a solution where one of the variables is negative:
# x + y = 2
# y + z = 20
A = [[1, 1, 0], [0, 1, 1])
b = [2, 20]
solution = lstsq(A, b)
solution[0]
# array([-5.33333333, 7.33333333, 12.66666667])
See how lstsq ignored the possible positive solution x = 1, y = 1, z = 18 and instead returned the solution it considered most "average", x = -5.33, y = 7.33, z = 12.67.
One way to fix this is to add an equation yourself to force the offending variable to be positive. For instance, here we noticed that lstsq wanted x to be negative, so we can manually force x to be equal to 1 instead, and solve again:
# x + y = 2
# y + z = 20
# x = 1
A = [[1, 1, 0], [0, 1, 1], [1, 0, 0]]
b = [2, 20, 1]
solution = lstsq(A, b)
solution[0]
# array([ 1., 1., 19.])
Now that we manually forced x to be 1, lstsq found solution x=1, y=1, z=19 which we're more happy with.
Using scipy.optimize.linprog
The particularity of linprog is that it expects you to specify the "objective" used to choose a particular solution, in case there is more than one possible solution.
Also, linprog allows you to specify bounds for the variables. The default is that all variables are nonnegative, which is what you want.
from scipy.optimize import linprog
# w_71 + w_77 = 25
# w_71 + w_92 = 40
# w_77 = 15
A = [[1, 1, 0], [1, 0, 1], [0, 1, 0]]
b = [25, 40, 15]
c = [1, 1, 1] # coefficients for objective: minimise w71 + w77 + w92.
solution = linprog(c, A_eq = A, b_eq = b)
solution.x
# array([10., 15., 30.])

Related

Solving linear equations in python with the answers restricted to 0/1

(My previous posting has just been closed. However, I cannot see what's inappropriate with the question.)
I'm dealing with a linear equations-solving problem, in which the value for each variable is either 0 or 1.
Hopefully, I would like to develop a solver that can tell whether the value for each variable is definitely 0 or 1. For the final output, the value would be assigned to the variable if it is solved; otherwise it would be assigned None.
For example, the inputs of
a + b + c = 1
b + c = 1
should generate the outputs of
{a=0, b=None, c=None}
And the inputs of
a + b + 2c + d = 2
a + d = 1
should give
{a=None, b=1, c=0, d=None}
As far as I know, there already exist some general linear solvers in python (e.g. numpy.linalg.solve). Is it possible to utilize them and with modifications? If no, what is the recommended approach instead?
Thank you~

Your idea is very close. np.linalg.solve(a,b) can only be used, if a is square and of full-rank, i.e., all rows (or, equivalently, columns) must be linearly independent. Otherwise use for instance lstsq for the least-squares best "solution" of the system/equation.
import numpy as np
A = np.array([[1, 1, 1], [0, 1, 1]])
B = np.array([1, 1])
X = np.linalg.lstsq(A, B)[0] #only interested of the best solution
###solution for [a, b, c]:
###[-1.11022302e-16 5.00000000e-01 5.00000000e-01]
A = np.array([[1, 1, 2, 1], [1, 0, 0, 1]])
B = np.array([2, 1])
X = np.linalg.lstsq(A, B)[0]
###solution for [a, b, c, d]:
###[0.5 0.2 0.4 0.5]

Automatically reduce piecewise function components - Pyomo

In pyomo, I have a piece-wise linear constraint defined through pyomo.environ.Piecewise. I keep getting a warning along the lines of
Piecewise component '<component name>' has detected slopes of consecutive piecewise segments to be within <tolerance> of one another. Refer to the Piecewise help documentation for information on how to disable this warning.
I know I could increase the tolerance and get rid of the warning, but I'm wondering if there is a general approach (through Pyomo or numpy) to reduce the number of "segments" if two consecutive slopes are below a given tolerance.
I could obviously implement this myself, but I'd like to avoid reinventing the wheel.

Ok, this is what I came up with. Definitely not optimized for performance, but my case depends on few points. It also lacks more validations on the inputs (e.g. x being sorted and unique).
def reduce_piecewise(x, y, abs_tol):
"""
Remove unnecessary points from piece-wise curve.
Points are remove if the slopes of consecutive segments
differ by less than `abs_tol`.
x points must be sorted and unique.
Consecutive y points can be the same though!
Parameters
----------
x : List[float]
Points along x-axis.
y : List[float]
abs_tol : float
Tolerance between consecutive segments.
Returns
-------
(np.array, np.array)
x and y points - reduced.
"""
if not len(x) == len(y):
raise ValueError("x and y must have same shape")
x_reduced = [x[0]]
y_reduced = [y[0]]
for i in range(1, len(x) - 1):
left_slope = (y[i] - y_reduced[-1])/(x[i] - x_reduced[-1])
right_slope = (y[i+1] - y[i])/(x[i+1] - x[i])
if abs(right_slope - left_slope) > abs_tol:
x_reduced.append(x[i])
y_reduced.append(y[i])
x_reduced.append(x[-1])
y_reduced.append(y[-1])
return np.array(x_reduced), np.array(y_reduced)
And here are some examples:
>>> x = np.array([0, 1, 2, 3])
>>> y = np.array([0, 1, 2, 3])
>>> reduce_piecewise(x, y, 0.01)
(array([0, 3]), array([0, 3]))
>>> x = np.array([0, 1, 2, 3, 4, 5])
>>> y = np.array([0, 2, -1, 3, 4.001, 5]) # 4.001 should be removed
>>> reduce_piecewise(x, y, 0.01)
(array([0, 1, 2, 3, 5]), array([ 0., 2., -1., 3., 5.]))

Modelling a probability distribution as a fuzzy set in Python3

I'm trying to build a fuzzy set from a series of example values with python3.
For instance, given [6, 7, 8, 9, 27] I'd like to obtain a function that:
returns 0.0 from 0 to 5ca,
goes gradually up to 1.0 from 5ca to 6,
stays at 1.0 from 6 to 9,
goes gradually down to 0.0 from 9 to 10ca,
stays at 0.0 from 10ca to 26ca,
goes gradually up to 1.0 from 26ca to 27,
goes gradually down to 0.0 from 27 to 28ca,
returns 0.0 from 28ca and afterwards.
Notice that the y values are always in the range [0.0, 1.0] and if a series is missing a value, the y of that value is 0.0.
Please consider that in the most general case, the input values might be something like [9, 41, 20, 13 ,11, 12, 14, 40, 4, 4, 4, 3, 34, 22] (values can always be sorted, but notice that in this series the value 4 is repeated 3 times therefore I'd expect to have a probability of 1 and all the other values a lower probability value -- not necessarily 1/3 as in this case).
The top part of this picture shows the desired function plotted up to x=16 (hand drawn). I'd be more than happy to obtain anything like it.
The bottom part of the picture shows some extra feature that would be nice to have but are not strictly mandatory:
better smoothing than shown in my drawing (A),
cumulative effect (B) provided that...
the function never goes above 1 (C) and...
the function never goes below 0 (D).
I've tried some approaches adapted from polyfit, bezier, gauss or others, for instance, but the results weren't what I expected.
I've also tried with package fuzzpy but I couldn't make it work because of its dependency to epydoc which seems not to be compatible with python3. No luck as well with StatModels.
Can anyone suggest how to achieve the desired function? Thanks in advance.
If you wonder, I plan to use the resulting function to predict the likelihood of a given value; with respect to the fuzzy set described above, for instance, 4.0 returns 0.0, 6.5 returns 1.0 and 5.8 something like 0.85. Maybe there is another simpler way to do this?
This is how I usually process the input values (not sure if the part that adds the 0s is needed), what show I have instead ??? to compute the desired f?
def prepare(values, normalize=True):
max = 0
table = {}
for value in values:
table[value] = (table[value] if value in table else 0) + 1
if normalize and table[value] > max:
max = table[value]
if normalize:
for value in table:
table[value] /= float(max)
for value in range(sorted(table)[-1] + 2):
if value not in table:
table[value] = 0
x = sorted(table)
y = [table[value] for value in x]
return x, y
if __name__ == '__main__':
# get x and y vectors
x, y = prepare([9, 41, 20, 13, 11, 12, 14, 40, 4, 4, 4, 3, 34, 22], normalize=True)
# calculate fitting function
f = ???
# calculate new x's and y's
x_new = np.linspace(x[0], x[-1], 50)
y_new = f(x_new)
# plot the results
plt.plot(x, y, 'o', x_new, y_new)
plt.xlim([x[0] - 1, x[-1] + 1])
plt.show()
print("Done.")
A practical example, just to clarify the motivations for this...
The series of values might be the number of minutes after which persons give up standing in line in front of a kiosk... With such a model, we could try to predict how likely somebody will leave the queue by knowing how long has been waiting. The value read in this way can be then defuzzyfied, for instance, in happily waiting [0.00, 0.33], just waiting (0.33, 0.66] and about to leave (0.66, 1.00]. In case of about to leave that somebody could be engaged by something (and ad?) to convince him to stay.

This only works (due to np.bincount) with a set of integers.
def fuzzy_interp(x, vals):
vmn, vmx = np.amin(vals), np.amax(vals)
v = vals - vmn + 1
b = np.bincount(v, minlength = vmx - vmn + 2)
b = b / np.amax(b)
return np.interp(x - vmn - 1, np.arange(b.size), b, left = 0, right = 0)

def pulse(x):
return np.maximum(0, 1 - abs(x))
def fuzzy_in_unscaled(x, xs):
return pulse(np.subtract.outer(x, xs)).sum(axis=-1)
def fuzzy_in(x, xs):
largest = fuzzy_in_unscaled(xs, xs).max()
return fuzzy_in_unscaled(x, xs) / largest
>>> fuzzy_in(1.5, [1, 3, 4, 5]) # single membership
0.5
>>> fuzzy_in([[1.5, 3], [3.5, 10]], [1, 3, 4, 5]) # vectorized in the first argument
array([[0.5, 1], [1, 0]])
This exploits the fact that the peak values must lie on the elements. This is not true for all pulse functions.
You'd do well to precompute largest, as it's O(N^2)

Vectorizing complex assignment logic in numpy

I have some complex assignment logic in a simulation that I would like to optimize for performance. The current logic is implemented as a set of nested for loops over a variety of numpy arrays. I would like to vectorize this assignment logic but haven't been able to figure out if this is possible
import numpy as np
from itertools import izip
def reverse_enumerate(l):
return izip(xrange(len(l)-1, -1, -1), reversed(l))
materials = np.array([[1, 0, 1, 1],
[1, 1, 0, 0],
[0, 1, 1, 1],
[1, 0, 0, 1]])
vectors = np.array([[1, 1, 0, 0],
[0, 0, 1, 1]])
prices = np.array([10, 20, 30, 40])
demands = np.array([1, 1, 1, 1])
supply_by_vector = np.zeros(len(vectors)).astype(int)
#go through each material and assign it to the first vector that the material covers
for m_indx, material in enumerate(materials):
#find the first vector where the material covers the SKU
for v_indx, vector in enumerate(vectors):
if (vector <= material).all():
supply_by_vector[v_indx] = supply_by_vector[v_indx] + 1
break
original_supply_by_vector = np.copy(supply_by_vector)
profit_by_vector = np.zeros(len(vectors))
remaining_ask_by_sku = np.copy(demands)
#calculate profit by assigning material from vectors to SKUs to satisfy demand
#go through vectors in reverse order (so lowest priority vectors are used up first)
profit = 0.0
for v_indx, vector in reverse_enumerate(vectors):
for sku_indx, price in enumerate(prices):
available = supply_by_vector[v_indx]
if available == 0:
continue
ask = remaining_ask_by_sku[sku_indx]
if ask <= 0:
continue
if vector[sku_indx]:
assign = ask if available > ask else available
remaining_ask_by_sku[sku_indx] = remaining_ask_by_sku[sku_indx] - assign
supply_by_vector[v_indx] = supply_by_vector[v_indx] - assign
profit_by_vector[v_indx] = profit_by_vector[v_indx] + assign*price
profit = profit + assign * price
print 'total profit:', profit
print 'unfulfilled demand:', remaining_ask_by_sku
print 'original supply:', original_supply_by_vector
result:
total profit: 80.0
unfulfilled demand: [0 1 0 0]
original supply: [1 2]

It seems there is a dependency between iterations within the innermost nested loop in the second part/group of the nested loops and that to me seemed like difficult if not impossible to vectorize. So, this post is basically a partial solution trying to vectorize instead the first group of two nested loops, which were -
supply_by_vector = np.zeros(len(vectors)).astype(int)
for m_indx, material in enumerate(materials):
#find the first vector where the material covers the SKU
for v_indx, vector in enumerate(vectors):
if (vector <= material).all():
supply_by_vector[v_indx] = supply_by_vector[v_indx] + 1
break
That entire section could be replaced by one line of vectorized code, like so -
supply_by_vector = ((vectors[:,None] <= materials).all(2)).sum(1)

Return value for which a function reaches its smallest negative value in python

It's pretty late, so I don't know how clear this will be.
I have a function f(x), I want to get the value of x from a list for which f(x) reachest the smallest negative value, namely:
x = [0, 2, 4, 6]
f(x) = [200, 0, -3, -1000]
In this case, I would like something to return the value 4 in x, which gave me -3. I don't want the absolute minimum (-1000), but the negative value with the lowest absolute value.
I hope that makes sense, thanks a lot for your help.
UPDATE
I was trying to simplify the problem, maybe too much. Here's the thing: I have a list of 2D points that form a polygon and I want to order them clockwise.
For that, I take the cross product between each point and the rest and select the next point based on getting the negative cross product (which tells me the sense of rotation) from the previous and which has the smallest absolute value (which tells me it really is the next point).
so, say:
x = [(1,1), (-1,-1), (-1,1), (1,-1)]
and I would like to get
x = [(1,1), (1,-1), (-1,-1), (-1,1)]
I'm doing
for point in x:
cp = [numpy.cross(point, p) for p in x]
# and then some magic to select the right point...
Thanks for your help again.

a = [0, 2, 4, 6]
b = [200, 0, -3, -1000]
value = max([x for x in b if x < 0])
print a[b.index(value)]

Try this:
inputs = [0, 2, 4, 6]
outputs = [200, 0, -3, -1000]
max = min(outputs)
for n in outputs:
if n >= 0:
continue
if n > max:
max = n
print inputs[outputs.index(max)]

x = [0, 2, 4, 6]
fx = [200, 0, -3, -1000]
print(x[fx.index(max(n for n in fx if n < 0))])

The following solution should work well in most cases
[ z for z in x if f(z) == max( f(y) for y in x if f(y) < 0 ) ]
One characteristic of this solution is that if there is a repetition, i.e. several x's producing the same biggest negative, all of them will be returned.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Guessing a missing value based on historical data - python

Related

Solving linear equations in python with the answers restricted to 0/1

Automatically reduce piecewise function components - Pyomo

Modelling a probability distribution as a fuzzy set in Python3

Vectorizing complex assignment logic in numpy

Return value for which a function reaches its smallest negative value in python

Categories

Resources