Expand numbers in a list - python

I have a list of numbers:
[10,20,30]
What I need is to expand it according to a predefined increment. Thus, let's call x the increment and x=2, my result should be:
[10,12,14,16,18,20,22,24,.....,38]
Right now I am using a for loop, but it is very slow and I am wondering if there is a faster way.
EDIT:
newA = []
for n in array:
newA= newA+ generateNewNumbers(n, p, t)
The function generates new number simply generate the new numbers to add to the list.
EDIT2:
To better define the problem the first array contains some timestamps:
[10,20,30]
I have two parameters one is the sampling rate and one is the sampling time, what I need is to expand the array adding between two timestamps the correct number of timestamps, according to the sampling rate.
For example, if I have a sampling rate 3 and a sampling time 3 the result should be:
[10,13,16,19,20,23,26,29,30,33,36,39]

You can add the same set of increments to each time stamp using np.add.outer and then flatten the result using ravel.
import numpy as np
a = [10,20,35]
inc = 3
ninc = 4
np.add.outer(a, inc * np.arange(ninc)).ravel()
# array([10, 13, 16, 19, 20, 23, 26, 29, 35, 38, 41, 44])

You can use list comprhensions but I'm not sure if I understand the stopping condition for the last point inclusion
a = [10, 20, 30, 40]
t = 3
sum([[x for x in range(y, z, t)] for y, z in zip(a[:-1], a[1:])], []) + [a[-1]]
will give
[10, 13, 16, 19, 20, 23, 26, 29, 30, 33, 36, 39, 40]

Using range and itertools.chain
l = [10,20,30]
x = 3
from itertools import chain
list(chain(*[range(i,i+10,x) for i in l]))
#Output:
#[10, 13, 16, 19, 20, 23, 26, 29, 30, 33, 36, 39]

Here is a bunch of good answers already. But I would advise numpy and linear interpolation.
# Now, this will give you the desired result with your first specifications
# And in pure Python too
t = [10, 20, 30]
increment = 2
last = int(round(t[-1]+((t[-1]-t[-2])/float(increment))-1)) # Value of last number in array
# Note if you insist on mathematically "incorrect" endpoint, do:
#last = ((t[-1]+(t[-1]-t[-2])) -((t[-1]-t[-2])/float(increment)))+1
newt = range(t[0], last+1, increment)
# And, of course, this may skip entered values (increment = 3
# But what you should do instead, when you use the samplerate is
# to use linear interpolation
# If you resample the original signal,
# Then you resample the time too
# And don't expand over the existing time
# Because the time doesn't change if you resampled the original properly
# You only get more or less samples at different time points
# But it lasts the same length of time.
# If you do what you originally meant, you actually shift your datapoints in time
# Which is wrong.
import numpy
t = [10, 20, 30, 40, 50, 60]
oldfs = 4000 # 4 KHz samplerate
newfs = 8000 # 8 KHz sample rate (2 times bigger signal and its time axis)
ratio = max(oldfs*1.0, newfs*1.0)/min(newfs, oldfs)
newlen = round(len(t)*ratio)
numpy.interp(
numpy.linspace(0.0, 1.0, newlen),
numpy.linspace(0.0, 1.0, len(t)),
t)
This code can resample your original signal too (if you have one). If you just want to cram in some more timepoints in between, you can also use interpolation. Again, don't go over the existing time. Although this code does it, to be compatible with the first one. And so that you can get ideas on what you can do.
t = [10, 20, 30]
increment = 2
last = t[-1]+((t[-1]-t[-2])/float(increment))-1 # Value of last number in array
t.append(last)
newlen = (t[-1]-t[0])/float(increment)+1 # How many samples we will get in the end
ratio = newlen / len(t)
numpy.interp(
numpy.linspace(0.0, 1.0, newlen),
numpy.linspace(0.0, 1.0, len(t)),
t)
This though results in an increment of 2.5 instead of 2. But it can be corrected. The thing is that this approach would work on floating time points as well as on integers. And fast. It will slow down if there is a lot of them, but until you reach some great number of them it will work pretty fast.

Related

Python - subtraction inside rolling window

I need to make subtractions inside red frames as [20-10,60-40,100-70]
that results in [10,20,30]
Current code makes subtractions but I don't know how to define red frames
seq = [10, 20, 40, 60, 70, 100]
window_size = 2
for i in range(len(seq) - window_size+1):
x=seq[i: i + window_size]
y=x[1]-x[0]
print(y)
You can build a quick solution using the fact that seq[0::2] will give you every other element of seq starting at zero. So you can compute seq[1::2] - seq[0::2] to get this result.
Without using any packages you could do:
seq = [10, 20, 40, 60, 70, 100]
sub_seq = [0]*(len(seq)//2)
for i in range(len(sub_seq)):
sub_seq[i] = seq[1::2][i] - seq[0::2][i]
print(sub_seq)
Instead you could use Numpy. Using the numpy array object you can subtract the arrays rather than explicitly looping:
import numpy as np
seq = np.array([10, 20, 40, 60, 70, 100])
sub_seq = seq[1::2] - seq[0::2]
print(sub_seq)
Here's a solution using numpy which might be useful if you have to process large amounts of data in a short time. We select values based on whether their index is even (index % 2 == 0) or odd (index % 2 != 0).
import numpy as np
seq = [10, 20, 40, 60, 70, 100]
seq = np.array(seq)
index = np.arange(len(seq))
seq[index % 2 != 0] - seq[index % 2 == 0]

Multi-knapsack problem with aggregate objective function/objective with a soft limit

I am trying to solve a variant of the multi-knapsack example in Google OR-tools. The one feature I cannot seem to encode is a soft limit on the value.
In the original example, an item has a weight that is used to calculate a constraint and a value that is used to calculate the optimum solution. In my variation I have multiple weights/capacities that form quotas and compatibilities for items of certain types. In addition, each bin has a funding target and each item has a value. I would like to minimise the funding shortfall for each bin:
# pseudocode!
minimise: sum(max(0, funding_capacity[j] - sum(item[i, j] * item_value[i] for i in num_items)) for j in num_bins)
The key differences between this approach and the example are that if item_1 has a value of 110 and bin_A has a funding requirement of 100, item_1 can fit into bin_A and makes the funding shortfall go to zero. item_2 with a value of 50 could also fit into bin_A (as long as the other constraints are met) but the optimiser will see no improvement in the objective function. I have attempted to use the objective.SetCoefficient method on a calculation of the funding shortfall but I keep getting errors that I think are to do with this method not liking aggregate functions like sum.
How do I implement the funding shortfall objective above, either in the objective function or alternatively in the constraints? How can I form an objective function using a summary calculation? The ideal answer would be a code snippet for OR-tools in Python but clear illustrative answers from OR-tools in other programming languages would also be helpful.
Working code follows, but here's how you would proceed with the formulation.
Formulation changes to the Multiple Knapsack problem given here
You will need two sets of new variables for each bin. Let's call them shortfall[j] (continuous) and filled[j] (boolean).
Shorfall[j] is simply the funding_requirement[j] - sum_i(funding[items i])
filled[j] is a Boolean, which we want to be 1 if the sum of the funding of each item in the bin is greater than its funding requirement, 0 otherwise.
We have to resort to a standard trick in Integer Programming that involves using a Big M. (A large number)
if total_item_funding >= requirement, filled = 1
if total_item_funding < requirement, filled = 0
This can be expressed in a linear constraint:
shortfall + BigM * filled > 0
Note that if the shortfall goes negative, it forces the filled variable to become 1. If shortfall is positive, filled can stay 0. (We will enforce this using the objective function.)
In the objective function for a Maximization problem, you penalize the filled variable.
Obj: Max sum(i,j) Xij * value_i + sum(j) filled_j * -100
So, this multiple knapsack formulation is incentivized to go close to each bin's funding requirement, but if it crosses the requirement, it pays a penalty.
You can play around with the objective function variables and penalities.
Formulation using Google-OR Tools
Working Python Code. For simplicity, I only made 3 bins.
from ortools.linear_solver import pywraplp
def create_data_model():
"""Create the data for the example."""
data = {}
weights = [48, 30, 42, 36, 36, 48, 42, 42, 36, 24, 30, 30, 42, 36, 36]
values = [10, 30, 25, 50, 35, 30, 15, 40, 30, 35, 45, 10, 20, 30, 25]
item_funding = [50, 17, 38, 45, 65, 60, 15, 30, 10, 25, 75, 30, 40, 40, 35]
data['weights'] = weights
data['values'] = values
data['i_fund'] = item_funding
data['items'] = list(range(len(weights)))
data['num_items'] = len(weights)
num_bins = 3
data['bins'] = list(range(num_bins))
data['bin_capacities'] = [100, 100, 80,]
data['bin_funding'] = [100, 100, 50,]
return data
def main():
data = create_data_model()
# Create the mip solver with the SCIP backend.
solver = pywraplp.Solver.CreateSolver('SCIP')
# Variables
# x[i, j] = 1 if item i is packed in bin j.
x , short, filled = {}, {}, {}
for i in data['items']:
for j in data['bins']:
x[(i, j)] = solver.IntVar(0, 1, 'x_%i_%i' % (i, j))
BIG_M, MAX_SHORT = 1e4, 500
for j in data['bins']:
short[j] = solver.NumVar(-MAX_SHORT, MAX_SHORT,
'bin_shortfall_%i' % (j))
filled[j] = solver.IntVar(0,1, 'filled[%i]' % (i))
# Constraints
# Each item can be in at most one bin.
for i in data['items']:
solver.Add(sum(x[i, j] for j in data['bins']) <= 1)
for j in data['bins']:
# The amount packed in each bin cannot exceed its capacity.
solver.Add(
sum(x[(i, j)] * data['weights'][i]
for i in data['items']) <= data['bin_capacities'][j])
#define bin shortfalls as equality constraints
solver.Add(
data['bin_funding'][j] - sum(x[(i, j)] * data['i_fund'][i]
for i in data['items']) == short[j])
# If shortfall is negative, filled is forced to be true
solver.Add(
short[j] + BIG_M * filled[j] >= 0)
# Objective
objective = solver.Objective()
for i in data['items']:
for j in data['bins']:
objective.SetCoefficient(x[(i, j)], data['values'][i])
for j in data['bins']:
# objective.SetCoefficient(short[j], 1)
objective.SetCoefficient(filled[j], -100)
objective.SetMaximization()
print('Number of variables =', solver.NumVariables())
status = solver.Solve()
if status == pywraplp.Solver.OPTIMAL:
print('OPTMAL SOLUTION FOUND\n\n')
total_weight = 0
for j in data['bins']:
bin_weight = 0
bin_value = 0
bin_fund = 0
print('Bin ', j, '\n')
print(f"Funding {data['bin_funding'][j]} Shortfall \
{short[j].solution_value()}")
for i in data['items']:
if x[i, j].solution_value() > 0:
print('Item', i, '- weight:', data['weights'][i], ' value:',
data['values'][i], data['i_fund'][i])
bin_weight += data['weights'][i]
bin_value += data['values'][i]
bin_fund += data['i_fund'][i]
print('Packed bin weight:', bin_weight)
print('Packed bin value:', bin_value)
print('Packed bin Funding:', bin_fund)
print()
total_weight += bin_weight
print('Total packed weight:', total_weight)
else:
print('The problem does not have an optimal solution.')
if __name__ == '__main__':
main()
Hope that helps you move forward.

How do I include the upper boundary of the bins in Matplotlib hist

When creating a histogram using hist() from matplotlib, the data falls into bins as such:
lb ≤ x < ub. How do I force it to behave like this: lb < x ≤ ub?
Additionally, the frequency table is shifted one bin lower compared to Excel, which produces an inaccurate measurement for my purpose.
import numpy as np
data = np.array([23.5, 28, 29, 29, 29.5, 29.5, 30, 30, 30])
bins = np.array([20, 25, 30])
# Excel 1, 8
# Python 1, 5
Using the table as a reference, how do I force hist() to put values between 25 and 30 in bin 30 and not bin 25?
# in Python: 20 <-> 20 ≤ x < 25
# in Excel: 25 <-> 20 < x ≤ 25
Maybe numpy.digitize might be interesting for you (from the documentation):
Return the indices of the bins to which each value in input array belongs.
`right` order of bins returned index `i` satisfies
========= ============= ============================
``False`` increasing ``bins[i-1] <= x < bins[i]``
``True`` increasing ``bins[i-1] < x <= bins[i]``
``False`` decreasing ``bins[i-1] > x >= bins[i]``
``True`` decreasing ``bins[i-1] >= x > bins[i]``
Hopefully this clears also a common misunderstanding when working with bins.
The bins correspond to the vertices of a grid and a data point falls between two vertices / in one bin. Therefore a data point does not correspond to one single point in the bins array but to two.
Another thing one can see from this notation, is with bins=[20, 25, 30] bin 1 goes from 20-25 and bin 2 from 25-30, maybe the notation in excel is different?
Using the keyword right for a custom histogram function results in following code and plot.
import numpy as np
import matplotlib.pyplot as plt
data = np.array([15,
17, 18, 20, 20, 20,
23.5, 24, 25, 25,
28, 29, 30, 30, 30])
bins = np.array([15, 20, 25, 30])
def custom_hist(x, bins, right=False):
x_dig = np.digitize(x, bins=bins, right=right)
u, c = np.unique(x_dig, return_counts=True)
h = np.zeros(len(bins), dtype=int)
h[u] = c
return h
plt.hist(data, bins=bins, color='b', alpha=0.7, label='plt.hist')
# array([3., 5., 7.]
height = custom_hist(x=data, bins=bins, right=True)
width = np.diff(bins)
width = np.concatenate((width, width[-1:]))
plt.bar(bins-width, height=height, width=width,
align='edge', color='r', alpha=0.7, label='np.digitize')
plt.legend()
# This function also allows different sized bins
Note that in the case of right=True 15 belongs to the bin ?<x<=15
which gives you a fourth bar in the histogram even so it is not explicitly included in the bins. If this is not wanted you have to treat the edge cases separately and maybe add the values to the first valid bin.
I guess that this is also the reason why we see an unexpected
behaviour with your example data. Matplotlib applies lb ≤ x < ub for the bins but nevertheless the 30ths get associated with the bin 25-30.
If we add an additional bin 30-35 we can see that now the 30ths are put in this bin. I guess that they apply the rule lb ≤ x < ub everywhere except at the edges, here the use lb ≤ x ≤ ub, which is also reasonable, but one has to be aware of it.
data = np.array([23.5, 28, 29, 29, 29.5, 29.5, 30, 30, 30])
plt.hist(data, bins=np.array([20, 25, 30]), color='b', alpha=0.7, label='[20, 25, 30]')
plt.hist(data, bins=np.array([20, 25, 30, 35]), color='r', alpha=0.7, label='[20, 25, 30, 35]')
plt.legend()

pymc with observations on multiple variables

I'm using an example of linear regression from bayesian methods for hackers but having trouble expanding it to my usage.
I have observations on a random variable, an assumed distribution on that random variable, and finally another assumed distribution on that random variable for which I have observations. How I have tried to model it is with intermediate distributions on a and b, but it complains Wrong number of dimensions: expected 0, got 1 with shape (788,).
To describe the actual model, I am predicting the conversion rate for a certain amount (n) of cultivating emails. My prior is that the conversion rate (described by a Beta function on alpha and beta) will be updated by having alpha and beta scaled by some factors (0,inf] a and b, which start at 1 for n=0 and increase to their max value at some threshold.
# Generate predictive data, X and target data, Y
data = [
{'n': 0 , 'trials': 120, 'successes': 1},
{'n': 5 , 'trials': 111, 'successes': 2},
{'n': 10, 'trials': 78 , 'successes': 1},
{'n': 15, 'trials': 144, 'successes': 3},
{'n': 20, 'trials': 280, 'successes': 7},
{'n': 25, 'trials': 55 , 'successes': 1}]
X = np.empty(0)
Y = np.empty(0)
for dat in data:
X = np.insert(X, 0, np.ones(dat['trials']) * dat['n'])
target = np.zeros(dat['trials'])
target[:dat['successes']] = 1
Y = np.insert(Y, 0, target)
with pm.Model() as model:
alpha = pm.Uniform("alpha_n", 5, 13)
beta = pm.Uniform("beta_n", 1000, 1400)
n_sat = pm.Gamma("n_sat", alpha=20, beta=2, testval=10)
a_gamma = pm.Gamma("a_gamma", alpha=18, beta=15)
b_gamma = pm.Gamma("b_gamma", alpha=18, beta=27)
a_slope = pm.Deterministic('a_slope', 1 + (X/n_sat)*(a_gamma-1))
b_slope = pm.Deterministic('b_slope', 1 + (X/n_sat)*(b_gamma-1))
a = pm.math.switch(X >= n_sat, a_gamma, a_slope)
b = pm.math.switch(X >= n_sat, b_gamma, b_slope)
p = pm.Beta("p", alpha=alpha*a, beta=beta*b)
observed = pm.Bernoulli("observed", p, observed=Y)
Is there a way to get this to work?
Data
First, note that the total likelihood of repeated Bernoulli trials is exactly a binomial likelihood, so there is no need to expand to individual trials in your data. I'd also suggest using a Pandas DataFrame to manage your data - it's helps to keep things tidy:
import pandas as pd
df = pd.DataFrame({
'n': [0, 5, 10, 15, 20, 25],
'trials': [120, 111, 78, 144, 280, 55],
'successes': [1, 2, 1, 3, 7, 1]
})
Solution
This will help simplify the model, but the solution really is to add a shape argument to the p random variable so that PyMC3 knows to how to interpret the one dimensional parameters. The fact is that you do want a different p distribution for each n case you have, so there is nothing conceptually wrong here.
with pm.Model() as model:
# conversion rate hyperparameters
alpha = pm.Uniform("alpha_n", 5, 13)
beta = pm.Uniform("beta_n", 1000, 1400)
# switchpoint prior
n_sat = pm.Gamma("n_sat", alpha=20, beta=2, testval=10)
a_gamma = pm.Gamma("a_gamma", alpha=18, beta=15)
b_gamma = pm.Gamma("b_gamma", alpha=18, beta=27)
# NB: I removed pm.Deterministic b/c (a|b)_slope[0] is constant
# and this causes issues when using ArViZ
a_slope = 1 + (df.n.values/n_sat)*(a_gamma-1)
b_slope = 1 + (df.n.values/n_sat)*(b_gamma-1)
a = pm.math.switch(df.n.values >= n_sat, a_gamma, a_slope)
b = pm.math.switch(df.n.values >= n_sat, b_gamma, b_slope)
# conversion rates
p = pm.Beta("p", alpha=alpha*a, beta=beta*b, shape=len(df.n))
# observations
pm.Binomial("observed", n=df.trials, p=p, observed=df.successes)
trace = pm.sample(5000, tune=10000)
This samples nicely
and yields reasonable intervals on the conversion rates
but the fact that the posteriors for alpha_n and beta_n go right up to your prior boundaries is a bit concerning:
I think the reason for this is that, for each condition you only do 55-280 trials, which, if the conditions were independent (worst case), conjugacy would tells us that your Beta hyperparameters should be in that range. Since you are doing a regression, then the best case scenario for information sharing across the trials would put your hyperparameters in the range of the sum of trials (788) - but that's an upper limit. Because you're outside this range, the concern here is that you're forcing the model to be more precise in its estimates than you really have the evidence to support. However, one can justify this is if the prior is based on strong independent evidence.
Otherwise, I'd suggest expanding the ranges on those priors that affect the final alpha*a and beta*b numbers (the sums of those should be close to your trial counts in the posterior).
Alternative Model
I'd probably do something along the following lines, which I think has a more transparent parameterization, though it's not completely identical to your model:
with pm.Model() as model_br_sp:
# regression coefficients
alpha = pm.Normal("alpha", mu=0, sd=1)
beta = pm.Normal("beta", mu=0, sd=1)
# saturation parameters
saturation_point = pm.Gamma("saturation_point", alpha=20, beta=2)
max_success_rate = pm.Beta("max_success_rate", 1, 9)
# probability of conversion
success_rate = pm.Deterministic("success_rate",
pm.math.switch(df.n.values > saturation_point,
max_success_rate,
max_success_rate*pm.math.sigmoid(alpha + beta*df.n)))
# observations
pm.Binomial("successes", n=df.trials, p=success_rate, observed=df.successes)
trace_br_sp = pm.sample(draws=5000, tune=10000)
Here we map the predictor space to probability space through a sigmoid that maxes out at the maximum success rate. The prior on the saturation point is identical to yours, while that on the maximum success rate is weakly informative (Beta[1,9] - though I will say it runs on a flat prior nearly as well). This also samples well,
and gives similar intervals (though the switchpoint seems to dominate more):
We can compare the two models and see that there isn't a significant difference in their explanatory power:
import arviz as az
model_compare = az.compare({'Binomial Regression w/ Switchpoint': trace_br_sp,
'Original Model': trace})
az.plot_compare(model_compare)

Summing until time-condition is reached in Python

I want to sum over a certain, but rolling, period within my dynamic model. The formal representation is as follows
A simple code snippet to run the equation is:
import numpy as np
import pandas as pd
import operator
year = np.arange(50)
m_ = [50, 30, 15]
a = [25, 15, 7.5]
ARC_ = [38, 255, 837]
r = 0.03
I tried subtracting list a from m_ by list(map(operator.sub, m_, a))) as found within another post.
My failed attempt looks something like this:
for t in year:
for i in range(0, 3):
while t < t+(list(map(operator.sub, m_, a))):
L_[t] = sum(ARC_[i] / (1+r) ** t)
Not at all sure that I understood it right, I tried to base my answer on the equation. Even if it is still a bit of from the result you expect, it might help you to solve your issue.
I create a result list to store each value of L[t], i.e. 50 values. Then I compute the start / stop of the sum for every couple (t,i) and compute it.
import numpy as np
years = np.arange(50)
m_ = [50, 30, 15]
a = [25, 15, 7.5]
ARC_ = [38, 255, 837]
r = 0.03
result = []
for t in years:
s = 0
for i in range(3):
t0 = t
tf = t + m_[i]-a[i]
for k in range(int(t0), int(tf+1)):
s += ARC_[i] / (1+r) ** t
result.append(s)
If what you wanted to do is to compute the difference element-wise between m and a, a simple solution is:
[m_[i] - a[i] for i in range(len(m_))]
Hope it helps.

Categories