I am using Python but since I am noob I can't figure out how to compute the average of a vector each, let's say, 100 elements in a larger for-loop.
My trial so far, which is not what I want is
import numpy as np
r = np.zeros(10000) # declare my vector
for i in range(0,2000): # start the loop
r[i] = i**2 # some function to compute and save
if (i%100 == 0): # each time I save 100 elements I want the mean
av_r = np.mean(r)
print(av_r)
My code do not work as I want because I would like to make the average of 100 elements only then pass to the other 100, compute the mean and go on.
I try to reduce the dimension of the vector and clean it into the if:
import numpy as np
r = np.zeros(100) # declare my vector
for i in range(0,2000): # start the loop
r[i] = i**2 # some function to compute and save
if (i%100 == 0): # each time I save 100 elements I want the mean
av_r = np.mean(r)
print(av_r)
r = np.zeros(100)
naively, I thought you may save 100 elements, compute the average clean the vector and continue the calculation saving the other elements from 100+1 to 200+1 but it give me errors. In particular:
IndexError: index 100 is out of bounds for axis 0 with size 100
Many thanks for your help.
Is this what you're looking for? This code will iterate from 0 to 2000 in intervals of 100, mapping some function (x -> x**2) over each interval, calculating the mean and printing the result.
import numpy as np
r = np.zeros(10000)
for i in range(0, 2000, 100):
interval = [x ** 2 for x in r[i:i + 100]]
av_r = np.mean(interval)
print(av_r)
The output from this is just a series of 20 0.0.
the error you probably have encountered is an arrays out of bounds (IndexError: index 100 is out of bounds for axis 0 with size 100), because your index ranges from 0 to 1999 and you're doing
r[i] = i**2 # some function to compute and save
on a 100-sized array.
Fix:
r[i%100] = i**2 # some function to compute and save
Related
I am struggling with implementing a Linear Programming (LP) problem into scipy.optimize.linprog. I've gotten help to formulate it here so it is already in the standard form; the problem should be maximized.
I think the easiest would be to look at the answer in that thread (since it is possible to write in LaTeX on the math forum) and then compare it to my implementation here in Python. If I should write it out here, please let me know.
The problem is using the notation:
min. c^T*x s.t. Hx = d, l <= x <= u.
(I seek to maximize)
import numpy as np
import pandas as pd
import math
# First I create the price array to use in 'c' below.
date = pd.date_range(
start='2020-01-01',
freq='H',
periods=120,
tz='Europe/Berlin',
inclusive='left')
forecast = pd.DataFrame({
'date': date})
forecast['price'] = 50*(1-np.sin(
2*math.pi*forecast.date.dt.hour/24))
forecast.set_index('date', inplace=True)
# Then creating the vector 'c'
c1 = np.zeros(120)
c2 = np.array(forecast)
c2 = c2.reshape(120,)
c = np.dstack((c1,c2)).flatten()
c = np.concatenate((c2,c1))
# Creating the matrix 'H'
H = np.zeros((120, 240))
for i, p in zip(range(0,238,2), range(120)):
for j in range(i, i+3):
if j - i < 2:
H[p][j] = -1
else:
H[p][j] = 1
# Create the vector 'd'
d = np.zeros(120)
# Create the bounds
bounds = [(None, None)] * 240
for i in range(240):
if i < 1:
bounds[i] = (0,0)
elif i < 120:
bounds[i] = (0,3)
else:
bounds[i] = (-1,1)
# Run the solver
from scipy.optimize import linprog
c = c
A_eq = H
b_eq = d
bounds = bounds
res = linprog(c, A_eq=A_eq, b_eq=b_eq, bounds=bounds)
x = res.x
Things I am not sure of:
If the vector c should be of this shape, alternating between 0 and the forecast value.
If it is correct to let the values -1, -1, 1 jump to steps to the right for each row so they fill the full diagonal?
If the bounds can be defined like this, N*2 tuples where the first N tuples are L_0,...L_n, and the last N tuples are A_0,...A_n.
Further, I am not sure in which ways the order of the matrix and the vectors relates to each other. For example, how does the scipy solver know that a bound in bounds_i relates to the constraint in H_ij?
Currently, the result shows that x is alternating between -0 and 0 for the first 120 rows, and then it starts to fluctuate in a non-maximizing way. I have tried to structure the vectors and the matrix differently without success.
Given an array x of length 1000, and y of length 500k, we can compute the index k for which x is the closest to "y-shifted by k indices":
mindistance = np.inf # infinity
for k in range(len(y)-1000):
t = np.sum(np.power(x-y[k:k+1000],2))
if t < mindistance:
mindistance = t
index = k
print index
# x is close to y[index:index+N]
According to my tests, this seems to be numerically costly. Is there a clever numpy way to compute it faster?
Note: It seems that if I replace the length of x from 1000 to 100, it doesn't change much the time taken for the computation. The slowness seems to come mostly from the for k in range(...) loop. How to speed it up?
This can be done with np.correlate which computes not the coefficient of correlation (as one might guess), but simply the sum of products like x[n]*y[m] (here m is n plus some shift). Since
(x[n] - y[m])**2 = x[n]**2 - 2*x[n]*y[m] + y[m]**2
we can get the sum of squares of differences from this, by adding the sums of squares of x and of a part of y. (Actually, the sum of x[n]**2 will not depend on the shift, since we'll always get just np.sum(x**2), but I'll include it all the same.) The sum of a part of y**2 can also be found in this way, by replacing x with an all-ones array of the same size, and y with y**2.
Here is an example.
import numpy as np
x = np.array([3.1, 1.2, 4.2])
y = np.array([8, 5, 3, -2, 3, 1, 4, 5, 7])
diff_sq = np.sum(x**2) - 2*np.correlate(y, x) + np.correlate(y**2, np.ones_like(x))
print(diff_sq)
This prints [39.89 45.29 11.69 39.49 0.09 12.89 23.09] which are indeed the required distances from x to various parts of y. Pick the smallest with argmin.
A little benchmark in addition to user6655984's wonderful answer:
import numpy as np
import time
x = np.random.rand(1000) # random array of size 1k
y = np.random.rand(100*1000) # random array of size 100k
print "Naive method"
start = time.time()
mindistance = np.inf
for k in range(len(y)-1000):
t = np.sum(np.power(x-y[k:k+1000],2))
if t < mindistance:
mindistance = t
index = k
print index, mindistance
print "%.2f seconds\n" % (time.time() - start)
print "Correlation method"
start = time.time()
diff_sq = np.sum(x**2) - 2*np.correlate(y, x) + np.correlate(y**2, np.ones_like(x))
i = np.argmin(diff_sq)
print i, diff_sq[i]
print "%.2f seconds\n" % (time.time() - start)
We get a x 145 speed improvement factor :)
Naive method
60911 143.6153965841267
8.75 seconds
Correlation method
60911 143.6153965841267
0.06 seconds
The minimum of the SSD distance ("sum of squared difference") is the maximum of the correlation.
Correlations are known to be computed efficiently (in time N Log N instead of NM), by the famous FFT.
With N=1000 and M=500000 you can expect a speedup.
I have written this python code to get neighbours of a label (a set of pixels sharing some common properties). The neighbours for a label are defined as the other labels that lie on the other side of the boundary (the neighbouring labels share a boundary). So, the code I wrote works but is extremely slow:
# segments: It is a 2-dimensional numpy array (an image really)
# where segments[x, y] = label_index. So each entry defines the
# label associated with a pixel.
# i: The label whose neighbours we want.
def get_boundaries(segments, i):
neighbors = []
for y in range(1, segments.shape[1]):
for x in range(1, segments.shape[0]):
# Check if current index has the label we want
if segments[x-1, y] == i:
# Check if neighbour in the x direction has
# a different label
if segments[x-1, y] != segments[x, y]:
neighbors.append(segments[x,y])
# Check if neighbour in the y direction has
# a different label
if segments[x, y-1] == i:
if segments[x, y-1] != segments[x, y]:
neighbors.append(segments[x, y])
return np.unique(np.asarray(neighbors))
As you can imagine, I have probably completely misused python here. I was wondering if there is a way to optimize this code to make it more pythonic.
Here you go:
def get_boundaries2(segments, i):
x, y = np.where(segments == i) # where i is
right = x + 1
rightMask = right < segments.shape[0] # keep in bounds
down = y + 1
downMask = down < segments.shape[1]
rightNeighbors = segments[right[rightMask], y[rightMask]]
downNeighbors = segments[x[downMask], down[downMask]]
neighbors = np.union1d(rightNeighbors, downNeighbors)
return neighbors
As you can see, there are no Python loops at all; I also tried to minimize copies (the first attempt made a copy of segments with a NAN border, but then I devised the "keep in bounds" check).
Note that I did not filter out i itself from the "neighbors" here; you can add that easily at the end if you want. Some timings:
Input 2000x3000: original takes 13 seconds, mine takes 370 milliseconds (35x speedup).
Input 1000x300: original takes 643 ms, mine takes 17.5 ms (36x speedup).
You need to replace your for loops with numpy's implicit looping.
I don't know enough about your code to convert it directly, but I can give an example.
Suppose you have an array of 100000 random integers, and you need to get an array of each element divided by its neighbor.
import random, numpy as np
a = np.fromiter((random.randint(1, 100) for i in range(100000)), int)
One way to do this would be:
[a[i] / a[i+1] for i in range(len(a)-1)]
Or this, which is much faster:
a / np.roll(a, -1)
Timeit:
initcode = 'import random, numpy as np; a = np.fromiter((random.randint(1, 100) for i in range(100000)), int)'
timeit.timeit('[a[i] / a[i+1] for i in range(len(a)-1)]', initcode, number=100)
5.822079309000401
timeit.timeit('(a / np.roll(a, -1))', initcode, number=100)
0.1392055350006558
I have a large, symmetric, 2D distance array. I want to get closest N pairs of observations.
The array is stored as a numpy condensed array, and has of the order of 100 million observations.
Here's an example to get the 100 closest distances on a smaller array (~500k observations), but it's a lot slower than I would like.
import numpy as np
import random
import sklearn.metrics.pairwise
import scipy.spatial.distance
N = 100
r = np.array([random.randrange(1, 1000) for _ in range(0, 1000)])
c = r[:, None]
dists = scipy.spatial.distance.pdist(c, 'cityblock')
# these are the indices of the closest N observations
closest = dists.argsort()[:N]
# but it's really slow to get out the pairs of observations
def condensed_to_square_index(n, c):
# converts an index in a condensed array to the
# pair of observations it represents
# modified from here: http://stackoverflow.com/questions/5323818/condensed-matrix-function-to-find-pairs
ti = np.triu_indices(n, 1)
return ti[0][c]+ 1, ti[1][c]+ 1
r = []
n = np.ceil(np.sqrt(2* len(dists)))
for i in closest:
pair = condensed_to_square_index(n, i)
r.append(pair)
It seems to me like there must be quicker ways to do this with standard numpy or scipy functions, but I'm stumped.
NB If lots of pairs are equidistant, that's OK and I don't care about their ordering in that case.
You don't need to calculate ti in each call to condensed_to_square_index. Here's a basic modification that calculates it only once:
import numpy as np
import random
import sklearn.metrics.pairwise
import scipy.spatial.distance
N = 100
r = np.array([random.randrange(1, 1000) for _ in range(0, 1000)])
c = r[:, None]
dists = scipy.spatial.distance.pdist(c, 'cityblock')
# these are the indices of the closest N observations
closest = dists.argsort()[:N]
# but it's really slow to get out the pairs of observations
def condensed_to_square_index(ti, c):
return ti[0][c]+ 1, ti[1][c]+ 1
r = []
n = np.ceil(np.sqrt(2* len(dists)))
ti = np.triu_indices(n, 1)
for i in closest:
pair = condensed_to_square_index(ti, i)
r.append(pair)
You can also vectorize the creation of r:
r = zip(ti[0][closest] + 1, ti[1][closest] + 1)
or
r = np.vstack(ti)[:, closest] + 1
You can speed up the location of the minimum values very notably if you are using numpy 1.8 using np.partition:
def smallest_n(a, n):
return np.sort(np.partition(a, n)[:n])
def argsmallest_n(a, n):
ret = np.argpartition(a, n)[:n]
b = np.take(a, ret)
return np.take(ret, np.argsort(b))
dists = np.random.rand(1000*999//2) # a pdist array
In [3]: np.all(argsmallest_n(dists, 100) == np.argsort(dists)[:100])
Out[3]: True
In [4]: %timeit np.argsort(dists)[:100]
10 loops, best of 3: 73.5 ms per loop
In [5]: %timeit argsmallest_n(dists, 100)
100 loops, best of 3: 5.44 ms per loop
And once you have the smallest indices, you don't need a loop to extract the indices, do it in a single shot:
closest = argsmallest_n(dists, 100)
tu = np.triu_indices(1000, 1)
pairs = np.column_stack((np.take(tu[0], closest),
np.take(tu[1], closest))) + 1
The best solution probably won't generate all of the distances.
Proposal:
Make a heap of max size 100 (if it grows bigger, reduce it).
Use the Closest Pair algorithm to find the closest pair.
Add the pair to the heap (priority queue).
Choose one of that pair. Add its 99 closest neighbors to the heap.
Remove the chosen point from the list.
Find the next closest pair and repeat. The number of neighbors added is 100 minus the number of times you ran the Closest Pair algorithm.
I want to generate a bunch (x, y) coordinates from 0 to 2500 that excludes points that are within 200 of each other without recursion.
Right now I have it check through a list of all previous values to see if any are far enough from all the others. This is really inefficient and if I need to generate a large number of points it takes forever.
So how would I go about doing this?
This is a variant on Hank Ditton's suggestion that should be more efficient time- and memory-wise, especially if you're selecting relatively few points out of all possible points. The idea is that, whenever a new point is generated, everything within 200 units of it is added to a set of points to exclude, against which all freshly-generated points are checked.
import random
radius = 200
rangeX = (0, 2500)
rangeY = (0, 2500)
qty = 100 # or however many points you want
# Generate a set of all points within 200 of the origin, to be used as offsets later
# There's probably a more efficient way to do this.
deltas = set()
for x in range(-radius, radius+1):
for y in range(-radius, radius+1):
if x*x + y*y <= radius*radius:
deltas.add((x,y))
randPoints = []
excluded = set()
i = 0
while i<qty:
x = random.randrange(*rangeX)
y = random.randrange(*rangeY)
if (x,y) in excluded: continue
randPoints.append((x,y))
i += 1
excluded.update((x+dx, y+dy) for (dx,dy) in deltas)
print randPoints
I would overgenerate the points, target_N < input_N, and filter them using a KDTree. For example:
import numpy as np
from scipy.spatial import KDTree
N = 20
pts = 2500*np.random.random((N,2))
tree = KDTree(pts)
print tree.sparse_distance_matrix(tree, 200)
Would give me points that are "close" to each other. From here it should be simple to apply any filter:
(11, 0) 60.843426339
(0, 11) 60.843426339
(1, 3) 177.853472309
(3, 1) 177.853472309
Some options:
Use your algorithm but implement it with a kd-tree that would speed up nearest neighbours look-up
Build a regular grid over the [0, 2500]^2 square and 'shake' all points randomly with a bi-dimensional normal distribution centered on each intersection in the grid
Draw a larger number of random points then apply a k-means algorithm and only keep the centroids. They will be far away from one another and the algorithm, though iterative, could converge more quickly than your algorithm.
This has been answered, but it's very tangentially related to my work so I took a stab at it. I implemented the algorithm described in this note which I found linked from this blog post. Unfortunately it's not faster than the other proposed methods, but I'm sure there are optimizations to be made.
import numpy as np
import matplotlib.pyplot as plt
def lonely(p,X,r):
m = X.shape[1]
x0,y0 = p
x = y = np.arange(-r,r)
x = x + x0
y = y + y0
u,v = np.meshgrid(x,y)
u[u < 0] = 0
u[u >= m] = m-1
v[v < 0] = 0
v[v >= m] = m-1
return not np.any(X[u[:],v[:]] > 0)
def generate_samples(m=2500,r=200,k=30):
# m = extent of sample domain
# r = minimum distance between points
# k = samples before rejection
active_list = []
# step 0 - initialize n-d background grid
X = np.ones((m,m))*-1
# step 1 - select initial sample
x0,y0 = np.random.randint(0,m), np.random.randint(0,m)
active_list.append((x0,y0))
X[active_list[0]] = 1
# step 2 - iterate over active list
while active_list:
i = np.random.randint(0,len(active_list))
rad = np.random.rand(k)*r+r
theta = np.random.rand(k)*2*np.pi
# get a list of random candidates within [r,2r] from the active point
candidates = np.round((rad*np.cos(theta)+active_list[i][0], rad*np.sin(theta)+active_list[i][1])).astype(np.int32).T
# trim the list based on boundaries of the array
candidates = [(x,y) for x,y in candidates if x >= 0 and y >= 0 and x < m and y < m]
for p in candidates:
if X[p] < 0 and lonely(p,X,r):
X[p] = 1
active_list.append(p)
break
else:
del active_list[i]
return X
X = generate_samples(2500, 200, 10)
s = np.where(X>0)
plt.plot(s[0],s[1],'.')
And the results:
Per the link, the method from aganders3 is known as Poisson Disc Sampling. You might be able to find more efficient implementations that use a local grid search to find 'overlaps.' For example Poisson Disc Sampling. Because you are constraining the system, it cannot be completely random. The maximum packing for circles with uniform radii in a plane is ~90% and is achieved when the circles are arranged in a perfect hexagonal array. As the number of points you request approaches the theoretical limit, the generated arrangement will become more hexagonal. In my experience, it is difficult to get above ~60% packing with uniform circles using this approach.
the following method uses list comprehension, but I am generating integers you can use different random generators for different datatypes
arr = [[random.randint(-4, 4), random.randint(-4, 4)] for i in range(40)]