I was hoping for a bit of help to make my code run faster.
Basically I have a square grid of lat,long points in a list insideoceanlist. Then there is a directory containing data files of lat, long coords which represent lightning strikes for a particular day. The idea is for each day, we want to know how many lightning strikes there were around each point on the square grid. At the moment it is just two for loops, so for every point on the square grid, you check how far away every lightning strike was for that day. If it was within 40km I add one to that point to make a density map.
The starting grid has the overall shape of a rectangle, made up of squares with width of 0.11 and length 0.11. The entire rectange is about 50x30. Lastly I have a shapefile which outlines the 'forecast zones' in Australia, and if any point in the grid is outside this zone then we omit it. So all the leftover points (insideoceanlist) are the ones in Australia.
There are around 100000 points on the square grid and even for a slow day there are around 1000 lightning strikes, so it takes a long time to process. Is there a way to do this more efficiently? I really appreciate any advice.
By the way I changed list2 into list3 because I heard that iterating over lists is faster than arrays in python.
for i in range(len(list1)): #list1 is a list of data files containing lat,long coords for lightning strikes for each day
dict_density = {}
for k in insideoceanlist: #insideoceanlist is a grid of ~100000 lat,long points
dict_density[k] = 0
list2 = np.loadtxt(list1[i],delimiter = ",") #this open one of the files containing lat,long coords and puts it into an array
list3 = map(list,list2) #converts the array into a list
# the following part is what I wanted to improve
for j in insideoceanlist:
for l in list3:
if great_circle(l,j).meters < 40000: #great_circle is a function which measures distance between points the two lat,long points
dict_density[j] += 1
#
filename = 'example' +str(i) + '.txt'
with open(filename, 'w') as f:
for m in range(len(insideoceanlist)):
f.write('%s\n' % (dict_density[insideoceanlist[m]])) #writes each point in the same order as the insideoceanlist
f.close()
To elaborate a bit on #DanGetz's answer, here is some code that uses the strike data as the driver, rather than iterating the entire grid for each strike point. I'm assuming you're centered on Australia's median point, with 0.11 degree grid squares, even though the size of a degree varies by latitude!
Some back-of-the-envelope computation with a quick reference to Wikipedia tells me that your 40km distance is a ±4 grid-square range from north to south, and a ±5 grid-square range from east to west. (It drops to 4 squares in the lower latitudes, but ... meh!)
The tricks here, as mentioned, are to convert from strike position (lat/lon) to grid square in a direct, formulaic manner. Figure out the position of one corner of the grid, subtract that position from the strike, then divide by the size of the grid - 0.11 degrees, truncate, and you have your row/col indexes. Now visit all the surrounding squares until the distance grows too great, which is at most 1 + (2 * 2 * 4 * 5) = 81 squares checking for distance. Increment the squares within range.
The result is that I'm doing at most 81 visits times 1000 strikes (or however many you have) as opposed to visiting 100,000 grid squares times 1000 strikes. This is a significant performance gain.
Note that you don't describe your incoming data format, so I just randomly generated numbers. You'll want to fix that. ;-)
#!python3
"""
Per WikiPedia (https://en.wikipedia.org/wiki/Centre_points_of_Australia)
Median point
============
The median point was calculated as the midpoint between the extremes of
latitude and longitude of the continent.
24 degrees 15 minutes south latitude, 133 degrees 25 minutes east
longitude (24°15′S 133°25′E); position on SG53-01 Henbury 1:250 000
and 5549 James 1:100 000 scale maps.
"""
MEDIAN_LAT = -(24.00 + 15.00/60.00)
MEDIAN_LON = (133 + 25.00/60.00)
"""
From the OP:
The starting grid has the overall shape of a rectangle, made up of
squares with width of 0.11 and length 0.11. The entire rectange is about
50x30. Lastly I have a shapefile which outlines the 'forecast zones' in
Australia, and if any point in the grid is outside this zone then we
omit it. So all the leftover points (insideoceanlist) are the ones in
Australia.
"""
DELTA_LAT = 0.11
DELTA_LON = 0.11
GRID_WIDTH = 50.0 # degrees
GRID_HEIGHT = 30.0 # degrees
GRID_ROWS = int(GRID_HEIGHT / DELTA_LAT) + 1
GRID_COLS = int(GRID_WIDTH / DELTA_LON) + 1
LAT_SIGN = 1.0 if MEDIAN_LAT >= 0 else -1.0
LON_SIGN = 1.0 if MEDIAN_LON >= 0 else -1.0
GRID_LOW_LAT = MEDIAN_LAT - (LAT_SIGN * GRID_HEIGHT / 2.0)
GRID_HIGH_LAT = MEDIAN_LAT + (LAT_SIGN * GRID_HEIGHT / 2.0)
GRID_MIN_LAT = min(GRID_LOW_LAT, GRID_HIGH_LAT)
GRID_MAX_LAT = max(GRID_LOW_LAT, GRID_HIGH_LAT)
GRID_LOW_LON = MEDIAN_LON - (LON_SIGN * GRID_WIDTH / 2.0)
GRID_HIGH_LON = MEDIAN_LON + (LON_SIGN * GRID_WIDTH / 2.0)
GRID_MIN_LON = min(GRID_LOW_LON, GRID_HIGH_LON)
GRID_MAX_LON = max(GRID_LOW_LON, GRID_HIGH_LON)
GRID_PROXIMITY_KM = 40.0
"""https://en.wikipedia.org/wiki/Longitude#Length_of_a_degree_of_longitude"""
_Degree_sizes_km = (
(0, 110.574, 111.320),
(15, 110.649, 107.551),
(30, 110.852, 96.486),
(45, 111.132, 78.847),
(60, 111.412, 55.800),
(75, 111.618, 28.902),
(90, 111.694, 0.000),
)
# For the Australia situation, +/- 15 degrees means that our worst
# case scenario is about 40 degrees south. At that point, a single
# degree of longitude is smallest, with a size about 80 km. That
# in turn means a 40 km distance window will span half a degree or so.
# Since grid squares a 0.11 degree across, we have to check +/- 5
# cols.
GRID_SEARCH_COLS = 5
# Latitude degrees are nice and constant-like at about 110km. That means
# a .11 degree grid square is 12km or so, making our search range +/- 4
# rows.
GRID_SEARCH_ROWS = 4
def make_grid(rows, cols):
return [[0 for col in range(cols)] for row in range(rows)]
Grid = make_grid(GRID_ROWS, GRID_COLS)
def _col_to_lon(col):
return GRID_LOW_LON + (LON_SIGN * DELTA_LON * col)
Col_to_lon = [_col_to_lon(c) for c in range(GRID_COLS)]
def _row_to_lat(row):
return GRID_LOW_LAT + (LAT_SIGN * DELTA_LAT * row)
Row_to_lat = [_row_to_lat(r) for r in range(GRID_ROWS)]
def pos_to_grid(pos):
lat, lon = pos
if lat < GRID_MIN_LAT or lat >= GRID_MAX_LAT:
print("Lat limits:", GRID_MIN_LAT, GRID_MAX_LAT)
print("Position {} is outside grid.".format(pos))
return None
if lon < GRID_MIN_LON or lon >= GRID_MAX_LON:
print("Lon limits:", GRID_MIN_LON, GRID_MAX_LON)
print("Position {} is outside grid.".format(pos))
return None
row = int((lat - GRID_LOW_LAT) / DELTA_LAT)
col = int((lon - GRID_LOW_LON) / DELTA_LON)
return (row, col)
def visit_nearby_grid_points(pos, dist_km):
row, col = pos_to_grid(pos)
# +0, +0 is not symmetric - don't increment twice
Grid[row][col] += 1
for dr in range(1, GRID_SEARCH_ROWS):
for dc in range(1, GRID_SEARCH_COLS):
misses = 0
gridpos = Row_to_lat[row+dr], Col_to_lon[col+dc]
if great_circle(pos, gridpos).meters <= dist_km:
Grid[row+dr][col+dc] += 1
else:
misses += 1
gridpos = Row_to_lat[row+dr], Col_to_lon[col-dc]
if great_circle(pos, gridpos).meters <= dist_km:
Grid[row+dr][col-dc] += 1
else:
misses += 1
gridpos = Row_to_lat[row-dr], Col_to_lon[col+dc]
if great_circle(pos, gridpos).meters <= dist_km:
Grid[row-dr][col+dc] += 1
else:
misses += 1
gridpos = Row_to_lat[row-dr], Col_to_lon[col-dc]
if great_circle(pos, gridpos).meters <= dist_km:
Grid[row-dr][col-dc] += 1
else:
misses += 1
if misses == 4:
break
def get_pos_from_line(line):
"""
FIXME: Don't know the format of your data, just random numbers
"""
import random
return (random.uniform(GRID_LOW_LAT, GRID_HIGH_LAT),
random.uniform(GRID_LOW_LON, GRID_HIGH_LON))
with open("strikes.data", "r") as strikes:
for line in strikes:
pos = get_pos_from_line(line)
visit_nearby_grid_points(pos, GRID_PROXIMITY_KM)
If you know the formula that generates the points on your grid, you can probably find the closest grid point to a given point quickly by reversing that formula.
Below is a motivating example, that isn't quite right for your purposes because the Earth is a sphere, not flat or cylindrical. If you can't easily reverse the grid point formula to find the closest grid point, then maybe you can do the following:
create a second grid (let's call it G2) that is a simple formula like below, with big enough boxes such that you can be confident that the closest grid point to any point in one box will either be in the same box, or in one of the 8 neighboring boxes.
create a dict which stores which original grid (G1) points are in which box of the G2 grid
take the point p you're trying to classify, and find the G2 box it would go into
compare p to all the G1 points in this G2 box, and all the immediate neighbors of that box
choose the G1 point of these that's closest to p
Motivating example with a perfect flat grid
If you had a perfect square grid on a flat surface, that isn't rotated, with sides of length d, then their points can be defined by a simple mathematical formula. Their latitude values will all be of the form
lat0 + d * i
for some integer value i, where lat0 is the lowest-numbered latitude, and their longitude values will be of the same form:
long0 + d * j
for some integer j. To find what the closest grid point is for a given (lat, long) pair, you can separately find its latitude and longitude. The closest latitude number on your grid will be where
i = round((lat - lat0) / d)
and likewise j = round((long - long0) / d) for the longitude.
So one way you can go forward is to plug that in to the formulas above, and get
grid_point = (lat0 + d * round((lat - lat0) / d),
long0 + d * round((long - long0) / d)
and just increment the count in your dict at that grid point. This should make your code much, much faster than before, because instead of checking thousands of grid points for distance, you directly found the grid point with a couple calculations.
You can probably make this a little faster by using the i and j numbers as indexes into a multidimensional array, instead of using grid_point in a dict.
Have you tried using Numpy for the indexing? You can use multi-dimensional arrays, and the indexing should be faster because Numpy arrays are essentially a Python wrapper around C arrays.
If you need further speed increases, take a look at Cython, a Python to optimized C converter. It is especially good for multi-dimensional indexing, and should be able to speed this type of code by about an order of magnitude. It'll add a single additional dependency to your code, but it's a quick install, and not too difficult to implement.
(Benchmarks), (Tutorial using Numpy with Cython)
Also as a quick aside, use
for listI in list1:
...
list2 = np.loadtxt(listI, delimiter=',')
# or if that doesn't work, at least use xrange() rather than range()
essentially you should only ever use range() when you explicity need the list generated by the range() function. In your case, it shouldn't do much because it is the outer-most loop.
Related
I have a 3d point cloud (x,y,z) in a txt file. I want to calculate the 3d distance between each point and all the other points in the point cloud, and save the number of points having distance less than a threshold. I have done it in python in the shown code but it takes too much time. I was asking for a faster one than the one I got.
from math import sqrt
import numpy as np
points_list = []
with open("D:/Point cloud data/projection_test_data3.txt") as chp:
for line in chp:
x, y, z = line.split()
points_list.append((float(x), float(y), float(z)))
j = 0
Final_density = 0
while j < len(points_list)-1:
i = 0
Density = 0
while i < len (points_list) - 1 :
if sqrt((points_list[i][0] - points_list[j][0])**2 + (points_list[i][1] - points_list[j][1])**2 + (points_list[i][2] - points_list[j][2])**2) < 0.15:
Density += 1
i += 1
Final_density = Density
with open("D:/Point cloud data/my_density.txt", 'a') as g:
g.write("{}\n".format(str(Final_density)))
j += 1
One (quick) option that might speed this up is to change the position of the file writing/opening so that you're not opening/closing the file as many times.
from math import sqrt
import numpy as np
points_list = []
with open("D:/Point cloud data/projection_test_data3.txt") as chp:
for line in chp:
x, y, z = line.split()
points_list.append((float(x), float(y), float(z)))
j = 0
Final_density = 0
with open("D:/Point cloud data/my_density.txt", 'a') as g:
while j < len(points_list)-1:
i = 0
Density = 0
while i < len (points_list) - 1 :
if sqrt((points_list[i][0] - points_list[j][0])**2 + (points_list[i][1] - points_list[j][1])**2 + (points_list[i][2] - points_list[j][2])**2) < 0.15:
Density += 1
i += 1
Final_density = Density
g.write("{}\n".format(str(Final_density)))
j += 1
Since it looks like you can use numpy, why not use it? You'll have to make sure the arrays are numpy arrays, but that should be simple.
if sqrt(...) < 0.15: to if np.linalg.norm(points_list[j] - points_list[i]) < 0.15:
This post (Finding 3d distances using an inbuilt function in python) has other ways to use a prebuilt function to get the 3d distance in python.
Edit thanks to #KellyBundy's comment:
You can also use np.linalg.norm(points_list - points_list[:, None], axis=1) to generate a matrix representing the distance between all points in the array. The diagonal will be 0 (corresponding to the distance between the given point and itself) and the matrix will be symmetric about the diagonal. You can use just the upper triangle to determine the distance between any given pair of points. Again, you'll have to modify your data structure to make all the points into a numpy array in the proper format (np.array([[point1x, point1y, point1z], [point2x, point2y, point2z]]), etc. (https://stackoverflow.com/a/46700369/2391458)
resulting matrix of the form:
[ [0 distance between points 0 and 1 distance between points 0 and 2......]
[distance between points 1 and 0 0 distance between points 2 and 0....]
.....
Quick x2 speed up – replace i = 0 with i = j + 1. That way you would check each pair only once, not twice.
More fundamental change – you can sort points by coordinates, and use sliding window algorithm. The idea is that if points are sorted by x coordinate, j-th point has x=1, and i-th point has x=1.01, then it might be near each other and you should check it. But if i-th point has x=2, then it cannot be near to j-th point, and since points are sorted, all points after i-th can be skipped (i.e. not checked in pair with j-th point).
If points are sparse, then it should significantly speed up function, and complexity would be O(n*log(n)) because of sorting.
In the if, instead of taking the sqrt and comparing it with 0.15, compare it with square of 0.15 which is 0.0225 directly. The result will be the same. sqrt is an expensive operation, it will save you time to not use it.
if (points_list[i][0] - points_list[j][0])**2 + (points_list[i][1] - points_list[j][1])**2 + (points_list[i][2] - points_list[j][2])**2 < 0.0225:
The Task
For a class in molecular dynamics, I have to simulate 100 particles in a box with periodic boundaries. I have to take particle-particle collisions into account, since the walls are 'transparent' those interactions can be happen across the boundaries. Since the simulation should cover 50000 steps, and I'm expecting more and more additional tasks in the future, I want my code as efficient as possible (I have to use python, despite the long run time).
The Setting
The system consists of
100 Particles
Box with x = 10, y = 5
Mass = 2
Radius = 0.2
Velocity |v| = 0.5 per step
Simulation of 50000 steps
What I've done so fare
I have found this example for particles in a box with particle-particle collision. Since the author used a efficient implementation, I took his approach.
My relevant code parts are (in strong resemblance to the linked site):
class particlesInPeriodicBox:
# define the particle properties
def __init__(self,
initialState = [[0,0,0,0]], # state with form [x, y, vx, vy] for each particle
boundaries = [0, 10, 0, 5], # box boundaries with form [xmin, xmax, ymin, ymax] in nm
radius = 0.2, # particle radius in nm
mass = 2): # mass in g/mol. Here a parameter instead of a global variable
self.initialState = np.asarray(initialState, dtype=float)
self.state = self.initialState.copy()
self.radius = radius
self.time = 0 # keep count of time, if time, i want to implement a 'clock'
self.mass = mass # mass
self.boundaries = boundaries
def collision(self):
"""
now, one has to check for collisions. I do this by distance check and will solve this problems below.
To minimize the simulation time I then will only consider the particles that colided.
"""
dist = squareform(pdist(self.state[:, :2])) # direct distance
colPart1, colPart2 = np.where(dist < 2 * self.radius) # define collision partners 1 and 2 as those where the distance between the centeres are smaller than two times the radius
# resolve self-self collissions
unique = (colPart1 < colPart2)
colPart1 = colPart1[unique]
colPart2 = colPart2[unique]
"""
The following loop resolves the collisions. I zip the lists of collisionpartners to one aray,
where one entry contains both colisionpartners.
"""
for cp1, cp2 in zip(colPart1, colPart2): # cp1/cp2 are the two particles colliding in one collision.
# masses could be different in future...
m1 = self.mass[cp1]
m2 = self.mass[cp2]
# take the position (x,y) tuples for the two particles
r1 = self.state[cp1, :2]
r2 = self.state[cp2, :2]
# same with velocities
v1 = self.state[cp1, 2:]
v2 = self.state[cp2, 2:]
# get relative parameters
r = r1-r2
v = r2-r1
# center of mass velocity:
vcm = (m1 * v1 + m2 * v2) / (m1 + m2)
"""
This is the part with the elastic collision
"""
dotrr = np.dot(r, r) # the dot product of the relative position with itself
dotvr = np.dot(v, r) # the dot product of the relative velocity with the relative position
v = 2 * r * dotvr / dotrr - v # new relative velocity
"""
In this center of mass frame, the velocities 'reflect' on the center of mass in oposite directions
"""
self.state[cp1, 2:] = vcm + v * m2/(m1 + m2) # new velocity of particle 1 still considering possible different masses
self.state[cp2, 2:] = vcm - v * m1/(m1 + m2) # new velocity of particle 2
As I understand it, this technique of handling the operations to the whole arrays is more efficient than manually looping through it every time. Moving the particles 'trough' the wall is easy, I just subtract or add the dimension of the box, respectively. But:
The Problem
For now the algorithm only sees collision inside the box, but not across the boundaries. I thought about this problem a while now, and come up with the following Ideas:
I could make a total of 9 copies of this system in a 3x3 grid, and only considering the middle one, can so look into the neighboring cells for the nearest neighbor search. BUT i can't think of a effective way to implement this despite the fact, that this approach seams to be the standard way
Every other idea has some hand waving use of modulo, and im almost sure, that this is not the way to go.
If I had to boil it down, I guess my key questions are:
How do I take periodic boundaries into account when calculating
the distance between particles?
the actual particle-particle collision (elastic) and resulting directions?
For the first problem it might be possible to use techniques like in Calculation of Contact/Coordination number with Periodic Boundary Conditions, but Im not sure if that is the most efficient way.
Thank you!
Modulus is likely as quick an operation as you're going to get. In any self-respecting run-time system, this will attach directly to the on-chip floating-divide operations, which may well be faster than a tedious set of "if-subtract" pairs.
I think your 9-cell solution is overkill. Use 4 cells in a 2x2 matrix and check two regions: the original cell, and the same dimensions centered on the "four corners" point (middle of the 2x2). For any pair of points, the proper distance is the lesser of these two. Note that this method also gives you a frame in which you can easily calculate the momentum changes.
A third possible approach is to double the dimensions (ala the 2x2 above), but give each particle four sets of coordinates, one in each box. Alter your algorithms to consider all four when computing distance. If you have good vectorization packages and parallelism, this might be the preferred solution.
I have a numpy 2D array of values. Each element in the array represents a grid point from a grid where each box is 13km on a side. I need to determine the average value of all points within 50 miles of a specific point on the grid.
My current solution determines a bounding box and then references items in the array within that box using their indices, which is slow with numpy. I'm trying to determine a faster solution.
Current solution:
num_x = 400 #horizontal dimension of the 2D array
num_y = 300 #vertical dimension of the 2D array
num_dx = 6 #maximum number of horizontal grid points that fit within 50 miles
num_dy = 6 #same as above but for vertical (square grid)
radius_m = 80467.2 #50 miles expressed in meters
values = [] # stores the extracted values
for ix in range(-num_dx,num_dx+1):
for jy in range(-num_dy,num_dy+1):
# Determine distance to this point
dist = ((ix*dx)**2+(jy*dy)**2)**0.5
if dist <= radius_m:
# Ensure this grid point actually exists within the grid
if (j+jy) < num_y and (i+ix) < num_x:
value = myarray[i+ix,j+jy]
if value is not masked and value >= 0:
values.append(float(value))
average = sum(values) / float(len(values))
This is slow (takes about 1.5 seconds) due to accessing myarray over 100 times to extract the value of a single element. Is there a vector method that would work better here? I can't seem to figure out a way to do this with a mask since the conditional is based on the location of the grid point relative to another, not the value of the element itself.
Your code isn't runable and seems to contain a bug for when i < num_dx or j < num_dy (then it wraps around to the other side of the array). But making some assumptions on your variable names, this is how I would do it:
# First make sure we stay in the grid
i1, i2 = max(i-num_dx, 0), min(i+num_dx+1, num_x)
j1, j2 = max(j-num_dy, 0), min(j+num_dy+1, num_y)
# Get the radius in blocks, grid should be homogeneous
radius_i = radius_m / 13000.0
# Calc distances per element by broadcasting
DX = np.arange(i1, i2) - i
DY = np.arange(j1, j2)[:, None] - j
mask = DX*DX + DY*DY <= radius_i*radius_i
# Get block of interest and apply mask
values = myarray[i1:i2, j1:j2][mask]
For interior points (where the radius doesn't extend outside your image), you can just compute a single mask that is used for any interior point. Start with an array of zeros:
mask = np.zeros((2 * num_dx + 1, 2 * num_dy + 1), dtype=np.int)
Assuming your point of interest is at the center of that array, set each element that falls within the radius to 1 (not shown here). Then,
indices = np.argwhere(mask.ravel() == 1)
Then for any interior element (i, j) in myarray, you would get the values within the radius like:
values = myarray[i-num_dx: i+num_dx+1, j-num_dy: j+num_dy+1].ravel()[indices]
For points near the border, you would make a copy of mask and set rows/cols outside the image to zero before setting indices.
I have a conceptual question on building a histogram on the fly with Python. I am trying to figure out if there is a good algorithm or maybe an existing package.
I wrote a function, which runs a Monte Carlo simulation, gets called 1,000,000,000 times, and returns a 64 bit floating number at the end of each run. Below is the said function:
def MonteCarlo(df,head,span):
# Pick initial truck
rnd_truck = np.random.randint(0,len(df))
full_length = df['length'][rnd_truck]
full_weight = df['gvw'][rnd_truck]
# Loop using other random trucks until the bridge is full
while True:
rnd_truck = np.random.randint(0,len(df))
full_length += head + df['length'][rnd_truck]
if full_length > span:
break
else:
full_weight += df['gvw'][rnd_truck]
# Return average weight per feet on the bridge
return(full_weight/span)
df is a Pandas dataframe object having columns labeled as 'length' and 'gvw', which are truck lengths and weights, respectively. head is the distance between two consecutive trucks, span is the bridge length. The function randomly places trucks on the bridge as long as the total length of the truck train is less than the bridge length. Finally, calculates the average weight of the trucks existing on the bridge per foot (total weight existing on the bridge divided by the bridge length).
As a result I would like to build a tabular histogram showing the distribution of the returned values, which can be plotted later. I had some ideas in mind:
Keep collecting the returned values in a numpy vector, then use existing histogram functions once the MonteCarlo analysis is completed. This would not be feasable, since if my calculation is correct, I would need 7.5 GB of memory for that vector only (1,000,000,000 64 bit floats ~ 7.5 GB)
Initialize a numpy array with a given range and number of bins. Increase the number of items in the matching bin by one at the end of each run. The problem is, I do not know the range of values I would get. Setting up a histogram with a range and an appropriate bin size is an unknown. I also have to figure out how to assign values to the correct bins, but I think it is doable.
Do it somehow on the fly. Modify ranges and bin sizes each time the function returns a number. This would be too tricky to write from scratch I think.
Well, I bet there may be a better way to handle this problem. Any ideas would be welcome!
On a second note, I tested running the above function for 1,000,000,000 times only to get the largest value that is computed (the code snippet is below). And this takes around an hour when span = 200. The computation time would increase if I run it for longer spans (the while loop runs longer to fill the bridge with trucks). Is there a way to optimize this you think?
max_w = 0
i = 1
while i < 1000000000:
if max_w < MonteCarlo(df_basic, 15., 200.):
max_w = MonteCarlo(df_basic, 15., 200.)
i += 1
print max_w
Thanks!
Here is a possible solution, with fixed bin size, and bins of the form [k * size, (k + 1) * size[. The function finalizebins returns two lists: one with bin counts (a), and the other (b) with bin lower bounds (the upper bound is deduced by adding binsize).
import math, random
def updatebins(bins, binsize, x):
i = math.floor(x / binsize)
if i in bins:
bins[i] += 1
else:
bins[i] = 1
def finalizebins(bins, binsize):
imin = min(bins.keys())
imax = max(bins.keys())
a = [0] * (imax - imin + 1)
b = [binsize * k for k in range(imin, imax + 1)]
for i in range(imin, imax + 1):
if i in bins:
a[i - imin] = bins[i]
return a, b
# A test with a mixture of gaussian distributions
def check(n):
bins = {}
binsize = 5.0
for i in range(n):
if random.random() > 0.5:
x = random.gauss(100, 50)
else:
x = random.gauss(-200, 150)
updatebins(bins, binsize, x)
return finalizebins(bins, binsize)
a, b = check(10000)
# This must be 10000
sum(a)
# Plot the data
from matplotlib.pyplot import *
bar(b,a)
show()
I want to generate random points on the surface of cylinder such that distance between the points fall in a range of 230 and 250. I used the following code to generate random points on surface of cylinder:
import random,math
H=300
R=20
s=random.random()
#theta = random.random()*2*math.pi
for i in range(0,300):
theta = random.random()*2*math.pi
z = random.random()*H
r=math.sqrt(s)*R
x=r*math.cos(theta)
y=r*math.sin(theta)
z=z
print 'C' , x,y,z
How can I generate random points such that they fall with in the range(on the surfaceof cylinder)?
This is not a complete solution, but an insight that should help. If you "unroll" the surface of the cylinder into a rectangle of width w=2*pi*r and height h, the task of finding distance between points is simplified. You have not explained how to measure "distance along the surface" between points on the top of the cylinder and the side- this is a slightly tricky bit of geometry.
As for computing the distance along the surface when we created an artificial "seam", just use both (x1-x2) and (w -x1+x2) - whichever gives the shorter distance is the one you want.
I do think that #VincentNivoliers' suggestion to use Poisson disk sampling is very good, but with the constraints of h=300 and r=20 you will get terrible results no matter what.
The basic way of creating a set of random points with constraints in the positions between them, is to have a function that modulates the probability of points being placed at a certain location. this function starts out being a constant, and whenever a point is placed, forbidden areas surrounding the point are set to zero. That is difficult to do with continuous variables, but reasonably easy if you discretize your problem.
The other thing to be careful about is the being on a cylinder part. It may be easier to think of it as random points on a rectangular area that repeats periodically. This can be handled in two different ways:
the simplest is to take into consideration not only the rectangular tile where you are placing the points, but also its neighbouring ones. Whenever you place a point in your main tile, you also place one in the neighboring ones and compute their effect on the probability function inside your tile.
A more sophisticated approach considers the probability function then convolution of a kernel that encodes forbidden areas, with a sum of delta functions, corresponding to the points already placed. If this is computed using FFTs, the periodicity is anatural by product.
The first approach can be coded as follows:
from __future__ import division
import numpy as np
r, h = 20, 300
w = 2*np.pi*r
int_w = int(np.rint(w))
mult = 10
pdf = np.ones((h*mult, int_w*mult), np.bool)
points = []
min_d, max_d = 230, 250
available_locs = pdf.sum()
while available_locs:
new_idx = np.random.randint(available_locs)
new_idx = np.nonzero(pdf.ravel())[0][new_idx]
new_point = np.array(np.unravel_index(new_idx, pdf.shape))
points += [new_point]
min_mask = np.ones_like(pdf)
if max_d is not None:
max_mask = np.zeros_like(pdf)
else:
max_mask = True
for p in [new_point - [0, int_w*mult], new_point +[0, int_w*mult],
new_point]:
rows = ((np.arange(pdf.shape[0]) - p[0]) / mult)**2
cols = ((np.arange(pdf.shape[1]) - p[1]) * 2*np.pi*r/int_w/mult)**2
dist2 = rows[:, None] + cols[None, :]
min_mask &= dist2 > min_d*min_d
if max_d is not None:
max_mask |= dist2 < max_d*max_d
pdf &= min_mask & max_mask
available_locs = pdf.sum()
points = np.array(points) / [mult, mult*int_w/(2*np.pi*r)]
If you run it with your values, the output is usually just one or two points, as the large minimum distance forbids all others. but if you run it with more reasonable values, e.g.
min_d, max_d = 50, 200
Here's how the probability function looks after placing each of the first 5 points:
Note that the points are returned as pairs of coordinates, the first being the height, the second the distance along the cylinder's circumference.