Localized random points using numpy and pandas - python

My idea is to try and generate random data points (2D, x and y coordinates) that would lie in close proximity to one another mimicking the following scenario:
I choose e.g. 10 points on one object.
There are 200 such objects in a database.
I record the coordinates of 10 points on the same locations on all the objects. So the data I have consists of 200x10 rows, so that first 10 rows represent coordinates of 10 points sampled on the first object, the next 10 represent the same points on the second object, and so on.
Collections of points in objects should be close in the scatterplot, but they should not be exactly the same, or too far apart. Now if I use plain random generators, most of the time I end up with a lot of evenly spaced random points...
This is the procedure I`ve tried using numpy, pandas and matplotlib and a cool usage of multvariate normal from from this post.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import brewer2mpl as bmpl
#the part of the code I use for generating correlated ranges for points
#but I have used it for generating x,y coords as well but it didn`t work out
corr = 0.95
means = [200, 180]
stds = [10, 10]
covs = [[stds[0]**2, stds[0]*stds[1]*corr],[stds[0]*stds[1]*corr, stds[1]**2]]
coordstest = np.random.multivariate_normal(means, covs, 20)
#now the part for generating x and y coords
coords1x = np.random.uniform(coordstest[0,0], coordstest[0,1], 200)
coords1y = np.random.uniform(coordstest[1,0], coordstest[1,1], 200)
coords2x = np.random.uniform(coordstest[2,0], coordstest[2,1], 200)
coords2y = np.random.uniform(coordstest[3,0], coordstest[3,1], 200)
... up to 10
#them make them into two-column arrays
coords1 = np.vstack((coords1x, coords1y)).T
coords2 = np.vstack((coords2x, coords2y)).T
... up to 10
#and generate individual levels
individuals = np.arange(0,200) #generate individual levels
individuals = np.tile(individuals, 10)
individuals = pd.Series(individuals)
#finally generate pandas data frame and plot the results
allCoords = np.concatenate((coords1, coords2, coords3, coords4, coords5, coords6, coords7, coords8, coords9, coords10))
allCoords = pd.DataFrame(allCoords)
allCoords.columns = ['x','y']
allCoords['individuals'] = individuals
allCoords['index'] = allCoords.index.tolist()
allCoords = allCoords.sort_index(by=['individuals', 'index'])
del allCoords['index']
allCoords = allCoords.set_index(np.arange(0,2000))
plt.scatter(allCoords['x'], allCoords['y'], c = allCoords['individuals'], s = 40, cmap = 'hot')
This is the scatter
and the same colored points should be grouped locally. Any ideas how this could be accomplished?

In fact you generate normally distributed intervals, and then uniformly distributed points within. Not surprisingly, you end up with non colocated groups of points.
To get colocated groups of points, you should choose expected locations:
coordstest = np.vstack([np.random.uniform(150, 220, 20),
np.random.uniform(150, 220, 20)]).T
Then generate points according to them:
coords = np.vstack([np.random.multivariate_normal(coordstest[i,:], covs, 200)
for i in range(10)])
And plot
individuals = (np.arange(0,200).reshape(-1,1)*np.ones(10).reshape(1,-1)).flatten()
individuals = pd.Series(individuals)
allCoords = pd.DataFrame(coords, columns = ['x','y'])
plt.scatter(allCoords['x'], allCoords['y'], c = individuals,
s = 40, cmap = 'hot')
Note that point are generated with linear dependency due to nontrivial covariance paramether for multivariate_normal. If you don't need it, you can for example do
coords = np.vstack([np.random.multivariate_normal(coordstest[i,:],
[[10,0],[0,10]], 200) for i in range(10)])
resulting in

Related

Sampling from two different normal distributions at specified probabilities

I'm trying to create a simulation that samples from two different normal distributions at specified probabilities. I want the simulation to choose a new value from the distribution during each simulation. I created the code below, but it picks a random value on each distribution one time, and then simulates it 50 times. How can I get new values from each distribution during each iteration of the simulation?
import numpy as np
from numpy.random import normal
number_simulations = 50
P1 = normal(loc=75, scale=5)
P2 = normal(loc=25, scale=5)
elements = [P1, P2]
probabilities = [.80, .20]
simulation = np.random.choice(elements, number_simulations, p=probabilities)
print(simulation)
[26.40889965 71.60833802 71.60833802 26.40889965 71.60833802, etc]
You could generate all 50 samples per P using size. Then use random to choose either index 0 of elements (P1) or index 1 of elements (P2) and then call random on the resulting distribution. You can use list comprehension to generate your 50 simulations.
import numpy as np
from numpy.random import normal
number_simulations = 50
P1 = normal(loc=75, scale=5, size=number_simulations)
P2 = normal(loc=75, scale=5, size=number_simulations)
elements = [P1, P2]
probabilities = [.80, .20]
[np.random.choice(elements[np.random.choice([0,1], p=probabilities)]) for x in range(number_simulations)]
Maybe a bit smoother would be to generate 50 samples with mean 50 and then either add or subtract 25 depending on the result:
import numpy as np
number_simulations = 50
probabilities = [.20, .80]
x = np.random.normal(loc = 50, scale = 5, size = number_simulations)
a = np.random.choice([-25,25], p = probabilities, size = number_simulations)
print(list(x+a))

What is the best way/method to digitize the data of a 3D surface into a grid of pixels with smaller resolution in Python?

I want to digitize (= average out over cells) photon count data into pixels given by a grid that tells how they are aligned. The photon count data is stored in a 2D array. I want to split that data into cells, each of which would correspond to a pixel. The idea is basically the same as changing an HD image to a smaller resolution. I'd like to achieve this in Python.
The digitizing function I've written:
import numpy as np
def digitize(function_data, grid_shape):
"""
function_data = 2D array of function values of some 3D shape,
eg.: exp(-(x^2 + y^2 -> want to digitize this
grid_shape: an array of length 2 which contains the dimensions of the smaller resolution
"""
l = len(function_data)
pixel_len_x = int(l/grid_shape[0])
pixel_len_y = int(l/grid_shape[1])
digitized_data = np.empty((grid_shape[0], grid_shape[1]))
for i in range(grid_shape[0]): #row-index of pixel in smaller-resolution grid
for j in range(grid_shape[1]): #column-index of pixel in smaller-resolution grid
hd_pixel = []
for k in range(pixel_len_y):
hd_pixel.append(z_data[k][j:j*pixel_len_x])
hd_pixel = np.ravel(hd_pixel) #turns 2D array into 1D to be able to compute average
pixel_avg = np.average(hd_pixel)
digitized_data[i][j] = pixel_avg
return digitized_data
In theory, this function should do what I want to achieve, but when tested it doesn't yield the expected results. Either a completed version of my function or any other method that achieves my goal would be extremely helpful.
You could also use a interpolation function, if you can use SciPy. Here we use one of the gridded data interpolating functions, RectBivariateSpline to upsample your function, but you can find numerous examples on this and other sites.
import numpy as np
import matplotlib.pyplot as plt
from scipy.interpolate import RectBivariateSpline as rbs
# Sampling coordinates
x = np.linspace(-2,2,20)
y = np.linspace(-2,2,30)
# Your function
f = np.exp(-(x[:,None]**2 + y**2))
# Interpolator
interp = rbs(x, y, f)
# Higher resolution coordinates
x_hd = np.linspace(x.min(), x.max(), x.size * 5)
y_hd = np.linspace(y.min(), y.max(), y.size * 5)
# New higher res function
f_hd = interp(x_hd, y_hd, grid = True)
# Some plots
fig, ax = plt.subplots(ncols = 2)
ax[0].imshow(f)
ax[1].imshow(f_hd)

How do I plot a matrix as a distance vs. time plot using matplotlib?

I am trying to plot the values (0's and 1's) that are stored in a 501x120 matrix. The plot is displaying but my x and y ticks correspond to the matrix indexes. I want to set these ticks to the corresponding distances (x-axis) and time (y-axis). I.e., the 501 rows correspond to a time series from 0 to 2 seconds with samples every 0.004 seconds. The columns are distances that go from (-600m to 600m) with a distance between columns of 10 m.
This is what I have written down so far:
# Import libraries. The magic command '%matplotlib inline' shows figures as an output in the same jupyter notebook.
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
v1 = 1500 #first layer velocity in m/s
h1 = 200 #vertical distance to first reflector in m
dt = 0.004 #sample rate in s
channels = 120 #number of geophones
dx = 10 #distance between geophones in m
dhalf = channels/2*dx #divide total distance in two. This is because we will assume a source at the center of the array, i.e., a central-shot gather
offsets = np.arange(channels)*dx-dhalf #creates numpy array with offsets to all 120 geophones
td = [np.sqrt(x**2+ 4*h1**2) for x in offsets] #calculates reflection travel distances to each geophone
tt = np.array(td)/v1 #calculates travel times as derived above
num_samples = 501 # number of samples per trace
seismic_data = np.zeros((num_samples,channels)) #creates a zero-matrix with a row per sample and a column per trace/channel.
for channel in range(channels):
sample=int(tt[channel]/dt)
seismic_data [sample,channel]=1
type(seismic_data)
# This loop looks in each channel for the sample number closest to the one that corresponds to the reflected wave arrival and turns the 0 into a 1 (spike).
fig,ax = plt.subplots(figsize=(10,20))
ax.imshow(seismic_data)
ax.set_aspect(.3)
Let's try contourf:
# random data
np.random.seed(1)
img = np.random.randint(0,2, (501,120))
y,x = np.linspace(0,2,501), np.linspace(-600,600,120)
xx,yy = np.meshgrid(x,y)
fig, ax = plt.subplots()
ax.contourf(xx,yy,img, cmap='seismic')
Output:
I have found a easier way to do this without having to create the meshgrid. It is by using the extent argument in plt.imshow. E.g.
ax.imshow(seismic_data, aspect=1400, extent= [np.min(offsets), np.max(offsets),np.max(time),np.min(time)])

How to do calculation on zoomed plot area

I have a time series plot along with a scatter plot on top to indicate some points of the series with certain characteristics. On jupyter notebook I am using %matplotlib notebook to get interaction plot and zoom.
Is it possible to calculate all points
EDIT:
The following code is a dummy example of ploting radnom data and marking with red dots those point where their value is above a certain threshold.
%matplotlib notebook
# generate random data [0, 10]
random_data = np.random.randint(10, size = 20)
# implement rule --> i.e. check which data point is > 3
index = np.where([random_data > 3])[1]
value = np.where([random_data > 3])[0]
# plot data and mark data point where rule applies
plt.plot(random_data)
plt.scatter(index, random_data[index], c = 'r')
This generates the plot below.
Is it possible to to get a result that calculates the red dots every time i zoom in the plot
So after a lot of search I came up with the following solution.
%matplotlib notebook
# generate random data [0, 10]
random_data = np.random.randint(10, size = 20)
# implement rule --> i.e. check which data point is > 3
index = np.where([random_data > 3])[1]
value = np.where([random_data > 3])[0]
# plot data and mark data point where rule applies
fig, ax = plt.subplots(1,1)
ax.plot(random_data)
ax.scatter(index, random_data[index], c = 'r')
global scatter_index
scatter_data = index
def on_xlims_change(axes):
d1, d2 = axes.get_xlim()
number_of_points = index[np.where((index > d1 )& (index < d2))].shape[0]
axes.legend([f'{ number_of_points } numbers of points in area' ])
# use a maplotlib callback to do the calculation
ax.callbacks.connect('xlim_changed', on_xlims_change)
The idea is that you can use a callback to get the new axis limits and filter data based on those limits. Hope

Python/Matplotlib: Randomly select "sample" scatter points for different marker

Pretty much exactly what the question states, but a little context:
I'm creating a program to plot a large number of points (~10,000, but it will be more later on). This is being done using matplotlib's plt.scatter. This command is part of a loop that saves the figure, so I can later animate it.
What I want to be able to do is randomly select a small portion of these particles (say, maybe 100?) and give them a different marker than the rest, even though they're part of the same data set. This is so I can use them as placeholders to see the motion of individual particles, as well as the bulk material.
Is there a way to use a different marker for a small subset of the same data?
For reference, the particles are uniformly distributed just using the numpy random sampler, but my code for that is:
for i in range(N): # N number of particles
particle_position[i] = np.random.uniform(0, xmax) # Initialize in spatial domain
particle_velocity[i] = np.random.normal(0, 5) # Initialize in velocity space
for i in range(maxtime):
plt.scatter(particle_position, particle_velocity, s=1, c=norm_xvel, cmap=br_disc, lw=0)
The position and velocity change on each iteration of the main loop (there's quite a bit of code), but these are the main initialization and plotting routines.
I had an idea that perhaps I could randomly select a bunch of i values from range(N), and use an ax.scatter() command to plot them on the same axes?
Here is a possible solution to have a subset of your points identified with a different marker:
import matplotlib.pyplot as plt
import numpy as np
SIZE = 100
SAMPLE_SIZE = 10
def select_subset(seq, size):
"""selects a subset of the data using ...
"""
return seq[:size]
points_x = np.random.uniform(-1, 1, size=SIZE)
points_y = np.random.uniform(-1, 1, size=SIZE)
plt.scatter(points_x, points_y, marker=".", color="blue")
plt.scatter(select_subset(points_x, SAMPLE_SIZE),
select_subset(points_y, SAMPLE_SIZE),
marker="o", color="red")
plt.show()
It uses plt.scatter twice; once on the full data set, the other on the sample points.
You will have to decide how you want to select the sample of points - it is isolated in the select_subset function..
You could also extract the sample points from the data set to prevent marking them twice, but numpy is rather inefficient at deleting or resizing.
Maybe a better method is to use a mask? A mask has the advantage of leaving your original data intact and in order.
Here is a way to proceed with masks:
import matplotlib.pyplot as plt
import numpy as np
import random
SIZE = 100
SAMPLE_SIZE = 10
def make_mask(data_size, sample_size):
mask = np.array([True] * sample_size + [False ] * (data_size - sample_size))
np.random.shuffle(mask)
return mask
points_x = np.random.uniform(-1, 1, size=SIZE)
points_y = np.random.uniform(-1, 1, size=SIZE)
mask = make_mask(SIZE, SAMPLE_SIZE)
not_mask = np.invert(mask)
plt.scatter(points_x[not_mask], points_y[not_mask], marker=".", color="blue")
plt.scatter(points_x[mask], points_y[mask], marker="o", color="red")
plt.show()
As you see, scatter is called once on a subset of the data points (the ones not selected in the sample), and a second time on the sampled subset, and draws each subset with its own marker. It is efficient & leaves the original data intact.
The code below does what you want. I have selected a random set v_sub_index of N_sub indices in the correct range (0 to N) and draw those (with _sub suffix) from the larger samples particle_position and particle_velocity. Please note that you don't have to loop to generate random samples. Numpy has great functionality for that without having to use for loops.
import numpy as np
import matplotlib.pyplot as pl
N = 100
xmax = 1.
v_sigma = 2.5 / 2. # 95% of the samples contained within 0, 5
v_mean = 2.5 # mean at 2.5
N_sub = 10
v_sub_index = np.random.randint(0, N, N_sub)
particle_position = np.random.rand (N) * xmax
particle_velocity = np.random.randn(N)
particle_position_sub = np.array(particle_position[v_sub_index])
particle_velocity_sub = np.array(particle_velocity[v_sub_index])
particle_position_nosub = np.delete(particle_position, v_sub_index)
particle_velocity_nosub = np.delete(particle_velocity, v_sub_index)
pl.scatter(particle_position_nosub, particle_velocity_nosub, color='b', marker='o')
pl.scatter(particle_position_sub , particle_velocity_sub , color='r', marker='^')
pl.show()

Categories