Python: histogram/ binning data from 2 arrays.

Python: histogram/ binning data from 2 arrays. - python

I have two arrays of data: one is a radius values and the other is a corresponding intensity reading at that intensity:
e.g. a small section of the data. First column is radius and the second is the intensities.
29.77036614 0.04464427
29.70281027 0.07771409
29.63523525 0.09424901
29.3639355 1.322793
29.29596385 2.321502
29.22783249 2.415751
29.15969437 1.511504
29.09139827 1.01704
29.02302068 0.9442765
28.95463729 0.3109002
28.88609766 0.162065
28.81754446 0.1356054
28.74883612 0.03637681
28.68004928 0.05952569
28.61125036 0.05291172
28.54229804 0.08432806
28.4732599 0.09950128
28.43877462 0.1091304
28.40421016 0.09629156
28.36961249 0.1193614
28.33500089 0.102711
28.30037503 0.07161685
How can I bin the radius data, and find the average intensity corresponding to that binned radius.
The aim of this is to then use the average intensity to assign an intensity value to a radius data with a missing (NaN) data point.
I've never had to use the histogram functions before and have very little idea of how they work/ if its possible to do this with them. The full data set is large with 336622 number of data points, so I don't really want to be using loops or if statements to achieve this.
Many Thanks for any help.

if you only need to do this for a handful of points, you could do something like this.
If intensites and radius are numpy arrays of your data:
bin_width = 0.1 # Depending on how narrow you want your bins
def get_avg(rad):
average_intensity = intensities[(radius>=rad-bin_width/2.) & (radius<rad+bin_width/2.)].mean()
return average_intensities
# This will return the average intensity in the bin: 27.95 <= rad < 28.05
average = get_avg(28.)

It's not really histogramming what your are after. A histogram is more a count of items that fall into a specific bin. What you want to do is more a group by operation, where you'd group your intensities by radius intervals and on the groups of itensities you apply some aggregation method, like average or median etc.
What your are describing, however, sounds a lot more like some sort of interpolation you want to perform. So I would suggest to think about interpolation as an alternative to solve your problem. Anyways, here's a suggestion how you can achieve what you asked for (assuming you can use numpy) - I'm using random inputs to illustrate:
radius = numpy.fromiter((random.random() * 10 for i in xrange(1000)), dtype=numpy.float)
intensities = numpy.fromiter((random.random() * 10 for i in xrange(1000)), dtype=numpy.float)
# group your radius input into 20 equal distant bins
bins = numpy.linspace(radius.min(), radius.max(), 20)
groups = numpy.digitize(radius, bins)
# groups now holds the index of the bin into which radius[i] falls
# loop through all bin indexes and select the corresponding intensities
# perform your aggregation on the selected intensities
# i'm keeping the aggregation for the group in a dict
aggregated = {}
for i in range(len(bins)+1):
selected_intensities = intensities[groups==i]
aggregated[i] = selected_intensities.mean()

Related

'Lining up' large lat/lon grid with smaller lat/lon grid

Let's say I have a large array of values that represent terrain latitude locations that is shape x. I also have another array of values that represent terrain longitude values that is shape y. All of the values in x as well as y are equally spaced at 0.005-degrees. In other words:
lons[0:10] = [-130.0, -129.995, -129.99, -129.985, -129.98, -129.975, -129.97, -129.965, -129.96, -129.955]
lats[0:10] = [55.0, 54.995, 54.99, 54.985, 54.98, 54.975, 54.97, 54.965, 54.96, 54.955]
I have a second dataset that is projected in an irregularly-spaced lat/lon grid (but equally spaced ~ 25 meters apart) that is [m,n] dimensions big, and falls within the domain of x and y. Furthermore, we also have all of the lat/lon points within this second dataset. I would like to 'lineup' the grids such that every value of [m,n] matches the nearest neighbor terrain value within the larger grid. I am able to do this with the following code where I basically loop through every lat/lon value in dataset two, and try to find the argmin of a the calculated lat/lon values from dataset1:
for a in range(0,lats.shape[0]):
# Loop through the ranges
for r in range(0,lons.shape[0]):
# Access the elements
tmp_lon = lons[r]
tmp_lat = lats[a]
# Now we need to find where the tmp_lon and tmp_lat match best with the index from new_lats and new_lons
idx = (np.abs(new_lats - tmp_lat)).argmin()
idy = (np.abs(new_lons - tmp_lon)).argmin()
# Make our final array!
second_dataset_trn[a,r] = first_dataset_trn[idy,idx]
Except it is exceptionally slow. Is there another method, either through a package, library, etc. that can speed this up?

Please take a look at the following previous question for iterating over two lists, which may improve the speed: Is there a better way to iterate over two lists, getting one element from each list for each iteration?
A possible correction to the sample code: assuming that the arrays are organized in the standard GIS fashion of Latitude, Longitude, I believe there is an error in the idx and idy variable assignments - the variables receiving the assignments should be swapped (idx should be idy, and the other way around). For example:
# Now we need to find where the tmp_lon and tmp_lat match best with the index from new_lats and new_lons
idy = (np.abs(new_lats - tmp_lat)).argmin()
idx = (np.abs(new_lons - tmp_lon)).argmin()

Interpolation onto a 3d grid from 3 different pairs of points and values

I have three 3D images with me, each representing one of the orthogonal views. I know the physical x,y,z locations on which each of the images are placed.
Let X1 = {(x1,y1,z1)} represent the set of physical coordinate tuples for one of the images and for which I know the corresponding intensity values I1. There are N tuples in X1 and hence, N intensity values. Similarly, I have access to X2, I2, and X3,I3 which are for the other two images. There are N tuples in X2 and X3 as well.
I want to estimate the volume that comes from interpolating information from all the views. I know the physical coordinates Xq for the final volume as well.
For example:
#Let image_matrix1, image_matrix2, and image_matrix3 represent the three #volumes (matrix with intensity values)
#for image/view 1
xs1 = np.linspace(-5,5,100)
ys1 = np.linspace(-5,5,100)
zs1 = np.linspace(-2,2,20)
#for image/view 2
xs2 = np.linspace(-5,5,100)
ys2 = np.linspace(-5,5,100)
zs2 = np.linspace(-2,2,20)
#for image/view 3
xs3 = np.linspace(-5,5,100)
ys3 = np.linspace(-5,5,100)
zs3 = np.linspace(-2,2,20)
#the following will not work, but this is close to what i want to achieve.
xs = [xs1,xs2,xs3]
ys = [ys1,ys2,ys3]
zs = [zs1,zs2,zs3]
points = (xs,ys,zs)
values = [image_matrix1,image_matrix2,image_matrix3]
query = (3.4,2.2,5.2) # the physical point at which i want to know the value
value_at_query = interpolating_function(points,values,query)
#the following will work, but this is for one image only
points = (xs1,ys1,zs1) #modified to take coords of one image only
values = [image_matrix1] #modified to take values of one image only
query = (3.4,2.2,5.2) # the physical point at which i want to know the value
value_at_query = interpolating_function(points,values,query)
Please help.

It doesn't make sense to me to interpolate between the three volumes (as a fourth dimension) as I understand the problem. The volumes are not like a fourth dimension in that they don't lie on a continuous axis that you can interpolate at a specified value.
You could interpolate the views separately and then calculate an aggregate value from the results by a suitable metric (average, quadratic average, min/max, etc.).
value_at_query = suitable_aggregate_metric(
interpolating_function(points1, [image_matrix1], query),
interpolating_function(points2, [image_matrix2], query),
interpolating_function(points3, [image_matrix3], query)
)
Considering the extrapolation, you could use a weight-matrix for each image. This weight-matrix would enclose the whole outer cube (128x128x128) with a weight-value of one in the region where it intersects with the image (128x128x10) and decaying to zero towards the outside (probably a strong decay like quadratic/cubic or higher order works better than linear). You then interpolate each image for an intensity-value and a weight-value and then calculate a weighted intensity-average.
The reason for my suggestion is, that if you probe e.g. at location (4, 4, 2.5) you have to extrapolate on every image, but you would want to weight the third image highest, as it is way closer to known values of the image and thus more reliable. A higher order decay exaggerates this weight towards closer values further.

Normalise max value of probability function for all frames

I have working code that plots a bivariate gaussian distribution. The distribution is produced by adjusting the COV matrix to account for specific variables. Specifically, every XY coordinate is applied with a radius. The COV matrix is then adjusted by a scaling factor to expand the radius in x-direction and contract in y-direction. The direction of this is measured by theta. The output is expressed as a probability density function (PDF).
I have normalised the PDF values. However, I'm calling a separate PDF for each frame. As such, the maximum value changes and hence the probability will be transformed differently for each frame.
Question: Using #Prasanth's suggestion. Is it possible to create normalized arrays for each frame before plotting, and then plot these arrays?
Below is the function I'm currently using to normalise the PDF for a single frame.
normPDF = (PDFs[0]-PDFs[1])/max(PDFs[0].max(),PDFs[1].max())

Is it possible to create normalized arrays for each frame before plotting, and then plot these arrays?
Indeed is possible. In your case you probably need to rescale your arrays between two values, say -1 and 1, before plotting. So that the minimum becomes -1, the maximum 1 and the intermediate values are scaled accordingly.
You could also choose 0 and 1 or whatever as minimum and maximum, but let's go with -1 and 1 so that a the middle value is 0.
To do this, in your code replace:
normPDF = (PDFs[0]-PDFs[1])/max(PDFs[0].max(),PDFs[1].max())
with:
renormPDF = PDFs[0]-PDFs[1]
renormPDF -= renormPDF.min()
normPDF = (renormPDF * 2 / renormPDF.max()) -1
This three lines ensure that normPDF.min() == -1 and normPDF.max() == 1.
Now when plotting the animation the axis on the right of your image does not change.

Your problem is to find the maximum values of PDFs[0].max() and PDFs[1].max() for all frames.
Why don't you run plotmvs on all your planned frames in order to find the absolute maximum for PDFs[0] and PDFs[1] and then run your animation with these absolute maxima to normalize your plots? This way, the colorbar will be the same for all frames.

Distance between two group of values in a numpy array

I have a very basic question which in theory is easy to do (with fewer points and a lot of manual labour in ArcGIS), but I am not able to start at all with the coding to solve this problem (also I am new to complicated python coding).
I have 2 variables 'Root zone' aka RTZ and 'Tree cover' aka TC both are an array of 250x186 values (which are basically grids with each grid having a specific value). The values in TC varies from 0 to 100. Each grid size is 0.25 degrees (might be helpful in understanding the distance).
My problem is "I want to calculate the distance of each TC value ranging between 50-100 (so each value of TC value greater than 50 at each lat and lon) from the points where nearest TC ranges between 0-30 (less than 30)."
Just take into consideration that we are not looking at the np.nan part of the TC. So the white part in TC is also white in RZS.
What I want to do is create a 2-dimensional scatter plot with X-axis denoting the 'distance of 50-100 TC from 0-30 values', Y-axis denoting 'RZS of those 50-100 TC points'. The above figure might make things more clear.
I hope I could have provided any code for this, but I am not to even able to start on the distance thing.
Please provide any suggestion on how should I proceed with this.
Let's consider an example:
If you look at the x: 70 and y:70, one can see a lot of points with values from 0-30 of the tree cover all across the dataset. But I only want the distance from the nearest value to my point which falls between 0-30.

The following code might work, with random example data:
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
# Create some completely random data, and include an area of NaNs as well
rzs = np.random.uniform(0, 100, size=(250, 168))
tc = np.random.lognormal(3.0, size=(250, 168))
tc = np.clip(tc, 0, 100)
rzs[60:80,:] = np.nan
tc[60:80,:] = np.nan
plt.subplot(2,2,1)
plt.imshow(rzs)
plt.colorbar()
plt.subplot(2,2,2)
plt.imshow(tc)
plt.colorbar()
Now do the real work:
# Select the indices of the low- and high-valued points
# This will results in warnings here because of NaNs;
# the NaNs should be filtered out in the indices, since they will
# compare to False in all the comparisons, and thus not be
# indexed by 'low' and 'high'
low = (tc >= 0) & (tc <= 30)
high = (tc >= 50) & (tc <= 100)
# Get the coordinates for the low- and high-valued points,
# combine and transpose them to be in the correct format
y, x = np.where(low)
low_coords = np.array([x, y]).T
y, x = np.where(high)
high_coords = np.array([x, y]).T
# We now calculate the distances between *all* low-valued points, and *all* high-valued points.
# This calculation scales as O^2, as does the memory cost (of the output),
# so be wary when using it with large input sizes.
from scipy.spatial.distance import cdist, pdist
distances = cdist(low_coords, high_coords)
# Now find the minimum distance along the axis of the high-valued coords,
# which here is the second axis.
# Since we also want to find values corresponding to those minimum distances,
# we should use the `argmin` function instead of a normal `min` function.
indices = distances.argmin(axis=1)
mindistances = distances[np.arange(distances.shape[0]), indices]
minrzs = rzs.flatten()[indices]
plt.scatter(mindistances, minrzs)
The resulting plot looks a bit weird, since there are rather discrete distances because of the grid (1, sqrt(1^1+1^1), 2, sqrt(1^1+2^2), sqrt(2^2+2^2), 3, sqrt(1^1+3^2), ...); this is because both TC values are randomly distributed, and thus low values may end up directly adjacent to high values (and because we're looking for minimum distances, most plotted points are for these cases). The vertical distribution is because the RZS values were uniformly distributed between 0 and 100.
This is simply a result of the input example data, which is not too representative of the real data.

How can I find the FWHM of a peak in a noisy data set in python (numpy/scipy)?

I am analyzing an image of two crossing lines (like a + sign) and I am extracting a line of pixels (an nx1 numpy array) perpendicular to one of the lines. This gives me an array of floating point values (representing colors) that I can then plot. I am plotting the data with matplotlib and I get a bunch of noisy data between 180 and 200 with a distinct peak in the middle that spikes down to around 100.
I need to find FWHM of this data. I figured I needed to filter the noise first, so I used a gaussian filter, which smoothed out my data, but its still not super flat at the top.
I was wondering if there is a better way to filter the data.
How can I find the FWHM of this data?
I would like to only use numpy, scipy, and matplotlib if possible.
Here is the original data:
Here is the filtered data:

I ended up not using any filter, but rather used the original data.
The procedure I used was:
Found the minimum and maximum points and calculated difference = max(arr_y) - min(arr_y)
Found the half max (in my case it is half min) HM = difference / 2
Found the nearest data point to HM: nearest = (np.abs(arr_y - HM)).argmin()
Calculated the distance between nearest and min (this gives me the HWHM)
Then simply multiplied by 2 to get the FWHM
I don't know (or think) this is the best way, but it works and seems to be fairly accurate based on comparison.

Your script does already the correct calculation.
But the error from your distance between nearest and pos_extremum can be reduced when taking the distance between nearest_above and nearest_below - the positions at half the extremal value (maximum/minimum) on both its sides.
import numpy as np
# Example data
arr_x = np.linspace(norm.ppf(0.00001), norm.ppf(0.99999), 10000)
arr_y = norm.pdf(arr_x)
# Effective code
difference = max(arr_y) - min(arr_y)
HM = difference / 2
pos_extremum = arr_y.argmax() # or in your case: arr_y.argmin()
nearest_above = (np.abs(arr_y[pos_extremum:-1] - HM)).argmin()
nearest_below = (np.abs(arr_y[0:pos_extremum] - HM)).argmin()
FWHM = (np.mean(arr_x[nearest_above + pos_extremum]) -
np.mean(arr_x[nearest_below]))
For this example you should receive the relation between FWHM and the standard deviation:
FWHM = 2.355 times the standard deviation (here 1) as mentioned on Wikipedia.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.