interpolate.griddata results inconsistent - python

I am converting some code from Matlab to Python and found that I was getting different results from scipy.interpolate.griddata than from Matlab scatteredInterpolant. After much research and experimentation I found that the interplation results from scipy.interpolate.griddata seem to depend on the size of the data set provided. There seem to be thresholds that cause the interpolated value to change. Is this a bug? OR Can someone explain the algorithm used that would cause this. Here is code that demonstrates the problem.
import numpy as np
from scipy import interpolate
# This code provides a simple example showing that the interpolated value
# for the same location changes depending on the size of the input data set.
# Results of this example show that the interpolated value changes
# at repeat 10 and 300.
def compute_missing_value(data):
"""Compute the missing value example function."""
# Indices for valid x, y, and z data
# In this example x and y are simply the column and row indices
valid_rows, valid_cols = np.where(np.isnan(data) == False)
valid_data = data[np.isnan(data) == False]
interpolated_value = interpolate.griddata(np.array((valid_rows,
valid_cols)).T, valid_data, (2, 2), method='linear')
print('Size=', data.shape,' Value:', interpolated_value)
# Sample data
data = np.array([[0.2154, 0.1456, 0.1058, 0.1918],
[-0.0398, 0.2238, -0.0576, 0.3841],
[0.2485, 0.2644, 0.2639, 0.1345],
[0.2161, 0.1913, 0.2036, 0.1462],
[0.0540, 0.3310, 0.3674, 0.2862]])
# Larger data sets are created by tiling the original data.
# The location of the invalid data to be interpolated is maintained at 2,2
repeat_list =[1, 9, 10, 11, 30, 100, 300]
for repeat in repeat_list:
new_data = np.tile(data, (1, repeat))
new_data[2,2] = np.nan
compute_missing_value(new_data)
The results are:
Size= (5, 4) Value: 0.07300000000000001
Size= (5, 36) Value: 0.07300000000000001
Size= (5, 40) Value: 0.19945000000000002
Size= (5, 44) Value: 0.07300000000000001
Size= (5, 120) Value: 0.07300000000000001
Size= (5, 400) Value: 0.07300000000000001
Size= (5, 1200) Value: 0.19945000000000002

Jaime's answer describes how scipy.interpolate.griddata interpolates values using Delaunay triangulation:
[When] you make a call to scipy.interpolate.griddata:
First, a call to sp.spatial.qhull.Delaunay is made to triangulate the irregular grid coordinates.
Then, for each point in the new grid, the triangulation is searched to find in which triangle ... does it lay.
The barycentric coordinates of each new grid point with respect to the vertices of the enclosing simplex are computed.
An interpolated values is computed for that grid point, using the barycentric coordinates, and the values of the function at the vertices of the enclosing simplex.
pv. explains that Delaunay
triangulation generated by a square grid is not unique. Since the points that
get linearly interpolated depend on the triangulation, you can get different
results depending on the particular Delaunay triangulation generated.
Here is a modified version of your script which draws the Delaunay triagulation used:
import numpy as np
from scipy import interpolate
import matplotlib.pyplot as plt
import scipy.spatial as spatial
import matplotlib.collections as mcoll
def compute_missing_value(data):
"""Compute the missing value example function."""
mask = ~np.isnan(data)
valid_rows, valid_cols = np.where(mask)
valid_data = data[mask]
interpolated_value = interpolate.griddata(
(valid_cols, valid_rows), valid_data, (2, 2), method='linear')
print('Size: {:<12s} Value: {:<.4f}'.format(
str(data.shape), interpolated_value))
points = np.column_stack((valid_cols, valid_rows))
tess = spatial.Delaunay(points)
tri = tess.simplices
verts = tess.points[tri]
lc = mcoll.LineCollection(
verts, colors='black', linewidth=2, zorder=5)
fig, ax = plt.subplots(figsize=(6, 6))
ax.add_collection(lc)
ax.plot(valid_cols, valid_rows, 'ko')
ax.set(xlim=(0, 3), ylim=(0, 3))
plt.title('Size: {:<12s} Value: {:<.4f}'.format(
str(data.shape), interpolated_value))
for label, x, y in zip(valid_data, valid_cols, valid_rows):
plt.annotate(
label,
xy=(x, y), xycoords='data',
xytext = (-20, -40), textcoords = 'offset points',
horizontalalignment = 'center',
verticalalignment = 'bottom',
bbox = dict(
boxstyle='round,pad=0.5', fc='yellow', alpha=0.5),
arrowprops = dict(arrowstyle='->', connectionstyle='arc3,rad=0'))
plt.show()
# Sample data
orig_data = np.array([[0.2154, 0.1456, 0.1058, 0.1918],
[-0.0398, 0.2238, -0.0576, 0.3841],
[0.2485, 0.2644, 0.2639, 0.1345],
[0.2161, 0.1913, 0.2036, 0.1462],
[0.0540, 0.3310, 0.3674, 0.2862]])
repeat_list =[1, 4]
for repeat in repeat_list:
print('{}: '.format(repeat), end='')
new_data = np.tile(orig_data, (1, repeat))
new_data[2,2] = np.nan
compute_missing_value(new_data)
As you can see, the two interpolated values, 0.1995 and 0.073, are the average of (A,C) or (B,D) (using pv.'s notation):
In [159]: (0.2644+0.1345)/2
Out[159]: 0.19945000000000002
In [160]: (0.2036-0.0576)/2
Out[160]: 0.07300000000000001

I think the explanation may lie in the way that scipy.interpolate.griddata constructs a triangularization of your data before interpolating. From the documentation, this uses scipy.interpolate.LinearNDInterpolator, which looks like it constructs a Delaunay triangularization of your data, which isn't guaranteed to be the same when you add more nodes at the edge of your grid (as you've done with numpy.tile). Because of the way your 2D area is divided into triangles, the resulting linear interpolation may vary.
For a plain 4x5 grid, with the (2,2) element missing, the Delaunay triangularization produced by scipy.spatial.Delaunay looks like this:
If you then tile the grid data, by the time you have four copies of the grid, the Delaunay triangularization has changed around the (2,2) location, which now lies on a horizontal boundary rather than vertical:
This means that the resulting interpolation for the value at (2,2) will use a different set of neighbouring nodes, which will give a different interpolated value on this extended grid.
(From a few quick experiments, this effect may not be present for 2x, or 3x tiling, but showed up on the 4x tiling.)
This change in the layout of the triangles is due to the way the Delaunay triangularization is computed, which involves projecting the entire 2D grid into a 3D space, and then computing the convex hull before projecting that back into 2D triangles. That means that as you add more nodes to the grid, there's no guarantee that the 3D convex hull will be the identical even where it refers to the same nodes in the original 2D grid.

Related

contourf() plots white space over finite data

I'm attempting to plot a 3D chart using matplotlib.pyplot.contourf() with the following program:
import numpy as np
import matplotlib.pyplot as plt
import scipy
# calculates Fast Fourier transforms for each value in the 1D array "Altitude"
# and stacks them vertically to form a 2D array of fft values called "Fourier"
Fourier = np.array([])
for i in range(len(Altitude)):
Ne_fft = Ne_lowpass[i,:]/np.average(Ne_lowpass[i,:])
Ne_fft = Ne_fft - Ne_fft.mean()
W = scipy.fftpack.fftfreq(10*Ne_fft.size, d=(Time[-1]-Time[0])/len(Ne_fft))
P = 1/abs(W)
FFT = abs(scipy.fftpack.fft(Ne_fft, n=10*len(Ne_fft)))
FFT = FFT**2
if len(Fourier) == 0:
Fourier = FFT
else:
Fourier = np.vstack((Fourier,FFT))
# plots the 2D contourf plot of "Fourier", with respect to "Altitude" and period "P"
plt.figure(5)
C = plt.contourf(P,Altitude,Fourier,100,cmap='jet')
plt.xscale('log')
plt.xlim([1,P[np.argmax(P)+1]])
plt.ylim([59,687])
plt.ylabel("Altitude")
plt.xlabel("Period")
plt.title("Power spectrum of Ne")
cbar = plt.colorbar(C)
cbar.set_label("Power", fontsize = 16)
For the most part it is working fine; however, in some places useless white space is plotted. the plot produced can be found here (sorry, I don't have enough reputation points to attach images directly)
The purpose of this program is to calculate a series of Fast Fourier Transforms across 1 axis of a 2 dimensional numpy array, and stack them up to display a contour plot depicting which periodicities are most prominent in the data.
I checked the parts of the plotted quantity that appear white, and finite values are still present, although much smaller than noticable quantities elsewhere in the plot:
print(Fourier[100:,14000:])
[[ 2.41147887e-03 1.50783490e-02 4.82620482e-02 ..., 1.49769976e+03
5.88859945e+02 1.31930217e+02]
[ 2.12684922e-03 1.44076962e-02 4.65881565e-02 ..., 1.54719976e+03
6.14086374e+02 1.38727145e+02]
[ 1.84414615e-03 1.38162140e-02 4.51940720e-02 ..., 1.56478339e+03
6.23619105e+02 1.41367042e+02]
...,
[ 3.51539440e-03 3.20182148e-03 2.38117665e-03 ..., 2.43824864e+03
1.18676851e+03 3.13067945e+02]
[ 3.51256439e-03 3.19924000e-03 2.37923875e-03 ..., 2.43805298e+03
1.18667139e+03 3.13042038e+02]
[ 3.50985146e-03 3.19677302e-03 2.37741084e-03 ..., 2.43790243e+03
1.18659640e+03 3.13021994e+02]]
print(np.isfinite(Fourier.all()))
True
print(np.isnan(Fourier.any()))
False
Is the white space present because the values are so small compared to the rest of the plot? I'm not sure at all how to fix this.
You can fix this problem by adding option extend='both'.
Example:
C = plt.contourf(P,Altitude, Fourier,100, cmap='jet', extend='both')
Ref: https://matplotlib.org/examples/pylab_examples/contourf_demo.html
In the line plt.contourf(P,Altitude,Fourier,100,cmap='jet') you are taking 100 automatically chosen levels for the contour plot. "Automatic" in this case does not guarantee that those levels include all data.
If you want to make sure they all data is included you may define you own levels to use
plt.contourf(x, y, Z, np.linspace(Z.min(), Z.max(), 100))

tripcolor using RGB values for each vertex

I have a 2D triangle mesh with n vertices that is stored in a variable tri (a matplotlib.tri.Triangulation object); I can plot the mesh with matplotlib's tripcolor function easily enough and everything works fine. However, I also have (r,g,b) triples for each vertex (vcolors), and these values do not fall along a single dimension thus can't be easily converted to a color-map (for example, imagine if you overlaid a triangle mesh on a large photo of a park, then assigned each vertex the color of the pixel beneath it).
I thought I would be able to do something like this:
matplotlib.pyplot.tripcolor(tri, vcolors)
ValueError: Collections can only map rank 1 arrays
Is there a convenient way to convert a vcolors-like (n x 3) matrix into something usable by tripcolor? Is there an alternative to tripcolor that accepts vertex colors?
One thing I have tried is to make my own colormap:
z = numpy.asarray(range(len(vcolors)), dtype=np.float) / (len(vcolors) - 1)
cmap = matplotlib.colors.Colormap(vcolors, N=len(vcolors))
matplotlib.pyplot.tripcolor(tri, z, cmap=cmap)
matplotlib.pyplot.show()
This however did nothing---no figure appears and no error is raised; the function returns a figure handle but nothing ever gets rendered (I'm using an IPython notebook). Note that if I call the following, a plot appears just fine:
tripcolor(tri, np.zeros(len(vcolors)))
matplotlib.pyplot.show()
I'm using Python 2.7.
After rooting around in matplotlib's tripcolor and Colormap code, I came up with the following solution, which seems to work only as long as one uses 'gouraud' shading (otherwise, it does a very poor job of deducing the face colors; see below).
The trick is to create a colormap that, when given n evenly spaced numbers between 0 and 1 (inclusive) reproduces the original array of colors:
def colors_to_cmap(colors):
'''
colors_to_cmap(nx3_or_nx4_rgba_array) yields a matplotlib colormap object that, when
that will reproduce the colors in the given array when passed a list of n evenly
spaced numbers between 0 and 1 (inclusive), where n is the length of the argument.
Example:
cmap = colors_to_cmap(colors)
zs = np.asarray(range(len(colors)), dtype=np.float) / (len(colors)-1)
# cmap(zs) should reproduce colors; cmap[zs[i]] == colors[i]
'''
colors = np.asarray(colors)
if colors.shape[1] == 3:
colors = np.hstack((colors, np.ones((len(colors),1))))
steps = (0.5 + np.asarray(range(len(colors)-1), dtype=np.float))/(len(colors) - 1)
return matplotlib.colors.LinearSegmentedColormap(
'auto_cmap',
{clrname: ([(0, col[0], col[0])] +
[(step, c0, c1) for (step,c0,c1) in zip(steps, col[:-1], col[1:])] +
[(1, col[-1], col[-1])])
for (clridx,clrname) in enumerate(['red', 'green', 'blue', 'alpha'])
for col in [colors[:,clridx]]},
N=len(colors))
Again, note that 'gouraud' shading is required for this to work. To demonstrate why this fails, the following code blocks show my particular use case. (I am plotting part of a flattened cortical sheet with a partially transparent data overlay). In this code, there are 40,886 vertices (in the_map.coordinates) and 81,126 triangles (in the_map.indexed_faces); the colors array has shape (40886, 3).
The following code works fine with 'gouraud' shading:
tri = matplotlib.tri.Triangulation(the_map.coordinates[0],
the_map.coordinates[1],
triangles=the_map.indexed_faces.T)
cmap = rgbs_to_cmap(colors)
zs = np.asarray(range(the_map.vertex_count), dtype=np.float) / (the_map.vertex_count - 1)
plt.figure(figsize=(16,16))
plt.tripcolor(tri, zs, cmap=cmap, shading='gouraud')
But without 'gouraud' shading, the face-colors are perhaps being assigned according to the average of their vertices (have not verified this), which is clearly wrong:
plt.figure(figsize=(16,16))
plt.tripcolor(tri, zs, cmap=cmap)
A much simpler way of creating the color map is via from_list:
z = numpy.arange(n)
cmap = matplotlib.colors.LinearSegmentedColormap.from_list(
'mymap', rgb, N=len(rgb)
)
While for the tripcolor function, use of a colormap is obligatory, the PolyCollection and TriMesh classes (from matplotlib.collection) that it calls internally can deal with RGB color arrays as well. I have used the following code, based on the tripcolor source, to draw a triangle mesh with given RGB face colors:
tri = Triangulation(...)
colors = nx3 RGB array
maskedTris = tri.get_masked_triangles()
verts = np.stack((tri.x[maskedTris], tri.y[maskedTris]), axis=-1)
collection = PolyCollection(verts)
collection.set_facecolor(colors)
plt.gca().add_collection(collection)
plt.gca().autoscale_view()
To set colors per vertex (Gouraud shading), use a TriMesh instead (with set_facecolor).

Python/Matplotlib: Randomly select "sample" scatter points for different marker

Pretty much exactly what the question states, but a little context:
I'm creating a program to plot a large number of points (~10,000, but it will be more later on). This is being done using matplotlib's plt.scatter. This command is part of a loop that saves the figure, so I can later animate it.
What I want to be able to do is randomly select a small portion of these particles (say, maybe 100?) and give them a different marker than the rest, even though they're part of the same data set. This is so I can use them as placeholders to see the motion of individual particles, as well as the bulk material.
Is there a way to use a different marker for a small subset of the same data?
For reference, the particles are uniformly distributed just using the numpy random sampler, but my code for that is:
for i in range(N): # N number of particles
particle_position[i] = np.random.uniform(0, xmax) # Initialize in spatial domain
particle_velocity[i] = np.random.normal(0, 5) # Initialize in velocity space
for i in range(maxtime):
plt.scatter(particle_position, particle_velocity, s=1, c=norm_xvel, cmap=br_disc, lw=0)
The position and velocity change on each iteration of the main loop (there's quite a bit of code), but these are the main initialization and plotting routines.
I had an idea that perhaps I could randomly select a bunch of i values from range(N), and use an ax.scatter() command to plot them on the same axes?
Here is a possible solution to have a subset of your points identified with a different marker:
import matplotlib.pyplot as plt
import numpy as np
SIZE = 100
SAMPLE_SIZE = 10
def select_subset(seq, size):
"""selects a subset of the data using ...
"""
return seq[:size]
points_x = np.random.uniform(-1, 1, size=SIZE)
points_y = np.random.uniform(-1, 1, size=SIZE)
plt.scatter(points_x, points_y, marker=".", color="blue")
plt.scatter(select_subset(points_x, SAMPLE_SIZE),
select_subset(points_y, SAMPLE_SIZE),
marker="o", color="red")
plt.show()
It uses plt.scatter twice; once on the full data set, the other on the sample points.
You will have to decide how you want to select the sample of points - it is isolated in the select_subset function..
You could also extract the sample points from the data set to prevent marking them twice, but numpy is rather inefficient at deleting or resizing.
Maybe a better method is to use a mask? A mask has the advantage of leaving your original data intact and in order.
Here is a way to proceed with masks:
import matplotlib.pyplot as plt
import numpy as np
import random
SIZE = 100
SAMPLE_SIZE = 10
def make_mask(data_size, sample_size):
mask = np.array([True] * sample_size + [False ] * (data_size - sample_size))
np.random.shuffle(mask)
return mask
points_x = np.random.uniform(-1, 1, size=SIZE)
points_y = np.random.uniform(-1, 1, size=SIZE)
mask = make_mask(SIZE, SAMPLE_SIZE)
not_mask = np.invert(mask)
plt.scatter(points_x[not_mask], points_y[not_mask], marker=".", color="blue")
plt.scatter(points_x[mask], points_y[mask], marker="o", color="red")
plt.show()
As you see, scatter is called once on a subset of the data points (the ones not selected in the sample), and a second time on the sampled subset, and draws each subset with its own marker. It is efficient & leaves the original data intact.
The code below does what you want. I have selected a random set v_sub_index of N_sub indices in the correct range (0 to N) and draw those (with _sub suffix) from the larger samples particle_position and particle_velocity. Please note that you don't have to loop to generate random samples. Numpy has great functionality for that without having to use for loops.
import numpy as np
import matplotlib.pyplot as pl
N = 100
xmax = 1.
v_sigma = 2.5 / 2. # 95% of the samples contained within 0, 5
v_mean = 2.5 # mean at 2.5
N_sub = 10
v_sub_index = np.random.randint(0, N, N_sub)
particle_position = np.random.rand (N) * xmax
particle_velocity = np.random.randn(N)
particle_position_sub = np.array(particle_position[v_sub_index])
particle_velocity_sub = np.array(particle_velocity[v_sub_index])
particle_position_nosub = np.delete(particle_position, v_sub_index)
particle_velocity_nosub = np.delete(particle_velocity, v_sub_index)
pl.scatter(particle_position_nosub, particle_velocity_nosub, color='b', marker='o')
pl.scatter(particle_position_sub , particle_velocity_sub , color='r', marker='^')
pl.show()

Count frequencies of x, y coordinates, display in 2D and plot

I am trying to plot the frequency of how often viral biological sequences combination of isolation year differences and nucleotide differences occurs. I am trying to find an elegant way to do it have having trouble.
So I have an alignment and I compare each sequence against each other to get an integer value of how different they are. I also check to see how different their years of isolation are. So for a set of sequences that are isolated two years apart and have three differences you get the coordinates (2,3). I want to count how many times (2,3) occurs as well as all other combinations and plot it (and get the plot data). I have been trying to convert a list of frequencies to a dataframe to no avail and I am wondering if there is a better way to do it.
I can show some code but I am not sure this is the best way so I want to hear other ideas.
One problem is how to represent the frequencies in the beginning. I can create a list of all of the occurrences or create a dictionary of the occurrences and increment a counter.
Sample data:
(year difference, sequence residue differences):
(1,2), (2,5), (1,2), (5, 5), (4, 5)
Output is shown in the picture but it does NOT have to be in a table structure. CSV is preferred.
I'm heavily borrowing the table construction of this post.
The difference here is in constructing the array data. By initialising an array with zeros, for every coordinate (i, j), you increment that array element by one, to represent the incremented frequency.
zip(*coords) will group all is together in a tuple and all js in another. By finding the maximum value in each, we know the size of our array. Note, this must be bigger by 1 from x and y to account for 0, i.e from 0 to x is x+1 rows.
import matplotlib.pyplot as plt
import numpy as np
from matplotlib.table import Table
def table_plot(data):
fig, ax = plt.subplots()
ax.set_axis_off()
tb = Table(ax, bbox=[0,0,1,1])
nrows, ncols = data.shape
width, height = 1.0 / ncols, 1.0 / nrows
for (i, j), val in np.ndenumerate(data):
tb.add_cell(i, j, width, height, text=str(val) if val else '', loc='center')
for i in range(data.shape[0]):
tb.add_cell(i, -1, width, height, text=str(i), loc='right',
edgecolor='none', facecolor='none')
for i in range(data.shape[1]):
tb.add_cell(-1, i, width, height/2, text=str(i), loc='center',
edgecolor='none', facecolor='none')
tb.set_fontsize(16)
ax.add_table(tb)
return fig
coords = ((1,2), (2,5), (1,2), (5, 5), (4, 5))
# get maximum value for both x and y to allocate the array
x, y = map(max, zip(*coords))
data = np.zeros((x+1, y+1), dtype=int)
for i, j in coords:
data[i,j] += 1
table_plot(data)
plt.show()
Output:
Assuming your (year, discrepancy) tuples are in a list called samples as in the example below
import random
samples = [(random.randint(0,10), random.randint(0,10)) for i in range(100) ]
you can get the frequency of each pair as described in this other stackoverflow post How to count the frequency of the elements in a list?
import collections
counter=collections.Counter(samples)
To visualize this frequency table, you can convert it to a numpy matrix and use matshow from matplotlib
import numpy as np
import matplotlib.pyplot as plt
x_max = max([x[0] for x in samples])
y_max = max([x[1] for x in samples])
freq = np.zeros((x_max+1, y_max+1))
for coord, f in counter.iteritems():
freq[coord[0]][coord[1]] = f
plt.matshow(freq, cmap=plt.cm.gray)
plt.show()

Set mask for matplotlib tricontourf

I have some numpy array containing data that I would visualize on a 2D grid. Some of the data is unphysical and I would like to mask this data. However, I could not figure out how to set the mask attribute of tricontour correctly. I tried:
import matplotlib.pyplot as mp
import numpy as np
with open('some_data.dat', 'r') as infile:
x, y, z = np.loadtxt(infile, usecols=(0, 1, 2), unpack=True)
isbad = np.less(z, 1.4) | np.greater(z, 2.1)
mp.tricontourf(x, y, z, mask = isbad)
But the resulting figure is simply not masked. I tried masking part of a contourf plot in matplotlib, i.e.
z2 = np.ma.array(z, mask= isbad)
mp.tricontourf(x, y, z2)
which did not work either. I want to use tricontourf instrad of contourf, because I do not want to grid my data.
z[isbad] = np.nan
results in a Segmentation fault when calling tricontourf
Here's the figure, the red colours are the ones I would like to mark as unphysical.
Here comes the trick. I need to collect the indices of triangles (which are indices into z!), evaluate whether they are good or not and then accept only the triangles for that at least one corner is valid (reducing the dimension from (ntri, 3) to ntri
triang = tr.Triangulation(x, y)
mask = np.all(np.where(isbad[triang.triangles], True, False), axis=1)
triang.set_mask(mask)
colplt = mp.tricontourf(triang, z)
mp.colorbar()
Inspired by this link: http://matplotlib.org/examples/pylab_examples/tripcolor_demo.html
wsj's answer didn't work for me since it didn't remove certain masked points (I think when not all of the nodes were bad).
This solution did:
z[isbad] = numpy.NaN
z = numpy.ma.masked_invalid(z)
vmin, vmax = z.min(), z.max()
z = z.filled(fill_value=-999)
levels = numpy.linspace(vmin, vmax, n_points)
plt.tricontourf(x, y, z, levels=levels)

Categories