I am trying to plot the frequency of how often viral biological sequences combination of isolation year differences and nucleotide differences occurs. I am trying to find an elegant way to do it have having trouble.
So I have an alignment and I compare each sequence against each other to get an integer value of how different they are. I also check to see how different their years of isolation are. So for a set of sequences that are isolated two years apart and have three differences you get the coordinates (2,3). I want to count how many times (2,3) occurs as well as all other combinations and plot it (and get the plot data). I have been trying to convert a list of frequencies to a dataframe to no avail and I am wondering if there is a better way to do it.
I can show some code but I am not sure this is the best way so I want to hear other ideas.
One problem is how to represent the frequencies in the beginning. I can create a list of all of the occurrences or create a dictionary of the occurrences and increment a counter.
Sample data:
(year difference, sequence residue differences):
(1,2), (2,5), (1,2), (5, 5), (4, 5)
Output is shown in the picture but it does NOT have to be in a table structure. CSV is preferred.
I'm heavily borrowing the table construction of this post.
The difference here is in constructing the array data. By initialising an array with zeros, for every coordinate (i, j), you increment that array element by one, to represent the incremented frequency.
zip(*coords) will group all is together in a tuple and all js in another. By finding the maximum value in each, we know the size of our array. Note, this must be bigger by 1 from x and y to account for 0, i.e from 0 to x is x+1 rows.
import matplotlib.pyplot as plt
import numpy as np
from matplotlib.table import Table
def table_plot(data):
fig, ax = plt.subplots()
ax.set_axis_off()
tb = Table(ax, bbox=[0,0,1,1])
nrows, ncols = data.shape
width, height = 1.0 / ncols, 1.0 / nrows
for (i, j), val in np.ndenumerate(data):
tb.add_cell(i, j, width, height, text=str(val) if val else '', loc='center')
for i in range(data.shape[0]):
tb.add_cell(i, -1, width, height, text=str(i), loc='right',
edgecolor='none', facecolor='none')
for i in range(data.shape[1]):
tb.add_cell(-1, i, width, height/2, text=str(i), loc='center',
edgecolor='none', facecolor='none')
tb.set_fontsize(16)
ax.add_table(tb)
return fig
coords = ((1,2), (2,5), (1,2), (5, 5), (4, 5))
# get maximum value for both x and y to allocate the array
x, y = map(max, zip(*coords))
data = np.zeros((x+1, y+1), dtype=int)
for i, j in coords:
data[i,j] += 1
table_plot(data)
plt.show()
Output:
Assuming your (year, discrepancy) tuples are in a list called samples as in the example below
import random
samples = [(random.randint(0,10), random.randint(0,10)) for i in range(100) ]
you can get the frequency of each pair as described in this other stackoverflow post How to count the frequency of the elements in a list?
import collections
counter=collections.Counter(samples)
To visualize this frequency table, you can convert it to a numpy matrix and use matshow from matplotlib
import numpy as np
import matplotlib.pyplot as plt
x_max = max([x[0] for x in samples])
y_max = max([x[1] for x in samples])
freq = np.zeros((x_max+1, y_max+1))
for coord, f in counter.iteritems():
freq[coord[0]][coord[1]] = f
plt.matshow(freq, cmap=plt.cm.gray)
plt.show()
Related
I want to get the mean of every interval with values above a threshold. Obviously, I could do a loop and just look if the next value is under the threshold etc., but I was hoping that there would be an easier way. Do you have ideas that are similar to something like masking, but include the "interval"-problem?
Below are 2 pictures with the original data and what I want to obtain.
Before:
After:
My original idea was looping through my array, but as I want to do this about 10.000 times or more, I guess it's getting very time intensive.
Is there a way to get rid of the for loops?
transformed is a numpy array.
plt.figure()
plt.plot(transformed)
thresh=np.percentile(transformed,30)
plt.hlines(np.percentile(transformed,30),0,700)
transformed_copy=transformed
transformed_mask=[True if x>thresh else False for x in transformed_copy]
mean_arr=[]
for k in range(0,len(transformed)):
if transformed_mask[k]==False:
mean_all=np.mean(transformed_copy[mean_arr])
for el in mean_arr:
transformed_copy[el]=mean_all
mean_arr=[]
if transformed_mask[k]==True:
mean_arr.append(k)
plt.plot(transformed_copy)
Output after loop:
The trick I am using here is to calculate where there are sudden differences in the mask which means we switch from a contiguous section to another. Then we get the indexes of where those sections start and end, and calculate the mean inside of them.
# Imports.
import matplotlib.pyplot as plt
import numpy as np
# Create data.
x = np.linspace(0, 2*np.pi, 100)
y = np.sin(np.sin(x)*4)
threshold = 0.30
mask = y > threshold
# Plot the raw data, threshold, and show where the data is above the threshold.
fig, ax = plt.subplots()
ax.plot(x, y, color="blue", label="y", marker="o", zorder=0)
ax.scatter(x[mask], y[mask], color="red", label="y > threshold", zorder=1)
ax.axhline(threshold, color="red", label="threshold")
ax.legend(loc="upper left", bbox_to_anchor=(1.01, 1))
# Detect the different segments.
diff = np.diff(mask) # Where the mask starts and ends.
jumps = np.where(diff)[0] # Indices of where the mask starts and ends.
for jump in jumps:
ax.axvline(x[jump], linestyle="--", color="black")
# Calculate the mean inside each segment.
for n1, n2 in zip(jumps[:-1:2], jumps[1::2]):
xn = x[n1:n2]
yn = y[n1:n2]
mean_in_section_n = np.mean(yn)
ax.hlines(mean_in_section_n, xn[0], xn[-1], color="red", lw=10)
fig.show()
With a bit more time, we could imagine a function that encases all this logic and has this signature: f(data, mask) -> data1, data2, ... with an element returned for each contiguous section.
def data_where_mask_is_contiguous(data:np.array, mask:np.array) -> list:
sections = []
diff = np.diff(mask) # Where the mask starts and ends.
jumps = np.where(diff)[0] # Indices of where the mask starts and ends.
for n1, n2 in zip(jumps[:-1:2], jumps[1::2]):
sections.append(data[n1:n2])
return sections
With this, you can get the mean in each section very easily:
print([np.mean(yn) for yn in data_where_mask_is_contiguous(y, mask)])
>>> [0.745226, 0.747790, 0.599429]
I just noticed it doesn't work when the mask is all true, so I need to add a default case but you get the idea.
I want to digitize (= average out over cells) photon count data into pixels given by a grid that tells how they are aligned. The photon count data is stored in a 2D array. I want to split that data into cells, each of which would correspond to a pixel. The idea is basically the same as changing an HD image to a smaller resolution. I'd like to achieve this in Python.
The digitizing function I've written:
import numpy as np
def digitize(function_data, grid_shape):
"""
function_data = 2D array of function values of some 3D shape,
eg.: exp(-(x^2 + y^2 -> want to digitize this
grid_shape: an array of length 2 which contains the dimensions of the smaller resolution
"""
l = len(function_data)
pixel_len_x = int(l/grid_shape[0])
pixel_len_y = int(l/grid_shape[1])
digitized_data = np.empty((grid_shape[0], grid_shape[1]))
for i in range(grid_shape[0]): #row-index of pixel in smaller-resolution grid
for j in range(grid_shape[1]): #column-index of pixel in smaller-resolution grid
hd_pixel = []
for k in range(pixel_len_y):
hd_pixel.append(z_data[k][j:j*pixel_len_x])
hd_pixel = np.ravel(hd_pixel) #turns 2D array into 1D to be able to compute average
pixel_avg = np.average(hd_pixel)
digitized_data[i][j] = pixel_avg
return digitized_data
In theory, this function should do what I want to achieve, but when tested it doesn't yield the expected results. Either a completed version of my function or any other method that achieves my goal would be extremely helpful.
You could also use a interpolation function, if you can use SciPy. Here we use one of the gridded data interpolating functions, RectBivariateSpline to upsample your function, but you can find numerous examples on this and other sites.
import numpy as np
import matplotlib.pyplot as plt
from scipy.interpolate import RectBivariateSpline as rbs
# Sampling coordinates
x = np.linspace(-2,2,20)
y = np.linspace(-2,2,30)
# Your function
f = np.exp(-(x[:,None]**2 + y**2))
# Interpolator
interp = rbs(x, y, f)
# Higher resolution coordinates
x_hd = np.linspace(x.min(), x.max(), x.size * 5)
y_hd = np.linspace(y.min(), y.max(), y.size * 5)
# New higher res function
f_hd = interp(x_hd, y_hd, grid = True)
# Some plots
fig, ax = plt.subplots(ncols = 2)
ax[0].imshow(f)
ax[1].imshow(f_hd)
I am converting some code from Matlab to Python and found that I was getting different results from scipy.interpolate.griddata than from Matlab scatteredInterpolant. After much research and experimentation I found that the interplation results from scipy.interpolate.griddata seem to depend on the size of the data set provided. There seem to be thresholds that cause the interpolated value to change. Is this a bug? OR Can someone explain the algorithm used that would cause this. Here is code that demonstrates the problem.
import numpy as np
from scipy import interpolate
# This code provides a simple example showing that the interpolated value
# for the same location changes depending on the size of the input data set.
# Results of this example show that the interpolated value changes
# at repeat 10 and 300.
def compute_missing_value(data):
"""Compute the missing value example function."""
# Indices for valid x, y, and z data
# In this example x and y are simply the column and row indices
valid_rows, valid_cols = np.where(np.isnan(data) == False)
valid_data = data[np.isnan(data) == False]
interpolated_value = interpolate.griddata(np.array((valid_rows,
valid_cols)).T, valid_data, (2, 2), method='linear')
print('Size=', data.shape,' Value:', interpolated_value)
# Sample data
data = np.array([[0.2154, 0.1456, 0.1058, 0.1918],
[-0.0398, 0.2238, -0.0576, 0.3841],
[0.2485, 0.2644, 0.2639, 0.1345],
[0.2161, 0.1913, 0.2036, 0.1462],
[0.0540, 0.3310, 0.3674, 0.2862]])
# Larger data sets are created by tiling the original data.
# The location of the invalid data to be interpolated is maintained at 2,2
repeat_list =[1, 9, 10, 11, 30, 100, 300]
for repeat in repeat_list:
new_data = np.tile(data, (1, repeat))
new_data[2,2] = np.nan
compute_missing_value(new_data)
The results are:
Size= (5, 4) Value: 0.07300000000000001
Size= (5, 36) Value: 0.07300000000000001
Size= (5, 40) Value: 0.19945000000000002
Size= (5, 44) Value: 0.07300000000000001
Size= (5, 120) Value: 0.07300000000000001
Size= (5, 400) Value: 0.07300000000000001
Size= (5, 1200) Value: 0.19945000000000002
Jaime's answer describes how scipy.interpolate.griddata interpolates values using Delaunay triangulation:
[When] you make a call to scipy.interpolate.griddata:
First, a call to sp.spatial.qhull.Delaunay is made to triangulate the irregular grid coordinates.
Then, for each point in the new grid, the triangulation is searched to find in which triangle ... does it lay.
The barycentric coordinates of each new grid point with respect to the vertices of the enclosing simplex are computed.
An interpolated values is computed for that grid point, using the barycentric coordinates, and the values of the function at the vertices of the enclosing simplex.
pv. explains that Delaunay
triangulation generated by a square grid is not unique. Since the points that
get linearly interpolated depend on the triangulation, you can get different
results depending on the particular Delaunay triangulation generated.
Here is a modified version of your script which draws the Delaunay triagulation used:
import numpy as np
from scipy import interpolate
import matplotlib.pyplot as plt
import scipy.spatial as spatial
import matplotlib.collections as mcoll
def compute_missing_value(data):
"""Compute the missing value example function."""
mask = ~np.isnan(data)
valid_rows, valid_cols = np.where(mask)
valid_data = data[mask]
interpolated_value = interpolate.griddata(
(valid_cols, valid_rows), valid_data, (2, 2), method='linear')
print('Size: {:<12s} Value: {:<.4f}'.format(
str(data.shape), interpolated_value))
points = np.column_stack((valid_cols, valid_rows))
tess = spatial.Delaunay(points)
tri = tess.simplices
verts = tess.points[tri]
lc = mcoll.LineCollection(
verts, colors='black', linewidth=2, zorder=5)
fig, ax = plt.subplots(figsize=(6, 6))
ax.add_collection(lc)
ax.plot(valid_cols, valid_rows, 'ko')
ax.set(xlim=(0, 3), ylim=(0, 3))
plt.title('Size: {:<12s} Value: {:<.4f}'.format(
str(data.shape), interpolated_value))
for label, x, y in zip(valid_data, valid_cols, valid_rows):
plt.annotate(
label,
xy=(x, y), xycoords='data',
xytext = (-20, -40), textcoords = 'offset points',
horizontalalignment = 'center',
verticalalignment = 'bottom',
bbox = dict(
boxstyle='round,pad=0.5', fc='yellow', alpha=0.5),
arrowprops = dict(arrowstyle='->', connectionstyle='arc3,rad=0'))
plt.show()
# Sample data
orig_data = np.array([[0.2154, 0.1456, 0.1058, 0.1918],
[-0.0398, 0.2238, -0.0576, 0.3841],
[0.2485, 0.2644, 0.2639, 0.1345],
[0.2161, 0.1913, 0.2036, 0.1462],
[0.0540, 0.3310, 0.3674, 0.2862]])
repeat_list =[1, 4]
for repeat in repeat_list:
print('{}: '.format(repeat), end='')
new_data = np.tile(orig_data, (1, repeat))
new_data[2,2] = np.nan
compute_missing_value(new_data)
As you can see, the two interpolated values, 0.1995 and 0.073, are the average of (A,C) or (B,D) (using pv.'s notation):
In [159]: (0.2644+0.1345)/2
Out[159]: 0.19945000000000002
In [160]: (0.2036-0.0576)/2
Out[160]: 0.07300000000000001
I think the explanation may lie in the way that scipy.interpolate.griddata constructs a triangularization of your data before interpolating. From the documentation, this uses scipy.interpolate.LinearNDInterpolator, which looks like it constructs a Delaunay triangularization of your data, which isn't guaranteed to be the same when you add more nodes at the edge of your grid (as you've done with numpy.tile). Because of the way your 2D area is divided into triangles, the resulting linear interpolation may vary.
For a plain 4x5 grid, with the (2,2) element missing, the Delaunay triangularization produced by scipy.spatial.Delaunay looks like this:
If you then tile the grid data, by the time you have four copies of the grid, the Delaunay triangularization has changed around the (2,2) location, which now lies on a horizontal boundary rather than vertical:
This means that the resulting interpolation for the value at (2,2) will use a different set of neighbouring nodes, which will give a different interpolated value on this extended grid.
(From a few quick experiments, this effect may not be present for 2x, or 3x tiling, but showed up on the 4x tiling.)
This change in the layout of the triangles is due to the way the Delaunay triangularization is computed, which involves projecting the entire 2D grid into a 3D space, and then computing the convex hull before projecting that back into 2D triangles. That means that as you add more nodes to the grid, there's no guarantee that the 3D convex hull will be the identical even where it refers to the same nodes in the original 2D grid.
I have an array with probability values stored in it. Some values are 0. I need to plot a histogram such that there are equal number of elements in each bin. I tried using matplotlibs hist function but that lets me decide number of bins. How do I go about plotting this?(Normal plot and hist work but its not what is needed)
I have 10000 entries. Only 200 have values greater than 0 and lie between 0.0005 and 0.2. This distribution isnt even as 0.2 only one element has whereas 2000 approx have value 0.0005. So plotting it was an issue as the bins had to be of unequal width with equal number of elements
The task does not make much sense to me, but the following code does, what i understood as the thing to do.
I also think the last lines of the code are what you really wanted to do. Using different bin-widths to improve visualization (but don't target the distribution of equal amount of samples within each bin)! I used astroml's hist with method='blocks' (astropy supports this too)
Code
# Python 3 -> beware the // operator!
import numpy as np
import matplotlib.pyplot as plt
from astroML import plotting as amlp
N_VALUES = 1000
N_BINS = 100
# Create fake data
prob_array = np.random.randn(N_VALUES)
prob_array /= np.max(np.abs(prob_array),axis=0) # scale a bit
# Sort array
prob_array = np.sort(prob_array)
# Calculate bin-borders,
bin_borders = [np.amin(prob_array)] + [prob_array[(N_VALUES // N_BINS) * i] for i in range(1, N_BINS)] + [np.amax(prob_array)]
print('SAMPLES: ', prob_array)
print('BIN-BORDERS: ', bin_borders)
# Plot hist
counts, x, y = plt.hist(prob_array, bins=bin_borders)
plt.xlim(bin_borders[0], bin_borders[-1] + 1e-2)
print('COUNTS: ', counts)
plt.show()
# And this is, what i think, what you really want
fig, (ax1, ax2) = plt.subplots(2)
left_blob = np.random.randn(N_VALUES/10) + 3
right_blob = np.random.randn(N_VALUES) + 110
both = np.hstack((left_blob, right_blob)) # data is hard to visualize with equal bin-widths
ax1.hist(both)
amlp.hist(both, bins='blocks', ax=ax2)
plt.show()
Output
Pretty much exactly what the question states, but a little context:
I'm creating a program to plot a large number of points (~10,000, but it will be more later on). This is being done using matplotlib's plt.scatter. This command is part of a loop that saves the figure, so I can later animate it.
What I want to be able to do is randomly select a small portion of these particles (say, maybe 100?) and give them a different marker than the rest, even though they're part of the same data set. This is so I can use them as placeholders to see the motion of individual particles, as well as the bulk material.
Is there a way to use a different marker for a small subset of the same data?
For reference, the particles are uniformly distributed just using the numpy random sampler, but my code for that is:
for i in range(N): # N number of particles
particle_position[i] = np.random.uniform(0, xmax) # Initialize in spatial domain
particle_velocity[i] = np.random.normal(0, 5) # Initialize in velocity space
for i in range(maxtime):
plt.scatter(particle_position, particle_velocity, s=1, c=norm_xvel, cmap=br_disc, lw=0)
The position and velocity change on each iteration of the main loop (there's quite a bit of code), but these are the main initialization and plotting routines.
I had an idea that perhaps I could randomly select a bunch of i values from range(N), and use an ax.scatter() command to plot them on the same axes?
Here is a possible solution to have a subset of your points identified with a different marker:
import matplotlib.pyplot as plt
import numpy as np
SIZE = 100
SAMPLE_SIZE = 10
def select_subset(seq, size):
"""selects a subset of the data using ...
"""
return seq[:size]
points_x = np.random.uniform(-1, 1, size=SIZE)
points_y = np.random.uniform(-1, 1, size=SIZE)
plt.scatter(points_x, points_y, marker=".", color="blue")
plt.scatter(select_subset(points_x, SAMPLE_SIZE),
select_subset(points_y, SAMPLE_SIZE),
marker="o", color="red")
plt.show()
It uses plt.scatter twice; once on the full data set, the other on the sample points.
You will have to decide how you want to select the sample of points - it is isolated in the select_subset function..
You could also extract the sample points from the data set to prevent marking them twice, but numpy is rather inefficient at deleting or resizing.
Maybe a better method is to use a mask? A mask has the advantage of leaving your original data intact and in order.
Here is a way to proceed with masks:
import matplotlib.pyplot as plt
import numpy as np
import random
SIZE = 100
SAMPLE_SIZE = 10
def make_mask(data_size, sample_size):
mask = np.array([True] * sample_size + [False ] * (data_size - sample_size))
np.random.shuffle(mask)
return mask
points_x = np.random.uniform(-1, 1, size=SIZE)
points_y = np.random.uniform(-1, 1, size=SIZE)
mask = make_mask(SIZE, SAMPLE_SIZE)
not_mask = np.invert(mask)
plt.scatter(points_x[not_mask], points_y[not_mask], marker=".", color="blue")
plt.scatter(points_x[mask], points_y[mask], marker="o", color="red")
plt.show()
As you see, scatter is called once on a subset of the data points (the ones not selected in the sample), and a second time on the sampled subset, and draws each subset with its own marker. It is efficient & leaves the original data intact.
The code below does what you want. I have selected a random set v_sub_index of N_sub indices in the correct range (0 to N) and draw those (with _sub suffix) from the larger samples particle_position and particle_velocity. Please note that you don't have to loop to generate random samples. Numpy has great functionality for that without having to use for loops.
import numpy as np
import matplotlib.pyplot as pl
N = 100
xmax = 1.
v_sigma = 2.5 / 2. # 95% of the samples contained within 0, 5
v_mean = 2.5 # mean at 2.5
N_sub = 10
v_sub_index = np.random.randint(0, N, N_sub)
particle_position = np.random.rand (N) * xmax
particle_velocity = np.random.randn(N)
particle_position_sub = np.array(particle_position[v_sub_index])
particle_velocity_sub = np.array(particle_velocity[v_sub_index])
particle_position_nosub = np.delete(particle_position, v_sub_index)
particle_velocity_nosub = np.delete(particle_velocity, v_sub_index)
pl.scatter(particle_position_nosub, particle_velocity_nosub, color='b', marker='o')
pl.scatter(particle_position_sub , particle_velocity_sub , color='r', marker='^')
pl.show()