Matrix normalization over multiple runs, what does this code do?

Matrix normalization over multiple runs, what does this code do? - python

I have several numpy matrices collected over some time. I now want to visualize these matrices and explore visual similarities among them. The matrices contain small numbers from 0.0 to 1.0.
To compare them, I want to ensure that the same "areas" get colored with the same color, e.g. that 0.01 to 0.02 always is red, and 0.02 to 0.03 always is green. I have two question:
I found another question which has this code snippet:
a = np.random.normal(0.0,0.5,size=(5000,10))**2
a = a/np.sum(a,axis=1)[:,None] # Normalize
plt.pcolor(a)
What is the effect of the second line, precisely the [:,None] statement. I tried normalizing a matrix by:
max_a = a/10# Normalize
print(max_a.shape)
plt.pcolor(max_a)
but there is not much visual difference compared to the visualization for the unnormalized matrix. When I then add the [:,None] statement I get an error
ValueError: too many values to unpack (expected 2)
which is expected since the shape now is 10,1,10. I therefor want to know what the brackets do and how to read the statement.
Secondly, and related, I want to make sure that I can visual compare the matrices. I therefor want to fix the "colorization", e.g. the ranges when a color is green or red, so that I do not end up with 0 to 0.1 as green in plot A and with 0 to 0.1 as red in plot B. How can I fix the "translation" from floats to colors? Do I have to normalize each matrix with a same constant, e.g. 10? Or do I normalize them with an unique value -- do I even need normalization here?

[:,None] adds new axis so you'll be able to divide sum of all columns in each row - it is the same as using np.sum(a,axis=1)[:,np.newaxis] - when you sum all columns with np.sum(a,axis=1) you'll get 1d array with shape (5000), but to be able to normalize your matrix with summed columns you need 2d array with shape (5000,1), that's why new axis is needed.
You can have fixed colors by fixing scale of your colormap: plt.pcolor(max_a,vmin=0,vmax=1)
adding discrete colorbar might also help:
from pylab import cm
cmap = cm.get_cmap('jet', 10)
plt.pcolor(a,cmap=cmap,vmin=0,vmax=1)
plt.colorbar()

Related

Adding a python colorbar but certain values are fixed to a color

I have a 2D array of data (lets call this X) where the values can range from anywhere between -2000 and 2000. The array also includes "empty" data, which I have implemented by using NaNs. I would like to use a custom-listed colormap to represent the data, where the NaNs are "empty" or white. I would also like to be able to dynamically alter the colorbar limits v_min and v_max which I have successfully done (see colormap_plot).
However, I would like to specify that values in X between -25 and 25 should be colored grey, regardless of the range of the colorbar. I have managed to do this by creating a second array (lets call this Y) of the same size as X, whose values are RGB codes dependent on the corresponding value inside X. This gives the result that I require (see plot_with_grey).
However, I would like to have the same colorbar from the colormap_plot at the side of the plot_with_grey, so I would love a solution that would allow me to solve this.

How can I produce multiple plots on one graph where each plot has a different color? Can I set a colormap to an array of scalar variables?

I have a series of simple mass-radius relationships (so a 2d plot) that I'd like to include in one plot according to how well of a fit it is to my data. I have the radii (x), masses (y), and a separate 1d array that quantifies how well the M-R relationship fits to my data. This 1d array can be likened to error, but it isn't calculated using a standard Python function (I calculate it myself).
Ideally, my end result is a series of ~2000 mass-radius relationships on one plot, where each mass-radius relationship is color coded according to its agreement with my data. So something like this, but instead of two colors, it's on a grayscale:
Here's a snippet of what I'm trying to do but obviously isn't working, as I didn't even define a colormap:
for i in range(10):
plt.plot(x,y,c=error[i])
plt.colorbar()
plt.show()
And again, I'd like to have each element in error correspond to a color in greyscale.
I know this is simple so I'm definitely outing myself as an amateur here, but I really appreciate any help!
EDIT: Here is the code snippet where I made the plot:
for i in range(2396):
if eps[i]==0.:
plt.plot(f[i,:,1],f[i,:,0],c='g',linewidth=0.1)
else:
plt.plot(f[i,:,1],f[i,:,0],c='r',linewidth=0.1)
plt.xlabel('Radius')
plt.ylabel('Mass')
plt.title('Neutron Star Mass-Radius Relationships')

You have one fit value for each series of points:
Here is a script to plot multiple series on a single plot, where each series (i.e. each line) is colored based on a third fit variable:
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
fit = np.random.rand(25)
cmap = mpl.cm.get_cmap('binary')
color_gradients = cmap(fit) # this line changed! it was incorrect before
fig, (ax1,ax2) = plt.subplots(1,2, gridspec_kw={'width_ratios': [30, 1]})
for i,_ in enumerate(fit):
x = sorted(np.random.randint(100, size=25))
y = sorted(np.random.randint(100, size=25))
ax1.plot(x, y, c=color_gradients[i])
cb = mpl.colorbar.ColorbarBase(ax2, cmap=cmap,
orientation='vertical',
ticks=[0,1])
Now responding to your questions from the comments:
How does fit play into the rest of the plot?
fit is an array of random decimals between 0 and 1, corresponding to the "error" values for each series:
>>>fit
array([0.76458568, 0.15017328, 0.70686393, 0.98885091, 0.18449953,
0.62506401, 0.49513702, 0.69138913, 0.96844495, 0.48937011,
0.09878352, 0.68965829, 0.13524182, 0.95419698, 0.39844843,
0.63095159, 0.95933663, 0.00693236, 0.98212815, 0.16262205,
0.26274884, 0.56880703, 0.68233984, 0.18304883, 0.66759496])
fit is used to generate the divisions of the color gradient in these lines:
cmap = mpl.cm.get_cmap('binary')
color_gradients = cmap(fit)
I'm not sure where the specific documentation for this is, but basically, passing an array of numbers to the cmap will return an array of RGBA color values spaced accordingly to the array passed:
>>>color_gradients
array([[0.23529412, 0.23529412, 0.23529412, 1. ],
[0.85098039, 0.85098039, 0.85098039, 1. ],
[0.29411765, 0.29411765, 0.29411765, 1. ],
[0.00784314, 0.00784314, 0.00784314, 1. ],
.
.
.
So this array can be used to assign specific colors to each line, based on their fit. And it assumes the higher numbers are better fits, and that you want better fits to be colored darker.
Note that before I had color_gradient_divisions = [(1/len(fit))*i for i in range(len(fit))], which was incorrect as it evenly divides the color map into 25 pieces, not actually returning values corresponding to the fit.
The cmap is also passed to the colorbar when constructing it. Often you can just call plt.colorbar to simply create one, but here matplotlib doesn't automatically know what to create a color bar for as the lines are separate and manually colored. So instead, we create 2 axes, one for the plot and one for the colorbar (spacing them accordingly with the gridspec_kw argument), and then using mpl.colorbar.ColorbarBase to make the colorbar (I also removed a norm argument b/c I don't think it is needed).
why have you used an underscore in the for loop?
This is a pattern in Python, typically meaning "I'm not using this thing". enumerate returns an iterator of tuples with the structure (value index, value). So enumerate(fit) returns (0, 0.76458568), (1, 0.15017328), etc (based on the data shown above). I am only using the index (i) to get the corresponding position (and color) in color_gradients (ax1.plot(x, y, c=color_gradients[i])). Even though the values from fit are being returned by enumerate, I am not using them, so I instead point them to _. If I was using them within the loop, I would use a typical variable name instead.
enumerate is the encouraged way to loop over an iterable if you need to access both the count of the values and the values themselves. People tend to use for i in range(len(fit)) also to do this (which works fine!) but the further I've gone with Python the more I've seen people avoiding that.
This was a little bit of a confusing example; I set my loop to iterate over fit b/c I was conceptualizing "creating one graph for each value in fit". But I could have just looped over color_gradients (for c in color_gradients) which might be more clear.
But in your real data, something like enumerate may be helpful if you are looping over multiple aligned arrays. In my example, I just create new random data within each loop. But you will likely want to have an array of fit values, an array of color values, an array (of series) of radii, and an array (of series) of masses, such that the ith element of each array corresponds to the same star. You may be iterating over one array and want to access the same position in another (zip is used for this also).
I'll leave this second answer here, even though it wasn't what OP was getting at:
You have one fit value for each point:
Here, each pair of x,y coordinates has its own fit value:
import numpy as np
import matplotlib.pyplot as plt
x = np.random.randint(100, size=25)
y = np.random.randint(100, size=25)
fit = np.random.rand(25)
plt.scatter(x, y, c=fit, cmap='binary')
plt.colorbar()
Note that with either approach, poorly fitting points or lines may be invisible

Python: Plot a sparse matrix

I have a sparse matrix X, shape (6000, 300). I'd like something like a scatterplot which has a dot where the X(i, j) != 0, and blank space otherwise. I don't know how many nonzero entries there are in each row of X. X[0] has 15 nonzero entries, X[1] has 3, etc. The maximum number of nonzero entries in a row is 16.
Attempts:
plt.imshow(X) results in a tall, skinny graph because of the shape of X. Using plt.imshow(X, aspect='auto) will stretch out the graph horizontally, but the dots get stretched out to become ellipses, and the plot becomes hard to read.
ax.spy suffers from the same problem.
bokeh seems promising, but really taxes my jupyter kernel.
Bonus:
The nonzero entries of X are positive real numbers. If there was some way to reflect their magnitude, that would be great as well (e.g. colour intensity, transparency, or across a colour bar).
Every 500 rows of X belong to the same class. That's 12 classes * 500 observations (rows) per class = 6000 rows. E.g. X[:500] are from class A, X[500:1000] are from class B, etc. Would be nice to colour-code the dots by class. For the moment I'll settle for manually including horizontal lines every 500 rows to delineate between classes.

You can use nonzero() to find the non zero elements and use scatter() plot the points:
import pylab as pl
import numpy as np
a = np.random.rand(6000, 300)
a[a < 0.9999] = 0
r, c = np.nonzero(a)
pl.scatter(r, c, c=a[r, c])

It seems to me heatmap is the best candidate for this type of plot. imshow() will return u a colored matrix with color scale legend.
I don't get ur stretched ellipses problem, shouldnt it be a colored squred for each data point?
u can try log color scale if it is sparse. also plot the 12 classes separately to analyze if theres any inter-class differences.

plt.matshow also turned out to be a feasible solution. I could also plot a heatmap with colorbars and all that.

Comparing two arrays which have very dispersed values

I have a very sparse array that looks like:
Array A: min = -68093253945.0 max=8.54631971208e+13
Array B: min=-1e+15 max = 1.87343e+14
And also each array will have concentration at certain levels e.g. near 2000, near 1m, near 0.05 and so on.
I am trying to compare these two arrays in terms of concentration, and want to do so in a way that is invariant to the number of entries in each. I also want to account for huge outliers if possible and maybe compress the bins to be between 0 and 1 or something of this sort.
The aim is to make a histogram via:
plt.hist(A,alpha=0.5,label='A') # plt.hist passes it's arguments to np.histogram
ion()
plt.hist(B,alpha=0.5,label='B')
plt.title("Histogram of Values")
plt.legend(loc='upper right')
plt.savefig('valuecomp.png')
How do I do this? I have experimented with:
A = stats.zscore(A)
B = stats.zscore(B)
A = preprocessing.scale(A)
B = preprocessing.scale(B)
A = preprocessing.scale(A, axis=0, with_mean=True, with_std=True, copy=True)
B = preprocessing.scale(B, axis=0, with_mean=True, with_std=True, copy=True)
And then for my histograms, adding normed=True, range(0,100). All the methods give me a histogram with a massive vertical chunk near to 0.0 instead of distributing the values smoothly. range(0,100) looks good but it ignores any values like 1m outside of 100.
Perhaps I need to remove outliers from my data first and then do a histogram?

#sascha's suggestion of using AstroML was a good one, but the knuth and freedman versions seem to take astronomically long (excuse the pun), and the blocks version simply thinned the blocks.
I took the sigmoid of each value via from scipy.special import expit and then plotted the histogram that way. Only way I could get this to work.

Scipy normalization-- localize values to set discrete points

I am currently displaying two separate 2D images (x,y plane and z,y plane) that are derived from 96x512 arrays of 0-255 values. I would like to be able to filter the data so that anything under a certain value is done away with (the highest values are indicative of targets). What I would like to be able to do is from these images, separate discrete points that may be then mapped three-dimensionally as points, rather than mapping two intersecting planes. I'm not entirely sure how to do this or where to start (I'm very new to python). I am producing the images using scipy and have done some normalization and noise reduction, but I'm not sure how to then separate out anything over the threshold as it's own individual point. Is this possible?

If I understand correctly what you want, filtering points can be done like this:
A=numpy.random.rand(5,5)
B=A>0.5
Now B is a binary mask, and you can use it in a number of ways:
A[B]
will return an array with all values of A that are true in B.
A[B]=0
will assign 0 to all values in A that are true in B.
numpy.nonzero(B)
will give you the x,y coordinates of each point that is true in B.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Matrix normalization over multiple runs, what does this code do? - python

Related

Adding a python colorbar but certain values are fixed to a color

How can I produce multiple plots on one graph where each plot has a different color? Can I set a colormap to an array of scalar variables?

Python: Plot a sparse matrix

Comparing two arrays which have very dispersed values

Scipy normalization-- localize values to set discrete points

Categories

Resources