How do I make a scatter plot with these data? - python

I am trying to make a 2D representation of a 3D data in matplotlib.
I have some data files, for example:
a_1.dat
a_2.dat
a_3.dat
b_1.dat
b_2.dat
b_3.dat
From each data file I can extract the letter, the number, and a parameter associated with the letter-number pair.
I am trying to make a scatter plot where one axis is the range of letters, another axis is the range of numbers, and the scattered points represent the magnitude of the parameter associated with each letter-number pair. I would prefer is this was a 2D plot with a colorbar of some kind, as opposed to a 3D plot.
At this point, I can make a stack of 2d numpy arrays, where each 2d array looks something like
[a 1 val_a1
a 2 val_a2
a 3 val_a3]
[b 1 val_b1
b 2 val_b2
b 3 val_b3]
First question: Is this the best way to store the data for the plot I am trying to make?
Second question: How do I make the plot using python (I am most familiar with matplotlib pyplot)?

To be able to fully determine if your way of storing data is correct, you should consider how you use it. If you're using it only want to use it for plotting as described here, then for the sake of the simplicity you can just use three 1D arrays. If, however, you wish to achieve tighter structure, you might consider using a 2D array with custom dtype.
Having this in mind, you can easily create a 2D scatter plot with different colors, where exact color is determined by the value associated with each pair (letter, number).
import numpy as np
from matplotlib import pyplot as plt
from matplotlib import cm
# You might note that in this simple case using numpy for creating array
# was actually unnecessary as simple lists would suffice
letters = np.array(['a', 'a', 'a', 'b', 'b', 'b'])
numbers = np.array([1, 2, 3, 1, 2, 3])
values = np.array([1, 2, 3, 1.5, 3.5, 4.5])
items = len(letters)
# x and y should be numbers, so we first feed it some integers
# Parameter c defines color values and cmap defines color mappings
plt.scatter(xrange(items), numbers, c=values, cmap=cm.jet)
# Now that data is created, we can re-set xticks
plt.xticks(xrange(items), letters)
Hopefully, this should be enough for a good start.

Related

Histogram of 2D arrays and determine array which contains highest and lowest values

I have a 2D array of shape 5 and 10. So 5 different arrays with 10 values. I am hoping to get a histogram and see which array is on the lower end versus higher end of a histogram. Hope that makes sense. I am attaching an image of an example of what I mean (labeled example).
Looking for one histogram but the histogram is organized by the distribution of the highest and lowest of each array.
I'm having trouble doing this with Python. I tried a few ways of doing this:
# setting up 2d array
import numpy as np
from scipy import signal
np.random.seed(1234)
array_2d = np.random.random((5,20))
I thought you could maybe just plot all the histograms of each array (5 of them) like this:
for i in range(5):
plt.hist(signal.detrend(array_2d[i,:],type='constant'),bins=20)
plt.show()
And then looking to see which array's histogram is furthest to the right or left, but not sure if that makes too much sense...
Then also considered using .ravel to make the 2D array into a 1D array which makes a nice histogram. But all the values within each array are being shifted around so it's difficult to tell which array is on the lower or higher end of the histogram:
plt.hist(signal.detrend(array_2d.ravel(),type='constant'),bins=20)
plt.xticks(np.linspace(-1,1,10));
How might I get a histogram of the 5 arrays (shape 5, 10) and get the range of the arrays with the lowest values versus array with highest values?
Also please let me know if this is unclear or not possible at all too haha. Thanks!
Maybe you could use a kdeplot? This would replace each input value with a small Gaussian curve and sum them.
from matplotlib import pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
np.random.seed(1234)
array_2d = np.random.random((5, 20))
sns.kdeplot(data=pd.DataFrame(array_2d.T, columns=range(1, 6)), palette='Set1', multiple='layer')

Using np.ravel to specify yerr in errorbar plot

My code generates values and corresponding standard deviations in sets of 3, i.e. 3x1 arrays. I want to plot them all together as a categorical errorbar plot. For specifying the yerr, since it only accepts scalar or (N,) or N x 2, I used np.ravel to convert all the 3x1 arrays to one single N x 1 array. But I still get the error ValueError: err must be [ scalar | N, Nx1 or 2xN array-like ]
Here is the code:
import numpy as np
import matplotlib.pyplot as plt
names_p=['p1','p1','p1','p2','p2','p2','p3','p3','p3','p4','p4','p4','p5','p5','p5','p6','p6','p6'] #### The names are repeated three times because for each variable I have three values
y=(p1sdm2N_ratem,p2sdm2N_ratem,p3sdm2N_ratem,p4sdm2N_ratem,p5sdm2N_ratem,p6sdm2N_ratem) #### each of these 6 elements is 3 x 1 E.g. p1sdm2N_ratem=(0.04,0.02,0.03)
c=np.ravel((p1sdm2N_ratestd,p2sdm2N_ratestd,p3sdm2N_ratestd,p4sdm2N_ratestd,p5sdm2N_ratestd,p6sdm2N_ratestd)) ### each of these 6 elements is 3x1 e.g. p1sdm2N_ratestd=(0.001,0.003,0.001)
plt. errorbar(names_p,y,yerr=c)
This gives the error I mentioned before, even though c is an 18x1 array. (It's not an array of an array, I checked.)
Note, with the way I've set up my variables,
plt.scatter(names_p,y)
and
plt. errorbar(names_p,y,yerr=None)
work, but without the errorbars, of course.
I'd appreciate any help!

plt.eventplot refuses lineoffsets

This should be quite easy to reproduce:
plt.eventplot(positions=[1, 2, 3], lineoffsets=[1, 2, 3])
raises
ValueError: lineoffsets and positions are unequal sized sequences
For reasons I can't figure out, because they clearly aren't.
If I understand correctly you want to plot 3 lines, at different starting heights (offsets). The way this works with plt.eventplot is as follows:
import numpy as np
import matplotlib.pyplot as plt
positions = np.array([1, 2, 3])[:,np.newaxis] # or np.array([[1], [2], [3]])
offsets = [1,2,3]
plt.eventplot(positions, lineoffsets=offsets)
plt.show()
You have to set the offset for each group of data you want to plot. In your case, you have to divide the list into a 3D array (shape (m,n) with m the number of datasets, and n number of data points per set). This way plt.eventplot knows it has to use the different offsets for each group of data. Also see this example.

matplotlib.pyplot.hist returns a histogram where all bins have the same value when I have varying data

I am trying to create a histogram in python using matplotlib.pyplot.hist.
I have an array of data that varies, however when put my code into python the histogram is returned with values in all bins equal to each other, or equal to zero which is not correct.
The histogram should look the the line graph above it with bins roughly the same height and in the same shape as the graph above.
The line graph above the histogram is there to illustrate what my data looks like and to show that my data does vary.
My data array is called spectrumnoise and is just a function I have created against an array x
x=np.arange[0.1,20.1,0.1]
The code I am using to create the histogram and the line graph above it is
import matplotlib.pylot as mpl
mpl.plot(x,spectrumnoise)
mpl.hist(spectrumnoise,bins=50,histtype='step')
mpl.show()
I have also tried using
mpl.hist((x,spectrumnoise),bins=50,histtype=step)
I have also changed the number of bins countless times to see if that helps an normalising the histogram function but nothing works.
Image of the output of the code can be seen here
The problem is that spectrumnoise is a list of arrays, not a numpy.ndarray. When you hand hist a list of arrays as its first argument, it treats each element as a separate dataset to plot. All the bins have the same height because each 'dataset' in the list has only one value in it!
From the hist docstring:
Multiple data can be provided via x as a list of datasets
of potentially different length ([x0, x1, ...]), or as
a 2-D ndarray in which each column is a dataset.
Try converting spectrumnoise to a 1D array:
pp.hist(np.vstack(spectrumnoise),50)
As an aside, looking at your code there's absolutely no reason to convert your data to lists in the first place. What you ought to do is operate directly on slices in your array, e.g.:
data[20:40] += y1

Uniform Random Numbers

I am trying to understand what this code does. I am going through some examples about numpy and plotting and I can't figure out what u and v are. I know u is an array of two arrays each with size 10000. What does v=u.max(axis=0) do? Is the max function being invoked part of the standard python library? When I plot the histogram I get a pdf defined by 2x as opposed to a normal uniform distribution.
import numpy as np
import numpy.random as rand
import matplotlib.pyplot as plt
np.random.seed(123)
u=rand.uniform(0,1,[2,10000])
v=u.max(axis=0)
plt.figure()
plt.hist(v,100,normed=1,color='blue')
plt.ylim([0,2])
plt.show()
u.max(), or equivalently np.max(u), will give you the maximum value in the array - i.e. a single value. It's the Numpy function here, not part of the standard library. You often want to find the maximum value along a particular axis/dimension and that's what is happening here.
U has shape (2,10000), and u.max(axis=0) gives you the max along the 0 axis, returning an array with shape (10000,). If you did u.max(axis=1) you would get an array with shape (2,).
Simple illustration/example:
>>> a = np.array([[1,2],[3,4]])
>>> a
array([[1, 2],
[3, 4]])
>>> a.max(axis=0)
array([3, 4])
>>> a.max(axis=1)
array([2, 4])
>>> a.max()
4
first three lines you load in different modules (libraries that are relied apon in the rest of the code). you load numpy which is a numerical library, numpy.random which is a library that does a lot of great work to create random numbers and matplotlib allows for plotting functions.
the rest is described here:
np.random.seed(123)
A computer does not really generate a random number rather picks a number from a long list of numbers (for a more correct explanation of how this is done http://en.wikipedia.org/wiki/Random_number_generation). In essence if you want to reproduce the work with the same random numbers the computer needs to know where in this list of numbers to start picking numbers. This is what this line of code does. If anybody else runs the same piece of code now you end up with the same 'random' numbers.
u=rand.uniform(0,1,[2,10000])
This generates 10000 random numbers twice that are distributed between 0 and 1. This is uniform distribution so it is equally likely to get any point between 0 and 1. (Again more information can be found here: http://en.wikipedia.org/wiki/Uniform_distribution_(continuous) ). You are creating two arrays within an array. This can be checked by doing: len(u) and len(u[0]).
v=u.max(axis=0)
The u.max? command in iPython refers you to the docs. It is basically select a max and the axis determines how the max is chosen. Try the following:
a = np.arange(4).reshape((2,2))
np.amax(a, axis=0) # gives array([2, 3])
np.amax(a, axis=1) # gives array([1, 3])
The rest of the code is meant to set the histogram plot. There are 100 bins in total in the histogram and the bars will be colored blue. The maximum height on the histogram y-axis is 2 and normed will guarantee that at least one sample will be in every bin.
I can't clearly make up what the true purpose or application of the code was. But this is en essence what it is doing.

Categories