Using np.ravel to specify yerr in errorbar plot - python

My code generates values and corresponding standard deviations in sets of 3, i.e. 3x1 arrays. I want to plot them all together as a categorical errorbar plot. For specifying the yerr, since it only accepts scalar or (N,) or N x 2, I used np.ravel to convert all the 3x1 arrays to one single N x 1 array. But I still get the error ValueError: err must be [ scalar | N, Nx1 or 2xN array-like ]
Here is the code:
import numpy as np
import matplotlib.pyplot as plt
names_p=['p1','p1','p1','p2','p2','p2','p3','p3','p3','p4','p4','p4','p5','p5','p5','p6','p6','p6'] #### The names are repeated three times because for each variable I have three values
y=(p1sdm2N_ratem,p2sdm2N_ratem,p3sdm2N_ratem,p4sdm2N_ratem,p5sdm2N_ratem,p6sdm2N_ratem) #### each of these 6 elements is 3 x 1 E.g. p1sdm2N_ratem=(0.04,0.02,0.03)
c=np.ravel((p1sdm2N_ratestd,p2sdm2N_ratestd,p3sdm2N_ratestd,p4sdm2N_ratestd,p5sdm2N_ratestd,p6sdm2N_ratestd)) ### each of these 6 elements is 3x1 e.g. p1sdm2N_ratestd=(0.001,0.003,0.001)
plt. errorbar(names_p,y,yerr=c)
This gives the error I mentioned before, even though c is an 18x1 array. (It's not an array of an array, I checked.)
Note, with the way I've set up my variables,
plt.scatter(names_p,y)
and
plt. errorbar(names_p,y,yerr=None)
work, but without the errorbars, of course.
I'd appreciate any help!

Related

Histogram of 2D arrays and determine array which contains highest and lowest values

I have a 2D array of shape 5 and 10. So 5 different arrays with 10 values. I am hoping to get a histogram and see which array is on the lower end versus higher end of a histogram. Hope that makes sense. I am attaching an image of an example of what I mean (labeled example).
Looking for one histogram but the histogram is organized by the distribution of the highest and lowest of each array.
I'm having trouble doing this with Python. I tried a few ways of doing this:
# setting up 2d array
import numpy as np
from scipy import signal
np.random.seed(1234)
array_2d = np.random.random((5,20))
I thought you could maybe just plot all the histograms of each array (5 of them) like this:
for i in range(5):
plt.hist(signal.detrend(array_2d[i,:],type='constant'),bins=20)
plt.show()
And then looking to see which array's histogram is furthest to the right or left, but not sure if that makes too much sense...
Then also considered using .ravel to make the 2D array into a 1D array which makes a nice histogram. But all the values within each array are being shifted around so it's difficult to tell which array is on the lower or higher end of the histogram:
plt.hist(signal.detrend(array_2d.ravel(),type='constant'),bins=20)
plt.xticks(np.linspace(-1,1,10));
How might I get a histogram of the 5 arrays (shape 5, 10) and get the range of the arrays with the lowest values versus array with highest values?
Also please let me know if this is unclear or not possible at all too haha. Thanks!
Maybe you could use a kdeplot? This would replace each input value with a small Gaussian curve and sum them.
from matplotlib import pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
np.random.seed(1234)
array_2d = np.random.random((5, 20))
sns.kdeplot(data=pd.DataFrame(array_2d.T, columns=range(1, 6)), palette='Set1', multiple='layer')

Intuition behind the correlation

I'm following this tutorial online from kaggle and I can't get my head round why .T is changing the shape of the matrix. Here is the part I am stuck at:
#saleprice correlation matrix
k = 10 #number of variables for heatmap
cols = corrmat.nlargest(k, 'SalePrice')['SalePrice'].index
cm = np.corrcoef(df_train[cols].values.T)
sns.set(font_scale=1.25)
hm = sns.heatmap(cm, cbar=True, annot=True, square=True, fmt='.2f', annot_kws={'size': 10}, yticklabels=cols.values, xticklabels=cols.values)
plt.show()
I'm basically trouble shooting the code and tried this:
cm = np.corrcoef(df_train[cols].values)
cm.shape
returns a matrix with shape 1460x1460. But when I input:
cm = np.corrcoef(df_train[cols].values.T)
cm.shape
it returns a matrix with shape 10x10. Does anyone know why it does this? I can't figure out.
The correlation gives you a normalized representation of the covariance matrix between all the "columns" of the dataframe. For instance, in the case of having only two variables, you'd end up with a matrix of the shape:
Rx = [[ 1, r_xy],
[r_yx, 1]]
This is quite an expensive computation, since it involves taking the dot product of each column with the rest, resulting in a correlation coefficient for each combination.
So in matrix notation, since you want to end up with a 10x10 matrix, you want to have the shapes correctly aligned. In this case you want (10,1460)x(1460,10) so you get a 10,10 matrix. Hence you need to transpose the 2D-array so that it has shape (10,1460) when you feed it to np.corrcoef.
Though you might find it a little easier by playing around with it yourself and seeing how the actual Pearson correlation is computed:
X = np.random.randint(0,10,(500,2))
print(np.corrcoef(X.T))
array([[1. , 0.04400245],
[0.04400245, 1. ]])
Which is doing the same as:
mean_X = X.mean(axis=0)
std_X = X.std(axis=0)
n, _ = X.shape
print((X.T-mean_X[:,None]).dot(X-mean_X)/(n*std_X**2))
array([[1. , 0.04416552],
[0.04383998, 1. ]])
Note that as mentioned, this is giving as result a normalized dot product of X with itself, so for each (1,1460)x(1460,1) product your getting a single number. So X here, just as in your example, has to be transposed so the dimensions are correctly aligned.
From numpy documentation of corrcoef:
x : array_like
A 1-D or 2-D array containing multiple variables and observations.
Each row of x represents a variable, and
each column a single observation of all those variables. Also see rowvar below.
Note that each row represents a variable, in the first case you have 1460 rows and 10 columns and in the second one you have 10 rows with 1460 columns.
So when you transpose your NumPy array your basically changing from 1460 variables with 10 values for each one to 10 variables with 1460 values for each one.
If you are dealing with pandas you could just use the built-in .corr() method that computes the correlation between columns.

Plot 3rd axis of a 3D numpy array

I have a 3D numpy array that is a stack of 2D (m,n) images at certain timestamps, t. So my array is of shape (t, m, n). I want to plot the value of one of the pixels as a function of time.
e.g.:
import numpy as np
import matplotlib.pyplot as plt
data_cube = []
for i in xrange(10):
a = np.random(100,100)
data_cube.append(a)
So my (t, m, n) now has shape (10,100,100). Say I wanted a 1D plot the value of index [12][12] at each of the 10 steps I would do:
plt.plot(data_cube[:][12][12])
plt.show()
But I'm getting index out of range errors. I thought I might have my indices mixed up, but every plot I generate seems to be in the 'wrong' axis, i.e. across one of the 2D arrays, but instead I want it 'through' the vertical stack. Thanks in advance!
Here is the solution: Since you are already using numpy, convert you final list to an array and just use slicing. The problem in your case was two-fold:
First: Your final data_cube was not an array. For a list, you will have to iterate over the values
Second: Slicing was incorrect.
import numpy as np
import matplotlib.pyplot as plt
data_cube = []
for i in range(10):
a = np.random.rand(100,100)
data_cube.append(a)
data_cube = np.array(data_cube) # Added this step
plt.plot(data_cube[:,12,12]) # Modified the slicing
Output
A less verbose version that avoids iteration:
data_cube = np.random.rand(10, 100,100)
plt.plot(data_cube[:,12,12])

Why is numpy.ravel() required in this code that produces small multiples?

I found some code to generate a set of small multiples and it is working perfectly.
fig, axes = plt.subplots(6,3, figsize=(21,21))
fig.subplots_adjust(hspace=.3, wspace=.175)
for ax, data in zip(axes.ravel(), clean_sets):
ax.plot(data.ETo, "o")
The line for ax, data in zip(axes.ravel(), clean_sets): contians .ravel() but I do not understand what this is actually doing or why it is necessary.
If I take a look at the docs I find the following:
Return a contiguous flattened array.
A 1-D array, containing the elements of the input, is returned. A copy is made only if needed.
I guess the return that corresponds to axes from plt.subplot() is a multidimensional array that can't be iterated over, but really I'm not sure. A simple explanation would be greatly appreciated.
What is the purpose of using .ravel() in this case?
Your guess is correct. plt.subplots() returns either an Axes or a numpy array of several axes, depending on the input. In case a 2D grid is defined by the arguments nrows and ncols, the returned numpy array will be a 2D array as well.
This behaviour is explained in the pyplot.subplots documentation inside the squeeze argument,
squeeze : bool, optional, default: True
If True, extra dimensions are squeezed out from the returned Axes object:
if only one subplot is constructed (nrows=ncols=1), the resulting single Axes object is returned as a scalar.
for Nx1 or 1xN subplots, the returned object is a 1D numpy object array of Axes objects are returned as numpy 1D arrays.
for NxM, subplots with N>1 and M>1 are returned as a 2D arrays.
If False, no squeezing at all is done: the returned Axes object is always a 2D array containing Axes instances, even if it ends up being 1x1.
Since here you have plt.subplots(6,3) and hence N>1, M>1, the resulting object is necessarily a 2D array, independent of what squeeze is set to.
This makes it necessary to flatten this array in order to be able to zip it. Options are
zip(axes.ravel())
zip(axes.flatten())
zip(axes.flat)

How do I make a scatter plot with these data?

I am trying to make a 2D representation of a 3D data in matplotlib.
I have some data files, for example:
a_1.dat
a_2.dat
a_3.dat
b_1.dat
b_2.dat
b_3.dat
From each data file I can extract the letter, the number, and a parameter associated with the letter-number pair.
I am trying to make a scatter plot where one axis is the range of letters, another axis is the range of numbers, and the scattered points represent the magnitude of the parameter associated with each letter-number pair. I would prefer is this was a 2D plot with a colorbar of some kind, as opposed to a 3D plot.
At this point, I can make a stack of 2d numpy arrays, where each 2d array looks something like
[a 1 val_a1
a 2 val_a2
a 3 val_a3]
[b 1 val_b1
b 2 val_b2
b 3 val_b3]
First question: Is this the best way to store the data for the plot I am trying to make?
Second question: How do I make the plot using python (I am most familiar with matplotlib pyplot)?
To be able to fully determine if your way of storing data is correct, you should consider how you use it. If you're using it only want to use it for plotting as described here, then for the sake of the simplicity you can just use three 1D arrays. If, however, you wish to achieve tighter structure, you might consider using a 2D array with custom dtype.
Having this in mind, you can easily create a 2D scatter plot with different colors, where exact color is determined by the value associated with each pair (letter, number).
import numpy as np
from matplotlib import pyplot as plt
from matplotlib import cm
# You might note that in this simple case using numpy for creating array
# was actually unnecessary as simple lists would suffice
letters = np.array(['a', 'a', 'a', 'b', 'b', 'b'])
numbers = np.array([1, 2, 3, 1, 2, 3])
values = np.array([1, 2, 3, 1.5, 3.5, 4.5])
items = len(letters)
# x and y should be numbers, so we first feed it some integers
# Parameter c defines color values and cmap defines color mappings
plt.scatter(xrange(items), numbers, c=values, cmap=cm.jet)
# Now that data is created, we can re-set xticks
plt.xticks(xrange(items), letters)
Hopefully, this should be enough for a good start.

Categories