plotting correlation matrix using python - python

I want to plot the correlation matrix using python. I have tried with the following script
corr_matrix=np.corrcoef(vector)
imshow(corr_matrix, interpolation='bilinear')
colorbar()
show()
The dimension of the matrix is 2500X2500. The above code produces a matrix of full of dots. But I want smooth surface. How do I get that.
Best
Sudipta

What do you mean by "smooth surface" and why do you want to visualize your correlation matrix that way?
Here are two useful examples for visualizing [correlation] matrices. Both contain an explanation as well as example code for matplotlib.
Square grid pseudocolor plot
http://glowingpython.blogspot.com/2012/10/visualizing-correlation-matrices.html
Hinton Diagram
http://www.scipy.org/Cookbook/Matplotlib/HintonDiagrams
Update:
To supplement my comment, here's a pseudocolor visualization of a 1000x1000 correlation matrix, which didn't encounter memory issues on my humble laptop:
Note that although row 20 is correlated to other variables and row 40 is correlated to row 80, in the style of the GlowingPython example, yet this information is obscured by the sheer size of the matrix.

You can sort the columns based on the values obtained in the correlation matrix.

Related

I have a 3D dataset of coordinates x,y,z. How do I check if the dataset is normally distributed?

The dataset is large with over 15000 rows.
One row of x,y,z plots a point on a 3D plot.
I need to scale the data and so far I'm using RobustScaler(), but I want to make sure that the dataset is either normally distributed or it isn't.
Matplotlib histogram [plt.hist()] can be used for checking data distribution. If the highest peak middle of the graph, then datasets are normally distributed.

Plotting a numpy array with 256 columns used for k-means

I have a numpy array with this shape: (109, 256) Every row is a frame and every column is a value of the frame's histogram (8 bits).
With k-means I cluster the histograms to get a resume of the frames. I want something like this:
Where every cluster should be a "scene" with similar histograms.
But how can I plot a representative graphic of the k-means process with 256 columns??
I'm trying with this typical example:
plt.scatter(X[:,0],X[:,1], c=kmeans.labels_, cmap='rainbow')
But yeah, it shows only 2 columns and it doesn't represent the problem. Any help? I'm really new on Python and machine learning.
PD: my k-means code works well and it clusters the way I want, but I don't know how to represent it correctly.
You always represent k-means clustering results on two axis. Those axis can be picked randomly. The only way you can include more attributes is by adapting the size of your points to another variables (for example the higher the income the bigger the point) or by having different color shades.
Otherwise, you seem to have done everything correctly, you have to stick to two variables for your axis and can't integrate more..
You can decide on creating more plots with different axis and create a grid (often this doesn't add much information though)

Python – visualise correlation in data

I have a dataframe and want to identify how the variables are correlated. I can get the correlation matrix easily using – df.corr(). I know I can easily plot the correlation matrix using plt.matshow(df.corr()) or seaborn's heatmap, but I'm looking for something like this - graph
taken from here. In this image, the thicker the line connecting the variables is, the higher the correlation.
I did a few searches on stack and elsewhere but all of them are correlation matrices where the values have been replaced by colors. Is there a way to achieve the linked plot?

Python distribution statistics on scatter plot style data

I'm trying to get statistics on a distribution but all the libraries I've seen require the input to be in histogram style. That is, with a huge long array of numbers like what plt.hist wants as an input.
I have the bar chart equivalent, i.e. 2 arrays; one with the x-axis centre points, and one with y-axis values for the corresponding value of each point. The plot looks like this:
My question is how can I apply statistics such as mean, range, skewness and kurtosis on this dataset. The numbers are not always integers. It seems very inefficient to force python to make a histogram style array with, for example, 180x 0.125's, 570x 0.25's e.t.c. as in the figure above.
Doing mean on the current array I have will give me the average frequency of all sizes, i.e. plotting a horizontal line on the figure above. I'd like a vertical line to show the average, as if it were a distribution.
Feels like there should be an easy solution! Thanks in advance.

How interpolate 3D coordinates

I have data points in x,y,z format. They form a point cloud of a closed manifold. How can I interpolate them using R-Project or Python? (Like polynomial splines)
It depends on what the points originally represented. Just having an array of points is generally not enough to derive the original manifold from. You need to know which points go together.
The most common low-level boundary representation ("brep") is a bunch of triangles. This is e.g. what OpenGL and Directx get as input. I've written a Python software that can convert triangular meshes in STL format to e.g. a PDF image. Maybe you can adapt that to for your purpose. Interpolating a triangle is usually not necessary, but rather trivail to do. Create three new points each halfway between two original point. These three points form an inner triangle, and the rest of the surface forms three triangles. So with this you have transformed one triangle into four triangles.
If the points are control points for spline surface patches (like NURBS, or Bézier surfaces), you have to know which points together form a patch. Since these are parametric surfaces, once you know the control points, all the points on the surface can be determined. Below is the function for a Bézier surface. The parameters u and v are the the parametric coordinates of the surface. They run from 0 to 1 along two adjecent edges of the patch. The control points are k_ij.
The B functions are weight functions for each control point;
Suppose you want to approximate a Bézier surface by a grid of 10x10 points. To do that you have to evaluate the function p for u and v running from 0 to 1 in 10 steps (generating the steps is easily done with numpy.linspace).
For each (u,v) pair, p returns a 3D point.
If you want to visualise these points, you could use mplot3d from matplotlib.
By "compact manifold" do you mean a lower dimensional function like a trajectory or a surface that is embedded in 3d? You have several alternatives for the surface-problem in R depending on how "parametric" or "non-parametric" you want to be. Regression splines of various sorts could be applied within the framework of estimating mean f(x,y) and if these values were "tightly" spaced you may get a relatively accurate and simple summary estimate. There are several non-parametric methods such as found in packages 'locfit', 'akima' and 'mgcv'. (I'm not really sure how I would go about statistically estimating a 1-d manifold in 3-space.)
Edit: But if I did want to see a 3D distribution and get an idea of whether is was a parametric curve or trajectory, I would reach for package:rgl and just plot it in a rotatable 3D frame.
If you are instead trying to form the convex hull (for which the word interpolate is probably the wrong choice), then I know there are 2-d solutions and suspect that searching would find 3-d solutions as well. Constructing the right search strategy will depend on specifics whose absence the 2 comments so far reflects. I'm speculating that attempting to model lower and higher order statistics like the 1st and 99th percentile as a function of (x,y) could be attempted if you wanted to use a regression effort to create boundaries. There is a quantile regression package, 'rq' by Roger Koenker that is well supported.

Categories