I have a dataframe and want to identify how the variables are correlated. I can get the correlation matrix easily using – df.corr(). I know I can easily plot the correlation matrix using plt.matshow(df.corr()) or seaborn's heatmap, but I'm looking for something like this - graph
taken from here. In this image, the thicker the line connecting the variables is, the higher the correlation.
I did a few searches on stack and elsewhere but all of them are correlation matrices where the values have been replaced by colors. Is there a way to achieve the linked plot?
Related
I'm trying to get statistics on a distribution but all the libraries I've seen require the input to be in histogram style. That is, with a huge long array of numbers like what plt.hist wants as an input.
I have the bar chart equivalent, i.e. 2 arrays; one with the x-axis centre points, and one with y-axis values for the corresponding value of each point. The plot looks like this:
My question is how can I apply statistics such as mean, range, skewness and kurtosis on this dataset. The numbers are not always integers. It seems very inefficient to force python to make a histogram style array with, for example, 180x 0.125's, 570x 0.25's e.t.c. as in the figure above.
Doing mean on the current array I have will give me the average frequency of all sizes, i.e. plotting a horizontal line on the figure above. I'd like a vertical line to show the average, as if it were a distribution.
Feels like there should be an easy solution! Thanks in advance.
I have the following dataframe, resulted from running grid search over several regression models:
As it can be noticed, there are many values grouped around 0.0009, but several that are a few orders of magnitude higher (-1.6, -2.3 etc).
I would like to plot these results, but I don't seem to find a way to get a readable plot. I have tried a bar plot, but I get something like:
How can I make this bar plot more readable? Or what other kind of plot would be more suitable to visualize such data?
Edit: Here is the dataframe, exported as CSV:
,a,b,c,d
LinearRegression,0.000858399508896,-4.11609208874e+20,0.000952538859738,0.000952538859733
RandomForestRegressor,-1.62264355718,-2.30218457629,0.0008957696846039999,0.0008990722465239999
ElasticNet,0.000883257900658,0.0008525502791760002,0.000884706195921,0.000929498696126
Lasso,7.92193516085e-05,-1.84086765436e-05,7.92193516085e-05,-1.84086765436e-05
ExtraTreesRegressor,-6.320170496909999,-6.30420308033,,
Ridge,0.0008584791396339999,0.0008601028734780001,,
SGDRegressor,-4.62522968756,,,
You could make the graph have a log scale, which is often used for plotting data with a very large range. This muddies the interpretation slightly, as now each equivalent distance is an equivalent order of magnitude difference. You can read about log scales here:
https://en.wikipedia.org/wiki/Logarithmic_scale
I'm working with some instrument data that has records the temperature at a specific latitude, longitude, and pressure (height) coordinate. I need to create a 3d grid from this instrument data that I can then use to take a vertical cross sections of the interpolated gridded data. I've looked at pretty much every interpolation function/library I can find and I'm still having trouble just wrapping my head around how to do this.
I'd prefer not to use Mayavi, since it seems to bug out on my school's server and I'd rather not try to deal with fixing it right now.
The data is currently in 4 separate 1d arrays and I used those to mock up some scatter plots of what I'm trying to get.
Here is the structure of my instrument data points:
And here is what I'm trying to create:
Ultimately, I'd like to create some kind of 3d contour from these points that I can take slices of. Each of the plotted points has a corresponding temperature attached to it, which is really what I think is throwing me off in terms of dimensions and whatnot.
There are a few options to go from the unstructured data which you have to a structured dataset.
The simplest option might be to use the scipy interpolate.griddata method which can interpolate unstructured points using, linear or cubic interpolation.
Another option is to define your grid and then average all of the unstructured points which fall into each grid cell, giving you some gridded representation of the data. You could use a tool such as CIS to do this easily (full disclosure, I wrote this package to do exactly this kind of thing).
Or, there are more complicated methods of interpolating the data by trying to determine the most likely value of the grid points based on the unstructured data, for example using kriging with the pyKriging package, though I've never used this.
I'm using matplotlib to plot things that have a lot of data clustered around values 0, .5, and 1. So it is difficult to see the difference between values within each cluster with a sequential colormap (which is what I want to use).
I would like to "stretch" the colormap around the values where my data clusters, so that you can see the contrast within each cluster as well as between clusters.
I saw this similar question, but it doesn't quite get me to where I want:
Matlab, Python: Fixing colormap to specified values
Thanks!
I want to plot the correlation matrix using python. I have tried with the following script
corr_matrix=np.corrcoef(vector)
imshow(corr_matrix, interpolation='bilinear')
colorbar()
show()
The dimension of the matrix is 2500X2500. The above code produces a matrix of full of dots. But I want smooth surface. How do I get that.
Best
Sudipta
What do you mean by "smooth surface" and why do you want to visualize your correlation matrix that way?
Here are two useful examples for visualizing [correlation] matrices. Both contain an explanation as well as example code for matplotlib.
Square grid pseudocolor plot
http://glowingpython.blogspot.com/2012/10/visualizing-correlation-matrices.html
Hinton Diagram
http://www.scipy.org/Cookbook/Matplotlib/HintonDiagrams
Update:
To supplement my comment, here's a pseudocolor visualization of a 1000x1000 correlation matrix, which didn't encounter memory issues on my humble laptop:
Note that although row 20 is correlated to other variables and row 40 is correlated to row 80, in the style of the GlowingPython example, yet this information is obscured by the sheer size of the matrix.
You can sort the columns based on the values obtained in the correlation matrix.