Data visulisation using ridge and scatter plot - python

Background:
I am working on python, I have a lot of data points (in .CSV form) so far what the code I have
Reads the csv and the "result" column
if the value in the "result" column is positive, the code plots the A B C D E F G parameters corresponding to it in such a way that the y-axis is the value of the parameters and x-axis is the name of the parameter.
If the number of such "result" are more than 10 It plots the first 10 A B C D E F G parameters corresponding to the results.
An example of the type of dataset is below. (Mine contains around 12000 rows)
The Dataset
A B C D E F G result
1.00 0.85 -0.999 0.27 0.98 0.39 0.80 -0.86
0.89 0.4 -0.6 0.47 0.28 0.29 0.26 0.65
0.65 -1.00 0.26 0.67 -0.88 0.29 0.10 0.50
0.98 -0.98 0.76 0.37 0.68 0.59 0.90 0
0 0.5 0.56 0.27 0.38 0.79 0.48 -0.65
The code :
df = pd.read_csv("result.csv")
df.loc[df.result>0, df.columns[:-1]].T.plot(ls='', marker='o')
Issue :
Sometimes if the value is the same the dot mark is at the same place thus it's hard to see the frequency distribution(such as in Column B and C below though they look similar one value has more points.
What I want to do is to plot something like a ridge plot on the current graph (as I drew below )so that the frequency distribution can be seen.
I am a novice in this type of data visualization. Kindly guide me on how it could be done

The density plot type already does pretty much what you want, we just need to superpose it to your data:
>>> data_to_plot = df.loc[df.result>0, df.columns[:-1]]
>>> data_to_plot.plot(kind='density')
This is trivial if you want horizontal subplots, you can simply use the subplots=True on either plot (and then zip the returned axes with columns to superpose the other plot):
>>> axes = data_to_plot.plot(kind='density', subplots=True, legend=False)
>>> for ax, (colname, series) in zip(axes, data_to_plot.iteritems()):
... ax.plot(series.values, np.zeros_like(series), ls='', marker='o')
... ax.set_ylabel(colname)
However if you want them vertically it’s likely we’ll have to compute the Gaussian densities ourselves. Pandas documentation points to scipy.stats.gaussian_kde. For this we’ll need to know at which points to interpolate the kernel. On your example it looks like [-1..1] is a good interval but of course you can take it from data min/max.
>>> from scipy.stats import gaussian_kde
>>> y = np.arange(-1, 1.01, .01)
>>> ridges = data_to_plot.apply(lambda s: gaussian_kde(s)(y))
>>> ridges
A B C D E F G
0 0.001119 0.271510 0.270048 2.029737e-24 0.163222 2.352981e-15 0.000018
1 0.001247 0.272310 0.272122 4.796826e-24 0.164507 3.959987e-15 0.000021
2 0.001389 0.273071 0.274155 1.125941e-23 0.165765 6.637610e-15 0.000025
3 0.001545 0.273794 0.276145 2.624972e-23 0.166995 1.108083e-14 0.000030
4 0.001717 0.274479 0.278093 6.078288e-23 0.168200 1.842365e-14 0.000036
.. ... ... ... ... ... ... ...
196 0.939109 0.307535 0.314227 3.791151e-02 0.436305 3.153771e-01 0.630121
197 0.932996 0.304793 0.310216 3.100156e-02 0.431472 2.913782e-01 0.615406
198 0.926089 0.302012 0.306172 2.518140e-02 0.426576 2.682819e-01 0.600298
199 0.918401 0.299193 0.302097 2.031681e-02 0.421619 2.461581e-01 0.584834
200 0.909948 0.296337 0.297994 1.628194e-02 0.416607 2.250649e-01 0.569049
[201 rows x 7 columns]
Then simply ploy with zip, as before. There might be some adjustment needed, but this is how it looks like with your sample data. Note the scaling of ridges so they are all on the same scale and fit inside a 0.5-wide space on the plot.
>>> ax = data_to_plot.T.plot(ls='', marker='o')
>>> for n, (colname, ridge) in enumerate(ridges.iteritems()):
... ax.plot(ridge / (-2 * ridges.max().max()) + n, y, color='black')

Related

Efficient Numpy Array Multiplication and Reshaping

Is there a simpler way to get this done?
Consider that I have an array of data points of length m - for instance, the amount of rain that accumulated at a weather station over the course of a single day for m many days. Now, we want to add n many small, semi-random perturbations to each day's data to create m * n many perturbed observations. Furthermore, we can divide the day into q many periods, and we have an estimate of the proportion of any day's rain that accumulates in each of those periods; we are assuming that the proportion of rain that accumulates during any period is not dependent on the day.
So I have an array of the daily observations of length m, (EDIT: length n was fixed to length m) which when perturbed becomes an array of shape [m,n], and an an array of period proportions of length q. What I want now is an array of shape [n,m*q] with one row for each perturbation, where each row is the concatenation of the "period-expanded" perturbed estimates of the daily rainfall observations.
As an example, we can define a toy set of data:
import numpy as np
m = 4
n = 3
q = 5
X = np.arange(m*n).reshape(m,n)
E = (np.arange(1,m*n + 1) /10).reshape(m,n)
X = X - E
Y = np.arange(1,q)
np.random.shuffle(Y)
Y = Y / np.sum(np.arange(1,q))
print(f'X : \n{X}')
print(f'Y : \n{Y}')
which gives us
X :
[[-0.1 0.8 1.7]
[ 2.6 3.5 4.4]
[ 5.3 6.2 7.1]
[ 8. 8.9 9.8]]
Y :
[0.2 0.3 0.1 0.4]
My solution is :
res = (X[:,:,np.newaxis] * Y[np.newaxis,np.newaxis,:]).transpose(1,0,2).reshape(X.shape[1],-1)
print(f'res : \n{res}')
which gives us the appropriate answer :
res :
[[-0.02 -0.03 -0.01 -0.04 0.52 0.78 0.26 1.04 1.06 1.59 0.53 2.12
1.6 2.4 0.8 3.2 ]
[ 0.16 0.24 0.08 0.32 0.7 1.05 0.35 1.4 1.24 1.86 0.62 2.48
1.78 2.67 0.89 3.56]
[ 0.34 0.51 0.17 0.68 0.88 1.32 0.44 1.76 1.42 2.13 0.71 2.84
1.96 2.94 0.98 3.92]]
Admittedly it would be easier to expand the observatons first and then randomly perturb, but the order of operations is a hard requirement on this particular problem: the observations must first be perturbed then the perturbed observations must be expanded and concatenated.

Plotting values from two datasets for comparison

I would like to plot two dataframes in order to compare the results. My first choice would be to plot line charts based only on one column from the two dataframes.
df
Name Surname P R F
0 B N 0.41 0.76 0.53
1 B L 0.62 0.67 0.61
2 B SV 0.63 0.53 0.52
3 B SG 0.43 0.61 0.53
4 B R 0.81 0.51 0.53
5 T N 0.32 0.82 0.53
6 T L 0.58 0.69 0.62
7 T SV 0.67 0.61 0.64
8 T SG 0.53 0.63 0.57
9 T R 0.74 0.48 0.58
and
data = [['B','N',0.41,0.72,0.51],
['B','L',0.66,0.67,0.62],
['B','SV',0.63,0.51,0.51],
['B','SG',0.44,0.63,0.51],
['B','R',0.81,0.51,0.62],
['T','N',0.33,0.80,0.47],
['T','L',0.58,0.61,0.63],
['T','SV',0.68,0.61,0.64],
['T','SG',0.53,0.63,0.57],
['T','R',0.74,0.48,0.58]]
df1 = pd.DataFrame(data, columns = ['Name','Surname','P','R','F'])
I would like to create a plot based on F values, keeping information (in legend/labels) of B/T and R,N,L, SV, SG.
I have tried with bar charts, but this does not take into account labels/legend.
I am looking for something like this:
fig, ax = plt.subplots()
ax2 = ax.twinx()
df.plot(x="Name", y=["F"], ax=ax)
df1.plot(x="Name", y=["F"], ax=ax2, ls="--")
However this misses labels and legend.
I have also tried with:
ax = df.plot()
l = ax.get_lines()
df1.plot(ax=ax, linestyle='--', color=(i.get_color() for i in l))
But I cannot distinguish by Name, Surname and dataframe (on the x axis there should be Surname).
It would be also ok to plot separately the values (P, R and F) as follows:
ax = df[['P']].plot()
l = ax.get_lines()
df1[['P']].plot(ax=ax, linestyle='--', color=(i.get_color() for i in l))
I should compare F values of the two plots based on Name and Surname.
Any help would be greatly appreciated.
IIUC,
fig, ax = plt.subplots()
ax2 = ax.twinx()
df.plot(x="Name", y=["F"], ax=ax)
df1.plot(x="Name", y=["F"], ax=ax2, ls="--")
fig.legend(loc="upper right", bbox_to_anchor=(1,1), bbox_transform=ax.transAxes)
Output:
The simplest way to add information about other parameters to a graph is to use functions like ax.text or ax.annotate over a loop. The code should look like this:
fig, ax = plt.subplots()
data1 = ax.bar(20*index, df["F"], bar_width)
data2 = ax.bar(20*index+bar_width, df1["F"],bar_width)
for i in index:
ax.text(i*20-5,0,df['Surname'][i],)
ax.text(i*20-5,0.05,df['Name'][i])
ax.text(i*20+bar_width-5,0,df1['Surname'][i])
ax.text(i*20+bar_width-5,0.05,df1['Name'][i])
plt.show()
Useful link:
Official Documentation for Text in Matplotlib Plots
Edit:
Probably similar problem: Different text at each point
Edit 2:
Code without index:
fig, ax = plt.subplots()
data1 = ax.plot(df["F"])
data2 = ax.plot(df1["F"])
for i in range(1,10):
ax.text(i,df["F"][i],df['Name'][i]+" "+df['Surname'][i],)
ax.text(i,df["F"][i],df['Name'][i]+" "+df['Surname'][i],)
plt.show()

Swap and group column names in a pandas DataFrame

I have a data frame with some quantitative data and one qualitative data. I would like to use describe to compute stats and group by column using the qualitative data. But I do not obtain the order I want for the level. Hereafter is an example:
df = pd.DataFrame({k: np.random.random(10) for k in "ABC"})
df["qual"] = 5 * ["init"] + 5 * ["final"]
The DataFrame looks like:
A B C qual
0 0.298217 0.675818 0.076533 init
1 0.015442 0.264924 0.624483 init
2 0.096961 0.702419 0.027134 init
3 0.481312 0.910477 0.796395 init
4 0.166774 0.319054 0.645250 init
5 0.609148 0.697818 0.151092 final
6 0.715744 0.067429 0.761562 final
7 0.748201 0.803647 0.482738 final
8 0.098323 0.614257 0.232904 final
9 0.033003 0.590819 0.943126 final
Now I would like to group by the qual column and compute statistical descriptors using describe. I did the following:
ddf = df.groupby("qual").describe().transpose()
ddf.unstack(level=0)
And I got
qual final init
A B C A B C
count 5.000000 5.000000 5.000000 5.000000 5.000000 5.000000
mean 0.440884 0.554794 0.514284 0.211741 0.574539 0.433959
std 0.347138 0.284931 0.338057 0.182946 0.274135 0.355515
min 0.033003 0.067429 0.151092 0.015442 0.264924 0.027134
25% 0.098323 0.590819 0.232904 0.096961 0.319054 0.076533
50% 0.609148 0.614257 0.482738 0.166774 0.675818 0.624483
75% 0.715744 0.697818 0.761562 0.298217 0.702419 0.645250
max 0.748201 0.803647 0.943126 0.481312 0.910477 0.796395
I am close to what I want but I would like to swap and group the column index such as:
A B C
qual initial final initial final initial final
Is there a way to do it ?
Use columns.swaplevel and then sort_index by level=0 and axis='columns':
ddf = df.groupby('qual').describe().T.unstack(level=0)
ddf.columns = ddf.columns.swaplevel(0,1)
ddf = ddf.sort_index(level=0, axis='columns')
Or in one line using DataFrame.swaplevel instead of index.swaplevel:
ddf = ddf.swaplevel(0,1, axis=1).sort_index(level=0, axis='columns')
A B C
qual final init final init final init
count 5.00 5.00 5.00 5.00 5.00 5.00
mean 0.44 0.21 0.55 0.57 0.51 0.43
std 0.35 0.18 0.28 0.27 0.34 0.36
min 0.03 0.02 0.07 0.26 0.15 0.03
25% 0.10 0.10 0.59 0.32 0.23 0.08
50% 0.61 0.17 0.61 0.68 0.48 0.62
75% 0.72 0.30 0.70 0.70 0.76 0.65
max 0.75 0.48 0.80 0.91 0.94 0.80
Try ddf.stack().unstack(level=[0,2]), inplace of ddf.unstack(level=0)

Incorrect Python Matplotlib Polar Plotting

I'm trying to plot a polar plot using the code below
ax = plt.subplot(111,polar=True)
plt.scatter(ts,rds,c=cs,
s=[(y*100)+10 for y in ys],
cmap='gist_rainbow')
ax.set_yticklabels([])
for i in range(len(names)):
ax.text(ts[i],rds[i],"{}".format(i+1),size=12)
plt.show()
There are 42 points, the first 7 of which are
# Theta Radius
1 249.25 0.39
2 66.25 0.40
3 239.09 0.71
4 31.82 1.05
5 114.02 0.54
6 189.15 0.46
7 359.00 0.05
However, the resulting figure/plot at the below link is incorrect
https://i.imgur.com/n5dMuND.png
Before taking a screenshot, I mouse hovered over point 7. You can see that point 7 is plotted at 41.186 degrees and not 359 as it should be. Any idea why this is?
Thanks

How can I plot a correlation matrix as a set of ellipses, similar to the R open-air package?

The figure below is plotted using the open-air R package:
I know matplotlib has the plt.matshow function,
but it can't clearly show the relation between variables at the same time.
Here is my early work:
df is a pandas dataframe with 7 variables shows like below:
I don't know how to attach a .csv file to StackOverflow.
Using plt.matshow(df.corr(),cmap = plt.cm.Greens), the figure shows like this:
The second figure can't represent the correlation relations of the variables as clearly as the first one.
Edit:
I upload the csv file to Google docs here.
I'm not aware of any existing Python library that does these "ellipse plots", but it's not particularly hard to implement using a matplotlib.collections.EllipseCollection:
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
from matplotlib.collections import EllipseCollection
def plot_corr_ellipses(data, ax=None, **kwargs):
M = np.array(data)
if not M.ndim == 2:
raise ValueError('data must be a 2D array')
if ax is None:
fig, ax = plt.subplots(1, 1, subplot_kw={'aspect':'equal'})
ax.set_xlim(-0.5, M.shape[1] - 0.5)
ax.set_ylim(-0.5, M.shape[0] - 0.5)
# xy locations of each ellipse center
xy = np.indices(M.shape)[::-1].reshape(2, -1).T
# set the relative sizes of the major/minor axes according to the strength of
# the positive/negative correlation
w = np.ones_like(M).ravel()
h = 1 - np.abs(M).ravel()
a = 45 * np.sign(M).ravel()
ec = EllipseCollection(widths=w, heights=h, angles=a, units='x', offsets=xy,
transOffset=ax.transData, array=M.ravel(), **kwargs)
ax.add_collection(ec)
# if data is a DataFrame, use the row/column names as tick labels
if isinstance(data, pd.DataFrame):
ax.set_xticks(np.arange(M.shape[1]))
ax.set_xticklabels(data.columns, rotation=90)
ax.set_yticks(np.arange(M.shape[0]))
ax.set_yticklabels(data.index)
return ec
For example, using your data:
data = df.corr()
fig, ax = plt.subplots(1, 1)
m = plot_corr_ellipses(data, ax=ax, cmap='Greens')
cb = fig.colorbar(m)
cb.set_label('Correlation coefficient')
ax.margins(0.1)
Negative correlations can be plotted as ellipses with the opposite orientation:
fig2, ax2 = plt.subplots(1, 1)
data2 = np.linspace(-1, 1, 9).reshape(3, 3)
m2 = plot_corr_ellipses(data2, ax=ax2, cmap='seismic', clim=[-1, 1])
cb2 = fig2.colorbar(m2)
ax2.margins(0.3)
Assuming you are interested in showing cluster relations, the seaborn package mentioned in the comments also has a clustermap. Using your correlation matrix (looks like you want to show correlation coefficients as int in the [-100, 100] range, you could do the following:
corr = df.corr().mul(100).astype(int)
GX HG RM SJ XB XN ZG
GX 100 77 62 71 48 66 57
HG 77 100 69 74 61 61 58
RM 62 69 100 75 48 64 68
SJ 71 74 75 100 50 70 65
XB 48 61 48 50 100 46 51
XN 66 61 64 70 46 100 75
ZG 57 58 68 65 51 75 100
and then use seaborn.clustermap() as follows:
import seaborn as sns
sns.clustermap(data=corr, annot=True, fmt='d', cmap='Greens').savefig('cluster.png')
I just discovered this Python package biokit today. It provides a very handy function to create various kinds of correlation charts. For example:
In [1]: import pandas as pd
In [2]: import matplotlib.pyplot as plt
...: from biokit.viz import corrplot
In [6]: corr
Out[6]:
GX HG RM SJ XB XN ZG
GX 1.00 -0.77 0.62 0.71 0.48 0.66 0.57
HG -0.77 1.00 0.69 0.74 0.61 0.61 0.58
RM 0.62 0.69 1.00 0.75 0.48 0.64 0.68
SJ 0.71 0.74 0.75 1.00 0.50 0.70 0.65
XB 0.48 0.61 0.48 0.50 1.00 -0.46 0.51
XN 0.66 0.61 0.64 0.70 -0.46 1.00 0.75
ZG 0.57 0.58 0.68 0.65 0.51 0.75 1.00
I took Stefan's data and modified it a little bit. Let's assume this is a correlation matrix. Now to create a correlation chart, you can simply do this:
In [7]: c = corrplot.Corrplot(corr)
...: c.plot()
Correlation chart with ellipses
You can read more examples here.

Categories