Violin Plot with python - python

I want to create 10 violin plots but within one diagram. I looked at many examples like this one: Violin plot matplotlib, what shows what I would like to have at the end.
But I did not know how to adapt it to a real data set. They all just generate some random data which is normal distributed.
I have data in form D[10,730] and if I try to adapt it from the link above with :
example:
axes[0].violinplot(all_data,showmeans=False,showmedians=True)
my code:
axes[0].violinplot(D,showmeans=False,showmedians=True)
it do not work.
It should print 10 violin plot in parallel (first dimension of D).
So how do my data need to look like to get the same type of violin plot?

You just need to transpose your data array D.
axes[0].violinplot(D.T,showmeans=False,showmedians=True)
This appears to be a small bug in matplotlib. The axes are treated in a non-consistent manner for a list of 1D arrays and a 2D array.
import numpy as np
import matplotlib.pyplot as plt
n_datasets = 10
n_samples = 730
data = np.random.randn(n_datasets,n_samples)
fig, axes = plt.subplots(1,3)
# http://matplotlib.org/examples/statistics/boxplot_vs_violin_demo.html
axes[0].violinplot([d for d in data])
# should be equivalent to:
axes[1].violinplot(data)
# is actually equivalent to
axes[2].violinplot(data.T)
You should file a bug report.

Related

how to output a 3 by 3 array to Seaborn chart

this is more of a data visualisation issue. I got a 3 by 3 output from my calculations and I want to project the output on to a Seaborn chart for better visual effect. How do I do that? I have attached a random pic of how ideally I would like to see it below. Numbers inside the individual squares will be from my 3*3 array to be clear.
Hypothetical 3*3 array
opt = np.random.randint(1,10,9).reshape(3,3)
You might use seaborn.heatmap. Beyond providing data (2D array), set annot=True if you wish cells to be labeled, simple example:
import numpy as np
import seaborn as sns
data = np.arange(9).reshape(3,3)
ax = sns.heatmap(data, annot=True)
gives

Plots not visible when using a line plot

I am new to python and I am trying to plot x and y (both have a large number of data) but when I use a plt.plot there is not plot visible on the output.
The code I have been using is
for i in range(len(a)):
plt.plot(a[i],b[i])
plt.figure()
plt.show()
when I tried a scatter plot
for i in range(len(a)):
plt.scatter(a[i],b[i])
plt.figure()
plt.show()
I am not able to understand the reason for missing the line plot and even when I try seaborn it showing me an error ValueError: If using all scalar values, you must pass an index
import numpy as np
import matplotlib.pyplot as plt
a = np.linspace(0,5,100)
b = np.linspace(0,10,100)
plt.plot(a,b)
plt.show()
I think this answers your question. I have taken sample values of a and b. The matplotlib line plots are not required to run in loops
A line is created between two points. If you are plotting single values, a line can't be constructed.
Well, you might say "but I am plotting many points," which already contains part of the answer (points). Actually, matplotlib.plot() plots line-objects. So every time, you call plot, it creates a new one (no matter if you are calling it on the same or on a new axis). The reason why you don't get lines is that only single points are plotted. The reason why you're not even seeing the these points is that plot() does not indicate the points with markers per default. If you add marker='o' to plot(), you will end up with the same figure as with scatter.
A scatter-plot on the other hand is an unordered collection of points. There characteristic is that there are no lines between these points because they are usually not a sequence. Nonetheless, because there are no lines between them, you can plot them all at once. Per default, they have all the same color but you can even specify a color vector so that you can encode a third information in it.
import matplotlib.pyplot as plt
import numpy as np
# create random data
a = np.random.rand(10)
b = np.random.rand(10)
# open figure + axes
fig,axs = plt.subplots(1,2)
# standard scatter-plot
axs[0].scatter(a,b)
axs[0].set_title("scatter plot")
# standard line-plot
axs[1].plot(a,b)
axs[1].set_title("line plot")

t-SNE map into 2D or 3D plot

features = ["Ask1", "Bid1", "smooth_midprice", "BidSize1", "AskSize1"]
client = InfluxDBClient(host='127.0.0.1', port=8086, database='data',
username=username, password=password)
series = "DCIX_2016_11_15"
sql = "SELECT * FROM {} where time >= '{}' AND time <= '{}' ".format(series,FROMT,TOT)
df = pd.DataFrame(client.query(sql).get_points())
#Separating out the features
X = df.loc[:, features].values
# Standardizing the features
X = StandardScaler().fit_transform(X)
tsne = TSNE(n_components=3, n_jobs=5).fit_transform(X)
I would like map my 5 features into a 2D or 3D plot. I am a bit confused how to do that. How can I build a plot from that information?
You already have most of the work done. t-SNE is a common visualization for understanding high-dimensional data, and right now the variable tsne is an array where each row represents a set of (x, y, z) coordinates from the obtained embedding. You could use other visualizations if you would like, but t-SNE is probably a good starting place.
As far as actually seeing the results, even though you have the coordinates available you still need to plot them somehow. The matplotlib library is a good option, and that's what we'll use here.
To plot in 2D you have a couple of options. You can either keep most of your code the same and simply perform a 2D t-SNE with
tsne = TSNE(n_components=2, n_jobs=5).fit_transform(X)
Or you can just use the components you have and only look at two of them at a time. The following snippet should handle either case:
import matplotlib.pyplot as plt
plt.scatter(*zip(*tsne[:,:2]))
plt.show()
The zip(*...) transposes your data so that you can pass the x coordinates and the y coordinates individually to scatter(), and the [:,:2] piece selects two coordinates to view. You could ignore it if your data is already 2D, or you could replace it with something like [:,[0,2]] to view, for example, the 0th and 2nd features in higher-dimensional data rather than just the first 2.
To plot in 3D the code looks much the same, at least for a minimal version.
from mpl_toolkits.mplot3d import Axes3D
import matplotlib.pyplot as plt
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.scatter(*zip(*tsne))
plt.show()
The main differences are a use of 3D plotting libraries and making a 3D subplot.
Adding color: t-SNE visualizations are typically more helpful if they're color-coded somehow. One example might be the smooth midprice you currently have stored in X[:,2]. For exploratory visualizations, I find 2D plots more helpful, so I'll use that as the example:
plt.scatter(*zip(*tsne[:,:2]), c=X[:,2])
You still need the imports and whatnot, but by passing the keyword argument c you can color code the scatter plot. To adjust how that numeric data is displayed, you could use a different color map like so:
plt.scatter(*zip(*tsne[:,:2]), c=X[:,2], cmap='RdBu')
As the name might suggest, this colormap consists of a gradient between red and blue, and the lower values of X[:,2] will correspond to red.

Plotting points extracted from a .txt file in python

I am trying to create a plot extracting points from a .txt file. The points are separated by 'tab' space only. Also, there are too many points to be accommodated in only one column, so they have been spread over 3 columns. However, when I plot in matplotlib, I am a little suspicious I am not seeing all the numbers plotted. It may be the case the data is plotted only over the first column and is ignoring the other two columns.
Here is the sample example of such data file: https://www.dropbox.com/s/th6uwrk2xdnmhyi/n1l2m2.txt?dl=0
I also attached the simple code I am using to plot:
import matplotlib.pyplot as plt
%matplotlib inline
import sys
import os
import numpy
from pylab import *
exp_sum = '/home/trina/Downloads/n1l2m2.txt'
a= numpy.loadtxt(exp_sum, unpack =True)
plt.plot(a)
show()
and here is the output image:
I am interested to know if this plot covers all the points in my data file. Your suggestion is very appreciated.
By doing plt.plot(a), you are passing a 3 dimensional data set to be plotted onto a 2 dimensional graph.
From the matplotlib docs for plot
If x and/or y is 2-dimensional, then the corresponding columns will be
plotted.
So, your graph output is:
column 0 values at x = 0
column 1 values at x = 1
column 2 values at x = 2
Adding the following to the code:
for i in range(0,len(a)):
print('a'+str(i),max(a[i]),min(a[i]))
Outputs the following:
stats max min
a0 0.9999 0.0
a1 0.9856736 0.3736717
a2 -0.003469009 -0.08896232
Using the mouseover position readout with matplotlib, this looks correct.
On a general graphs point, I'd recommend using histograms, boxplots or violin plots if you want to visualise the frequency (and other stats) of data sets. See the matplotlib examples for histograms, boxplots and violin plots.
Edit: from the shading on the graph you have, it also looks like it does contain all the points, as your data columns are long tails when plotted individually. The long tail graphs correlate to the shading on the graph you have.

Pyplot colormap line by line

I'm beginning with plotting on python using the very nice pyplot. I aim at showing the evolution of two series of data along time. Instead of doing a casual plot of data function of time, I'd like to have a scatter plot (data1,data2) where the time component is shown as a color gradient.
In my two column file, the time would be described by the line number. Either written as a 3rd column in the file either using the intrinsic capability of pyplot to get the line number on its own.
Can anyone help me in doing that ?
Thanks a lot.
Nicolas
When plotting using matplotlib.pyplot.scatter you can pass a third array via the keyword argument c. This array can choose the colors that you want your scatter points to be. You then also pick an appropriate colormap from matplotlib.cm and assign that with the cmap keyword argument.
This toy example creates two datasets data1 and data2. It then also creates an array colors, an array of continual values equally spaced between 0 and 1, and with the same length as data1 and data2. It doesn't need to know the "line number", it just needs to know the total number of data points, and then equally spaces the colors.
I've also added a colorbar. You can remove this by removing the plt.colorbar() line.
import matplotlib.pyplot as plt
from matplotlib import cm
import numpy as np
N = 500
data1 = np.random.randn(N)
data2 = np.random.randn(N)
colors = np.linspace(0,1,N)
plt.scatter(data1, data2, c=colors, cmap=cm.Blues)
plt.colorbar()
plt.show()

Categories