how to add box plot to scatter data in matplotlib - python

I have the following data and after plotting scatter data point, I would like to add boxplot around each set of position. Here is my code for plotting the scatter plot:
%matplotlib inline
import matplotlib.pyplot as plt
X = [1, 1, 1, 1, 1, 1, 1,
2, 2, 2, 2, 2, 2, 2,
3, 3, 3, 3, 3, 3, 3,
4, 4, 4, 4, 4, 4, 4,
5, 5, 5, 5, 5, 5, 5,
6, 6, 6, 6, 6, 6, 6,
7, 7, 7, 7, 7, 7, 7,
8, 8, 8, 8, 8, 8, 8,
9, 9, 9, 9, 9, 9, 9,
10, 10, 10, 10, 10, 10, 10,
11, 11, 11, 11, 11, 11, 11,
12, 12, 12, 12, 12, 12, 12,
13, 13, 13, 13, 13, 13, 13,
14, 14, 14, 14, 14, 14, 14,
15, 15, 15, 15, 15, 15, 15]
H = [15, 17, 16, 20, 15, 18, 15,
17, 16, 16, 20, 19, 18, 15,
20, 22, 20, 22, 19, 21, 21,
19, 21, 20, 23, 21, 20, 22,
21, 23, 22, 20, 24, 22, 20,
20, 19, 20, 18, 21, 17, 19,
18, 20, 16, 15, 17, 20, 19,
19, 19, 18, 21, 21, 16, 19,
21, 22, 22, 24, 24, 23, 25,
28, 26, 30, 27, 26, 29, 30,
27, 26, 29, 31, 27, 29, 30,
25, 26, 27, 28, 25, 27, 30,
31, 28, 25, 27, 30, 25, 31,
28, 26, 30, 28, 29, 27, 31,
24, 26, 25, 28, 26, 23, 25]
fig, axes = plt.subplots(figsize=(8,5))
axes.scatter(X, H, color='b')
axes.set_xlabel('Pos');
axes.set_ylabel('H, µm');
when i add plt.boxplot, it captures all data not individual position. I appreciate the answers either in matplotlib or seaborn.
thanks

A good way would be using pandas:
df = pd.DataFrame({'X':X, 'H': H})
ax=df.plot(kind='scatter', x='X', y='H')
df.boxplot(by='X', ax=ax)
plt.show()
output:

Here's a condensed solution to how to map your H array by X and plot it using matplotlib:
groups = [[] for i in range(max(X))]
[groups[X[i]-1].append(H[i]) for i in range(len(H))]
plt.boxplot(groups)
Outcome:
you can add grid with plt.grid(True)

Related

Python Append items to dictionary in a loop

I am trying to append values to a dictionary inside a loop, but somehow it's only appending one of the values. I recreated the setup using the same numbers I am dynamically getting.
The output from "print(vertex_id_from_shell)" is "{0: [4], 1: [12], 2: [20]}". I need to keep the keys, but add the remaining numbers to the values.
Thanks.
shells = {0: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13], 1: [14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27], 2: [28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41]}
uvsID = [0, 1, 3, 2, 2, 3, 5, 4, 4, 5, 7, 6, 6, 7, 9, 8, 1, 10, 11, 3, 12, 0, 2, 13, 14, 15, 16, 17, 17, 16, 18, 19, 19, 18, 20, 21, 21, 20, 22, 23, 15, 24, 25, 16, 26, 14, 17, 27, 28, 29, 30, 31, 31, 30, 32, 33, 33, 32, 34, 35, 35, 34, 36, 37, 29, 38, 39, 30, 40, 28, 31, 41]
vertsID = [0, 1, 3, 2, 2, 3, 5, 4, 4, 5, 7, 6, 6, 7, 1, 0, 1, 7, 5, 3, 6, 0, 2, 4, 8, 9, 11, 10, 10, 11, 13, 12, 12, 13, 15, 14, 14, 15, 9, 8, 9, 15, 13, 11, 14, 8, 10, 12, 16, 17, 19, 18, 18, 19, 21, 20, 20, 21, 23, 22, 22, 23, 17, 16, 17, 23, 21, 19, 22, 16, 18, 20]
vertex_id_from_shell = {}
for shell in shells:
selection_shell = shells.get(shell)
#print(selection_shell)
for idx, item in enumerate(selection_shell):
if item in uvsID:
uv_index = uvsID.index(item)
vertex_ids = vertsID[uv_index]
vertex_id_from_shell[shell] = [ ( vertex_ids ) ]
print(vertex_id_from_shell)
#{0: [4], 1: [12], 2: [20]}
#desired result
{0: [0, 1, 2, 3, 4, 5, 6, 7, 0, 1, 7, 0, 1, 7, 5, 6, 4], 1: [8, 9, 11, 10, 13, 12, 15, 14, 9, 8, 15, 13, 14, 12], 2: [16, 17, 19, 18, 21, 20, 23, 22, 17, 16, 23, 21, 22, 20]}
You're overwriting vertex_id_from_shell[shell] each time through the loop, not appending to it.
Use collections.defaultdict() to automatically create the dictionary elements with an empty list if necessary, then you can append.
from collections import defaultdict
vertex_id_from_shell = defaultdict(list)
for shell, selection_shell in shells.items():
for item in selection_shell:
if item in uvsID:
uv_index = uvsID.index(item)
vertex_ids = vertsID[uv_index]
vertex_id_from_shell[shell].append(vertex_ids)
You are setting vertex_id_from_shell[shell] to a new list, containing only one item every time. Instead, you should append to it.But first, of course that list needs to exist, so you should check and create it if it doesn't already exist.
shells = {0: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13], 1: [14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27], 2: [28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41]}
uvsID = [0, 1, 3, 2, 2, 3, 5, 4, 4, 5, 7, 6, 6, 7, 9, 8, 1, 10, 11, 3, 12, 0, 2, 13, 14, 15, 16, 17, 17, 16, 18, 19, 19, 18, 20, 21, 21, 20, 22, 23, 15, 24, 25, 16, 26, 14, 17, 27, 28, 29, 30, 31, 31, 30, 32, 33, 33, 32, 34, 35, 35, 34, 36, 37, 29, 38, 39, 30, 40, 28, 31, 41]
vertsID = [0, 1, 3, 2, 2, 3, 5, 4, 4, 5, 7, 6, 6, 7, 1, 0, 1, 7, 5, 3, 6, 0, 2, 4, 8, 9, 11, 10, 10, 11, 13, 12, 12, 13, 15, 14, 14, 15, 9, 8, 9, 15, 13, 11, 14, 8, 10, 12, 16, 17, 19, 18, 18, 19, 21, 20, 20, 21, 23, 22, 22, 23, 17, 16, 17, 23, 21, 19, 22, 16, 18, 20]
vertex_id_from_shell = {}
for shell in shells:
selection_shell = shells.get(shell)
#print(selection_shell)
for idx, item in enumerate(selection_shell):
if item in uvsID:
uv_index = uvsID.index(item)
vertex_ids = vertsID[uv_index]
# if the list does not exist, create it
if shell not in vertex_id_from_shell:
vertex_id_from_shell[shell] = []
# append to list
vertex_id_from_shell[shell].append(vertex_ids)
print(vertex_id_from_shell)
# {0: [0, 1, 2, 3, 4, 5, 6, 7, 0, 1, 7, 5, 6, 4],
# 1: [8, 9, 11, 10, 13, 12, 15, 14, 9, 8, 15, 13, 14, 12],
# 2: [16, 17, 19, 18, 21, 20, 23, 22, 17, 16, 23, 21, 22, 20]}

Plotting lognormal distribution with my data, instead of randomly generalized data

I am new to Python and statistics, and I have a problem.
I have a random variable X whose values fall under a three-parameter lognormal distribution. I would like to plot the PDF of my variable.
X contains 500 samples (N=500), which are the following:
X = [1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 17, 17, 17, 17, 17, 17, 17, 17, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 20, 20, 20, 20, 20, 20, 20, 21, 21, 21, 21, 21, 21, 21, 21, 22, 22, 22, 22, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 24, 24, 24, 24, 25, 25, 25, 26, 26, 27, 27, 27, 27, 27, 28, 28, 28, 29, 29, 29, 29, 30, 30, 30, 30, 30, 30, 31, 31, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 33, 33, 33, 33, 33, 33, 34, 34, 34, 34, 34, 35, 35, 35, 36, 36, 36, 36, 36, 37, 38, 38, 39, 39, 40, 42, 42, 43, 43, 44, 45, 47, 48, 49, 49, 50, 50, 51, 52, 52, 54, 54, 55, 58, 62, 67, 67, 73, 80]
and the mean and standard deviation are:
Mean = 15.088
Stddev = 12.445
I have been doing some research, and I think the following code could be adapted to get the lognormal curve that I need with my data, but I do not understand very much how to do it, because I do not really comprehend how this distribution works.
The code is this:
from scipy import stats
import numpy as np
import matplotlib.pyplot as plt
import math
def lognorm(mu,variance):
size = 500
sigma = math.sqrt(variance)
np.random.seed(1)
gaussianData = stats.norm.rvs(loc=mu, scale=sigma, size=size)
logData = np.exp(gaussianData)
shape, loc, scale = stats.lognorm.fit(logData, floc=0)
logData.sort()
return logData, stats.lognorm.pdf(logData, shape, loc, scale)
x, y = lognorm(37, 0.8)
plt.plot(x, y)
plt.grid()
plt.show()
Any help with be much appreciated.

How do I plot a convenient lognormal pdf with my data?

I have a random variable Words per sentence which contains those values:
words_per_sentence = [1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 17, 17, 17, 17, 17, 17, 17, 17, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 20, 20, 20, 20, 20, 20, 20, 21, 21, 21, 21, 21, 21, 21, 21, 22, 22, 22, 22, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 24, 24, 24, 24, 25, 25, 25, 26, 26, 27, 27, 27, 27, 27, 28, 28, 28, 29, 29, 29, 29, 30, 30, 30, 30, 30, 30, 31, 31, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 33, 33, 33, 33, 33, 33, 34, 34, 34, 34, 34, 35, 35, 35, 36, 36, 36, 36, 36, 37, 38, 38, 39, 39, 40, 42, 42, 43, 43, 44, 45, 47, 48, 49, 49, 50, 50, 51, 52, 52, 54, 54, 55, 58, 62, 67, 67, 73, 80]
The observation space contains 500 values (N=500), and the mean and standard deviation are:
mu = 15.088
sigma = 12.445
I want to calculate a three-parameters-lognormal PDF for my variable and plot it in a graph but I do not get the result I want. This is my failing code up to now:
import math
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
data = [1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 17, 17, 17, 17, 17, 17, 17, 17, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 20, 20, 20, 20, 20, 20, 20, 21, 21, 21, 21, 21, 21, 21, 21, 22, 22, 22, 22, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 24, 24, 24, 24, 25, 25, 25, 26, 26, 27, 27, 27, 27, 27, 28, 28, 28, 29, 29, 29, 29, 30, 30, 30, 30, 30, 30, 31, 31, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 33, 33, 33, 33, 33, 33, 34, 34, 34, 34, 34, 35, 35, 35, 36, 36, 36, 36, 36, 37, 38, 38, 39, 39, 40, 42, 42, 43, 43, 44, 45, 47, 48, 49, 49, 50, 50, 51, 52, 52, 54, 54, 55, 58, 62, 67, 67, 73, 80]
plt.hist(data, bins=50, color='c', alpha=0.75)
xmin = min(data)
xmax = max(data)
x = np.linspace(xmin, xmax, 200)
pdf = stats.lognorm.pdf(x, s=12.445, loc=1, scale=math.exp(15.088))
plt.plot(pdf, 'y')
The problem is that I do not get the PDF printed but a straight horizontal line in at the bottom of the graph. I cannot post any picture as I have not yet enough punctuation. Please help.
The character class you have there does not allow spaces. Add a space to it:
^[A-ZÑÁÉÍÓÚÜ ]+:
(Edited to add the Ñ that it needs)
Use
^[\p{Lu}\s]+:
See regex proof.
EXPLANATION
NODE EXPLANATION
--------------------------------------------------------------------------------
^ the beginning of the string
--------------------------------------------------------------------------------
[\p{Lu}\s]+ any character of: Unicode uppercase letter (\p{Lu}),
whitespace (\n, \r, \t, \f, and " ")
(1 or more times (matching the most amount possible))
--------------------------------------------------------------------------------
: ':'</pre>
If it does not work use
^[^:\r\n]+:
EXPLANATION
NODE EXPLANATION
--------------------------------------------------------------------------------
^ the beginning of the string
--------------------------------------------------------------------------------
[^:\r\n]+ any character except: ':', '\r' (carriage
return), '\n' (newline) (1 or more times
(matching the most amount possible))
--------------------------------------------------------------------------------
: ':'

Python: change color of nodes based on its numbers

I have a graph and I need to change color of some nodes.
For example
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29]
all nodes in graph and there are blue.
How can I
[0, 1, 2, 3, 4, 5, 8, 9, 11, 15, 19, 21, 22, 23, 24, 25, 29]
change to red?

How to test if dataset follows zipf's law? (using a Plot in R, tikz or even Python)

I have a very simple problem which I can't solve by myself: I have a small dataset of counts and I want to compare their distribution with zipf's law.
my data is a simple table with one row:
31,
29,
28,
27,
27,
27,
27,
26,
25,
24,
23,
23,
22,
22,
22,
21,
21,
20,
20,
20,
19,
19,
19,
18,
18,
18,
18,
17,
17,
17,
17,
16,
15,
15,
15,
15,
14,
14,
13,
13,
13,
13,
12,
12,
12,
12,
11,
11,
11,
10,
10,
9,
8,
8,
8,
7,
7,
7,
6,
6,
6,
6,
6,
5,
5,
5,
4,
3,
2,
2,
2,
2,
1,
1,
could anyone help me with this?
I have tried the hole day…

Categories