Having difficulties with normalizing an histogram in python - python

I'm working with an histogram plotted over and on the right of a scatter.
I've already tried the normal function inside plt.hist(norm=1, or density=True), and with that I obtain an histogram with y-axis size approximately 2, 5. I know that if I grow the bin size I can low that y-axis value, but the work I'm trying to replicate doesn't have a bin bigger than 0.2.
Code:
x,y=columns of a predetermined table
left, width = 0.1, 0.7
bottom, height = 0.1, 0.7
spacing = 0.05
rect_scatter = [left, bottom, width, height]
rect_histx = [left, bottom + height + spacing, width, 0.2]
rect_histy = [left + width + spacing, bottom, 0.2, height]
plt.figure(figsize=(9, 8))
ax_scatter = plt.axes(rect_scatter)
ax_scatter.tick_params(direction='in', top=True, right=True)
ax_histx = plt.axes(rect_histx)
ax_histx.tick_params(direction='in', labelbottom=True)
ax_histy = plt.axes(rect_histy)
ax_histy.tick_params(direction='in', labelleft=False)
ax_scatter.scatter(x, y, s=30, marker='*')
binwidth = 0.1
ax_scatter.set_xlim((-1, 0.7))
ax_scatter.set_ylim((-0.9, 0.9))
bins = np.arange(-10, 10 + binwidth, binwidth)
ax_histx.hist(x, bins=bins, normed=1, color='chartreuse')
ax_histy.hist(y, bins=bins, orientation='horizontal', normed=1, color='darkmagenta')
ax_histx.set_xlim(ax_scatter.get_xlim())
x_histy.set_ylim(ax_scatter.get_ylim())
P.S.: I've looked in other posts, and tried for a long time fixing it, but i'm really lost. Also, I'm new in programming and statistics. Therefore, if you can, use easy terms please.
I'm addressing a screenshot of the graphic I plotted with this function, and another with an example of the one I'm trying to achieve (the values are not supposed to be the same).
If you need anything else to help solve my problem, please, be free to ask. Thank you

Related

Scatter plot with varying Quantile/Percentile in python [duplicate]

This question already has an answer here:
Plotting stochastic processes in Python
(1 answer)
Closed 2 years ago.
Basically, I want to plot a scatter plot between two variables with varying percentile, I've plotted the scatter plot with the following toy code but I'm unable to plot it for different percentile (quantile).
quantiles = [1,10,25,50,50,75,90,99]
grays = ["#DCDCDC", "#A9A9A9", "#2F4F4F","#A9A9A9", "#DCDCDC"]
alpha = 0.3
data = df[['area_log','mr_ecdf']]
y = data['mr_ecdf']
x = data['area_log']
idx = np.argsort(x)
x = np.array(x)[idx]
y = np.array(y)[idx]
for i in range(len(quantiles)//2):
plt.fill_between(x, y, y, color='black', alpha = alpha, label=f"{quantiles[i]}")
lower_lim = np.percentile(y, quantiles[i])
upper_lim = np.percentile(y, 100-quantiles[i])
data = data[data['mr_ecdf'] >= lower_lim]
data = data[data['mr_ecdf'] <= upper_lim]
y = data['mr_ecdf']
x = data['area_log']
idx = np.argsort(x)
x = np.array(x)[idx]
y = np.array(y)[idx]
data = df[['area_log','mr_ecdf']]
y = data['mr_ecdf']
x = data['area_log']
plt.scatter(x, y,s=1, color = 'r', label = 'data')
plt.legend()
# axes.set_ylim([0,1])
enter image description here
data link : here
I want plot something like this (First- (1,1)):
As was mentioned by #Mr. T, one way to do that is to calculate the CIs yourself and then plot them using plt.fill_between. The data you show pose a problem since there is not enough points and variance so you'll never get what is on your pictures (and the separation in my figure is also not clear so I have put another example below to show how it works). If you have data for that, post it, I will update. Anyway, you should check the post I mentioned in the comment and some way of doing it follows:
import numpy as np
import matplotlib.pyplot as plt
x = np.array([5,7,8,7,2,17,2,9,4,11,12,9,6])
y = np.array([99,86,87,88,111,86,103,87,94,78,77,85,86])
idx = np.argsort(x)
x = np.array(x)[idx]
y = np.array(y)[idx]
# Create a list of quantiles to calculate
quantiles = [0.05, 0.25, 0.75, 0.95]
grays = ["#DCDCDC", "#A9A9A9", "#2F4F4F","#A9A9A9", "#DCDCDC"]
alpha = 0.3
plt.fill_between(x, y-np.percentile(y, 0.5), y+np.percentile(y, 0.5), color=grays[2], alpha = alpha, label="0.50")
# if the percentiles are symmetrical and we want labels on both sides
for i in range(len(quantiles)//2):
plt.fill_between(x, y, y+np.percentile(y, quantiles[i]), color=grays[i], alpha = alpha, label=f"{quantiles[i]}")
plt.fill_between(x, y-np.percentile(y, quantiles[-(i+1)]),y, color=grays[-(i+1)], alpha = alpha, label=f"{quantiles[-(i+1)]}")
plt.scatter(x, y, color = 'r', label = 'data')
plt.legend()
EDIT:
Some explanation. I am not sure what is not correct in my code, but I would be happy if you can tell me -- there is always a way for improvement (Thanks to #Mr T. again for the catch). Nevertheless, the fill between function does the following:
Fill the area between two horizontal curves.
The curves are defined by the points (x, y1) and (x, y2)
So you specify by the y1 and y2 where you want to have the graph filled with a colour. Let me bring another example:
X = np.linspace(120, 50, 71)
Y = X + 20*np.random.randn(71)
plt.fill_between(X, Y-np.percentile(Y, 95),Y+np.percentile(Y, 95), color="k", alpha = alpha)
plt.fill_between(X, Y-np.percentile(Y, 80),Y+np.percentile(Y, 80), color="r", alpha = alpha)
plt.fill_between(X, Y-np.percentile(Y, 60),Y, color="b", alpha = alpha)
plt.scatter(X, Y, color = 'r', label = 'data')
I generated some random data to see what is happening. The line plt.fill_between(X, Y-np.percentile(Y, 60),Y, color="b", alpha = alpha) is plotting the fill only from the 60th percentile below Y up to Y. The other two lines are covering the space always from both sides of Y (hence the +-). You can see that the percentiles overlap, of course they do, they must -- a 90 percentile includes the 60 as well. So you see only the differences between them. You could plot the data in the opposite order (or change z-factor) but then all would be covered by the highest percentile. I hope this clarifies the answer. Also, your question is perfectly fine, sorry if my answer feels not neutral. Just if you had also the data for the graphs and not only the picture, my/others answer could be more tailored :).

How to label the vertical lines independent of the scale of the plot?

My program takes n sets of data and plots their histograms.
I. How to label the vertical lines independent of the height of the plot?
A vertical line indicates the most frequent value in a dataset. I want to add a label indicating the value, say 20% from the top. When using matplotlib.pyplot.text() I had to manually assign x and y values. Depending up on the dataset the text goes way up or way down which I don't want to happen.
matplot.axvline(most_common_number, linewidth=0.5, color='black')
matplot.text(most_common_number + 3, 10, str(most_common_number),
horizontalalignment='center', fontweight='bold', color='black')
I also tried setting the label parameter of matplotlib.pyplot.axvline() but it only adds to the legend of the plot.
matplot.axvline(most_common_number, linewidth=0.5, color='black', label=str(most_common_number))
I wonder if there is a way to use percentages so the text appears n% from the top or use a different method to label the vertical lines. Or am I doing this all wrong?
II. How to make the ticks on x-axis to be spaced out better on resulting image?
I want the x-axis ticks to be factors of 16 so I had to override the defaults. This is where the trouble began. When I save the plot to a PNG file, the x-axis looks really messed up.
But when I use show() it works fine:
Program Snippet
kwargs = dict(alpha=0.5, bins=37, range=(0, 304), density=False, stacked=True)
fig, ax1 = matplot.subplots()
colors = ['tab:blue', 'tab:orange', 'tab:green', 'tab:red', 'tab:purple', 'tab:brown', 'tab:pink', 'tab:gray', 'tab:olive', 'tab:cyan']
count = 0
'''
datasets = [('dataset name', ['data'])]
'''
for item in datasets:
dataset = item[1]
most_common_number = most_common(dataset)
ax1.hist(dataset, **kwargs, label=item[0], color=colors[count])
matplot.axvline(most_common_number, linewidth=0.5, color='black')
matplot.text(most_common_number + 3, 10, str(most_common_number),
horizontalalignment='center', fontweight='bold', color='black')
count += 1
#for x-axis
loc = matplotticker.MultipleLocator(base=16) # this locator puts ticks at regular intervals
ax1.xaxis.set_major_locator(loc)
#for y-axis
y_vals = ax1.get_yticks()
ax1.set_yticklabels(['{:3.1f}%'.format(x / len(datasets[0][1]) * 100) for x in y_vals])
#set title
matplot.gca().set(title='1 vs 2 vs 3')
#set subtitle
matplot.suptitle("This is a cool subtitle.", va="bottom", family="overpass")
matplot.legend()
fig = matplot.gcf()
fig.set_size_inches(16, 9)
matplot.savefig('out.png', format = 'png', dpi=120)
matplot.show()
I. How to label the vertical lines independent of the height of the plot?
It can be done in two ways:
Axes limits
matplotlib.pyplot.xlim and matplotlib.pyplot.ylim
ylim() will give the max and min values of the axis. eg: (0.0, 1707.3)
matplot.text(x + matplot.xlim()[1] * 0.02 , matplot.ylim()[1] * 0.8,
str(most_common_number),,
horizontalalignment='center', fontweight='bold', color='black')
(x + matplot.xlim()[1] * 0.02 means at x but 2% to the right. Because you don't want the text to coincide on the vertical line it labels.
matplot.ylim()[1] * 0.8 means at 80% height of the y-axis.
Or you can directly specify x and y as scale (eg: 0.8 of an axis) using transform parameter:
matplot.text(most_common_number, 0.8,
' ' + str(most_common_number), transform=ax1.get_xaxis_transform(),
horizontalalignment='center', fontweight='bold', color='black')
Here y = 0.8 means at 80% height of y-axis.
II. How to make the ticks on x-axis to be spaced out better on resulting image?
Use matplotlib.pyplot.gcf() to change the dimensions and use a custom dpi (otherwise the text will not scale properly) when saving the figure.
gcf() means "get current figure".
fig = matplot.gcf()
fig.set_size_inches(16, 9)
matplot.savefig('out.png', format = 'png', dpi=120)
So the resulting image will be (16*120, 9*120) or (1920, 1080) px.

How to colour circular lines in polar chart (matplotlib)

I'm trying to to colour the circular line that corresponds to the value of 0 in a polar chart. This is what I want to achieve:
On this related question (Shading a segment between two lines on polar axis (matplotlib)), ax.fill_between is used to fill the space between two values, but I'm looking for a way to colour just the circular line where the value for each variable is 0.
If anybody has any tips that would be most appreciated! I've inserted a minimal working example below if anybody fancies having a go.
import matplotlib.pyplot as plt
import pandas as pd
def make_spider(row, title, color):
import math
categories = list(df)
N = len(categories)
angles = [n / float(N) * 2 * math.pi for n in range(N)]
angles += angles[:1]
ax = plt.subplot(1, 5, row+1, polar=True)
plt.xticks(angles[:-1], categories, color='grey', size=8)
values = df.iloc[row].values.flatten().tolist()
values += values[:1]
ax.plot(angles, values, color=color, linewidth=2, linestyle='solid')
ax.fill(angles, values, color=color, alpha = .4)
plt.gca().set_rmax(.4)
my_dpi = 40
plt.figure(figsize=(1000/my_dpi, 1000/my_dpi), dpi=96)
my_palette = plt.cm.get_cmap('Set2', len(df.index)+1)
for row in range(0, len(df.index)):
make_spider( row = row, title='Cluster: ' + str(row), color=my_palette(row) )
Example dataframe here:
df = pd.DataFrame.from_dict({"no_rooms":{"0":-0.3470532925,"1":-0.082144001,"2":-0.082144001,"3":-0.3470532925,"4":-0.3470532925},"total_area":{"0":-0.1858487321,"1":-0.1685491141,"2":-0.1632483955,"3":-0.1769700284,"4":-0.0389887094},"car_park_spaces":{"0":-0.073703681,"1":-0.073703681,"2":-0.073703681,"3":-0.073703681,"4":-0.073703681},"house_price":{"0":-0.2416123064,"1":-0.2841806825,"2":-0.259622004,"3":-0.3529449824,"4":-0.3414842657},"pop_density":{"0":-0.1271390651,"1":-0.3105853643,"2":-0.2316607937,"3":-0.3297832328,"4":-0.4599021194},"business_rate":{"0":-0.1662745006,"1":-0.1426329043,"2":-0.1577528867,"3":-0.163560133,"4":-0.1099718326},"noqual_pc":{"0":-0.0251535462,"1":-0.1540641646,"2":-0.0204666924,"3":-0.0515740013,"4":-0.0445135996},"level4qual_pc":{"0":-0.0826103951,"1":-0.1777759951,"2":-0.114263357,"3":-0.1787044751,"4":-0.2709496389},"badhealth_pc":{"0":-0.105481688,"1":-0.1760349683,"2":-0.128215043,"3":-0.1560577648,"4":-0.1760349683}})
Probably a cheap hack based on the link you shared. The trick here is to simply use 360 degrees for fill_between and then use a very thin region around the circular line for 0 using margins such as -0.005 to 0.005. This way, you make sure the curve is centered around the 0 line. To make the line thicker/thinner you can increase/decrease this number. This can be straightforwardly extended to color all circular lines by putting it in a for loop.
ax.plot(angles, values, color=color, linewidth=2, linestyle='solid')
ax.fill(angles, values, color=color, alpha = .4)
ax.fill_between(np.linspace(0, 2*np.pi, 100), -0.005, 0.005, color='red', zorder=10) # <-- Added here
Other alternative could be to use a Circle patch as following
circle = plt.Circle((0, 0), 0.36, transform=ax.transData._b, fill=False, edgecolor='red', linewidth=2, zorder=10)
plt.gca().add_artist(circle)
but here I had to manually put 0.36 as the radius of the circle by playing around so as to put it exactly at the circular line for 0. If you know exactly the distance from the origin (center of the polar plot), you can put that number for exact position. At least for this case, 0.36 seems to be a good guess.
There is an easier option:
fig_radar.add_trace(go.Scatterpolar(
r = np.repeat(0, 360),
dtheta = 360,
mode = 'lines',
name = 'cirlce',
line_color = 'black',
line_shape="spline"
)
The addition of line_shape = "spline" makes it appear as a circle
dtheta divides the coordinates in so many parts (at least I understood it this way and it works)

Setting plot border frame for two subplot containing MatPlotLib.Basemap contents

As an illustration, I present a figure here to depict my question.
fig = plt.figure(figsize = (10,4))
ax1 = plt.subplot(121)
map =Basemap(llcrnrlon=x_map1,llcrnrlat=y_map1,urcrnrlon=x_map2,urcrnrlat=y_map2)
map.readshapefile('shapefile','shapefile',zorder =2,linewidth = 0)
for info, shape in zip(map.shapefile, map.shapefile):
x, y = zip(*shape)
map.plot(x, y, marker=None,color='k',linewidth = 0.5)
plt.title("a")
ax2 = plt.subplot(122)
y_pos = [0.5,1,]
performance = [484.0,1080.0]
bars = plt.bar(y_pos, performance, align='center')
plt.title("b")
Due to the mapping setting is not consistent with the subplot(b). Thus, subplot(a) and subplot(b) has distinct board frame. In my opinion, the un-aligned borders are not pleasant for reader.
Is there any way to adjust the boarder size of subplot(b) in order to harmony as a whole figure.
This is my target:
Notice that, subplot(a) need to contain matplotlib.basemap elements.
Currently, your subplot on the left has an 'equal' aspect ratio, while for the other one it is automatic. Therefore, you have to manually set the aspect ratio of the subplot on the right:
def get_aspect(ax):
xlim = ax.get_xlim()
ylim = ax.get_ylim()
aspect_ratio = abs((ylim[0]-ylim[1]) / (xlim[0]-xlim[1]))
return aspect_ratio
ax2.set_aspect(get_aspect(ax1) / get_aspect(ax2))

alignment of stacked subplots

EDIT:
I found myself an answer (see below) how to align the images within their subplots:
for ax in axes:
ax.set_anchor('W')
EDIT END
I have some data I plot with imshow. It's long in x direction, so I break it into multiple lines by plotting slices of the data in vertically stacked subplots. I am happy with the result but for the last subplot (not as wide as the others) which I want left aligned with the others.
The code below is tested with Python 2.7.1 and matplotlib 1.2.x.
#! /usr/bin/env python
import matplotlib.pyplot as plt
import numpy as np
x_slice = [0,3]
y_slices = [[0,10],[10,20],[20,30],[30,35]]
d = np.arange(35*3).reshape((35,3)).T
vmin = d.min()
vmax = d.max()
fig, axes = plt.subplots(len(y_slices), 1)
for i, s in enumerate(y_slices):
axes[i].imshow(
d[ x_slice[0]:x_slice[1], s[0]:s[1] ],
vmin=vmin, vmax=vmax,
aspect='equal',
interpolation='none'
)
plt.show()
results in
Given the tip by Zhenya I played around with axis.get/set_position. I tried to half the width but I don't understand the effect it has
for ax in axes:
print ax.get_position()
p3 = axes[3].get_position().get_points()
x0, y0 = p3[0]
x1, y1 = p3[1]
# [left, bottom, width, height]
axes[3].set_position([x0, y0, (x1-x0)/2, y1-y0])
get_position gives me the bbox of each subplot:
for ax in axes:
print ax.get_position()
Bbox(array([[ 0.125 , 0.72608696],
[ 0.9 , 0.9 ]]))
Bbox(array([[ 0.125 , 0.5173913 ],
[ 0.9 , 0.69130435]]))
Bbox(array([[ 0.125 , 0.30869565],
[ 0.9 , 0.4826087 ]]))
Bbox(array([[ 0.125 , 0.1 ],
[ 0.9 , 0.27391304]]))
so all the subplots have the exact same horizontal extent (0.125 to 0.9). Judging from the narrower 4th subplot the image inside the subplot is somehow centered.
Let's look at the AxesImage objects:
for ax in axes:
print ax.images[0]
AxesImage(80,348.522;496x83.4783)
AxesImage(80,248.348;496x83.4783)
AxesImage(80,148.174;496x83.4783)
AxesImage(80,48;496x83.4783)
again, the same horizontal extent for the 4th image too.
Next try AxesImage.get_extent():
for ax in axes:
print ax.images[0].get_extent()
# [left, right, bottom, top]
(-0.5, 9.5, 2.5, -0.5)
(-0.5, 9.5, 2.5, -0.5)
(-0.5, 9.5, 2.5, -0.5)
(-0.5, 4.5, 2.5, -0.5)
there is a difference (right) but the left value is the same for all so why is the 4th one centered then?
EDIT: They are all centered...
Axis.set_anchor works so far (I just hope I don't have to adjust too much manually now):
for ax in axes:
ax.set_anchor('W')
You can control the position of the subplot manually, like so:
for ax in axes:
print ax.get_position()
and
ax[3].set_position([0.1,0.2,0.3,0.4])
Alternatively, you may want to have a look at GridSpec
Often I found ax.set_position is very hard to be precise.
I would prefer to use plt.subplots_adjust(wspace=0.005) # adjust the width between the subplots to adjust the distance between the two horizontal subplots.
You can adjust the vertical distance as well.

Categories