Related
I would like to see both the density and frequency on my histogram. For example, display density on the left side and frequency on the right side.
Here is my code:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
x = [6.950915827194559, 0.5704464713012669, -1.655326152372283, 5.867122206816244, -1.809359944941513, -6.164821482653027, -2.538999462076397, 0.2108693568484643, -8.740600769897465, 2.121232876712331, 7.967032967032961, 10.61701196601832, 1.847419201771516, 0.6858006670780847, -2.008695652173909, 2.86991153132885, 1.703131050506168, -1.346913193356314, 3.334927671049193, -15.64688995215311, 20.00022688856367, 10.05956454173731, 2.044936877124148, 3.06513409961684, -0.9973614775725559, 1.190631873030967, -1.509991311902692, -0.3333827233664155, 1.898473282442747, 1.618299899267539, -0.1897860593512823, 1.000000000000001, 3.03501945525293, -7.646697418593529, -0.9769069279216391, -2.918403811792736, -3.90929422276739, 9.609846259653532, 3.240690674452962, 10.08973134408675, 1.98356309650054, 1.915301127899549, -0.7792207792207684, -3.308682400714091, -3.312977099236647, 19.98101265822785, 3.661973444534827, -5.770676691729326, 0.5268044012063156, -1.573767040370533, 3.234974862888484, -1.514352732634994, 6.564849624060143, 9.956794019127146, 3.232590278195024, 2.042007001166857, 1.601164483260553, -2.384737678855331, -2.731242556570068, 0.6069707315088602, 1.40561881957264, -6.805306861851957, 2.492102492102499, -3.639688275501762, 0.7958485384154335, 2.799187725631769, 0.9195966872689088, -2.366608280379856, 0.797679477882518, -3.80380434782609]
df = pd.DataFrame(x, columns=["Returns"])
def plot_histogram():
bins = range(-11, 12, 1)
bins_str = []
for i in bins:
bins_str.append(str(i)+"%")
fig, ax = plt.subplots(figsize=(9, 5))
_, bins, patches = plt.hist(np.clip(df.Returns, bins[0], bins[-1]),
bins=bins, density=True, rwidth=0.8)
xlabels = bins_str[:]
xlabels[-1] = "Over"
xlabels[0] = "Under"
N_labels = len(xlabels)
plt.xlim([bins[0], bins[-1]])
plt.xticks(bins)
ax.set_xticklabels(xlabels)
plt.title("Returns distribution")
plt.grid(axis="y", linewidth=0.5)
plot_histogram()
I tried adding density=True in plt.hist() but it removes the count from the histogram. Is it possible to display both the frequency and density on the same histogram?
A density plot sets the heights of the bars such that the area of all the bars (taking rwidth=1 for that calculation) sums to 1. As such, the bar heights of a counting histogram get divided by (the number of values times the bar widths).
With that conversion factor, you can recalculate the counts from the density (or vice versa). The recalculation can be used to label the bars and/or set a secondary y-axis. Note that the ticks of both y axes are aligned, so the grid only works well for one of them. (A secondary y-axis is a bit different from ax.twiny(), as the former has a fixed conversion between both y axes).
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
x = [6.950915827194559, 0.5704464713012669, -1.655326152372283, 5.867122206816244, -1.809359944941513, -6.164821482653027, -2.538999462076397, 0.2108693568484643, -8.740600769897465, 2.121232876712331, 7.967032967032961, 10.61701196601832, 1.847419201771516, 0.6858006670780847, -2.008695652173909, 2.86991153132885, 1.703131050506168, -1.346913193356314, 3.334927671049193, -15.64688995215311, 20.00022688856367, 10.05956454173731, 2.044936877124148, 3.06513409961684, -0.9973614775725559, 1.190631873030967, -1.509991311902692, -0.3333827233664155, 1.898473282442747, 1.618299899267539, -0.1897860593512823, 1.000000000000001, 3.03501945525293, -7.646697418593529, -0.9769069279216391, -2.918403811792736, -3.90929422276739, 9.609846259653532, 3.240690674452962, 10.08973134408675, 1.98356309650054, 1.915301127899549, -0.7792207792207684, -3.308682400714091, -3.312977099236647, 19.98101265822785, 3.661973444534827, -5.770676691729326, 0.5268044012063156, -1.573767040370533, 3.234974862888484, -1.514352732634994, 6.564849624060143, 9.956794019127146, 3.232590278195024, 2.042007001166857, 1.601164483260553, -2.384737678855331, -2.731242556570068, 0.6069707315088602, 1.40561881957264, -6.805306861851957, 2.492102492102499, -3.639688275501762, 0.7958485384154335, 2.799187725631769, 0.9195966872689088, -2.366608280379856, 0.797679477882518, -3.80380434782609]
df = pd.DataFrame(x, columns=["Returns"])
bins = range(-11, 12, 1)
bins_str = [str(i) + "%" for i in bins]
fig, ax = plt.subplots(figsize=(9, 5))
values, bins, patches = ax.hist(np.clip(df["Returns"], bins[0], bins[-1]),
bins=bins, density=True, rwidth=0.8)
# conversion between counts and density: number of values times bin width
factor = len(df) * (bins[1] - bins[0])
ax.bar_label(patches, ['' if v == 0 else f'{v * factor:.0f}' for v in values])
xlabels = bins_str[:]
xlabels[-1] = "Over"
xlabels[0] = "Under"
ax.set_xlim([bins[0], bins[-1]])
ax.set_xticks(bins, xlabels)
ax.set_title("Returns distribution")
ax.grid(axis="y", linewidth=0.5)
secax = ax.secondary_yaxis('right', functions=(lambda y: y * factor, lambda y: y / factor))
secax.set_ylabel('counts')
ax.set_ylabel('density')
plt.show()
To have the same grid positions for both y-axes, you can copy the ticks of one and convert them to set them at the other. For the ticks to be calculated, the plot needs to be drawn once (at the end of the code). Note that the converted values are only shown with a limited number of digits.
fig.canvas.draw()
ax.set_yticks(secax.get_yticks() / factor)
plt.show()
I'm using this nice boxplot graph, answer from #Parfait.
I got an out of bound error on j and had to use range(i*5,i*5+5). Why?
I'd like to set the median to a particular color, let's say red. medianprops=dict(color="red") won't work. How to do it?
How to set the y-axis tick labels to the same color as the boxes?
Disclaimer: I don't know what I'm doing.
Here's the code using random data :
# import the required library
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import string
import matplotlib.colors as mc
import colorsys
# data
df = pd.DataFrame(np.random.normal(np.random.randint(5,15),np.random.randint(1,5),size=(100, 16)), columns=list(string.ascii_uppercase)[:16])
# Boxplot
fig, ax = plt.subplots(figsize=(9, 10))
medianprops=dict(color="red")
ax = sns.boxplot(data=df, orient="h", showfliers=False, palette = "husl")
ax = sns.stripplot(data=df, orient="h", jitter=True, size=7, alpha=0.5, palette = "husl") # show data points
ax.set_title("Title")
plt.xlabel("X label")
def lighten_color(color, amount=0.5):
# --------------------- SOURCE: #IanHincks ---------------------
try:
c = mc.cnames[color]
except:
c = color
c = colorsys.rgb_to_hls(*mc.to_rgb(c))
return colorsys.hls_to_rgb(c[0], 1 - amount * (1 - c[1]), c[2])
for i,artist in enumerate(ax.artists):
# Set the linecolor on the artist to the facecolor, and set the facecolor to None
col = lighten_color(artist.get_facecolor(), 1.2)
artist.set_edgecolor(col)
# Each box has 6 associated Line2D objects (to make the whiskers, fliers, etc.)
# Loop over them here, and use the same colour as above
for j in range(i*5,i*5+5):
line = ax.lines[j]
line.set_color(col)
line.set_mfc(col)
line.set_mec(col)
#line.set_linewidth(0.5)
To change the color of the median, you can use the medianprops in sns.boxplot(..., medianprops=...). If you also set a unique label, that label can be tested again when iterating through the lines.
To know how many lines belong to each boxplot, you can divide the number of lines by the number of artists (just after the boxplot has been created, before other elements have been added to the plot). Note that a line potentially has 3 colors: the line color, the marker face color and the marker edge color. Matplotlib creates the fliers as an invisible line with markers. The code below thus also changes these colors to make it more robust to different options and possible future changes.
Looping simultaneously through the boxes and the y tick labels allows copying the color. Making them a bit larger and darker helps for readability.
import matplotlib.pyplot as plt
from matplotlib.colors import rgb_to_hsv, hsv_to_rgb, to_rgb
import seaborn as sns
import pandas as pd
import numpy as np
def enlighten(color, factor=0.5):
h, s, v = rgb_to_hsv(to_rgb(color))
return hsv_to_rgb((h, s, 1 - factor * (1 - v)))
def endarken(color, factor=0.5):
h, s, v = rgb_to_hsv(to_rgb(color))
return hsv_to_rgb((h, s, factor * v))
df = pd.DataFrame(np.random.normal(1, 5, size=(100, 16)).cumsum(axis=0),
columns=['Hydrogen', 'Helium', 'Lithium', 'Beryllium', 'Boron', 'Carbon', 'Nitrogen', 'Oxygen',
'Fluorine', 'Neon', 'Sodium', 'Magnesium', 'Aluminum', 'Silicon', 'Phosphorus', 'Sulfur'])
sns.set_style('white')
fig, ax = plt.subplots(figsize=(9, 10))
colors = sns.color_palette("husl", len(df.columns))
sns.boxplot(data=df, orient="h", showfliers=False, palette='husl',
medianprops=dict(color="yellow", label='median'), ax=ax)
lines_per_boxplot = len(ax.lines) // len(ax.artists)
for i, (box, ytick) in enumerate(zip(ax.artists, ax.get_yticklabels())):
ytick.set_color(endarken(box.get_facecolor()))
ytick.set_fontsize(20)
color = enlighten(box.get_facecolor())
box.set_color(color)
for lin in ax.lines[i * lines_per_boxplot: (i + 1) * lines_per_boxplot]:
if lin.get_label() != 'median':
lin.set_color(color)
lin.set_markerfacecolor(color)
lin.set_markeredgecolor(color)
sns.stripplot(data=df, orient="h", jitter=True, size=7, alpha=0.5, palette='husl', ax=ax)
sns.despine(ax=ax)
ax.set_title("Title")
ax.set_xlabel("X label")
plt.tight_layout()
plt.show()
I just answer point 2. of my question.
After tinkering, I found this to work :
# Each box has 5 associated Line2D objects (the whiskers and median)
# Loop over them here, and use the same colour as above
n=5 # this was for tinkering
for j in range(i*n,i*n+n):
if j != i*n+4 : line = ax.lines[j] # not the median
line.set_color(col)
Again, I don't know what I'm doing. So someone more knowledgeable may provide a more valuable answer.
I removed the stripplot for better clarity.
I try to plot a count plot using seaborn and matplotlib. Given each year, I want to sort the count "drought types" within each year so that it looks better. Currently it is unsorted within each year and look very messy.
Thank you!
import seaborn as sns
import matplotlib.pyplot as plt
count=pd.read_csv(r"https://raw.githubusercontent.com/tuyenhavan/Course_Data/main/count.csv")
plt.figure(figsize=(15,8))
sns.countplot(x= 'Year', hue = 'Drought types', data = count, palette = 'YlOrRd')
plt.legend(loc = "best",frameon=True,bbox_to_anchor=(0.9,0.75))
plt.show()
The following approach draws the years one-by-one. order= is used to fix the order of the years. hue_order is recalculated for each individual year (.reindex() is needed to make sure all drought_types are present).
A dictionary palette is used to make sure each hue value gets the same color, independent of the order. The automatic legend repeats all hue values for each year, so the legend needs to be reduced.
By the way, loc='best' shouldn't be used together with bbox_to_anchor in the legend, as it might cause very unexpected changes with small changes in the data. loc='best' will be changed to one of the 9 possible locations depending on the available space.
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
count = pd.read_csv("https://raw.githubusercontent.com/tuyenhavan/Course_Data/main/count.csv")
fig, ax = plt.subplots(figsize=(15, 8))
drought_types = count['Drought types'].unique()
palette = {drought_type: color
for drought_type, color in zip(drought_types, sns.color_palette('YlOrRd', len(drought_types)))}
all_years = range(count['Year'].min(), count['Year'].max() + 1)
sns.set_style('darkgrid')
for year in all_years:
year_data = count[count['Year'] == year]
if len(year_data) > 0:
# reindex is needed to make sure all drought_types are present
hue_order = year_data.groupby('Drought types').size().reindex(drought_types).sort_values(ascending=True).index
sns.countplot(x='Year', order=all_years,
hue='Drought types', hue_order=hue_order,
data=year_data, palette=palette, ax=ax)
# handles, _ = ax.get_legend_handles_labels()
# handles = handles[:len(drought_types)]
handles = [plt.Rectangle((0, 0), 0, 0, color=palette[drought_type], label=drought_type)
for drought_type in drought_types]
ax.legend(handles=handles, loc="upper right", frameon=True, bbox_to_anchor=(0.995, 0.99))
plt.show()
I am creating bar graphs for data that comes from series. However the names (x-axis values) are extremely long. If they are rotated 90 degrees it is impossible to read the entire name and get a good image of the graph. 45 degrees is not much better. I am looking for a way to label the x-axis by numbers 1-15 and then have a legend listing the names that correspond to each number.
This is the completed function I have so far, including creating the series from a larger dataframe
def graph_average_expressions(TAD_matches, CAGE):
"""graphs the top 15 expression levels of each lncRNA"""
for i, row in TAD_matches.iterrows():
mask = (
CAGE['short_description'].isin(row['peak_ID'])
)#finds expression level for peaks in an lncRNA
average = CAGE[mask].iloc[:,8:].mean(axis=0).astype('float32').sort_values().tail(n=15)
#made a new df of the top 15 highest expression levels for all averaged groups
#a group is peaks belong to the same lncRNA
cell_type = list(average.index)
expression = list(average.values)
average_df = pd.DataFrame(
list(zip(cell_type, expression)),
columns=['cell_type','expression']
)
colors = sns.color_palette(
'husl',
n_colors=len(cell_type)
)
p = sns.barplot(
x=average_df.index,
y='expression',
data=average_df,
palette=colors
)
cmap = dict(zip(average_df.cell_type, colors))
patches = [Patch(color=v, label=k) for k, v in cmap.items()]
plt.legend(
handles=patches,
bbox_to_anchor=(1.04, 0.5),
loc='center left',
borderaxespad=0
)
plt.title('expression_levels_of_lncRNA_' + row['lncRNA_name'])
plt.xlabel('cell_type')
plt.ylabel('expression')
plt.show()
Here is an example of the data I am graphing
CD14_monocytes_treated_with_Group_A_streptococci_donor2.CNhs13532 1.583428
Neutrophils_donor3.CNhs11905 1.832527
CD14_monocytes_treated_with_Trehalose_dimycolate_TDM_donor2.CNhs13483 1.858384
CD14_monocytes_treated_with_Candida_donor1.CNhs13473 1.873013
CD14_Monocytes_donor2.CNhs11954 2.041607
CD14_monocytes_treated_with_Candida_donor2.CNhs13488 2.112112
CD14_Monocytes_donor3.CNhs11997 2.195365
CD14_monocytes_treated_with_Group_A_streptococci_donor1.CNhs13469 2.974203
Eosinophils_donor3.CNhs12549 3.566822
CD14_monocytes_treated_with_lipopolysaccharide_donor1.CNhs13470 3.685389
CD14_monocytes_treated_with_Salmonella_donor1.CNhs13471 4.409062
CD14_monocytes_treated_with_Candida_donor3.CNhs13494 5.546789
CD14_monocytes_-_treated_with_Group_A_streptococci_donor3.CNhs13492 5.673991
Neutrophils_donor1.CNhs10862 8.352045
Neutrophils_donor2.CNhs11959 11.595509
With the new code above this is the graph I get, but no legend or title.
A bit of a different route. Made a string mapping x values to the names and added it to the figure.
Made my own DataFrame for illustration.
from matplotlib import pyplot as plt
import pandas as pd
import string,random
df = pd.DataFrame({'name':[''.join(random.sample(string.ascii_letters,15))
for _ in range(10)],
'data':[random.randint(1,20) for _ in range(10)]})
Make the plot.
fig,ax = plt.subplots()
ax.bar(df.index,df.data)
Make the legend.
x_legend = '\n'.join(f'{n} - {name}' for n,name in zip(df.index,df['name']))
Add the legend as a Text artist and adjust the plot to accommodate it.
t = ax.text(.7,.2,x_legend,transform=ax.figure.transFigure)
fig.subplots_adjust(right=.65)
plt.show()
plt.close()
That can be made dynamic by getting and using the Text artist's size and the Figure's size.
# using imports and DataFrame from above
fig,ax = plt.subplots()
r = fig.canvas.get_renderer()
ax.bar(df.index,df.data)
x_legend = '\n'.join(f'{n} - {name}' for n,name in zip(df.index,df['name']))
t = ax.text(0,.1,x_legend,transform=ax.figure.transFigure)
# find the width of the Text and place it on the right side of the Figure
twidth = t.get_window_extent(renderer=r).width
*_,fwidth,fheight = fig.bbox.extents
tx,ty = t.get_position()
tx = .95 - (twidth/fwidth)
t.set_position((tx,ty))
# adjust the right edge of the plot/Axes
ax_right = tx - .05
fig.subplots_adjust(right=ax_right)
Setup the dataframe
verify the index of the dataframe to be plotted is reset, so it's integers beginning at 0, and use the index as the x-axis
plot the values on the y-axis
Option 1A: Seaborn hue
The easiest way is probably to use seaborn.barplot and use the hue parameter with the 'names'
Seaborn: Choosing color palettes
This plot is using husl
Additional options for the husl palette can be found at seaborn.husl_palette
The bars will not be centered for this option, because they are placed according to the number of hue levels, and there are 15 levels in this case.
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# plt styling parameters
plt.style.use('seaborn')
plt.rcParams['figure.figsize'] = (16.0, 10.0)
plt.rcParams["patch.force_edgecolor"] = True
# create a color palette the length of the dataframe
colors = sns.color_palette('husl', n_colors=len(df))
# plot
p = sns.barplot(x=df.index, y='values', data=df, hue='names')
# place the legend to the right of the plot
plt.legend(bbox_to_anchor=(1.04, 0.5), loc='center left', borderaxespad=0)
Option 1B: Seaborn palette
Using the palette parameter instead of hue, places the bars directly over the ticks.
This option requires "manually" associating 'names' with the colors and creating the legend.
patches uses Patch to create each item in the legend. (e.g. the rectangle, associated with color and name).
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from matplotlib.patches import Patch
# create a color palette the length of the dataframe
colors = sns.color_palette('husl', n_colors=len(df))
# plot
p = sns.barplot(x=df.index, y='values', data=df, palette=colors)
# create color map with colors and df.names
cmap = dict(zip(df.names, colors))
# create the rectangles for the legend
patches = [Patch(color=v, label=k) for k, v in cmap.items()]
# add the legend
plt.legend(handles=patches, bbox_to_anchor=(1.04, 0.5), loc='center left', borderaxespad=0)
Option 2: pandas.DataFrame.plot
This option also requires "manually" associating 'names' with the palette and creating the legend using Patch.
Choosing Colormaps in Matplotlib
This plot is using tab20c
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib import cm
from matplotlib.patches import Patch
# plt styling parameters
plt.style.use('seaborn')
plt.rcParams['figure.figsize'] = (16.0, 10.0)
plt.rcParams["patch.force_edgecolor"] = True
# chose a color map with enough colors for the number of bars
colors = [plt.cm.tab20c(np.arange(len(df)))]
# plot the dataframe
df.plot.bar(color=colors)
# create color map with colors and df.names
cmap = dict(zip(df.names, colors[0]))
# create the rectangles for the legend
patches = [Patch(color=v, label=k) for k, v in cmap.items()]
# add the legend
plt.legend(handles=patches, bbox_to_anchor=(1.04, 0.5), loc='center left', borderaxespad=0)
Reproducible DataFrame
data = {'names': ['CD14_monocytes_treated_with_Group_A_streptococci_donor2.CNhs13532', 'Neutrophils_donor3.CNhs11905', 'CD14_monocytes_treated_with_Trehalose_dimycolate_TDM_donor2.CNhs13483', 'CD14_monocytes_treated_with_Candida_donor1.CNhs13473', 'CD14_Monocytes_donor2.CNhs11954', 'CD14_monocytes_treated_with_Candida_donor2.CNhs13488', 'CD14_Monocytes_donor3.CNhs11997', 'CD14_monocytes_treated_with_Group_A_streptococci_donor1.CNhs13469', 'Eosinophils_donor3.CNhs12549', 'CD14_monocytes_treated_with_lipopolysaccharide_donor1.CNhs13470', 'CD14_monocytes_treated_with_Salmonella_donor1.CNhs13471', 'CD14_monocytes_treated_with_Candida_donor3.CNhs13494', 'CD14_monocytes_-_treated_with_Group_A_streptococci_donor3.CNhs13492', 'Neutrophils_donor1.CNhs10862', 'Neutrophils_donor2.CNhs11959'],
'values': [1.583428, 1.832527, 1.858384, 1.873013, 2.041607, 2.1121112, 2.195365, 2.974203, 3.566822, 3.685389, 4.409062, 5.546789, 5.673991, 8.352045, 11.595509]}
df = pd.DataFrame(data)
I am plotting a histogram using matplotlib but my y-axis range is in the millions. How can I scale the y-axis so that instead of printing 5000000 it will print 5
Here is my code
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
filename = './norstar10readlength.csv'
df=pd.read_csv(filename, sep=',',header=None)
n, bins, patches = plt.hist(x=df.values, bins=10, color='#0504aa',
alpha=0.7, rwidth=0.85)
plt.grid(axis='y', alpha=0.75)
plt.xlabel('Value')
plt.ylabel('Frequency')
plt.title('My Very Own Histogram')
maxfreq = n.max()
# Set a clean upper y-axis limit.
plt.ylim(ymax=np.ceil(maxfreq / 10) * 10 if maxfreq % 10 else maxfreq + 10)
plt.show()
And here is the plot I am generating now
An elegant solution is to apply a FuncFormatter to format y labels.
Instead of your source data, I used the following DataFrame:
Val
0 800000
1 2600000
2 6700000
3 1400000
4 1700000
5 1600000
and made a bar plot. "Ordinary" bar plot:
df.Val.plot.bar(rot=0, width=0.75);
yields a picture with original values on the y axis (1000000 to
7000000).
But if you run:
from matplotlib.ticker import FuncFormatter
def lblFormat(n, pos):
return str(int(n / 1e6))
lblFormatter = FuncFormatter(lblFormat)
ax = df.Val.plot.bar(rot=0, width=0.75)
ax.yaxis.set_major_formatter(lblFormatter)
then y axis labels are integers (the number of millions):
So you can arrange your code something like this:
n, bins, patches = plt.hist(x=df.values, ...)
#
# Other drawing actions, up to "plt.ylim" (including)
#
ax = plt.gca()
ax.yaxis.set_major_formatter(lblFormatter)
plt.show()
You can modify your df itself, you just need to decide one ratio
so if you want to make 50000 to 5 then it means the ratio is 5/50000 which is 0.0001
Once you have the ratio just multiply all the values of y-axis with the ratio in your DataFrame itself.
Hope this helps!!