Background
I am plotting my data using sns.regplot (seaborn 0.11.0, Python 3.8.5). I use the argument 'x_estimator' to plot the mean of each category shown on the x-axis, and for each point on the x-axis I have an errorbar which is bootstrapped using the sns.regplot arguments 'ci' and 'boot'.
Since this plot needs to have a specific dots per inch (DPI) of 800, I needed to readjust the scaling of the original plot to make sure the desired DPI was obtained.
Problem
Due to the rescaling, my errorbars appear to be rather 'wide'. I would like to make them less wide, and if it is possible, I would also like to add caps on the errorbars. I have included my code below using a randomly generated dataset. Running this code, one can see that the plot that I obtain has the correct DPI, but the errorbars are too wide.
Edit for clarification
I am fine with the confidence intervals (CI) in itself. My only worry is that the CIs are a bit too wide. This is probably some formatting issue. I already checked line_kws and scatter_kws but I can't find any formatting options for the CIs. My desired output looks like this: the same bars, but not as 'heavy' as the original ones.
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
#%%
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from matplotlib import rcParams
#%%
# seaborn params
sns.set_style("ticks")
sns.set_context("paper")
# plotting params
rcParams['font.family'] = 'Times New Roman'
rcParams['axes.titlesize'] = 6
rcParams['axes.labelsize'] = 5
rcParams['xtick.labelsize'] = 5
rcParams['ytick.labelsize'] = 5
#%%
# some toy data into to pandas dataframe
df = pd.DataFrame({'Y': np.random.normal(0, 1, (800,)),
'X': np.repeat(range(1, 9), 100),
'Condition': np.tile(["A", "B"], 400)},
index=range(800))
#%%
# make a subplot with 1 row and 2 columns
fig, ax_list = plt.subplots(1, 2,
sharex = True,
sharey = True,
squeeze = True)
# A condition
g = sns.regplot(x = "X",
y = "Y",
data = df.loc[df["Condition"] == "A"],
x_estimator = np.mean,
x_ci = "ci",
ci = 95,
n_boot = 5000,
scatter_kws = {"s":15},
line_kws = {'lw': .75},
color = "darkgrey",
ax = ax_list[0])
# B condition
g = sns.regplot(x = "X",
y = "Y",
data = df.loc[df["Condition"] == "B"],
x_estimator = np.mean,
x_ci = "ci",
ci = 95,
n_boot = 5000,
scatter_kws = {"s":15},
line_kws = {'lw': .75},
color = "black",
ax = ax_list[1])
# figure parameters (left figure)
ax_list[0].set_title("A condition")
ax_list[0].set_xticks(np.arange(1, 9))
ax_list[0].set_xlim(0.5, 8.5)
ax_list[0].set_xlabel("X")
ax_list[0].set_ylabel("Y")
# figure parameters (right figure)
ax_list[1].set_title("B condition")
ax_list[1].set_xlabel("X")
ax_list[1].set_ylabel("Y")
# general title
fig.suptitle("Y ~ X", fontsize = 8)
#%%
# set the size of the image
fig.set_size_inches(3, 2)
# play around until the figure is satisfactory (difficult due to high DPI)
plt.subplots_adjust(top=0.85, bottom=0.15, left=0.185, right=0.95, hspace=0.075,
wspace=0.2)
# save as tiff with defined DPI
plt.savefig(fname = "test.tiff", dpi = 800)
plt.close("all")
Try setting ci parameters in sns.regplot to a lower value
I just ran into this problem myself and found a hacky solution.
It looks like keyword arguments for the confidence intervals (CI) are not yet exposed to the user (see here). But we can see that it sets the CI line width to 1.75 * linewidth from mpl.rcParams. So I think you can get what you want by hacking a matplotlib rcParams context manager.
import matplotlib as mpl
import numpy as np
import seaborn as sns
# Insert other code from your question here
# to get your dataframe
df = ...
# Play around with this number until you get the desired line width
line_width_reduction = 0.5
linewidth = mpl.rcParams["lines.linewidth"]
with mpl.rc_context({"lines.linewidth": line_width_reduction * linewidth}):
g = sns.regplot(
x="X",
y="Y",
data=df.loc[df["Condition"] == "A"],
x_estimator=np.mean,
x_ci="ci",
ci=95,
n_boot=5000,
scatter_kws={"s":15},
line_kws={'lw': .75},
color="darkgrey",
ax=ax_list[0]
)
Related
I have the code below with randomly generated dataframes and I would like to extract the x and y values of both plotted lines. These line plots show the Price on the Y-axis and are Volume weighted.
For some reason, the line values for the second distribution plot, cannot be stored on the variables "df_2_x", "df_2_y". The values of "df_1_x", "df_1_y" are also written on the other variables. Both print statements return True, so the arrays are completely equal.
If I put them in separate cells in a notebook, it does work.
I also looked at this solution: How to retrieve all data from seaborn distribution plot with mutliple distributions?
But this does not work for weighted distplots.
import pandas as pd
import random
import seaborn as sns
import matplotlib.pyplot as plt
Price_1 = [round(random.uniform(2,12), 2) for i in range(30)]
Volume_1 = [round(random.uniform(100,3000)) for i in range(30)]
Price_2 = [round(random.uniform(0,10), 2) for i in range(30)]
Volume_2 = [round(random.uniform(100,1500)) for i in range(30)]
df_1 = pd.DataFrame({'Price_1' : Price_1,
'Volume_1' : Volume_1})
df_2 = pd.DataFrame({'Price_2' : Price_2,
'Volume_2' :Volume_2})
df_1_x, df_1_y = sns.distplot(df_1.Price_1, hist_kws={"weights":list(df_1.Volume_1)}).get_lines()[0].get_data()
df_2_x, df_2_y = sns.distplot(df_2.Price_2, hist_kws={"weights":list(df_2.Volume_2)}).get_lines()[0].get_data()
print((df_1_x == df_2_x).all())
print((df_1_y == df_2_y).all())
Why does this happen, and how can I fix this?
Whether or not weight is used, doesn't make a difference here.
The principal problem is that you are extracting again the first curve in df_2_x, df_2_y = sns.distplot(df_2....).get_lines()[0].get_data(). You'd want the second curve instead: df_2_x, df_2_y = sns.distplot(df_2....).get_lines()[1].get_data().
Note that seaborn isn't really meant to concatenate commands. Sometimes it works, but it usually adds a lot of confusion. E.g. sns.distplot returns an ax (which represents a subplot). Graphical elements such as lines are added to that ax.
Also note that sns.distplot has been deprecated. It will be removed from Seaborn in one of the next versions. It is replaced by sns.histplot and sns.kdeplot.
Here is how the code could look like:
import pandas as pd
import random
import seaborn as sns
import matplotlib.pyplot as plt
Price_1 = [round(random.uniform(2, 12), 2) for i in range(30)]
Volume_1 = [round(random.uniform(100, 3000)) for i in range(30)]
Price_2 = [round(random.uniform(0, 10), 2) for i in range(30)]
Volume_2 = [round(random.uniform(100, 1500)) for i in range(30)]
df_1 = pd.DataFrame({'Price_1': Price_1,
'Volume_1': Volume_1})
df_2 = pd.DataFrame({'Price_2': Price_2,
'Volume_2': Volume_2})
ax = sns.histplot(x=df_1.Price_1, weights=list(df_1.Volume_1), bins=10, kde=True, kde_kws={'cut': 3})
sns.histplot(x=df_2.Price_2, weights=list(df_2.Volume_2), bins=10, kde=True, kde_kws={'cut': 3}, ax=ax)
df_1_x, df_1_y = ax.lines[0].get_data()
df_2_x, df_2_y = ax.lines[1].get_data()
# use fill_between to demonstrate where the extracted curves lie
ax.fill_between(df_1_x, 0, df_1_y, color='b', alpha=0.2)
ax.fill_between(df_2_x, 0, df_2_y, color='r', alpha=0.2)
plt.show()
I want to to a violin plot of binned data but at the same time be able to plot a model prediction and visualize how well the model describes the main part of the individual data distributions. My problem here is, I guess, that the x-axis after the violin plot does not behave like a regular axis with numbers, but more like string-values that just accidentally happen to be numbers. Maybe not a good description, but in the example I would like to have a "normal" plot a function, e.g. f(x) = 2*x**2, and at x=1, x=5.2, x=18.3 and x=27 I would like to have the violin in the background.
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
np.random.seed(10)
collectn_1 = np.random.normal(1, 2, 200)
collectn_2 = np.random.normal(802, 30, 200)
collectn_3 = np.random.normal(90, 20, 200)
collectn_4 = np.random.normal(70, 25, 200)
ys = [collectn_1, collectn_2, collectn_3, collectn_4]
xs = [1, 5.2, 18.3, 27]
sns.violinplot(x=xs, y=ys)
xx = np.arange(0, 30, 10)
plt.plot(xx, 2*xx**2)
plt.show()
Somehow this code actually does not plot violins but only bars, this is only a problem in this example and not in the original code though. In my real code I want to have different "half-violins" on both sides, therefore I use sns.violinplot(x="..", y="..", hue="..", data=.., split=True).
I think that would be hard to do with seaborn because it does not provide an easy way to manipulate the artists that it creates, particularly if there are other things plotted on the same Axes. Matplotlib's violinplot allows setting the position of the violins, but does not provide an option for plotting only half violins. Therefore, I would suggest using statsmodels.graphics.boxplots.violinplot, which does both.
from statsmodels.graphics.boxplots import violinplot
df = sns.load_dataset('tips')
x_col = 'day'
y_col = 'total_bill'
hue_col = 'smoker'
xs = [1, 5.2, 18.3, 27]
xx = np.arange(0, 30, 1)
yy = 0.1*xx**2
cs = ['C0','C1']
fig, ax = plt.subplots()
ax.plot(xx,yy)
for (_,gr0),side,c in zip(df.groupby(hue_col),['left','right'],cs):
print(side)
data = [gr1 for (_,gr1) in gr0.groupby(x_col)[y_col]]
violinplot(ax=ax, data=data, positions=xs, side=side, show_boxplot=False, plot_opts=dict(violin_fc=c))
# violinplot above messes up which ticks are shown, the line below restores a sensible tick locator
ax.xaxis.set_major_locator(matplotlib.ticker.MaxNLocator())
I am using a github repository called ptitprince, which is derived from seaborn and matplotlib, to generate graphs.
For example, this is the code using the ptitprince repo:
# coding: utf8
import pandas as pd
import ptitprince as pt
import seaborn as sns
import os
import matplotlib.pyplot as plt
#sns.set(style="darkgrid")
#sns.set(style="whitegrid")
#sns.set_style("white")
sns.set(style="whitegrid",font_scale=2)
import matplotlib.collections as clt
df = pd.read_csv ("u118phag.csv", sep= ",")
df.head()
savefigs = True
figs_dir = 'figs'
if savefigs:
# Make the figures folder if it doesn't yet exist
if not os.path.isdir('figs'):
os.makedirs('figs')
#automation
f, ax = plt.subplots(figsize=(4, 5))
#f.subplots_adjust(hspace=0,wspace=0)
dx = "Treatment"; dy = "score"; ort = "v"; pal = "Set2"; sigma = .2
ax=pt.RainCloud(x = dx, y = dy, data = df, palette = pal, bw = sigma,
width_viol = .6, ax = ax, move=.2, offset=.1, orient = ort, pointplot = True)
f.show()
if savefigs:
f.savefig('figs/figure20.png', bbox_inches='tight', dpi=500)
which generates the following graph
The raw code not using ptitprince is as follows and produces the same graph as above:
# coding: utf8
import pandas as pd
import ptitprince as pt
import seaborn as sns
import os
import matplotlib.pyplot as plt
#sns.set(style="darkgrid")
#sns.set(style="whitegrid")
#sns.set_style("white")
sns.set(style="whitegrid",font_scale=2)
import matplotlib.collections as clt
df = pd.read_csv ("u118phag.csv", sep= ",")
df.head()
savefigs = True
figs_dir = 'figs'
if savefigs:
# Make the figures folder if it doesn't yet exist
if not os.path.isdir('figs'):
os.makedirs('figs')
f, ax = plt.subplots(figsize=(7, 5))
dy="Treatment"; dx="score"; ort="h"; pal = sns.color_palette(n_colors=1)
#adding color
pal = "Set2"
f, ax = plt.subplots(figsize=(7, 5))
ax=pt.half_violinplot( x = dx, y = dy, data = df, palette = pal, bw = .2, cut = 0.,
scale = "area", width = .6, inner = None, orient = ort)
ax=sns.stripplot( x = dx, y = dy, data = df, palette = pal, edgecolor = "white",
size = 3, jitter = 1, zorder = 0, orient = ort)
ax=sns.boxplot( x = dx, y = dy, data = df, color = "black", width = .15, zorder = 10,\
showcaps = True, boxprops = {'facecolor':'none', "zorder":10},\
showfliers=True, whiskerprops = {'linewidth':2, "zorder":10},\
saturation = 1, orient = ort)
if savefigs:
f.savefig('figs/figure21.png', bbox_inches='tight', dpi=500)
Now, what I'm trying to do is to figure out how to modify the graph so that I can (1) move the plots closer together, so there is not so much white space between them, and (2) shift the x-axis to the right, so that I can make the distribution (violin) plot wider without it getting cut in half by the y-axis.
I have tried to play around with subplots_adjust() as you can see in the first box of code, but I receive an error. I cannot figure out how to appropriately use this function, or even if that will actually bring the different graphs closer together.
I also know that I can increase the distribution size by increasing this value width = .6, but if I increase it too high, the distribution plot begins to being cut off by the y-axis. I can't figure out if I need to adjust the overall plot using the plt.subplots,or if I need to move each individual plot.
Any advice or recommendations on how to change the visuals of the graph? I've been staring at this for awhile, and I can't figure out how to make seaborn/matplotlib play nicely with ptitprince.
You may try to change the interval of X-axis being shown using ax.set_xbound (put a lower value than you currently have for the beginning).
I am translating a set of R visualizations to Python. I have the following target R multiple plot histograms:
Using Matplotlib and Seaborn combination and with the help of a kind StackOverflow member (see the link: Python Seaborn Distplot Y value corresponding to a given X value), I was able to create the following Python plot:
I am satisfied with its appearance, except, I don't know how to put the Header information in the plots. Here is my Python code that creates the Python Charts
""" Program to draw the sampling histogram distributions """
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.backends.backend_pdf import PdfPages
import seaborn as sns
def main():
""" Main routine for the sampling histogram program """
sns.set_style('whitegrid')
markers_list = ["s", "o", "*", "^", "+"]
# create the data dataframe as df_orig
df_orig = pd.read_csv('lab_samples.csv')
df_orig = df_orig.loc[df_orig.hra != -9999]
hra_list_unique = df_orig.hra.unique().tolist()
# create and subset df_hra_colors to match the actual hra colors in df_orig
df_hra_colors = pd.read_csv('hra_lookup.csv')
df_hra_colors['hex'] = np.vectorize(rgb_to_hex)(df_hra_colors['red'], df_hra_colors['green'], df_hra_colors['blue'])
df_hra_colors.drop(labels=['red', 'green', 'blue'], axis=1, inplace=True)
df_hra_colors = df_hra_colors.loc[df_hra_colors['hra'].isin(hra_list_unique)]
# hard coding the current_component to pc1 here, we will extend it by looping
# through the list of components
current_component = 'pc1'
num_tests = 5
df_columns = df_orig.columns.tolist()
start_index = 5
for test in range(num_tests):
current_tests_list = df_columns[start_index:(start_index + num_tests)]
# now create the sns distplots for each HRA color and overlay the tests
i = 1
for _, row in df_hra_colors.iterrows():
plt.subplot(3, 3, i)
select_columns = ['hra', current_component] + current_tests_list
df_current_color = df_orig.loc[df_orig['hra'] == row['hra'], select_columns]
y_data = df_current_color.loc[df_current_color[current_component] != -9999, current_component]
axs = sns.distplot(y_data, color=row['hex'],
hist_kws={"ec":"k"},
kde_kws={"color": "k", "lw": 0.5})
data_x, data_y = axs.lines[0].get_data()
axs.text(0.0, 1.0, row['hra'], horizontalalignment="left", fontsize='x-small',
verticalalignment="top", transform=axs.transAxes)
for current_test_index, current_test in enumerate(current_tests_list):
# this_x defines the series of current_component(pc1,pc2,rhob) for this test
# indicated by 1, corresponding R program calls this test_vector
x_series = df_current_color.loc[df_current_color[current_test] == 1, current_component].tolist()
for this_x in x_series:
this_y = np.interp(this_x, data_x, data_y)
axs.plot([this_x], [this_y - current_test_index * 0.05],
markers_list[current_test_index], markersize = 3, color='black')
axs.xaxis.label.set_visible(False)
axs.xaxis.set_tick_params(labelsize=4)
axs.yaxis.set_tick_params(labelsize=4)
i = i + 1
start_index = start_index + num_tests
# plt.show()
pp = PdfPages('plots.pdf')
pp.savefig()
pp.close()
def rgb_to_hex(red, green, blue):
"""Return color as #rrggbb for the given color values."""
return '#%02x%02x%02x' % (red, green, blue)
if __name__ == "__main__":
main()
The Pandas code works fine and it is doing what it is supposed to. It is my lack of knowledge and experience of using 'PdfPages' in Matplotlib that is the bottleneck. How can I show the header information in Python/Matplotlib/Seaborn that I can show in the corresponding R visalization. By the Header information, I mean What The R visualization has at the top before the histograms, i.e., 'pc1', MRP, XRD,....
I can get their values easily from my program, e.g., current_component is 'pc1', etc. But I don't know how to format the plots with the Header. Can someone provide some guidance?
You may be looking for a figure title or super title, fig.suptitle:
fig.suptitle('this is the figure title', fontsize=12)
In your case you can easily get the figure with plt.gcf(), so try
plt.gcf().suptitle("pc1")
The rest of the information in the header would be called a legend.
For the following let's suppose all subplots have the same markers. It would then suffice to create a legend for one of the subplots.
To create legend labels, you can put the labelargument to the plot, i.e.
axs.plot( ... , label="MRP")
When later calling axs.legend() a legend will automatically be generated with the respective labels. Ways to position the legend are detailed e.g. in this answer.
Here, you may want to place the legend in terms of figure coordinates, i.e.
ax.legend(loc="lower center",bbox_to_anchor=(0.5,0.8),bbox_transform=plt.gcf().transFigure)
I'm trying to do a quite simple scatter plot with error bars and semilogy scale. What is a little bit different from tutorials I have found is that the color of the scatterplot should trace a different quantity. On one hand, I was able to do a scatterplot with the errorbars with my data, but just with one color. On the other hand, I realized a scatterplot with the right colors, but without the errorbars.
I'm not able to combine the two different things.
Here an example using fake data:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
from __future__ import division
import numpy as np
import matplotlib.pyplot as plt
n=100
Lx_gas = 1e40*np.random.random(n) + 1e37
Tx_gas = np.random.random(n) + 0.5
Lx_plus_error = Lx_gas
Tx_plus_error = Tx_gas/2.
Tx_minus_error = Tx_gas/4.
#actually positive numbers, this is the quantity that should be traced by the
#color, in this example I use random numbers
Lambda = np.random.random(n)
#this is actually different from zero, but I want to be sure that this simple
#code works with the log axis
Lx_minus_error = np.zeros_like(Lx_gas)
#normalize the color, to be between 0 and 1
colors = np.asarray(Lambda)
colors -= colors.min()
colors *= (1./colors.max())
#build the error arrays
Lx_error = [Lx_minus_error, Lx_plus_error]
Tx_error = [Tx_minus_error, Tx_plus_error]
##--------------
##important part of the script
##this works, but all the dots are of the same color
#plt.errorbar(Tx_gas, Lx_gas, xerr = Tx_error,yerr = Lx_error,fmt='o')
##this is what is should be in terms of colors, but it is without the error bars
#plt.scatter(Tx_gas, Lx_gas, marker='s', c=colors)
##what I tried (and failed)
plt.errorbar(Tx_gas, Lx_gas, xerr = Tx_error,yerr = Lx_error,\
color=colors, fmt='o')
ax = plt.gca()
ax.set_yscale('log')
plt.show()
I even tried to plot the scatterplot after the errorbar, but for some reason everything plotted on the same window is put in background with respect to the errorplot.
Any ideas?
Thanks!
You can set the color to the LineCollection object returned by the errorbar as described here.
from __future__ import division
import numpy as np
import matplotlib.pyplot as plt
n=100
Lx_gas = 1e40*np.random.random(n) + 1e37
Tx_gas = np.random.random(n) + 0.5
Lx_plus_error = Lx_gas
Tx_plus_error = Tx_gas/2.
Tx_minus_error = Tx_gas/4.
#actually positive numbers, this is the quantity that should be traced by the
#color, in this example I use random numbers
Lambda = np.random.random(n)
#this is actually different from zero, but I want to be sure that this simple
#code works with the log axis
Lx_minus_error = np.zeros_like(Lx_gas)
#normalize the color, to be between 0 and 1
colors = np.asarray(Lambda)
colors -= colors.min()
colors *= (1./colors.max())
#build the error arrays
Lx_error = [Lx_minus_error, Lx_plus_error]
Tx_error = [Tx_minus_error, Tx_plus_error]
sct = plt.scatter(Tx_gas, Lx_gas, marker='s', c=colors)
cb = plt.colorbar(sct)
_, __ , errorlinecollection = plt.errorbar(Tx_gas, Lx_gas, xerr = Tx_error,yerr = Lx_error, marker = '', ls = '', zorder = 0)
error_color = sct.to_rgba(colors)
errorlinecollection[0].set_color(error_color)
errorlinecollection[1].set_color(error_color)
ax = plt.gca()
ax.set_yscale('log')
plt.show()