Plotting histogram on a seaborn PairGrid with hue leads to stacking by default. Is there a way to avoid this ? (stacked=False is inefficient.)
I tried with seaborn.distplot, kde=False but the bars are too wide in my case and decreasing rwidth kind of shifts the bars away from the corresponding variable values (which does not happen with plt.hist).
EDIT code to illustrate so-called 'shifting away from the corresponding variable values' (actually plt.hist does it too but it is less obvious).
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
df = pd.DataFrame()
for n in ['a', 'b']:
tmp = pd.DataFrame({'name': [n] * 100,
'prior': [1, 10] * 50,
'post': [1, 10] * 50})
df = df.append(tmp)
g = sns.PairGrid(df, hue='name', diag_sharey=False)
g.map_offdiag(sns.regplot, fit_reg=False, x_jitter=.1)
g.map_diag(plt.hist, rwidth=0.2, stacked=False)
g = sns.PairGrid(df, hue='name', diag_sharey=False)
g.map_offdiag(sns.regplot, fit_reg=False, x_jitter=.1)
g.map_diag(sns.distplot, kde=False, hist_kws={'rwidth':0.2})
Related
I am drawing boxplots with Python Seaborn package. I have facet grid with both rows and columns. That much I've been able to do with the Seaborn function catplot.
I also want to annotate the outliers. I have found some nice examples at SO for annotating the outliers but without facet structure. That's where I'm struggling.
Here is what I've got (borrows heavily from this post: Boxplot : Outliers Labels Python):
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from matplotlib.cbook import boxplot_stats
sns.set_style('darkgrid')
Month = np.repeat(np.arange(1, 11), 10)
Id = np.arange(1, 101)
Value = np.random.randn(100)
Row = ["up", "down"]*50
df = pd.DataFrame({'Value': Value, 'Month': Month, 'Id': Id, 'Row': Row})
g = sns.catplot(data=df, x="Month", y="Value", row="Row", kind="box", height=3, aspect=3)
for name, group in df.groupby(["Month", "Row"]):
fliers = [y for stat in boxplot_stats(group["Value"]) for y in stat["fliers"]]
d = group[group["Value"].isin(fliers)]
g.axes.flatten().annotate(d["Id"], xy=(d["Month"] - 1, d["Value"]))
The dataframe d collects all the outliers by patch. The last line aims to match d with the graph g patches. However, that doesn't work, but I haven't found a way to flatten axes to a list where each element would correspond to a grouped dataframe element.
I'd be glad to hear alternative versions for achieving this too.
One way to do it:
for name, group in df.groupby(["Month", "Row"]):
fliers = [y for stat in boxplot_stats(group["Value"]) for y in stat["fliers"]]
d = group[group["Value"].isin(fliers)]
for i in range(len(d)):
ngrid = (0 if d.iloc[i,3]=='up' else 1)
g.fig.axes[ngrid].annotate(d.iloc[i, 2], xy=(d.iloc[i, 1] - 1, d.iloc[i, 0]))
You can loop through g.axes_dict to visit each of the axes.
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from matplotlib.cbook import boxplot_stats
sns.set_style('darkgrid')
Month = np.repeat(np.arange(1, 11), 10)
Id = np.arange(1, 101)
Value = np.random.randn(100)
Row = ["up", "down"] * 50
df = pd.DataFrame({'Value': Value, 'Month': Month, 'Id': Id, 'Row': Row})
g = sns.catplot(data=df, x="Month", y="Value", row="Row", kind="box", height=3, aspect=3)
for row, ax in g.axes_dict.items():
for month in np.unique(df["Month"]):
group = df.loc[(df["Row"] == row) & (df["Month"] == month), :]
fliers = boxplot_stats(group["Value"])[0]["fliers"]
if len(fliers) > 0:
for mon, val, id in zip(group["Month"], group["Value"], group["Id"]):
if val in fliers:
ax.annotate(f' {id}', xy=(mon - 1, val))
plt.tight_layout()
plt.show()
I tried hard to look through all the documentation and examples but I am not able to figure it out. How do I change the number of categories = the number of size bubbles, and their boundaries in seaborn scatterplot? The sizes parameter doesn't help here.
It always gives me 6 of them regardless of what I try (here 8, 16, ..., 48):
import seaborn as sns
tips = sns.load_dataset("tips")
sns.scatterplot(data=tips, x="total_bill", y="tip", size="total_bill")
or
penguins = sns.load_dataset("penguins")
sns.scatterplot(data=penguins, x="bill_length_mm", y="bill_depth_mm", size="body_mass_g")
And how do I change their boundaries? Ie. if I want to have 10, 20, 30, 40, 50 in the first case or 3000, 4000, 5000, 6000 in the second?
I know that going around and creating another column in the dataframe works but that is not wanted (adds unnecessary columns and even if I do it on the fly, it's just not what I am looking for).
Workaround:
def myfunc(mass):
if mass <3500:
return 3000
elif mass <4500:
return 4000
elif mass <5500:
return 5000
return 6000
penguins["mass"] = penguins.apply(lambda x: myfunc(x['body_mass_g']), axis=1)
sns.scatterplot(data=penguins, x="bill_length_mm", y="bill_depth_mm", size="mass")
I don't think seaborn has a fine-grained control, it just tries to come up with something that works a bit intuitively for many situations, but not for all. The legend='full' parameter shows all values of the size column, but that can be too overwhelming.
The suggestion to create a new column with binned sizes has the drawback that this will also change the sizes used in the scatterplot.
An approach could be to create your own custom legend. Note that when the legend also contains other elements, this approach needs to be adapted a bit.
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
tips = sns.load_dataset("tips")
ax = sns.scatterplot(data=tips, x="total_bill", y="tip", size="total_bill", legend='full')
handles, labels = ax.get_legend_handles_labels()
labels = np.array([float(l) for l in labels])
desired_labels = [10, 20, 30, 40, 50]
desired_handles = [handles[np.argmin(np.abs(labels - d))] for d in desired_labels]
ax.legend(handles=desired_handles, labels=desired_labels, title=ax.legend_.get_title().get_text())
plt.show()
The code can be wrapped into a function, and e.g. applied to the penguins:
from matplotlib import pyplot as plt
import seaborn as sns
import numpy as np
def sizes_legend(desired_sizes, ax=None):
ax = ax or plt.gca()
handles, labels = ax.get_legend_handles_labels()
labels = np.array([float(l) for l in labels])
desired_handles = [handles[np.argmin(np.abs(labels - d))] for d in desired_sizes]
ax.legend(handles=desired_handles, labels=desired_sizes, title=ax.legend_.get_title().get_text())
penguins = sns.load_dataset("penguins")
ax = sns.scatterplot(data=penguins, x="bill_length_mm", y="bill_depth_mm", size="body_mass_g", legend='full')
sizes_legend([3000, 4000, 5000, 6000], ax)
plt.show()
I have written a code that looks like this:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
T = np.array([10.03,100.348,1023.385])
power1 = np.array([100000,86000,73000])
power2 = np.array([1008000,95000,1009000])
df1 = pd.DataFrame(data = {'Size': T, 'Encrypt_Time': power1, 'Decrypt_Time': power2})
exp1= sns.lineplot(data=df1)
plt.savefig('exp1.png')
exp1_smooth= sns.lmplot(x='Size', y='Time', data=df, ci=None, order=4, truncate=False)
plt.savefig('exp1_smooth.png')
That gives me Graph_1:
The Size = x- axis is a constant line but as you can see in my code it varies from (10,100,1000).
How does this produces a constant line? I want to produce a multiline graph with x-axis = Size(T),y- axis= Encrypt_Time and Decrypt_Time (power1 & power2).
Also I wanted to plot a smooth graph of the same graph I am getting right now but it gives me error. What needs to be done to achieve a smooth multi-line graph with x-axis = Size(T),y- axis= Encrypt_Time and Decrypt_Time (power1 & power2)?
I think it not the issue, the line represents for size looks like constant but it NOT.
Can see that values of size in range 10-1000 while the minimum division of y-axis is 20,000 (20 times bigger), make it look like a horizontal line on your graph.
You can try with a bigger values to see the slope clearly.
If you want 'size` as x-axis, you can try below example:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
T = np.array([10.03,100.348,1023.385])
power1 = np.array([100000,86000,73000])
power2 = np.array([1008000,95000,1009000])
df1 = pd.DataFrame(data = {'Size': T, 'Encrypt_Time': power1, 'Decrypt_Time': power2})
fig = plt.figure()
fig = sns.lineplot(data=df1, x='Size',y='Encrypt_Time' )
fig = sns.lineplot(data=df1, x='Size',y='Decrypt_Time' )
I have a pandas dataframe df with columns x (categorical), y, and z (both floats).
Here is my bar plot.
sns.barplot(data=df, x=x, y=y)
How can I set a color palette for the bars based on the values of the z column? I would like to set a Matplotlib style palette like magma or RdYlBu. Basically, like setting the hue argument, but with a scalar variable.
Thanks in advance!
I'm not sure if there is a way to do this in seaborn. But usually using matplotlib directly works as well.
import matplotlib.pyplot as plt
import pandas as pd
df = pd.DataFrame({"x" : list("ABCDEFGH"),
"y" : [3,4,5,2,1,6,3,4],
"z" : [4,5,7,1,4,5,3,4]})
norm = plt.Normalize(df.z.min(), df.z.max())
cmap = plt.get_cmap("magma")
plt.bar(x="x", height="y", data=df, color=cmap(norm(df.z.values)))
plt.show()
If your "categorical" column contains pandas categories, instead of simple strings, you would first need to convert it, df["x"] = df["x"].astype(str).
Simply use the palette argument that corresponds to the hue variable:
sns.barplot(data=df, x=x, y=y, hue=z, palette='magma')
To demonstrate with random data:
import numpy as np
import pandas as pd
import time
from datetime import datetime
import matplotlib.pyplot as plt
import seaborn as sns
data_tools = ['sas', 'stata', 'spss', 'python', 'r', 'julia']
np.random.seed(11212018)
rand_df = pd.DataFrame({'GROUP': np.random.choice(data_tools, 500),
'INT': np.random.randint(1, 10, 500),
'NUM': np.random.randn(500),
})
fig, ax = plt.subplots(figsize=(15,5))
sns.barplot(data=rand_df, x='GROUP', y='NUM', hue='INT', palette='magma', ax=ax, ci=None)
plt.legend(bbox_to_anchor=(1,0.5), loc="center right",)
plt.show()
I want to overlay 95 percentile values on seaborn boxplot. I could not figure out the ways to overlay text or if there is seaborn capability for that. How would I modify following code to overlay the 95 percentile values on plot.
import pandas as pd
import numpy as np
import seaborn as sns
df = pd.DataFrame(np.random.randn(200, 4), columns=list('ABCD'))*100
alphabet = list('AB')
df['Gr'] = np.random.choice(np.array(alphabet, dtype="|S1"), df.shape[0])
df_long = pd.melt(df, id_vars=['Gr'], value_vars = ['A','B','C','D'])
sns.boxplot(x = "variable", y="value", hue = 'Gr', data=df_long, whis = [5,95])
Consider seaborn's plot.text, borrowing from #bernie's answer (also a healty +1 for including sample dataset). The only challenge is adjusting the alignment due to grouping in hue field to have labels overlay over each boxplot series. Even have labels color coded according to series.
import pandas as pd
import numpy as np
import seaborn as sns
np.random.seed(61518)
# ... same as OP
# 95TH PERCENTILE SERIES
pctl95 = df_long.groupby(['variable', 'Gr'])['value'].quantile(0.95)
pctl95_labels = [str(np.round(s, 2)) for s in pctl95]
# GROUP INDEX TUPLES
grps = [(i, 2*i, 2*i+1) for i in range(4)]
# [(0,0,1), (1,2,3), (2,4,5), (3,6,7)]
pos = range(len(pctl95))
# ADJUST HORIZONTAL ALIGNMENT WITH MORE SERIES
for tick, label in zip(grps, hplot.get_xticklabels()):
hplot.text(tick[0]-0.1, pctl95[tick[1]] + 0.95, pctl95_labels[tick[1]],
ha='center', size='x-small', color='b', weight='semibold')
hplot.text(tick[0]+0.1, pctl95[tick[2]] + 0.95, pctl95_labels[tick[2]],
ha='center', size='x-small', color='g', weight='semibold')
sns.plt.show()