Retrieving line data from multiple weighted seaborn distribution plots? - python

I have the code below with randomly generated dataframes and I would like to extract the x and y values of both plotted lines. These line plots show the Price on the Y-axis and are Volume weighted.
For some reason, the line values for the second distribution plot, cannot be stored on the variables "df_2_x", "df_2_y". The values of "df_1_x", "df_1_y" are also written on the other variables. Both print statements return True, so the arrays are completely equal.
If I put them in separate cells in a notebook, it does work.
I also looked at this solution: How to retrieve all data from seaborn distribution plot with mutliple distributions?
But this does not work for weighted distplots.
import pandas as pd
import random
import seaborn as sns
import matplotlib.pyplot as plt
Price_1 = [round(random.uniform(2,12), 2) for i in range(30)]
Volume_1 = [round(random.uniform(100,3000)) for i in range(30)]
Price_2 = [round(random.uniform(0,10), 2) for i in range(30)]
Volume_2 = [round(random.uniform(100,1500)) for i in range(30)]
df_1 = pd.DataFrame({'Price_1' : Price_1,
'Volume_1' : Volume_1})
df_2 = pd.DataFrame({'Price_2' : Price_2,
'Volume_2' :Volume_2})
df_1_x, df_1_y = sns.distplot(df_1.Price_1, hist_kws={"weights":list(df_1.Volume_1)}).get_lines()[0].get_data()
df_2_x, df_2_y = sns.distplot(df_2.Price_2, hist_kws={"weights":list(df_2.Volume_2)}).get_lines()[0].get_data()
print((df_1_x == df_2_x).all())
print((df_1_y == df_2_y).all())
Why does this happen, and how can I fix this?

Whether or not weight is used, doesn't make a difference here.
The principal problem is that you are extracting again the first curve in df_2_x, df_2_y = sns.distplot(df_2....).get_lines()[0].get_data(). You'd want the second curve instead: df_2_x, df_2_y = sns.distplot(df_2....).get_lines()[1].get_data().
Note that seaborn isn't really meant to concatenate commands. Sometimes it works, but it usually adds a lot of confusion. E.g. sns.distplot returns an ax (which represents a subplot). Graphical elements such as lines are added to that ax.
Also note that sns.distplot has been deprecated. It will be removed from Seaborn in one of the next versions. It is replaced by sns.histplot and sns.kdeplot.
Here is how the code could look like:
import pandas as pd
import random
import seaborn as sns
import matplotlib.pyplot as plt
Price_1 = [round(random.uniform(2, 12), 2) for i in range(30)]
Volume_1 = [round(random.uniform(100, 3000)) for i in range(30)]
Price_2 = [round(random.uniform(0, 10), 2) for i in range(30)]
Volume_2 = [round(random.uniform(100, 1500)) for i in range(30)]
df_1 = pd.DataFrame({'Price_1': Price_1,
'Volume_1': Volume_1})
df_2 = pd.DataFrame({'Price_2': Price_2,
'Volume_2': Volume_2})
ax = sns.histplot(x=df_1.Price_1, weights=list(df_1.Volume_1), bins=10, kde=True, kde_kws={'cut': 3})
sns.histplot(x=df_2.Price_2, weights=list(df_2.Volume_2), bins=10, kde=True, kde_kws={'cut': 3}, ax=ax)
df_1_x, df_1_y = ax.lines[0].get_data()
df_2_x, df_2_y = ax.lines[1].get_data()
# use fill_between to demonstrate where the extracted curves lie
ax.fill_between(df_1_x, 0, df_1_y, color='b', alpha=0.2)
ax.fill_between(df_2_x, 0, df_2_y, color='r', alpha=0.2)
plt.show()

Related

Create a Seaborn style histogram / kernel density plot using the actual density function

I really like to the look of Seaborn's KDE plot:
I was wondering how can I replicate this for line plot.
In my case I actually have the function to generate the density instead of samples of the data.
So assuming I have the data in a data frame:
x - The value of x per sample.
y - The value of the density function at y.
μσ - Categorical variable to group data from the same density (In the code, I use the mean and standard deviation of a normal distribution).
I can use Seaborn's lineplot to get what I want without the area below the curve as in the image above.
I'm after achieving the look as above for the data I have.
Is there a way to replicate this theme, area under the curve included, for lineplot?
The code below shows what I got so far:
import numpy as np
import scipy as sp
import pandas as pd
from scipy.stats import norm
import matplotlib.pyplot as plt
import seaborn as sns
num_grid_pts = 1000
val_μ = [0, -1, 1, 0]
val_σ = [1, 2, 3, 4]
num_var = len(val_μ) # variations
x = np.linspace(-10, 10, num_grid_pts)
P = np.zeros((num_grid_pts, num_var)) # PDF
μσ = [f'μ = {μ}, σ = {σ}' for μ, σ in zip(val_μ, val_σ)]
for ii, (μ, σ) in enumerate(zip(val_μ, val_σ)):
randVar = norm(μ, σ)
P[:, ii] = randVar.pdf(x)
df_P = pd.DataFrame(data = {'x': np.tile(x, num_var), 'PDF': P.flatten('F'), 'μσ': np.repeat(μσ, len(x))})
f, ax = plt.subplots(figsize=(15, 10))
sns.lineplot(data=df_P, x='x', y='PDF', hue='μσ', ax=ax)
plot_lines = ax.get_lines()
for ii in range(num_var):
ax.fill_between(x=plot_lines[ii].get_xdata(), y1=plot_lines[ii].get_ydata(), alpha=0.25, color=plot_lines[ii].get_color())
ax.set_title(f'Normal Distribution')
ax.set_xlabel(f'Value')
ax.set_ylabel(f'Probability')
plt.show()
I used the lineplot to create the lines and then created the fills. But this is a hack, I was wondering if I can do it more naturally within Seaborn.
I found a way to manually play with the elements do so using the area object:
(
so.Plot(healthexp, "Year", "Spending_USD", color="Country")
.add(so.Area(alpha=.7), so.Stack())
)
The result is:
Yet for some reason the example code doesn't work.
What I did was using Seabron's lineplot() and then manually add fill_between() polygon:
ax = sns.lineplot(data=data_frame, x='data_x', y='data_y', hue='data_color')
plot_lines = ax.get_lines()
for i in range(num_unique_colors):
ax.fill_between(x=plot_lines[i].get_xdata(), y1=plot_lines[i].get_ydata(), alpha=0.25, color=plot_lines[i].get_color())

How to set xlim in seaborn barplot?

I have created a barplot for given days of the year and the number of people born on this given day (figure a). I want to set the x-axes in my seaborn barplot to xlim = (0,365) to show the whole year.
But, once I use ax.set_xlim(0,365) the bar plot is simply moved to the left (figure b).
This is the code:
#data
df = pd.DataFrame()
df['day'] = np.arange(41,200)
df['born'] = np.random.randn(159)*100
#plot
f, axes = plt.subplots(4, 4, figsize = (12,12))
ax = sns.barplot(df.day, df.born, data = df, hue = df.time, ax = axes[0,0], color = 'skyblue')
ax.get_xaxis().set_label_text('')
ax.set_xticklabels('')
ax.set_yscale('log')
ax.set_ylim(0,10e3)
ax.set_xlim(0,366)
ax.set_title('SE Africa')
How can I set the x-axes limits to day 0 and 365 without the bars being shifted to the left?
IIUC, the expected output given the nature of data is difficult to obtain straightforwardly, because, as per the documentation of seaborn.barplot:
This function always treats one of the variables as categorical and draws data at ordinal positions (0, 1, … n) on the relevant axis, even when the data has a numeric or date type.
This means the function seaborn.barplot creates categories based on the data in x (here, df.day) and they are linked to integers, starting from 0.
Therefore, it means even if we have data from day 41 onwards, seaborn is going to refer the starting category with x = 0, making for us difficult to tweak the lower limit of x-axis post function call.
The following code and corresponding plot clarifies what I explained above:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# data
rng = np.random.default_rng(101)
day = np.arange(41,200)
born = rng.integers(low=0, high=10e4, size=200-41)
df = pd.DataFrame({"day":day, "born":born})
# plot
f, ax = plt.subplots(figsize=(4, 4))
sns.barplot(data=df, x='day', y='born', ax=ax, color='b')
ax.set_xlim(0,365)
ax.set_xticks(ticks=np.arange(0, 365, 30), labels=np.arange(0, 365, 30))
ax.set_yscale('log')
ax.set_title('SE Africa')
plt.tight_layout()
plt.show()
I suggest using matplotlib.axes.Axes.bar to overcome this issue, although handling colors of the bars would be not straightforward compared to sns.barplot(..., hue=..., ...) :
# plot
f, ax = plt.subplots(figsize=(4, 4))
ax.bar(x=df.day, height=df.born) # instead of sns.barplot
ax.get_xaxis().set_label_text('')
ax.set_xlim(0,365)
ax.set_yscale('log')
ax.set_title('SE Africa')
plt.tight_layout()
plt.show()

Pointplot and Scatterplot in one figure but X axis is shifting

Hi I'm trying to plot a pointplot and scatterplot on one graph with the same dataset so I can see the individual points that make up the pointplot.
Here is the code I am using:
xlPath = r'path to data here'
df = pd.concat(pd.read_excel(xlPath, sheet_name=None),ignore_index=True)
sns.pointplot(data=df, x='ID', y='HM (N/mm2)', palette='bright', capsize=0.15, alpha=0.5, ci=95, join=True, hue='Layer')
sns.scatterplot(data=df, x='ID', y='HM (N/mm2)')
plt.show()
When I plot, for some reason the points from the scatterplot are offsetting one ID spot right on the x-axis. When I plot the scatter or the point plot separately, they each are in the correct ID spot. Why would plotting them on the same plot cause the scatterplot to offset one right?
Edit: Tried to make the ID column categorical, but that didn't work either.
Seaborn's pointplot creates a categorical x-axis while here the scatterplot uses a numerical x-axis.
Explicitly making the x-values categorical: df['ID'] = pd.Categorical(df['ID']), isn't sufficient, as the scatterplot still sees numbers. Changing the values to strings does the trick. To get them in the correct order, sorting might be necessary.
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
# first create some test data
df = pd.DataFrame({'ID': np.random.choice(np.arange(1, 49), 500),
'HM (N/mm2)': np.random.uniform(1, 10, 500)})
df['Layer'] = ((df['ID'] - 1) // 6) % 4 + 1
df['HM (N/mm2)'] += df['Layer'] * 8
df['Layer'] = df['Layer'].map(lambda s: f'Layer {s}')
# sort the values and convert the 'ID's to strings
df = df.sort_values('ID')
df['ID'] = df['ID'].astype(str)
fig, ax = plt.subplots(figsize=(12, 4))
sns.pointplot(data=df, x='ID', y='HM (N/mm2)', palette='bright',
capsize=0.15, alpha=0.5, ci=95, join=True, hue='Layer', ax=ax)
sns.scatterplot(data=df, x='ID', y='HM (N/mm2)', color='purple', ax=ax)
ax.margins(x=0.02)
plt.tight_layout()
plt.show()

Merging legends of different lineplots using seaborn

I would like to plot two dataframes with a 'long' representation, and differing axis, to one plot using sns.lineplot(). Yet, I am failing plot it with a single legend containing the elements of both lineplots.
The issue is similar to this: Secondary axis with twinx(): how to add to legend?, though I'd like to use seaborn.
A minimal working example up to the point I got stuck is given below.
import pandas as pd
import seaborn as sns
import numpy as np
import itertools
# mock dataset
lst = range(1,11)
steps1 = list(itertools.chain.from_iterable(itertools.repeat(x, 4) for x in lst))
labels1 = ['A','B']*20
values1 = list(np.random.uniform(0,1,40))
df1 = pd.DataFrame({'steps':steps1, 'lab':labels1, 'vals':values1})
lst = range(6,11)
steps2 = list(itertools.chain.from_iterable(itertools.repeat(x, 4) for x in lst))
labels2 = ['C','D']*10
values2 = list(np.random.uniform(10,20,20))
df2 = pd.DataFrame({'steps':steps2, 'lab2':labels2, 'others':values2})
# plotting
fig, ax = plt.subplots()
fig = sns.lineplot(x='steps',y='vals', data=df1, hue='lab',palette='bright', legend='brief')
ax2 = ax.twinx()
fig2 = sns.lineplot(x='steps',y='others', hue='lab2', data=df2 ,palette='dark', legend='brief')
# How do I merge the legends into one?
# the solution below gives me one merged and one separate legend
h1,l1 = fig.get_legend_handles_labels()
h2,l2 = fig2.get_legend_handles_labels()
ax.legend(loc=3, handles=h1+h2, labels = l1+l2)
I just resolved it by removing the obsolete legend by ax2.get_legend().remove().

MatPlotlib Seaborn Multiple Plots formatting

I am translating a set of R visualizations to Python. I have the following target R multiple plot histograms:
Using Matplotlib and Seaborn combination and with the help of a kind StackOverflow member (see the link: Python Seaborn Distplot Y value corresponding to a given X value), I was able to create the following Python plot:
I am satisfied with its appearance, except, I don't know how to put the Header information in the plots. Here is my Python code that creates the Python Charts
""" Program to draw the sampling histogram distributions """
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.backends.backend_pdf import PdfPages
import seaborn as sns
def main():
""" Main routine for the sampling histogram program """
sns.set_style('whitegrid')
markers_list = ["s", "o", "*", "^", "+"]
# create the data dataframe as df_orig
df_orig = pd.read_csv('lab_samples.csv')
df_orig = df_orig.loc[df_orig.hra != -9999]
hra_list_unique = df_orig.hra.unique().tolist()
# create and subset df_hra_colors to match the actual hra colors in df_orig
df_hra_colors = pd.read_csv('hra_lookup.csv')
df_hra_colors['hex'] = np.vectorize(rgb_to_hex)(df_hra_colors['red'], df_hra_colors['green'], df_hra_colors['blue'])
df_hra_colors.drop(labels=['red', 'green', 'blue'], axis=1, inplace=True)
df_hra_colors = df_hra_colors.loc[df_hra_colors['hra'].isin(hra_list_unique)]
# hard coding the current_component to pc1 here, we will extend it by looping
# through the list of components
current_component = 'pc1'
num_tests = 5
df_columns = df_orig.columns.tolist()
start_index = 5
for test in range(num_tests):
current_tests_list = df_columns[start_index:(start_index + num_tests)]
# now create the sns distplots for each HRA color and overlay the tests
i = 1
for _, row in df_hra_colors.iterrows():
plt.subplot(3, 3, i)
select_columns = ['hra', current_component] + current_tests_list
df_current_color = df_orig.loc[df_orig['hra'] == row['hra'], select_columns]
y_data = df_current_color.loc[df_current_color[current_component] != -9999, current_component]
axs = sns.distplot(y_data, color=row['hex'],
hist_kws={"ec":"k"},
kde_kws={"color": "k", "lw": 0.5})
data_x, data_y = axs.lines[0].get_data()
axs.text(0.0, 1.0, row['hra'], horizontalalignment="left", fontsize='x-small',
verticalalignment="top", transform=axs.transAxes)
for current_test_index, current_test in enumerate(current_tests_list):
# this_x defines the series of current_component(pc1,pc2,rhob) for this test
# indicated by 1, corresponding R program calls this test_vector
x_series = df_current_color.loc[df_current_color[current_test] == 1, current_component].tolist()
for this_x in x_series:
this_y = np.interp(this_x, data_x, data_y)
axs.plot([this_x], [this_y - current_test_index * 0.05],
markers_list[current_test_index], markersize = 3, color='black')
axs.xaxis.label.set_visible(False)
axs.xaxis.set_tick_params(labelsize=4)
axs.yaxis.set_tick_params(labelsize=4)
i = i + 1
start_index = start_index + num_tests
# plt.show()
pp = PdfPages('plots.pdf')
pp.savefig()
pp.close()
def rgb_to_hex(red, green, blue):
"""Return color as #rrggbb for the given color values."""
return '#%02x%02x%02x' % (red, green, blue)
if __name__ == "__main__":
main()
The Pandas code works fine and it is doing what it is supposed to. It is my lack of knowledge and experience of using 'PdfPages' in Matplotlib that is the bottleneck. How can I show the header information in Python/Matplotlib/Seaborn that I can show in the corresponding R visalization. By the Header information, I mean What The R visualization has at the top before the histograms, i.e., 'pc1', MRP, XRD,....
I can get their values easily from my program, e.g., current_component is 'pc1', etc. But I don't know how to format the plots with the Header. Can someone provide some guidance?
You may be looking for a figure title or super title, fig.suptitle:
fig.suptitle('this is the figure title', fontsize=12)
In your case you can easily get the figure with plt.gcf(), so try
plt.gcf().suptitle("pc1")
The rest of the information in the header would be called a legend.
For the following let's suppose all subplots have the same markers. It would then suffice to create a legend for one of the subplots.
To create legend labels, you can put the labelargument to the plot, i.e.
axs.plot( ... , label="MRP")
When later calling axs.legend() a legend will automatically be generated with the respective labels. Ways to position the legend are detailed e.g. in this answer.
Here, you may want to place the legend in terms of figure coordinates, i.e.
ax.legend(loc="lower center",bbox_to_anchor=(0.5,0.8),bbox_transform=plt.gcf().transFigure)

Categories