How to scatter plot each group of a pandas DataFrame - python

I am making a scatter plot with the geyser dataset from seaborn. I am coloring the points based on the 'kind' column but for some reason, the legend only shows 'long' but leaves out 'short'. I don't know what I am missing. I also was wondering if there is a simpler way to color code the data one that does not use a for-loop. Thanks!
x = geyser_df['waiting']
y = geyser_df['duration']
col = []
for i in range(len(geyser_df)):
if (geyser_df['kind'][i] == 'short'):
col.append('MediumVioletRed')
elif(geyser_df['kind'][i] == 'long'):
col.append('Navy')
plt.scatter(x, y, c=col)
plt.legend(('long','short'))
plt.xlabel('Waiting')
plt.ylabel("Duration")
plt.suptitle("Waiting vs Duration")
plt.show()

The correct way to do this with pandas is with pandas.DataFrame.groupby and pandas.DataFrame.plot.
Tested in python 3.8.12, pandas 1.3.4, matplotlib 3.4.3, seaborn 0.11.2
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
# load data
df = sns.load_dataset('geyser')
# plot
fig, ax = plt.subplots(figsize=(6, 4))
colors = {'short': 'MediumVioletRed', 'long': 'Navy'}
for kind, data in df.groupby('kind'):
data.plot(kind='scatter', x='waiting', y='duration', label=kind, color=colors[kind], ax=ax)
ax.set(xlabel='Waiting', ylabel='Duration')
fig.suptitle('Waiting vs Duration')
plt.show()
The easiest way is with seaborn, a high-level API for matplotlib, where hue is used to separate groups by color.
sns.scatterplot: an axes-level plot
sns.relplot: a figure-level plot where kind='scatter' is the default plot style
fig, ax = plt.subplots(figsize=(6, 4))
colors = {'short': 'MediumVioletRed', 'long': 'Navy'}
sns.scatterplot(data=df, x='waiting', y='duration', hue='kind', palette=colors, ax=ax)
ax.set(xlabel='Waiting', ylabel='Duration')
fig.suptitle('Waiting vs Duration')
plt.show()
colors = {'short': 'MediumVioletRed', 'long': 'Navy'}
p = sns.relplot(data=df, x='waiting', y='duration', hue='kind', palette=colors, height=4, aspect=1.5)
ax = p.axes.flat[0] # extract the single subplot axes
ax.set(xlabel='Waiting', ylabel='Duration')
p.fig.suptitle('Waiting vs Duration', y=1.1)
plt.show()

 You're passing x = geyser_df ['waiting'] and y = geyser_df ['duration'] as a single dataset which causes plt.scatter to only use as label="long" as legend. I don't have enough experience using this type of libraries but to reproduce the example you describe you need to write a program like this:
long = [[], []]
short = [[], []]
col=['MediumVioletRed', 'Navy']
for i in range(len(geyser_df["kind"])):
if (geyser_df["kind"][i] == "long"):
long[0].append([geyser_df['waiting'][i]])
long[1].append([geyser_df['duration'][i]])
else:
short[0].append([geyser_df['waiting'][i]])
short[1].append([geyser_df['duration'][i]])
plt.scatter(long[0], long[1], c=col[1], label="long")
plt.scatter(short[0], short[1], c=col[0], label="short")
plt.legend()
plt.xlabel('Waiting')
plt.ylabel("Duration")
plt.suptitle("Waiting vs Duration")
plt.show()

Related

Pointplot and Scatterplot in one figure but X axis is shifting

Hi I'm trying to plot a pointplot and scatterplot on one graph with the same dataset so I can see the individual points that make up the pointplot.
Here is the code I am using:
xlPath = r'path to data here'
df = pd.concat(pd.read_excel(xlPath, sheet_name=None),ignore_index=True)
sns.pointplot(data=df, x='ID', y='HM (N/mm2)', palette='bright', capsize=0.15, alpha=0.5, ci=95, join=True, hue='Layer')
sns.scatterplot(data=df, x='ID', y='HM (N/mm2)')
plt.show()
When I plot, for some reason the points from the scatterplot are offsetting one ID spot right on the x-axis. When I plot the scatter or the point plot separately, they each are in the correct ID spot. Why would plotting them on the same plot cause the scatterplot to offset one right?
Edit: Tried to make the ID column categorical, but that didn't work either.
Seaborn's pointplot creates a categorical x-axis while here the scatterplot uses a numerical x-axis.
Explicitly making the x-values categorical: df['ID'] = pd.Categorical(df['ID']), isn't sufficient, as the scatterplot still sees numbers. Changing the values to strings does the trick. To get them in the correct order, sorting might be necessary.
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
# first create some test data
df = pd.DataFrame({'ID': np.random.choice(np.arange(1, 49), 500),
'HM (N/mm2)': np.random.uniform(1, 10, 500)})
df['Layer'] = ((df['ID'] - 1) // 6) % 4 + 1
df['HM (N/mm2)'] += df['Layer'] * 8
df['Layer'] = df['Layer'].map(lambda s: f'Layer {s}')
# sort the values and convert the 'ID's to strings
df = df.sort_values('ID')
df['ID'] = df['ID'].astype(str)
fig, ax = plt.subplots(figsize=(12, 4))
sns.pointplot(data=df, x='ID', y='HM (N/mm2)', palette='bright',
capsize=0.15, alpha=0.5, ci=95, join=True, hue='Layer', ax=ax)
sns.scatterplot(data=df, x='ID', y='HM (N/mm2)', color='purple', ax=ax)
ax.margins(x=0.02)
plt.tight_layout()
plt.show()

How to make a bubble graph using seaborn

import matplotlib.pyplot as plt
import numpy as np
# data
x=["IEEE", "Elsevier", "Others"]
y=[7, 6, 2]
import seaborn as sns
plt.legend()
plt.scatter(x, y, s=300, c="blue", alpha=0.4, linewidth=3)
plt.ylabel("No. of Papers")
plt.figure(figsize=(10, 4))
I want to make a graph as shown in the image. I am not sure how to provide data for both journal and conference categories. (Currently, I just include one). Also, I am not sure how to add different colors for each category.
You can try this code snippet for you problem.
- I modified your Data format, I suggest you to use pandas for
data visualization.
- I added one more field to visualize the data more efficiently.
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import pandas as pd
# data
x=["IEEE", "Elsevier", "Others", "IEEE", "Elsevier", "Others"]
y=[7, 6, 2, 5, 4, 3]
z=["conference", "journal", "conference", "journal", "conference", "journal"]
# create pandas dataframe
data_list = pd.DataFrame(
{'x_axis': x,
'y_axis': y,
'category': z
})
# change size of data points
minsize = min(data_list['y_axis'])
maxsize = max(data_list['y_axis'])
# scatter plot
sns.catplot(x="x_axis", y="y_axis", kind="swarm", hue="category",sizes=(minsize*100, maxsize*100), data=data_list)
plt.grid()
How to create the graph with correct bubble sizes and with no overlap
Seaborn stripplot and swarmplot (or sns.catplot(kind=strip or kind=swarm)) provide the handy dodge argument which prevents the bubbles from overlapping. The only downside is that the size argument applies a single size to all bubbles and the sizes argument (as used in the other answer) is of no use here. They do not work like the s and size arguments of scatterplot. Therefore, the size of each bubble must be edited after generating the plot:
import numpy as np # v 1.19.2
import pandas as pd # v 1.1.3
import seaborn as sns # v 0.11.0
# Create sample data
x = ['IEEE', 'Elsevier', 'Others', 'IEEE', 'Elsevier', 'Others']
y = np.array([7, 6, 3, 7, 1, 3])
z = ['conference', 'conference', 'conference', 'journal', 'journal', 'journal']
df = pd.DataFrame(dict(organisation=x, count=y, category=z))
# Create seaborn stripplot (swarmplot can be used the same way)
ax = sns.stripplot(data=df, x='organisation', y='count', hue='category', dodge=True)
# Adjust the size of the bubbles
for coll in ax.collections[:-2]:
y = coll.get_offsets()[0][1]
coll.set_sizes([100*y])
# Format figure size, spines and grid
ax.figure.set_size_inches(7, 5)
ax.grid(axis='y', color='black', alpha=0.2)
ax.grid(axis='x', which='minor', color='black', alpha=0.2)
ax.spines['bottom'].set(position='zero', color='black', alpha=0.2)
sns.despine(left=True)
# Format ticks
ax.tick_params(axis='both', length=0, pad=10, labelsize=12)
ax.tick_params(axis='x', which='minor', length=25, width=0.8, color=[0, 0, 0, 0.2])
minor_xticks = [tick+0.5 for tick in ax.get_xticks() if tick != ax.get_xticks()[-1]]
ax.set_xticks(minor_xticks, minor=True)
ax.set_yticks(range(0, df['count'].max()+2))
# Edit labels and legend
ax.set_xlabel('Organisation', labelpad=15, size=12)
ax.set_ylabel('No. of Papers', labelpad=15, size=12)
ax.legend(bbox_to_anchor=(1.0, 0.5), loc='center left', frameon=False);
Alternatively, you can use scatterplot with the convenient s argument (or size) and then edit the space between the bubbles to reproduce the effect of the missing dodge argument (note that the x_jitter argument seems to have no effect). Here is an example using the same data as before and without all the extra formatting:
# Create seaborn scatterplot with size argument
ax = sns.scatterplot(data=df, x='organisation', y='count',
hue='category', s=100*df['count'])
ax.figure.set_size_inches(7, 5)
ax.margins(0.2)
# Dodge bubbles
bubbles = ax.collections[0].get_offsets()
signs = np.repeat([-1, 1], df['organisation'].nunique())
for bubble, sign in zip(bubbles, signs):
bubble[0] += sign*0.15
As a side note, I recommend that you consider other types of plots for this data. A grouped bar chart:
df.pivot(index='organisation', columns='category').plot.bar()
Or a balloon plot (aka categorical bubble plot):
sns.scatterplot(data=df, x='organisation', y='category', s=100*count).margins(0.4)
Why? In the bubble graph, the counts are displayed using 2 visual attributes, i) the y-coordinate location and ii) the bubble size. Only one of them is really necessary.

Python. Use two y axis for line and bar plots on Seaborn Facetgrid

Updated question and code!
Probably, the tips dataset is not the best example to use, however my issue is reproduced in it, i.e. we see that both point and bar plots share the same Y
I need to combine line and bar plots on one chart. To do this I used seaborn and the following code:
tips = sns.load_dataset('tips')
g = sns.FacetGrid(tips, hue='sex', col='sex', size=4, aspect=2.1, sharey=False, sharex=False)
g = g.map(sns.pointplot, 'day', 'tip', ci=0)
g = g.map(sns.barplot, 'day', 'total_bill', ci=0)
g.set_xticklabels(rotation=45, fontsize=9)
g.set_xticklabels(rotation=45, fontsize=9)
plt.show()
Here is the result:
Everything is okay except the fact that one Y axis is used for both bars and lines on each facetgrid object. I am new to seaborn and currently cannot find a solution. Tried to add "sharey=False" to this line of code
> `g.map(sns.pointplot, 'date', 'worthusdcount')`
however it didn't help.
Any solutions on how to add second Y axis would be appreciated
Here's an example where you apply a custom mapping function to the dataframe of interest. Within the function, you can call plt.gca() to get the current axis at the facet being currently plotted in FacetGrid. Once you have the axis, twinx() can be called just like you would in plain old matplotlib plotting.
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import seaborn as sns
def facetgrid_two_axes(*args, **kwargs):
data = kwargs.pop('data')
dual_axis = kwargs.pop('dual_axis')
alpha = kwargs.pop('alpha', 0.2)
kwargs.pop('color')
ax = plt.gca()
if dual_axis:
ax2 = ax.twinx()
ax2.set_ylabel('Second Axis!')
ax.plot(data['x'],data['y1'], **kwargs, color='red',alpha=alpha)
if dual_axis:
ax2.bar(df['x'],df['y2'], **kwargs, color='blue',alpha=alpha)
df = pd.DataFrame()
df['x'] = np.arange(1,5,1)
df['y1'] = 1 / df['x']
df['y2'] = df['x'] * 100
df['facet'] = 'foo'
df2 = df.copy()
df2['facet'] = 'bar'
df3 = pd.concat([df,df2])
win_plot = sns.FacetGrid(df3, col='facet', size=6)
(win_plot.map_dataframe(facetgrid_two_axes, dual_axis=True)
.set_axis_labels("X", "First Y-axis"))
plt.show()
This isn't the prettiest plot as you might want to adjust the presence of the second y-axis' label, the spacing between plots, etc. but the code suffices to show how to plot two series of differing magnitudes within FacetGrids.

Seaborn (time series) boxplot using hue and different scale axes

I have a dataframe which has a number of values per date (datetime field). This values are classified in U (users) and S (session) by using a column Group. Seaborn is used to visualize two boxplots per date, where the hue is set to Group.
The problem comes when considering that the values corresponding to U (users) are much bigger than those corresponding to S (session), making the S data illegible. Thus, I need to come up with a solution that allows me to plot both series (U and S) in the same figure in an understandable manner.
I wonder if independent Y axes (with different scales) can be set to each hue, so that both Y axes are shown (as when using twinx but without losing hue visualization capabilities).
Any other alternative would be welcome =)
The S boxplot time series boxplot:
The combined boxplot time series using hue. Obviously it's not possible to see any information about the S group because of the scale of the Y axis:
The columns of the dataframe:
| Day (datetime) | n_data (numeric) | Group (S or U)|
The code line generating the combined boxplot:
seaborn.boxplot(ax=ax,x='Day', y='n_data', hue='Group', data=df,
palette='PRGn', showfliers=False)
Managed to find a solution by using twinx:
fig,ax= plt.subplots(figsize=(50,10))
tmpU = groups.copy()
tmpU.loc[tmp['Group']!='U','n_data'] = np.nan
tmpS = grupos.copy()
tmpS.loc[tmp['Group']!='S','n_data'] = np.nan
ax=seaborn.boxplot(ax=ax,x='Day', y = 'n_data', hue='Group', data=tmpU, palette = 'PRGn', showfliers=False)
ax2 = ax.twinx()
seaborn.boxplot(ax=ax2,x='Day', y = 'n_data', hue='Group', data=tmpS, palette = 'PRGn', showfliers=False)
handles,labels = ax.get_legend_handles_labels()
l= plt.legend(handles[0:2],labels[0:2],loc=1)
plt.setp(ax.get_xticklabels(),rotation=30,horizontalalignment='right')
for label in ax.get_xticklabels()[::2]:
label.set_visible(False)
plt.show()
plt.close('all')
The code above generates the following figure:
Which in this case turns out to be too dense to be published. Therefore I would adopt a visualization based in subplots, as Parfait susgested in his/her answer.
It wasn't an obvious solution to me so I would like to thank Parfait for his/her answer.
Consider building separate plots on same figure with y-axes ranges tailored to subsetted data. Below demonstrates with random data seeded for reproducibility (for readers of this post).
Data (with U values higher than S values)
import pandas as pd
import numpy as np
import seaborn
import matplotlib.pyplot as plt
np.random.seed(2018)
u_df = pd.DataFrame({'Day': pd.date_range('2016-10-01', periods=10)\
.append(pd.date_range('2016-10-01', periods=10)),
'n_data': np.random.uniform(0,800,20),
'Group': 'U'})
s_df = pd.DataFrame({'Day': pd.date_range('2016-10-01', periods=10)\
.append(pd.date_range('2016-10-01', periods=10)),
'n_data': np.random.uniform(0,200,20),
'Group': 'S'})
df = pd.concat([u_df, s_df], ignore_index=True)
df['Day'] = df['Day'].astype('str')
Plot
fig = plt.figure(figsize=(10,5))
for i,g in enumerate(df.groupby('Group')):
plt.title('N_data of {}'.format(g[0]))
plt.subplot(2, 1, i+1)
seaborn.boxplot(x="Day", y="n_data", data=g[1], palette="PRGn", showfliers=False)
plt.tight_layout()
plt.show()
plt.clf()
plt.close('all')
To retain original hue and grouping, render all non-group n_data to np.nan:
fig = plt.figure(figsize=(10,5))
for i,g in enumerate(df.Group.unique()):
plt.subplot(2, 1, i+1)
tmp = df.copy()
tmp.loc[tmp['Group']!=g, 'n_data'] = np.nan
seaborn.boxplot(x="Day", y="n_data", hue="Group", data=tmp,
palette="PRGn", showfliers=False)
plt.tight_layout()
plt.show()
plt.clf()
plt.close('all')
So one option to do a grouped box plot with two separate axis is to use hue_order= ['value, np.nan] in your argument for sns.boxplot:
fig = plt.figure(figsize=(14,8))
ax = sns.boxplot(x="lon_bucketed", y="value", data=m, hue='name', hue_order=['co2',np.nan],
width=0.75,showmeans=True,meanprops={"marker":"s","markerfacecolor":"black", "markeredgecolor":"black"},linewidth=0.5 ,palette = customPalette)
ax2 = ax.twinx()
ax2 = sns.boxplot(ax=ax2,x="lon_bucketed", y="value", data=m, hue='name', hue_order=[np.nan,'g_xco2'],
width=0.75,showmeans=True,meanprops={"marker":"s","markerfacecolor":"black", "markeredgecolor":"black"},linewidth=0.5, palette = customPalette)
ax1.grid(alpha=0.5, which = 'major')
plt.tight_layout()
ax.legend_.remove()
GW = mpatches.Patch(color='seagreen', label='$CO_2$')
WW = mpatches.Patch(color='mediumaquamarine', label='$XCO_2$')
ax, ax2.legend(handles=[GW,WW], loc='upper right',prop={'size': 14}, fontsize=12)
ax.set_title("$XCO_2$ vs. $CO_2$",fontsize=18)
ax.set_xlabel('Longitude [\u00b0]',fontsize=14)
ax.set_ylabel('$CO_2$ [ppm]',fontsize=14)
ax2.set_ylabel('$XCO_2$ [ppm]',fontsize=14)
ax.tick_params(labelsize=14)

How to avoid displaying x-axis labels without any associated data in seaborn factorplot

To make the plot above, I do this:
import seaborn as sns
g = sns.factorplot('State', ' % Overall Match', hue='experiment', data=df_states, col='percentile', kind='box', col_wrap=2, sharex=False)
Is there a way to NOT display the x-axis labels that do not have nay associated data?
-- EDIT: Data is available here: https://www.dropbox.com/s/pn7j95sjjb9n8t0/old_all.csv?dl=0
I think I would do this manually:
fig, ax_array = plt.subplots(2,2)
axes = ax_array.flatten()
for ax, perc in zip(axes, df_states['percentile'].unique()):
sns.boxplot('State', ' % Overall Match', hue='experiment',
data=df_states[df_states['percentile'] == perc], ax=ax)

Categories