Altair: Customizing outliers in boxplots - python

Is there any way to customize the outlier points in an Altair boxplot? Suppose I had the following plot:
penguins_data="https://raw.githubusercontent.com/datavizpyr/data/master/palmer_penguin_species.tsv"
penguins_df = pd.read_csv(penguins_data, sep="\t")
chart = alt.Chart(penguins_df).mark_boxplot(size=50, extent=0.5).encode(
x='species:O',
y=alt.Y('culmen_length_mm:Q',scale=alt.Scale(zero=False)),
color=alt.Color('species')
).properties(width=300)
I would like to jitter the outliers and also make the points smaller. Is that possible, or would we have to create two layered plots? Ideally the jittered points are all found within the width of the boxplot itself, but that isn't necessary.

I don't think you can jitter them, but you can make them smaller:
alt.Chart(penguins_df).mark_boxplot(size=50, extent=0.5, outliers={'size': 5}).encode(
x='species:O',
y=alt.Y('culmen_length_mm:Q',scale=alt.Scale(zero=False)),
color=alt.Color('species')
).properties(width=300)

Related

Highlighting Outliers in scatter plot

I am using dataset "tips".
Plotting scatter plot with below code
sns.scatterplot(data=df['total_bill'])
I want to show the outliers let's say in this case points which are above 40 on y-axis, in different color or big or is it possible to draw a horizontal like at 40?
With below code desired result achieved.
sns.scatterplot(data=df, y='total_bill', x=range(0,244), hue='is_outlier')
Using seaborn.scatterplot you can leverage the "hue" parameter to plot groups in different color. For your example the following should work
is_outlier = (df['total_bill'] >= 40)
sns.scatterplot(data=df['total_bill'], hue=is_outlier)

How to change scale of a plot with Maplotlib and add increments?

I am making a simple plot in Python with Matplotlib that shows populations of different regions over time. I have a CSV file that has columns of each region's population over the years, so the years is on the x-axis and population is on the y-axis. The plot looks okay except the y-axis. As you can see in the image, every single population value is included on the y-axis, which is too many values and is unnecessary. I would like to y-axis to have some increments (such as 100 million). Is there a simple way to do that or would I have to manually add my own increments?
And I tried to scale it linearly and logarithmic but I would still prefer to have increments on the y-axis.
This is what the plot looks like right now.
(I took out unnecessary code such as legend and formatting):
data2 = pd.read_csv('data02_world.csv')
for region in data2:
if region != 'Year':
plt.plot(data2.Year, data2[region], marker='.', label=region)
plt.xlabel('Year')
plt.ylabel('Population')
plt.show()
I think you can simply do with pandas:
data2 = pd.read_csv('data02_world.csv')
data2.set_index('Year', inplace=True)
data2.plot()
if you would like to adopt matplotlib plt.yticks is what you need

Center nested boxplots in Python/Seaborn with unequal classes

I have grouped data from which I want to generated boxplots using seaborn. However, not every group has all classes. As a result, the boxplots are not centered if classes are missing within one group:
Figure
The graph is generated using the following code:
sns.boxplot(x="label2", y="value", hue="variable",palette="Blues")
Is there any way to force seaborn to center theses boxes? I didn't find any approbiate way.
Thank you in advance.
Yes there is but you are not going to like it.
Centering these will mean that you will have the same y value for median values, so normalize your data so that the median is 0.5 for each y value for each value of x. That will give you the plot you want, but you should note that somewhere in the plot so people will not be confused.

MatPlotLib - Showing legend

I'm making a scatter plot from a Pandas DataFrame with 3 columns. The first two would be the x and y axis, and the third would be classicfication data that I want to visualize by points having different colors. My question is, how can I add the legend to this plot:
df= df.groupby(['Month', 'Price'])['Quantity'].sum().reset_index()
df.plot(kind='scatter', x='Month', y='Quantity', c=df.Price , s = 100, legend = True);
As you can see, I'd like to automatically color the dots based on their price, so adding labels manually is a bit of an inconvenience. Is there a way I could add something to this code, that would also show a legend to the Price values?
Also, this colors the scatter plot dots on a range from black to white. Can I add custom colors without giving up the easy usage of c=df.Price?
Thank you!

Sorted bar charts with pandas/matplotlib or seaborn

I have a dataset of 5000 products with 50 features. One of the column is 'colors' and there are more than 100 colors in the column. I'm trying to plot a bar chart to show only the top 10 colors and how many products there are in each color.
top_colors = df.colors.value_counts()
top_colors[:10].plot(kind='barh')
plt.xlabel('No. of Products');
Using Seaborn:
sns.factorplot("colors", data=df , palette="PuBu_d");
1) Is there a better way to do this?
2) How can i replicate this with Seaborn?
3) How do i plot such that the highest count is at the top (i.e black at the very top of the bar chart)
An easy trick might be to invert the y axis of your plot, rather than futzing with the data:
s = pd.Series(np.random.choice(list(string.uppercase), 1000))
counts = s.value_counts()
ax = counts.iloc[:10].plot(kind="barh")
ax.invert_yaxis()
Seaborn barplot doesn't currently support horizontally oriented bars, but if you want to control the order the bars appear in you can pass a list of values to the x_order param. But I think it's easier to use the pandas plotting methods here, anyway.
If you want to use pandas then you can first sort:
top_colors[:10].sort(ascending=0).plot(kind='barh')
Seaborn already styles your pandas plots, but you can also use:
sns.barplot(top_colors.index, top_colors.values)

Categories