Can mark_rule be extended outside the chart with Altair? - python

Is there a way to make a rule mark longer without disrupting the axes of a chart? If I have this:
random.seed(0)
df = pd.DataFrame({'x':[i for i in range(1,21)],'y':random.sample(range(1,50), 20)})
chart = alt.Chart(df).mark_area().encode(x='x',y='y')
ruler = alt.Chart(pd.DataFrame({'x':[5]})).mark_rule().encode(x='x')
chart+ruler
But I want this
:

You can set an explicit y-domain and then set clip=False inside mark_rule, but you also need to define the y-range of the rule since the default is to stretch over the entire plot:
import altair as alt
import pandas as pd
import random
random.seed(0)
df = pd.DataFrame({'x':[i for i in range(1,21)],'y':random.sample(range(1,50), 20)})
chart = alt.Chart(df).mark_area().encode(x='x', y=alt.Y('y', scale=alt.Scale(domain=(0, 50))))
ruler = alt.Chart(pd.DataFrame({'x':[5], 'y': [-10], 'y2': [50]})).mark_rule(clip=False, fill='black').encode(x='x', y='y', y2='y2')
chart+ruler

Have you tried to overlay an empty plot on top with wider margins? So the overlay plot just includes the line, but since it has larger margins on the bottom it will extend past the original plot.

Related

How to create a plot with stacked and labeled line segments

I want to create sort of Stacked Bar Chart [don't know the proper name]. I hand drew the graph [for years 2016 and 2017] and attached it here.
The code to create the df is below:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
data = [[2016.0, 0.4862, 0.4115, 0.3905, 0.3483, 0.1196],
[2017.0, 0.4471, 0.4096, 0.3725, 0.2866, 0.1387],
[2018.0, 0.4748, 0.4016, 0.3381, 0.2905, 0.2012],
[2019.0, 0.4705, 0.4247, 0.3857, 0.3333, 0.2457],
[2020.0, 0.4755, 0.4196, 0.3971, 0.3825, 0.2965]]
cols = ['attribute_time', '100-81 percentile', '80-61 percentile', '60-41 percentile', '40-21 percentile', '20-0 percentile']
df = pd.DataFrame(data, columns=cols)
#set seaborn plotting aesthetics
sns.set(style='white')
#create stacked bar chart
df.set_index('attribute_time').plot(kind='bar', stacked=True)
The data doesn't need to stack on top of each other. The code will create a stacked bar chart, but that's not exactly what needs to be displayed. The percentile needs to have labeled horizontal lines indicating the percentile on the x axis for that year. Does anyone have recommendations on how to achieve this goal? Is it a sort of modified stacked bar chart that needs to be visualized?
My approach to this is to represent the data as a categorical scatter plot (stripplot in Seaborn) using horizontal lines rather than points as markers. You'll have to make some choices about exactly how and where you want to plot things, but this should get you started!
I first modified the data a little bit:
df['attribute_time'] = df['attribute_time'].astype('int') # Just to get rid of the decimals.
df = df.melt(id_vars = ['attribute_time'],
value_name = 'pct_value',
var_name = 'pct_range')
Melting the DataFrame takes the wide data and makes it long instead, so the columns are now year, pct_value, and pct_range and there is a row for each data point.
Next is the plotting:
fig, ax = plt.subplots()
sns.stripplot(data = df,
x = 'attribute_time',
y = 'pct_value',
hue = 'pct_range',
jitter = False,
marker = '_',
s = 40,
linewidth = 3,
ax = ax)
Instead of labeling each point with the range that it belongs to, I though it would be a lot cleaner to separate them into ranges by color.
The jitter is used when there are lots of points for a given category that might overlap to try and prevent them from touching. In this case, we don't need to worry about that so I turned the jitter off. The marker style is designated here as hline.
The s parameter is the horizontal width of each line, and the linewidth is the thickness, so you can play around with those a bit to see what works best for you.
Text is added to the figure using the ax.text method as follows:
for year, value in zip(df['attribute_time'],df['pct_value']):
ax.text(year - 2016,
value,
str(value),
ha = 'center',
va = 'bottom',
fontsize = 'small')
The figure coordinates are indexed starting from 0 despite the horizontal markers displaying the years, so the x position of the text is shifted left by the minimum year (2016). The y position is equal to the value, and the text itself is a string representation of the value. The text is centered above the line and sits slightly above it due to the vertical anchor being on the bottom.
There's obviously a lot you can tweak to make it look how you want with sizing and labeling and stuff, but hopefully this is at least a good start!

Size legend for plotly express scatterplot in Python

Here is a Plotly Express scatterplot with marker color, size and symbol representing different fields in the data frame. There is a legend for symbol and a colorbar for color, but there is nothing to indicate what marker size represents.
Is it possible to display a "size" legend? In the legend I'm hoping to show some example marker sizes and their respective values.
A similar question was asked for R and I'm hoping for a similar results in Python. I've tried adding markers using fig.add_trace(), and this would work, except I don't know how to make the sizes equal.
import pandas as pd
import plotly.express as px
import random
# create data frame
df = pd.DataFrame({
'X':list(range(1,11,1)),
'Y':list(range(1,11,1)),
'Symbol':['Yes']*5+['No']*5,
'Color':list(range(1,11,1)),
'Size':random.sample(range(10,150), 10)
})
# create scatterplot
fig = px.scatter(df, y='Y', x='X',color='Color',symbol='Symbol',size='Size')
# move legend
fig.update_layout(legend=dict(y=1, x=0.1))
fig.show()
Scatterplot Image:
Thank you
You can not achieve this goal, if you use a metric scale/data like in your range. Plotly will try to always interpret it like metric, even if it seems/is discrete in the output. So your data has to be a factor like in R, as you are showing groups. One possible solution could be to use a list comp. and convert everything to a str. I did it in two steps so you can follow:
import pandas as pd
import plotly.express as px
import random
check = sorted(random.sample(range(10,150), 10))
check = [str(num) for num in check]
# create data frame
df = pd.DataFrame({
'X':list(range(1,11,1)),
'Y':list(range(1,11,1)),
'Symbol':['Yes']*5+['No']*5,
'Color':check,
'Size':list(range(1,11,1))
})
# create scatterplot
fig = px.scatter(df, y='Y', x='X',color='Color',symbol='Symbol',size='Size')
# move legend
fig.update_layout(legend=dict(y=1, x=0.1))
fig.show()
That gives:
Keep in mind, that you also get the symbol label, as you now have TWO groups!
Maybe you want to sort the values in the list before converting to string!
Like in this picture (added it to the code above)
UPDATE
Hey There,
yes, but as far as I know, only in matplotlib, and it is a little bit hacky, as you simulate scatter plots. I can only show you a modified example from matplotlib, but maybe it helps you so you can fiddle it out by yourself:
from numpy.random import randn
z = randn(10)
red_dot, = plt.plot(z, "ro", markersize=5)
red_dot_other, = plt.plot(z*2, "ro", markersize=20)
plt.legend([red_dot, red_dot_other], ["Yes", "No"], markerscale=0.5)
That gives:
As you can see you are working with two different plots, to be exact one plot for each size legend. In the legend these plots are merged together. Legendsize is further steered through markerscale and it is linked to markersize of each plot. And because we have two plots with TWO different markersizes, we can create a plot with different markersizes in the legend. markerscale is normally a value between 0 and 1 but you can also do 150% thus 1.5.
You can achieve this through fiddling around with the legend handler in matplotlib see here:
https://matplotlib.org/stable/tutorials/intermediate/legend_guide.html

Explicitly set colours of the boxplot in ploltly

I am using plotly express to plot boxplot as shown below:
px.box(data_frame=df,
y="price",
x="products",
points="all")
However, the boxpots of the products shown up with the same colours. They are four products. I would like to colour each with a different colour, using an additional paramter color_discrete_sequence does not work.
I am using plotly.express.data.tips() as an example dataset and am creating a new column called mcolour to show how we can use an additional column for coloring. See below;
## packages
import plotly.express as px
import numpy as np
import pandas as pd
## example dataset:
df = px.data.tips()
## creating a new column with colors
df['mcolour'] = np.where(
df['day'] == "Sun" ,
'#636EFA',
np.where(
df['day'] == 'Sat', '#EF553B', '#00CC96'
)
)
## plot
fig = px.box(df, x="day", y="total_bill", color="mcolour")
fig = fig.update_layout(showlegend=False)
fig.show()
So, as you see, you can simply assign colors based on another column using color argument in plotly.express.box().
You will need to add, before plotting, this parameter setting (as part of an effective solution) in order to align the (indeed!) newly colored box plots correctly.
fig.update_layout(boxmode = "overlay")
The boxmode setting "overlay" brings the plot back to the normal layout, that is seemingly being overridden (as setting "group") after having set the color.
In the plotly help it says about boxmode:
"Determines how boxes at the same location coordinate are displayed on
the graph. If 'group', the boxes are plotted next to one another
centered around the shared location. If 'overlay', the boxes are
plotted over one another [...]"
Hope this helps! R

Matplotlib bar chart - overlay bars similar to stacked

I want to create a matplotlib bar plot that has the look of a stacked plot without being additive from a multi-index pandas dataframe.
The below code gives the basic behaviour
%matplotlib notebook
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import io
data = io.StringIO('''Fruit,Color,Price
Apple,Red,1.5
Apple,Green,1.0
Pear,Red,2.5
Pear,Green,2.3
Lime,Green,0.5
Lime, Red, 3.0
''')
df_unindexed = pd.read_csv(data)
df_unindexed
df = df_unindexed.set_index(['Fruit', 'Color'])
df.unstack().plot(kind='bar')
The plot command df.unstack().plot(kind='bar') shows all the apple prices grouped next to each other. If you choose the option df.unstack().plot(kind='bar',stacked=True) - it adds the prices for Red and Green together and stacks them.
I am wanting a plot that is halfway between the two - it shows each group as a single bar, but overlays the values so you can see them all. The below figure (done in powerpoint) shows what behaviour I am looking for -> I want the image on the right.
Short of calculating all the values and then using the stacked option, is this possible?
This seems (to me) like a bad idea, since this representation leads to several problem. Will a reader understand that those are not staked bars? What happens when the front bar is taller than the ones behind?
In any case, to accomplish what you want, I would simply repeatedly call plot() on each subset of the data and using the same axes so that the bars are drawn on top of each other.
In your example, the "Red" prices are always higher, so I had to adjust the order to plot them in the back, or they would hide the "Green" bars.
fig,ax = plt.subplots()
my_groups = ['Red','Green']
df_group = df_unindexed.groupby("Color")
for color in my_groups:
temp_df = df_group.get_group(color)
temp_df.plot(kind='bar', ax=ax, x='Fruit', y='Price', color=color, label=color)
There are two problems with this kind of plot. (1) What if the background bar is smaller than the foreground bar? It would simply be hidden and not visible. (2) A chart like this is not distinguishable from a stacked bar chart. Readers will have severe problems interpreting it.
That being said, you can plot both columns individually.
import matplotlib.pyplot as plt
import pandas as pd
import io
data = io.StringIO('''Fruit,Color,Price
Apple,Red,1.5
Apple,Green,1.0
Pear,Red,2.5
Pear,Green,2.3
Lime,Green,0.5
Lime,Red,3.0''')
df_unindexed = pd.read_csv(data)
df = df_unindexed.set_index(['Fruit', 'Color']).unstack()
df.columns = df.columns.droplevel()
plt.bar(df.index, df["Red"].values, label="Red")
plt.bar(df.index, df["Green"].values, label="Green")
plt.legend()
plt.show()

Black bar covering my x labels for matplotlib plot?

I am trying to play a figure and I am having a black box pop up on the bottom of the plot where the x labels should be. I tried this command from a similar question on here in the past:
from matplotlib import rcParams
rcParams.update({'figure.autolayout': True})
But the problem was still the same. Here is my current code:
import pylab
from matplotlib import rcParams
rcParams.update({'figure.autolayout': True})
df['date'] = df['date'].astype('str')
pos = np.arange(len(df['date']))
plt.bar(pos,df['value'])
ticks = plt.xticks(pos, df['value'])
And my plot is attached here. Any help would be great!
pos = np.arange(len(df['date'])) and ticks = plt.xticks(pos, df['value']) are causing the problem you are having. You are putting an xtick at every value you have in the data frame.
Don't know how you data looks like and what's the most sensible way to do this. ticks = plt.xticks(pos[::20], df['value'].values[::20], rotation=90) will put a tick every 20 rows that would make the plot more readable.
It actually is not a black bar, but rather all of your x-axis labels being crammed into too small of a space. You can try rotating the axis labels to create more space or just remove them all together.

Categories