Plotly returning evenly spaced piecharts - python

I am trying to plot a pie chart using plotly, but it seems it is always retuning even plots regardless of values provided
import plotly.express as px
df_africa['CompFreq'].value_counts().tolist() # check for the particular order the labels should be in
sizes = df_africa['CompFreq'].value_counts().tolist()
labels = ['Monthly', 'Yearly', 'Weekly']
# Plot
fig = px.pie(sizes, names=labels, color_discrete_sequence=px.colors.sequential.RdBu)
fig.show()
The sizes variable contains the list below
[923, 168, 40]

For plotly.express, the first argument needs to be a DataFrame, dictionary, or array-like. This is explained in the documentation here.
However, for your use case, I think it would be simplest to just directly pass your list of sector sizes to the values parameter:
fig = px.pie(names=labels, values=sizes, color_discrete_sequence=px.colors.sequential.RdBu)

Related

Is there a way to set the plotting global properties specifically for just one figure in matplotlib?

I do realize this has already been addressed here (e.g.,.how to set local rcParams or rcParams for one figure in matplotlib) Nevertheless, I hope this question was different.
I have a plotting function in python with matplotlib that includes global properties, so all new plots will have updated with the global properties.
def catscatter(data,colx,coly,cols,color=['grey','black'],ratio=10,font='Helvetica',save=False,save_name='Default'):
'''
This function creates a scatter plot for categorical variables. It's useful to compare two lists with elements in common.
'''
df = data.copy()
# Create a dict to encode the categeories into numbers (sorted)
colx_codes=dict(zip(df[colx].sort_values().unique(),range(len(df[colx].unique()))))
coly_codes=dict(zip(df[coly].sort_values(ascending=False).unique(),range(len(df[coly].unique()))))
# Apply the encoding
df[colx]=df[colx].apply(lambda x: colx_codes[x])
df[coly]=df[coly].apply(lambda x: coly_codes[x])
# Prepare the aspect of the plot
plt.rcParams['xtick.bottom'] = plt.rcParams['xtick.labelbottom'] = False
plt.rcParams['xtick.top'] = plt.rcParams['xtick.labeltop'] = True
plt.rcParams['font.sans-serif']=font
plt.rcParams['xtick.color']=color[-1]
plt.rcParams['ytick.color']=color[-1]
plt.box(False)
# Plot all the lines for the background
for num in range(len(coly_codes)):
plt.hlines(num,-1,len(colx_codes),linestyle='dashed',linewidth=2,color=color[num%2],alpha=0.5)
for num in range(len(colx_codes)):
plt.vlines(num,-1,len(coly_codes),linestyle='dashed',linewidth=2,color=color[num%2],alpha=0.5)
# Plot the scatter plot with the numbers
plt.scatter(df[colx],
df[coly],
s=df[cols]*ratio,
zorder=2,
color=color[-1])
# Change the ticks numbers to categories and limit them
plt.xticks(ticks=list(colx_codes.values()),labels=colx_codes.keys(),rotation=90)
plt.yticks(ticks=list(coly_codes.values()),labels=coly_codes.keys())
# Save if wanted
if save:
plt.savefig(save_name+'.png')
Below are the properties that I'm using inside the function,
plt.rcParams['xtick.bottom'] = plt.rcParams['xtick.labelbottom'] = False
plt.rcParams['xtick.top'] = plt.rcParams['xtick.labeltop'] = True
I want these properties to apply only when I call the catscatter function.
Is there a way to set the plotting global properties specifically for just one figure, without impacting other plots in jupyter notebook?
Or is there at least a good way to change the properties for one plotting function and then change them back to the values that were used before (not necessarily the rcdefaults?
To only change the properties of one figure, you can just use the relevant method on the Figure or Axes instance rather than using the rcParams.
In this case, it looks like you want to set the x-axis label and ticks to appear on the top of the plot rather than the bottom. You can use the following to achieve exactly that.
ax.xaxis.set_label_position('top')
ax.xaxis.set_ticks_position('top')
Consider the following minimal example:
import matplotlib.pyplot as plt
fig, ax = plt.subplots()
ax.set_xlabel('label')
ax.xaxis.set_label_position('top')
ax.xaxis.set_ticks_position('top')

Size legend for plotly express scatterplot in Python

Here is a Plotly Express scatterplot with marker color, size and symbol representing different fields in the data frame. There is a legend for symbol and a colorbar for color, but there is nothing to indicate what marker size represents.
Is it possible to display a "size" legend? In the legend I'm hoping to show some example marker sizes and their respective values.
A similar question was asked for R and I'm hoping for a similar results in Python. I've tried adding markers using fig.add_trace(), and this would work, except I don't know how to make the sizes equal.
import pandas as pd
import plotly.express as px
import random
# create data frame
df = pd.DataFrame({
'X':list(range(1,11,1)),
'Y':list(range(1,11,1)),
'Symbol':['Yes']*5+['No']*5,
'Color':list(range(1,11,1)),
'Size':random.sample(range(10,150), 10)
})
# create scatterplot
fig = px.scatter(df, y='Y', x='X',color='Color',symbol='Symbol',size='Size')
# move legend
fig.update_layout(legend=dict(y=1, x=0.1))
fig.show()
Scatterplot Image:
Thank you
You can not achieve this goal, if you use a metric scale/data like in your range. Plotly will try to always interpret it like metric, even if it seems/is discrete in the output. So your data has to be a factor like in R, as you are showing groups. One possible solution could be to use a list comp. and convert everything to a str. I did it in two steps so you can follow:
import pandas as pd
import plotly.express as px
import random
check = sorted(random.sample(range(10,150), 10))
check = [str(num) for num in check]
# create data frame
df = pd.DataFrame({
'X':list(range(1,11,1)),
'Y':list(range(1,11,1)),
'Symbol':['Yes']*5+['No']*5,
'Color':check,
'Size':list(range(1,11,1))
})
# create scatterplot
fig = px.scatter(df, y='Y', x='X',color='Color',symbol='Symbol',size='Size')
# move legend
fig.update_layout(legend=dict(y=1, x=0.1))
fig.show()
That gives:
Keep in mind, that you also get the symbol label, as you now have TWO groups!
Maybe you want to sort the values in the list before converting to string!
Like in this picture (added it to the code above)
UPDATE
Hey There,
yes, but as far as I know, only in matplotlib, and it is a little bit hacky, as you simulate scatter plots. I can only show you a modified example from matplotlib, but maybe it helps you so you can fiddle it out by yourself:
from numpy.random import randn
z = randn(10)
red_dot, = plt.plot(z, "ro", markersize=5)
red_dot_other, = plt.plot(z*2, "ro", markersize=20)
plt.legend([red_dot, red_dot_other], ["Yes", "No"], markerscale=0.5)
That gives:
As you can see you are working with two different plots, to be exact one plot for each size legend. In the legend these plots are merged together. Legendsize is further steered through markerscale and it is linked to markersize of each plot. And because we have two plots with TWO different markersizes, we can create a plot with different markersizes in the legend. markerscale is normally a value between 0 and 1 but you can also do 150% thus 1.5.
You can achieve this through fiddling around with the legend handler in matplotlib see here:
https://matplotlib.org/stable/tutorials/intermediate/legend_guide.html

Explicitly set colours of the boxplot in ploltly

I am using plotly express to plot boxplot as shown below:
px.box(data_frame=df,
y="price",
x="products",
points="all")
However, the boxpots of the products shown up with the same colours. They are four products. I would like to colour each with a different colour, using an additional paramter color_discrete_sequence does not work.
I am using plotly.express.data.tips() as an example dataset and am creating a new column called mcolour to show how we can use an additional column for coloring. See below;
## packages
import plotly.express as px
import numpy as np
import pandas as pd
## example dataset:
df = px.data.tips()
## creating a new column with colors
df['mcolour'] = np.where(
df['day'] == "Sun" ,
'#636EFA',
np.where(
df['day'] == 'Sat', '#EF553B', '#00CC96'
)
)
## plot
fig = px.box(df, x="day", y="total_bill", color="mcolour")
fig = fig.update_layout(showlegend=False)
fig.show()
So, as you see, you can simply assign colors based on another column using color argument in plotly.express.box().
You will need to add, before plotting, this parameter setting (as part of an effective solution) in order to align the (indeed!) newly colored box plots correctly.
fig.update_layout(boxmode = "overlay")
The boxmode setting "overlay" brings the plot back to the normal layout, that is seemingly being overridden (as setting "group") after having set the color.
In the plotly help it says about boxmode:
"Determines how boxes at the same location coordinate are displayed on
the graph. If 'group', the boxes are plotted next to one another
centered around the shared location. If 'overlay', the boxes are
plotted over one another [...]"
Hope this helps! R

How to connect boxplot median values

It seems like plotting a line connecting the mean values of box plots would be a simple thing to do, but I couldn't figure out how to do this plot in pandas.
I'm using this syntax to do the boxplot so that it automatically generate the box plot for Y vs. X device without having to do external manipulation of the data frame:
df.boxplot(column='Y_Data', by="Category", showfliers=True, showmeans=True)
One way I thought of doing is to just do a line plot by getting the mean values from the boxplot, but I'm not sure how to extract that information from the plot.
You can save the axis object that gets returned from df.boxplot(), and plot the means as a line plot using that same axis. I'd suggest using Seaborn's pointplot for the lines, as it handles a categorical x-axis nicely.
First let's generate some sample data:
import pandas as pd
import numpy as np
import seaborn as sns
N = 150
values = np.random.random(size=N)
groups = np.random.choice(['A','B','C'], size=N)
df = pd.DataFrame({'value':values, 'group':groups})
print(df.head())
group value
0 A 0.816847
1 A 0.468465
2 C 0.871975
3 B 0.933708
4 A 0.480170
...
Next, make the boxplot and save the axis object:
ax = df.boxplot(column='value', by='group', showfliers=True,
positions=range(df.group.unique().shape[0]))
Note: There's a curious positions argument in Pyplot/Pandas boxplot(), which can cause off-by-one errors. See more in this discussion, including the workaround I've employed here.
Finally, use groupby to get category means, and then connect mean values with a line plot overlaid on top of the boxplot:
sns.pointplot(x='group', y='value', data=df.groupby('group', as_index=False).mean(), ax=ax)
Your title mentions "median" but you talk about category means in your post. I used means here; change the groupby aggregation to median() if you want to plot medians instead.
You can get the value of the medians by using the .get_data() property of the matplotlib.lines.Line2D objects that draw them, without having to use seaborn.
Let bp be your boxplot created as bp=plt.boxplot(data). Then, bp is a dict containing the medians key, among others. That key contains a list of matplotlib.lines.Line2D, from which you can extract the (x,y) position as follows:
bp=plt.boxplot(data)
X=[]
Y=[]
for m in bp['medians']:
[[x0, x1],[y0,y1]] = m.get_data()
X.append(np.mean((x0,x1)))
Y.append(np.mean((y0,y1)))
plt.plot(X,Y,c='C1')
For an arbitrary dataset (data), this script generates this figure. Hope it helps!

Inconsistency when setting figure size using pandas plot method

I'm trying to use the convenience of the plot method of a pandas dataframe while adjusting the size of the figure produced. (I'm saving the figures to file as well as displaying them inline in a Jupyter notebook). I found the method below successful most of the time, except when I plot two lines on the same chart - then the figure goes back to the default size.
I suspect this might be due to the differences between plot on a series and plot on a dataframe.
Setup example code:
data = {
'A': 90 + np.random.randn(366),
'B': 85 + np.random.randn(366)
}
date_range = pd.date_range('2016-01-01', '2016-12-31')
index = pd.Index(date_range, name='Date')
df = pd.DataFrame(data=data, index=index)
Control - this code produces the expected result (a wide plot):
fig = plt.figure(figsize=(10,4))
df['A'].plot()
plt.savefig("plot1.png")
plt.show()
Result:
Plotting two lines - figure size is not (10,4)
fig = plt.figure(figsize=(10,4))
df[['A', 'B']].plot()
plt.savefig("plot2.png")
plt.show()
Result:
What's the right way to do this so that the figure size is consistency set regardless of number of series selected?
The reason for the difference between the two cases is a bit hidden inside the logic of pandas.DataFrame.plot(). As one can see in the documentation this method allows a lot of arguments to be passed such that it will handle all kinds of different cases.
Here in the first case, you create a matplotlib figure via fig = plt.figure(figsize=(10,4)) and then plot a single column DataFrame. Now the internal logic of pandas plot function is to check if there is already a figure present in the matplotlib state machine, and if so, use it's current axes to plot the columns values to it. This works as expected.
However in the second case, the data consists of two columns. There are several options how to handle such a plot, including using different subplots with shared or non-shared axes etc. In order for pandas to be able to apply any of those possible requirements, it will by default create a new figure to which it can add the axes to plot to. The new figure will not know about the already existing figure and its size, but rather have the default size, unless you specify the figsize argument.
In the comments, you say that a possible solution is to use df[['A', 'B']].plot(figsize=(10,4)). This is correct, but you then need to omit the creation of your initial figure. Otherwise it will produce 2 figures, which is probably undesired. In a notebook this will not be visible, but if you run this as a usual python script with plt.show() at the end, there will be two figure windows opening.
So the solution which lets pandas take care of figure creation is
import pandas as pd
import matplotlib.pyplot as plt
df = pd.DataFrame({"A":[2,3,1], "B":[1,2,2]})
df[['A', 'B']].plot(figsize=(10,4))
plt.show()
A way to circumvent the creation of a new figure is to supply the ax argument to the pandas.DataFrame.plot(ax=ax) function, where ax is an externally created axes. This axes can be the standard axes you obtain via plt.gca().
import pandas as pd
import matplotlib.pyplot as plt
df = pd.DataFrame({"A":[2,3,1], "B":[1,2,2]})
plt.figure(figsize=(10,4))
df[['A', 'B']].plot(ax = plt.gca())
plt.show()
Alternatively use the more object oriented way seen in the answer from PaulH.
Always operate explicitly and directly on your Figure and Axes objects. Don't rely on the pyplot state machine. In your case that means:
fig1, ax1 = plt.subplots(figsize=(10,4))
df['A'].plot(ax=ax1)
fig1.savefig("plot1.png")
fig2, ax2 = plt.figure(figsize=(10,4))
df[['A', 'B']].plot(ax=ax2)
fig2.savefig("plot2.png")
plt.show()

Categories