Boxplot with a bolean column and a Int value - python

I would like to create a boxplot of the distribution of the variable duration according to whether the film belongs to the category Dramas or (true or false)
Unfortunately these two options do not take into account whether the in_Dramas column is true or false...
Notice that the two columns are in the same DataFrame
movies.boxplot(column= 'in_drama', by='duree', figsize= (7,7));
# sns.catplot(x="in_drama", y="duree" , kind="box", data=movies);

For the pandas boxplot, you can set by='in_drama' and column='duree' to get x-values of in_drama == False and in_drama == True, and boxplots taking into account the duree column:
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
movies = pd.DataFrame({'in_drama': [False, False, False, False, False, True, True, True, True, True],
'durée': [95, 118, 143, 89, 91, 145, 168, 193, 139, 141]})
movies.boxplot(by='in_drama', column='durée', figsize=(7, 7))
plt.show()
The seaborn plot should also work. As only one subplot is needed, sns.boxplot can be used directly.
sns.set()
sns.boxplot(x="in_drama", y="durée", data=movies)
At the left the pandas boxplot, at the right seaborn:

Related

How would I add a legend entry for each column of pandas dataframe in graph generated by df.boxplot()

trying to create some boxplots of pandas dataframes.
I have dataframes that typically look like this (not sure if there was a good way to show it so just took a screenshot).
I am creating a boxplot for each dataframe (after transposing) using the df.boxplot() method, it comes out almost exactly how I want it using the code below:
ax = crit_storm_df[tp_cols].T.boxplot()
ax.set_xlabel("Duration (m)")
ax.set_ylabel("Max Flow (cu.m/sec")
ax.set_xlim(0, None)
ax.set_ylim(0, None)
ax.set_title(crit_storm_df.name)
plt.show()
Example pic of output graph. What's lacking though is I want to add a legend with one entry for each box that represents a column in my dataframe in the pic above. Since I transposed the df before plotting, I would like to have a legend entry for each row, i.e. "tp01", "tp02" etc.
Anyone know what I should be doing instead? Is there a way to do this through the df.boxplot() method or do I need to do something in matplotlib?
I tried ax.legend() but it doesn't do anything except give me a warning:
No artists with labels found to put in legend. Note that artists whose label start with an underscore are ignored when legend() is called with no argument.
Any help would be appreciated, thanks!
If you simply want your boxes to have different colors, you can use seaborn. It's default behavior there:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.DataFrame(np.random.randn(10, 4),
columns=['Col1', 'Col2', 'Col3', 'Col4'])
ax = sns.boxplot(data=df)
plt.legend(ax.patches, df.columns)
plt.show()
Edit: adding legend
Output:
To get similar type of required graph this code is help you to do that :
import matplotlib.pyplot as plt
import pandas as pd
data = {
'Duration': [10, 15, 20, 25, 30, 45, 60, 90, 120, 180, 270, 360],
'tp01': [13.1738, 13.1662, 14.3903, 14.2772, 14.3223, 12.5686, 14.8710, 8.9785, 9.2224, 7.4957, 3.6493, 5.7982],
'tp02': [13.1029, 14.2570, 16.5373, 12.6589, 11.0455, 12.6777, 8.1715, 9.3830, 8.3498, 6.0930, 6.4310, 7.4538],
'tp03': [14.5263, 13.6724, 11.4800, 13.4982, 12.3987, 11.6688, 10.4089, 7.0736, 5.8004, 10.1354, 5.5874, 5.6749],
'tp04': [14.7589, 11.6993, 12.5825, 13.5627, 11.9481, 10.7803, 8.9388, 5.7076, 12.7690, 9.7546, 9.5004, 5.9912],
'tp05': [15.5543, 14.1007, 11.7304, 13.3218, 12.4318, 9.5237, 11.9014, 5.6778, 14.2627, 3.7422, 6.4555, 3.3458],
'tp06': [13.5196, 12.5939, 12.5679, 11.4414, 9.3590, 9.6083, 9.6704, 10.5239, 9.1028, 6.0336, 7.0258, 5.9800],
'tp07': [14.7476, 13.3925, 13.0324, 13.3649, 14.7832, 8.1078, 7.1307, 15.4406, 5.0187, 6.9497, 3.6492, 4.8642],
'tp08': [13.3995, 14.3639, 12.7579, 10.6844, 10.3281, 10.2541, 8.8257, 8.8773, 8.3498, 5.7315, 7.8469, 6.7316],
'tp09': [16.7954, 17.1788, 15.9850, 10.8780, 12.5249, 10.2174, 7.5735, 7.3753, 7.1157, 4.8536, 9.1581, 5.6369],
'tp10': [15.7671, 16.1570, 11.6122, 15.2340, 13.2356, 13.2270, 11.6810, 7.1157, 8.0048, 5.5782, 6.0876, 5.7982],
}
df = pd.DataFrame(data).set_index("Duration")
fig, ax = plt.subplots()
df.T.plot(kind='box', ax=ax)
labels = df.columns
lines = [plt.Line2D([0, 1], [0, 1], color=c, marker='o', markersize=10) for c in plt.rcParams['axes.prop_cycle'].by_key()['color'][:len(labels)]]
ax.legend(lines, labels, loc='best')
ax.set_xlabel("Duration (m)")
ax.set_ylabel("Max Flow (cu.m/sec")
ax.set_xlim(0, None)
ax.set_ylim(0, None)
ax.set_xticklabels(df.index)
plt.show()
Result:

How to access/change properties of individual points on matplotlib scatter plot

Is there a way I could modify properties of individual points on matplotlib scatter plot for example make certain points invisible or change theirsize/shape ?
Let's consider example data set using pandas.DataFrame():
import pandas as pd
import matplotlib.pyplot as plt
import random
df = pd.DataFrame()
df['name'] = ['cat', 'dog', 'bird', 'fish', 'frog']
df['id'] = [1, 1, 1, 2, 2]
df['x'] = [random.randint(-10, 10) for n in range(5)]
df['y'] = [random.randint(-10, 10) for n in range(5)]
Let's plot it on scatter plot:
sc = plt.scatter(df['x'].tolist(), df['y'].tolist())
plt.show()
#easy-peasy
Plot was generated.
Let's say I want all datapoints that have id=1 in df removed from the existing plot (for example with button click). By removed I don't necessary mean deleted. Set-invisible or something will be ok. In general I'm interested in a way to iterate over each point existing on the plot and do something with it.
EDIT #1
using inspect module I noticed that sc plot object holds property named sc._offsets.
Those seems to be 2D numpy arrays holding coordinates of datapoints on the scatter plot (for 2D plot).
This _offsets property consists of 2 components? .. should I say?: "data" (2D array of coordinates) and "mask" (2D aray of bool values: in this case = False) and "fill value" which seems to be of no concern to me.
I've managed to remove points of choice from the scatter plot by deleting _offsets elements at certain indexes like this:
sc._offsets = numpy.delete(sc._offsets, [0, 1, 3], axis=0)
and then re-drawing the plot:
sc.figure.canvas.draw()
Since values in 'id' column of the dataframe and coordinates in sc._offsets are aligned, I can remove coordinates by index where 'id' value was (for example) = 1.
This does what I wanted cause original dataframe with dataset remains intact so I can re-create points on scatter plot on demand.
I think I could use the "mask" to somehow hide/show points of choice on scatter plot but I don't yet know how. I'm investigating it.
SOLVED
Answer is setting mask of numpy.core.ma.MaskedArray that lies under sc._offsets.mask property of matplotlib scatter plot.
This can be done in the following way both during plot generation and after plot has been generated, in interactive mode:
#before change:
#sc._offsets.mask = [[False, False], [False, False], [False, False], [False, False], [False, False]]
sc._offsets.mask = [[1, 1], [1, 1], [1, 1], [0, 0], [0, 0]]
#after change:
#sc._offsets.mask = [[True, True], [True, True], [True, True], [False, False], [False, False]]
#then re-draw plot
sc.figure.canvas.draw() #docs say that it's better to use draw_idle() but I don't see difference
Setting to True value coressponding with index of point you would like to exclude from plot, removes that particular point from the plot. It does not "deletes" it. Points can be restored by setting bool values back to "False". Note that it is 2D array so passing simple: [1, 1, 1, 0, 0]
will not do and you need to take into account both x and y coordinates of the plot.
Consult numpy docs for details:
https://numpy.org/doc/stable/reference/maskedarray.generic.html#accessing-the-mask
I'll edit if something comes up.
Thank you all for help.
A basic solution. If your dataset is not a big one, and you know the conditions that differentiates the data, you want to plot differently, you can create one column per condition and plot each one with different markers and colors.
Suppose you want to plot different the y that are greater than 3:
import pandas as pd
import matplotlib.pyplot as plt
import random
df = pd.DataFrame()
df['name'] = ['cat', 'dog', 'bird', 'fish', 'frog']
df['id'] = [1, 1, 1, 2, 2]
df['x'] = [random.randint(-10, 10) for n in range(5)]
df['y'] = [random.randint(-10, 10) for n in range(5)]
_mask = df.y > 3
df.loc[_mask, 'y_case_2'] = df.y
df.loc[~_mask, 'y_case_1'] = df.y
sc = plt.scatter(df.x, df.y_case_1, marker='*', color='r')
sc = plt.scatter(df.x, df.y_case_2, marker='.', color='b')
plt.show()
df
Note: Be aware that random data could not generate data greater than 3. If so, try again.

Seaborn confidence intervals for fraction of total

I have a dataframe with a categorical column and a column of float values and would like to visualise the sum of the floats by categorical group as a fraction of total (e.g. 20% vs 80%). In addition, I would like to visualise the uncertainty, i.e. plot the confidence interval around the point estimates.
Here is a stylised example:
import pandas as pd
df = pd.DataFrame(data={
'flag':[True, False, True, True, True, False, True, False, True, True, True, True],
'revenue': [1,2,3,4,5,6,7,8,9,10,11,12]
})
print(df.groupby('flag').revenue.sum()/df.revenue.sum())
flag
False 0.205128
True 0.794872
Name: revenue, dtype: float64
I tried to specify the estimator in the seaborn barplot function and got this Seaborn barplot output:
import seaborn as sns
sns.barplot(data=df, x='flag', y='revenue', estimator=lambda x: np.sum(x)/df.revenue.sum())
The problem is that seaborn allows the confidence interval of the bars to extend beyond 100%, i.e. to more than the total (while it should be capped at 100%).
Here is a quick-and-dirty code sample of bootstrapping the confidence intervals properly:
def bootstrap_ratio(df):
df_bs = df.sample(n=len(df), replace=True)
return df_bs.groupby('flag').revenue.sum()/df_bs.revenue.sum()
N = 1000 # bootstrap samples
pd.concat([bootstrap_ratio(df) for i in range(N)], axis=1).quantile([0.05, 0.95], axis=1)
False True
0.05 0.052912 0.575636
0.95 0.427810 0.973333
In this case, the 90% confidence interval of flag==True is [58%, 97%], with the upper edge not more than 100%.
How would I get seaborn to do this? Or at least how could I specify confidence interval values for seaborn to plot (instead of compute)?

Multiple Bar Plot using Seaborn

I'm making a barplot using 3 datasets in seaborn, however each datapoint overlays the previous, regardless of if it is now hiding the previous plot. eg:
sns.barplot(x="Portfolio", y="Factor", data=d2,
label="Portfolio", color="g")
sns.barplot(x="Benchmark", y="Factor", data=d2,
label="Benchmark", color="b")
sns.barplot(x="Active Exposure", y="Factor", data=d2,
label="Active", color="r")
ax.legend(frameon=True)
ax.set(xlim=(-.1, .5), ylabel="", xlabel="Sector Decomposition")
sns.despine(left=True, bottom=True)
However, I want it to show green, even if the blue being overlayed is greater. Any ideas?
Without being able to see your data I can only guess that your dataframe is not in long-form. There's a section on the seaborn tutorial on the expected shape of DataFrames that seaborn is expecting, I'd take a look there for more info, specifically the section on messy data.
Because I can't see your DataFrame I have made some assumptions about it's shape:
import numpy as np
import pandas as pd
import seaborn as sns
df = pd.DataFrame({
"Factor": list("ABC"),
"Portfolio": np.random.random(3),
"Benchmark": np.random.random(3),
"Active Exposure": np.random.random(3),
})
# Active Exposure Benchmark Factor Portfolio
# 0 0.140177 0.112653 A 0.669687
# 1 0.823740 0.078819 B 0.072474
# 2 0.450814 0.702114 C 0.039068
We can melt this DataFrame to get the long-form data seaborn wants:
d2 = df.melt(id_vars="Factor", var_name="exposure")
# Factor exposure value
# 0 A Active Exposure 0.140177
# 1 B Active Exposure 0.823740
# 2 C Active Exposure 0.450814
# 3 A Benchmark 0.112653
# 4 B Benchmark 0.078819
# 5 C Benchmark 0.702114
# 6 A Portfolio 0.669687
# 7 B Portfolio 0.072474
# 8 C Portfolio 0.039068
Then, finally we can plot out box plot using the seaborn's builtin aggregations:
ax = sns.barplot(x="value", y="Factor", hue="exposure", data=d2)
ax.set(ylabel="", xlabel="Sector Decomposition")
ax.legend(loc='center left', bbox_to_anchor=(1, 0.5))
Which produces:
Here's the plot params I used to make this chart:
import matplotlib as mpl
# Plot configuration
mpl.style.use("seaborn-pastel")
mpl.rcParams.update(
{
"font.size": 14,
"figure.facecolor": "w",
"axes.facecolor": "w",
"axes.spines.right": False,
"axes.spines.top": False,
"axes.spines.bottom": False,
"xtick.top": False,
"xtick.bottom": False,
"ytick.right": False,
"ytick.left": False,
}
)
If you are fine without using seaborn you can use pandas plotting to create a stacked horizontal bar chart (barh):
import pandas as pd
import matplotlib as mpl
# Plot configuration
mpl.style.use("seaborn-pastel")
mpl.rcParams.update(
{
"font.size": 14,
"figure.facecolor": "w",
"axes.facecolor": "w",
"axes.spines.right": False,
"axes.spines.top": False,
"axes.spines.bottom": False,
"xtick.top": False,
"xtick.bottom": False,
"ytick.right": False,
"ytick.left": False,
}
)
df = pd.DataFrame({
"Factor": list("ABC"),
"Portfolio": [0.669687, 0.072474, 0.039068],
"Benchmark": [0.112653, 0.078819, 0.702114],
"Active Exposure": [0.140177, 0.823740, 0.450814],
}).set_index("Factor")
ax = df.plot.barh(stacked=True)
ax.legend(loc='center left', bbox_to_anchor=(1, 0.5))
ax.set_ylabel("")
ax.set_xlabel("Sector Decomposition")
Notice in the code above the index is set to Factor which then becomes the y axis.
If you don't set stacked=True you get almost the same chart as seaborn produced:
ax = df.plot.barh(stacked=False)
ax.legend(loc='center left', bbox_to_anchor=(1, 0.5))
ax.set_ylabel("")
ax.set_xlabel("Sector Decomposition")

heatmap and dendrogram (clustermap) error using Plotly

The last example in Plotly's documentation for Dendrograms has an error. When executing this code, I get this error in two locations due to 'extend':
AttributeError: ‘tuple’ object has no attribute ‘extend’
They are produced by these lines: figure.add_traces(heatmap) and figure['data'].extend(dendro_side['data'])
If anyone has run into this problem, please see my solution below! Happy coding!
I have a quick and accurate solution to run the last example code in Plotly's documentation for Dendrograms. Note that I am using Plotly offline in a Jupyter Notebook.
Figure has methods to add_traces, and these should replace extend.
The three key lines are :
figure.add_traces(dendro_side[‘data’])
figure.add_traces(heatmap)
plotly.offline.iplot(figure, filename=‘dendrogram_with_heatmap’)
Here is the full example code with my corrections and necessary imports, below:
# Import Useful Things
import plotly
import plotly.plotly as py
import plotly.graph_objs as go
import plotly.figure_factory as ff
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
init_notebook_mode(connected=True)
plotly.offline.init_notebook_mode(connected=True)
import plotly.graph_objs as go
import plotly.figure_factory as ff
import numpy as np
from scipy.spatial.distance import pdist, squareform
# Get Data
data = np.genfromtxt("http://files.figshare.com/2133304/ExpRawData_E_TABM_84_A_AFFY_44.tab",names=True,usecols=tuple(range(1,30)),dtype=float, delimiter="\t")
data_array = data.view((np.float, len(data.dtype.names)))
data_array = data_array.transpose()
labels = data.dtype.names
# Initialize figure by creating upper dendrogram
figure = ff.create_dendrogram(data_array, orientation='bottom', labels=labels)
for i in range(len(figure['data'])):
figure['data'][i]['yaxis'] = 'y2'
# Create Side Dendrogram
dendro_side = ff.create_dendrogram(data_array, orientation='right')
for i in range(len(dendro_side['data'])):
dendro_side['data'][i]['xaxis'] = 'x2'
# Add Side Dendrogram Data to Figure
figure.add_traces(dendro_side['data'])
# Create Heatmap
dendro_leaves = dendro_side['layout']['yaxis']['ticktext']
dendro_leaves = list(map(int, dendro_leaves))
data_dist = pdist(data_array)
heat_data = squareform(data_dist)
heat_data = heat_data[dendro_leaves,:]
heat_data = heat_data[:,dendro_leaves]
heatmap = [
go.Heatmap(
x = dendro_leaves,
y = dendro_leaves,
z = heat_data,
colorscale = 'Blues'
)
]
heatmap[0]['x'] = figure['layout']['xaxis']['tickvals']
heatmap[0]['y'] = dendro_side['layout']['yaxis']['tickvals']
figure.add_traces(heatmap)
# Edit Layout
figure['layout'].update({'width':800, 'height':800,
'showlegend':False, 'hovermode': 'closest',
})
# Edit xaxis
figure['layout']['xaxis'].update({'domain': [.15, 1],
'mirror': False,
'showgrid': False,
'showline': False,
'zeroline': False,
'ticks':""})
# Edit xaxis2
figure['layout'].update({'xaxis2': {'domain': [0, .15],
'mirror': False,
'showgrid': False,
'showline': False,
'zeroline': False,
'showticklabels': False,
'ticks':""}})
# Edit yaxis
figure['layout']['yaxis'].update({'domain': [0, .85],
'mirror': False,
'showgrid': False,
'showline': False,
'zeroline': False,
'showticklabels': False,
'ticks': ""})
# Edit yaxis2
figure['layout'].update({'yaxis2':{'domain':[.825, .975],
'mirror': False,
'showgrid': False,
'showline': False,
'zeroline': False,
'showticklabels': False,
'ticks':""}})
# Plot using Plotly Offline
plotly.offline.iplot(figure, filename='dendrogram_with_heatmap')
This outputs:

Categories