Is there any way to show mean in box plot using Python? - python

I'm just starting using Matplotlib, and I'm trying to learn how to draw a box plot in Python using Colab.
My problem is: I'm not able to put the median on the graph. The graph just showed the quartiles, mean, and outliers. Can someone help me?
My code is the following.
from google.colab import auth
auth.authenticate_user()
import gspread
import numpy as np
from oauth2client.client import GoogleCredentials
gc = gspread.authorize(GoogleCredentials.get_application_default())
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as pl
sns.set_theme(style="ticks", color_codes=True)
wb = gc.open_by_url('URL_JUST_FOR_EXAMPLE')
boxplot = wb.worksheet('control-Scale10to100')
boxplotData = boxplot.get_all_values()
df = pd.DataFrame(boxplotData[1:], columns=boxplotData[0])
df.drop(df.columns[0], 1, inplace=True)
df = df.apply(pd.to_numeric, errors='ignore')
df.dtypes
df.describe()
dfBoxPlotData = df.iloc[:,4:15]
dfBoxPlotData.apply(pd.to_numeric)
dfBoxPlotData.head()
props = dict(whiskers="Black", medians="Black", caps="Black")
ax = df.plot.box(rot=90, fontsize=14, figsize=(15, 8), color=props, patch_artist=True, grid=False, meanline=True, showmeans=True, meanprops=dict(color='red'))

I tried running your code with a sample data set where the mean and median are distinct, and like #tdy showed, as long as the parameters showmeans=True and meanline=True are being passed to the df.plot.box method, the mean and median should both show up. Is it possible that in your data set, the mean and median are close enough together that they're hard to distinguish?
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as pl
mu, sigma = 50., 10. # mean and standard deviation
np.random.seed(42)
s = np.random.normal(mu, sigma, 30)
df = pd.DataFrame({'values':s})
props = dict(whiskers="Black", medians="Black", caps="Black")
ax = df.plot.box(rot=90, fontsize=14, figsize=(15, 8), color=props, patch_artist=True, grid=False, meanline=True, showmeans=True, meanprops=dict(color='red'))
pl.show()

Related

How to plot Multiline Graphs Via Seaborn library in Python?

I have written a code that looks like this:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
T = np.array([10.03,100.348,1023.385])
power1 = np.array([100000,86000,73000])
power2 = np.array([1008000,95000,1009000])
df1 = pd.DataFrame(data = {'Size': T, 'Encrypt_Time': power1, 'Decrypt_Time': power2})
exp1= sns.lineplot(data=df1)
plt.savefig('exp1.png')
exp1_smooth= sns.lmplot(x='Size', y='Time', data=df, ci=None, order=4, truncate=False)
plt.savefig('exp1_smooth.png')
That gives me Graph_1:
The Size = x- axis is a constant line but as you can see in my code it varies from (10,100,1000).
How does this produces a constant line? I want to produce a multiline graph with x-axis = Size(T),y- axis= Encrypt_Time and Decrypt_Time (power1 & power2).
Also I wanted to plot a smooth graph of the same graph I am getting right now but it gives me error. What needs to be done to achieve a smooth multi-line graph with x-axis = Size(T),y- axis= Encrypt_Time and Decrypt_Time (power1 & power2)?
I think it not the issue, the line represents for size looks like constant but it NOT.
Can see that values of size in range 10-1000 while the minimum division of y-axis is 20,000 (20 times bigger), make it look like a horizontal line on your graph.
You can try with a bigger values to see the slope clearly.
If you want 'size` as x-axis, you can try below example:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
T = np.array([10.03,100.348,1023.385])
power1 = np.array([100000,86000,73000])
power2 = np.array([1008000,95000,1009000])
df1 = pd.DataFrame(data = {'Size': T, 'Encrypt_Time': power1, 'Decrypt_Time': power2})
fig = plt.figure()
fig = sns.lineplot(data=df1, x='Size',y='Encrypt_Time' )
fig = sns.lineplot(data=df1, x='Size',y='Decrypt_Time' )

How to print the heatmap in a square shape using seaborn?

When I run the code below I notice that the heatmap does not have a square shape knowing that I have used square=True but it did not work! Any idea how can I print the heatmap in a square format? Thank you!
The code:
from datetime import datetime
import numpy as np
import pandas as pd
import matplotlib as plt
import os
import seaborn as sns
temp_hourly_A5_A7_AX_ASHRAE=pd.read_csv('C:\\Users\\cvaa4\\Desktop\\projects\\s\\temp_hourly_A5_A7_AX_ASHRAE.csv',index_col=0, parse_dates=True, dayfirst=True, skiprows=2)
sns.heatmap(temp_hourly_A5_A7_AX_ASHRAE,cmap="YlGnBu", vmin=18, vmax=27, square=True, cbar=False, linewidth=0.0001);
The result:
square=True should work to have square cells, below is a working example:
import pandas as pd
import numpy as np
import seaborn as sns
df = pd.DataFrame(np.tile([0,1], 15*15).reshape(-1,15))
sns.heatmap(df, square=True)
If you want a square shape of the plot however, you can use set_aspect and the shape of the data:
ax = sns.heatmap(df)
ax.set_aspect(df.shape[1]/df.shape[0]) # here 0.5 Y/X ratio
You can use matplotlib and set a figsize before plotting heatmap.
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
rnd = np.random.default_rng(12345)
data = rnd.uniform(-100, 100, [100, 50])
plt.figure(figsize=(6, 5))
sns.heatmap(data, cmap='viridis');
Note that I used figsize=(6, 5) rather than a square figsize=(5, 5). This is because on a given figsize, seaborn also puts the colorbar, which might cause the heatmap to be squished a bit. You might want to change those figsizes too depending on what you need.

Can you plot interquartile range as the error band on a seaborn lineplot?

I'm plotting time series data using seaborn lineplot (https://seaborn.pydata.org/generated/seaborn.lineplot.html), and plotting the median instead of mean. Example code:
import seaborn as sns; sns.set()
import matplotlib.pyplot as plt
fmri = sns.load_dataset("fmri")
ax = sns.lineplot(x="timepoint", y="signal", estimator = np.median, data=fmri)
I want the error bands to show the interquartile range as opposed to the confidence interval. I know I can use ci = "sd" for standard deviation, but is there a simple way to add the IQR instead? I cannot figure it out.
Thank you!
I don't know if this can be done with seaborn alone, but here's one way to do it with matplotlib, keeping the seaborn style. The describe() method conveniently provides summary statistics for a DataFrame, among them the quartiles, which we can use to plot the medians with inter-quartile-ranges.
import seaborn as sns; sns.set()
import matplotlib.pyplot as plt
fmri = sns.load_dataset("fmri")
fmri_stats = fmri.groupby(['timepoint']).describe()
x = fmri_stats.index
medians = fmri_stats[('signal', '50%')]
medians.name = 'signal'
quartiles1 = fmri_stats[('signal', '25%')]
quartiles3 = fmri_stats[('signal', '75%')]
ax = sns.lineplot(x, medians)
ax.fill_between(x, quartiles1, quartiles3, alpha=0.3);
You can calculate the median within lineplot like you have done, set ci to be none and fill in using ax.fill_between()
import numpy as np
import seaborn as sns; sns.set()
import matplotlib.pyplot as plt
fmri = sns.load_dataset("fmri")
ax = sns.lineplot(x="timepoint", y="signal", estimator = np.median,
data=fmri,ci=None)
bounds = fmri.groupby('timepoint')['signal'].quantile((0.25,0.75)).unstack()
ax.fill_between(x=bounds.index,y1=bounds.iloc[:,0],y2=bounds.iloc[:,1],alpha=0.1)
This option is possible since version 0.12 of seaborn, see here for the documentation.
pip install --upgrade seaborn
The estimator specifies the point by the name of pandas method or callable, such as 'median' or 'mean'.
The errorbar is an option to plot a distribution spread by a string, (string, number) tuple, or callable. In order to mark the median value and fill the area between the interquartile, you would need the params:
import seaborn as sns; sns.set()
import matplotlib.pyplot as plt
fmri = sns.load_dataset("fmri")
ax = sns.lineplot(data=fmri, x="timepoint", y="signal", estimator=np.median,
errorbar=lambda x: (np.quantile(x, 0.25), np.quantile(x, 0.75)))
You can now!
estimator="median", errobar=("pi",0.5)
https://seaborn.pydata.org/tutorial/error_bars

How to fill color by groups in histogram using Matplotlib?

I know how to do this in R and have provided a code for it below. I want to know how can I do something similar to the below mentioned in Python Matplotlib or using any other library
library(ggplot2)
ggplot(dia[1:768,], aes(x = Glucose, fill = Outcome)) +
geom_bar() +
ggtitle("Glucose") +
xlab("Glucose") +
ylab("Total Count") +
labs(fill = "Outcome")
Using pandas you can pivot the dataframe and directly plot it.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# dataframe with two columns in "long form"
g = np.array([np.random.normal(5, 10, 500),
np.random.rayleigh(10, size=500)]).flatten()
df = pd.DataFrame({'Glucose': g, 'Outcome': np.repeat([0,1],500)})
# pivot and plot
df.pivot(columns="Outcome", values="Glucose").plot.hist(bins=100)
plt.show()
Please consider the following example, which uses seaborn 0.11.1.
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
# generate random data
data = {'Glucose': np.random.normal(5, 10, 100),
'Outcome': np.random.randint(2, size=100)}
df = pd.DataFrame(data)
# plot
fig, ax = plt.subplots(figsize=(10, 10))
sns.histplot(data=df, x='Glucose', hue='Outcome', stat='count', edgecolor=None)
ax.set_title('Glucose')

Plotting large datasets with pandas

I convert an oscilloscope dataset with millions of values into a pandas DataFrame. Next step is to plot it. But Matplotlib needs on my fairly powerful machine ~50 seconds to plot the DataFrame.
import pandas as pd
import matplotlib.pyplot as plt
import readTrc
datX, datY, m = readTrc.readTrc('C220180104_ch2_UHF00000.trc')
srx, sry = pd.Series(datX), pd.Series(datY)
df = pd.concat([srx, sry], axis = 1)
df.set_index(0, inplace = True)
df.plot(grid = 1)
plt.show()
Now I found out that there is a way to make matplotlib faster with large datasets by using 'Agg'.
import matplotlib
matplotlib.use('Agg')
import pandas as pd
import matplotlib.pyplot as plt
import readTrc
datX, datY, m = readTrc.readTrc('C220180104_ch2_UHF00000.trc')
srx, sry = pd.Series(datX), pd.Series(datY)
df = pd.concat([srx, sry], axis = 1)
df.set_index(0, inplace = True)
df.plot(grid = 1)
plt.show()
Unfortunately no plot is shown. The process of processing the plot takes ~5 seconds (a big improvement) but no plot is shown. Is this method not compatible with pandas?
You can use Ploty and Lenspy (was built to solve this exact problem). Here is an example of how you can plot 10m points on scatter plot. This plot runs super fast on my 2016 MacBook.
import numpy as np
import plotly.graph_objects as go
from lenspy import DynamicPlot
# First, let's create a very large figure
x = np.arange(1, 11, 1e-6)
y = 1e-2*np.sin(1e3*x) + np.sin(x) + 1e-3*np.sin(1e10*x)
fig = go.Figure(data=[go.Scattergl(x=x, y=y)])
fig.update_layout(title=f"{len(x):,} Data Points.")
# Use DynamicPlot.show to view the plot
plot = DynamicPlot(fig)
plot.show()
# Plot will be available in the browser at http://127.0.0.1:8050/
For your use case (again, I cannot test this since I don’t have access to your dataset):
import pandas as pd
import matplotlib.pyplot as plt
import readTrc
from lenspy import DynamicPlot
import plotly.graph_objects as go
datX, datY, m = readTrc.readTrc('C220180104_ch2_UHF00000.trc')
srx, sry = pd.Series(datX), pd.Series(datY)
fig = go.Figure(data=[go.Scattergl(x=srx, y=sry)])
fig.update_layout(title=f"{len(x):,} Data Points.")
# Use DynamicPlot.show to view the plot
plot = DynamicPlot(fig)
plot.show()
Disclaimer: I am the creator of Lenspy

Categories