Dot-boxplots from DataFrames - python

Dataframes in Pandas have a boxplot method, but is there any way to create dot-boxplots in Pandas, or otherwise with seaborn?
By a dot-boxplot, I mean a boxplot that shows the actual data points (or a relevant sample of them) inside the plot, e.g. like the example below (obtained in R).

For a more precise answer related to OP's question (with Pandas):
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
data = pd.DataFrame({ "A":np.random.normal(0.8,0.2,20),
"B":np.random.normal(0.8,0.1,20),
"C":np.random.normal(0.9,0.1,20)} )
data.boxplot()
for i,d in enumerate(data):
y = data[d]
x = np.random.normal(i+1, 0.04, len(y))
plt.plot(x, y, mfc = ["orange","blue","yellow"][i], mec='k', ms=7, marker="o", linestyle="None")
plt.hlines(1,0,4,linestyle="--")
Old version (more generic) :
With matplotlib :
import numpy as np
import matplotlib.pyplot as plt
a = np.random.normal(0,2,1000)
b = np.random.normal(-2,7,100)
data = [a,b]
plt.boxplot(data) # Or you can use the boxplot from Pandas
for i in [1,2]:
y = data[i-1]
x = np.random.normal(i, 0.02, len(y))
plt.plot(x, y, 'r.', alpha=0.2)
Which gives that :
Inspired from this tutorial
Hope this helps !

This will be possible with seaborn version 0.6 (currently in the master branch on github) using the stripplot function. Here's an example:
import seaborn as sns
tips = sns.load_dataset("tips")
sns.boxplot(x="day", y="total_bill", data=tips)
sns.stripplot(x="day", y="total_bill", data=tips,
size=4, jitter=True, edgecolor="gray")

Related

How to plot colors for two variables in scatterplot in python?

I have a dataset with two different variables, i want to give colors to each with different color, Can anyone help please? Link to my dataset : "https://github.com/mayuripandey/Data-Analysis/blob/main/word.csv"
import matplotlib.pyplot as plt
import pandas as pd
fig, ax = plt.subplots(figsize=(10, 6))
ax.scatter(x = df['Friends Network-metrics'], y = df['Number of Followers'],cmap = "magma")
plt.xlabel("Friends Network-metrics")
plt.ylabel("Number of Followers")
plt.show()
Not very clear what you want to do here. But I'll provide a solution that may help you a bit.
Could use seaborn to implement the colors on the variables. Otherwise, you'd need to iterate through the points to set the color. Or create a new column that conditionally inputs a color for a value.
I don't know what your variable is, but you just want to put that in for the hue parameter:
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
df = pd.read_csv('https://raw.githubusercontent.com/mayuripandey/Data-Analysis/main/word.csv')
# Use the 'hue' argument to provide a factor variable
sns.lmplot(x='Friends Network-metrics',
y='Number of Followers',
height=8,
aspect=.8,
data=df,
fit_reg=False,
hue='Sentiment',
legend=True)
plt.xlabel("Friends Network-metrics")
plt.ylabel("Number of Followers")
plt.show()
This can give you a view like this:
If you were looking for color scale for one of the variables though, you would do the below. However, the max value is so big that the range also doesn't make it really an effective visual:
import matplotlib.pyplot as plt
import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/mayuripandey/Data-Analysis/main/word.csv')
fig, ax = plt.subplots(figsize=(10, 6))
g = ax.scatter(x = df['Friends Network-metrics'],
y = df['Number of Followers'],
c = df['Friends Network-metrics'],
cmap = "magma")
fig.colorbar(g)
plt.xlabel("Friends Network-metrics")
plt.ylabel("Number of Followers")
plt.show()
So you could adjust the scale (I'd also add edgecolors = 'black' as its hard to see the light plots):
import matplotlib.pyplot as plt
import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/mayuripandey/Data-Analysis/main/word.csv')
fig, ax = plt.subplots(figsize=(10, 6))
g = ax.scatter(x = df['Friends Network-metrics'],
y = df['Number of Followers'],
c = df['Friends Network-metrics'],
cmap = "magma",
vmin=0, vmax=10000,
edgecolors = 'black')
fig.colorbar(g)
plt.xlabel("Friends Network-metrics")
plt.ylabel("Number of Followers")
plt.show()

How to create specific plots using Pandas and then store them as PNG files?

So I am trying to create histograms for each specific variable in my dataset and then save it as a PNG file.
My code is as follows:
import pandas as pd
import matplotlib.pyplot as plt
x=combined_databook.groupby('x_1').hist()
x.figure.savefig("x.png")
I keep getting "AttributeError: 'Series' object has no attribute 'figure'"
Use matplotlib to create a figure and axis objects, then tell pandas which axes to plot on using the ax argument. Finally, use matplotlib (or the fig) to save the figure.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# Sample Data (3 groups, normally distributed)
df = pd.DataFrame({'gp': np.random.choice(list('abc'), 1000),
'data': np.random.normal(0, 1, 1000)})
fig, ax = plt.subplots()
df.groupby('gp').hist(ax=ax, ec='k', grid=False, bins=20, alpha=0.5)
fig.savefig('your_fig.png', dpi=200)
your_fig.png
Instead of using *.hist() I would use matplotlib.pyplot.hist().
Example :
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
y =[10, 20,30,40,100,200,300,400,1000,2000]
x = np.arange(10)
fig = plt.figure()
ax = plt.subplot(111)
ax.plot(x, y, label='$y = Values')
plt.title('my plot')
ax.legend()
plt.show()
fig.savefig('tada.png')

How do I get the diagonal of sns.pairplot?

OK I am probably being thick, but how do I get just the graphs in the diagonal (top left to bottom right) in a nice row or 2x2 grid of:
import seaborn as sns; sns.set(style="ticks", color_codes=True)
iris = sns.load_dataset("iris")
g = sns.pairplot(iris, hue="species", palette="husl")
TO CLARIFY: I just want these graphs I do not care whether pairplot or something else is used.
Doing this the seaborn-way would make use of a FacetGrid. For this we would need to convert the wide-form input to a long-form dataframe, such that every observation is a single row. This is done via pandas.melt.
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
iris = sns.load_dataset("iris")
df = pd.melt(iris, iris.columns[-1], iris.columns[:-1])
g = sns.FacetGrid(df, col="variable", hue="species", col_wrap=2)
g.map(sns.kdeplot, "value", shade=True)
plt.show()
Why do you even want to do that. The diagonal of the pairplot gives you the distplot of that feature. It will be more effective if you can plot the idividual distplots as subplot or mux them Ex:
import numpy as np
import pandas as pd
from sklearn.datasets import load_iris
import seaborn as sns
iris = load_iris()
iris = pd.DataFrame(data=np.c_[iris['data'], iris['target']],
columns=iris['feature_names'] + ['target'])
# Sort the dataframe by target
target_0 = iris.loc[iris['target'] == 0]
target_1 = iris.loc[iris['target'] == 1]
target_2 = iris.loc[iris['target'] == 2]
sns.distplot(target_0[['sepal length (cm)']], hist=False, rug=True)
sns.distplot(target_1[['sepal length (cm)']], hist=False, rug=True)
sns.distplot(target_2[['sepal length (cm)']], hist=False, rug=True)
sns.plt.show()
The output will be somewhat like this:
[1]
Read more here : python: distplot with multiple distributions
import seaborn as sns
import matplotlib.pyplot as plt
sns.set(style="ticks", color_codes=True)
iris = sns.load_dataset("iris")
def hide_current_axis(*args, **kwds):
plt.gca().set_visible(False)
g = sns.pairplot(iris, hue="species", palette="husl")
g.map_upper(hide_current_axis)
g.map_lower(hide_current_axis)
Output:
plt.subplots(2, 2)
for i, col in enumerate(iris.columns[:4]):
plt.subplot(2, 2, i+1)
sns.kdeplot(iris.loc[iris['species'] == 'setosa', col], shade=True, label='setosa')
sns.kdeplot(iris.loc[iris['species'] == 'versicolor', col], shade=True, label='versicolor')
sns.kdeplot(iris.loc[iris['species'] == 'virginica', col], shade=True, label='virginica')
plt.xlabel('cm')
plt.title(col)
if i == 1:
plt.legend(loc='upper right')
else:
plt.legend().remove()
plt.subplot_tool() # Opens a widget which allows adjusting plot aesthetics
import seaborn as sns
import matplotlib.pyplot as plt
iris = sns.load_dataset("iris")
sns.pairplot(iris, hue="species", corner=True)

How to format seaborn/matplotlib axis tick labels from number to thousands or Millions? (125,436 to 125.4K)

import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
import seaborn as sns
import pandas as pd
sns.set(style="darkgrid")
fig, ax = plt.subplots(figsize=(8, 5))
palette = sns.color_palette("bright", 6)
g = sns.scatterplot(ax=ax, x="Area", y="Rent/Sqft", hue="Region", marker='o', data=df, s=100, palette= palette)
g.legend(bbox_to_anchor=(1, 1), ncol=1)
g.set(xlim = (50000,250000))
How can I can change the axis format from a number to custom format? For example, 125000 to 125.00K
IIUC you can format the xticks and set these:
In[60]:
#generate some psuedo data
df = pd.DataFrame({'num':[50000, 75000, 100000, 125000], 'Rent/Sqft':np.random.randn(4), 'Region':list('abcd')})
df
Out[60]:
num Rent/Sqft Region
0 50000 0.109196 a
1 75000 0.566553 b
2 100000 -0.274064 c
3 125000 -0.636492 d
In[61]:
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
import seaborn as sns
import pandas as pd
sns.set(style="darkgrid")
fig, ax = plt.subplots(figsize=(8, 5))
palette = sns.color_palette("bright", 4)
g = sns.scatterplot(ax=ax, x="num", y="Rent/Sqft", hue="Region", marker='o', data=df, s=100, palette= palette)
g.legend(bbox_to_anchor=(1, 1), ncol=1)
g.set(xlim = (50000,250000))
xlabels = ['{:,.2f}'.format(x) + 'K' for x in g.get_xticks()/1000]
g.set_xticklabels(xlabels)
Out[61]:
The key bit here is this line:
xlabels = ['{:,.2f}'.format(x) + 'K' for x in g.get_xticks()/1000]
g.set_xticklabels(xlabels)
So this divides all the ticks by 1000 and then formats them and sets the xtick labels
UPDATE
Thanks to #ScottBoston who has suggested a better method:
ax.xaxis.set_major_formatter(ticker.FuncFormatter(lambda x, pos: '{:,.2f}'.format(x/1000) + 'K'))
see the docs
The canonical way of formatting the tick labels in the standard units is to use an EngFormatter. There is also an example in the matplotlib docs.
Also see Tick locating and formatting
Here it might look as follows.
import numpy as np; np.random.seed(42)
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
import seaborn as sns
import pandas as pd
df = pd.DataFrame({"xaxs" : np.random.randint(50000,250000, size=20),
"yaxs" : np.random.randint(7,15, size=20),
"col" : np.random.choice(list("ABC"), size=20)})
fig, ax = plt.subplots(figsize=(8, 5))
palette = sns.color_palette("bright", 6)
sns.scatterplot(ax=ax, x="xaxs", y="yaxs", hue="col", data=df,
marker='o', s=100, palette="magma")
ax.legend(bbox_to_anchor=(1, 1), ncol=1)
ax.set(xlim = (50000,250000))
ax.xaxis.set_major_formatter(ticker.EngFormatter())
plt.show()
Using Seaborn without importing matplotlib:
import seaborn as sns
sns.set()
chart = sns.relplot(x="x_val", y="y_val", kind="line", data=my_data)
ticks = chart.axes[0][0].get_xticks()
xlabels = ['$' + '{:,.0f}'.format(x) for x in ticks]
chart.set_xticklabels(xlabels)
chart.fig
Thank you to EdChum's answer above for getting me 90% there.
Here's how I'm solving this: (similar to ScottBoston)
from matplotlib.ticker import FuncFormatter
f = lambda x, pos: f'{x/10**3:,.0f}K'
ax.xaxis.set_major_formatter(FuncFormatter(f))
We could used the APIs: ax.get_xticklabels() , get_text() and ax.set_xticklabels do it.
e.g,
xlabels = ['{:.2f}k'.format(float(x.get_text().replace('−', '-')))/1000 for x in g.get_xticklabels()]
g.set_xticklabels(xlabels)

How To Plot Multiple Histograms On Same Plot With Seaborn

With matplotlib, I can make a histogram with two datasets on one plot (one next to the other, not overlay).
import matplotlib.pyplot as plt
import random
x = [random.randrange(100) for i in range(100)]
y = [random.randrange(100) for i in range(100)]
plt.hist([x, y])
plt.show()
This yields the following plot.
However, when I try to do this with seabron;
import seaborn as sns
sns.distplot([x, y])
I get the following error:
ValueError: color kwarg must have one color per dataset
So then I try to add some color values:
sns.distplot([x, y], color=['r', 'b'])
And I get the same error. I saw this post on how to overlay graphs, but I would like these histograms to be side by side, not overlay.
And looking at the docs it doesn't specify how to include a list of lists as the first argument 'a'.
How can I achieve this style of histogram using seaborn?
If I understand you correctly you may want to try something this:
fig, ax = plt.subplots()
for a in [x, y]:
sns.distplot(a, bins=range(1, 110, 10), ax=ax, kde=False)
ax.set_xlim([0, 100])
Which should yield a plot like this:
UPDATE:
Looks like you want 'seaborn look' rather than seaborn plotting functionality.
For this you only need to:
import seaborn as sns
plt.hist([x, y], color=['r','b'], alpha=0.5)
Which will produce:
UPDATE for seaborn v0.12+:
After seaborn v0.12 to get seaborn-styled plots you need to:
import seaborn as sns
sns.set_theme() # <-- This actually changes the look of plots.
plt.hist([x, y], color=['r','b'], alpha=0.5)
See seaborn docs for more information.
Merge x and y to DataFrame, then use histplot with multiple='dodge' and hue option:
import random
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
x = [random.randrange(100) for _ in range(100)]
y = [random.randrange(100) for _ in range(100)]
df = pd.concat(axis=0, ignore_index=True, objs=[
pd.DataFrame.from_dict({'value': x, 'name': 'x'}),
pd.DataFrame.from_dict({'value': y, 'name': 'y'})
])
fig, ax = plt.subplots()
sns.histplot(
data=df, x='value', hue='name', multiple='dodge',
bins=range(1, 110, 10), ax=ax
)
ax.set_xlim([0, 100])

Categories