Random empty spaces/bars in seaborn distribution plot - python

GOAL: I want to make a distribution function for registered dogs' ages in 2017 in Zurich from the 'Dogs of Zurich' dataset (Kaggle) (with Python). The variable I'm working with - 'GEBURTSJAHR_HUND' - gives the birth year for every registered dog as an int.
I have converted it to a 'dog_age' variable (= 2017 - birth_date) and want to plot the distribution function. See image below for sorted list of group size per age.
Size of dog age groups
PROBLEM: I'm running into is the fact that my distribution function's x axis has empty spaces/bars in it. Every age is shown on the graph, but in between some of these ages are empty bars.
Example: 1 and 2 are full bars, but between them is an empty space. Between 2 and 3, there is no empty space, but between 3 and 4 there is. Seemingly random which values have white spaces between them.
What my problematic distribution plot looks like at the moment
TRIED: I have previously tried three things to fix this.
plt.xticks(...)
Unfortunately this only changed the aesthetics of the x axis.
Tried ax = sns.distplot followed by ax.xaxis ticker lines, but this did not have the expected result.
ax.xaxis.set_major_locator(ticker.MultipleLocator())
ax.xaxis.set_major_formatter(ticker.ScalarFormatter(0))
Maybe problem is with 'dog_age' variable?
Used the original birth_date variable, but this had the same problem.
CODE:
dfnew = pd.read_csv(dog17_filepath,index_col='HALTER_ID')
dfnew.dropna(subset = ["ALTER"], inplace=True)
dfnew['dog_age'] = 2017 - dfnew['GEBURTSJAHR_HUND']
b = dfnew['dog_age']
sns.set_style("darkgrid")
plt.figure(figsize=(15,5))
sns.distplot(a=b,hist=True)
plt.xticks(np.arange(min(b), max(b)+1, 1))
plt.xlabel('Age Dog', fontsize=12)
plt.title('Distribution of age of dogs', fontsize=20)
plt.show()
Thanks in advance,
Arthur

The problem is that the age column is discrete: it only contains a short range of integers. Default the histogram divides the range of values (float) into a fixed number of bins, which usually don't align well with those integers. To get an appropriate histogram, the bins needs to be set explicitly, for example having a bin bound at every half.
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
dfnew = pd.read_csv('hundehalter.csv')
dfnew.dropna(subset=["ALTER"], inplace=True)
dfnew['dog_age'] = 2017 - dfnew['GEBURTSJAHR_HUND']
b = dfnew['dog_age'][(dfnew['dog_age'] >= 0) & (dfnew['dog_age'] <= 25)]
sns.set_style("darkgrid")
plt.figure(figsize=(15, 5))
sns.distplot(a=b, hist=True, bins=np.arange(min(b)-0.5, max(b)+1, 1))
plt.xticks(np.arange(min(b), max(b) + 1, 1))
plt.xlabel('Age Dog', fontsize=12)
plt.title('Distribution of age of dogs', fontsize=20)
plt.xlim(min(b), max(b) + 1)
plt.show()

Related

Line plot doesn't show when used with a datetime variable on the x-axis

I'm trying to plot a bar graph that is accompanied by two line graphs. The barplot shows fine but I can't seem to get the lines plotted above the barplot. Here's the code:
fig, ax = plt.subplots(figsize=(18,9))
sns.set_style("darkgrid")
g=sns.barplot(date_new, df["Net xG (xG - Opponent's xG)"].astype("float"), palette="coolwarm_r", hue=df["Net xG (xG - Opponent's xG)"].replace({"-":0}).astype("float"), dodge=False, data=df)
plt.plot(date_new, -df["Opponent's xG"].astype("float"), color="gold", marker="o")
plt.plot(date_new, df["xG (Expected goals)"].astype("float"), color="indianred", marker="o")
g.set_xticklabels(stuff[::-1], rotation=90)
g.get_legend().remove()
g.set(xlim=([-0.8, 46]))
plt.show()
date_new variable used for the x-axis is in datetime64[ns] format. A weird thing I noticed is that if I reformat date_new as a string like date_new.astype("str"), the line plots show but the order is reversed.
I tried to "re-reverse" the order of which dates are sorted by by changing the x-axis variable to date_new[::-1], but that doesn't seem to change the line plots' order.
Here's a screenshot of how the x (Date) and y (xG) axis variables look on the dataframe:
You are trying to combine a bar graph with two line plots. It seems you are having issues matching your x-axis variables. As #Henry Ecker said above, the x axis labels on a bar plot are cosmetic and do not represent an actual date time axis. Consequently, the x-axis values for your bar plot are simply the numbers 0 to 46.
To fix your problem, simply make the line plot x values a list from 0 to 46.
I simulated your data and demonstrate the solution in the example below.
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd
# create data
# there are 46 rows each representing a game against some other club
# colums include: date of game, opposing club, club goals, opposing club goals
# goal range is 0-5
df = pd.DataFrame({
'date':pd.date_range(start='1/2021', end='7/1/2021', periods=46),
'club':['Team: ' + str(n) for n in range(1,47)],
'goals': np.random.randint(0, 5, 46),
'opposing_goals':np.random.randint(0, 5, 46)
})
df['net_goals'] = df.goals - df.opposing_goals
fig, ax = plt.subplots(figsize=(18,9))
sns.set_style("darkgrid")
g=sns.barplot(
x=df.date, y=df.net_goals,
palette="coolwarm_r", hue=df.net_goals, dodge=False, data=df
)
plt.plot(np.arange(0,46), -df.opposing_goals, color="gold", marker="o")
plt.plot(np.arange(0,46), df.goals, color="indianred", marker="o")
g.set_xticklabels(df.club, rotation=45)
g.get_legend().remove()
g.set(xlim=([-0.8, 46]))

Seaborn scatterplot size and jitter

I have the following code for a scatter graph
dimens = (12, 10)
fig, ax = plt.subplots(figsize=dimens)
sns.scatterplot(data = information, x = 'latitude', y = 'longitude', hue="genre", s=200,
x_jitter=4, y_jitter=4, ax=ax)
No matter what I change the jitter to, the plots still remain very close. Whats wrong with it?
Example dataframe:
store longitude latitude genre
mcdonalds 140.232323 40.434343 all
kfc 140.232323 40.434343 chicken
burgerking 138.434343 35.545433 burger
fiveguys 137.323984 36.543322 burger
In the help page, it writes:
{x,y}_jitterbooleans or floats Currently non-functional
You can either add a new column or do it on the fly:
import seaborn as sns
import pandas as pd
import numpy as np
information = pd.DataFrame({'store':['mcdonalds','kfc','burgerking','fiveguys'],
'longitude':[140.232323,140.232323,138.434343,137.323984],
'latitude':[40.434343,40.434343,35.545433,36.543322],
'genre':['all','chicken','burger','burger']})
def jitter(values,j):
return values + np.random.normal(j,0.1,values.shape)
sns.scatterplot(x = jitter(information.latitude,2),
y = jitter(information.longitude,2),
hue=information.genre,s=200,alpha=0.5)
The parameter s=200 sets the individual scatter points to a very large size.
Adding 4 points of jitter is very little compared to that.

Annotated heatmap with multiple color schemes

I have the following dataframe and would like to differentiate the minor decimal differences in each "step" with a different color scheme in a heatmap.
Sample data:
Sample Step 2 Step 3 Step 4 Step 5 Step 6 Step 7 Step 8
A 64.847 54.821 20.897 39.733 23.257 74.942 75.945
B 64.885 54.767 20.828 39.613 23.093 74.963 75.928
C 65.036 54.772 20.939 39.835 23.283 74.944 75.871
D 64.869 54.740 21.039 39.889 23.322 74.925 75.894
E 64.911 54.730 20.858 39.608 23.101 74.956 75.930
F 64.838 54.749 20.707 39.394 22.984 74.929 75.941
G 64.887 54.781 20.948 39.748 23.238 74.957 75.909
H 64.903 54.720 20.783 39.540 23.028 74.898 75.911
I 64.875 54.761 20.911 39.695 23.082 74.897 75.866
J 64.839 54.717 20.692 39.377 22.853 74.849 75.939
K 64.857 54.736 20.934 39.699 23.130 74.880 75.903
L 64.754 54.746 20.777 39.536 22.991 74.877 75.902
M 64.798 54.811 20.963 39.824 23.187 74.886 75.895
An example of what I am looking for:
My first approach would be based on a figure with multiple subplots. Number of plots would equal number of columns in your dataframe; the gap between the plots could be shrinked down to zero:
cm = ['Blues', 'Reds', 'Greens', 'Oranges', 'Purples', 'bone', 'winter']
f, axs = plt.subplots(1, df.columns.size, gridspec_kw={'wspace': 0})
for i, (s, a, c) in enumerate(zip(df.columns, axs, cm)):
sns.heatmap(np.array([df[s].values]).T, yticklabels=df.index, xticklabels=[s], annot=True, fmt='.2f', ax=a, cmap=c, cbar=False)
if i>0:
a.yaxis.set_ticks([])
Result:
Not sure if this will lead to a helpful or even self describing visualization of data, but that's your choice - perhaps this helps to start...
Supplemental:
Regarding adding the colorbars: of course you can. But - besides not knowing the background of your data and the purpose of the visualization - I'd like to add some thoughts on all that:
First: adding all those colorbars as a separate bunch of bars on one side or below the heatmap is probably possible, but I find it already quite hard to read the data, plus: you already have all those annotations - it would mess all up I think.
Additionally: in the meantime #ImportanceOfBeingErnest provided such a beutiful solution on that topic, that this would be not too meaningful imo here.
Second: if you really want to stick to the heatmap thing, perhaps splitting up and giving every column its colorbar would suit better:
cm = ['Blues', 'Reds', 'Greens', 'Oranges', 'Purples', 'bone', 'winter']
f, axs = plt.subplots(1, df.columns.size, figsize=(10, 3))
for i, (s, a, c) in enumerate(zip(df.columns, axs, cm)):
sns.heatmap(np.array([df[s].values]).T, yticklabels=df.index, xticklabels=[s], annot=True, fmt='.2f', ax=a, cmap=c)
if i>0:
a.yaxis.set_ticks([])
f.tight_layout()
However, all that said - I dare to doubt that this is the best visualization for your data. Of course, I don't know what you want to say, see or find with these plots, but that's the point: if the visualization type would fit to the needs, I guess I'd know (or at least could imagine).
Just for example:
A simple df.plot() results in
and I feel that this tells more about different characteristics of your columns within some tenths of a second than the heatmap.
Or are you explicitely after the differences to each columns' means?
(df - df.mean()).plot()
... or the distribution of each column around them?
(df - df.mean()).boxplot()
What I want to say: data visualization becomes powerful when a plot begins to tell sth about the underlying data before you begin/have to explain anything...
I suppose the problem can be divided into several parts.
Getting several heatmaps with different colormaps into the same picture. This can be done masking the complete array column-wise, plot each masked array seperately via imshow and apply a different colormap. To visualize the concept:
Obtaining variable number of distinct colormaps. Matplotlib provides a large number of colormaps, however, they are in general very different concerning luminosity and saturation. Here it seems desireable to have colormaps of differing hue, but otherwise same saturation and luminosity.
An option is to create the colormaps on the fly, choosing n different (and equally spaced) hues, and create a colormap using the same saturation and luminosity.
Obtaining a distinct colorbar for each column. Since the values within columns might be on totally different scales, a colorbar for each column would be needed to know the values shown, e.g. in the first column the brightest color may correspond to a value of 1, while in the second column it may correspond to a value of 100. Several colorbars can be created inside of the axes of a GridSpec which is placed next to the actual heatmap axes. The number of columns and rows of that gridspec would be dependent of the number of columns in the dataframe.
In total this may then look as follows.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.colors as mcolors
from matplotlib.gridspec import GridSpec
def get_hsvcmap(i, N, rot=0.):
nsc = 24
chsv = mcolors.rgb_to_hsv(plt.cm.hsv(((np.arange(N)/N)+rot) % 1.)[i,:3])
rhsv = mcolors.rgb_to_hsv(plt.cm.Reds(np.linspace(.2,1,nsc))[:,:3])
arhsv = np.tile(chsv,nsc).reshape(nsc,3)
arhsv[:,1:] = rhsv[:,1:]
rgb = mcolors.hsv_to_rgb(arhsv)
return mcolors.LinearSegmentedColormap.from_list("",rgb)
def columnwise_heatmap(array, ax=None, **kw):
ax = ax or plt.gca()
premask = np.tile(np.arange(array.shape[1]), array.shape[0]).reshape(array.shape)
images = []
for i in range(array.shape[1]):
col = np.ma.array(array, mask = premask != i)
im = ax.imshow(col, cmap=get_hsvcmap(i, array.shape[1], rot=0.5), **kw)
images.append(im)
return images
### Create some dataset
ind = list("ABCDEFGHIJKLM")
m = len(ind)
n = 8
df = pd.DataFrame(np.random.randn(m,n) + np.random.randint(20,70,n),
index=ind, columns=[f"Step {i}" for i in range(2,2+n)])
### Plot data
fig, ax = plt.subplots(figsize=(8,4.5))
ims = columnwise_heatmap(df.values, ax=ax, aspect="auto")
ax.set(xticks=np.arange(len(df.columns)), yticks=np.arange(len(df)),
xticklabels=df.columns, yticklabels=df.index)
ax.tick_params(bottom=False, top=False,
labelbottom=False, labeltop=True, left=False)
### Optionally add colorbars.
fig.subplots_adjust(left=0.06, right=0.65)
rows = 3
cols = len(df.columns) // rows + int(len(df.columns)%rows > 0)
gs = GridSpec(rows, cols)
gs.update(left=0.7, right=0.95, wspace=1, hspace=0.3)
for i, im in enumerate(ims):
cax = fig.add_subplot(gs[i//cols, i % cols])
fig.colorbar(im, cax = cax)
cax.set_title(df.columns[i], fontsize=10)
plt.show()

Get sample size for boxplots in seaborn factorplot

I'm looking to get the sample number to appear on each boxplot as I see here:
https://python-graph-gallery.com/38-show-number-of-observation-on-boxplot/
I'm able to get the median and counts in lists as the link above presents.
However, I have a factorplot with hue, such that the positions of the x-ticks don't seem to be captured on the x-axis.
Using the seaborn tips data set, I have the following:
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
sns.set_style("whitegrid")
tips = sns.load_dataset("tips")
g = sns.factorplot(x="sex", y="total_bill",hue="smoker",
col="time",data=tips, kind="box",size=4, aspect=.7)
# Calculate number of obs per group & median to position labels
medians = tips.groupby(['time','sex','smoker'])['total_bill'].median().values
nobs = tips.groupby(['time','sex','smoker']).size()
nobs = [str(x) for x in nobs.tolist()]
nobs = ["n: " + i for i in nobs]
plt.show()
Here is the plot
I'd like to get the "n: [# of observations]" right above the median, and I'm wondering if there's a way to get that x-tick. Also, assume some groups don't always have both male and female so it can't just be hard coded.
There are several tricky things going on here:
You have two subaxes, one for each main plot. You need to iterate through these.
You have multiple x-offset boxplots on each axis. You need to account for this.
Once you know where you're drawing, you need to know which plot is being visualized there, since ordering ('Yes' first or 'No' first? 'Male' first or 'Female'?) isn't guaranteed.
Fortunately, if you keep your dataframe indexed (or, in this case, multi-indexed), you just need the text for the time, sex, and smoking to get to the correct value. These are all available with a little digging. The resulting code looks something like the following (note the changes to medians and nobs):
medians = tips.groupby(['time','sex','smoker'])['total_bill'].median()
nobs = tips.groupby(['time','sex','smoker']).apply(lambda x: 'n: {}'.format(len(x)))
for ax in plt.gcf().axes:
ax_time = ax.get_title().partition(' = ')[-1]
for tick, label in enumerate(ax.get_xticklabels()):
ax_sex = label.get_text()
for j, ax_smoker in enumerate(ax.get_legend_handles_labels()[1]):
x_offset = (j - 0.5) * 2/5
med_val = medians[ax_time, ax_sex, ax_smoker]
num = nobs[ax_time, ax_sex, ax_smoker]
ax.text(tick + x_offset, med_val + 0.1, num,
horizontalalignment='center', size='x-small', color='w', weight='semibold')
To verify, here is the nobs series:
time sex smoker
Lunch Male Yes n: 13
No n: 20
Female Yes n: 10
No n: 25
Dinner Male Yes n: 47
No n: 77
Female Yes n: 23
No n: 29

How to create bar chart with secondary_y from dataframe

I want to create a bar chart of two series (say 'A' and 'B') contained in a Pandas dataframe. If I wanted to just plot them using a different y-axis, I can use secondary_y:
df = pd.DataFrame(np.random.uniform(size=10).reshape(5,2),columns=['A','B'])
df['A'] = df['A'] * 100
df.plot(secondary_y=['A'])
but if I want to create bar graphs, the equivalent command is ignored (it doesn't put different scales on the y-axis), so the bars from 'A' are so big that the bars from 'B' are cannot be distinguished:
df.plot(kind='bar',secondary_y=['A'])
How can I do this in pandas directly? or how would you create such graph?
I'm using pandas 0.10.1 and matplotlib version 1.2.1.
Don't think pandas graphing supports this. Did some manual matplotlib code.. you can tweak it further
import pylab as pl
fig = pl.figure()
ax1 = pl.subplot(111,ylabel='A')
#ax2 = gcf().add_axes(ax1.get_position(), sharex=ax1, frameon=False, ylabel='axes2')
ax2 =ax1.twinx()
ax2.set_ylabel('B')
ax1.bar(df.index,df.A.values, width =0.4, color ='g', align = 'center')
ax2.bar(df.index,df.B.values, width = 0.4, color='r', align = 'edge')
ax1.legend(['A'], loc = 'upper left')
ax2.legend(['B'], loc = 'upper right')
fig.show()
I am sure there are ways to force the one bar further tweak it. move bars further apart, one slightly transparent etc.
Ok, I had the same problem recently and even if it's an old question, I think that I can give an answer for this problem, in case if someone else lost his mind with this. Joop gave the bases of the thing to do, and it's easy when you only have (for exemple) two columns in your dataframe, but it becomes really nasty when you have a different numbers of columns for the two axis, due to the fact that you need to play with the position argument of the pandas plot() function. In my exemple I use seaborn but it's optionnal :
import pandas as pd
import seaborn as sns
import pylab as plt
import numpy as np
df1 = pd.DataFrame(np.array([[i*99 for i in range(11)]]).transpose(), columns = ["100"], index = [i for i in range(11)])
df2 = pd.DataFrame(np.array([[i for i in range(11)], [i*2 for i in range(11)]]).transpose(), columns = ["1", "2"], index = [i for i in range(11)])
fig, ax = plt.subplots()
ax2 = ax.twinx()
# we must define the length of each column.
df1_len = len(df1.columns.values)
df2_len = len(df2.columns.values)
column_width = 0.8 / (df1_len + df2_len)
# we calculate the position of each column in the plot. This value is based on the position definition :
# Specify relative alignments for bar plot layout. From 0 (left/bottom-end) to 1 (right/top-end). Default is 0.5 (center)
# http://pandas.pydata.org/pandas-docs/dev/generated/pandas.DataFrame.plot.html
df1_posi = 0.5 + (df2_len/float(df1_len)) * 0.5
df2_posi = 0.5 - (df1_len/float(df2_len)) * 0.5
# In order to have nice color, I use the default color palette of seaborn
df1.plot(kind='bar', ax=ax, width=column_width*df1_len, color=sns.color_palette()[:df1_len], position=df1_posi)
df2.plot(kind='bar', ax=ax2, width=column_width*df2_len, color=sns.color_palette()[df1_len:df1_len+df2_len], position=df2_posi)
ax.legend(loc="upper left")
# Pandas add line at x = 0 for each dataframe.
ax.lines[0].set_visible(False)
ax2.lines[0].set_visible(False)
# Specific to seaborn, we have to remove the background line
ax2.grid(b=False, axis='both')
# We need to add some space, the xlim don't manage the new positions
column_length = (ax2.get_xlim()[1] - abs(ax2.get_xlim()[0])) / float(len(df1.index))
ax2.set_xlim([ax2.get_xlim()[0] - column_length, ax2.get_xlim()[1] + column_length])
fig.patch.set_facecolor('white')
plt.show()
And the result : http://i.stack.imgur.com/LZjK8.png
I didn't test every possibilities but it looks like it works fine whatever the number of columns in each dataframe you use.

Categories