How to improve this seaborn countplot? - python

I used the following code to generate the countplot in python using seaborn:
sns.countplot( x='Genres', data=gn_s)
But I got the following output:
I can't see the items on x-axis clearly as they are overlapping. How can I correct that?
Also I would like all the items to be arranged in a decreasing order of count. How can I achieve that?

You can use choose the x-axis to be vertical, as an example:
g = sns.countplot( x='Genres', data=gn_s)
g.set_xticklabels(g.get_xticklabels(),rotation=90)
Or, you can also do:
plt.xticks(rotation=90)

Bring in matplotlib to set up an axis ahead of time, so that you can modify the axis tick labels by rotating them 90 degrees and/or changing font size. To arrange your samples in order, you need to modify the source. I assume you're starting with a pandas dataframe, so something like:
data = data.sort_values(by='Genres', ascending=False)
labels = # list of labels in the correct order, probably your data.index
fig, ax1 = plt.subplots(1,1)
sns.countplot( x='Genres', data=gn_s, ax=ax1)
ax1.set_xticklabels(labels, rotation=90)
would probably help.
edit Taking andrewnagyeb's suggestion from the comments to order the plot:
sns.countplot( x='Genres', data=gn_s, order = gn_s['Genres'].value_counts().index)

Related

Seaborn jointplot axis on log scale with kind="hex"

I'd like to show the chart below, but with the x-axis on a log scale.
df = pd.DataFrame(np.random.randint(0,100,size=(100, 2)), columns=list('XY'))
sns.jointplot(data=df,x='X',y='Y',height=3,kind='hex')
To be clear, I don't want to log X first, rather I want the numbers to stay the same but the distance between the axis ticks to change. In altair, it would look like the following (I can't do hex in altair, although please correct me if I'm wrong on that):
EDIT: Matt suggested adding xscale="log". That gets me very nearly there. I just need a way to from powers to normal integers.
You can use the xscale="log" keyword argument, which gets passed to the Matplotlib hexbin function that is used under-the-hood by seaborn. E.g.,
sns.jointplot(data=df, x='X', y='Y' ,height=3, kind='hex')
As stated in the comments, there are various ways to set the axis tick labels to not be in scientific format. The simplest is to do:
import matplotlib.ticker as mticker
grid = sns.jointplot(data=df, x='X', y='Y', height=3, kind='hex', xscale="log")
grid.ax_joint.xaxis.set_major_formatter(mticker.ScalarFormatter())
If instead you want, e.g., 1000 to be formatted with a comma so that it is 1,000, you could instead do:
grid.ax_joint.xaxis.set_major_formatter(mticker.StrMethodFormatter("{x:,.0f}"))

Matplotlib - Skipping xticks while maintaining correct x value

I'm trying to plot two separate things from two pandas dataframes but the x-axis is giving some issues. When using matplotlib.ticker to skip x-ticks, the date doesn't get skipped. The result is that the x-axis values doesn't match up with what is plotted.
For example, when the x-ticks are set to a base of 2, you'll see that the dates are going up by 1.
But the graph has the same spacing when the base is set to 4, which you can see here:
For the second image, the goal is for the days to increase by 4 each tick, so it should read 22, 26, 30, etc.
Here is the code that I'm working with:
ax = plot2[['Date','change value']].plot(x='Date',color='red',alpha=1,linewidth=1.5)
plt.ylabel('Total Change')
plot_df[['Date','share change daily']].plot(x='Date',secondary_y=True,kind='bar',ax=ax,alpha=0.4,color='black',figsize=(6,2),label='Daily Change')
plt.ylabel('Daily Change')
ax.legend(['Total Change (L)','Daily Change'])
plt.xticks(plot_df.index,plot_df['Date'].values)
myLocator = mticker.MultipleLocator(base=4)
ax.xaxis.set_major_locator(myLocator)
Any help is appreciated! Thanks :)
First off, I suggest you set the date as the index of your dataframe. This lets pandas automatically format the date labels nicely when you create line plots and it lets you conveniently create a custom format with the strftime method.
This second point is relevant to this example, seeing as plotting a bar plot over a line plot prevents you from getting the pandas line plot date labels because the x-axis units switch to integer units starting at 0 (note that this is also the case when you use the dates as strings instead of datetime objects, aka timestamp objects in pandas). You can check this for yourself by running ax.get_xticks() after creating the line plot (with a DatetimeIndex) and again after creating the bar plot.
There are too many peculiarities regarding the tick locators and formatters, the pandas plotting defaults, and the various ways in which you could define your custom ticks and tick labels for me to go into more detail here. So let me suggest you refer to the documentation for more information (though for your case you don't really need any of this): Major and minor ticks, Date tick labels, Custom tick formatter for time series, more examples using ticks, and the ticker module which contains the list of tick locators and formatters and their parameters.
Furthermore, you can identify the default tick locators and formatters used by the plotting functions with ax.get_xaxis().get_major_locator() or ax.get_xaxis().get_major_formatter() (you can do the same for the y-axis, and for minor ticks) to get an idea of what is happening under the hood.
On to solving your problem. Seeing as you want a fixed frequency of ticks for a predefined range of dates, I suggest that you avoid explicitly selecting a ticker locator and formatter and that instead you simply create the list of ticks and tick labels you want. First, here is some sample data similar to yours:
import numpy as np # v 1.19.2
import pandas as pd # v 1.1.3
import matplotlib.pyplot as plt # v 3.3.2
rng = np.random.default_rng(seed=1) # random number generator
dti = pd.bdate_range(start='2020-07-22', end='2020-09-03')
daily = rng.normal(loc=0, scale=250, size=dti.size)
total = -1900 + np.cumsum(daily)
df = pd.DataFrame({'Daily Change': daily,
'Total Change': total},
index=dti)
df.head()
Daily Change Total Change
2020-07-22 86.396048 -1813.603952
2020-07-23 205.404536 -1608.199416
2020-07-24 82.609269 -1525.590147
2020-07-27 -325.789308 -1851.379455
2020-07-28 226.338967 -1625.040488
The date is set as the index, which will simplify the code for creating the plots (no need to specify x). I use the same formatting arguments as in the example you gave, except for the figure size. Note that for setting the ticks and tick labels I do not use plt.xticks because this refers to the secondary Axes containing the bar plot and for some reason, the rotation and ha arguments get ignored.
label_daily, label_total = df.columns
# Create pandas line plot: note the 'use_index' parameter
ax = df.plot(y=label_total, color='red', alpha=1, linewidth=1.5,
use_index=False, ylabel=label_total)
# Create pandas bar plot: note that the second ylabel must be created
# after, else it overwrites the previous label on the left
df.plot(kind='bar', y=label_daily, color='black', alpha=0.4,
ax=ax, secondary_y=True, mark_right=False, figsize=(9, 4))
plt.ylabel(label_daily, labelpad=10)
# Place legend in a better location: note that because there are two
# Axes, the combined legend can only be edited with the fig.legend
# method, and the ax legend must be removed
ax.legend().remove()
plt.gcf().legend(loc=(0.11, 0.15))
# Create custom x ticks and tick labels
freq = 4 # business days
xticks = ax.get_xticks()
xticklabels = df.index[::freq].strftime('%b-%d')
ax.set_xticks(xticks[::freq])
ax.set_xticks(xticks, minor=True)
ax.set_xticklabels(xticklabels, rotation=0, ha='center')
plt.show()
The codes for formatting the dates can be found here.
For the sake of completeness, here are two alternative ways of creating exactly the same ticks but this time by making explicit use of matplotlib tick locators and formatters.
This first alternative uses lists of ticks and tick labels like before, but this time passing them to FixedLocator and FixedFormatter:
import matplotlib.ticker as mticker
# Create custom x ticks and tick labels
freq = 4 # business days
maj_locator = mticker.FixedLocator(ax.get_xticks()[::freq])
min_locator = mticker.FixedLocator(ax.get_xticks())
ax.xaxis.set_major_locator(maj_locator)
ax.xaxis.set_minor_locator(min_locator)
maj_formatter = mticker.FixedFormatter(df.index[maj_locator.locs].strftime('%b-%d'))
ax.xaxis.set_major_formatter(maj_formatter)
plt.setp(ax.get_xticklabels(), rotation=0, ha='center')
This second alternative makes use of the option to create a tick at every nth position of the index when using IndexLocator, combining it with FuncFormatter (instead of IndexFormatter which is deprecated):
import matplotlib.ticker as mticker
# Create custom x ticks and tick labels
maj_freq = 4 # business days
min_freq = 1 # business days
maj_locator = mticker.IndexLocator(maj_freq, 0)
min_locator = mticker.IndexLocator(min_freq, 0)
ax.xaxis.set_major_locator(maj_locator)
ax.xaxis.set_minor_locator(min_locator)
maj_formatter = mticker.FuncFormatter(lambda x, pos=None:
df.index[int(x)].strftime('%b-%d'))
ax.xaxis.set_major_formatter(maj_formatter)
plt.setp(ax.get_xticklabels(), rotation=0, ha='center')
As you can see, both of these alternatives are more verbose than the initial example.

How to rescale the y-axis of a boxplot in python

I have a boxplot below (using seaborn) where the "box" part is too squashed. How do I change the scale along the y-axis so that the boxplot is more presentable (ie. the "box" part is too squashed) but still keeping all the outliers in the plot?
Many thanks.
You can do two things here.
Make the plot bigger
Change the range of the y-axis
Since you want to keep the outliers, rescaling the y-axis may not be that effective. You haven't given any data or code examples. So I'll just add a way to make your figure bigger.
# this script makes the figure bigger and rescale the y-axis
ax = plt.figure(figsize=(20,15))
ax = sns.boxplot(x="day", y="total_bill", data=tips)
ax.set_ylim(0,100)
You could set the axis after the plot:
import seaborn as sns
df = sns.load_dataset('iris')
a = sns.boxplot(y=df["sepal_length"])
a.set(ylim=(0,10))
Additionally, you could try dropping outliers from the plot passing showfliers = False in boxplot.

Using a Pandas dataframe index as values for x-axis in matplotlib plot

I have time series in a Pandas dateframe with a number of columns which I'd like to plot. Is there a way to set the x-axis to always use the index from a dateframe?
When I use the .plot() method from Pandas the x-axis is formatted correctly however I when I pass my dates and the column(s) I'd like to plot directly to matplotlib the graph doesn't plot correctly. Thanks in advance.
plt.plot(site2.index.values, site2['Cl'])
plt.show()
FYI: site2.index.values produces this (I've cut out the middle part for brevity):
array([
'1987-07-25T12:30:00.000000000+0200',
'1987-07-25T16:30:00.000000000+0200',
'2010-08-13T02:00:00.000000000+0200',
'2010-08-31T02:00:00.000000000+0200',
'2010-09-15T02:00:00.000000000+0200'
],
dtype='datetime64[ns]')
It seems the issue was that I had .values. Without it (i.e. site2.index) the graph displays correctly.
You can use plt.xticks to set the x-axis
try:
plt.xticks( site2['Cl'], site2.index.values ) # location, labels
plt.plot( site2['Cl'] )
plt.show()
see the documentation for more details: http://matplotlib.org/api/pyplot_api.html#matplotlib.pyplot.xticks
That's Builtin Right Into To plot() method
You can use yourDataFrame.plot(use_index=True) to use the DataFrame Index On X-Axis.
The "use_index=True" sets the DataFrame Index on the X-Axis.
Read More Here: https://pandas.pydata.org/pandas-docs/version/0.23/generated/pandas.DataFrame.plot.html
you want to use matplotlib to select a 'sensible' scale just like me, there is one way can solve this question. using a Pandas dataframe index as values for x-axis in matplotlib plot. Code:
ax = plt.plot(site2['Cl'])
x_ticks = ax.get_xticks() # use matplotlib default xticks
x_ticks = list(filter(lambda x: x in range(len(site2)), x_ticks))
ax.set_xticklabels([' '] + site2.index.iloc[x_ticks].to_list())

're-sort' / adapt ticks of matshow matrix plot

I tried hard, but I'm stuck with matplotlib here. Please overlook, that the mpl docs are a bit confusing to me . My question concerns the following:
I draw a symmetrical n*n matrix D with matshow function. That works.
I want to do the same thing, just with different order of (the n) items in D
D = [:,neworder]
D = [neworder,:]
Now, how do I make the ticks reproduce this neworder, preferably using additionally MaxNLocator?
As far as I understand...
set_xticklabels assigns labels to the ticks by order, independently of where the ticks are set?!
set_xticks (mpl docs: 'Set the x ticks with list of ticks') here I'm really not sure what it does. Can somebody explain it precisely? I don't know, whether these functions are helpful in my case at all. Maybe even things are different between using a common xy plot and matshow.
import numpy as np
import matplotlib.pyplot as plt
fig = plt.figure()
ax = fig.gca()
D = np.arange(100).reshape(10,10)
neworder = np.arange(10)
np.random.shuffle(neworder)
D = D[:,neworder]
D = D[neworder, :]
# modify ticks somehow...
ax.matshow(D)
plt.show()
Referring to Paul's answer, think I tried smth like this. Using the neworder to define positions and using it for the labels, I added plt.xticks(neworder, neworder) as tick-modifier. For example with neworder = [9 8 4 7 2 6 3 0 1 5] I get is this
The order of the labels is correct, but the ticks are not. The labels should be independently show the correct element independently of where the ticks are set. So where is the mistake?
I think what you want to do is set the labels on the new plot to show the rearranged order of the values. Is that right? If so, you want to keep the tick locations the same, but change the labels:
plt.xticks(np.arange(0,10), neworder)
plt.yticks(np.arange(0,10), neworder)
Edit: Note that these commands must be issued after matshow. This seems to be a quirk of matshow (plot does not show this behaviour, for example). Perhaps it's related to this line from the plt.matshow documentation:
Because of how :func:matshow tries to set the figure aspect ratio to be the
one of the array, if you provide the number of an already
existing figure, strange things may happen.
Perhaps the safest way to go is to issue plt.matshow(D) without first creating a figure, then use plt.xticks and plt.yticks to make adjustments.
Your question also asks about the set_ticks and related axis methods. The same thing can be accomplished using those tools, again after issuing matshow:
ax = plt.gca()
ax.xaxis.set_ticks(np.arange(0,10)) # turn on all tick locations
ax.xaxis.set_ticklabels(neworder) # use neworder for labels
Edit2: The next part of your question is related to setting a max number of ticks. 20 would require a new example. For our example I'll set the max no. of ticks at 2:
ax = plt.gca()
ax.xaxis.set_major_locator(plt.MaxNLocator(nbins=3)) # one less tick than 'bin'
tl = ax.xaxis.get_ticklocs() # get current tick locations
tl[1:-1] = [neworder[idx] for idx in tl[1:-1]] # find what the labels should be at those locs
ax.xaxis.set_ticklabels(tl) # set the labels
plt.draw()
You are on the right track. The plt.xticks command is what you need.
You can specify the xtick locations and the label at each position with the following command.
labelPositions = arange(len(D))
newLabels = ['z','y','x','w','v','u','t','s','q','r']
plt.xticks(labelPositions,newLabels)
You could also specify an arbitrary order for labelPositions, as they will be assigned based on the values in the vector.
labelPositions = [0,9,1,8,2,7,3,6,4,5]
newLabels = ['z','y','x','w','v','u','t','s','q','r']
plt.xticks(labelPositions,newLabels)

Categories