How do i show the proper count value in seaborn? - python

CH Gayle 17
YK Pathan 16
AB de Villiers 15
DA Warner 14
SK Raina 13
RG Sharma 13
MEK Hussey 12
AM Rahane 12
MS Dhoni 12
G Gambhir 12
I have a series like this. I want to plot the player on the x axis and their respective value on the y axis. I tried this code:
man_of_match=(matches['player_of_match'].value_counts())
sns.countplot(x=(man_of_match),data=matches,color='B')
sns.plt.show()
But with this code, it plots the frequency of the numeric value, i.e on x axis 12 gets plotted and the count on y axis becomes 4. Similarly for 13 on x axis it shows 2 on y axis.
How do i make the x axis show the name of the player and the y axis the corresponding value of the player.?

sns.countplot is meant to do the counting for you. You are counting yourself with value_counts then plotting the counts of counts. Pass matches directly to sns.countplot
ax = sns.countplot(matches['player_of_match'], color='B')
plt.sca(ax)
plt.xticks(rotation=90);
If you want to limit it to the top 10 players. Use value_counts as you did. But use matplotlib directly, to plot.
ax = matches['player_of_match'].value_counts().head(10).plot.bar(width=.8, color='R')
ax.set_xlabel('player_of_match')
ax.set_ylabel('count')
You can get it to look a lot like the seaborn plot
kws = dict(width=.8, color=sns.color_palette('pastel'))
ax = matches['player_of_match'].value_counts().head(10).plot.bar(**kws)
ax.set_xlabel('player_of_match')
ax.set_ylabel('count')
ax.grid(False, axis='x')

Related

Python plotting numerical columns of dataframe in loop while dynamically changing xtick frequency

Context
I'm trying to produce plots across a dataframe for value_counts.
I'm unable to share the dataset I've used as its work related. But have used another dataset below.
Blocker
There are 3 main issues:
This line "plt.xticks(np.arange(min(df_num[c]),max(df_num[c])+1, aaa));" causes a
"ValueError: arange: cannot compute length.
The xticks overlap
The xticks at times aren't at the frequency specified below
# load dataset
df = sns.load_dataset('mpg')
# subset dataset
df_num = df.select_dtypes(['int64', 'float64'])
# Loop over columns - plots
for c in df_num.columns:
fig = plt.figure(figsize= [10,5]);
bins1 = df_num[c].nunique()+1
# plot
ax = df[c].plot(kind='hist', color='orange', bins=bins1, edgecolor='w');
# dynamic xtick frequency
if df_num[c].nunique() <=30:
aaa = 1
elif 30< df_num[c].nunique() <=50:
aaa = 3
elif 50< df_num[c].nunique() <=60:
aaa = 6
elif 60< df_num[c].nunique() <=70:
aaa = 7
elif 70< df_num[c].nunique() <=80:
aaa = 8
elif 80< df_num[c].nunique() <=90:
aaa = 9
elif 90< df_num[c].nunique() <=100:
aaa = 10
elif 90< df_num[c].nunique() <=100:
aaa = 20
else:
aaa = 40
# format plot
plt.xticks(np.arange(min(df_num[c]),max(df_num[c])+1, aaa));
ax.set_title(c)
#Cimbali
The ticks are at times at the edgepoint and other times partly in bin.
Is it possible to have one or the other?
TL;DR: define histogram bins and ticks based on the range of values and not the number of unique values.
Your histogram plots make some assumptions that might not be verified, in particular that all unique values are distributed identically. If that’s not the case − which in general it isn’t − then the range from min to max has little to do with the number of unique values (especially with floating point values, where unique values mean very little).
In particular, when you plot histograms, your bins (on the x-axis) correspond to the values (left). If you plot bars (right), you would get one bar per unique value, but not distributed based on the x-axis.
Here’s a simple example:
>>> s = pd.DataFrame([1, 1, 2, 5])
>>> s.plot(kind='hist')
>>> s.value_counts().plot(kind='bar')
You see there’s only 3 unique values but the index range (and number of bars) is from min to max on the histogram (left). If you only defined 3 bins, then 1 and 2 would be in the same bar.
The bar plot (right) has bar counts proportional to the number of unique values, but then the your x-axis is not proportional to the values anymore.
So instead, let’s define the number of bars and indexes from the range of values:
>>> df_range = df_num.max() - df_num.min()
>>> df_range
mpg 37.6
cylinders 5.0
displacement 387.0
horsepower 184.0
weight 3527.0
acceleration 16.8
model_year 12.0
dtype: float64
>>> df_bins = df_range.div(10).round().astype(int).clip(lower=df_range.transform(np.ceil), upper=50)
>>> df_bins
mpg 39
cylinders 6
displacement 50
horsepower 50
weight 50
acceleration 18
model_year 13
dtype: int64
Here’s an example of plotting using these number of bins:
>>> for col, n in df_bins.iteritems():
... fig = plt.figure(figsize=(10,5))
... df[col].plot.hist(bins=n, title=col)
You can also define xticks additionally to bin sizes, but again for histograms you have to take the range into account, not the number of unique values (so you could compute ticks from bins too), but your rules make for some pretty weird results, especially on very wide ranges:
>>> ticks = pd.Series(index=df_range.index, dtype=int)
>>> ticks[df_range < 30] = 1
>>> ticks[(30 < df_range) & (df_range <= 50)] = 3
>>> ticks[(50 < df_range) & (df_range <= 100)] = np.floor(df_range.div(10)) + 1
>>> ticks[100 < df_range] = 40
>>> for col, n in df_bins.iteritems():
... fig = plt.figure(figsize=(10,5))
... df[col].plot.hist(bins=n, title=col, xticks=np.arange(df[col].min(), df[col].max() + 1, ticks[col]))
Note that you could also use np.linspace to define the ticks from the min, max, and number (instead of min, max, and interval).

Python- compress lower end of y-axis in contourf plot

The issue
I have a contourf plot I made with a pandas dataframe that plots some 2-dimensional value with time on the x-axis and vertical pressure level on the y-axis. The field, time, and pressure data I'm pulling is all from a netCDF file. I can plot it fine, but I'd like to scale the y-axis to better represent the real atmosphere. (The default scaling is linear, but the pressure levels in the file imply a different king of scaling.) Basically, it should look something like the plot below on the y-axis. It's like a log scale, but compressing the bottom part of the axis instead of the top. (I don't know the term for this... like a log scale but inverted?) It doesn't need to be exact.
Working example (written in Jupyter notebook)
#modules
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib import ticker, colors
#data
time = np.arange(0,10)
lev = np.array([900,800,650,400,100])
df = pd.DataFrame(np.arange(50).reshape(5,10),index=lev,columns=time)
df.index.name = 'Level'
print(df)
0 1 2 3 4 5 6 7 8 9
Level
900 0 1 2 3 4 5 6 7 8 9
800 10 11 12 13 14 15 16 17 18 19
650 20 21 22 23 24 25 26 27 28 29
400 30 31 32 33 34 35 36 37 38 39
100 40 41 42 43 44 45 46 47 48 49
#lists for plotting
levtick = np.arange(len(lev))
clevels = np.arange(0,55,5)
#Main plot
fig, ax = plt.subplots(figsize=(10, 5))
im = ax.contourf(df,levels=clevels,cmap='RdBu_r')
#x-axis customization
plt.xticks(time)
ax.set_xticklabels(time)
ax.set_xlabel('Time')
#y-axis customization
plt.yticks(levtick)
ax.set_yticklabels(lev)
ax.set_ylabel('Pressure')
#title and colorbar
ax.set_title('Some mean time series')
cbar = plt.colorbar(im,values=clevels,pad=0.01)
tick_locator = ticker.MaxNLocator(nbins=11)
cbar.locator = tick_locator
cbar.update_ticks()
The Question
How can I scale the y-axis such that values near the bottom (900, 800) are compressed while values near the top (200) are expanded and given more plot space, like in the sample above my code? I tried using ax.set_yscale('function', functions=(forward, inverse)) but didn't understand how it works. I also tried simply ax.set_yscale('log'), but log isn't what I need.
You can use a custom scale transformation with ax.set_yscale('function', functions=(forward, inverse)) as you suggested. From the documentation:
forward and inverse are callables that return the scale transform
and its inverse.
In this case, define in forward() the function you want, such as the inverse of the log function, or a more custom one for your need. Call this function before your y-axis customization.
def forward(x):
return 2**x
def inverse(x):
return np.log2(x)
ax.set_yscale('function', functions=(forward,inverse))

how to visualize columns of a dataframe python as a plot?

I have a dataframe that looks like below:
DateTime ID Temperature
2019-03-01 18:36:01 3 21
2019-04-01 18:36:01 3 21
2019-18-01 08:30:01 2 18
2019-12-01 18:36:01 2 12
I would like to visualize this as a plot, where I need the datetime in x-axis, and Temperature on the y axis with a hue of IDs, I tried the below, but i need to see the Temperature distribution for every point more clearly. Is there any other visualization technique?
x= df['DateTime'].values
y= df['Temperature'].values
hue=df['ID'].values
plt.scatter(x, y,hue,color = "red")
you can try:
df.set_index('DateTime').plot()
output:
or you can use:
df.set_index('DateTime').plot(style="x-", figsize=(15, 10))
output:

Seaborn distplot only whole numbers

How can I make a distplot with seaborn to only have whole numbers?
My data is an array of numbers between 0 and ~18. I would like to plot the distribution of the numbers.
Impressions
0 210
1 1084
2 2559
3 4378
4 5500
5 5436
6 4525
7 3329
8 2078
9 1166
10 586
11 244
12 105
13 51
14 18
15 5
16 3
dtype: int64
Code I'm using:
sns.distplot(Impressions,
# bins=np.arange(Impressions.min(), Impressions.max() + 1),
# kde=False,
axlabel=False,
hist_kws={'edgecolor':'black', 'rwidth': 1})
plt.xticks = range(current.Impressions.min(), current.Impressions.max() + 1, 1)
Plot looks like this:
What I'm expecting:
The xlabels should be whole numbers
Bars should touch each other
The kde line should simply connect the top of the bars. By the looks of it, the current one assumes to have 0s between (x, x + 1), hence why the downward spike (This isn't required, I can turn off kde)
Am I using the correct tool for the job or distplot shouldn't be used for whole numbers?
For your problem can be solved bellow code,
import seaborn as sns # for data visualization
import numpy as np # for numeric computing
import matplotlib.pyplot as plt # for data visualization
arr = np.array([1,2,3,4,5,6,7,8,9])
sns.distplot(arr, bins = arr, kde = False)
plt.xticks(arr)
plt.show()
enter image description here
In this way, you can plot histogram using seaborn sns.distplot() function.
Note: Whatever data you will pass to bins and plt.xticks(). It should be an ascending order.

Matplotlib showing x-tick labels overlapping

Have a look at the graph below:
It's a subplot of this larger figure:
I see two problems with it. First, the x-axis labels overlap with one another (this is my major issue). Second. the location of the x-axis minor gridlines seems a bit wonky. On the left of the graph, they look properly spaced. But on the right, they seem to be crowding the major gridlines...as if the major gridline locations aren't proper multiples of the minor tick locations.
My setup is that I have a DataFrame called df which has a DatetimeIndex on the rows and a column called value which contains floats. I can provide an example of the df contents in a gist if necessary. A dozen or so lines of df are at the bottom of this post for reference.
Here's the code that produces the figure:
now = dt.datetime.now()
fig, axes = plt.subplots(2, 2, figsize=(15, 8), dpi=200)
for i, d in enumerate([360, 30, 7, 1]):
ax = axes.flatten()[i]
earlycut = now - relativedelta(days=d)
data = df.loc[df.index>=earlycut, :]
ax.plot(data.index, data['value'])
ax.xaxis_date()
ax.get_xaxis().set_minor_locator(mpl.ticker.AutoMinorLocator())
ax.get_yaxis().set_minor_locator(mpl.ticker.AutoMinorLocator())
ax.grid(b=True, which='major', color='w', linewidth=1.5)
ax.grid(b=True, which='minor', color='w', linewidth=0.75)
What is my best option here to get the x-axis labels to stop overlapping each other (in each of the four subplots)? Also, separately (but less urgently), what's up with the minor tick issue in the top-left subplot?
I am on Pandas 0.13.1, numpy 1.8.0, and matplotlib 1.4.x.
Here's a small snippet of df for reference:
id scale tempseries_id value
timestamp
2014-11-02 14:45:10.302204+00:00 7564 F 1 68.0000
2014-11-02 14:25:13.532391+00:00 7563 F 1 68.5616
2014-11-02 14:15:12.102229+00:00 7562 F 1 68.9000
2014-11-02 14:05:13.252371+00:00 7561 F 1 69.0116
2014-11-02 13:55:11.792191+00:00 7560 F 1 68.7866
2014-11-02 13:45:10.782227+00:00 7559 F 1 68.6750
2014-11-02 13:35:10.972248+00:00 7558 F 1 68.4500
2014-11-02 13:25:10.362213+00:00 7557 F 1 68.1116
2014-11-02 13:15:10.822247+00:00 7556 F 1 68.2250
2014-11-02 13:05:10.102200+00:00 7555 F 1 68.5616
2014-11-02 12:55:10.292217+00:00 7554 F 1 69.0116
2014-11-02 12:45:10.382226+00:00 7553 F 1 69.3500
2014-11-02 12:35:10.642245+00:00 7552 F 1 69.2366
2014-11-02 12:25:12.642255+00:00 7551 F 1 69.1250
2014-11-02 12:15:11.122382+00:00 7550 F 1 68.7866
2014-11-02 12:05:11.332224+00:00 7549 F 1 68.5616
2014-11-02 11:55:11.662311+00:00 7548 F 1 68.2250
2014-11-02 11:45:11.122193+00:00 7547 F 1 68.4500
2014-11-02 11:35:11.162271+00:00 7546 F 1 68.7866
2014-11-02 11:25:12.102211+00:00 7545 F 1 69.2366
2014-11-02 11:15:10.422226+00:00 7544 F 1 69.4616
2014-11-02 11:05:11.412216+00:00 7543 F 1 69.3500
2014-11-02 10:55:10.772212+00:00 7542 F 1 69.1250
2014-11-02 10:45:11.332220+00:00 7541 F 1 68.7866
2014-11-02 10:35:11.332232+00:00 7540 F 1 68.5616
2014-11-02 10:25:11.202411+00:00 7539 F 1 68.2250
2014-11-02 10:15:11.932326+00:00 7538 F 1 68.5616
2014-11-02 10:05:10.922229+00:00 7537 F 1 68.9000
2014-11-02 09:55:11.602357+00:00 7536 F 1 69.3500
Edit: Trying fig.autofmt_xdate():
I don't think this going to do the trick. This seems to use the same x-tick labels for both graphs on the left and also for both graphs on the right. Which is not correct given my data. Please see the problematic output below:
Ok, finally got it working. The trick was to use plt.setp to manually rotate the tick labels. Using fig.autofmt_xdate() did not work as it does some unexpected things when you have multiple subplots in your figure. Here's the working code with its output:
for i, d in enumerate([360, 30, 7, 1]):
ax = axes.flatten()[i]
earlycut = now - relativedelta(days=d)
data = df.loc[df.index>=earlycut, :]
ax.plot(data.index, data['value'])
ax.get_xaxis().set_minor_locator(mpl.ticker.AutoMinorLocator())
ax.get_yaxis().set_minor_locator(mpl.ticker.AutoMinorLocator())
ax.grid(b=True, which='major', color='w', linewidth=1.5)
ax.grid(b=True, which='minor', color='w', linewidth=0.75)
plt.setp(ax.get_xticklabels(), rotation=30, horizontalalignment='right')
fig.tight_layout()
By the way, the comment earlier about some matplotlib things taking forever is very interesting here. I'm using a raspberry pi to act as a weather station at a remote location. It's collecting the data and serving the results via the web. And boy oh boy, it's really wheezing trying to put out these graphics.
Due to the way text rendering is handled in matplotlib, auto-detecting overlapping text really slows things down. (The space that text takes up can't be accurately calculated until after it's been drawn.) For that reason, matplotlib doesn't try to do this automatically.
Therefore, it's best to rotate long tick labels. Because dates most commonly have this problem, there's a figure method fig.autofmt_xdate() that will (among other things) rotate the tick labels to make them a bit more readable. (Note: If you're using a pandas plot method, it returns an axes object, so you'll need to use ax.figure.autofmt_xdate().)
As a quick example:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
time = pd.date_range('01/01/2014', '4/01/2014', freq='H')
values = np.random.normal(0, 1, time.size).cumsum()
fig, ax = plt.subplots()
ax.plot_date(time, values, marker='', linestyle='-')
fig.autofmt_xdate()
plt.show()
If we were to leave fig.autofmt_xdate() out:
And if we use fig.autofmt_xdate():
For the problems which don't have date values in x axis, rather a string, you can insert \n character in x axis values so they don't overlap. Here is an example -
The data frame is
somecol value
category 1 of column 16
category 2 of column 13
category 3 of column 21
category 4 of column 20
category 5 of column 11
category 6 of column 22
category 7 of column 19
category 8 of column 14
category 9 of column 18
category 10 of column 23
category 11 of column 10
category 12 of column 24
category 13 of column 17
category 14 of column 15
category 15 of column 12
I need to plot value on y axis and somecol on x axis, which will normally be plotted like this -
As you can see, there is a lot of overlap. Now introduce \n character in somecol column.
somecol = df['somecol'].values.tolist()
for i in range(len(somecol)):
x = somecol[i].split(' ')
# insert \n before 'of'
x.insert(x.index('of'),'\n')
somecol[i] = ' '.join(x)
Now if you plot, it will look like this -
plt.plot(somecol, df['val'])
This method works well if you don't want to rotate your labels.
The only con so far I found in this method is that you need to tweak your labels 3-4 times i.e., try with multiple formats to display the plot in best format.

Categories