How to annotate points in a scatterplot based on a pandas column - python

Wanted 'Age' as the x-axis, 'Pos' as the y-axis and labels as 'Player' Names. But for some reason, not able to do label the points.
Code:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import adjustText as at
data = pd.read_excel("path to the file")
fig, ax = plt.subplots()
fig.set_size_inches(7,3)
df = pd.DataFrame(data, columns = ['Player', 'Pos', 'Age'])
df.plot.scatter(x='Age',
y='Pos',
c='DarkBlue', xticks=([15,20,25,30,35,40]))
y = df.Player
texts = []
for i, txt in enumerate(y):
plt.text()
at.adjust_text(texts, arrowprops=dict(arrowstyle="simple, head_width=0.25, tail_width=0.05", color='black', lw=0.5, alpha=0.5))
plt.show()
Summary of the data :
df.head()
Player Pos Age
0 Thibaut Courtois GK 28
1 Karim Benzema FW 32
2 Sergio Ramos DF 34
3 Raphael Varane DF 27
4 Luka Modric MF 35
Error :
ConversionError: Failed to convert value(s) to axis units: 'GK'
This is the plot so far; not able to label these points:
EDIT:
This is what I wanted but of all points:
Also, Could anyone help me in re-ordering the labels on the yaxis.
Like, I wanted FW,MF,DF,GK as my order but the plot is in MF,DF,FW,GK.
Thanks.

A similar solution was described here. Essentially, you want to annotate the points in your scatter plot.
I have stripped your code. Note that you need to plot the data with matplotlib (and not with pandas): df = pd.DataFrame(data, columns = ['Player', 'Pos', 'Age']). In this way, you can use the annotation()-method.
import matplotlib.pyplot as plt
import pandas as pd
# build data
data = [
['Thibaut Courtois', 'GK', 28],
['Karim Benzema', 'FW', 32],
['Sergio Ramos','DF', 34],
['Raphael Varane', 'DF', 27],
['Luka Modric', 'MF', 35],
]
# create pandas DataFrame
df = pd.DataFrame(data, columns = ['Player', 'Pos', 'Age'])
# open figure + axis
fig, ax = plt.subplots()
# plot
ax.scatter(x=df['Age'],y=df['Pos'],c='DarkBlue')
# set labels
ax.set_xlabel('Age')
ax.set_ylabel('Pos')
# annotate points in axis
for idx, row in df.iterrows():
ax.annotate(row['Player'], (row['Age'], row['Pos']) )
# force matplotlib to draw the graph
plt.show()
This is what you'll get as output:

Related

Lines to separate x-axis labels in matplotlib

I would like to separate my labels on the x-axis according to their category (female/male).
I have the following code:
import matplotlib.pyplot as plt
import pandas as pd
if __name__=='__main__':
data = {
'Ava': 30,
'Charlotte': 32,
'Sophia': 41,
'Mark': 33,
'William': 50
}
data_df = pd.DataFrame(data, index=[0]).T
data_df.plot(kind='bar', rot=0, edgecolor='black', legend=False)
plt.title('Age')
plt.savefig('ages.png')
It produces the following plot
I want to add lines to group x-axis labels according to gender.
FEMALE = ['Ava', 'Charlotte', 'Sophia']
MALE = ['Mark', 'William']
And obtaining something like this
How can I do it?

Pandas hist subplots - adding colour bar for the colours of each histogram

I have the columns of a dataframe plotted as separate histogram subplots. For each subplot, I want the bars coloured according to the value in a separate list. I have managed this by making a cmap of it and manually cycling those colours, however, is there a way to add a colorbar to the side to show what values these colours belong to? This is what I have right now:
import pandas as pd
import matplotlib as mpl
from matplotlib.colors import rgb2hex
#reading in the data
df = pd.read_csv( "shortlist_temp.dat", sep='\t',header=(0), usecols=(range(1,13)))
#separate list of values
orig_star_teff = [4308.0, 5112.0, 4240.0, 4042.0, 4411.0, 4100.0, 4511.0, 4738.0, 4630.0, 4870.0, 4442.0, 4845.0]
#Colormapping the values. I did not like the result from the original values so I reduced by 4000.
orig_star_teff_norm = [i - 4000 for i in orig_star_teff]
orig_star_teff_norm = [float(i)/max(orig_star_teff_norm) for i in orig_star_teff_norm]
cmap = mpl.cm.plasma
color_list = cmap(orig_star_teff_norm)
color_list2 = [ rgb2hex(color_list[i,:]) for i in range(color_list.shape[0]) ]
mpl.rcParams['axes.prop_cycle'] = mpl.cycler(color = color_list2)
ax = df.plot.hist(subplots=True, bins = 12, legend=False, layout=(3, 4), figsize = (15,10), sharey = True)
ax[0,0].set_title('ABOO')
ax[0,1].set_title('EpsVIR')
ax[0,2].set_title('HIP 96014')
ax[0,3].set_title('2M16113361')
ax[1,0].set_title('KIC 3955590')
ax[1,1].set_title('KIC 5113061')
ax[1,2].set_title('KIC 5859492')
ax[1,3].set_title('KIC 6547007')
ax[2,0].set_title('KIC 11444313')
ax[2,1].set_title('KIC 11657684')
ax[2,2].set_title('HD102328-K3III')
ax[2,3].set_title('HD142091-K0III')
Resulting plot
Instead of doing all the normalization steps manually, it probably is easier to create a norm. In this case a norm that maps the values from 4000 till max to the range 0,1 needed for the colormap. Note that converting to hex isn't necessary.
With the norm and the colormap a ScalarMapple can be created with all the necessary information for a colorbar:
import pandas as pd
import matplotlib as mpl
from matplotlib.cm import ScalarMappable
# reading in the data
# df = pd.read_csv("shortlist_temp.dat", sep='\t', header=(0), usecols=(range(1, 13)))
# generating some dummy data
df = pd.DataFrame(np.random.randn(100, 12))
# separate list of values
orig_star_teff = [4308.0, 5112.0, 4240.0, 4042.0, 4411.0, 4100.0, 4511.0, 4738.0, 4630.0, 4870.0, 4442.0, 4845.0]
norm = plt.Normalize(4000, max(orig_star_teff))
cmap = mpl.cm.plasma
color_list = cmap(norm(orig_star_teff))
mpl.rcParams['axes.prop_cycle'] = mpl.cycler(color=color_list)
axs = df.plot.hist(subplots=True, bins=12, legend=False, layout=(3, 4), figsize=(15, 10), sharey=True)
titles = ['ABOO', 'EpsVIR', 'HIP 96014', '2M16113361',
'KIC 3955590', 'KIC 5113061', 'KIC 5859492', 'KIC 6547007',
'KIC 11444313', 'KIC 11657684', 'HD102328-K3III', 'HD142091-K0III']
for ax, title in zip(axs.flat, titles):
ax.set_title(title)
plt.colorbar(ScalarMappable(cmap=cmap, norm=norm), ax=axs[:, -1])
plt.show()

How to combine two heatmaps in Seaborn in Python so both are shown in the same heatmap?

This is link to the data I'm using:
https://github.com/fivethirtyeight/data/tree/master/drug-use-by-age
I'm using Jupyter Lab, and here's the code:
from matplotlib import pyplot as plt
import pandas as pd
import numpy as np
import seaborn as sb
url = 'https://raw.githubusercontent.com/fivethirtyeight/data/master/drug-use-by-age/drug-use-by-age.csv'
df = pd.read_csv(url, index_col = 0)
df.dtypes
df.replace('-', np.nan, inplace=True)
df = df.iloc[:,:].astype(float)
df = df.loc[:, df.columns != 'n']
#df.columns = df.columns.str.rstrip('-use')
df
fig, axes = plt.subplots(1,2, figsize=(20, 8))
fig.subplots_adjust(wspace=0.1)
fig.colorbar(ax.collections[0], ax=ax,location="right", use_gridspec=False, pad=0.2)
#plt.figure(figsize=(16, 16))
df_percentage = df.iloc[:,range(0,26,2)]
plot_precentage = sb.heatmap(df_percentage, cmap='Reds', ax=axes[0], cbar_kws={'format': '%.0f%%', 'label': '% used in past 12 months'})
df_frequency = df.iloc[:,range(1,27,2)]
plot_frequency = sb.heatmap(df_frequency, cmap='Blues', ax=axes[1], cbar_kws= dict(label = 'median frequency a user used'))
I can just show two of them in a subplot in separate diagrams.
I want to make it look like this (this is made in paint):
Also show the data side by side. Is there a simple way to achieve that?
A pretty simple solution with mask option:
mask = np.vstack([np.arange(df.shape[1])]* df.shape[0]) % 2
fig, axes = plt.subplots()
plot_precentage = sns.heatmap(df,mask=mask, cmap='Reds', ax=axes,
cbar_kws={'format': '%.0f%%',
'label': '% used in past 12 months'}
)
plot_frequency = sns.heatmap(df, mask=1-mask, cmap='Blues', ax=axes,
cbar_kws= dict(label = 'median frequency a user used')
)
Output:

Matplotlib: custom ticker for pandas MultiIndex DataFrame

I have a large pandas MultiIndex DataFrame that I would like to plot. A minimal example would look like:
import pandas as pd
years = range(2015, 2018)
fields = range(4)
days = range(4)
bands = ['R', 'G', 'B']
index = pd.MultiIndex.from_product(
[years, fields], names=['year', 'field'])
columns = pd.MultiIndex.from_product(
[days, bands], names=['day', 'band'])
df = pd.DataFrame(0, index=index, columns=columns)
df.loc[(2015,), (0,)] = 1
df.loc[(2016,), (1,)] = 1
df.loc[(2017,), (2,)] = 1
If I plot this using plt.spy, I get:
However, the tick locations and labels are less than desirable. I would like the ticks to completely ignore the second level of the MultiIndex. Using IndexLocator and IndexFormatter, I'm able to do the following:
from matplotlib.ticker import IndexFormatter, IndexLocator
import matplotlib.pyplot as plt
ax = plt.gca()
plt.spy(df)
xbase = len(bands)
xoffset = xbase / 2
xlabels = df.columns.get_level_values('day')
ax.xaxis.set_major_locator(IndexLocator(base=xbase, offset=xoffset))
ax.xaxis.set_major_formatter(IndexFormatter(xlabels))
plt.xlabel('Day')
ax.xaxis.tick_bottom()
ybase = len(fields)
yoffset = ybase / 2
ylabels = df.index.get_level_values('year')
ax.yaxis.set_major_locator(IndexLocator(base=ybase, offset=yoffset))
ax.yaxis.set_major_formatter(IndexFormatter(ylabels))
plt.ylabel('Year')
plt.show()
This gives me exactly what I want:
But here's the problem. My actual DataFrame has 15 years, 4,000 fields, 365 days, and 7 bands. If I actually label every single day, the labels would be illegible. I could place a tick every 50 days, but I would like the ticks to be dynamic so that when I zoom in, the ticks become more fine-grained. Basically what I'm looking for is a custom MultiIndexLocator that combines the placement of IndexLocator with the dynamism of MaxNLocator.
Bonus: My data is really nice in the sense that there are always the same number of fields for every year and the same number of bands for every day. But what if this was not the case? I would love to contribute a generic MultiIndexLocator and MultiIndexFormatter to matplotlib that works for any MultiIndex DataFrame.
Matplotlib does not know about dataframes or MultiIndex. It simply plots the data you supply. I.e. you get the same as if you were plotting the numpy array of data, spy(df.values).
So I would suggest to first set the extent of the image correctly such that you may use numeric tickers. Then a MaxNLocator should work fine, unless you do not zoom in too much.
import numpy as np
import pandas as pd
from matplotlib.ticker import MaxNLocator
import matplotlib.pyplot as plt
plt.rcParams['axes.formatter.useoffset'] = False
years = range(2000, 2018)
fields = range(9) #17
days = range(120) #365
bands = ['R', 'G', 'B', 'A']
index = pd.MultiIndex.from_product(
[years, fields], names=['year', 'field'])
columns = pd.MultiIndex.from_product(
[days, bands], names=['day', 'band'])
data = np.random.rand(len(years)*len(fields),len(days)*len(bands))
x,y = np.meshgrid(np.arange(data.shape[1]),np.arange(data.shape[0]))
data += 2*((y//len(fields)+x//len(bands)) % 2)
df = pd.DataFrame(data, index=index, columns=columns)
############
# Plotting
############
xbase = len(bands)
xlabels = df.columns.get_level_values('day')
ybase = len(fields)
ylabels = df.index.get_level_values('year')
extent = [xlabels.min()-np.diff(np.unique(xlabels))[0]/2.,
xlabels.max()+np.diff(np.unique(xlabels))[0]/2.,
ylabels.min()-np.diff(np.unique(ylabels))[0]/2.,
ylabels.max()+np.diff(np.unique(ylabels))[0]/2.,]
fig, ax = plt.subplots()
ax.imshow(df.values, extent=extent, aspect="auto")
ax.set_ylabel('Year')
ax.set_xlabel('Day')
ax.xaxis.set_major_locator(MaxNLocator(integer=True,min_n_ticks=1))
ax.yaxis.set_major_locator(MaxNLocator(integer=True,min_n_ticks=1))
plt.show()

How to plot multiple dataframes in subplots

I have a few Pandas DataFrames sharing the same value scale, but having different columns and indices. When invoking df.plot(), I get separate plot images. what I really want is to have them all in the same plot as subplots, but I'm unfortunately failing to come up with a solution to how and would highly appreciate some help.
You can manually create the subplots with matplotlib, and then plot the dataframes on a specific subplot using the ax keyword. For example for 4 subplots (2x2):
import matplotlib.pyplot as plt
fig, axes = plt.subplots(nrows=2, ncols=2)
df1.plot(ax=axes[0,0])
df2.plot(ax=axes[0,1])
...
Here axes is an array which holds the different subplot axes, and you can access one just by indexing axes.
If you want a shared x-axis, then you can provide sharex=True to plt.subplots.
You can see e.gs. in the documentation demonstrating joris answer. Also from the documentation, you could also set subplots=True and layout=(,) within the pandas plot function:
df.plot(subplots=True, layout=(1,2))
You could also use fig.add_subplot() which takes subplot grid parameters such as 221, 222, 223, 224, etc. as described in the post here. Nice examples of plot on pandas data frame, including subplots, can be seen in this ipython notebook.
You can plot multiple subplots of multiple pandas data frames using matplotlib with a simple trick of making a list of all data frame. Then using the for loop for plotting subplots.
Working code:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
# dataframe sample data
df1 = pd.DataFrame(np.random.rand(10,2)*100, columns=['A', 'B'])
df2 = pd.DataFrame(np.random.rand(10,2)*100, columns=['A', 'B'])
df3 = pd.DataFrame(np.random.rand(10,2)*100, columns=['A', 'B'])
df4 = pd.DataFrame(np.random.rand(10,2)*100, columns=['A', 'B'])
df5 = pd.DataFrame(np.random.rand(10,2)*100, columns=['A', 'B'])
df6 = pd.DataFrame(np.random.rand(10,2)*100, columns=['A', 'B'])
#define number of rows and columns for subplots
nrow=3
ncol=2
# make a list of all dataframes
df_list = [df1 ,df2, df3, df4, df5, df6]
fig, axes = plt.subplots(nrow, ncol)
# plot counter
count=0
for r in range(nrow):
for c in range(ncol):
df_list[count].plot(ax=axes[r,c])
count+=1
Using this code you can plot subplots in any configuration. You need to define the number of rows nrow and the number of columns ncol. Also, you need to make list of data frames df_list which you wanted to plot.
You can use the familiar Matplotlib style calling a figure and subplot, but you simply need to specify the current axis using plt.gca(). An example:
plt.figure(1)
plt.subplot(2,2,1)
df.A.plot() #no need to specify for first axis
plt.subplot(2,2,2)
df.B.plot(ax=plt.gca())
plt.subplot(2,2,3)
df.C.plot(ax=plt.gca())
etc...
You can use this:
fig = plt.figure()
ax = fig.add_subplot(221)
plt.plot(x,y)
ax = fig.add_subplot(222)
plt.plot(x,z)
...
plt.show()
You may not need to use Pandas at all. Here's a matplotlib plot of cat frequencies:
x = np.linspace(0, 2*np.pi, 400)
y = np.sin(x**2)
f, axes = plt.subplots(2, 1)
for c, i in enumerate(axes):
axes[c].plot(x, y)
axes[c].set_title('cats')
plt.tight_layout()
Option 1: Create subplots from a dictionary of dataframes with long (tidy) data
Assumptions:
There is a dictionary of multiple dataframes of tidy data that are either:
Created by reading in from files
Created by separating a single dataframe into multiple dataframes
The categories, cat, may be overlapping, but all dataframes don't necessarily contain all values of cat
hue='cat'
This example uses a dict of dataframes, but a list of dataframes would be similar.
If the dataframes are wide, use pandas.DataFrame.melt to convert them to long form.
Because dataframes are being iterated through, there's no guarantee that colors will be mapped the same for each plot
A custom color map needs to be created from the unique 'cat' values for all the dataframes
Since the colors will be the same, place one legend to the side of the plots, instead of a legend in every plot
Tested in python 3.10, pandas 1.4.3, matplotlib 3.5.1, seaborn 0.11.2
Imports and Test Data
import pandas as pd
import numpy as np # used for random data
import matplotlib.pyplot as plt
from matplotlib.patches import Patch # for custom legend - square patches
from matplotlib.lines import Line2D # for custom legend - round markers
import seaborn as sns
import math import ceil # determine correct number of subplot
# synthetic data
df_dict = dict()
for i in range(1, 7):
np.random.seed(i) # for repeatable sample data
data_length = 100
data = {'cat': np.random.choice(['A', 'B', 'C'], size=data_length),
'x': np.random.rand(data_length), 'y': np.random.rand(data_length)}
df_dict[i] = pd.DataFrame(data)
# display(df_dict[1].head())
cat x y
0 B 0.944595 0.606329
1 A 0.586555 0.568851
2 A 0.903402 0.317362
3 B 0.137475 0.988616
4 B 0.139276 0.579745
# display(df_dict[6].tail())
cat x y
95 B 0.881222 0.263168
96 A 0.193668 0.636758
97 A 0.824001 0.638832
98 C 0.323998 0.505060
99 C 0.693124 0.737582
Create color mappings and plot
# create color mapping based on all unique values of cat
unique_cat = {cat for v in df_dict.values() for cat in v.cat.unique()} # get unique cats
colors = sns.color_palette('tab10', n_colors=len(unique_cat)) # get a number of colors
cmap = dict(zip(unique_cat, colors)) # zip values to colors
col_nums = 3 # how many plots per row
row_nums = math.ceil(len(df_dict) / col_nums) # how many rows of plots
# create the figue and axes
fig, axes = plt.subplots(row_nums, col_nums, figsize=(9, 6), sharex=True, sharey=True)
# convert to 1D array for easy iteration
axes = axes.flat
# iterate through dictionary and plot
for ax, (k, v) in zip(axes, df_dict.items()):
sns.scatterplot(data=v, x='x', y='y', hue='cat', palette=cmap, ax=ax)
sns.despine(top=True, right=True)
ax.legend_.remove() # remove the individual plot legends
ax.set_title(f'dataset = {k}', fontsize=11)
fig.tight_layout()
# create legend from cmap
# patches = [Patch(color=v, label=k) for k, v in cmap.items()] # square patches
patches = [Line2D([0], [0], marker='o', color='w', markerfacecolor=v, label=k, markersize=8) for k, v in cmap.items()] # round markers
# place legend outside of plot; change the right bbox value to move the legend up or down
plt.legend(title='cat', handles=patches, bbox_to_anchor=(1.06, 1.2), loc='center left', borderaxespad=0, frameon=False)
plt.show()
Option 2: Create subplots from a single dataframe with multiple separate datasets
The dataframes must be in a long form with the same column names.
This option uses pd.concat to combine multiple dataframes into a single dataframe, and .assign to add a new column.
See Import multiple csv files into pandas and concatenate into one DataFrame for creating a single dataframes from a list of files.
This option is easier because it doesn't require manually mapping colors to 'cat'
Combine DataFrames
# using df_dict, with dataframes as values, from the top
# combine all the dataframes in df_dict to a single dataframe with an identifier column
df = pd.concat((v.assign(dataset=k) for k, v in df_dict.items()), ignore_index=True)
# display(df.head())
cat x y dataset
0 B 0.944595 0.606329 1
1 A 0.586555 0.568851 1
2 A 0.903402 0.317362 1
3 B 0.137475 0.988616 1
4 B 0.139276 0.579745 1
# display(df.tail())
cat x y dataset
595 B 0.881222 0.263168 6
596 A 0.193668 0.636758 6
597 A 0.824001 0.638832 6
598 C 0.323998 0.505060 6
599 C 0.693124 0.737582 6
Plot a FacetGrid with seaborn.relplot
sns.relplot(kind='scatter', data=df, x='x', y='y', hue='cat', col='dataset', col_wrap=3, height=3)
Both options create the same result, however, it's less complicated to combine all the dataframes, and plot a figure-level plot with sns.relplot.
Building on #joris response above, if you have already established a reference to the subplot, you can use the reference as well. For example,
ax1 = plt.subplot2grid((50,100), (0, 0), colspan=20, rowspan=10)
...
df.plot.barh(ax=ax1, stacked=True)
Here is a working pandas subplot example, where modes is the column names of the dataframe.
dpi=200
figure_size=(20, 10)
fig, ax = plt.subplots(len(modes), 1, sharex="all", sharey="all", dpi=dpi)
for i in range(len(modes)):
ax[i] = pivot_df.loc[:, modes[i]].plot.bar(figsize=(figure_size[0], figure_size[1]*len(modes)),
ax=ax[i], title=modes[i], color=my_colors[i])
ax[i].legend()
fig.suptitle(name)
import numpy as np
import pandas as pd
imoprt matplotlib.pyplot as plt
fig, ax = plt.subplots(2,2)
df = pd.DataFrame({'A':np.random.randint(1,100,10),
'B': np.random.randint(100,1000,10),
'C':np.random.randint(100,200,10)})
for ax in ax.flatten():
df.plot(ax =ax)

Categories