How to plot a histogram by different groups in matplotlib? - python

I have a table like:
value type
10 0
12 1
13 1
14 2
Generate a dummy data:
import numpy as np
value = np.random.randint(1, 20, 10)
type = np.random.choice([0, 1, 2], 10)
I want to accomplish a task in Python 3 with matplotlib (v1.4):
plot a histogram of value
group by type, i.e. use different colors to differentiate types
the position of the "bars" should be "dodge", i.e. side by side
since the range of value is small, I would use identity for bins, i.e. the width of a bin is 1
The questions are:
how to assign colors to bars based on the values of type and draw colors from colormap (e.g. Accent or other cmap in matplotlib)? I don't want to use named color (i.e. 'b', 'k', 'r')
the bars in my histogram overlap each other, how to "dodge" the bars?
Note
I have tried on Seaborn, matplotlib and pandas.plot for two hours and failed to get the desired histogram.
I read the examples and Users' Guide of matplotlib. Surprisingly, I found no tutorial about how to assign colors from colormap.
I have searched on Google but failed to find a succinct example.
I guess one could accomplish the task with matplotlib.pyplot, without import a bunch of modules such as matplotlib.cm, matplotlib.colors.

For your first question, we can create a dummy column equal to 1, and then generate counts by summing this column, grouped by value and type.
For your second question you can pass the colormap directly into plot using the colormap parameter:
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.cm as cm
import seaborn
seaborn.set() #make the plots look pretty
df = pd.DataFrame({'value': value, 'type': type})
df['dummy'] = 1
ag = df.groupby(['value','type']).sum().unstack()
ag.columns = ag.columns.droplevel()
ag.plot(kind = 'bar', colormap = cm.Accent, width = 1)
plt.show()

Whenever you need to plot a variable grouped by another (using color), seaborn usually provides a more convenient way to do that than matplotlib or pandas. So here is a solution using the seaborn histplot function:
import numpy as np # v 1.19.2
import pandas as pd # v 1.1.3
import matplotlib.pyplot as plt # v 3.3.2
import seaborn as sns # v 0.11.0
# Set parameters for random data
rng = np.random.default_rng(seed=1) # random number generator
size = 50
xmin = 1
xmax = 20
# Create random dataframe
df = pd.DataFrame(dict(value = rng.integers(xmin, xmax, size=size),
val_type = rng.choice([0, 1, 2], size=size)))
# Create histogram with discrete bins (bin width is 1), colored by type
fig, ax = plt.subplots(figsize=(10,4))
sns.histplot(data=df, x='value', hue='val_type', multiple='dodge', discrete=True,
edgecolor='white', palette=plt.cm.Accent, alpha=1)
# Create x ticks covering the range of all integer values of df['value']
ax.set_xticks(np.arange(df['value'].min(), df['value'].max()+1))
# Additional formatting
sns.despine()
ax.get_legend().set_frame_on(False)
plt.show()
As you can notice, this being a histogram and not a bar plot, there is no space between the bars except where values of the x axis are not present in the dataset, like for values 12 and 14.
Seeing as the accepted answer provided a bar plot in pandas and that a bar plot may be a relevant choice for displaying a histogram in certain situations, here is how to create one with seaborn using the countplot function:
# For some reason the palette argument in countplot is not processed the
# same way as in histplot so here I fetch the colors from the previous
# example to make it easier to compare them
colors = [c for c in set([patch.get_facecolor() for patch in ax.patches])]
# Create bar chart of counts of each value grouped by type
fig, ax = plt.subplots(figsize=(10,4))
sns.countplot(data=df, x='value', hue='val_type', palette=colors,
saturation=1, edgecolor='white')
# Additional formatting
sns.despine()
ax.get_legend().set_frame_on(False)
plt.show()
As this is a bar plot, the values 12 and 14 are not included which produces a somewhat deceitful plot as no empty space is shown for those values. On the other hand, there is some space between each group of bars which makes it easier to see what value each bar belongs to.

Related

Colormap has wrong range when used with seaborn factegrid and pyplot scatter

For a Pandas dataframe, I am trying to use Seaborn to create a FacetGrid based on one column ('year') and then create a scatter plot for each subplot where the x and y axes correspond to the columns 'x' and 'y' and the color corresponds to yet another value 'p'. I'm also looking to add a meaningful colorbar as an annotation to the plot. Using my code below, where data is a Pandas dataframe,
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# 'data' imported from a csv file
data.head()
x y q t p year
0 0.864721 0.080396 0.970694 0.848008 2.164305 1
1 0.615352 0.768370 0.950874 2.902603 2.383057 3
2 0.775110 0.562287 1.163768 0.151289 4.026925 1
3 0.355569 0.161638 0.840118 3.075967 0.431773 4
4 0.405850 0.807168 1.261452 0.242770 1.111010 1
g = sns.FacetGrid(data, col='year', hue='p', palette='seismic')
g = g.map(plt.scatter, 'x', 'y', s=100, alpha=0.5)
plt.colorbar()
I am able to create this as pictured below.
So I can get a facetgrid with scatter plots. The problem is that my colorbar on the right hand side is not aligned (in terms of values and color palette) with the circles in the scatter plots. The true values for 'p' range from 0.15 to 8.2; the values for the colormap range from 0 and 1. In addition, the color palette seems to differ between the plots and the colormap. So a correct colormap would range from dark blue to dark red for the range 0.15 to 8.2.
I suspect that the colormap I'm creating is not "linked" to the scatter plots. Instead, pyplot is just creating a generic colormap with a default palette and range in the absence of any data to use.
How can I get this to work?

matplotlib assign color to categorical variable [duplicate]

I have this data frame diamonds which is composed of variables like (carat, price, color), and I want to draw a scatter plot of price to carat for each color, which means different color has different color in the plot.
This is easy in R with ggplot:
ggplot(aes(x=carat, y=price, color=color), #by setting color=color, ggplot automatically draw in different colors
data=diamonds) + geom_point(stat='summary', fun.y=median)
I wonder how could this be done in Python using matplotlib ?
PS:
I know about auxiliary plotting packages, such as seaborn and ggplot for python, and I don't prefer them, just want to find out if it is possible to do the job using matplotlib alone, ;P
Imports and Sample DataFrame
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns # for sample data
from matplotlib.lines import Line2D # for legend handle
# DataFrame used for all options
df = sns.load_dataset('diamonds')
carat cut color clarity depth table price x y z
0 0.23 Ideal E SI2 61.5 55.0 326 3.95 3.98 2.43
1 0.21 Premium E SI1 59.8 61.0 326 3.89 3.84 2.31
2 0.23 Good E VS1 56.9 65.0 327 4.05 4.07 2.31
With matplotlib
You can pass plt.scatter a c argument, which allows you to select the colors. The following code defines a colors dictionary to map the diamond colors to the plotting colors.
fig, ax = plt.subplots(figsize=(6, 6))
colors = {'D':'tab:blue', 'E':'tab:orange', 'F':'tab:green', 'G':'tab:red', 'H':'tab:purple', 'I':'tab:brown', 'J':'tab:pink'}
ax.scatter(df['carat'], df['price'], c=df['color'].map(colors))
# add a legend
handles = [Line2D([0], [0], marker='o', color='w', markerfacecolor=v, label=k, markersize=8) for k, v in colors.items()]
ax.legend(title='color', handles=handles, bbox_to_anchor=(1.05, 1), loc='upper left')
plt.show()
df['color'].map(colors) effectively maps the colors from "diamond" to "plotting".
(Forgive me for not putting another example image up, I think 2 is enough :P)
With seaborn
You can use seaborn which is a wrapper around matplotlib that makes it look prettier by default (rather opinion-based, I know :P) but also adds some plotting functions.
For this you could use seaborn.lmplot with fit_reg=False (which prevents it from automatically doing some regression).
sns.scatterplot(x='carat', y='price', data=df, hue='color', ec=None) also does the same thing.
Selecting hue='color' tells seaborn to split and plot the data based on the unique values in the 'color' column.
sns.lmplot(x='carat', y='price', data=df, hue='color', fit_reg=False)
With pandas.DataFrame.groupby & pandas.DataFrame.plot
If you don't want to use seaborn, use pandas.groupby to get the colors alone, and then plot them using just matplotlib, but you'll have to manually assign colors as you go, I've added an example below:
fig, ax = plt.subplots(figsize=(6, 6))
grouped = df.groupby('color')
for key, group in grouped:
group.plot(ax=ax, kind='scatter', x='carat', y='price', label=key, color=colors[key])
plt.show()
This code assumes the same DataFrame as above, and then groups it based on color. It then iterates over these groups, plotting for each one. To select a color, I've created a colors dictionary, which can map the diamond color (for instance D) to a real color (for instance tab:blue).
Here's a succinct and generic solution to use a seaborn color palette.
First find a color palette you like and optionally visualize it:
sns.palplot(sns.color_palette("Set2", 8))
Then you can use it with matplotlib doing this:
# Unique category labels: 'D', 'F', 'G', ...
color_labels = df['color'].unique()
# List of RGB triplets
rgb_values = sns.color_palette("Set2", 8)
# Map label to RGB
color_map = dict(zip(color_labels, rgb_values))
# Finally use the mapped values
plt.scatter(df['carat'], df['price'], c=df['color'].map(color_map))
I had the same question, and have spent all day trying out different packages.
I had originally used matlibplot: and was not happy with either mapping categories to predefined colors; or grouping/aggregating then iterating through the groups (and still having to map colors). I just felt it was poor package implementation.
Seaborn wouldn't work on my case, and Altair ONLY works inside of a Jupyter Notebook.
The best solution for me was PlotNine, which "is an implementation of a grammar of graphics in Python, and based on ggplot2".
Below is the plotnine code to replicate your R example in Python:
from plotnine import *
from plotnine.data import diamonds
g = ggplot(diamonds, aes(x='carat', y='price', color='color')) + geom_point(stat='summary')
print(g)
So clean and simple :)
Here a combination of markers and colors from a qualitative colormap in matplotlib:
import itertools
import numpy as np
from matplotlib import markers
import matplotlib.pyplot as plt
m_styles = markers.MarkerStyle.markers
N = 60
colormap = plt.cm.Dark2.colors # Qualitative colormap
for i, (marker, color) in zip(range(N), itertools.product(m_styles, colormap)):
plt.scatter(*np.random.random(2), color=color, marker=marker, label=i)
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0., ncol=4);
Using Altair.
from altair import *
import pandas as pd
df = datasets.load_dataset('iris')
Chart(df).mark_point().encode(x='petalLength',y='sepalLength', color='species')
With df.plot()
Normally when quickly plotting a DataFrame, I use pd.DataFrame.plot(). This takes the index as the x value, the value as the y value and plots each column separately with a different color.
A DataFrame in this form can be achieved by using set_index and unstack.
import matplotlib.pyplot as plt
import pandas as pd
carat = [5, 10, 20, 30, 5, 10, 20, 30, 5, 10, 20, 30]
price = [100, 100, 200, 200, 300, 300, 400, 400, 500, 500, 600, 600]
color =['D', 'D', 'D', 'E', 'E', 'E', 'F', 'F', 'F', 'G', 'G', 'G',]
df = pd.DataFrame(dict(carat=carat, price=price, color=color))
df.set_index(['color', 'carat']).unstack('color')['price'].plot(style='o')
plt.ylabel('price')
With this method you do not have to manually specify the colors.
This procedure may make more sense for other data series. In my case I have timeseries data, so the MultiIndex consists of datetime and categories. It is also possible to use this approach for more than one column to color by, but the legend is getting a mess.
You can convert the categorical column into a numerical one by using the commands:
#we converting it into categorical data
cat_col = df['column_name'].astype('category')
#we are getting codes for it
cat_col = cat_col.cat.codes
# we are using c parameter to change the color.
plt.scatter(df['column1'],df['column2'], c=cat_col)
The easiest way is to simply pass an array of integer category levels to the plt.scatter() color parameter.
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/diamonds.csv')
plt.scatter(df['carat'], df['price'], c=pd.factorize(df['color'])[0],)
plt.gca().set(xlabel='Carat', ylabel='Price', title='Carat vs. Price')
This creates a plot without a legend, using the default "viridis" colormap. In this case "viridis" is not a good default choice because the colors appear to imply a sequential order rather than purely nominal categories.
To choose your own colormap and add a legend, the simplest approach is this:
import matplotlib.patches
levels, categories = pd.factorize(df['color'])
colors = [plt.cm.tab10(i) for i in levels] # using the "tab10" colormap
handles = [matplotlib.patches.Patch(color=plt.cm.tab10(i), label=c) for i, c in enumerate(categories)]
plt.scatter(df['carat'], df['price'], c=colors)
plt.gca().set(xlabel='Carat', ylabel='Price', title='Carat vs. Price')
plt.legend(handles=handles, title='Color')
I chose the "tab10" discrete (aka qualitative) colormap here, which does a better job at signaling the color factor is a nominal categorical variable.
Extra credit:
In the first plot, the default colors are chosen by passing min-max scaled values from the array of category level ints pd.factorize(iris['species'])[0] to the call method of the plt.cm.viridis colormap object.

Matplotlib scatter legend with colors using categorical variable

I have made a simple scatterplot using matplotlib showing data from 2 numerical variables (varA and varB) with colors that I defined with a 3rd categorical string variable (col) containing 10 unique colors (corresponding to another string variable with 10 unique names), all in the same Pandas DataFrame with 100+ rows.
Is there an easy way to create a legend for this scatterplot that shows the unique colored dots and their corresponding category names? Or should I somehow group the data and plot each category in a subplot to do this? This is what I have so far:
import matplotlib.pyplot as plt
from matplotlib import colors as mcolors
varA = df['A']
varB = df['B']
col = df['Color']
plt.scatter(varA,varB, c=col, alpha=0.8)
plt.legend()
plt.show()
I had to chime in, because I could not accept that I needed a for-loop to accomplish this. It just seems really annoying and unpythonic - especially when I'm not using Pandas. However, after some searching, I found the answer. You just need to import the 'collections' package so that you can access the PathCollections class and specifically, the legend_elements() method. See implementation below:
# imports
import matplotlib.collections
import numpy as np
# create random data and numerical labels
x = np.random.rand(10,2)
y = np.random.randint(4, size=10)
# create list of categories
labels = ['type1', 'type2', 'type3', 'type4']
# plot
fig, ax = plt.subplots()
scatter = ax.scatter(x[:,0], x[:,1], c=y)
handles, _ = scatter.legend_elements(prop="colors", alpha=0.6) # use my own labels
legend1 = ax.legend(handles, labels, loc="upper right")
ax.add_artist(legend1)
plt.show()
scatterplot legend with custom labels
Source:
https://matplotlib.org/stable/gallery/lines_bars_and_markers/scatter_with_legend.html
https://matplotlib.org/stable/api/collections_api.html#matplotlib.collections.PathCollection.legend_elements
Considering, Color is the column that has all the colors and labels, you can simply do following.
colors = list(df['Color'].unique())
for i in range(0 , len(colors)):
data = df.loc[df['Color'] == colors[i]]
plt.scatter('A', 'B', data=data, color='Color', label=colors[i])
plt.legend()
plt.show()
A simple way is to group your data by color, then plot all of the data on one plot. Pandas has a built in groupby function. For example:
import matplotlib.pyplot as plt
from matplotlib import colors as mcolors
for color, group in df.groupby(['Color']):
plt.scatter(group['A'], group['B'], c=color, alpha=0.8, label=color)
plt.legend()
plt.show()
Notice that we call plt.scatter once for each grouping of data. Then we only need to call plt.legend and plt.show once all of the data is in our plot.

Pandas Plot With Positive Values One Color And Negative Values Another

I have a pandas dataframe where I am plotting two columns out the 12, one as the x-axis and one as the y-axis. The x-axis is simply a time series and the y-axis are values are random integers between -5000 and 5000 roughly.
Is there any way to make a scatter plot using only these 2 columns where the positive values of y are a certain color and the negative colors are another color?
I have tried so many variations but can't get anything to go. I tried diverging color maps, colormeshs, using seaborn colormaps and booleans masks for neg/positive numbers. I am at my wits end.
The idea to use a colormap to colorize the points of a scatter is of course justified. If you're using the plt.scatter plot, you can supply the values according to which the colormap chooses the color in the c argument.
Here you only want two values, so c= np.sign(df.y) would be an appropriate choice.
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
df = pd.DataFrame({'x': np.arange(25), 'y': np.random.normal(0,2500,25)})
fig, ax = plt.subplots()
ax.scatter(df.x, df.y, c=np.sign(df.y), cmap="bwr")
plt.show()
Split dataframe and plot them separately:
import matplotlib.pylab as plt
import numpy as np
import pandas as pd
df = pd.DataFrame({'x': np.arange(20), 'y': np.random.randn(20)})
# split dataframes
df_plus = df[df.y >= 0]
df_minus = df[df.y < 0]
print df_plus
print df_minus
fig = plt.figure()
ax = fig.add_subplot(1, 1, 1)
# plot scatter
ax.scatter(df_plus.x, df_plus.y, color='r')
ax.scatter(df_minus.x, df_minus.y, color='b')
ax.autoscale()
plt.show()
If you want plot negative datframe as positive write df.minus.y = -df_minus.y.
Make 2 separate dataframes by using boolean masking and the where keyword. The condition would be if >0 or not. Then plot both datframes one by one ,one top of the other, with different parameters for the color.

Pandas + Matplotlib, Make one color in barplot stand out

I have a barplot with different colors. I would like to make one bar stand out with brighter colors and the others faded. My guess is to use the keyword alpha on the bars to fade them, but I can not figure out how to make one keep the original color (= not faded with alpha keyword).
I need help on this
Here is my code:
from matplotlib import pyplot as plt
from itertools import cycle, islice
import pandas as pd, numpy as np
ds2=ds[['Factors', 'contribution']]
ds3=ds2.set_index('Factors')
it = cycle(['b', 'green', 'y', 'pink','orange','cyan','darkgrey'])
my_colors=[next(it) for i in xrange(len(ds))]
figure(1, figsize=(10,8))
# Specify this list of colors as the `color` option to `plot`.
ds3.plot(kind='barh', stacked=True, color=my_colors, alpha=0.95)
plt.title('xxxxxxxxxxxxxx', fontsize = 10)
Here is my simple dataframe ds3
contribution
Factors
A 0.188137
B 0.160208
C 0.160208
D 0.151654
E 0.149489
F 0.135975
G 0.063206
I think mgilson's approach is the best were you add the data from Pandas in a Matplotlib command. You could however also capture the axes object which Pandas returns and then iterate over the artists to modify them.
This gets really tricky, because the bars don't have a label (its "_no_legend_") as an identifier, the only way to target a specific bar is to look-up its position in the index of the original DataFrame. Any change, like sorting, in the order between plotting and looking it up will give a wrong result!
import pandas as pd
df = pd.DataFrame({'contribution': [0.188137,0.160208,0.160208,0.151654,0.149489,0.135975,0.063206]}
,index=['A','B','C','D','E','F','G'])
colors = ['b', 'green', 'y', 'pink','orange','cyan','darkgrey']
ax = df.plot(kind='barh', color=colors, legend=False)
for bar in ax.patches:
bar.set_facecolor('#888888')
highlight = 'D'
pos = df.index.get_loc(highlight)
ax.patches[pos].set_facecolor('#aa3333')
ax.legend()
So this example gives only a little bit of insight in how Pandas and Matplotlib work together. I don't recommend actually using it and suggest just to go with mgnilson's approach.
Why not plot the special bar in a separate plot command?
import matplotlib.pyplot as plt
fig = plt.figure()
ax = fig.add_subplot(111)
colors = list('rgbkm')
data_y = [1,2,4,5,6]
data_x = [1,1,1,1,1]
ax.barh(data_y, data_x, color=colors, alpha=0.25)
# Plot the special bar separately ...
ax.barh([3], [1], color='b')
plt.show()

Categories