When plotting a bar chart with Seaborn and using the hue parameter to color the bars according to their column value, bars with identical column values are nested, or aggregated, and only a single bar is shown. The image below illustrates the problem. Patient number 1 has two samples of sample_type 1, with values 10 and 20. The two values have been nested, and both values are represented as a single bar (as the average of the two).
I'd like to avoid this nesting, and rather have something like in the image below.
Is this possible to achieve? MVE below. Thanks!
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.DataFrame({
"patient_number": [1, 1, 1, 2, 2, 2],
"sample_type": [1, 1, 2, 1, 2, 3],
"value": [10, 20, 15, 10, 11, 12]
})
sns.barplot(x="patient_number", y="value", hue="sample_type", data=df)
plt.show()
The following approach obtains the desired plot:
Seaborn's hue= parameter both defines the color and the position of the bars.
Per patient, an extra field ('idx') contains a unique number for each of the desired bars. This field 'idx' restarts from 0 for every next patient and is added to the dataframe.
'idx' can then be used as hue='idx' to get the desired columns, although they will be colored just sequently.
In order to get one color per sample type, an extra column now contains a factorized version of the sample types (so, 0 for the first type, 1 for the next, etc.)
Seaborn generates the bars per hue, one for each patient. These bars can be accessed as a list via ax.patches If some patient doesn't have a value for a given 'idx', a dummy bar is will be added to the list.
By iterating through the patients and then through the 'idx', all bars can be visited and colored via 'sample_type'. As the ordering of the bars is a bit tricky, an adequate renumbering is needed.
The legend needs to be changed to reflect the sample types.
The given data is extended a bit to be able to test different numbers of samples per patient, and sample types that aren't simple subsequent numbers.
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.DataFrame({
'patient_number': [1, 1, 1, 2, 2, 2, 3, 3, 3, 3, 3],
'sample_type': ['st1', 'st1', 'st2', 'st1', 'st2', 'st3', 'st4', 'st4', 'st4', 'st4', 'st4'],
'value': [10, 20, 15, 10, 11, 12, 1, 2, 3, 4, 5]
})
df['idx'] = df.groupby('patient_number').cumcount()
df['sample_factors'], sample_labels = pd.factorize(df['sample_type'])
ax = sns.barplot(x='patient_number', y='value', hue='idx', data=df)
colors = plt.cm.get_cmap('Set2').colors # https://matplotlib.org/3.1.0/tutorials/colors/colormaps.html
handles = [None for _ in sample_labels]
num_patients = len(ax.patches) // (df['idx'].max() + 1)
for i, (patient_id, group) in enumerate(df.groupby('patient_number')):
for j, factor in enumerate(group['sample_factors']):
patch = ax.patches[i + j * num_patients]
patch.set_color(colors[factor])
handles[factor] = patch
ax.legend(handles=handles, labels=list(sample_labels), title='Sample type')
plt.show()
Related
For context: I'd like to make a plot in plotly showing the evolution of an investment portfolio where the value of each asset is plotted on top of each other. Since assets are bought and sold, not every asset should be shown for the entire range of the curve.
The below example can clarify this. Leading or trailing zeros indicate that the asset was not in the portfolio at that moment.
import pandas as pd
import plotly.express as px
import numpy as np
data = {"Asset 1": [0, 1, 2, 3, 4, 5], "Asset 2": [0, 0, 2, 3, 2, 2], "Asset 3": [1, 1, 3, 0, 0, 0]}
df = pd.DataFrame(data)
fig = px.area(df)
fig.show()
This results in the following figure:
The problem is now that at the indicated time (index=4), Asset 3 is not in the portfolio anymore, hence its value 0. However it is still shown, and the bigger problem is that it makes it impossible to see the value of Asset 2 which is in the portfolio.
I tried changing the zeros to NaN values to indicate that they don't exist but that gives the exact same figure.
data2 = {"a": [np.nan, 1, 2, 3, 4, 5], "b": [np.nan, np.nan, 2, 3, 2, 2], "c": [1, 1, 3, np.nan, np.nan, np.nan]}
df2 = pd.DataFrame(data2)
fig2 = px.area(df2)
fig2.show()
I am afraid I cannot construct an elegant solution. However this will work for most requirements you stated. How it works:
Instead of using the auto stack function, draw the line one by one by yourself.
That means you will have to pre-process the dataframe a little bit - by calculating the values of column A+B and column A+B+C.
plotly.express offers limited custom control. Instead of using plotly.express, use plotly.graph_objects. They have similar syntax.
The order of placing the "traces" (aka. lines) is important. The last line rendered get placed on the top. In your problem statement, the lines get drawn from left-most to right-most column, and that's why overlapping would favor the right-er column.
The NaN values has to be zero-filled manually before the plotting. Otherwise the filled areas create weird shapes, considering your sample data contains a certain amount of NaNs.
import pandas as pd
import numpy as np
import plotly.graph_objects as go
data = {"a": [np.nan, 1, 2, 3, 4, 5], "b": [np.nan, np.nan, 2, 3, 2, 2], "c": [1, 1, 3, np.nan, np.nan, np.nan]}
df = pd.DataFrame(data)
# fill NAs with zeros before doing anything
df = df.fillna(0)
fig = go.Figure()
# add lines one by one. The order matters - last one lays on top along with its hoverinfo
fig.add_trace(go.Scatter(
x=df.index,
y=df['a'],
mode='lines',
fill='tonexty', # fill the area under line to next y
))
fig.add_trace(go.Scatter(
x=df.index,
y=df['a']+df['b'], # sum of 'a' and 'b'
mode='lines',
fill='tonexty', # fill the area under line to next y
))
fig.add_trace(go.Scatter(
x=df.index,
y=df['a']+df['b']+df['c'], # sum of 'a' and 'b' and 'c'
mode='lines',
fill='tonexty', # fill the area under line to next y
))
# minor bug where an area below zero is shown
fig.update_layout(yaxis=dict(range=[0, max(df.sum(axis=1) * 1.05)]))
fig.show()
The resulting plot would look like:
The green line, representing values of df['a']+df['b']+df['c'] still sits on the top. However, the hover label is now showing the value of df['a']+df['b']+df['c'] instead of either of the assets.
In fact, I found these asset-allocation-y plot prettier without the edge lines:
and this can be done by setting mode='none' for each of the 3 plot objects.
Remarks:
Another way I have tried for anyone who is reading: consider each filled area and line as two separate traces. By doing so, you will need to define custom pairs of colors (solid and its half-transparent color). There were some buggy results for this. Also, the struggle of traces with stackgroup set in argument cannot contain NaN values and NaN values will either be zero-filled or interpolated. This creates bad plots in the context of this problem.
I'd like to use Matplotlib to plot a histogram over data that's been pre-counted. For example, say I have the raw data
data = [1, 2, 2, 3, 4, 5, 5, 5, 5, 6, 10]
Given this data, I can use
pylab.hist(data, bins=[...])
to plot a histogram.
In my case, the data has been pre-counted and is represented as a dictionary:
counted_data = {1: 1, 2: 2, 3: 1, 4: 1, 5: 4, 6: 1, 10: 1}
Ideally, I'd like to pass this pre-counted data to a histogram function that lets me control the bin widths, plot range, etc, as if I had passed it the raw data. As a workaround, I'm expanding my counts into the raw data:
data = list(chain.from_iterable(repeat(value, count)
for (value, count) in counted_data.iteritems()))
This is inefficient when counted_data contains counts for millions of data points.
Is there an easier way to use Matplotlib to produce a histogram from my pre-counted data?
Alternatively, if it's easiest to just bar-plot data that's been pre-binned, is there a convenience method to "roll-up" my per-item counts into binned counts?
You can use the weights keyword argument to np.histgram (which plt.hist calls underneath)
val, weight = zip(*[(k, v) for k,v in counted_data.items()])
plt.hist(val, weights=weight)
Assuming you only have integers as the keys, you can also use bar directly:
min_bin = np.min(counted_data.keys())
max_bin = np.max(counted_data.keys())
bins = np.arange(min_bin, max_bin + 1)
vals = np.zeros(max_bin - min_bin + 1)
for k,v in counted_data.items():
vals[k - min_bin] = v
plt.bar(bins, vals, ...)
where ... is what ever arguments you want to pass to bar (doc)
If you want to re-bin your data see Histogram with separate list denoting frequency
I used pyplot.hist's weights option to weight each key by its value, producing the histogram that I wanted:
pylab.hist(counted_data.keys(), weights=counted_data.values(), bins=range(50))
This allows me to rely on hist to re-bin my data.
You can also use seaborn to plot the histogram :
import seaborn as sns
sns.distplot(
list(
counted_data.keys()
),
hist_kws={
"weights": list(counted_data.values())
}
)
the length of the "bins" array should be longer than the length of "counts". Here's the way to fully reconstruct the histogram:
import numpy as np
import matplotlib.pyplot as plt
bins = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9]).astype(float)
counts = np.array([5, 3, 4, 5, 6, 1, 3, 7]).astype(float)
centroids = (bins[1:] + bins[:-1]) / 2
counts_, bins_, _ = plt.hist(centroids, bins=len(counts),
weights=counts, range=(min(bins), max(bins)))
plt.show()
assert np.allclose(bins_, bins)
assert np.allclose(counts_, counts)
Adding to tacaswell's comment, plt.bar can be much more efficient than plt.hist here for large numbers of bins (>1e4). Especially for a crowded random plot where you only need plot the highest bars because the width required to see them will cover most of their neighbors anyway. You can pick out the highest bars and plot them with
i, = np.where(vals > min_height)
plt.bar(i,vals[i],width=len(bins)//50)
Other statistical trends may prefer to instead plot every 100th bar or something similar.
The trick here is that plt.hist wants to plot all of your bins whereas plt.bar will let you just plot the sparser set of visible bins.
hist uses bar under the hood, this will produce something similar to what hist creates (assumes bins of equal size):
bins = [1,2,3]
heights = [10,20,30]
ax = plt.gca()
ax.bar(bins, heights, align='center', width=bins[-1] - bins[-2])
I have a dataset consisting of financial stock IDs [0, 1400] and timestamps [0, 1800]. For a given ID, it either does or does not have data for a given timestamp.
I have created a dictionary where each key is an ID, and each value is a list of all the timestamps for which that ID has data.
I would now like to plot a chart with the each row corresponding to an ID, and each column corresponding to a timestamp. Each cell [i, j] of the chart will be coloured green if ID i has data for timestamp j (if j in dict[i]), and red if not.
Here is a sample I produced manually in Excel:
Can this be done through matplotlip or some other library?
Since the chart would be of size 1400x1800, the cells can be very small. I am attempting to reorder the data so that the number of green cells intersecting between adjacent IDs is maximised and so this chart will allow me to provide visualisations of how well I have achieved these overlaps/intersections across the dataset.
To provide some data, I simply iterated through the first 20 IDs in my dictionary and printed out the ID and the list of its timestamps. each line is in the form of ID [list of IDs timestamps]
EDIT:
Here is my first attempt on a example of data on a small scale. Although, this does achieved what I set out to do, it is a very brute force solution, so any recommendations on improvements would be appreciated.
import matplotlib.pyplot as plt
import pandas as pd
TSs = [0, 1, 2, 3, 4, 5]
ID_TS = {0: [1, 2, 3], 1: [2, 3, 4, 5]}
df = pd.DataFrame(index=ID_TS.keys(), columns=TSs)
for ID, TS in ID_TS.items():
bools = []
for i in TSs:
if i in TS:
bools.append(True)
else:
bools.append(False)
df.loc[ID] = bools
plt.imshow(df, cmap='hot', interpolation='nearest')
plt.show()
Your code to generate your dataframe doesn't work. So I took some liberties with that...
import numpy
import pandas
from matplotlib import pyplot
from matplotlib import ticker
TSs = [0, 1, 2, 3, 4, 5]
ID_TS = {0: [1, 2, 3, numpy.nan], 1: [2, 3, 4, 5]}
fig, ax = pyplot.subplots()
img = (
pandas.DataFrame(data=ID_TS, columns=TSs)
.isnull()
.pipe(numpy.bitwise_not)
.pipe(ax.pcolor, cmap='RdYlGn', edgecolors='k')
)
unit_ints = ticker.MultipleLocator(1)
ax.set_xlabel('Time')
ax.set_ylabel('ID')
ax.yaxis.set_major_locator(unit_ints)
ax.xaxis.set_major_locator(unit_ints)
I'd like to create a list of boxplots with the color of the box dependent on the name of the pandas.DataFrame column I use as input.
The column names contain strings that indicate an experimental condition based on which I want the box of the boxplot colored.
I do this to make the boxplots:
sns.boxplot(data = data.dropna(), orient="h")
plt.show()
This creates a beautiful list of boxplots with correct names. Now I want to give every boxplot that has 'prog +, DMSO+' in its name a red color, leaving the rest as blue.
I tried creating a dictionary with column names as keys and colors as values:
color = {}
for column in data.columns:
if 'prog+, DMSO+' in column:
color[column] = 'red'
else:
color[column] = 'blue'
And then using the dictionary as color:
sns.boxplot(data = data.dropna(), orient="h", color=color[column])
plt.show()
This does not work, understandably (there is no loop to go through the dictionary). So I make a loop:
for column in data.columns:
sns.boxplot(data = data[column], orient='h', color=color[column])
plt.show()
This does make boxplots of different colors but all on top of each other and without the correct labels. If I could somehow put these boxplot nicely in one plot below each other I'd be almost at what I want. Or is there a better way?
You should use the palette parameter, which handles multiple colors, rather than color, which handles a specific one. You can give palette a name, an ordered list, or a dictionary. The latter seems best suited to your question:
import seaborn as sns
sns.set_color_codes()
tips = sns.load_dataset("tips")
pal = {day: "r" if day == "Sat" else "b" for day in tips.day.unique()}
sns.boxplot(x="day", y="total_bill", data=tips, palette=pal)
You can set the facecolor of individual boxes after plotting them all in one go, using ax.artists[i].set_facecolor('r')
For example:
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
df = pd.DataFrame(
[[2, 4, 5, 6, 1],
[4, 5, 6, 7, 2],
[5, 4, 5, 5, 1],
[10, 4, 7, 8, 2],
[9, 3, 4, 6, 2],
[3, 3, 4, 4, 1]
],columns=['bar', 'prog +, DMSO+ 1', 'foo', 'something', 'prog +, DMSO+ 2'])
ax = sns.boxplot(data=df,orient='h')
boxes = ax.artists
for i,box in enumerate(boxes):
if 'prog +, DMSO+' in df.columns[i]:
box.set_facecolor('r')
else:
box.set_facecolor('b')
plt.tight_layout()
plt.show()
I'd like to use Matplotlib to plot a histogram over data that's been pre-counted. For example, say I have the raw data
data = [1, 2, 2, 3, 4, 5, 5, 5, 5, 6, 10]
Given this data, I can use
pylab.hist(data, bins=[...])
to plot a histogram.
In my case, the data has been pre-counted and is represented as a dictionary:
counted_data = {1: 1, 2: 2, 3: 1, 4: 1, 5: 4, 6: 1, 10: 1}
Ideally, I'd like to pass this pre-counted data to a histogram function that lets me control the bin widths, plot range, etc, as if I had passed it the raw data. As a workaround, I'm expanding my counts into the raw data:
data = list(chain.from_iterable(repeat(value, count)
for (value, count) in counted_data.iteritems()))
This is inefficient when counted_data contains counts for millions of data points.
Is there an easier way to use Matplotlib to produce a histogram from my pre-counted data?
Alternatively, if it's easiest to just bar-plot data that's been pre-binned, is there a convenience method to "roll-up" my per-item counts into binned counts?
You can use the weights keyword argument to np.histgram (which plt.hist calls underneath)
val, weight = zip(*[(k, v) for k,v in counted_data.items()])
plt.hist(val, weights=weight)
Assuming you only have integers as the keys, you can also use bar directly:
min_bin = np.min(counted_data.keys())
max_bin = np.max(counted_data.keys())
bins = np.arange(min_bin, max_bin + 1)
vals = np.zeros(max_bin - min_bin + 1)
for k,v in counted_data.items():
vals[k - min_bin] = v
plt.bar(bins, vals, ...)
where ... is what ever arguments you want to pass to bar (doc)
If you want to re-bin your data see Histogram with separate list denoting frequency
I used pyplot.hist's weights option to weight each key by its value, producing the histogram that I wanted:
pylab.hist(counted_data.keys(), weights=counted_data.values(), bins=range(50))
This allows me to rely on hist to re-bin my data.
You can also use seaborn to plot the histogram :
import seaborn as sns
sns.distplot(
list(
counted_data.keys()
),
hist_kws={
"weights": list(counted_data.values())
}
)
the length of the "bins" array should be longer than the length of "counts". Here's the way to fully reconstruct the histogram:
import numpy as np
import matplotlib.pyplot as plt
bins = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9]).astype(float)
counts = np.array([5, 3, 4, 5, 6, 1, 3, 7]).astype(float)
centroids = (bins[1:] + bins[:-1]) / 2
counts_, bins_, _ = plt.hist(centroids, bins=len(counts),
weights=counts, range=(min(bins), max(bins)))
plt.show()
assert np.allclose(bins_, bins)
assert np.allclose(counts_, counts)
Adding to tacaswell's comment, plt.bar can be much more efficient than plt.hist here for large numbers of bins (>1e4). Especially for a crowded random plot where you only need plot the highest bars because the width required to see them will cover most of their neighbors anyway. You can pick out the highest bars and plot them with
i, = np.where(vals > min_height)
plt.bar(i,vals[i],width=len(bins)//50)
Other statistical trends may prefer to instead plot every 100th bar or something similar.
The trick here is that plt.hist wants to plot all of your bins whereas plt.bar will let you just plot the sparser set of visible bins.
hist uses bar under the hood, this will produce something similar to what hist creates (assumes bins of equal size):
bins = [1,2,3]
heights = [10,20,30]
ax = plt.gca()
ax.bar(bins, heights, align='center', width=bins[-1] - bins[-2])