how to scale the histogram plot via matplotlib - python

You can see there is histogram below.
It is made like
pl.hist(data1,bins=20,color='green',histtype="step",cumulative=-1)
How to scale the histogram?
For example, let the height of the histogram be one third of it is like now.
Besides, it is a way to remove the vertical line at the left?

The matplotlib hist is actually just making calls to some other functions. It is often easier to use these directly allowing you to inspect the data and modify it directly:
# Generate some data
data = np.random.normal(size=1000)
# Generate the histogram data directly
hist, bin_edges = np.histogram(data, bins=10)
# Get the reversed cumulative sum
hist_neg_cumulative = [np.sum(hist[i:]) for i in range(len(hist))]
# Get the cin centres rather than the edges
bin_centers = (bin_edges[:-1] + bin_edges[1:]) / 2.
# Plot
plt.step(bin_centers, hist_neg_cumulative)
plt.show()
The hist_neg_cumulative is the array of data being plotted. So you can rescale is as you wish before passing it to the plotting function. This also doesn't plot the vertical line.

Related

customization of plotly create_scattermatrix plots

A simple call to plotly's figure_factory routine to create a scatter matrix:
import pandas as pd
import numpy as np
from plotly import figure_factory
df = pd.DataFrame(np.random.randn(40,3))
fig = figure_factory.create_scatterplotmatrix(df, diag='histogram')
fig.show()
yields
My questions are:
How can I specify a single color for all the plots?
How can I set the axes ranges for each of the three variables on the scatter plot?
Is there a way to create a density (normalized) version of the histogram?
Is there a way to include the correlation coefficient (say, computed from df.corr()) in the upper right corner of the non-diagonal plots?
To change to the same color for the first, update the marker attribute color in the generated graph data; to modify the range of axes for the second scatter plot, update the generated data in the same way; since only the x-axis has been modified, use the same technique for the y-axis if necessary; to change to a normalized version of the third histogram To change to the normalized version of the third histogram, replace it with the normalized data. The data to be replaced is the one done in the example specification in Ref. If this does not hit normalization, I believe it is possible to replace it with data obtained with np.histogram(), etc. The fourth is a note, but I have added the data obtained with df.corr() with the graph data reference, specifying the data by axis name for each subplot.
import pandas as pd
import numpy as np
from plotly import figure_factory
np.random.seed(20220529)
df = pd.DataFrame(np.random.randn(40,3))
density = px.histogram(df, x=[0,1,2], histnorm='probability density')
df_corr = df.corr()
fig = figure_factory.create_scatterplotmatrix(df, diag='histogram', height=600, width=600)
# 1.How can I specify a single color for all the plots?
for i in range(9):
fig.data[i]['marker']['color'] = 'blue'
# 2.How can I set the axes ranges for each of the three variables on the scatter plot?
for axes in ['xaxis2','xaxis3','xaxis4','xaxis6','xaxis7']:
fig.layout[axes]['range']=(-4,4)
# 3.Is there a way to create a density (normalized) version of the histogram?
fig['data'][0]['histnorm'] = 'probability density'
fig['data'][4]['histnorm'] = 'probability density'
fig['data'][8]['histnorm'] = 'probability density'
# 4.Is there a way to include the correlation coefficient (say, computed from df.corr())
# in the upper right corner of the non-diagonal plots?
for r,x,y in zip(df_corr.values.flatten(),
['x1','x2','x3','x4','x5','x6','x7','x8','x9'],
['y1','y2','y3','y4','y5','y6','y7','y8','y9']):
if r == 1.0:
pass
else:
fig.add_annotation(x=3.3, y=2, xref=x, yref=y, showarrow=False, text='R:'+str(round(r,2)))
fig.show()

Altair: Creating a layered violin + stripplot

I'm trying to create a plot that contains both a violin plot and a stripplot with jitter. How do I go about doing this? I provided my attempt below. The problem that I have been encountering is that the violin plot seems to be invisible in the plots.
# 1. Create violin plot
violin = alt.Chart(df).transform_density(
"n_genes_by_counts",
as_=["n_genes_by_counts", "density"],
).mark_area(orient="horizontal").encode(
y="n_genes_by_counts:Q",
x=alt.X("Density:Q", stack="center", title=None),
)
# 2. Create stripplot
stripplot = alt.Chart(df).mark_circle(size=8, color="black").encode(
y="n_gene_by_counts",
x=alt.X("jitter:Q", title=None),
).transform_calculate(
jitter="sqrt(-2*log(random()))*cos(2*PI*random())"
)
# 3. Combine both
combined = stripplot + violin
I have a feeling that it could be a problem with the scaling of the X axis. That is, density is much, much smaller than jitter. If that's the case, how to I make jitter so that it's on the same order of magnitude as density? Would it be possible for someone to show me how to create a violin+stripplot given a column name n_gene_by_counts that belongs to some pandas dataframe df? Here's an example image of the kind of plot I'm looking for:
As you suspected, the different scales will make the violin very small in the stripplot unless you adjust for it. In your case, you have also accidentally capitalized Density:Q in the channel encoding, which means that your violinplot is actually empty since this channel doesn't exist. This example works:
import altair as alt
from vega_datasets import data
df = data.cars()
# 1. Create violin plot
violin = alt.Chart(df).transform_density(
"Horsepower",
as_=["Horsepower", "density"],
).mark_area().encode(
x="Horsepower:Q",
y=alt.Y("density:Q", stack="center", title=None),
)
# 2. Create stripplot
stripplot = alt.Chart(df).mark_circle(size=8, color="black").encode(
x="Horsepower",
y=alt.X("jitter:Q", title=None),
).transform_calculate(
jitter="(random() / 400) + 0.0052" # Narrowing and centering the points
)
# 3. Combine both
violin + stripplot
By using scipy, you could also lay out the points themselves in the shape of the violin, which I am personally quite found of (discussion in this issue):
import altair as alt
import numpy as np
import pandas as pd
from scipy import stats
from vega_datasets import data
# NAs are not supported in SciPy's density calculation
df = data.cars().dropna()
y = 'Horsepower'
# Compute the density function of the data
dens = stats.gaussian_kde(df[y])
# Compute the density value for each data point
pdf = dens(df[y].sort_values())
# Randomly jitter points within 0 and the upper bond of the probability density function
density_cloud = np.empty(pdf.shape[0])
for i in range(pdf.shape[0]):
density_cloud[i] = np.random.uniform(0, pdf[i])
# To create a symmetric density/violin, we make every second point negative
# Distributing every other point like this is also more likely to preserve the shape of the violin
violin_cloud = density_cloud.copy()
violin_cloud[::2] = violin_cloud[::2] * -1
# Append the density cloud to the original data in the correctly sorted order
df_with_density = pd.concat([
df,
pd.DataFrame({
'density_cloud': density_cloud,
'violin_cloud': violin_cloud
},
index=df['Horsepower'].sort_values().index)],
axis=1
)
# Visualize using the new Offset channel
alt.Chart(df_with_density).mark_circle().encode(
x='Horsepower',
y='violin_cloud'
)
Both these approaches will work with multiple categoricals without faceting in the next version of Altair when support for x/y offset channels are added.

How can I plot a matplotlib.mlab spectrogram while keeping the frequency and time values?

I'm trying to plot a colour map and I need to use mlab.specgram() to perform a cross-correlation of two functions in the frequency domain, so I can't use pyplot.specgram(). I used pyplot.imshow in order to plot a colour map of the cross-correlation, but as a result the axes are just index numbers rather than the actual time values corresponding to the power shown in the colour map.
I've tried to change the labels using xticks()/yticks() and the extent argument, but all it does is show me a small portion of my colour map instead of changing the labels.
Is there a way for me the change the scale of my axes to match the actual frequency and time?
For reference:
# My spectrograms:
spec_H1, freqs, t = mlab.specgram(H_filt, NFFT=NFFT, Fs=fs, noverlap=NOVL, mode=mode)
spec_L1, freqs, t = mlab.specgram(L_filt, NFFT=NFFT, Fs=fs, noverlap=NOVL, mode=mode)
# The cross-correlation:
X = np.real(spec_H1 * np.conj(spec_L1))
# The figure:
plt.figure(figsize=(10,10))
plt.imshow(abs(X), cmap = 'jet')
plt.colorbar()
plt.ylim(0,200)
plt.xlim(0,500)
As you can see, for example, the time axis (x) should run from 0 to 4 seconds, but it's sampled such that it runs from 0 to 500. How do I change this?

How do I plot more than one set of bars per axis on a bar plot in python?

I currently use the align=’edge’ parameter and positive/negative widths in pyplot.bar() to plot the bar data of one metric to each axis. However, if I try to plot a second set of data to one axis, it covers the first set. Is there a way for pyplot to automatically space this data correctly?
lns3 = ax[1].bar(bucket_df.index,bucket_df.original_revenue,color='c',width=-0.4,align='edge')
lns4 = ax[1].bar(bucket_df.index,bucket_df.revenue_lift,color='m',bottom=bucket_df.original_revenue,width=-0.4,align='edge')
lns5 = ax3.bar(bucket_df.index,bucket_df.perc_first_priced,color='grey',width=0.4,align='edge')
lns6 = ax3.bar(bucket_df.index,bucket_df.perc_revenue_lift,color='y',width=0.4,align='edge')
This is what it looks like when I show the plot:
The data shown in yellow completely covers the data in grey. I'd like it to be shown next to the grey data.
Is there any easy way to do this? Thanks!
The first argument to the bar() plotting method is an array of the x-coordinates for your bars. Since you pass the same x-coordinates they will all overlap. You can get what you want by staggering the bars by doing something like this:
x = np.arange(10) # define your x-coordinates
width = 0.1 # set a width for your plots
offset = 0.15 # define an offset to separate each set of bars
fig, ax = plt.subplots() # define your figure and axes objects
ax.bar(x, y1) # plot the first set of bars
ax.bar(x + offset, y2) # plot the second set of bars
Since you have a few sets of data to plot, it makes more sense to make the code a bit more concise (assume y_vals is a list containing the y-coordinates you'd like to plot, bucket_df.original_revenue, bucket_df.revenue_lift, etc.). Then your plotting code could look like this:
for i, y in enumerate(y_vals):
ax.bar(x + i * offset, y)
If you want to plot more sets of bars you can decrease the width and offset accordingly.

Aspect ratio in semi-log plot with Matplotlib

When I plot a function in matplotlib, the plot is framed by a rectangle. I want the ratio of the length and height of this rectangle to be given by the golden mean ,i.e., dx/dy=1.618033...
If the x and y scale are linear I found this solution using google
import numpy as np
import matplotlib.pyplot as pl
golden_mean = (np.sqrt(5)-1.0)/2.0
dy=pl.gca().get_ylim()[1]-pl.gca().get_ylim()[0]
dx=pl.gca().get_xlim()[1]-pl.gca().get_xlim()[0]
pl.gca().set_aspect((dx/dy)*golden_mean,adjustable='box')
If it is a log-log plot I came up with this solution
dy=np.abs(np.log10(pl.gca().get_ylim()[1])-np.log10(pl.gca().get_ylim()[0]))
dx=np.abs(np.log10(pl.gca().get_xlim()[1])-np.log10(pl.gca().get_xlim()[0]))
pl.gca().set_aspect((dx/dy)*golden_mean,adjustable='box')
However, for a semi-log plot, when I call set_aspect, I get
UserWarning: aspect is not supported for Axes with xscale=log, yscale=linear
Can anyone think of a work-around for this?
the most simple solution would be to log your data and then use the method for lin-lin.
you can then label the axes to let it look like a normal log-plot.
ticks = np.arange(min_logx, max_logx, 1)
ticklabels = [r"$10^{}$".format(tick) for tick in ticks]
pl.yticks(ticks, ticklabels)
if you have higher values than 10e9 you will need three pairs of braces, two pairs for the LaTeX braces and one for the .format()
ticklabels = [r"$10^{{{}}}$".format(tick) for tick in ticks]
Edit:
if you want also the ticks for 0.1ex ... 0.9ex, you want to use the minor ticks as well:
they need to be located at log10(1), log10(2), log10(3) ..., log10(10), log10(20) ...
you can create and set them with:
minor_ticks = []
for i in range(min_exponent, max_exponent):
for j in range(2,10):
minor_ticks.append(i+np.log10(j))
plt.gca().set_yticks(minor_labels, minor=True)

Categories