Using healpy.anafast on gamma ray maps - python

I have a gamma-ray maps (image with surface brightness) in fits format as also .hpx as output by the Aladin converter.
I wish to compute the angular power spectrum. How do I create a file readable by
healpy.anafast?
I seem to be getting the data format wrong (TypeErrors).
One of the Gamma Ray images I tried was the Fermi Galactic Diffuse. The file
is a public LAT Galactic diffuse map named gll_iem_v02_P6_V11_DIFFUSE.fit on:
http://fermi.gsfc.nasa.gov/ssc/data/access/lat/BackgroundModels.html
I have pasted the code below as I use it, but it is essentially the script called plot_wmap_power_spectra on astroml:
"""
WMAP power spectrum analysis with HealPy
----------------------------------------
This demonstrates how to plot and take a power spectrum of the WMAP data
using healpy, the python wrapper for healpix. Healpy is available for
download at the `github site <https://github.com/healpy/healpy>`_
"""
# Author: Jake VanderPlas <vanderplas#astro.washington.edu>
# License: BSD
# The figure is an example from astroML: see http://astroML.github.com
import numpy as np
from matplotlib import pyplot as plt
# warning: due to a bug in healpy, importing it before pylab can cause
# a segmentation fault in some circumstances.
import pylab
import healpy as hp
###
from astroML.datasets import fetch_wmap_temperatures
###
#------------------------------------------------------------
# Fetch the data
###
wmap_unmasked = fetch_wmap_temperatures(masked=False)
#PredictedSurfaceFluxFromModelMap = np.arange(hp.read_map('PredictedSurfaceFluxFromModelMap.hpx[1]'))
PredictedSurfaceFluxFromModelMap = hp.read_map('gll_iem_v02_p6_V11_DIFFUSE.fit',dtype=np.float,verbose=True)
#PredictedSurfaceFluxFromModelMap = hp.read_map('all.fits',dtype=np.float,verbose=True)
#cl_out = hp.read_cl('PredictedSurfaceFluxFromModelMap.hpx',dtype=np.float)#,verbose=True)
wmap_masked = fetch_wmap_temperatures(masked=True)
###
white_noise = np.ma.asarray(np.random.normal(0, 0.062, wmap_masked.shape))
len(cl_out)
#------------------------------------------------------------
# plot the unmasked map
fig = plt.figure(1)
#hp.mollview(wmap_unmasked, min=-1, max=1, title='Unmasked map',
# fig=1, unit=r'$\Delta$T (mK)')
########----------------
##hp.mollview(PredictedSurfaceFluxFromModelMap, min=-1, max=1, title='Unmasked map',
## fig=1, unit=r'$\Delta$T (mK)')
########----------------
#------------------------------------------------------------
# plot the masked map
# filled() fills the masked regions with a null value.
########----------------
#fig = plt.figure(2)
#hp.mollview(wmap_masked.filled(), title='Masked map',
# fig=2, unit=r'$\Delta$T (mK)')
########----------------
#------------------------------------------------------------
# compute and plot the power spectrum
########----------------
#cl = hp.anafast(wmap_masked.filled(), lmax=1024)
cl = hp.anafast(PredictedSurfaceFluxFromModelMap, lmax=1024)
#cl = cl_out
########----------------
ell = np.arange(len(cl))
cl_white = hp.anafast(white_noise, lmax=1024)
fig = plt.figure(3)
ax = fig.add_subplot(111)
ax.scatter(ell, ell * (ell + 1) * cl,
s=4, c='black', lw=0,
label='data')
ax.scatter(ell, ell * (ell + 1) * cl_white,
s=4, c='gray', lw=0,
label='white noise')
ax.set_xlabel(r'$\ell$')
ax.set_ylabel(r'$\ell(\ell+1)C_\ell$')
ax.set_title('Angular Power (not mask corrected)')
ax.legend(loc='upper right')
ax.grid()
ax.set_xlim(0, 1100)
plt.show()

I have uploaded your map also to Figshare, where is likely to be available in the future.
Once you have the map in HEALPix format, it is easy to just read it with healpy:
import healpy as hp
m = hp.ma(hp.read_map("gll_iem_v02_p6_V11_DIFFUSE.hpx"))
Mask NaN pixels:
m.mask = np.isnan(m)
Plot it:
hp.mollview(m, min=-1e-5, max=1e-5, xsize=2000)
title("gll_iem_v02_p6_V11_DIFFUSE")
Compute and plot the angular power spectrum:
plt.loglog(hp.anafast(m))
See also a IPython notebook: http://nbviewer.ipython.org/7553252

Related

Plot the timeframe of each unique sound loop in a song, with rows sorted by sound similarity using python Librosa

Background
Here's a video of a song clip from an electronic song. At the beginning of the video, the song plays at full speed. When you slow down the song you can hear all the unique sounds that the song uses. Some of these sounds repeat.
Mp3, Wav and MIDI of the audio in the video
Problem Description
What I am trying to do is create a visual like below, where a horizontal track/row is created for each unique sound, with a colored block on that track that corresponds to each timeframe in the song that sound is playing. The tracks/rows should be sorted by how similar the sounds are to each, with more similar sounds being closer together. If sounds are so identical a human couldn't tell them apart, they should be considered the same sound.
I'll accept an imperfect solution if it can generally do what I'm asking
Watch the video linked above for a video description of what I am saying. It includes a visual grid that I created manually which almost matches the grid I am trying to produce.
If for example, each of the 5 waves below represents the soundwave that a sound makes, each of these sounds would be considered similar, and would be placed close to each other vertically on the grid.
Attempts
I've been looking at an example for Laplacian segmentation in librosa. The graph labelled structure components looks like it might be what I need. From reading the paper, it looks like they are trying to break the song into segments like chorus, verse, bridge... but I am essentially trying to break the song into 1 or 2 beat fragments.
Here is the code for the Laplacian Segmentation (there's a Jupyter Notebook as well if you would prefer that).
# -*- coding: utf-8 -*-
"""
======================
Laplacian segmentation
======================
This notebook implements the laplacian segmentation method of
`McFee and Ellis, 2014 <http://bmcfee.github.io/papers/ismir2014_spectral.pdf>`_,
with a couple of minor stability improvements.
Throughout the example, we will refer to equations in the paper by number, so it will be
helpful to read along.
"""
# Code source: Brian McFee
# License: ISC
###################################
# Imports
# - numpy for basic functionality
# - scipy for graph Laplacian
# - matplotlib for visualization
# - sklearn.cluster for K-Means
#
import numpy as np
import scipy
import matplotlib.pyplot as plt
import sklearn.cluster
import librosa
import librosa.display
import matplotlib.patches as patches
#############################
# First, we'll load in a song
def laplacianSegmentation(fileName):
y, sr = librosa.load(librosa.ex('fishin'))
##############################################
# Next, we'll compute and plot a log-power CQT
BINS_PER_OCTAVE = 12 * 3
N_OCTAVES = 7
C = librosa.amplitude_to_db(np.abs(librosa.cqt(y=y, sr=sr,
bins_per_octave=BINS_PER_OCTAVE,
n_bins=N_OCTAVES * BINS_PER_OCTAVE)),
ref=np.max)
fig, ax = plt.subplots()
librosa.display.specshow(C, y_axis='cqt_hz', sr=sr,
bins_per_octave=BINS_PER_OCTAVE,
x_axis='time', ax=ax)
##########################################################
# To reduce dimensionality, we'll beat-synchronous the CQT
tempo, beats = librosa.beat.beat_track(y=y, sr=sr, trim=False)
Csync = librosa.util.sync(C, beats, aggregate=np.median)
# For plotting purposes, we'll need the timing of the beats
# we fix_frames to include non-beat frames 0 and C.shape[1] (final frame)
beat_times = librosa.frames_to_time(librosa.util.fix_frames(beats,
x_min=0,
x_max=C.shape[1]),
sr=sr)
fig, ax = plt.subplots()
librosa.display.specshow(Csync, bins_per_octave=12*3,
y_axis='cqt_hz', x_axis='time',
x_coords=beat_times, ax=ax)
#####################################################################
# Let's build a weighted recurrence matrix using beat-synchronous CQT
# (Equation 1)
# width=3 prevents links within the same bar
# mode='affinity' here implements S_rep (after Eq. 8)
R = librosa.segment.recurrence_matrix(Csync, width=3, mode='affinity',
sym=True)
# Enhance diagonals with a median filter (Equation 2)
df = librosa.segment.timelag_filter(scipy.ndimage.median_filter)
Rf = df(R, size=(1, 7))
###################################################################
# Now let's build the sequence matrix (S_loc) using mfcc-similarity
#
# :math:`R_\text{path}[i, i\pm 1] = \exp(-\|C_i - C_{i\pm 1}\|^2 / \sigma^2)`
#
# Here, we take :math:`\sigma` to be the median distance between successive beats.
#
mfcc = librosa.feature.mfcc(y=y, sr=sr)
Msync = librosa.util.sync(mfcc, beats)
path_distance = np.sum(np.diff(Msync, axis=1)**2, axis=0)
sigma = np.median(path_distance)
path_sim = np.exp(-path_distance / sigma)
R_path = np.diag(path_sim, k=1) + np.diag(path_sim, k=-1)
##########################################################
# And compute the balanced combination (Equations 6, 7, 9)
deg_path = np.sum(R_path, axis=1)
deg_rec = np.sum(Rf, axis=1)
mu = deg_path.dot(deg_path + deg_rec) / np.sum((deg_path + deg_rec)**2)
A = mu * Rf + (1 - mu) * R_path
###########################################################
# Plot the resulting graphs (Figure 1, left and center)
fig, ax = plt.subplots(ncols=3, sharex=True, sharey=True, figsize=(10, 4))
librosa.display.specshow(Rf, cmap='inferno_r', y_axis='time', x_axis='s',
y_coords=beat_times, x_coords=beat_times, ax=ax[0])
ax[0].set(title='Recurrence similarity')
ax[0].label_outer()
librosa.display.specshow(R_path, cmap='inferno_r', y_axis='time', x_axis='s',
y_coords=beat_times, x_coords=beat_times, ax=ax[1])
ax[1].set(title='Path similarity')
ax[1].label_outer()
librosa.display.specshow(A, cmap='inferno_r', y_axis='time', x_axis='s',
y_coords=beat_times, x_coords=beat_times, ax=ax[2])
ax[2].set(title='Combined graph')
ax[2].label_outer()
#####################################################
# Now let's compute the normalized Laplacian (Eq. 10)
L = scipy.sparse.csgraph.laplacian(A, normed=True)
# and its spectral decomposition
evals, evecs = scipy.linalg.eigh(L)
# We can clean this up further with a median filter.
# This can help smooth over small discontinuities
evecs = scipy.ndimage.median_filter(evecs, size=(9, 1))
# cumulative normalization is needed for symmetric normalize laplacian eigenvectors
Cnorm = np.cumsum(evecs**2, axis=1)**0.5
# If we want k clusters, use the first k normalized eigenvectors.
# Fun exercise: see how the segmentation changes as you vary k
k = 5
X = evecs[:, :k] / Cnorm[:, k-1:k]
# Plot the resulting representation (Figure 1, center and right)
fig, ax = plt.subplots(ncols=2, sharey=True, figsize=(10, 5))
librosa.display.specshow(Rf, cmap='inferno_r', y_axis='time', x_axis='time',
y_coords=beat_times, x_coords=beat_times, ax=ax[1])
ax[1].set(title='Recurrence similarity')
ax[1].label_outer()
librosa.display.specshow(X,
y_axis='time',
y_coords=beat_times, ax=ax[0])
ax[0].set(title='Structure components')
#############################################################
# Let's use these k components to cluster beats into segments
# (Algorithm 1)
KM = sklearn.cluster.KMeans(n_clusters=k)
seg_ids = KM.fit_predict(X)
# and plot the results
fig, ax = plt.subplots(ncols=3, sharey=True, figsize=(10, 4))
colors = plt.get_cmap('Paired', k)
librosa.display.specshow(Rf, cmap='inferno_r', y_axis='time',
y_coords=beat_times, ax=ax[1])
ax[1].set(title='Recurrence matrix')
ax[1].label_outer()
librosa.display.specshow(X,
y_axis='time',
y_coords=beat_times, ax=ax[0])
ax[0].set(title='Structure components')
img = librosa.display.specshow(np.atleast_2d(seg_ids).T, cmap=colors,
y_axis='time', y_coords=beat_times, ax=ax[2])
ax[2].set(title='Estimated segments')
ax[2].label_outer()
fig.colorbar(img, ax=[ax[2]], ticks=range(k))
###############################################################
# Locate segment boundaries from the label sequence
bound_beats = 1 + np.flatnonzero(seg_ids[:-1] != seg_ids[1:])
# Count beat 0 as a boundary
bound_beats = librosa.util.fix_frames(bound_beats, x_min=0)
# Compute the segment label for each boundary
bound_segs = list(seg_ids[bound_beats])
# Convert beat indices to frames
bound_frames = beats[bound_beats]
# Make sure we cover to the end of the track
bound_frames = librosa.util.fix_frames(bound_frames,
x_min=None,
x_max=C.shape[1]-1)
###################################################
# And plot the final segmentation over original CQT
# sphinx_gallery_thumbnail_number = 5
bound_times = librosa.frames_to_time(bound_frames)
freqs = librosa.cqt_frequencies(n_bins=C.shape[0],
fmin=librosa.note_to_hz('C1'),
bins_per_octave=BINS_PER_OCTAVE)
fig, ax = plt.subplots()
librosa.display.specshow(C, y_axis='cqt_hz', sr=sr,
bins_per_octave=BINS_PER_OCTAVE,
x_axis='time', ax=ax)
for interval, label in zip(zip(bound_times, bound_times[1:]), bound_segs):
ax.add_patch(patches.Rectangle((interval[0], freqs[0]),
interval[1] - interval[0],
freqs[-1],
facecolor=colors(label),
alpha=0.50))
One major thing that I believe would have to change would be the number of clusters, they have 5 in the example, but I don't know what I would want it to be because I don't know how many sounds there are. I set it to 400 producing the following result, which didn't really feel like something I could work with. Ideally I would want all the blocks to be solid colors: not colors in between the max red and blue values.
(I turned it sideways to look more like my examples above and more like the output I'm trying to produce)
Additional Info
There may also be a drum track in the background and sometimes multiple sounds are played at the same time. If these multiple sound groups get interpreted as one unique sound that's ok, but I'd obviously prefer if they could be distinguished as separate sounds.
If it makes it easier you can remove a drum loop using
y, sr = librosa.load(librosa.ex('exampleSong.mp3'))
y_harmonic, y_percussive = librosa.effects.hpss(y)
Update
I was able to separate the sounds by transients. Currently this kind of works, but it separates into too many sounds, from I could tell, it seemed like it was mostly just separating some sounds into two though. I can also create a midi file from the software I'm using, and using that to determine the transient times, but I would like to solve this problem without the midi file if I could. The midi file was pretty accurate, and split the sound file into 33 sections, whereas that transient code split the sound file into 40 sections. Here's a visualization of the midi
So that parts that still need to be solved would be
Better transient separation
Sorting the sounds
Below is a script that uses Non-negative Matrix Factorization (NMF) on mel-spectrograms to decompose the input audio. I took the first seconds with complete audio of your uploaded audio WAV, and ran the code to get the following output.
Both the code and the audio clip can be found in the Github repository.
This approach seems to do pretty reasonably on short audio clips when the BPM is known (seems to be around 130 with given example) and the input audio is roughly aligned to the beat. No guarantee it will work as well on a whole song, or other songs.
There are many ways it could be improved:
Using a more compact and perceptual vector than mel-spectrogram as NMF. Possibly a transformation learned from music. Either an embedding an autoencoder.
De-duplicate NMF components into "primary" components.
Adding constraints to the NMF, such as temporal. Lots of research papers out there
Automatically detecting BPM and doing alignment
Better perceptual sorting. Might want to have groups, such as chords, single tones, percussive
import os.path
import sys
import librosa
import pandas
import numpy
import sklearn.decomposition
import skimage.color
from matplotlib import pyplot as plt
import librosa.display
import seaborn
def decompose_audio(y, sr, bpm, per_beat=8,
n_components=16, n_mels=128, fmin=100, fmax=6000):
"""
Decompose audio using NMF spectrogram decomposition,
using a fixed number of frames per beat (#per_beat) for a given #bpm
NOTE: assumes audio to be aligned to the beat
"""
interval = (60/bpm)/per_beat
T = sklearn.decomposition.NMF(n_components)
S = numpy.abs(librosa.feature.melspectrogram(y, hop_length=int(sr*interval), n_mels=128, fmin=100, fmax=6000))
comps, acts = librosa.decompose.decompose(S, transformer=T, sort=False)
# compute feature to sort components by
ind = numpy.apply_along_axis(numpy.argmax, 0, comps)
#ind = librosa.feature.spectral_rolloff(S=comps)[0]
#ind = librosa.feature.spectral_centroid(S=comps)[0]
# apply sorting
order_idx = numpy.argsort(ind)
ordered_comps = comps[:,order_idx]
ordered_acts = acts[order_idx,:]
# plot components
librosa.display.specshow(librosa.amplitude_to_db(ordered_comps,
ref=numpy.max),y_axis='mel', sr=sr)
return S, ordered_comps, ordered_acts
def plot_colorized_activations(acts, ax, hop_length=None, sr=None, value_mod=1.0):
hsv = numpy.stack([
numpy.ones(shape=acts.shape),
numpy.ones(shape=acts.shape),
acts,
], axis=-1)
# Set hue based on a palette
colors = seaborn.color_palette("husl", hsv.shape[0])
for row_no in range(hsv.shape[0]):
c = colors[row_no]
c = skimage.color.rgb2hsv(numpy.stack([c]))[0]
hsv[row_no, :, 0] = c[0]
hsv[row_no, :, 1] = c[1]
hsv[row_no, :, 2] *= value_mod
colored = skimage.color.hsv2rgb(hsv)
# use same kind of order as librosa.specshow
flipped = colored[::-1, :, :]
ax.imshow(flipped)
ax.set(aspect='auto')
ax.tick_params(axis='x',
which='both',
bottom=False,
top=False,
labelbottom=False)
ax.tick_params(axis='both',
which='both',
bottom=False,
left=False,
top=False,
labelbottom=False)
def plot_activations(S, acts):
fig, ax = plt.subplots(nrows=4, ncols=1, figsize=(25, 15), sharex=False)
# spectrogram
db = librosa.amplitude_to_db(S, ref=numpy.max)
librosa.display.specshow(db, ax=ax[0], y_axis='mel')
# original activations
librosa.display.specshow(acts, x_axis='time', ax=ax[1])
# colorize
plot_colorized_activations(acts, ax=ax[2], value_mod=3.0)
# thresholded
q = numpy.quantile(acts, 0.90, axis=0, keepdims=True) + 1e-9
norm = acts / q
threshold = numpy.quantile(norm, 0.93)
plot_colorized_activations((norm > threshold).astype(float), ax=ax[3], value_mod=1.0)
return fig
def main():
audio_file = 'silence-end.wav'
audio_bpm = 130
sr = 22050
audio, sr = librosa.load(audio_file, sr=sr)
S, comps, acts = decompose_audio(y=audio, sr=sr, bpm=audio_bpm)
fig = plot_activations(S, acts)
fig.savefig('plot.png', transparent=False)
main()

Matplotlib's Basemap seems to not store map's center for later overplotting of data

I want to plot the average daily temperature from the NOAA Earth System Research Laboratory's Physical Sciences Division onto a map created with matplotlib's Basemap.
The dataset can be download as a netCDF-file from here.
My problem is, however, that Basemap seems not to store the center (or boundary box) coordinates of the map as the subsequent overplot only fills part of the map, see the following figure:
The code to generate the figure is as follows:
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.basemap import Basemap
import netCDF4
# to check whether a file exists (before downloading it)
import os.path
import sys
fig1, ax1 = plt.subplots(1,1, figsize=(8,6) )
temperature_fname = 'air.sig995.2016.nc'
url = 'https://www.esrl.noaa.gov/psd/thredds/fileServer/Datasets/ncep.reanalysis.dailyavgs/surface/{0}'.format( temperature_fname)
if not os.path.isfile( temperature_fname ):
print( "ERROR: you need to download the file {0}".format(url) )
sys.exit(1)
# read netCDF4 dataset
tmprt_dSet = netCDF4.Dataset( temperature_fname )
# extract (copy) the relevant data
tmprt_vals = tmprt_dSet.variables['air'][:] - 273.15
tmprt_lat = tmprt_dSet.variables['lat'][:]
tmprt_lon = tmprt_dSet.variables['lon'][:]
# close dataset
tmprt_dSet.close()
# use the Miller projection
map1 = Basemap( projection='mill', resolution='l',
lon_0=0., lat_0=0.
)
# draw coastline, map-boundary
map1.drawcoastlines()
map1.drawmapboundary( fill_color='white' )
# draw grid
map1.drawparallels( np.arange(-90.,90.,30.), labels=[1,0,0,0] )
map1.drawmeridians( np.arange(-180.,180.,60.),labels=[0,0,0,1] )
# overplot temperature
## make the longitude and latitude grid projected onto map
tmprt_x, tmprt_y = map1(*np.meshgrid(tmprt_lon,tmprt_lat))
## make the contour plot
CS1 = map1.contourf( tmprt_x, tmprt_y, tmprt_vals[0,:,:],
cmap=plt.cm.jet
)
cbar1 = map1.colorbar( CS1, location='right' )
cbar1.set_label( r'$T$ in $^\circ$C')
plt.show()
Note: if I set lon_0=180 everything looks fine (it is just not the center position I would like to have)
I have the feeling that the solution is pretty obvious and I would appreciate any hint pointing me into that direction.
As commented, the data is aranged from 0 to 360 instead of -180 to 180. So you would need to
map the range between 180 and 360 degrees to -180 to 0.
move the second half of the data in front of the first half, such that it is sorted ascendingly.
Adding the following piece of code in between your data extraction and the plotting function would do that.
# map lon values to -180..180 range
f = lambda x: ((x+180) % 360) - 180
tmprt_lon = f(tmprt_lon)
# rearange data
ind = np.argsort(tmprt_lon)
tmprt_lon = tmprt_lon[ind]
tmprt_vals = tmprt_vals[:, :, ind]
Complete code:
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.basemap import Basemap
import netCDF4
# read netCDF4 dataset
tmprt_dSet = netCDF4.Dataset('data/air.sig995.2018.nc')
# extract (copy) the relevant data
tmprt_vals = tmprt_dSet.variables['air'][:] - 273.15
tmprt_lat = tmprt_dSet.variables['lat'][:]
tmprt_lon = tmprt_dSet.variables['lon'][:]
# close dataset
tmprt_dSet.close()
### Section added ################
# map lon values to -180..180 range
f = lambda x: ((x+180) % 360) - 180
tmprt_lon = f(tmprt_lon)
# rearange data
ind = np.argsort(tmprt_lon)
tmprt_lon = tmprt_lon[ind]
tmprt_vals = tmprt_vals[:, :, ind]
##################################
fig1, ax1 = plt.subplots(1,1, figsize=(8,6) )
# use the Miller projection
map1 = Basemap( projection='mill', resolution='l',
lon_0=0., lat_0=0. )
# draw coastline, map-boundary
map1.drawcoastlines()
map1.drawmapboundary( fill_color='white' )
# draw grid
map1.drawparallels( np.arange(-90.,90.,30.), labels=[1,0,0,0] )
map1.drawmeridians( np.arange(-180.,180.,60.),labels=[0,0,0,1] )
# overplot temperature
## make the longitude and latitude grid projected onto map
tmprt_x, tmprt_y = map1(*np.meshgrid(tmprt_lon,tmprt_lat))
## make the contour plot
CS1 = map1.contourf( tmprt_x, tmprt_y, tmprt_vals[0,:,:],
cmap=plt.cm.jet
)
cbar1 = map1.colorbar( CS1, location='right' )
cbar1.set_label( r'$T$ in $^\circ$C')
plt.show()
This is challenging. I split the data array into 2 parts. The first part spans from 0° to 180°E longitude. The second part lying on the west side of 0° need longitude shift of 360°. Colormap must be normalized and applied to get common reference colors. Here is the working code and the resulting plot:
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.basemap import Basemap
import netCDF4
import matplotlib as mpl
#import os.path
#import sys
fig1, ax1 = plt.subplots(1,1, figsize=(10,6) )
temperature_fname = r'.\air.sig995.2018.nc'
# read netCDF4 dataset
tmprt_dSet = netCDF4.Dataset( temperature_fname )
# extract (copy) the relevant data
shift_val = - 273.15
tmprt_vals = tmprt_dSet.variables['air'][:] + shift_val
tmprt_lat = tmprt_dSet.variables['lat'][:]
tmprt_lon = tmprt_dSet.variables['lon'][:]
# prep norm of the color map
color_shf = 40 # to get better lower range of colormap
normalize = mpl.colors.Normalize(tmprt_vals.data.min()+color_shf, \
tmprt_vals.data.max())
# close dataset
#tmprt_dSet.close()
# use the Miller projection
map1 = Basemap( projection='mill', resolution='i', \
lon_0=0., lat_0=0.)
# draw coastline, map-boundary
map1.drawcoastlines()
map1.drawmapboundary( fill_color='white' )
# draw grid
map1.drawparallels( np.arange(-90.,90.,30.), labels=[1,0,0,0] )
map1.drawmeridians( np.arange(-180.,180.,60.), labels=[0,0,0,1] )
# overplot temperature
# split data into 2 parts at column 73 (longitude: +180)
# part1 (take location as is)
beg_col = 0
end_col = 73
grdx, grdy = np.meshgrid(tmprt_lon[beg_col:end_col], tmprt_lat[:])
tmprt_x, tmprt_y = map1(grdx, grdy)
CS1 = map1.contourf( tmprt_x, tmprt_y, tmprt_vals[0,:, beg_col:end_col],
cmap=plt.cm.jet, norm=normalize)
# part2 (longitude is shifted -360 degrees, but -359.5 looks better)
beg_col4 = 73
end_col4 = 144
grdx, grdy = np.meshgrid(tmprt_lon[beg_col4:end_col4]-359.5, tmprt_lat[:])
tmprt_x, tmprt_y = map1(grdx, grdy)
CS4 = map1.contourf( tmprt_x, tmprt_y, tmprt_vals[0,:, beg_col4:end_col4],
cmap=plt.cm.jet, norm=normalize)
# color bars CS1, CS4 are the same (normalized), plot one only
cbar1 = map1.colorbar( CS1, location='right' )
cbar1.set_label( r'$T$ in $^\circ$C')
plt.show()
The resulting plot:
Both answers posted so far are a solution to my question (thank you, ImportanceOfBeingErnest and swatchai).
I thought, however, that there must be a simpler way to do this (and by simple I mean some Basemap utility). So I looked again into the documentation [1] and found something which I overlooked so far: mpl_toolkits.basemap.shiftgrid. The following two lines need to be added to the code:
from mpl_toolkits.basemap import shiftgrid
tmprt_vals, tmprt_lon = shiftgrid(180., tmprt_vals, tmprt_lon, start=False)
Note that the second line has to be added before the meshgrid call.
[1] https://matplotlib.org/basemap/api/basemap_api.html

Matplotlib Affine2D object has no attribute 'skew_deg'

I am trying to create a plot using matplotlib but I get an error, and after hours of searching, I do not see an alternative or something that works. Here's the code that's giving me trouble:
import matplotlib.transforms as transforms
self.transDataToAxes = self.transScale + (self.transLimits +
transforms.Affine2D().skew_deg(rot, 0))
Which gives me the error: AttributeError: 'Affine2D' object has no attribute 'skew_deg'. This error happens with both python 2.7 and python 3.
If anyone has any suggestions on what I can try, it would be greatly appreciated.
Edit: Here is the entire script which I am trying to run, it should also be noted that I've tried this on Windows, Linux, and Mac with no success:
import matplotlib
spc_file = open('1OUN.txt', 'r').read()
import sharppy
import sharppy.sharptab.profile as profile
import sharppy.sharptab.interp as interp
import sharppy.sharptab.winds as winds
import sharppy.sharptab.utils as utils
import sharppy.sharptab.params as params
import sharppy.sharptab.thermo as thermo
import numpy as np
from StringIO import StringIO
def parseSPC(spc_file):
## read in the file
data = np.array([l.strip() for l in spc_file.split('\n')])
## necessary index points
title_idx = np.where( data == '%TITLE%')[0][0]
start_idx = np.where( data == '%RAW%' )[0] + 1
finish_idx = np.where( data == '%END%')[0]
## create the plot title
data_header = data[title_idx + 1].split()
location = data_header[0]
time = data_header[1][:11]
## put it all together for StringIO
full_data = '\n'.join(data[start_idx : finish_idx][:])
sound_data = StringIO( full_data )
## read the data into arrays
p, h, T, Td, wdir, wspd = np.genfromtxt( sound_data, delimiter=',', comments="%", unpack=True )
return p, h, T, Td, wdir, wspd
pres, hght, tmpc, dwpc, wdir, wspd = parseSPC(spc_file)
prof = profile.create_profile(profile='default', pres=pres, hght=hght, tmpc=tmpc, \
dwpc=dwpc, wspd=wspd, wdir=wdir, missing=-9999, strictQC=True)
import matplotlib.pyplot as plt
plt.plot(prof.tmpc, prof.hght, 'r-')
plt.plot(prof.dwpc, prof.hght, 'g-')
#plt.barbs(40*np.ones(len(prof.hght)), prof.hght, prof.u, prof.v)
plt.xlabel("Temperature [C]")
plt.ylabel("Height [m above MSL]")
plt.grid()
plt.show()
msl_hght = prof.hght[prof.sfc] # Grab the surface height value
agl_hght = interp.to_agl(prof, msl_hght) # Converts to AGL
msl_hght = interp.to_msl(prof, agl_hght) # Converts to MSL
plt.plot(thermo.ktoc(prof.thetae), prof.hght, 'r-', label='Theta-E')
plt.plot(prof.wetbulb, prof.hght, 'c-', label='Wetbulb')
plt.xlabel("Temperature [C]")
plt.ylabel("Height [m above MSL]")
plt.legend()
plt.grid()
plt.show()
sfcpcl = params.parcelx( prof, flag=1 ) # Surface Parcel
#fcstpcl = params.parcelx( prof, flag=2 ) # Forecast Parcel
#mupcl = params.parcelx( prof, flag=3 ) # Most-Unstable Parcel
#mlpcl = params.parcelx( prof, flag=4 ) # 100 mb Mean Layer Parcel
# This serves as an intensive exercise of matplotlib's transforms
# and custom projection API. This example produces a so-called
# SkewT-logP diagram, which is a common plot in meteorology for
# displaying vertical profiles of temperature. As far as matplotlib is
# concerned, the complexity comes from having X and Y axes that are
# not orthogonal. This is handled by including a skew component to the
# basic Axes transforms. Additional complexity comes in handling the
# fact that the upper and lower X-axes have different data ranges, which
# necessitates a bunch of custom classes for ticks,spines, and the axis
# to handle this.
from matplotlib.axes import Axes
import matplotlib.transforms as transforms
import matplotlib.axis as maxis
import matplotlib.spines as mspines
import matplotlib.path as mpath
from matplotlib.projections import register_projection
# The sole purpose of this class is to look at the upper, lower, or total
# interval as appropriate and see what parts of the tick to draw, if any.
class SkewXTick(maxis.XTick):
def draw(self, renderer):
if not self.get_visible(): return
renderer.open_group(self.__name__)
lower_interval = self.axes.xaxis.lower_interval
upper_interval = self.axes.xaxis.upper_interval
if self.gridOn and transforms.interval_contains(
self.axes.xaxis.get_view_interval(), self.get_loc()):
self.gridline.draw(renderer)
if transforms.interval_contains(lower_interval, self.get_loc()):
if self.tick1On:
self.tick1line.draw(renderer)
if self.label1On:
self.label1.draw(renderer)
if transforms.interval_contains(upper_interval, self.get_loc()):
if self.tick2On:
self.tick2line.draw(renderer)
if self.label2On:
self.label2.draw(renderer)
renderer.close_group(self.__name__)
# This class exists to provide two separate sets of intervals to the tick,
# as well as create instances of the custom tick
class SkewXAxis(maxis.XAxis):
def __init__(self, *args, **kwargs):
maxis.XAxis.__init__(self, *args, **kwargs)
self.upper_interval = 0.0, 1.0
def _get_tick(self, major):
return SkewXTick(self.axes, 0, '', major=major)
#property
def lower_interval(self):
return self.axes.viewLim.intervalx
def get_view_interval(self):
return self.upper_interval[0], self.axes.viewLim.intervalx[1]
# This class exists to calculate the separate data range of the
# upper X-axis and draw the spine there. It also provides this range
# to the X-axis artist for ticking and gridlines
class SkewSpine(mspines.Spine):
def _adjust_location(self):
trans = self.axes.transDataToAxes.inverted()
if self.spine_type == 'top':
yloc = 1.0
else:
yloc = 0.0
left = trans.transform_point((0.0, yloc))[0]
right = trans.transform_point((1.0, yloc))[0]
pts = self._path.vertices
pts[0, 0] = left
pts[1, 0] = right
self.axis.upper_interval = (left, right)
# This class handles registration of the skew-xaxes as a projection as well
# as setting up the appropriate transformations. It also overrides standard
# spines and axes instances as appropriate.
class SkewXAxes(Axes):
# The projection must specify a name. This will be used be the
# user to select the projection, i.e. ``subplot(111,
# projection='skewx')``.
name = 'skewx'
def _init_axis(self):
#Taken from Axes and modified to use our modified X-axis
self.xaxis = SkewXAxis(self)
self.spines['top'].register_axis(self.xaxis)
self.spines['bottom'].register_axis(self.xaxis)
self.yaxis = maxis.YAxis(self)
self.spines['left'].register_axis(self.yaxis)
self.spines['right'].register_axis(self.yaxis)
def _gen_axes_spines(self):
spines = {'top':SkewSpine.linear_spine(self, 'top'),
'bottom':mspines.Spine.linear_spine(self, 'bottom'),
'left':mspines.Spine.linear_spine(self, 'left'),
'right':mspines.Spine.linear_spine(self, 'right')}
return spines
def _set_lim_and_transforms(self):
"""
This is called once when the plot is created to set up all the
transforms for the data, text and grids.
"""
rot = 30
#Get the standard transform setup from the Axes base class
Axes._set_lim_and_transforms(self)
# Need to put the skew in the middle, after the scale and limits,
# but before the transAxes. This way, the skew is done in Axes
# coordinates thus performing the transform around the proper origin
# We keep the pre-transAxes transform around for other users, like the
# spines for finding bounds
self.transDataToAxes = self.transScale + (self.transLimits +
transforms.Affine2D().skew_deg(rot, 0))
# Create the full transform from Data to Pixels
self.transData = self.transDataToAxes + self.transAxes
# Blended transforms like this need to have the skewing applied using
# both axes, in axes coords like before.
self._xaxis_transform = (transforms.blended_transform_factory(
self.transScale + self.transLimits,
transforms.IdentityTransform()) +
transforms.Affine2D().skew_deg(rot, 0)) + self.transAxes
# Now register the projection with matplotlib so the user can select
# it.
register_projection(SkewXAxes)
pcl = sfcpcl
# Create a new figure. The dimensions here give a good aspect ratio
fig = plt.figure(figsize=(6.5875, 6.2125))
ax = fig.add_subplot(111, projection='skewx')
ax.grid(True)
pmax = 1000
pmin = 10
dp = -10
presvals = np.arange(int(pmax), int(pmin)+dp, dp)
# plot the moist-adiabats
for t in np.arange(-10,45,5):
tw = []
for p in presvals:
tw.append(thermo.wetlift(1000., t, p))
ax.semilogy(tw, presvals, 'k-', alpha=.2)
def thetas(theta, presvals):
return ((theta + thermo.ZEROCNK) / (np.power((1000. / presvals),thermo.ROCP))) - thermo.ZEROCNK
# plot the dry adiabats
for t in np.arange(-50,110,10):
ax.semilogy(thetas(t, presvals), presvals, 'r-', alpha=.2)
plt.title(' OAX 140616/1900 (Observed)', fontsize=14, loc='left')
# Plot the data using normal plotting functions, in this case using
# log scaling in Y, as dicatated by the typical meteorological plot
ax.semilogy(prof.tmpc, prof.pres, 'r', lw=2)
ax.semilogy(prof.dwpc, prof.pres, 'g', lw=2)
ax.semilogy(pcl.ttrace, pcl.ptrace, 'k-.', lw=2)
# An example of a slanted line at constant X
l = ax.axvline(0, color='b', linestyle='--')
l = ax.axvline(-20, color='b', linestyle='--')
# Disables the log-formatting that comes with semilogy
ax.yaxis.set_major_formatter(plt.ScalarFormatter())
ax.set_yticks(np.linspace(100,1000,10))
ax.set_ylim(1050,100)
ax.xaxis.set_major_locator(plt.MultipleLocator(10))
ax.set_xlim(-50,50)
plt.show()
And the text file can be found here, before renaming it: OUN_Sounding

how to plot geotiff data in specific area (lat/lon) with python

I have a geotiff raster data sets with elevation data init and i want to plot it in specific area, such as 60°E - 70° E ,70°S - 80°E.
I have a bit of code from here,but the pcolormesh seem couldn't plot my geotif.it's all red. picture. The picture is shown by imshow as really picture
When I try to make a plot with this code below:
path = "F:\\Mosaic_h1112v28_ps.tif"
dataset = gdal.Open(path)
data = dataset.ReadAsArray()
x0, dx, dxdy, y0, dydx, dy = dataset.GetGeoTransform()
nrows, ncols = data.shape
londata = np.linspace(x0, x0+dx*ncols)
latdata = np.linspace(y0, y0+dy*nrows)
lons, lats = np.meshgrid(lonarray, latarray)
fig = plt.figure(figsize=(8, 8))
m = Basemap(projection='lcc', lon_0=67.5, lat_0=-68.5, height=950000,
width=580000, resolution='h')
m.drawcoastlines()
x, y = m(lons, lats)
Then i dont know how to continue it . I just want to use imshow, but the imshow dont specify area(lat/lon).
I will really appreciate your help.
It's a good question, here is my solution.
Required packages: georaster with its dependencies (gdal, etc).
Data for demo purposes downloadable from http://dwtkns.com/srtm/
import georaster
import matplotlib.pyplot as plt
from mpl_toolkits.basemap import Basemap
fig = plt.figure(figsize=(8,8))
# full path to the geotiff file
fpath = r"C:\\path_to_your\geotiff_file\srtm_57_10.tif" # Thailand east
# read extent of image without loading
# good for values in degrees lat/long
# geotiff may use other coordinates and projection
my_image = georaster.SingleBandRaster(fpath, load_data=False)
# grab limits of image's extent
minx, maxx, miny, maxy = my_image.extent
# set Basemap with slightly larger extents
# set resolution at intermediate level "i"
m = Basemap( projection='cyl', \
llcrnrlon=minx-2, \
llcrnrlat=miny-2, \
urcrnrlon=maxx+2, \
urcrnrlat=maxy+2, \
resolution='i')
m.drawcoastlines(color="gray")
m.fillcontinents(color='beige')
# load the geotiff image, assign it a variable
image = georaster.SingleBandRaster( fpath, \
load_data=(minx, maxx, miny, maxy), \
latlon=True)
# plot the image on matplotlib active axes
# set zorder to put the image on top of coastlines and continent areas
# set alpha to let the hidden graphics show through
plt.imshow(image.r, extent=(minx, maxx, miny, maxy), zorder=10, alpha=0.6)
plt.show()
The resulting plot:
Edit1
My original answer places focus on how to plot simple geotiff image on the most basic projection with Basemap. A better answer was not possible without access to all required resources (i.e. geotiff file).
Here I try to improve my answer.
I have clipped a small portion from whole world geotiff file. Then reproject (warp) it to LCC projection specifications defined by Basemap() to be used. All the process were done with GDAL softwares. The resulting file is named "lcc_2.tiff". With this geotiff file, the plotting of the image is done with the code below.
The most important part is that geotiff file must have the same coordinate system (same projection) as the projection used by Basemap.
import georaster
import matplotlib.pyplot as plt
from mpl_toolkits.basemap import Basemap
fig = plt.figure(figsize=(8,8))
m = Basemap(projection='lcc', lon_0=67.5, lat_0=-68.5, \
height=950000, width=580000, resolution='h')
m.drawcoastlines()
m.fillcontinents(color='beige')
image = georaster.SingleBandRaster( "lcc_2.tiff", latlon=False)
plt.imshow(image.r, extent=image.extent, zorder=10, alpha=0.6)
plt.show()
The output map:
Here is my solution.
1. Import GEOTIF file and transform it into 2-D array data
from osgeo import gdal
pathToRaster = r'./xxxx.tif'
raster = gdal.Open(pathToRaster, gdal.GA_ReadOnly)
data = raster.GetRasterBand(1).ReadAsArray()
data = data[::-1]
2. Plot it using Pcolormesh
kk = plt.pcolormesh(data,cmap = plt.cm.Reds,alpha = 0.45, zorder =2)
You can use rioxarray
import rioxarray as rio
ds = rio.open_rasterio(path)
# Example lat lon range for subset
geometries = [
{
'type': 'Polygon',
'coordinates': [[
[33.97301017665958, -118.45830810580743],
[33.96660083660732, -118.37455741054782],
[33.92304171545437, -118.37151348516299],
[33.915042933806724, -118.42909440702563]
]]
}
]
clipped = ds.rio.clip(geometries)
clipped.plot()

Random Number from Histogram

Suppose I create a histogram using scipy/numpy, so I have two arrays: one for the bin counts, and one for the bin edges. If I use the histogram to represent a probability distribution function, how can I efficiently generate random numbers from that distribution?
It's probably what np.random.choice does in #Ophion's answer, but you can construct a normalized cumulative density function, then choose based on a uniform random number:
from __future__ import division
import numpy as np
import matplotlib.pyplot as plt
data = np.random.normal(size=1000)
hist, bins = np.histogram(data, bins=50)
bin_midpoints = bins[:-1] + np.diff(bins)/2
cdf = np.cumsum(hist)
cdf = cdf / cdf[-1]
values = np.random.rand(10000)
value_bins = np.searchsorted(cdf, values)
random_from_cdf = bin_midpoints[value_bins]
plt.subplot(121)
plt.hist(data, 50)
plt.subplot(122)
plt.hist(random_from_cdf, 50)
plt.show()
A 2D case can be done as follows:
data = np.column_stack((np.random.normal(scale=10, size=1000),
np.random.normal(scale=20, size=1000)))
x, y = data.T
hist, x_bins, y_bins = np.histogram2d(x, y, bins=(50, 50))
x_bin_midpoints = x_bins[:-1] + np.diff(x_bins)/2
y_bin_midpoints = y_bins[:-1] + np.diff(y_bins)/2
cdf = np.cumsum(hist.ravel())
cdf = cdf / cdf[-1]
values = np.random.rand(10000)
value_bins = np.searchsorted(cdf, values)
x_idx, y_idx = np.unravel_index(value_bins,
(len(x_bin_midpoints),
len(y_bin_midpoints)))
random_from_cdf = np.column_stack((x_bin_midpoints[x_idx],
y_bin_midpoints[y_idx]))
new_x, new_y = random_from_cdf.T
plt.subplot(121, aspect='equal')
plt.hist2d(x, y, bins=(50, 50))
plt.subplot(122, aspect='equal')
plt.hist2d(new_x, new_y, bins=(50, 50))
plt.show()
#Jaime solution is great, but you should consider using the kde (kernel density estimation) of the histogram. A great explanation why it's problematic to do statistics over histogram, and why you should use kde instead can be found here
I edited #Jaime's code to show how to use kde from scipy. It looks almost the same, but captures better the histogram generator.
from __future__ import division
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import gaussian_kde
def run():
data = np.random.normal(size=1000)
hist, bins = np.histogram(data, bins=50)
x_grid = np.linspace(min(data), max(data), 1000)
kdepdf = kde(data, x_grid, bandwidth=0.1)
random_from_kde = generate_rand_from_pdf(kdepdf, x_grid)
bin_midpoints = bins[:-1] + np.diff(bins) / 2
random_from_cdf = generate_rand_from_pdf(hist, bin_midpoints)
plt.subplot(121)
plt.hist(data, 50, normed=True, alpha=0.5, label='hist')
plt.plot(x_grid, kdepdf, color='r', alpha=0.5, lw=3, label='kde')
plt.legend()
plt.subplot(122)
plt.hist(random_from_cdf, 50, alpha=0.5, label='from hist')
plt.hist(random_from_kde, 50, alpha=0.5, label='from kde')
plt.legend()
plt.show()
def kde(x, x_grid, bandwidth=0.2, **kwargs):
"""Kernel Density Estimation with Scipy"""
kde = gaussian_kde(x, bw_method=bandwidth / x.std(ddof=1), **kwargs)
return kde.evaluate(x_grid)
def generate_rand_from_pdf(pdf, x_grid):
cdf = np.cumsum(pdf)
cdf = cdf / cdf[-1]
values = np.random.rand(1000)
value_bins = np.searchsorted(cdf, values)
random_from_cdf = x_grid[value_bins]
return random_from_cdf
Perhaps something like this. Uses the count of the histogram as a weight and chooses values of indices based on this weight.
import numpy as np
initial=np.random.rand(1000)
values,indices=np.histogram(initial,bins=20)
values=values.astype(np.float32)
weights=values/np.sum(values)
#Below, 5 is the dimension of the returned array.
new_random=np.random.choice(indices[1:],5,p=weights)
print new_random
#[ 0.55141614 0.30226256 0.25243184 0.90023117 0.55141614]
I had the same problem as the OP and I would like to share my approach to this problem.
Following Jaime answer and Noam Peled answer I've built a solution for a 2D problem using a Kernel Density Estimation (KDE).
Frist, let's generate some random data and then calculate its Probability Density Function (PDF) from the KDE. I will use the example available in SciPy for that.
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
def measure(n):
"Measurement model, return two coupled measurements."
m1 = np.random.normal(size=n)
m2 = np.random.normal(scale=0.5, size=n)
return m1+m2, m1-m2
m1, m2 = measure(2000)
xmin = m1.min()
xmax = m1.max()
ymin = m2.min()
ymax = m2.max()
X, Y = np.mgrid[xmin:xmax:100j, ymin:ymax:100j]
positions = np.vstack([X.ravel(), Y.ravel()])
values = np.vstack([m1, m2])
kernel = stats.gaussian_kde(values)
Z = np.reshape(kernel(positions).T, X.shape)
fig, ax = plt.subplots()
ax.imshow(np.rot90(Z), cmap=plt.cm.gist_earth_r,
extent=[xmin, xmax, ymin, ymax])
ax.plot(m1, m2, 'k.', markersize=2)
ax.set_xlim([xmin, xmax])
ax.set_ylim([ymin, ymax])
And the plot is:
Now, we obtain random data from the PDF obtained from the KDE, which is the variable Z.
# Generate the bins for each axis
x_bins = np.linspace(xmin, xmax, Z.shape[0]+1)
y_bins = np.linspace(ymin, ymax, Z.shape[1]+1)
# Find the middle point for each bin
x_bin_midpoints = x_bins[:-1] + np.diff(x_bins)/2
y_bin_midpoints = y_bins[:-1] + np.diff(y_bins)/2
# Calculate the Cumulative Distribution Function(CDF)from the PDF
cdf = np.cumsum(Z.ravel())
cdf = cdf / cdf[-1] # Normalização
# Create random data
values = np.random.rand(10000)
# Find the data position
value_bins = np.searchsorted(cdf, values)
x_idx, y_idx = np.unravel_index(value_bins,
(len(x_bin_midpoints),
len(y_bin_midpoints)))
# Create the new data
new_data = np.column_stack((x_bin_midpoints[x_idx],
y_bin_midpoints[y_idx]))
new_x, new_y = new_data.T
And we can calculate the KDE from this new data and the plot it.
kernel = stats.gaussian_kde(new_data.T)
new_Z = np.reshape(kernel(positions).T, X.shape)
fig, ax = plt.subplots()
ax.imshow(np.rot90(new_Z), cmap=plt.cm.gist_earth_r,
extent=[xmin, xmax, ymin, ymax])
ax.plot(new_x, new_y, 'k.', markersize=2)
ax.set_xlim([xmin, xmax])
ax.set_ylim([ymin, ymax])
Here is a solution, that returns datapoints that are uniformly distributed within each bin instead of the bin center:
def draw_from_hist(hist, bins, nsamples = 100000):
cumsum = [0] + list(I.np.cumsum(hist))
rand = I.np.random.rand(nsamples)*max(cumsum)
return [I.np.interp(x, cumsum, bins) for x in rand]
A few things do not work well for the solutions suggested by #daniel, #arco-bast, et al
Taking the last example
def draw_from_hist(hist, bins, nsamples = 100000):
cumsum = [0] + list(I.np.cumsum(hist))
rand = I.np.random.rand(nsamples)*max(cumsum)
return [I.np.interp(x, cumsum, bins) for x in rand]
This assumes that at least the first bin has zero content, which may or may not be true. Secondly, this assumes that the value of the PDF is at the upper bound of the bins, which it isn't - it's mostly in the centre of the bin.
Here's another solution done in two parts
def init_cdf(hist,bins):
"""Initialize CDF from histogram
Parameters
----------
hist : array-like, float of size N
Histogram height
bins : array-like, float of size N+1
Histogram bin boundaries
Returns:
--------
cdf : array-like, float of size N+1
"""
from numpy import concatenate, diff,cumsum
# Calculate half bin sizes
steps = diff(bins) / 2 # Half bin size
# Calculate slope between bin centres
slopes = diff(hist) / (steps[:-1]+steps[1:])
# Find height of end points by linear interpolation
# - First part is linear interpolation from second over first
# point to lowest bin edge
# - Second part is linear interpolation left neighbor to
# right neighbor up to but not including last point
# - Third part is linear interpolation from second to last point
# over last point to highest bin edge
# Can probably be done more elegant
ends = concatenate(([hist[0] - steps[0] * slopes[0]],
hist[:-1] + steps[:-1] * slopes,
[hist[-1] + steps[-1] * slopes[-1]]))
# Calculate cumulative sum
sum = cumsum(ends)
# Subtract off lower bound and scale by upper bound
sum -= sum[0]
sum /= sum[-1]
# Return the CDF
return sum
def sample_cdf(cdf,bins,size):
"""Sample a CDF defined at specific points.
Linear interpolation between defined points
Parameters
----------
cdf : array-like, float, size N
CDF evaluated at all points of bins. First and
last point of bins are assumed to define the domain
over which the CDF is normalized.
bins : array-like, float, size N
Points where the CDF is evaluated. First and last points
are assumed to define the end-points of the CDF's domain
size : integer, non-zero
Number of samples to draw
Returns
-------
sample : array-like, float, of size ``size``
Random sample
"""
from numpy import interp
from numpy.random import random
return interp(random(size), cdf, bins)
# Begin example code
import numpy as np
import matplotlib.pyplot as plt
# initial histogram, coarse binning
hist,bins = np.histogram(np.random.normal(size=1000),np.linspace(-2,2,21))
# Calculate CDF, make sample, and new histogram w/finer binning
cdf = init_cdf(hist,bins)
sample = sample_cdf(cdf,bins,1000)
hist2,bins2 = np.histogram(sample,np.linspace(-3,3,61))
# Calculate bin centres and widths
mx = (bins[1:]+bins[:-1])/2
dx = np.diff(bins)
mx2 = (bins2[1:]+bins2[:-1])/2
dx2 = np.diff(bins2)
# Plot, taking care to show uncertainties and so on
plt.errorbar(mx,hist/dx,np.sqrt(hist)/dx,dx/2,'.',label='original')
plt.errorbar(mx2,hist2/dx2,np.sqrt(hist2)/dx2,dx2/2,'.',label='new')
plt.legend()
Sorry, I don't know how to get this to show up in StackOverflow, so copy'n'paste and run to see the point.
I stumbled upon this question when I was looking for a way to generate a random array based on a distribution of another array. If this would be in numpy, I would call it random_like() function.
Then I realized, I have written a package Redistributor which might do this for me even though the package was created with a bit different motivation (Sklearn transformer capable of transforming data from an arbitrary distribution to an arbitrary known distribution for machine learning purposes). Of course I understand unnecessary dependencies are not desired, but at least knowing this package might be useful to you someday. The thing OP asked about is basically done under the hood here.
WARNING: under the hood, everything is done in 1D. The package also implements multidimensional wrapper, but I have not written this example using it as I find it to be too niche.
Installation:
pip install git+https://gitlab.com/paloha/redistributor
Implementation:
import numpy as np
import matplotlib.pyplot as plt
def random_like(source, bins=0, seed=None):
from redistributor import Redistributor
np.random.seed(seed)
noise = np.random.uniform(source.min(), source.max(), size=source.shape)
s = Redistributor(bins=bins, bbox=[source.min(), source.max()]).fit(source.ravel())
s.cdf, s.ppf = s.source_cdf, s.source_ppf
r = Redistributor(target=s, bbox=[noise.min(), noise.max()]).fit(noise.ravel())
return r.transform(noise.ravel()).reshape(noise.shape)
source = np.random.normal(loc=0, scale=1, size=(100,100))
t = random_like(source, bins=80) # More bins more precision (0 = automatic)
# Plotting
plt.figure(figsize=(12,4))
plt.subplot(121); plt.title(f'Distribution of source data, shape: {source.shape}')
plt.hist(source.ravel(), bins=100)
plt.subplot(122); plt.title(f'Distribution of generated data, shape: {t.shape}')
plt.hist(t.ravel(), bins=100); plt.show()
Explanation:
import numpy as np
import matplotlib.pyplot as plt
from redistributor import Redistributor
from sklearn.metrics import mean_squared_error
# We have some source array with "some unknown" distribution (e.g. an image)
# For the sake of example we just generate a random gaussian matrix
source = np.random.normal(loc=0, scale=1, size=(100,100))
plt.figure(figsize=(12,4))
plt.subplot(121); plt.title('Source data'); plt.imshow(source, origin='lower')
plt.subplot(122); plt.title('Source data hist'); plt.hist(source.ravel(), bins=100); plt.show()
# We want to generate a random matrix from the distribution of the source
# So we create a random uniformly distributed array called noise
noise = np.random.uniform(source.min(), source.max(), size=(100,100))
plt.figure(figsize=(12,4))
plt.subplot(121); plt.title('Uniform noise'); plt.imshow(noise, origin='lower')
plt.subplot(122); plt.title('Uniform noise hist'); plt.hist(noise.ravel(), bins=100); plt.show()
# Then we fit (approximate) the source distribution using Redistributor
# This step internally approximates the cdf and ppf functions.
s = Redistributor(bins=200, bbox=[source.min(), source.max()]).fit(source.ravel())
# A little naming workaround to make obj s work as a target distribution
s.cdf = s.source_cdf
s.ppf = s.source_ppf
# Here we create another Redistributor but now we use the fitted Redistributor s as a target
r = Redistributor(target=s, bbox=[noise.min(), noise.max()])
# Here we fit the Redistributor r to the noise array's distribution
r.fit(noise.ravel())
# And finally, we transform the noise into the source's distribution
t = r.transform(noise.ravel()).reshape(noise.shape)
plt.figure(figsize=(12,4))
plt.subplot(121); plt.title('Transformed noise'); plt.imshow(t, origin='lower')
plt.subplot(122); plt.title('Transformed noise hist'); plt.hist(t.ravel(), bins=100); plt.show()
# Computing the difference between the two arrays
print('Mean Squared Error between source and transformed: ', mean_squared_error(source, t))
Mean Squared Error between source and transformed: 2.0574123162302143

Categories