To illustrate my problem I prepared an example:
First, I have two arrays 'a'and 'b'and I'm interested in their distribution:
import numpy as np
import matplotlib.pyplot as plt
a = np.array([1,2,2,2,2,4,8,1,9,5,3,1,2,9])
b = np.array([5,9,9,2,3,9,3,6,8,4,2,7,8,8])
n1,bin1,pat1 = plt.hist(a,np.arange(1,10,2),histtype='step')
n2,bin2,pat2 = plt.hist(b,np.arange(1,10,2), histtype='step')
plt.show()
This code gives me a histogram with two 'curves'. Now I want to subtract one 'curve' from the other, and by this I mean that I do this for each bin separately:
n3 = n2-n1
I don't need negative counts so:
for i in range(0,len(n2)):
if n3[i]<0:
n3[i]=0
else:
continue
The new histogram curve should be plotted in the same range as the previous ones and it should have the same number of bins. So I have the number of bins and their position (which will be the same as the ones for the other curves, please refer to the block above) and the frequency or counts (n3) that every bins should have. Do you have any ideas of how I can do this with the data that I have?
You can use a step function to plot n3 = n2 - n1. The only issue is that you need to provide one more value, otherwise the last value is not shown nicely. Also you need to use the where="post" option of the step function.
import numpy as np
import matplotlib.pyplot as plt
a = np.array([1,2,2,2,2,4,8,1,9,5,3,1,2,9])
b = np.array([5,9,9,2,3,9,3,6,8,4,2,7,8,8])
n1,bin1,pat1 = plt.hist(a,np.arange(1,10,2),histtype='step')
n2,bin2,pat2 = plt.hist(b,np.arange(1,10,2), histtype='step')
n3=n2-n1
n3[n3<0] = 0
plt.step(np.arange(1,10,2),np.append(n3,[n3[-1]]), where='post', lw=3 )
plt.show()
Related
I am plotting a histogram of a fairly simple simulation. On the histogram the last two columns are merged and it looks odd. Please finde attached below the code and the plot result.
Thanks in advance!
import numpy as np
import random
import matplotlib.pyplot as plt
die = [1, 2, 3, 4, 5, 6]
N = 100000
results = []
# first round
for i in range(N):
X1 = random.choice(die)
if X1 > 4:
results.append(X1)
else:
X2 = random.choice(die)
if X2 > 3:
results.append(X2)
else:
X3 = random.choice(die)
results.append(X3)
plt.hist(results)
plt.ylabel('Count')
plt.xlabel('Result');
plt.title("Mean results: " + str(np.mean(results)))
plt.show()
The output looks like this. I dont understand why the last two columns are stuck together.
Any help appreciated!
By default, matplotlib divides the range of the input into 10 equally sized bins. All bins span a half-open interval [x1,x2), but the rightmost bin includes the end of the range. Your range is [1,6], so your bins are [1,1.5), [1.5,2), ..., [5.5,6], so all the integers end up in the first, third, etc. odd-numbered bins, but the sixes end up in the tenth (even) bin.
To fix the layout, specify the bins:
# This will give you a slightly different layout with only 6 bars
plt.hist(results, bins=die + [7])
# This will simulate your original plot more closely, with empty bins in between
plt.hist(results, bins=np.arange(2, 14)/2)
The last bit generates the number sequence 2,3,...,13 and then divides each number by 2, which gives 1, 1.5, ..., 6.5 so that the last bin spans [6,6.5].
You need to tell matplotlib that you want the histogram to match the bins.
Otherwise matplotlib choses the default value of 10 for you - in this case, that doesn't round well.
# ... your code ...
plt.hist(results, bins=die) # or bins = 6
plt.ylabel('Count')
plt.xlabel('Result');
plt.title("Mean results: " + str(np.mean(results)))
plt.show()
Full documentation is here: https://matplotlib.org/3.2.2/api/_as_gen/matplotlib.pyplot.hist.html
it doesn't, but you can try it.
import seaborn as sns
sns.distplot(results , kde = False)
I have an array with probability values stored in it. Some values are 0. I need to plot a histogram such that there are equal number of elements in each bin. I tried using matplotlibs hist function but that lets me decide number of bins. How do I go about plotting this?(Normal plot and hist work but its not what is needed)
I have 10000 entries. Only 200 have values greater than 0 and lie between 0.0005 and 0.2. This distribution isnt even as 0.2 only one element has whereas 2000 approx have value 0.0005. So plotting it was an issue as the bins had to be of unequal width with equal number of elements
The task does not make much sense to me, but the following code does, what i understood as the thing to do.
I also think the last lines of the code are what you really wanted to do. Using different bin-widths to improve visualization (but don't target the distribution of equal amount of samples within each bin)! I used astroml's hist with method='blocks' (astropy supports this too)
Code
# Python 3 -> beware the // operator!
import numpy as np
import matplotlib.pyplot as plt
from astroML import plotting as amlp
N_VALUES = 1000
N_BINS = 100
# Create fake data
prob_array = np.random.randn(N_VALUES)
prob_array /= np.max(np.abs(prob_array),axis=0) # scale a bit
# Sort array
prob_array = np.sort(prob_array)
# Calculate bin-borders,
bin_borders = [np.amin(prob_array)] + [prob_array[(N_VALUES // N_BINS) * i] for i in range(1, N_BINS)] + [np.amax(prob_array)]
print('SAMPLES: ', prob_array)
print('BIN-BORDERS: ', bin_borders)
# Plot hist
counts, x, y = plt.hist(prob_array, bins=bin_borders)
plt.xlim(bin_borders[0], bin_borders[-1] + 1e-2)
print('COUNTS: ', counts)
plt.show()
# And this is, what i think, what you really want
fig, (ax1, ax2) = plt.subplots(2)
left_blob = np.random.randn(N_VALUES/10) + 3
right_blob = np.random.randn(N_VALUES) + 110
both = np.hstack((left_blob, right_blob)) # data is hard to visualize with equal bin-widths
ax1.hist(both)
amlp.hist(both, bins='blocks', ax=ax2)
plt.show()
Output
I'm trying to learn how to use dendrograms in Python using SciPy . I want to get clusters and be able to visualize them; I heard hierarchical clustering and dendrograms are the best way.
How can I "cut" the tree at a specific distance?
In this example, I just want to cut it at distance 1.6
I looked up a tutorial on https://joernhees.de/blog/2015/08/26/scipy-hierarchical-clustering-and-dendrogram-tutorial/#Inconsistency-Method but the guy did some really confusing wrapper function using **kwargs (he calls his threshold max_d)
Here is my code and plot below; I tried annotating it as best as I could for reproducibility:
from __future__ import print_function
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from scipy.cluster.hierarchy import dendrogram,linkage,fcluster
from scipy.spatial import distance
np.random.seed(424173239) #43984
#Dims
n,m = 20,7
#DataFrame: rows = Samples, cols = Attributes
attributes = ["a" + str(j) for j in range(m)]
DF_data = pd.DataFrame(np.random.random((n, m)), columns = attributes)
A_dist = distance.cdist(DF_data.as_matrix().T, DF_data.as_matrix().T)
#(i) . Do the labels stay in place from DF_data for me to do this?
DF_dist = pd.DataFrame(A_dist, index = attributes, columns = attributes)
#Create dendrogram
fig, ax = plt.subplots()
Z = linkage(distance.squareform(DF_dist.as_matrix()), method="average")
D_dendro = dendrogram(Z, labels = attributes, ax=ax) #create dendrogram dictionary
threshold = 1.6 #for hline
ax.axhline(y=threshold, c='k')
plt.show()
#(ii) How can I "cut" the tree by giving it a distance threshold?
#i.e. If I cut at 1.6 it would make (a5 : cluster_1 or not in a cluster), (a2,a3 : cluster_2), (a0,a1 : cluster_3), and (a4,a6 : cluster_4)
#link_1 says use fcluster
#This -> fcluster(Z, t=1.5, criterion='inconsistent', depth=2, R=None, monocrit=None)
#gives me -> array([1, 1, 1, 1, 1, 1, 1], dtype=int32)
print(
len(set(D_dendro["color_list"])), "^ # of colors from dendrogram",
len(D_dendro["ivl"]), "^ # of labels",sep="\n")
#3
#^ # of colors from dendrogram it should be 4 since clearly (a6, a4) and a5 are in different clusers
#7
#^ # of labels
link_1 : How to compute cluster assignments from linkage/distance matrices in scipy in Python?
color_threshold is the method I was looking for. It doesn't really help when the color_palette is too small for the amount of clusters being generated. Migrated the next step to Bigger color-palette in matplotlib for SciPy's dendrogram (Python) if anyone can help.
For a bigger color palette this should work:
from scipy.cluster import hierarchy as hc
import matplotlib.cm as cm
import matplotlib.colors as col
#get a color spectrum "gist_ncar" from matplotlib cm.
#When you have a spectrum it begins with 0 and ends with 1.
#make tinier steps if you need more than 10 colors
colors = cm.gist_ncar(np.arange(0, 1, 0.1))
colorlst=[]# empty list where you will put your colors
for i in range(len(colors)): #get for your color hex instead of rgb
colorlst.append(col.to_hex(colors[i]))
hc.set_link_color_palette(colorlst) #sets the color to use.
Put all of that infront of your code and it should work
Pretty much exactly what the question states, but a little context:
I'm creating a program to plot a large number of points (~10,000, but it will be more later on). This is being done using matplotlib's plt.scatter. This command is part of a loop that saves the figure, so I can later animate it.
What I want to be able to do is randomly select a small portion of these particles (say, maybe 100?) and give them a different marker than the rest, even though they're part of the same data set. This is so I can use them as placeholders to see the motion of individual particles, as well as the bulk material.
Is there a way to use a different marker for a small subset of the same data?
For reference, the particles are uniformly distributed just using the numpy random sampler, but my code for that is:
for i in range(N): # N number of particles
particle_position[i] = np.random.uniform(0, xmax) # Initialize in spatial domain
particle_velocity[i] = np.random.normal(0, 5) # Initialize in velocity space
for i in range(maxtime):
plt.scatter(particle_position, particle_velocity, s=1, c=norm_xvel, cmap=br_disc, lw=0)
The position and velocity change on each iteration of the main loop (there's quite a bit of code), but these are the main initialization and plotting routines.
I had an idea that perhaps I could randomly select a bunch of i values from range(N), and use an ax.scatter() command to plot them on the same axes?
Here is a possible solution to have a subset of your points identified with a different marker:
import matplotlib.pyplot as plt
import numpy as np
SIZE = 100
SAMPLE_SIZE = 10
def select_subset(seq, size):
"""selects a subset of the data using ...
"""
return seq[:size]
points_x = np.random.uniform(-1, 1, size=SIZE)
points_y = np.random.uniform(-1, 1, size=SIZE)
plt.scatter(points_x, points_y, marker=".", color="blue")
plt.scatter(select_subset(points_x, SAMPLE_SIZE),
select_subset(points_y, SAMPLE_SIZE),
marker="o", color="red")
plt.show()
It uses plt.scatter twice; once on the full data set, the other on the sample points.
You will have to decide how you want to select the sample of points - it is isolated in the select_subset function..
You could also extract the sample points from the data set to prevent marking them twice, but numpy is rather inefficient at deleting or resizing.
Maybe a better method is to use a mask? A mask has the advantage of leaving your original data intact and in order.
Here is a way to proceed with masks:
import matplotlib.pyplot as plt
import numpy as np
import random
SIZE = 100
SAMPLE_SIZE = 10
def make_mask(data_size, sample_size):
mask = np.array([True] * sample_size + [False ] * (data_size - sample_size))
np.random.shuffle(mask)
return mask
points_x = np.random.uniform(-1, 1, size=SIZE)
points_y = np.random.uniform(-1, 1, size=SIZE)
mask = make_mask(SIZE, SAMPLE_SIZE)
not_mask = np.invert(mask)
plt.scatter(points_x[not_mask], points_y[not_mask], marker=".", color="blue")
plt.scatter(points_x[mask], points_y[mask], marker="o", color="red")
plt.show()
As you see, scatter is called once on a subset of the data points (the ones not selected in the sample), and a second time on the sampled subset, and draws each subset with its own marker. It is efficient & leaves the original data intact.
The code below does what you want. I have selected a random set v_sub_index of N_sub indices in the correct range (0 to N) and draw those (with _sub suffix) from the larger samples particle_position and particle_velocity. Please note that you don't have to loop to generate random samples. Numpy has great functionality for that without having to use for loops.
import numpy as np
import matplotlib.pyplot as pl
N = 100
xmax = 1.
v_sigma = 2.5 / 2. # 95% of the samples contained within 0, 5
v_mean = 2.5 # mean at 2.5
N_sub = 10
v_sub_index = np.random.randint(0, N, N_sub)
particle_position = np.random.rand (N) * xmax
particle_velocity = np.random.randn(N)
particle_position_sub = np.array(particle_position[v_sub_index])
particle_velocity_sub = np.array(particle_velocity[v_sub_index])
particle_position_nosub = np.delete(particle_position, v_sub_index)
particle_velocity_nosub = np.delete(particle_velocity, v_sub_index)
pl.scatter(particle_position_nosub, particle_velocity_nosub, color='b', marker='o')
pl.scatter(particle_position_sub , particle_velocity_sub , color='r', marker='^')
pl.show()
I am trying to write a simple python code for a plot of intensity vs wavelength for a given temperature, T=200K.
So far I have this...
import scipy as sp
import math
import matplotlib.pyplot as plt
import numpy as np
pi = np.pi
h = 6.626e-34
c = 3.0e+8
k = 1.38e-23
def planck(wav, T):
a = 2.0*h*pi*c**2
b = h*c/(wav*k*T)
intensity = a/ ( (wav**5)*(math.e**b - 1.0) )
return intensity
I don't know how to define wavelength(wav) and thus produce the plot of Plancks Formula. Any help would be appreciated.
Here's a basic plot. To plot using plt.plot(x, y, fmt) you need two arrays x and y of the same size, where x is the x coordinate of each point to plot and y is the y coordinate, and fmt is a string describing how to plot the numbers.
So all you need to do is create an evenly spaced array of wavelengths (an np.array which I named wavelengths). This can be done with arange(start, end, spacing) which will create an array from start to end (not inclusive) spaced at spacing apart.
Then compute the intensity using your function at each of those points in the array (which will be stored in another np.array), and then call plt.plot to plot them. Note numpy let's you do mathematical operations on arrays quickly in a vectorized form which will be computationally efficient.
import matplotlib.pyplot as plt
import numpy as np
h = 6.626e-34
c = 3.0e+8
k = 1.38e-23
def planck(wav, T):
a = 2.0*h*c**2
b = h*c/(wav*k*T)
intensity = a/ ( (wav**5) * (np.exp(b) - 1.0) )
return intensity
# generate x-axis in increments from 1nm to 3 micrometer in 1 nm increments
# starting at 1 nm to avoid wav = 0, which would result in division by zero.
wavelengths = np.arange(1e-9, 3e-6, 1e-9)
# intensity at 4000K, 5000K, 6000K, 7000K
intensity4000 = planck(wavelengths, 4000.)
intensity5000 = planck(wavelengths, 5000.)
intensity6000 = planck(wavelengths, 6000.)
intensity7000 = planck(wavelengths, 7000.)
plt.plot(wavelengths*1e9, intensity4000, 'r-')
# plot intensity4000 versus wavelength in nm as a red line
plt.plot(wavelengths*1e9, intensity5000, 'g-') # 5000K green line
plt.plot(wavelengths*1e9, intensity6000, 'b-') # 6000K blue line
plt.plot(wavelengths*1e9, intensity7000, 'k-') # 7000K black line
# show the plot
plt.show()
And you see:
You probably will want to clean up the axes labels, add a legend, plot the intensity at multiple temperatures on the same plot, among other things. Consult the relevant matplotlib documentation.
You may also want to use the RADIS library, which allows you to plot the Planck function against wavelengths, or against frequency / wavenumber, if needed !
from radis import sPlanck
sPlanck(wavelength_min=135, wavelength_max=3000, T=4000).plot()
sPlanck(wavelength_min=135, wavelength_max=3000, T=5000).plot(nfig='same')
sPlanck(wavelength_min=135, wavelength_max=3000, T=6000).plot(nfig='same')
sPlanck(wavelength_min=135, wavelength_max=3000, T=7000).plot(nfig='same')
Just want to point out that there seems to be an equivalent of what OP wants to do in astropy:
https://docs.astropy.org/en/stable/api/astropy.modeling.physical_models.BlackBody.html
Unfortunately, it is not very clear to me yet how to get wavelength vs frequency based expression.