Unequal width binned histogram in python - python

I have an array with probability values stored in it. Some values are 0. I need to plot a histogram such that there are equal number of elements in each bin. I tried using matplotlibs hist function but that lets me decide number of bins. How do I go about plotting this?(Normal plot and hist work but its not what is needed)
I have 10000 entries. Only 200 have values greater than 0 and lie between 0.0005 and 0.2. This distribution isnt even as 0.2 only one element has whereas 2000 approx have value 0.0005. So plotting it was an issue as the bins had to be of unequal width with equal number of elements

The task does not make much sense to me, but the following code does, what i understood as the thing to do.
I also think the last lines of the code are what you really wanted to do. Using different bin-widths to improve visualization (but don't target the distribution of equal amount of samples within each bin)! I used astroml's hist with method='blocks' (astropy supports this too)
Code
# Python 3 -> beware the // operator!
import numpy as np
import matplotlib.pyplot as plt
from astroML import plotting as amlp
N_VALUES = 1000
N_BINS = 100
# Create fake data
prob_array = np.random.randn(N_VALUES)
prob_array /= np.max(np.abs(prob_array),axis=0) # scale a bit
# Sort array
prob_array = np.sort(prob_array)
# Calculate bin-borders,
bin_borders = [np.amin(prob_array)] + [prob_array[(N_VALUES // N_BINS) * i] for i in range(1, N_BINS)] + [np.amax(prob_array)]
print('SAMPLES: ', prob_array)
print('BIN-BORDERS: ', bin_borders)
# Plot hist
counts, x, y = plt.hist(prob_array, bins=bin_borders)
plt.xlim(bin_borders[0], bin_borders[-1] + 1e-2)
print('COUNTS: ', counts)
plt.show()
# And this is, what i think, what you really want
fig, (ax1, ax2) = plt.subplots(2)
left_blob = np.random.randn(N_VALUES/10) + 3
right_blob = np.random.randn(N_VALUES) + 110
both = np.hstack((left_blob, right_blob)) # data is hard to visualize with equal bin-widths
ax1.hist(both)
amlp.hist(both, bins='blocks', ax=ax2)
plt.show()
Output

Related

Python Matplotlib pyplot histogram

I am plotting a histogram of a fairly simple simulation. On the histogram the last two columns are merged and it looks odd. Please finde attached below the code and the plot result.
Thanks in advance!
import numpy as np
import random
import matplotlib.pyplot as plt
die = [1, 2, 3, 4, 5, 6]
N = 100000
results = []
# first round
for i in range(N):
X1 = random.choice(die)
if X1 > 4:
results.append(X1)
else:
X2 = random.choice(die)
if X2 > 3:
results.append(X2)
else:
X3 = random.choice(die)
results.append(X3)
plt.hist(results)
plt.ylabel('Count')
plt.xlabel('Result');
plt.title("Mean results: " + str(np.mean(results)))
plt.show()
The output looks like this. I dont understand why the last two columns are stuck together.
Any help appreciated!
By default, matplotlib divides the range of the input into 10 equally sized bins. All bins span a half-open interval [x1,x2), but the rightmost bin includes the end of the range. Your range is [1,6], so your bins are [1,1.5), [1.5,2), ..., [5.5,6], so all the integers end up in the first, third, etc. odd-numbered bins, but the sixes end up in the tenth (even) bin.
To fix the layout, specify the bins:
# This will give you a slightly different layout with only 6 bars
plt.hist(results, bins=die + [7])
# This will simulate your original plot more closely, with empty bins in between
plt.hist(results, bins=np.arange(2, 14)/2)
The last bit generates the number sequence 2,3,...,13 and then divides each number by 2, which gives 1, 1.5, ..., 6.5 so that the last bin spans [6,6.5].
You need to tell matplotlib that you want the histogram to match the bins.
Otherwise matplotlib choses the default value of 10 for you - in this case, that doesn't round well.
# ... your code ...
plt.hist(results, bins=die) # or bins = 6
plt.ylabel('Count')
plt.xlabel('Result');
plt.title("Mean results: " + str(np.mean(results)))
plt.show()
Full documentation is here: https://matplotlib.org/3.2.2/api/_as_gen/matplotlib.pyplot.hist.html
it doesn't, but you can try it.
import seaborn as sns
sns.distplot(results , kde = False)

python get cdf values from array

I have an array of numbers, lets say 100 members of that array.
I know how to draw the cdf function, but my problem is, that I want the cdf-value of each member of the array.
How can I iterate through an array and give me back the according cdf-value of a member of that array?
cumsum() and hist()
could solve my problem. I didn't find any library which I can use to give me back the value.
norm.cdf()
is not working for me (for any reason)
For example
import matplotlib.pyplot as plt
import numpy as np
# create some randomly ddistributed data:
data = np.random.randn(10000)
# sort the data:
data_sorted = np.sort(data)
# calculate the proportional values of samples
p = 1. * arange(len(data)) / (len(data) - 1)
# plot the sorted data:
fig = figure()
ax1 = fig.add_subplot(121)
ax1.plot(p, data_sorted)
ax1.set_xlabel('$p$')
ax1.set_ylabel('$x$')
ax2 = fig.add_subplot(122)
ax2.plot(data_sorted, p)
ax2.set_xlabel('$x$')
ax2.set_ylabel('$p$')
Draws a line(or 2 lines) which represent the cdf. How can I get the values out of that? I mean there are values lying behind the graph, how can I use the corresponding values of my x?
But in my opinion, its not completly right. He just divides the range by the number of rows. And it doesnt pay attention to repeating values :/
Thanks in advance
EDIT
What would you say about that:
cur.execute("Select AGE From **** ")
output = []
for row in cur:
output.append(float(row[0]))
data_sorted = np.sort(output)
length=len(data_sorted)
yvals = np.arange(len(data_sorted))/float(len(data_sorted))
print yvals
plt.plot(data_sorted, yvals)
plt.show()
The result is, that the array is 5 members long. So that each member has a 1/5=0,2
That leads to:
[ 1 2 2 9 58]
[ 0. 0.2 0.4 0.6 0.8]
But it should be 1 is 0.2; 2 is 0,6 (because 2 appears 2 times, so 3 out of 5 are 2 or less)
How do I get the 0,6???
I mean, I could write that in a view and sum it up, after group by AGE but, I dont know, would prefer to do it in python...

Matplotlib graphing distribution with two colors

The goal here is to color value above a certain threshold into one color and values below this threshold into another color. The code below tries to just separate it into two histographs but it only looks balanced if the threshold is at 50%. I'm assuming I must play around with the discreetlevel variable.
finalutilityrange is some vector with a bunch of values(you must generate it to test the code), which I am trying to graph. The value deter is the value that determines whether they will be blue or red. discreetlevel is just the amount of bins I would want.
import random
import numpy as np
import matplotlib.pyplot as plt
discreetlevel = 10
deter = 2
for x in range(0,len(finalutilityrange)):
if finalutilityrange[x-1]>=deter:
piraterange.append(finalutilityrange[x-1])
else:
nonpiraterange.append(finalutilityrange[x-1])
plt.hist(piraterange,bins=discreetlevel,normed=False,cumulative=False,color = 'b')
plt.hist(nonpiraterange,bins=discreetlevel),normed=False,cumulative=False,color = 'r')
plt.title("Histogram")
plt.xlabel("Utlity")
plt.ylabel("Probability")
plt.show()
This solution is a bit more complex than #user2699's. I am just presenting it for completeness. You have full control over the patch objects that hist returns, so if you can ensure that the threshold you are using is exactly on a bin edge, it is easy to change to color of selected patches. You can do this because hist can accept a sequence of bin edges as the bins parameter.
import numpy as np
from matplotlib import pyplot as plt
# Make sample data
finalutilityrange = np.random.randn(100)
discreetlevel = 10
deter = 0.2
# Manually create `discreetlevel` bins anchored to `deter`
binsAbove = round(discreetlevel * np.count_nonzero(finalutilityrange > deter) / finalutilityrange.size)
binsBelow = discreetlevel - binsAbove
binwidth = max((finalutilityrange.max() - deter) / binsAbove,
(deter - finalutilityrange.min()) / binsBelow)
bins = np.concatenate([
np.arange(deter - binsBelow * binwidth, deter, binwidth),
np.arange(deter, deter + (binsAbove + 0.5) * binwidth, binwidth)
])
# Use the bins to make a single histogram
h, bins, patches = plt.hist(finalutilityrange, bins, color='b')
# Change the appropriate patches to red
plt.setp([p for p, b in zip(patches, bins) if b >= deter], color='r')
The result is a homogenous histogram with bins of different colors:
The bins may be a tad wider than if you did not anchor to deter. Either the first or last bin will generally go a little past the edge of the data.
This answer doesn't address your code since it isn't self-contained, but for what you're trying to do the default histogram should work (assuming numpy/pyplot is loaded)
x = randn(100)
idx = x < 0.2 # Threshold to separate values
hist([x[idx], x[~idx]], color=['b', 'r'])
Explanation:
first line just generates some random data to test,
creates an index for where the data is below some threshold, this can be negated with ~ to find where it's above the threshold
Last line plots the histogram. The command takes a list of separate groups to plot, which doesn't make a big difference here but if normed=True it will
There's more the hist plot can do, so look over the documentation before you accidentally implement it yourself.
Just as above do:
x = np.random.randn(100)
threshold_x = 0.2 # Threshold to separate values
x_lower, x_upper = (
[_ for _ in x if _ < threshold_x],
[_ for _ in x if _ >= threshold_x]
)
hist([x_lower, x_upper], color=['b', 'r'])

Python/Matplotlib: Randomly select "sample" scatter points for different marker

Pretty much exactly what the question states, but a little context:
I'm creating a program to plot a large number of points (~10,000, but it will be more later on). This is being done using matplotlib's plt.scatter. This command is part of a loop that saves the figure, so I can later animate it.
What I want to be able to do is randomly select a small portion of these particles (say, maybe 100?) and give them a different marker than the rest, even though they're part of the same data set. This is so I can use them as placeholders to see the motion of individual particles, as well as the bulk material.
Is there a way to use a different marker for a small subset of the same data?
For reference, the particles are uniformly distributed just using the numpy random sampler, but my code for that is:
for i in range(N): # N number of particles
particle_position[i] = np.random.uniform(0, xmax) # Initialize in spatial domain
particle_velocity[i] = np.random.normal(0, 5) # Initialize in velocity space
for i in range(maxtime):
plt.scatter(particle_position, particle_velocity, s=1, c=norm_xvel, cmap=br_disc, lw=0)
The position and velocity change on each iteration of the main loop (there's quite a bit of code), but these are the main initialization and plotting routines.
I had an idea that perhaps I could randomly select a bunch of i values from range(N), and use an ax.scatter() command to plot them on the same axes?
Here is a possible solution to have a subset of your points identified with a different marker:
import matplotlib.pyplot as plt
import numpy as np
SIZE = 100
SAMPLE_SIZE = 10
def select_subset(seq, size):
"""selects a subset of the data using ...
"""
return seq[:size]
points_x = np.random.uniform(-1, 1, size=SIZE)
points_y = np.random.uniform(-1, 1, size=SIZE)
plt.scatter(points_x, points_y, marker=".", color="blue")
plt.scatter(select_subset(points_x, SAMPLE_SIZE),
select_subset(points_y, SAMPLE_SIZE),
marker="o", color="red")
plt.show()
It uses plt.scatter twice; once on the full data set, the other on the sample points.
You will have to decide how you want to select the sample of points - it is isolated in the select_subset function..
You could also extract the sample points from the data set to prevent marking them twice, but numpy is rather inefficient at deleting or resizing.
Maybe a better method is to use a mask? A mask has the advantage of leaving your original data intact and in order.
Here is a way to proceed with masks:
import matplotlib.pyplot as plt
import numpy as np
import random
SIZE = 100
SAMPLE_SIZE = 10
def make_mask(data_size, sample_size):
mask = np.array([True] * sample_size + [False ] * (data_size - sample_size))
np.random.shuffle(mask)
return mask
points_x = np.random.uniform(-1, 1, size=SIZE)
points_y = np.random.uniform(-1, 1, size=SIZE)
mask = make_mask(SIZE, SAMPLE_SIZE)
not_mask = np.invert(mask)
plt.scatter(points_x[not_mask], points_y[not_mask], marker=".", color="blue")
plt.scatter(points_x[mask], points_y[mask], marker="o", color="red")
plt.show()
As you see, scatter is called once on a subset of the data points (the ones not selected in the sample), and a second time on the sampled subset, and draws each subset with its own marker. It is efficient & leaves the original data intact.
The code below does what you want. I have selected a random set v_sub_index of N_sub indices in the correct range (0 to N) and draw those (with _sub suffix) from the larger samples particle_position and particle_velocity. Please note that you don't have to loop to generate random samples. Numpy has great functionality for that without having to use for loops.
import numpy as np
import matplotlib.pyplot as pl
N = 100
xmax = 1.
v_sigma = 2.5 / 2. # 95% of the samples contained within 0, 5
v_mean = 2.5 # mean at 2.5
N_sub = 10
v_sub_index = np.random.randint(0, N, N_sub)
particle_position = np.random.rand (N) * xmax
particle_velocity = np.random.randn(N)
particle_position_sub = np.array(particle_position[v_sub_index])
particle_velocity_sub = np.array(particle_velocity[v_sub_index])
particle_position_nosub = np.delete(particle_position, v_sub_index)
particle_velocity_nosub = np.delete(particle_velocity, v_sub_index)
pl.scatter(particle_position_nosub, particle_velocity_nosub, color='b', marker='o')
pl.scatter(particle_position_sub , particle_velocity_sub , color='r', marker='^')
pl.show()

Bin size in Matplotlib (Histogram)

I'm using matplotlib to make a histogram.
Is there any way to manually set the size of the bins as opposed to the number of bins?
Actually, it's quite easy: instead of the number of bins you can give a list with the bin boundaries. They can be unequally distributed, too:
plt.hist(data, bins=[0, 10, 20, 30, 40, 50, 100])
If you just want them equally distributed, you can simply use range:
plt.hist(data, bins=range(min(data), max(data) + binwidth, binwidth))
Added to original answer
The above line works for data filled with integers only. As macrocosme points out, for floats you can use:
import numpy as np
plt.hist(data, bins=np.arange(min(data), max(data) + binwidth, binwidth))
For N bins, the bin edges are specified by list of N+1 values where the first N give the lower bin edges and the +1 gives the upper edge of the last bin.
Code:
from numpy import np; from pylab import *
bin_size = 0.1; min_edge = 0; max_edge = 2.5
N = (max_edge-min_edge)/bin_size; Nplus1 = N + 1
bin_list = np.linspace(min_edge, max_edge, Nplus1)
Note that linspace produces array from min_edge to max_edge broken into N+1 values or N bins
I use quantiles to do bins uniform and fitted to sample:
bins=df['Generosity'].quantile([0,.05,0.1,0.15,0.20,0.25,0.3,0.35,0.40,0.45,0.5,0.55,0.6,0.65,0.70,0.75,0.80,0.85,0.90,0.95,1]).to_list()
plt.hist(df['Generosity'], bins=bins, normed=True, alpha=0.5, histtype='stepfilled', color='steelblue', edgecolor='none')
I guess the easy way would be to calculate the minimum and maximum of the data you have, then calculate L = max - min. Then you divide L by the desired bin width (I'm assuming this is what you mean by bin size) and use the ceiling of this value as the number of bins.
I like things to happen automatically and for bins to fall on "nice" values. The following seems to work quite well.
import numpy as np
import numpy.random as random
import matplotlib.pyplot as plt
def compute_histogram_bins(data, desired_bin_size):
min_val = np.min(data)
max_val = np.max(data)
min_boundary = -1.0 * (min_val % desired_bin_size - min_val)
max_boundary = max_val - max_val % desired_bin_size + desired_bin_size
n_bins = int((max_boundary - min_boundary) / desired_bin_size) + 1
bins = np.linspace(min_boundary, max_boundary, n_bins)
return bins
if __name__ == '__main__':
data = np.random.random_sample(100) * 123.34 - 67.23
bins = compute_histogram_bins(data, 10.0)
print(bins)
plt.hist(data, bins=bins)
plt.xlabel('Value')
plt.ylabel('Counts')
plt.title('Compute Bins Example')
plt.grid(True)
plt.show()
The result has bins on nice intervals of bin size.
[-70. -60. -50. -40. -30. -20. -10. 0. 10. 20. 30. 40. 50. 60.]
I had the same issue as OP (I think!), but I couldn't get it to work in the way that Lastalda specified. I don't know if I have interpreted the question properly, but I have found another solution (it probably is a really bad way of doing it though).
This was the way that I did it:
plt.hist([1,11,21,31,41], bins=[0,10,20,30,40,50], weights=[10,1,40,33,6]);
Which creates this:
So the first parameter basically 'initialises' the bin - I'm specifically creating a number that is in between the range I set in the bins parameter.
To demonstrate this, look at the array in the first parameter ([1,11,21,31,41]) and the 'bins' array in the second parameter ([0,10,20,30,40,50]):
The number 1 (from the first array) falls between 0 and 10 (in the 'bins' array)
The number 11 (from the first array) falls between 11 and 20 (in the 'bins' array)
The number 21 (from the first array) falls between 21 and 30 (in the 'bins' array), etc.
Then I'm using the 'weights' parameter to define the size of each bin. This is the array used for the weights parameter: [10,1,40,33,6].
So the 0 to 10 bin is given the value 10, the 11 to 20 bin is given the value of 1, the 21 to 30 bin is given the value of 40, etc.
This answer support the # macrocosme suggestion.
I am using heat map as hist2d plot. Additionally I use cmin=0.5 for no count value and cmap for color, r represent the reverse of given color.
Describe statistics.
# np.arange(data.min(), data.max()+binwidth, binwidth)
bin_x = np.arange(0.6, 7 + 0.3, 0.3)
bin_y = np.arange(12, 58 + 3, 3)
plt.hist2d(data=fuel_econ, x='displ', y='comb', cmin=0.5, cmap='viridis_r', bins=[bin_x, bin_y]);
plt.xlabel('Dispalcement (1)');
plt.ylabel('Combine fuel efficiency (mpg)');
plt.colorbar();
If you are looking on the visualization aspect also, you can add edgecolor='white', linewidth=2 and will have the binned separated :
date_binned = new_df[(new_df['k']>0)&(new_df['k']<360)]['k']
plt.hist(date_binned, bins=range(min(date_binned), max(date_binned) + binwidth, binwidth), edgecolor='white', linewidth=2)
For a histogram with integer x-values I ended up using
plt.hist(data, np.arange(min(data)-0.5, max(data)+0.5))
plt.xticks(range(min(data), max(data)))
The offset of 0.5 centers the bins on the x-axis values. The plt.xticks call adds a tick for every integer.

Categories