I am plotting a histogram of a fairly simple simulation. On the histogram the last two columns are merged and it looks odd. Please finde attached below the code and the plot result.
Thanks in advance!
import numpy as np
import random
import matplotlib.pyplot as plt
die = [1, 2, 3, 4, 5, 6]
N = 100000
results = []
# first round
for i in range(N):
X1 = random.choice(die)
if X1 > 4:
results.append(X1)
else:
X2 = random.choice(die)
if X2 > 3:
results.append(X2)
else:
X3 = random.choice(die)
results.append(X3)
plt.hist(results)
plt.ylabel('Count')
plt.xlabel('Result');
plt.title("Mean results: " + str(np.mean(results)))
plt.show()
The output looks like this. I dont understand why the last two columns are stuck together.
Any help appreciated!
By default, matplotlib divides the range of the input into 10 equally sized bins. All bins span a half-open interval [x1,x2), but the rightmost bin includes the end of the range. Your range is [1,6], so your bins are [1,1.5), [1.5,2), ..., [5.5,6], so all the integers end up in the first, third, etc. odd-numbered bins, but the sixes end up in the tenth (even) bin.
To fix the layout, specify the bins:
# This will give you a slightly different layout with only 6 bars
plt.hist(results, bins=die + [7])
# This will simulate your original plot more closely, with empty bins in between
plt.hist(results, bins=np.arange(2, 14)/2)
The last bit generates the number sequence 2,3,...,13 and then divides each number by 2, which gives 1, 1.5, ..., 6.5 so that the last bin spans [6,6.5].
You need to tell matplotlib that you want the histogram to match the bins.
Otherwise matplotlib choses the default value of 10 for you - in this case, that doesn't round well.
# ... your code ...
plt.hist(results, bins=die) # or bins = 6
plt.ylabel('Count')
plt.xlabel('Result');
plt.title("Mean results: " + str(np.mean(results)))
plt.show()
Full documentation is here: https://matplotlib.org/3.2.2/api/_as_gen/matplotlib.pyplot.hist.html
it doesn't, but you can try it.
import seaborn as sns
sns.distplot(results , kde = False)
Related
Currently I have a plot that look like this.
How do I increase the size of each point by the count? In other words, if a certain point has 9 counts, how do I increase it so that it is bigger than another point with only 2 counts?
If you look closely, I think there are overlaps (one point has both grey and orange circles). How do I make it so there's a clear difference?
In case you have no idea what I mean by "Plotting a 3-dimensional graph by increasing the size of the points", this below is what I mean, where the z-axis is the count
This answer doesn't really answer the question straight up, but please consider this multivariate solution seaborn has:
The syntax is way easier to write than using matplotlib.
seaborn.jointplot(
data = data,
x = x_name,
y = y_name,
hue = label_name
)
And voila! You should get something that looks like this
See: https://seaborn.pydata.org/generated/seaborn.jointplot.html
Using matplotlib library you could iterate on your data and count it (some example below).
import numpy as np
import matplotlib.pyplot as plt
# Generate Data
N = 100
l_bound = -10
u_bound = 10
s_0 = 40
array = np.random.randint(l_bound, u_bound, (N, 2))
# Plot it
points, counts = np.unique(array, axis = 0, return_counts = True)
for point_, count_ in zip(points, counts):
plt.scatter(point_[0], point_[1], c = np.random.randint(0, 3), s = s_0* count_**2, vmin = 0, vmax = 2)
plt.colorbar()
plt.show()
Result
You can probably do the same with Plotly to have something fancier and closer to your second picture.
Cheers
Try this:
x_name = 'x_name'
y_name = 'y_name'
z_name = 'z_name'
scatter_data = pd.DataFrame(data[[x_name, y_name, z_name]].value_counts())
scatter_data.reset_index(inplace=True)
plt.scatter(
scatter_data.loc[:, x_name],
scatter_data.loc[:, y_name],
s=scatter_data.loc[:, 0],
c=scatter_data.loc[:, z_name]
)
The thing is that the reason why your scatter plot looks like this is because every point which is at (1, 1) or (0,1) is overlapping.
With the plt.scatter argument (s=), you can specify the size of the points. If you print scatter_data, it is a "group by" clause with the count of each of the indexes.
It should look something like this, with column 0 being the count.
It should look something like the above.
I found this code :
import numpy as np
import matplotlib.pyplot as plt
# We create 1000 realizations with 200 steps each
n_stories = 1000
t_max = 500
t = np.arange(t_max)
# Steps can be -1 or 1 (note that randint excludes the upper limit)
steps = 2 * np.random.randint(0, 1 + 1, (n_stories, t_max)) - 1
# The time evolution of the position is obtained by successively
# summing up individual steps. This is done for each of the
# realizations, i.e. along axis 1.
positions = np.cumsum(steps, axis=1)
# Determine the time evolution of the mean square distance.
sq_distance = positions**2
mean_sq_distance = np.mean(sq_distance, axis=0)
# Plot the distance d from the origin as a function of time and
# compare with the theoretically expected result where d(t)
# grows as a square root of time t.
plt.figure(figsize=(10, 7))
plt.plot(t, np.sqrt(mean_sq_distance), 'g.', t, np.sqrt(t), 'y-')
plt.xlabel(r"$t$")
plt.tight_layout()
plt.show()
Instead of doing just steps -1 or 1 , I would like to do steps following a standard normal distribution ... when I am inserting np.random.normal(0,1,1000) instead of np.random.randint(...) it is not working.
I am really new to Python btw.
Many thanks in advance and Kind regards
You are entering a single number as third parameter of np.random.normal, therefore you get a 1d array, instead of 2d, see the documentation. Try this:
steps = np.random.normal(0, 1, (n_stories, t_max))
To illustrate my problem I prepared an example:
First, I have two arrays 'a'and 'b'and I'm interested in their distribution:
import numpy as np
import matplotlib.pyplot as plt
a = np.array([1,2,2,2,2,4,8,1,9,5,3,1,2,9])
b = np.array([5,9,9,2,3,9,3,6,8,4,2,7,8,8])
n1,bin1,pat1 = plt.hist(a,np.arange(1,10,2),histtype='step')
n2,bin2,pat2 = plt.hist(b,np.arange(1,10,2), histtype='step')
plt.show()
This code gives me a histogram with two 'curves'. Now I want to subtract one 'curve' from the other, and by this I mean that I do this for each bin separately:
n3 = n2-n1
I don't need negative counts so:
for i in range(0,len(n2)):
if n3[i]<0:
n3[i]=0
else:
continue
The new histogram curve should be plotted in the same range as the previous ones and it should have the same number of bins. So I have the number of bins and their position (which will be the same as the ones for the other curves, please refer to the block above) and the frequency or counts (n3) that every bins should have. Do you have any ideas of how I can do this with the data that I have?
You can use a step function to plot n3 = n2 - n1. The only issue is that you need to provide one more value, otherwise the last value is not shown nicely. Also you need to use the where="post" option of the step function.
import numpy as np
import matplotlib.pyplot as plt
a = np.array([1,2,2,2,2,4,8,1,9,5,3,1,2,9])
b = np.array([5,9,9,2,3,9,3,6,8,4,2,7,8,8])
n1,bin1,pat1 = plt.hist(a,np.arange(1,10,2),histtype='step')
n2,bin2,pat2 = plt.hist(b,np.arange(1,10,2), histtype='step')
n3=n2-n1
n3[n3<0] = 0
plt.step(np.arange(1,10,2),np.append(n3,[n3[-1]]), where='post', lw=3 )
plt.show()
I have an array with probability values stored in it. Some values are 0. I need to plot a histogram such that there are equal number of elements in each bin. I tried using matplotlibs hist function but that lets me decide number of bins. How do I go about plotting this?(Normal plot and hist work but its not what is needed)
I have 10000 entries. Only 200 have values greater than 0 and lie between 0.0005 and 0.2. This distribution isnt even as 0.2 only one element has whereas 2000 approx have value 0.0005. So plotting it was an issue as the bins had to be of unequal width with equal number of elements
The task does not make much sense to me, but the following code does, what i understood as the thing to do.
I also think the last lines of the code are what you really wanted to do. Using different bin-widths to improve visualization (but don't target the distribution of equal amount of samples within each bin)! I used astroml's hist with method='blocks' (astropy supports this too)
Code
# Python 3 -> beware the // operator!
import numpy as np
import matplotlib.pyplot as plt
from astroML import plotting as amlp
N_VALUES = 1000
N_BINS = 100
# Create fake data
prob_array = np.random.randn(N_VALUES)
prob_array /= np.max(np.abs(prob_array),axis=0) # scale a bit
# Sort array
prob_array = np.sort(prob_array)
# Calculate bin-borders,
bin_borders = [np.amin(prob_array)] + [prob_array[(N_VALUES // N_BINS) * i] for i in range(1, N_BINS)] + [np.amax(prob_array)]
print('SAMPLES: ', prob_array)
print('BIN-BORDERS: ', bin_borders)
# Plot hist
counts, x, y = plt.hist(prob_array, bins=bin_borders)
plt.xlim(bin_borders[0], bin_borders[-1] + 1e-2)
print('COUNTS: ', counts)
plt.show()
# And this is, what i think, what you really want
fig, (ax1, ax2) = plt.subplots(2)
left_blob = np.random.randn(N_VALUES/10) + 3
right_blob = np.random.randn(N_VALUES) + 110
both = np.hstack((left_blob, right_blob)) # data is hard to visualize with equal bin-widths
ax1.hist(both)
amlp.hist(both, bins='blocks', ax=ax2)
plt.show()
Output
I'm using matplotlib to make a histogram.
Is there any way to manually set the size of the bins as opposed to the number of bins?
Actually, it's quite easy: instead of the number of bins you can give a list with the bin boundaries. They can be unequally distributed, too:
plt.hist(data, bins=[0, 10, 20, 30, 40, 50, 100])
If you just want them equally distributed, you can simply use range:
plt.hist(data, bins=range(min(data), max(data) + binwidth, binwidth))
Added to original answer
The above line works for data filled with integers only. As macrocosme points out, for floats you can use:
import numpy as np
plt.hist(data, bins=np.arange(min(data), max(data) + binwidth, binwidth))
For N bins, the bin edges are specified by list of N+1 values where the first N give the lower bin edges and the +1 gives the upper edge of the last bin.
Code:
from numpy import np; from pylab import *
bin_size = 0.1; min_edge = 0; max_edge = 2.5
N = (max_edge-min_edge)/bin_size; Nplus1 = N + 1
bin_list = np.linspace(min_edge, max_edge, Nplus1)
Note that linspace produces array from min_edge to max_edge broken into N+1 values or N bins
I use quantiles to do bins uniform and fitted to sample:
bins=df['Generosity'].quantile([0,.05,0.1,0.15,0.20,0.25,0.3,0.35,0.40,0.45,0.5,0.55,0.6,0.65,0.70,0.75,0.80,0.85,0.90,0.95,1]).to_list()
plt.hist(df['Generosity'], bins=bins, normed=True, alpha=0.5, histtype='stepfilled', color='steelblue', edgecolor='none')
I guess the easy way would be to calculate the minimum and maximum of the data you have, then calculate L = max - min. Then you divide L by the desired bin width (I'm assuming this is what you mean by bin size) and use the ceiling of this value as the number of bins.
I like things to happen automatically and for bins to fall on "nice" values. The following seems to work quite well.
import numpy as np
import numpy.random as random
import matplotlib.pyplot as plt
def compute_histogram_bins(data, desired_bin_size):
min_val = np.min(data)
max_val = np.max(data)
min_boundary = -1.0 * (min_val % desired_bin_size - min_val)
max_boundary = max_val - max_val % desired_bin_size + desired_bin_size
n_bins = int((max_boundary - min_boundary) / desired_bin_size) + 1
bins = np.linspace(min_boundary, max_boundary, n_bins)
return bins
if __name__ == '__main__':
data = np.random.random_sample(100) * 123.34 - 67.23
bins = compute_histogram_bins(data, 10.0)
print(bins)
plt.hist(data, bins=bins)
plt.xlabel('Value')
plt.ylabel('Counts')
plt.title('Compute Bins Example')
plt.grid(True)
plt.show()
The result has bins on nice intervals of bin size.
[-70. -60. -50. -40. -30. -20. -10. 0. 10. 20. 30. 40. 50. 60.]
I had the same issue as OP (I think!), but I couldn't get it to work in the way that Lastalda specified. I don't know if I have interpreted the question properly, but I have found another solution (it probably is a really bad way of doing it though).
This was the way that I did it:
plt.hist([1,11,21,31,41], bins=[0,10,20,30,40,50], weights=[10,1,40,33,6]);
Which creates this:
So the first parameter basically 'initialises' the bin - I'm specifically creating a number that is in between the range I set in the bins parameter.
To demonstrate this, look at the array in the first parameter ([1,11,21,31,41]) and the 'bins' array in the second parameter ([0,10,20,30,40,50]):
The number 1 (from the first array) falls between 0 and 10 (in the 'bins' array)
The number 11 (from the first array) falls between 11 and 20 (in the 'bins' array)
The number 21 (from the first array) falls between 21 and 30 (in the 'bins' array), etc.
Then I'm using the 'weights' parameter to define the size of each bin. This is the array used for the weights parameter: [10,1,40,33,6].
So the 0 to 10 bin is given the value 10, the 11 to 20 bin is given the value of 1, the 21 to 30 bin is given the value of 40, etc.
This answer support the # macrocosme suggestion.
I am using heat map as hist2d plot. Additionally I use cmin=0.5 for no count value and cmap for color, r represent the reverse of given color.
Describe statistics.
# np.arange(data.min(), data.max()+binwidth, binwidth)
bin_x = np.arange(0.6, 7 + 0.3, 0.3)
bin_y = np.arange(12, 58 + 3, 3)
plt.hist2d(data=fuel_econ, x='displ', y='comb', cmin=0.5, cmap='viridis_r', bins=[bin_x, bin_y]);
plt.xlabel('Dispalcement (1)');
plt.ylabel('Combine fuel efficiency (mpg)');
plt.colorbar();
If you are looking on the visualization aspect also, you can add edgecolor='white', linewidth=2 and will have the binned separated :
date_binned = new_df[(new_df['k']>0)&(new_df['k']<360)]['k']
plt.hist(date_binned, bins=range(min(date_binned), max(date_binned) + binwidth, binwidth), edgecolor='white', linewidth=2)
For a histogram with integer x-values I ended up using
plt.hist(data, np.arange(min(data)-0.5, max(data)+0.5))
plt.xticks(range(min(data), max(data)))
The offset of 0.5 centers the bins on the x-axis values. The plt.xticks call adds a tick for every integer.