I'm using ipython notebook to plot histogram of lottery numbers results. I want to demonstrate how many times each number appeared. I have drawing results in CSV file like matrix. I've tried to load numbers in numpy matrix then convert it to int array and then using matplotlib.pyplot.hist() to plot it - but i get wrong result (looks like wrong bin, only 5 rectangles are show but i can't see the range). What is the easiest way to get this?
If you don't specify the range and number of bins, matplotlib.pyplot.hist() will guess guess the range and default to 10 bins. This is usually not what you want.
The following works as expected
import matplotlib.pyplot as plt
from numpy import random
N = random.random_integers( 0, 10, 20 )
plt.hist( N, range=[-.5,10.5], bins=11 )
plt.show()
I did shift the range 0.5 so that the bars would be nicely lined up with the tick marks.
Related
EDIT: Ive found a general example where it doesnt work either!
I am trying to extract the data for a histogram, but different counts seem wrong. As an example code:
import matplotlib.pyplot as plt
import numpy as np
data = np.random.rand(1000000)
bins = np.arange(0,1,0.0001)
a,b,c = plt.hist(data,bins)
This gives me this rather messy histogram, and i've saved the counts as a and the interval as b. Now, plotting a and b, I should expect the same histogram, right? But that's not what I get:
plt.scatter(b[0:len(b)-1],a,s=2)
which gives me this, which doesnt match at all! Furthurmore, when I try and find the maximum value of a, it gives me 144, which fits fine with the scatterplot, but not with the histogram function.
If I count the numbers myself with the following code:
len(np.intersect1d(np.where(data>=b[np.argmax(a)]),np.where(data<b[np.argmax(a)+1])))
then it also gives me 144, in accordance with the values. So is the displayed histogram just wrong for some reason, and I should ignore it and just take the extracted data?
Old, unedited post:
For a physics course I am trying to bin my results in the following way:
import matplotlib.pyplot as plt
import numpy as np
import scipy.stats as ss
from scipy.optimize import curve_fit
plt.rc("font", family=["Helvetica", "Arial"])
plt.rc("axes", labelsize=18)
plt.rc("xtick", labelsize=16, top=True, direction="in")
plt.rc("ytick", labelsize=16, right=True, direction="in")
plt.rc("axes", titlesize=22)
plt.rc("legend", fontsize=16)
data_Ra = np.loadtxt('Ra226_cal2_ch001.txt',skiprows=5)
t_Ra = data_Ra[:,0]*10**-8 # time in seconds
channels_Ra = data_Ra[:,1]
channels_Ra = channels_Ra[np.where(channels_Ra>0)] # removing all the measurements at channel = 0
intervalspace = 2 #The intervals in which we count
bins=np.arange(0,4000,intervalspace)
counts, intervals , stuff = plt.hist(channels_Ra,bins)
plt.xlabel('Channels')
plt.ylabel('Counts')
plt.show()
Here, the histogram plot looks totally fine, with a max near 13000 counts. But when I then use np.max(counts), I am given about 24000, and when I try and just plot the values it gives me with:
plt.scatter(intervals[0:len(intervals)-1]+intervalspace/2,counts,s=1)
plt.xlabel('Channels')
plt.ylabel('Counts')
plt.title('Ra225')
plt.show()
it looks like this, which is totally different, and I can't figure out why. I am expecting the scatterplot to resemble the histogram, and while the peaks are located at the same x-vales, the height do not match.
This problem is in other large datasets as well.
I dont think i'm allowed to drop the txt-file here? So im not sure how much more I can show, but any help will be appreciated!
I don't know why you interpret the results in that way.
If you look at the histogram plot, you will be able to see the maximum value of the y-axis is 25,000. That means that there are some values close to 25,000. This fact can be verified in the scatter plot.
Your scatter plot shows actual values. It would be clearer if you describe how your expected plot looks like.
If you want discard some outlier points, you should apply some filtering before plotting the data.
In order to test the returns of hist, I want to use them using plot via matplotlib. hist give the following returns:
import matplotlib.pyplot as plt
counts, bins, bars = plt.hist(x)
where x is the vector of data you want to plot the histogram.
I have tried the following syntax
plt.plot(bins,counts)
I get the following error
Error: x and y must have the same first dimension, but have shapes (501,) and (500,)
Thank you for your answers.
From the matplotlib documentationof plt.hist():
bins : array
The edges of the bins. Length nbins + 1 (nbins left edges
and right edge of last bin). Always a single array even when multiple
data sets are passed in.
So the returned value bins is the number of bins + 1 because it includes the left bin edges and right edge of the last bin.
You might not want to include the right edge of the last bin, therefore you can slice the array:
plt.plot(bins[:-1], counts)
Try this:
import matplotlib.pyplot as plt
plt.hist(x)
plt.show()
This is the simplest one I guess.
I have a 2D array of temperature over time data. There are about 7500 x-values and as much corresponding y-values (so one y for every x).
It looks like this:
The blue line in the middle is the result of my unsuccessful attempt to draw a plot line, which would represent the average of my data. Code:
import numpy as np
import matplotlib.pyplot as plt
data=np.genfromtxt("data.csv")
temp_av=[np.mean(data[1])]*len(data[0])
plt.figure()
plt.subplot(111)
plt.scatter(data[0],data[1])
plt.plot(data[0],temp_av)
plt.show()
However what I need is a curve, which will follow the rise in the temperature. Basically a line which will be somewhere in the middle of data points.
I googled for some solutions, but all I found were suggestions how to compute an average in cases where you have multiple y-values for one x. I understand how to do that, but it doesn't help in this case.
My next idea would be to use a loop to compute an average for every 2 neighbor points. But I am not sure how to do that best and if there aren't better solutions.
Also, I understand that what I need is to compute an other array. Plotting is only for representation.
If I undestrand correclty, what you are trying to plot is a trend line. You could do it by using the numpy function 'polyfit'. If that's what you are looking for, try this small modification to your code
import numpy as np
import matplotlib.pyplot as plt
data=np.genfromtxt("data.csv")
plt.figure()
plt.subplot(111)
plt.scatter(data[0],data[1])
pfit = np.polyfit(data[0], data[1], 1)
trend_line_model = np.poly1d(pfit)
plt.plot(data[0], trend_line_model(data[0]), "m--")
plt.show()
This will plot the trend line in dashed magenta
Currently I have a program that spits out data points like:
52.14535518
6.22793227
6.08643652
18.57737925
12.4697867
31.05047514
31.31070843
56.5758045
6.45830507
6.31006974
6.33210673
12.35320293
18.99089132
31.57124629
6.41475245
I want to be able to create 200 bins that are evenly spaced apart and so that when the program spits out the data, whichever range the data point is, the bin adds 1, which will tell me how many points are in each specific range, which I will then be able to plot this into a histogram,
My question is how do I make these 200 bins, and have my program store the data values in each bin, and know how many points are in each bin.
Thanks!
Matplotlib has the ability to make histograms very easily. See this histogram demo.
An even shorter example would be:
import matplotlib.pyplot as plt
data = [52.14535518, 6.22793227, 6.08643652, ...] # <- your data
num_bins = 200 # <- number of bins for the histogram
plt.hist(data, num_bins)
plt.show()
I have count data (a 100 of them), each correspond to a bin (0 to 99). I need to plot these data as histogram. However, histogram count those data and does not plot correctly because my data is already binned.
import random
import matplotlib.pyplot as plt
x = random.sample(range(1000), 100)
xbins = [0, len(x)]
#plt.hist(x, bins=xbins, color = 'blue')
#Does not make the histogram correct. It counts the occurances of the individual counts.
plt.plot(x)
#plot works but I need this in histogram format
plt.show()
If I'm understanding what you want to achieve correctly then the following should give you what you want:
import matplotlib.pyplot as plt
plt.bar(range(0,100), x)
plt.show()
It doesn't use hist(), but it looks like you've already put your data into bins so there's no need.
The problem is with your xbins. You currently have
xbins = [0, len(x)]
which will give you the list [0, 100]. This means you will only see 1 bin (not 2) bounded below by 0 and above by 100. I am not totally sure what you want from your histogram. If you want to have 2 unevenly spaced bins, you can use
xbins = [0, 100, 1000]
to show everything below 100 in one bin, and everything else in the other bin. Another option would be to use an integer value to get a certain number of evenly spaced bins. In other words do
plt.hist(x, bins=50, color='blue')
where bins is the number of desired bins.
On a side note, whenever I can't remember how to do something with matplotlib, I will usually just go to the thumbnail gallery and find an example that looks more or less what I am trying to accomplish. These examples all have accompanying source code so they are quite helpful. The documentation for matplotlib can also be very handy.
Cool, thanks! Here's what I think the OP wanted to do:
import random
import matplotlib.pyplot as plt
x=[x/1000 for x in random.sample(range(100000),100)]
xbins=range(0,len(x))
plt.hist(x, bins=xbins, color='blue')
plt.show()
I am fairly sure that your problem is the bins. It is not a list of limits but rather a list of bin edges.
xbins = [0,len(x)]
returns in your case a list containing [0, 100] Indicating that you want a bin edge at 0 and one at 100. So you get one bin from 0 to 100.
What you want is:
xbins = [x for x in range(len(x))]
Which returns:
[0,1,2,3, ... 99]
Which indicates the bin edges you want.
You can achieve this using matplotlib's hist as well, no need for numpy. You have essentially already created the bins as xbins. In this case x will be your weights.
plt.hist(xbins,weights=x)
Have a look at the histogram examples in the matplotlib documentation. You should use the hist function. If it by default does not yield the result you expect, then play around with the arguments to hist and prepare/transform/modify your data before providing it to hist. It is not really clear to me what you want to achieve, so I cannot help at this point.