Bin/Histogram making - python

Currently I have a program that spits out data points like:
52.14535518
6.22793227
6.08643652
18.57737925
12.4697867
31.05047514
31.31070843
56.5758045
6.45830507
6.31006974
6.33210673
12.35320293
18.99089132
31.57124629
6.41475245
I want to be able to create 200 bins that are evenly spaced apart and so that when the program spits out the data, whichever range the data point is, the bin adds 1, which will tell me how many points are in each specific range, which I will then be able to plot this into a histogram,
My question is how do I make these 200 bins, and have my program store the data values in each bin, and know how many points are in each bin.
Thanks!

Matplotlib has the ability to make histograms very easily. See this histogram demo.
An even shorter example would be:
import matplotlib.pyplot as plt
data = [52.14535518, 6.22793227, 6.08643652, ...] # <- your data
num_bins = 200 # <- number of bins for the histogram
plt.hist(data, num_bins)
plt.show()

Related

Cannot understand matplotlib pyplot histogram

I am just learning some basics of Data Analysis.
I have a simple csv data file like the one below.
START,FIRST,SECOND,ITEM
1,100,200,A
2,100,200,B
2,100,300,C
2,200,300,D
3,200,100,E
3,200,100,F
3,200,100,G
3,200,100,H
3,200,100,I
3,200,100,J
I wrote this small program to read this csv file and then print a histogram using matplotlib for the three columns START, FIRST, and SECOND. I also print a scatter plot for FIRST vs SECOND columns.
#!/exp/anaconda3/bin/python3
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
file_name = 'junk.csv'
data = pd.read_csv(file_name)
print(data.describe())
plt.rcParams['axes.grid'] = True
fix, axs = plt.subplots(2,2, figsize=(15,10))
axs[0, 0].hist(data['START'], 100, density=True, facecolor='g', alpha=0.8)
axs[1, 0].scatter(data['FIRST'], data['SECOND'], facecolor='violet')
axs[0, 1].hist(data['FIRST'], 100, density=True, facecolor='r', alpha=0.8)
axs[1, 1].hist(data['SECOND'], 100, density=True, facecolor='b', alpha=0.8)
plt.show()
What I do not understand is in the histogram plots, for example, the bottom right hand image with blue bars in the attached picture, why does it not simply plot how many times the number 200 is occurring instead of showing that 200 occurs 0.10 times. How is that possible? Same goes for the 300 as well.
Can someone help me understand what and how matplot is coming up with the Y-axis values? These values do not make sense to me.
Thank you.
Ruby Drew
Try density = False. The density parameter tells matplotlib whether you want it to normalise the heights or not such that it represents a probability density.
First note that a histogram is primarily meant to count continuous samples in small bins. For discrete data, the bins should be carefully chosen to have boundaries nicely in-between the values. When you add bins=N, matplotlib supposes a continuous distribution and subdivides the space from smallest to largest sample into N equally-sized bins. For discrete data this can have unexpected side effects, such as putting samples in either one bin for the values on the bin boundaries.
With density=True, the heights of the bars is recalculated such that the total area of all bins sums to 1. For a continuous distribution with many samples, this resembles the probability density function and can be used to draw a kde plot with the same y-axis.
So, what's happening in the blue histogram:
100 bins are created between 100 and 300. Each bin will be 2 wide.
3 bins get values: the bin 100-102 gets count 3, either the bin 198-200 or the bin 200-202 get a count of 1, the bin 298-300 also gets a count of one.
The total height of the bins is now 5. As their width is 2, the histogram counts need to be divided by (total_height * bin_width) to obtain a total area of 1.
Clearly, the sum of height times width of the bars sums to 1: 0.3*2 + 0.1*2 + 0.1*2 = 1.
The latest version (0.11) of Seaborn's histplot has a parameter to indicate that a distribution is discrete. And a parameter stat= where you can choose between 'count' for bin heights indicating the usual counts and 'probability' for heights relative to their probability, mimicking a probability mass function. The blue histogram could be drawn as:
import seaborn as sns
sns.histplot(data, x='SECOND', discrete=True, stat='probability', facecolor='b', alpha=0.8, ax=axs[1, 1])

python bar chart not centered

I am trying to build a simple histogram. For some reason, my bars are behaving abnormally. As you can see in this picture, my bar over "3" is moved to the right side. I am not sure what caused it. I did align='mid' but it did not fix it.
This is the code that I used to create it:
def createBarChart(colName):
df[colName].hist(align='mid')
plt.title(str(colName))
RUNS = [1,2,3,4,5]
plt.xticks(RUNS)
plt.show()
for column in colName:
createBarChart(column)
And this is what I got:
bar is not centered over 3
To recreate my data:
df = pd.DataFrame(np.random.randint(1,6,size=(100, 4)), columns=list('ABCD'))
Thank you for your help!
P/s: idk if this info is relevant, but I am using seaborn-whitegrid style. I tried to recreate a plot with sample data and it's still showing up. Is it a bug?
hist created using random data
The hist function is behaving exactly as it is supposed to. By default it splits the data you pass into 10 bins, with the left edge of the first bin at the data's minimum value and the right edge of the last bin at its maximum. The chart below shows the randomly generated data binned this way, with red dashed lines to mark the edges of the bins.
The way around this is to define the bin edges yourself, with a slight adjustment to the minimum and maximum values to centre the bars over the x axis ticks. This can be done quite easily with numpy's linspace function (using column A in the randomly generated data frame as an example):
bins = np.linspace(df["A"].min() - .5, df["A"].max() + .5, 6)
df["A"].hist(bins=bins)
We ask for 6 values because we are defining the bin edges, this will result in 5 bins, as shown in this chart:
If you wanted to keep the gaps between the bars you can increase the number of bins to 9 and adjust the offset slightly, but this wouldn't work in all cases (it works here because every value is either 1, 2, 3, 4 or 5).
bins = np.linspace(df["A"].min() - .25, df["A"].max() + .25, 10)
df["A"].hist(bins=bins)
Finally, as this data contains discrete values and really you are plotting the counts, you could use the value_counts function to create a series that can then be plotted as a bar chart:
df["A"].value_counts().sort_index().plot(kind="bar")
# Provide a 'color' argument if you need all of the bars to look the same.
df["A"].value_counts().sort_index().plot(kind="bar", color="steelblue")
Try using something like this in your code to create all of the histogram bars to the same place.
plt.hist("Your data goes here", bins=range(1,7), align='left', rwidth=1, normed=True)
place your data where I put your data goes here

how to plot histogram of lottery numbers?

I'm using ipython notebook to plot histogram of lottery numbers results. I want to demonstrate how many times each number appeared. I have drawing results in CSV file like matrix. I've tried to load numbers in numpy matrix then convert it to int array and then using matplotlib.pyplot.hist() to plot it - but i get wrong result (looks like wrong bin, only 5 rectangles are show but i can't see the range). What is the easiest way to get this?
If you don't specify the range and number of bins, matplotlib.pyplot.hist() will guess guess the range and default to 10 bins. This is usually not what you want.
The following works as expected
import matplotlib.pyplot as plt
from numpy import random
N = random.random_integers( 0, 10, 20 )
plt.hist( N, range=[-.5,10.5], bins=11 )
plt.show()
I did shift the range 0.5 so that the bars would be nicely lined up with the tick marks.

Plotting histogram in python

I wrote the following program in python to obtain equi-width histograms. But when I am plotting it I am getting a single line in figure instead of a histogram. Can someone please help me figure out as to where am I going wrong.
import numpy as np
import matplotlib.pyplot as plt
for num in range(0,5):
hist, bin_edges = np.histogram([1000, 98,99992,8474,95757,958574,97363,97463,1,4,5], bins = 5)
plt.bar(bin_edges[:-1], hist, width = 1000)
plt.xlim(min(bin_edges), max(bin_edges))
plt.show()
Additionally I want to label each plot obtained with its "num" value..which range from 0 to 5. In the example given above although I have kept my data constant, but I intend to change my data for different "num" values.
Look at your bin edges:
>>> bin_edges
array([ 1.00000000e+00, 1.91715600e+05, 3.83430200e+05,
5.75144800e+05, 7.66859400e+05, 9.58574000e+05])
Your bin positions range from 1 to approximately 1 million, but you only gave the bars a width of 1000. Your bars, where they exist at all, are too skinny to be seen. Also, most of the bars have sero height, because most of the bins are empty:
>>> hist
array([10, 0, 0, 0, 1])
The "line" you see is the last bin, with one element. This bin covers a span of approximately 200000, but the bar width is only 1000, so it is very thin relative to the amount of space it is supposed to cover. The bar of height 10 is also there, but it's also very skinny, and jammed up against the left edge of the plot, so it's basically invisible.
It doesn't make sense to try to use constant-width bars while also placing them at x-coordinates that correspond to their size. By putting the bars at those x-coordinates, you are already spacing them out proportional to the bin widths; making the bars skinnier doesn't bring them closer together, it just makes them invisible.
If you want to use constant-width bars, you should put them at sequential X positions and use labels on the axis to show the values the bins represent. Here's a simple example with your data:
plt.bar(np.arange(len(bin_edges)-1), hist, width=1)
plt.xticks((np.arange(len(bin_edges))-0.5)[1:], bin_edges[:-1])
You'll have to decide how you want to format those labels.

python plot simple histogram given binned data

I have count data (a 100 of them), each correspond to a bin (0 to 99). I need to plot these data as histogram. However, histogram count those data and does not plot correctly because my data is already binned.
import random
import matplotlib.pyplot as plt
x = random.sample(range(1000), 100)
xbins = [0, len(x)]
#plt.hist(x, bins=xbins, color = 'blue')
#Does not make the histogram correct. It counts the occurances of the individual counts.
plt.plot(x)
#plot works but I need this in histogram format
plt.show()
If I'm understanding what you want to achieve correctly then the following should give you what you want:
import matplotlib.pyplot as plt
plt.bar(range(0,100), x)
plt.show()
It doesn't use hist(), but it looks like you've already put your data into bins so there's no need.
The problem is with your xbins. You currently have
xbins = [0, len(x)]
which will give you the list [0, 100]. This means you will only see 1 bin (not 2) bounded below by 0 and above by 100. I am not totally sure what you want from your histogram. If you want to have 2 unevenly spaced bins, you can use
xbins = [0, 100, 1000]
to show everything below 100 in one bin, and everything else in the other bin. Another option would be to use an integer value to get a certain number of evenly spaced bins. In other words do
plt.hist(x, bins=50, color='blue')
where bins is the number of desired bins.
On a side note, whenever I can't remember how to do something with matplotlib, I will usually just go to the thumbnail gallery and find an example that looks more or less what I am trying to accomplish. These examples all have accompanying source code so they are quite helpful. The documentation for matplotlib can also be very handy.
Cool, thanks! Here's what I think the OP wanted to do:
import random
import matplotlib.pyplot as plt
x=[x/1000 for x in random.sample(range(100000),100)]
xbins=range(0,len(x))
plt.hist(x, bins=xbins, color='blue')
plt.show()
I am fairly sure that your problem is the bins. It is not a list of limits but rather a list of bin edges.
xbins = [0,len(x)]
returns in your case a list containing [0, 100] Indicating that you want a bin edge at 0 and one at 100. So you get one bin from 0 to 100.
What you want is:
xbins = [x for x in range(len(x))]
Which returns:
[0,1,2,3, ... 99]
Which indicates the bin edges you want.
You can achieve this using matplotlib's hist as well, no need for numpy. You have essentially already created the bins as xbins. In this case x will be your weights.
plt.hist(xbins,weights=x)
Have a look at the histogram examples in the matplotlib documentation. You should use the hist function. If it by default does not yield the result you expect, then play around with the arguments to hist and prepare/transform/modify your data before providing it to hist. It is not really clear to me what you want to achieve, so I cannot help at this point.

Categories