I am trying to plot a histogram of datetime.time values. Where these values are discretized into five minute slices. The data looks like this, in a list:
['17:15:00', '18:20:00', '17:15:00', '13:10:00', '17:45:00', '18:20:00']
I would like to plot a histogram, or some form of distribution graph so that the number of occurrences of each time can be examined easily.
NB. Given each time is discretised then. The maximum number of bins in a histogram would be 288 = (60 / 5 * 24)
I have looked at matplotlib.pyplot.hist. But is requires some sort of continuous scalar
I did what David Zwicker said and used seconds, and then changed the x axis. I will look at what Dave said about 'bins'. This works roughly and gives a bar per hour plot to start with.
def chart(occurance_list):
hour_list = [t.hour for t in occurance_list]
print hour_list
numbers=[x for x in xrange(0,24)]
labels=map(lambda x: str(x), numbers)
plt.xticks(numbers, labels)
plt.xlim(0,24)
plt.hist(hour_list)
plt.show()
you have to convert the data in two variable and then you can use plotlab to plot in histograms.
Related
I am plotting a histogram, with another set of data, but the frequencies are all 1, no matter how I change the number of bins. I did this with data generated from a normal distribution in the following fashion
x=npr.normal(0,2,(1,100))
plt.hist(x,bins=10)
and I get the following histogram:
This happens even if I increase the number of simulations to 1000 or 10000.
How do I plot a histogram that displays the bell shape of the normal distribution?
Thanks in advance.
You are ploting one histogram for each column of your input array. That is one histogram with 1 value for each of your 100 columns.
x=npr.normal(0,2,(1,100))
plt.hist(x[0],bins=10)
will do (note that I am selecting the first (and only) row of x).
So the dataset that I'm using is tips from seaborn.
I wanted to plot a histogram against the total_bill column, and I did that using both seaborn and matlotlib.
This is my matplotlib histogram:
plt.hist(tips_df.total_bill);
And this is my seaborn histogram:
sns.histplot(tips_df.total_bill)
As you can see, around a total_bill of 13, the frequency seems to be maximum.
However, in matplotlib it's around 68, while its around 48 in seaborn.
Which are both wrong. Because on typing
tips_df["total_bill"].value_counts().sort_values(ascending=False).head(5)
we get the output
13.42 3
15.69 2
10.34 2
10.07 2
20.69 2
Name: total_bill, dtype: int64
As we can see, the most frequent bill is around 13, but why is the count values on the y-axis wrong?
In a histogram, a "rectangle"'s height represents how many values are in the given range which is in turn described by the width of the rectangle. You can get the width of each rectangle by (max - min) / number_of_rectangles.
For example, in the matplotlib's output, there are 10 rectangles (bins). Since your data has a minimum around 3 and maximum around 50, each width is around 4.7 units wide. Now, to get the 3rd rectangles range, for example, we start from minimum and add this width until we get there, i.e., 3 + 4.7*2 = 12.4. It then ends at 12.4 + 4.7 = 17.1. So, the counts corresponding to 3rd bin is the number of values in tips_df.total_bill that fall in this range. Let's find it manually:
>>> tips_df.total_bill.between(12.4, 17.1).sum()
70
(since I used crude approximations in calculating ranges and omitted precision, it is not exact; but I hope you get the feeling.)
This so far was to explain why a direct value_counts doesn't match the histogram output directly, because it gives value-by-value counts whereas histogram is about ranges.
Now, why the different graphs between seaborn & matplotlib? It's because they use different number of bins! If you count, matplotlib has 10 and seaborn has 14. Since you didn't specify bins argument to either of them, they use default values and matplotlib defaults to plt.rcParams["hist.bins"] and seaborn chooses "automatically" (see Notes section here).
So, we might as well give bins arguments to enforce the same output:
>>> plt.hist(tips_df.total_bill, bins=10)
>>> sns.histplot(tips_df.total_bill, bins=10)
I have a data frame which is indexed by DataTime in pandas.
I have data about a car with the Inside temperature, Lowest inside temperature, Highest temperature and the same three features for the Outside temperature.
Thus I plot all 6 features like so as a time series, and have tried to use plt.fill_between like so :
car_df[['insideTemp','insideTempLow','insideTempHigh','outsideTemp','outsideTempLow','outsideTempHigh']].plot()
plt.fill_between(car_df['insideTemp'], car_df['insideTempLow'],car_df['insideTempHigh'], data=car_df)
plt.fill_between(car_df['outsideTemp'], car_df['outsideTempLow'],car_df['outsideTempHigh'], data=car_df)
plt.show()
I get 6 lines as desired, however nothing seems to get filled (thus not separating the two categories of indoor and outdoor).
Any ideas? Thanks in advance.
You passed wrong arguments to fill_between.
The proper parameters are as follows:
x - x coordinates, in your case index values,
y1 - y coordinates of the first curve,
y2 - y coordinates of the secondt curve.
For readability, usually there is a need to pass also color parameter.
I performed such a test to draw just 2 lines (shortening column names)
and fill the space between them:
car_df[['inside', 'outside']].plot()
plt.fill_between(car_df.index, car_df.inside, car_df.outside,
color=(0.8, 0.9, 0.5));
and got the followig picture:
Assume I have the following observations:
1,2,3,4,5,6,7,100
Now I want to make a plot how the observations are distributed percent wise:
First 12.5% of the observations is <=1 (1 out of 8)
First 50% of the observations is <=4 (4 out of 4)
First 87.5% of the observations is <=7 (7 out of 8)
First 100% of the observations is <=100 (8 out of 8)
My questions:
How is such kind of plot called? (so max observation on y axis per percentile, percentile on x axis?). A kind of histogram?
How can I create such kind of plot in Matplotlib/Numpy?
Thanks
I'm not sure what such a plot would be called (edit: it appears it's called a cumulative frequency plot, or something similar). However, it's easy to do.
Essentially, if you have sorted data, then the percentage of observations <= a value at index i is just (i+1)/len(data). It's easy to create an x array using arange that satisfies this. So, for example:
from matplotlib import pylab
import numpy as np
a = np.array([1,2,3,4,5,6,7,100])
pylab.plot( np.arange(1,len(a)+1)/len(a), a, # This part is required
'-', drawstyle='steps' ) # This part is stylistic
Gives:
If you'd prefer your x axis go from 0 to 100 rather than 0 to
Note too that this works for your example data because it is already sorted. If you are using unsorted data, then sort it first, with np.sort for example:
c = np.random.randn(100)
c.sort()
pylab.plot( np.arange(1,len(c)+1)/len(c), c, '-', drawstyle='steps' )
I have tried to research this problem, but failed. I'm quite a beginner at python, so bear with me.
I have a textfile containing numbers on each line (they are angles in degrees).
I want to first cluster the angles into cluster sizes of 20. Then I want to plot this on a histogram. I have the following code:
angle = open(output_dir+'/chi_angle.txt', 'r').read().splitlines()
array = numpy.array(map(float, angle))
hello = list(array)
from cluster import *
cl = HierarchicalClustering(hello, lambda x,y: abs(x-y))
clusters = cl.getlevel(20)
frequency = [len(x) for x in clusters]
average = [1.0*sum(x)/len(x) for x in clusters]
Now. My question is: How do I plot the histogram?
Doing the following:
pylab.hist(average, bins=50)
pylab.xlabel('Chi 1 Angle [degrees]')
pylab.ylabel('#')
pylab.show()
will show a histogram with bars correctly placed (i.e. at the average of each cluster), but it wont show how many "angles" each cluster contains.
Just for clarification. The clustered data looks like this:
clusters = [[-60.26, -30.26, -45.24], [163.24, 173.24], [133.2, 123.23, 121.23]]
I want the mean of each cluster, and the number of angles in each cluster. On the histogram the first bar will thus be located at around -50 and will be a height of 3. How do I plot this?
Thanks a lot!
Not sure I understood your question. Anyhow try saving your histogram in this array
H=hist(average, bins=50)
If you want to plot it then do
plot(H[1][1:],H[0])
H[1] is an array that stores the bins centers and H[0] the counts in each bin. I hope this helped.
Why don't you just use a histogram right away?
A histogram of cluster centers is not a very sensible representation of your data.