How to stop pyplot from overlapping histogram bins? - python

I'm sure there's an easy answer to this and I'm just looking at things wrong, but what's going on with my pyplot histogram? Here's the output; the data contains participants between the ages of 18 and 24, with no fractional ages (nobody's 18.5):
Why are the bins staggered like this? The current width is set to 1, so each bar should be the width of a bin, right? The problem gets even worse when the width is less than 0.5, when the bars look like they're in completely different bins.
Here's the code:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
csv = pd.read_csv('F:\Python\Delete\Delete.csv')
age = csv.age
gender = csv.gender
new_age = age[~np.isnan(age)]
new_age_f = new_age[gender==2]
new_age_m = new_age[gender==1]
plt.hist(new_age_f, alpha=.80, label='Female', width=1, align='left')
plt.hist(new_age_m, alpha=.80, label='Male', width=1, align='left')
plt.legend()
plt.show()
Thank you!

plt.hist does not have any argument width. If width is specified, it is given to the underlying patch, meaning that the rectangle is made 1 wide. This has nothing to do with the bin width of the histogram and I would guess there are little to no reasons to ever use width in a histogram call at all.
Instead what you want is to specify the bins. You probably also want to use the same bins for both histogram plots.
import matplotlib.pyplot as plt
import numpy as np; np.random.seed(5)
import pandas as pd
csv = pd.DataFrame({"age" : np.random.randint(18,27, 20),
"gender" : np.random.randint(1,3,20)})
age = csv.age
gender = csv.gender
new_age = age[~np.isnan(age)]
new_age_f = new_age[gender==2]
new_age_m = new_age[gender==1]
bins = np.arange(new_age.values.min(),new_age.values.max()+2)
plt.hist(new_age_f, alpha=.40, label='Female', bins=bins, ec="k")
plt.hist(new_age_m, alpha=.40, label='Male', bins=bins, ec="k")
plt.legend()
plt.show()

Related

Seaborn: How to change the color of individual bars in histogram?

I was looking on internet but i didn't get any solution.
I have this graph and I want to change the color of the first bar, if I use the parameter 'color' it changes all the bars.
Is it possible to do this?
Thank u so much!
You could access the list of generated rectangles via ax.patches, and then recolor the first one:
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
df = pd.DataFrame({'Sales': 100000 * (np.random.rand(80) ** 1.5) + 18000})
ax = sns.histplot(x='Sales', data=df, bins=4, color='skyblue', alpha=1)
ax.patches[0].set_facecolor('salmon')
plt.show()
To get a separation exactly at 40.000, you could create two histograms on the same subplot. With binrange= exact limits can be set:
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
df = pd.DataFrame({'Sales': 100000 * (np.random.rand(80) ** 1.5) + 18000})
# either choose a fixed limit, or set it exactly at one fourth
limit = 40000
# limit = df['Sales'].min() + 0.25 * (df['Sales'].max() - df['Sales'].min())
ax = sns.histplot(x='Sales', data=df[df['Sales'] <= limit],
bins=1, binrange=(df['Sales'].min(), limit), color='salmon')
sns.histplot(x='Sales', data=df[df['Sales'] > limit],
bins=3, binrange=(limit, df['Sales'].max()), color='skyblue', ax=ax)
plt.show()
Use:
import seaborn as sns
s = [1,1,2,2,1,3,4]
s = pd.DataFrame({'val': s, 'col':['1' if x==1 else '0' for x in s]})
sns.histplot(data=s, x="val", hue="col")
The output:
Well, the exact way will depend on which mapping software you are using. Your best bet is to break your data into two sets, one for the first bar and one for the rest. You should be able to output each of the sets in its own colour.

changing major/minor axis interval and color scheme for heatmap

You can find my data set here.
I am using seaborn to plot the heatmap. But open to other choices.
I have trouble getting the color scheme right. I wish to have a black and white scheme. As the current color scheme doesn't clear show the result.
I also wish to display only x and y intervals as (0 , 25 , 50, 100 , 127).
How can I do this.
Below is my try:
import pandas as pd
import numpy
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
data_sorted = pd.read_csv("tors_sorted.txt", sep="\t")
ax = plt.axes()
ax.set_xlim(right=128)
minor_ticks = numpy.arange(0, 128, 50) # doesn't seem to work
data_sorted= data_sorted.pivot("dst","src","pdf_num_bytes")
#sns.heatmap(data_sorted,ax=ax)
sns.heatmap(data_sorted,linecolor='black',xticklabels=True,yticklabels=True)
ax.set_title('Sample plot')
ax.set_xticks(minor_ticks, minor=True)
fig = ax.get_figure()
fig.savefig('heatmap.jpg')
This is the image that I get.
thanks.

matplotlib: histogram is not displaying

I am trying to draw histogram but nothing appears in the Figure Window.
My code is below:
import numpy as np
import matplotlib.pyplot as plt
values = [1000000, 1525097, 2050194, 1095638, 1620736, 2145833, 1191277, 1716375, 1286916, 1382555]
plt.hist(values, 10, histtype = 'bar', facecolor = 'blue')
plt.ylabel("Values")
plt.xlabel("Bin Number")
plt.title("Histogram")
plt.axis([0,11,0,220000])
plt.show()
This is the output:
I am trying to achieve this plot
Any help would be much appreciated...
You are confusing what a histogram is. The histogram that can be produced with the given data is as given below.
A histogram basically counts how many given values fall within a given range.
You have given incorrect arguments to the axis() function. The ending value is 2200000 You missed a single zero. Also you have swapped the arguments. Limits of the x axis comes first and then the limits of the Y axis. This is the modified code:
import numpy as np
import matplotlib.pyplot as plt
values = [1000000, 1525097, 2050194, 1095638, 1620736, 2145833, 1191277, 1716375, 1286916, 1382555]
plt.hist(values, 10, histtype = 'bar', facecolor = 'blue')
plt.ylabel("Values")
plt.xlabel("Bin Number")
plt.title("Histogram")
plt.axis([0,2200000,0,11])
plt.show()
This is the histogram generated:
I finally achieved it...
Here is the code:
import numpy as np
import matplotlib.pyplot as plt
values = [1000000, 1525097, 2050194, 1095638, 1620736, 2145833, 1191277, 1716375, 1286916, 1382555]
strategy = [1,2,3,4,5,6,7,8,9,10]
value = np.array(values)
strategies = np.array(strategy)
plt.bar(strategy, values, .8)
plt.ylabel("Values")
plt.xlabel("Bin Number")
plt.title("Histogram")
plt.axis([1,11,0,2200000])
plt.show()
Output:

Labelling a matplotlib histogram bin with an arrow

I have a histogram plot which could be replicated with the MWE below:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
pd.Series(np.random.normal(0, 100, 1000)).plot(kind='hist', bins=50)
Which creates a plot like this:
How would I then go about labelling the bin with an arrow for a given integer?
For example see below, where an arrow labels the bin containing the integer 300.
EDIT: I should add ideally the y coordinates of the arrow should be set automatically by the height of the bar it is labelling - if possible!
you can use annotate to add an arrow:
import pandas as pd
import matplotlib.pyplot as plt
#import seaborn as sns
import numpy as np
fig, ax = plt.subplots()
series = pd.Series(np.random.normal(0, 100, 1000))
series.plot(kind='hist', bins=50, ax=ax)
ax.annotate("",
xy=(300, 5), xycoords='data',
xytext=(300, 20), textcoords='data',
arrowprops=dict(arrowstyle="->",
connectionstyle="arc3"),
)
In this example, I added an arrow that goes from coordinates (300, 20) to (300, 5).
In order to automatically scale your arrow to the value in the bin, you can use matplotlib hist to plot the histogram and get the values back and then use numpy where to find which bin corresponds to the desired position.
import pandas as pd
import matplotlib.pyplot as plt
#import seaborn as sns
import numpy as np
nbins = 50
labeled_bin = 200
fig, ax = plt.subplots()
series = pd.Series(np.random.normal(0, 100, 1000))
## plot the histogram and return the bin position and values
ybins, xbins, _ = ax.hist(series, bins=nbins)
## find out in which bin belongs the position where you want the label
ind_bin = np.where(xbins >= labeled_bin)[0]
if len(ind_bin) > 0 and ind_bin[0] > 0:
## get position and value of the bin
x_bin = xbins[ind_bin[0]-1]/2. + xbins[ind_bin[0]]/2.
y_bin = ybins[ind_bin[0]-1]
## add the arrow
ax.annotate("",
xy=(x_bin, y_bin + 5), xycoords='data',
xytext=(x_bin, y_bin + 20), textcoords='data',
arrowprops=dict(arrowstyle="->",
connectionstyle="arc3"),
)
else:
print "Labeled bin is outside range"
#Julien Spronck showed the best way, I think. Alternatively, you can also use arrow; the example code can be found below. The y-ccordinate is determined automatically by calculating how many elements are in a certain bin (with a certain tolerance which you can define yourself). You can play with the parameters (length of arrow head, length of arrow). Here is the code:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
mySer = pd.Series(np.random.normal(0, 100, 1000))
mySer.plot(kind='hist', bins=50)
# that is where you want to add the arrow
ind = 200
# determine how many elements you have in the bin (with a certain tolerance)
n = len(mySer[(mySer > ind*0.95) & (mySer < ind*1.05)])
# define length of the arrow
lenArrow = 10
lenHead = 2
wiArrow = 5
plt.arrow(ind, n+lenArrow+lenHead, 0, -lenArrow, head_width=wiArrow+3, head_length=lenHead, width=wiArrow, fc='k', ec='k')
plt.show()
This gives you the following output (for 200 instead of 300 as in your example):

Python-Matplotlib boxplot. How to show percentiles 0,10,25,50,75,90 and 100?

I would like to plot an EPSgram (see below) using Python and Matplotlib.
The boxplot function only plots quartiles (0, 25, 50, 75, 100). So, how can I add two more boxes?
I put together a sample, if you're still curious. It uses scipy.stats.scoreatpercentile, but you may be getting those numbers from elsewhere:
from random import random
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import scoreatpercentile
x = np.array([random() for x in xrange(100)])
# percentiles of interest
perc = [min(x), scoreatpercentile(x,10), scoreatpercentile(x,25),
scoreatpercentile(x,50), scoreatpercentile(x,75),
scoreatpercentile(x,90), max(x)]
midpoint = 0 # time-series time
fig = plt.figure()
ax = fig.add_subplot(111)
# min/max
ax.broken_barh([(midpoint-.01,.02)], (perc[0], perc[1]-perc[0]))
ax.broken_barh([(midpoint-.01,.02)], (perc[5], perc[6]-perc[5]))
# 10/90
ax.broken_barh([(midpoint-.1,.2)], (perc[1], perc[2]-perc[1]))
ax.broken_barh([(midpoint-.1,.2)], (perc[4], perc[5]-perc[4]))
# 25/75
ax.broken_barh([(midpoint-.4,.8)], (perc[2], perc[3]-perc[2]))
ax.broken_barh([(midpoint-.4,.8)], (perc[3], perc[4]-perc[3]))
ax.set_ylim(-0.5,1.5)
ax.set_xlim(-10,10)
ax.set_yticks([0,0.5,1])
ax.grid(True)
plt.show()

Categories