Histogram manipulation to remove unwanted data

Histogram manipulation to remove unwanted data - python

How do I remove data from a histogram in python under a certain frequency count?
Say I have 10 bins, the first bin has a count of 4, the second has 2, the third has 1, fourth has 5, etc...
Now I want to get rid of the data that has a count of 2 or less. So the second bin would go to zero, as would the third.
Example:
import numpy as np
import matplotlib.pyplot as plt
gaussian_numbers = np.random.randn(1000)
plt.hist(gaussian_numbers, bins=12)
plt.title("Gaussian Histogram")
plt.xlabel("Value")
plt.ylabel("Frequency")
fig = plt.gcf()
Gives:
and I want to get rid of the bins with fewer than a frequency of say 'X' (could be frequency = 100 for example).
want:
thank you.

Une np.histogram to create the histogram.
Then use np.where. Given a condition, it yields an array of booleans you can use to index your histogram.
import numpy as np
import matplotlib.pyplot as plt
gaussian_numbers = np.random.randn(1000)
# Get histogram
hist, bins = np.histogram(gaussian_numbers, bins=12)
# Threshold frequency
freq = 100
# Zero out low values
hist[np.where(hist <= freq)] = 0
# Plot
width = 0.7 * (bins[1] - bins[0])
center = (bins[:-1] + bins[1:]) / 2
plt.bar(center, hist, align='center', width=width)
plt.title("Gaussian Histogram")
plt.xlabel("Value")
plt.ylabel("Frequency")
(Plot part inspired from here.)

Related

How to build a histogram of numpy 2 dimensional array

I have a 256x256 matrix of values and I would like to plot a histogram of these values
If I am not mistaken, the histogram must be calculated in a vector of values, correct? so here is what I have tried:
from skimage.measure import compare_ssim
import numpy as np
import matplotlib.pyplot as plt
d = np.load("BB_Digital.npy")
n, bins, patches = plt.hist(x=d.ravel(), color='#0504aa', bins='auto', alpha=0.7, rwidth=0.85)
plt.grid(axis='y', alpha=0.75)
plt.xlabel('Value')
plt.ylabel('Frequency')
plt.title('Blue channel Co-occurency matrix')
maxfreq = n.max()
# Set a clean upper y-axis limit.
plt.ylim(ymax=np.ceil(maxfreq / 10) * 10 if maxfreq % 10 else maxfreq + 10)
plt.show()
But then, I have a very strange result:
When I don't use the ravel function (use the 2D matrix) the following result is shown:
However, both histograms seem to be wrong, as I verified later:
>>> np.count_nonzero(d==0)
51227
>>> np.count_nonzero(d==1)
2529
>>> np.count_nonzero(d==2)
1275
>>> np.count_nonzero(d==3)
885
>>> np.count_nonzero(d==4)
619
>>> np.count_nonzero(d==5)
490
>>> np.count_nonzero(d==6)
403
>>> np.max(d)
12518
>>> np.min(d)
0
How can I build a correct histogram?
P.s: Here is the file if you could help me.

The data seems to be discrete. Setting explicit bin boundaries at the halves could show the frequency of each value. As there are very high but infrequent values, the following example cuts off at 50:
import numpy as np
from matplotlib import pyplot as plt
d = np.load("BB_Digital.npy")
plt.hist(d.ravel(), bins=np.arange(-0.5, 51), color='#0504aa', alpha=0.7, rwidth=0.85)
plt.yscale('log')
plt.margins(x=0.02)
plt.show()
Another visualization could show a pcolormesh where the colors use a logarithmic scale. As the values start at 0, adding 1 avoids minus infinity:
from matplotlib import pyplot as plt
from matplotlib.colors import LogNorm
import numpy as np
d = np.load("BB_Digital.npy")
plt.pcolormesh(d + 1, norm=LogNorm(), cmap='inferno')
plt.colorbar()
plt.show()
Yet another visualization concentrates on the diagonal values:
plt.plot(np.diagonal(d), color='navy')
ind_max = np.argmax(np.diagonal(d))
plt.vlines(ind_max, 0, d[ind_max, ind_max], colors='crimson', ls=':')
plt.yscale('log')

How to change what the axis of a plot is based on? (Python, Matplotlib)

I want to create a graph of 2 * height (which is the meter values in the index) versus the time squared (which are the decimal values in the columns). How can I go about doing this? (In matplotlib)
For clarity, I want the y-axis to be 2 * index values, and the x-axis to be the times squared from within the columns. I would like this to be a series of line graphs
It should end up looking something like this:

In your comment you say you use df1.plot() to draw lines. df.plot() uses dataframe index as x values by default. You say you want the y-axis to be 2 * index values, and the x-axis to be the times squared from within the columns. Your demand involves changes to dataframe values, so I suggest you use ax.plot() for better customization.
Here is a program uses numpy.linalg.lstsq which adopts Least squares internally to get a matched line among given points.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from io import StringIO
TESTDATA = StringIO("""Height Trial:1 Trial:2 Trial:3 Trial:4 Trial:5 Trial:6 Trial:7
1.029 0.4667 0.4616 0.4569 0.4579 0.4653 0.4578 0.4484
1.095 0.4752 0.4773 0.4721 0.4738 0.4713 0.4745 0.4663
1.168 0.4836 0.4834 0.4873 0.4890 0.4890 0.4904 0.4902
1.315 0.5139 0.5117 0.5161 0.5108 0.5224 0.5129 0.5187
1.540 0.5644 0.5677 0.5804 0.5535 0.5636 0.5605 0.5609
1.807 0.6051 0.6124 0.6014 0.6035 0.5977 0.6012 0.6209
""")
df = pd.read_csv(TESTDATA, delim_whitespace=True)
df.set_index(['Height'], inplace=True)
fig, ax = plt.subplots()
for column in df:
x = df[column]**2
y = df.index*2
A = np.vstack([x, np.ones(len(x))]).T
k, b = np.linalg.lstsq(A, y)[0]
line = ax.plot(x, y, 'o')
ax.plot(x, k*x+b, label=f'y={k:.5f}x+{b:.5f}', color=line[0].get_color(), linestyle='dashed')
plt.legend()
plt.xlabel('Fall time, squared (s²)')
plt.ylabel('Twice the height (m)')
plt.title('Measurement of Acceleration due to Gravity on Earth')
plt.show()

import matplotlib.pyplot as plt
plt.plot(list of things on x-axis, list of things on y-axs)
plt.show

import matplotlib.pyplot as plt
plt.plot(times_squared_variable, 2_height_variable, '--', color='choose_a_color')
# Label axis and the plot
plt.xlabel('Name_x_axis')
plt.ylabel('Name_y_axis')
plt.title('Plot_name')
# Show the plot
plt.show()

How to get the full width at half maximum (FWHM) from kdeplot

I have used seaborn's kdeplot on some data.
import seaborn as sns
import numpy as np
sns.kdeplot(np.random.rand(100))
Is it possible to return the fwhm from the curve created?
And if not, is there another way to calculate it?

You can extract the generated kde curve from the ax. Then get the maximum y value and search the x positions nearest to the half max:
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
ax = sns.kdeplot(np.random.rand(100))
kde_curve = ax.lines[0]
x = kde_curve.get_xdata()
y = kde_curve.get_ydata()
halfmax = y.max() / 2
maxpos = y.argmax()
leftpos = (np.abs(y[:maxpos] - halfmax)).argmin()
rightpos = (np.abs(y[maxpos:] - halfmax)).argmin() + maxpos
fullwidthathalfmax = x[rightpos] - x[leftpos]
ax.hlines(halfmax, x[leftpos], x[rightpos], color='crimson', ls=':')
ax.text(x[maxpos], halfmax, f'{fullwidthathalfmax:.3f}\n', color='crimson', ha='center', va='center')
ax.set_ylim(ymin=0)
plt.show()
Note that you can also calculate a kde curve from scipy.stats.gaussian_kde if you don't need the plotted version. In that case, the code could look like:
import numpy as np
from scipy.stats import gaussian_kde
data = np.random.rand(100)
kde = gaussian_kde(data)
x = np.linspace(data.min(), data.max(), 1000)
y = kde(x)
halfmax = y.max() / 2
maxpos = y.argmax()
leftpos = (np.abs(y[:maxpos] - halfmax)).argmin()
rightpos = (np.abs(y[maxpos:] - halfmax)).argmin() + maxpos
fullwidthathalfmax = x[rightpos] - x[leftpos]
print(fullwidthathalfmax)

I don't believe there's a way to return the fwhm from the random dataplot without writing the code to calculate it.
Take into account some example data:
import numpy as np
arr_x = np.linspace(norm.ppf(0.00001), norm.ppf(0.99999), 10000)
arr_y = norm.pdf(arr_x)
Find the minimum and maximum points and calculate difference.
difference = max(arr_y) - min(arr_y)
Find the half max (in this case it is half min)
HM = difference / 2
Find the nearest data point to HM:
nearest = (np.abs(arr_y - HM)).argmin()
Calculate the distance between nearest and min to get the HWHM, then mult by 2 to get the FWHM.

Labelling a matplotlib histogram bin with an arrow

I have a histogram plot which could be replicated with the MWE below:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
pd.Series(np.random.normal(0, 100, 1000)).plot(kind='hist', bins=50)
Which creates a plot like this:
How would I then go about labelling the bin with an arrow for a given integer?
For example see below, where an arrow labels the bin containing the integer 300.
EDIT: I should add ideally the y coordinates of the arrow should be set automatically by the height of the bar it is labelling - if possible!

you can use annotate to add an arrow:
import pandas as pd
import matplotlib.pyplot as plt
#import seaborn as sns
import numpy as np
fig, ax = plt.subplots()
series = pd.Series(np.random.normal(0, 100, 1000))
series.plot(kind='hist', bins=50, ax=ax)
ax.annotate("",
xy=(300, 5), xycoords='data',
xytext=(300, 20), textcoords='data',
arrowprops=dict(arrowstyle="->",
connectionstyle="arc3"),
)
In this example, I added an arrow that goes from coordinates (300, 20) to (300, 5).
In order to automatically scale your arrow to the value in the bin, you can use matplotlib hist to plot the histogram and get the values back and then use numpy where to find which bin corresponds to the desired position.
import pandas as pd
import matplotlib.pyplot as plt
#import seaborn as sns
import numpy as np
nbins = 50
labeled_bin = 200
fig, ax = plt.subplots()
series = pd.Series(np.random.normal(0, 100, 1000))
## plot the histogram and return the bin position and values
ybins, xbins, _ = ax.hist(series, bins=nbins)
## find out in which bin belongs the position where you want the label
ind_bin = np.where(xbins >= labeled_bin)[0]
if len(ind_bin) > 0 and ind_bin[0] > 0:
## get position and value of the bin
x_bin = xbins[ind_bin[0]-1]/2. + xbins[ind_bin[0]]/2.
y_bin = ybins[ind_bin[0]-1]
## add the arrow
ax.annotate("",
xy=(x_bin, y_bin + 5), xycoords='data',
xytext=(x_bin, y_bin + 20), textcoords='data',
arrowprops=dict(arrowstyle="->",
connectionstyle="arc3"),
)
else:
print "Labeled bin is outside range"

#Julien Spronck showed the best way, I think. Alternatively, you can also use arrow; the example code can be found below. The y-ccordinate is determined automatically by calculating how many elements are in a certain bin (with a certain tolerance which you can define yourself). You can play with the parameters (length of arrow head, length of arrow). Here is the code:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
mySer = pd.Series(np.random.normal(0, 100, 1000))
mySer.plot(kind='hist', bins=50)
# that is where you want to add the arrow
ind = 200
# determine how many elements you have in the bin (with a certain tolerance)
n = len(mySer[(mySer > ind*0.95) & (mySer < ind*1.05)])
# define length of the arrow
lenArrow = 10
lenHead = 2
wiArrow = 5
plt.arrow(ind, n+lenArrow+lenHead, 0, -lenArrow, head_width=wiArrow+3, head_length=lenHead, width=wiArrow, fc='k', ec='k')
plt.show()
This gives you the following output (for 200 instead of 300 as in your example):

Python-Matplotlib boxplot. How to show percentiles 0,10,25,50,75,90 and 100?

I would like to plot an EPSgram (see below) using Python and Matplotlib.
The boxplot function only plots quartiles (0, 25, 50, 75, 100). So, how can I add two more boxes?

I put together a sample, if you're still curious. It uses scipy.stats.scoreatpercentile, but you may be getting those numbers from elsewhere:
from random import random
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import scoreatpercentile
x = np.array([random() for x in xrange(100)])
# percentiles of interest
perc = [min(x), scoreatpercentile(x,10), scoreatpercentile(x,25),
scoreatpercentile(x,50), scoreatpercentile(x,75),
scoreatpercentile(x,90), max(x)]
midpoint = 0 # time-series time
fig = plt.figure()
ax = fig.add_subplot(111)
# min/max
ax.broken_barh([(midpoint-.01,.02)], (perc[0], perc[1]-perc[0]))
ax.broken_barh([(midpoint-.01,.02)], (perc[5], perc[6]-perc[5]))
# 10/90
ax.broken_barh([(midpoint-.1,.2)], (perc[1], perc[2]-perc[1]))
ax.broken_barh([(midpoint-.1,.2)], (perc[4], perc[5]-perc[4]))
# 25/75
ax.broken_barh([(midpoint-.4,.8)], (perc[2], perc[3]-perc[2]))
ax.broken_barh([(midpoint-.4,.8)], (perc[3], perc[4]-perc[3]))
ax.set_ylim(-0.5,1.5)
ax.set_xlim(-10,10)
ax.set_yticks([0,0.5,1])
ax.grid(True)
plt.show()

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Histogram manipulation to remove unwanted data - python

Related

How to build a histogram of numpy 2 dimensional array

How to change what the axis of a plot is based on? (Python, Matplotlib)

How to get the full width at half maximum (FWHM) from kdeplot

Labelling a matplotlib histogram bin with an arrow

Python-Matplotlib boxplot. How to show percentiles 0,10,25,50,75,90 and 100?

Categories

Resources