How to scale histogram bar heights in matplotlib / seaborn? - python

My dataset is a numpy array of size (m, 1) and I need 2 plots: a) one of the whole data normalized (probability) b) one of a subset of the dataset keeping the same normalization as before.
The problem is that in case b) the normalization options provided by matplotlib and seaborn only "see" the subset so they cannot normalize based on the whole data.
Essentially what I want to do is:
bar_height = bar_count / m
Sample data:
array([[-0.00996642],
[ 0.00407526],
[ 0.00547561],
...,
[ 0.05205999],
[ 0.00224144],
[ 0.01201942]])

You can use np.histogram() to calculate the histogram and then draw the bars with plt.bar() :
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
sns.set()
m = 200
samples = np.random.rand(1000)
hist_values, bin_edges = np.histogram(samples)
plt.bar(x=bin_edges[:-1], height=hist_values / m, width=np.diff(bin_edges), align='edge')
plt.show()

Related

How to plot heatmap onto mplsoccer pitch?

Wondering how I can plot a seaborn plot onto a different matplotlib plot. Currently I have two plots (one a heatmap, the other a soccer pitch), but when I plot the heatmap onto the pitch, I get the results below. (Plotting the pitch onto the heatmap isn't pretty either.) Any ideas how to fix it?
Note: Plots don't need a colorbar and the grid structure isn't required either. Just care about the heatmap covering the entire space of the pitch. Thanks!
import pandas as pd
import numpy as np
from mplsoccer import Pitch
import seaborn as sns
nmf_shot_W = pd.read_csv('https://raw.githubusercontent.com/lucas-nelson-uiuc/datasets/main/nmf_show_W.csv').iloc[:, 1:]
nmf_shot_ThierryHenry = pd.read_csv('https://raw.githubusercontent.com/lucas-nelson-uiuc/datasets/main/nmf_show_Hth.csv')['Thierry Henry']
pitch = Pitch(pitch_type='statsbomb', line_zorder=2,
pitch_color='#22312b', line_color='#efefef')
dfdfdf = np.array(np.matmul(nmf_shot_W, nmf_shot_ThierryHenry)).reshape((24,25))
g_ax = sns.heatmap(dfdfdf)
pitch.draw(ax=g_ax)
Current output:
Desired output:
Use the built-in pitch.heatmap:
pitch.heatmap expects a stats dictionary of binned data, bin mesh, and bin centers:
stats (dict) – The keys are statistic (the calculated statistic), x_grid and y_grid (the bin's edges), and cx and cy (the bin centers).
In the mplsoccer heatmap demos, they construct this stats object using pitch.bin_statistic because they have raw data. However, you already have binned data ("calculated statistic"), so reconstruct the stats object manually by building the mesh and centers:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from mplsoccer import Pitch
nmf_shot_W = pd.read_csv('71878281/nmf_show_W.csv', index_col=0)
nmf_shot_ThierryHenry = pd.read_csv('71878281/nmf_show_Hth.csv')['Thierry Henry']
statistic = np.dot(nmf_shot_W, nmf_shot_ThierryHenry.to_numpy()).reshape((24, 25))
# construct stats object from binned data, bin mesh, and bin centers
y, x = statistic.shape
x_grid = np.linspace(0, 120, x + 1)
y_grid = np.linspace(0, 80, y + 1)
cx = x_grid[:-1] + 0.5 * (x_grid[1] - x_grid[0])
cy = y_grid[:-1] + 0.5 * (y_grid[1] - y_grid[0])
stats = dict(statistic=statistic, x_grid=x_grid, y_grid=y_grid, cx=cx, cy=cy)
# use pitch.draw and pitch.heatmap as per mplsoccer demo
pitch = Pitch(pitch_type='statsbomb', line_zorder=2, pitch_color='#22312b', line_color='#efefef')
fig, ax = pitch.draw(figsize=(6.6, 4.125))
pcm = pitch.heatmap(stats, ax=ax, cmap='plasma')
cbar = fig.colorbar(pcm, ax=ax, shrink=0.6)
cbar.outline.set_edgecolor('#efefef')
cbar.ax.yaxis.set_tick_params(color='#efefef')
plt.setp(plt.getp(cbar.ax.axes, 'yticklabels'), color='#efefef')

How to draw the Probability Density Function (PDF) plot in Python?

I'd like to ask how to draw the Probability Density Function (PDF) plot in Python.
This is my codes.
import numpy as np
import pandas as pd
from pandas import DataFrame
import matplotlib.pyplot as plt
import scipy.stats as stats
.
x = np.random.normal(50, 3, 1000)
source = {"Genotype": ["CV1"]*1000, "AGW": x}
df=pd.DataFrame(source)
df
I generated a data frame. Then, I tried to draw a PDF graph.
df["AGW"].sort_values()
df_mean = np.mean(df["AGW"])
df_std = np.std(df["AGW"])
pdf = stats.norm.pdf(df["AGW"], df_mean, df_std)
plt.plot(df["AGW"], pdf)
I obtained above graph. What I did wrong? Could you let me how to draw the Probability Density Function (PDF) Plot which is also known as normal distribution graph.
Could you let me know which codes (or library) I need to use to draw the PDF graph?
Always many thanks!!
You just need to sort the values (not really check what's after edit)
pdf = stats.norm.pdf(df["AGW"].sort_values(), df_mean, df_std)
plt.plot(df["AGW"].sort_values(), pdf)
And it will work.
The line df["AGW"].sort_values() doesn't change df. Maybe you meant df.sort_values(by=['AGW'], inplace=True).
In that case the full code will be :
import numpy as np
import pandas as pd
from pandas import DataFrame
import matplotlib.pyplot as plt
import scipy.stats as stats
x = np.random.normal(50, 3, 1000)
source = {"Genotype": ["CV1"]*1000, "AGW": x}
df=pd.DataFrame(source)
df.sort_values(by=['AGW'], inplace=True)
df_mean = np.mean(df["AGW"])
df_std = np.std(df["AGW"])
pdf = stats.norm.pdf(df["AGW"], df_mean, df_std)
plt.plot(df["AGW"], pdf)
Which gives :
Edit :
I think here we already have the distribution (x is normally distributed) so we dont need to generate the pdf of x. As the use of the pdf is for something like this :
mu = 50
variance = 3
sigma = math.sqrt(variance)
x = np.linspace(mu - 5*sigma, mu + 5*sigma, 1000)
plt.plot(x, stats.norm.pdf(x, mu, sigma))
plt.show()
Here we dont need to generate the distribution from x points, we only need to plot the density of the distribution we already have .
So you might use this :
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
x = np.random.normal(50, 3, 1000) #Generating Data
source = {"Genotype": ["CV1"]*1000, "AGW": x}
df=pd.DataFrame(source) #Converting to pandas DataFrame
df.plot(kind = 'density'); # or df["AGW"].plot(kind = 'density');
Which gives :
You might use other packages if you want, like seaborn :
import seaborn as sns
plt.figure(figsize = (5,5))
sns.kdeplot(df["AGW"] , bw = 0.5 , fill = True)
plt.show()
Or this :
import seaborn as sns
sns.set_style("whitegrid") # Setting style(Optional)
plt.figure(figsize = (10,5)) #Specify the size of figure
sns.distplot(x = df["AGW"] , bins = 10 , kde = True , color = 'teal'
, kde_kws=dict(linewidth = 4 , color = 'black')) #kde for normal distribution
plt.show()
Check this article for more.

How to build a histogram of numpy 2 dimensional array

I have a 256x256 matrix of values and I would like to plot a histogram of these values
If I am not mistaken, the histogram must be calculated in a vector of values, correct? so here is what I have tried:
from skimage.measure import compare_ssim
import numpy as np
import matplotlib.pyplot as plt
d = np.load("BB_Digital.npy")
n, bins, patches = plt.hist(x=d.ravel(), color='#0504aa', bins='auto', alpha=0.7, rwidth=0.85)
plt.grid(axis='y', alpha=0.75)
plt.xlabel('Value')
plt.ylabel('Frequency')
plt.title('Blue channel Co-occurency matrix')
maxfreq = n.max()
# Set a clean upper y-axis limit.
plt.ylim(ymax=np.ceil(maxfreq / 10) * 10 if maxfreq % 10 else maxfreq + 10)
plt.show()
But then, I have a very strange result:
When I don't use the ravel function (use the 2D matrix) the following result is shown:
However, both histograms seem to be wrong, as I verified later:
>>> np.count_nonzero(d==0)
51227
>>> np.count_nonzero(d==1)
2529
>>> np.count_nonzero(d==2)
1275
>>> np.count_nonzero(d==3)
885
>>> np.count_nonzero(d==4)
619
>>> np.count_nonzero(d==5)
490
>>> np.count_nonzero(d==6)
403
>>> np.max(d)
12518
>>> np.min(d)
0
How can I build a correct histogram?
P.s: Here is the file if you could help me.
The data seems to be discrete. Setting explicit bin boundaries at the halves could show the frequency of each value. As there are very high but infrequent values, the following example cuts off at 50:
import numpy as np
from matplotlib import pyplot as plt
d = np.load("BB_Digital.npy")
plt.hist(d.ravel(), bins=np.arange(-0.5, 51), color='#0504aa', alpha=0.7, rwidth=0.85)
plt.yscale('log')
plt.margins(x=0.02)
plt.show()
Another visualization could show a pcolormesh where the colors use a logarithmic scale. As the values start at 0, adding 1 avoids minus infinity:
from matplotlib import pyplot as plt
from matplotlib.colors import LogNorm
import numpy as np
d = np.load("BB_Digital.npy")
plt.pcolormesh(d + 1, norm=LogNorm(), cmap='inferno')
plt.colorbar()
plt.show()
Yet another visualization concentrates on the diagonal values:
plt.plot(np.diagonal(d), color='navy')
ind_max = np.argmax(np.diagonal(d))
plt.vlines(ind_max, 0, d[ind_max, ind_max], colors='crimson', ls=':')
plt.yscale('log')

Calculate the distance between all other points in a TSV file?

I have a TSV file filled with n data points and I want to calculate the distances between all of the points. I have something like this:
What I thought about doing was the .iloc feature
import pandas as pd
x = pd.read_csv('data.tsv', sep='\t')
print (x)
while True: xcord= (int)
I was thinking you could do where you add 1 to each point iteratively, but I don't know how to do that.
Solution using distance_matrix
You can proceed using scipy.spatial.distance_matrix.
Suppose your DataFrame is my_dataframe.
import pandas as pd
import scipy as sp
points = pd.DataFrame(my_dataframe, columns=["X", "Y", "Z"]).astype(float)
distance_matrix = sp.spatial.distance_matrix(points, points)
Visualising the result
We can use seabord.heatmap to visualise the obtained results:
from matplotlib import pyplot as plt
import seaborn as sns
labels = my_dataframe["points"]
plt.rcParams['figure.figsize'] = [10, 10]
plt.axis('scaled')
sns.heatmap(distance_matrix,
annot=True,
cbar = False,
fmt="0.2f",
cmap="YlGnBu",
xticklabels=labels,
yticklabels=labels)
plt.title("Distance matrix")
The result is:
A small textual example
We can create a small textual example with which we may help understand step by step inputs and outputs. Let's consider a DataFrame with just two points:
Generating an example dataframe
import pandas as pd
import numpy as np
a = np.random.uniform(100, size=(2, 3))
my_dataframe = pd.DataFrame(np.hstack([[["A"], ["B"]], a]), columns=["points", "X", "Y", "Z"])
The DataFrame we have generated looks like:
Splitting points and labels
We split the labels and the points:
points = pd.DataFrame(my_dataframe, columns=["X", "Y", "Z"]).astype(float)
labels = my_dataframe["points"]
So points looks like:
And labels looks like:
Calculating the distance matrix
Now we can proceed calculating the distance matrix executing scipy.spatial.distance_matrix:
distance_matrix = sp.spatial.distance_matrix(points, points)
The resulting matrix is:
array([[ 0. , 93.43955419],
[93.43955419, 0. ]])
Visualising the obtained matrix
Using the same code as above, we obtain:

Major Difference in 2D kernel Density Plots: Seaborn and R

I am trying to plot data using the 2D kernel density plot of Seaborn's jointplot function (using statsmodels' KDEMultivariate function to calculate a data-driven bandwidth). I've plotted a 2D kernel density in R using the same data and the result looks very good (using the 'ks' package), while the Seaborn plot looks very very different.
I am using the same exact data and the same exact bandwidth for each (taking the bandwidth given by KDEMultivariant and passing that to the R method).
Here is the input.csv data used: https://app.box.com/s/ot7d36t44wrr85pusp5657pc1w2kf5hj
Below are the code used in each and output images from each.
Python / Seaborn:
import matplotlib.pyplot as plt
import statsmodels.api as sm
data = pd.read_csv("input.csv", dtype={'x': float, 'y': float}, skiprows=0)
bw_ml_x = sm.nonparametric.KDEMultivariate(data=data['x'], var_type='c', bw='cv_ml')
bw_ml_y = sm.nonparametric.KDEMultivariate(data=data['y'], var_type='c', bw='cv_ml')
g = sns.jointplot(x='x', y='y', data=data, kind="kde", stat_func=None, bw=[bw_ml_x.bw, bw_ml_y.bw])
g.plot_joint(plt.scatter, c="w")
g.ax_joint.collections[0].set_alpha(0)
sns.plt.show()
Img for Seaborn plot:
The bandwidth given by bw_ml_x.bw and bw_ml_y.bw is placed in a 2 x 2 R matrix H, where H[1,1] = bw_ml_x.bw, H[2,2] = bw_ml.y.bw, and other values set to zero.
R:
library(ks)
fhat <- kde(x=as.data.frame(data[1], data[2]), H=H)
plot(fhat, display="filled.contour2", cont=seq(10,90,by=10))
Img for R plot:
Looking at your Seaborn/Python plot, many of the points cluster along the (0,n) region and the (1,1) region of your space, just as the KDE of the R plot shows. This indicates that Seaborn and R are looking at the same data; we simply need to reformulate the call to the kde in Seaborn in order to visualize the KDE gradients.
If you modify your Python call to match the documentation for Kernel Density Estimation in Seaborn you'll get a proper 2d-kdf out of Python:
import matplotlib.pyplot as plt
import statsmodels.api as sm
import pandas as pd
import seaborn as sns
data = pd.read_csv("input.csv", dtype={'x': float, 'y': float}, skiprows=0)
bw_ml_x = sm.nonparametric.KDEMultivariate(data=data['x'], var_type='c', bw='cv_ml')
bw_ml_y = sm.nonparametric.KDEMultivariate(data=data['y'], var_type='c', bw='cv_ml')
g = sns.jointplot(x='x', y='y', data=data, kind="kde")
g.plot_joint(plt.scatter, c="w")
g.ax_joint.collections[0].set_alpha(0)
sns.plt.show()
This accords with the R plot (though the kernel estimators seem to be slightly different, which would account for the variation in gradients between the plots):

Categories