I'm applying a rotation matrix to a group of points with the aim to align the points along the horizontal axis. Using below, the xy points I want to adjust are recorded in x and y.
I'm hoping to transform the points using the angle between X_Ref and Y_Ref and X_Fixed and Y_Fixed. I'm also hoping to transform the points so X_Ref and Y_Ref is at 0,0 once the rotation is completed.
The rotated points currently don't adjust for this. I'm not sure if I should account for the reference point prior to rotating or afterwards.
import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
import pandas as pd
df = pd.DataFrame({
'Period' : ['1','1','1','1','2','2','2','2'],
'Label' : ['A','B','C','D','A','B','C','D'],
'x' : [2.0,3.0,3.0,2.0,2.0,3.0,3.0,1.0],
'y' : [2.0,3.0,-1.0,0.0,2.0,3.0,-1.0,1.0],
'X_Ref' : [1,1,1,1,2,2,2,2],
'Y_Ref' : [1,1,1,1,0,0,0,0],
'X_Fixed' : [0,0,0,0,0,0,0,0],
'Y_Fixed' : [0,0,0,0,2,2,2,2],
})
np.random.seed(1)
xy = df[['x','y']].values
Ref = df[['X_Ref','Y_Ref']].values
Fix = df[['X_Fixed','Y_Fixed']].values
fig, ax = plt.subplots()
plot_kws = {'alpha': 0.75,
'edgecolor': 'white',
'linewidths': 0.75}
ax.scatter(xy[:, 0], xy[:, 1], **plot_kws)
ax.scatter(Ref[:, 0], Ref[:, 1], marker = 'x')
ax.scatter(Fix[:, 0], Fix[:, 1], marker = '+')
pca = PCA(2)
# Fit the PCA object, but do not transform the data
pca.fit(xy)
# pca.components_ : array, shape (n_components, n_features)
# cos theta
ct = pca.components_[0, 0]
# sin theta
st = pca.components_[0, 1]
# One possible value of theta that lies in [0, pi]
t = np.arccos(ct)
# If t is in quadrant 1, rotate CLOCKwise by t
if ct > 0 and st > 0:
t *= -1
# If t is in Q2, rotate COUNTERclockwise by the complement of theta
elif ct < 0 and st > 0:
t = np.pi - t
# If t is in Q3, rotate CLOCKwise by the complement of theta
elif ct < 0 and st < 0:
t = -(np.pi - t)
# If t is in Q4, rotate COUNTERclockwise by theta, i.e., do nothing
elif ct > 0 and st < 0:
pass
# Manually build the ccw rotation matrix
rotmat = np.array([[np.cos(t), -np.sin(t)],
[np.sin(t), np.cos(t)]])
# Apply rotation to each row of 'm'. The output (m2)
# will be the rotated FIFA input coordinates.
m2 = (rotmat # xy.T).T
# Center the rotated point cloud at (0, 0)
m2 -= m2.mean(axis=0)
Initial distribution period 1:
Intended distribution period 1:
Initial distribution period 2:
Intended distribution period 2:
Your question in unclear, since the "intended rotation" mentioned in the question can be already achieved if you plot m2 which has been already calculated:
fig, ax = plt.subplots()
ax.scatter(m2[:, 0], m2[:, 1], **plot_kws)
Output:
But you have also mentioned the following in the question:
The rotation angle is determined by the angle between X_Ref,Y_Ref and X_Fixed,Y_Fixed.
This is a totally different scenario. You can calculate the angle between two points by calculating the arctan between them, without having to use PCA at all. This can be done using numpy.arctan as follows:
t = np.arctan((Y_Fixed - Y_Ref/ X_Fixed - X_Ref))
Here (X_Fixed, Y_Fixed) and (X_Ref, Y_Ref) are being assumed as two points.
For each row in your dataframe, you can then calculate the x and y values after rotation with respect to the angle between (X_Fixed, Y_Fixed) and (X_Ref, Y_Ref) in that particular row. This can be done using the following code snippet;
def rotate_points(row):
t = np.arctan((row['Y_Fixed'] - row['Y_Ref']/ row['X_Fixed'] - row['X_Ref']))
rotmat = np.array([[np.cos(t), -np.sin(t)],
[np.sin(t), np.cos(t)]])
xy = row[['x','y']].values
rotated = rotmat # xy
return rotated
df['rotated_x'] = df.apply(lambda row: rotate_points(row)[0], axis = 1)
df['rotated_y'] = df.apply(lambda row: rotate_points(row)[1], axis = 1)
Your dataframe would now look like this with the two new columns added to the right:
+----+----------+---------+-----+-----+---------+---------+-----------+-----------+-------------+-------------+-------------+
| | Period | Label | x | y | X_Ref | Y_Ref | X_Fixed | Y_Fixed | Direction | rotated_x | rotated_y |
|----+----------+---------+-----+-----+---------+---------+-----------+-----------+-------------+-------------+-------------|
| 0 | 1 | A | -1 | 1 | 1 | 3 | -2 | 0 | Left | -1.34164 | 0.447214 |
| 1 | 1 | B | 0 | 4 | 1 | 3 | -2 | 0 | Left | -1.78885 | 3.57771 |
| 2 | 1 | C | 2 | 2 | 1 | 3 | -2 | 0 | Left | 0.894427 | 2.68328 |
| 3 | 1 | D | 2 | 3 | 1 | 3 | -2 | 0 | Left | 0.447214 | 3.57771 |
| 4 | 2 | E | 2 | 4 | 1 | 3 | -2 | 0 | Right | 0 | 4.47214 |
| 5 | 2 | F | 1 | 4 | 1 | 3 | -2 | 0 | Right | -0.894427 | 4.02492 |
| 6 | 2 | G | 3 | 5 | 1 | 3 | -2 | 0 | Right | 0.447214 | 5.81378 |
| 7 | 2 | H | 0 | 2 | 1 | 3 | -2 | 0 | Right | -0.894427 | 1.78885 |
+----+----------+---------+-----+-----+---------+---------+-----------+-----------+-------------+-------------+-------------+
Now you have your rotated x and y points as desired.
UPDATE:
As per the amended question, you can add the reference point at (0,0) in your plot as follows:
fig, ax = plt.subplots()
ax.scatter(m2[:, 0], m2[:, 1], **plot_kws)
ax.scatter(list(np.repeat(0, len(Ref))), list(np.repeat(0, len(Ref))) , **plot_kws)
plt.show()
Output:
There is no need for any PCA if I understood what you try to achieve. I'd use complex number and that seems more straightforward :
EDIT
There was a small mistake in the order of steps for translation previously. This edit will correct it as well as use your new dataset including changing ref/fixed points at different periods.
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
df = pd.DataFrame({
'Period' : ['1','1','1','1','2','2','2','2'],
'Label' : ['A','B','C','D','A','B','C','D'],
'x' : [2.0,3.0,3.0,2.0,2.0,3.0,3.0,1.0],
'y' : [2.0,3.0,-1.0,0.0,2.0,3.0,-1.0,1.0],
'X_Ref' : [1,1,1,1,2,2,2,2],
'Y_Ref' : [1,1,1,1,0,0,0,0],
'X_Fixed' : [0,0,0,0,0,0,0,0],
'Y_Fixed' : [0,0,0,0,2,2,2,2],
})
First, transform fixed/ref points to complex numbers :
for f in ['Ref', 'Fixed']:
df[f] = df['X_'+f] + 1j*df['Y_'+f]
df.drop(['X_'+f, 'Y_'+f], axis=1, inplace=True)
Compute the rotation (note that it is the opposite angle of what you stated in your question to match your expected results) :
df['angle'] = - np.angle(df['Ref'] - df['Fixed'])
Compute the rotation for every point (ref/fixed included) :
df['rotated'] = (df['x'] + 1j*df["y"]) * np.exp(1j*df['angle'])
for f in ['Ref', 'Fixed']:
df[f+'_Rotated'] = df[f] * np.exp(1j*df['angle'])
Center your dataset around the "ref" point :
df['translation'] = - df['Ref_Rotated']
df['NewPoint'] = df['rotated'] + df['translation']
for f in ['Ref', 'Fixed']:
df[f+'_Transformed'] = df[f+'_Rotated'] + df['translation']
Revert to cartesian coordinates :
df['x2'] = np.real(df['NewPoint'])
df['y2'] = np.imag(df['NewPoint'])
for f in ['Ref', 'Fixed']:
df['NewX_'+f] = np.real(df[f+'_Transformed'])
df['NewY_'+f] = np.imag(df[f+'_Transformed'])
And then plot the output for any period you like :
output = df[['Period', 'Label', 'x2', 'y2', 'NewX_Ref', 'NewY_Ref', 'NewX_Fixed', 'NewY_Fixed']]
output.set_index('Period', inplace=True)
fig, ax = plt.subplots()
plot_kws = {'alpha': 0.75,
'edgecolor': 'white',
'linewidths': 0.75}
plt.xlim(-5,5)
plt.ylim(-5,5)
period = '1'
ax.scatter(output.loc[period, 'NewX_Ref'], output.loc[period, 'NewY_Ref'])
ax.scatter(output.loc[period, 'NewX_Fixed'], output.loc[period, 'NewY_Fixed'])
ax.scatter(output.loc[period, 'x2'], output.loc[period, 'y2'], **plot_kws, marker = '+')
plt.gca().set_aspect('equal', adjustable='box')
plt.show()
Result for period 1 :
Result for period 2 :
Related
I am trying to come up with a way to determine the "best fit" between the following distributions:
Gaussian, Multinomial, Bernoulli.
I have a large pandas df, where each column can be thought of as a distribution of numbers. What I am trying to do, is for each column, determine the distribution of the above list as the best fit.
I noticed this question which asks something familiar, but these all look like discrete distribution tests, not continuous. I know scipy has metrics for a lot of these, but I can't determine how to to properly place the inputs. My thought would be:
For each column, save the data in a temporary np array
Generate Gaussian, Multinomial, Bernoulli distributions, perform a SSE test to determine the distribution that gives the "best fit", and move on to the next column.
An example dataset (arbitrary, my dataset is 29888 x 73231) could be:
| could | couldnt | coupl | cours | death | develop | dialogu | differ | direct | director | done |
|:-----:|:-------:|:-----:|:-----:|:-----:|:-------:|:-------:|:------:|:------:|:--------:|:----:|
| 0 | 0 | 0 | 1 | 0 | 1 | 1 | 0 | 0 | 0 | 0 |
| 0 | 2 | 1 | 0 | 0 | 1 | 0 | 2 | 0 | 0 | 1 |
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 2 |
| 1 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 |
| 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 0 | 0 |
| 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| 0 | 0 | 0 | 0 | 2 | 1 | 0 | 1 | 0 | 0 | 2 |
| 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 2 | 0 | 1 |
| 0 | 0 | 0 | 0 | 0 | 2 | 0 | 0 | 0 | 0 | 0 |
| 0 | 0 | 0 | 1 | 0 | 0 | 5 | 0 | 0 | 0 | 3 |
| 1 | 1 | 0 | 0 | 1 | 2 | 0 | 0 | 1 | 0 | 0 |
| 1 | 1 | 0 | 0 | 0 | 4 | 0 | 0 | 1 | 0 | 1 |
| 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
| 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
| 0 | 0 | 0 | 0 | 0 | 1 | 0 | 3 | 0 | 0 | 1 |
| 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 2 |
| 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 |
| 1 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 1 | 0 | 2 |
| 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 |
| 0 | 1 | 0 | 3 | 0 | 0 | 0 | 1 | 1 | 0 | 0 |
I have some basic code now, which was edited from this question, which attempts this:
import warnings
import numpy as np
import pandas as pd
import scipy.stats as st
import statsmodels as sm
import matplotlib
import matplotlib.pyplot as plt
matplotlib.rcParams['figure.figsize'] = (16.0, 12.0)
matplotlib.style.use('ggplot')
# Create models from data
def best_fit_distribution(data, bins=200, ax=None):
"""Model data by finding best fit distribution to data"""
# Get histogram of original data
y, x = np.histogram(data, bins=bins, density=True)
x = (x + np.roll(x, -1))[:-1] / 2.0
# Distributions to check
DISTRIBUTIONS = [
st.norm, st.multinomial, st.bernoulli
]
# Best holders
best_distribution = st.norm
best_params = (0.0, 1.0)
best_sse = np.inf
# Estimate distribution parameters from data
for distribution in DISTRIBUTIONS:
# Try to fit the distribution
try:
# Ignore warnings from data that can't be fit
with warnings.catch_warnings():
warnings.filterwarnings('ignore')
# fit dist to data
params = distribution.fit(data)
# Separate parts of parameters
arg = params[:-2]
loc = params[-2]
scale = params[-1]
# Calculate fitted PDF and error with fit in distribution
pdf = distribution.pdf(x, loc=loc, scale=scale, *arg)
sse = np.sum(np.power(y - pdf, 2.0))
# if axis pass in add to plot
try:
if ax:
pd.Series(pdf, x).plot(ax=ax)
end
except Exception:
pass
# identify if this distribution is better
if best_sse > sse > 0:
best_distribution = distribution
best_params = params
best_sse = sse
except Exception:
print("Error on: {}".format(distribution))
pass
#print("Distribution: {} | SSE: {}".format(distribution, sse))
return best_distribution.name, best_sse
for col in df.columns:
nm, pm = best_fit_distribution(df[col])
print(nm)
print(pm)
However, I get:
Error on: <scipy.stats._multivariate.multinomial_gen object at 0x000002E3CCFA9F40>
Error on: <scipy.stats._discrete_distns.bernoulli_gen object at 0x000002E3CCEF4040>
norm
(4.4, 7.002856560004639)
My expected output would be something like, for each column:
Gaussian SSE: <val> | Multinomial SSE: <val> | Bernoulli SSE: <val>
UPDATE
Catching the error yields:
Error on: <scipy.stats._multivariate.multinomial_gen object at 0x000002E3CCFA9F40>
'multinomial_gen' object has no attribute 'fit'
Error on: <scipy.stats._discrete_distns.bernoulli_gen object at 0x000002E3CCEF4040>
'bernoulli_gen' object has no attribute 'fit'
Why am I getting errors? I think it is because multinomial and bernoulli do not have fit methods. How can I make a fit method, and integrate that to get the SSE?? The target output of this function or program would be, for aGaussian, Multinomial, Bernoulli' distributions, what is the average SSE, per column in the df, for each distribution type (to try and determine best-fit by column).
UPDATE 06/15:
I have added a bounty.
UPDATE 06/16:
The larger intention, as this is a piece of a larger application, is to discern, over the course of a very large dataframe, what the most common distribution of tfidf values is. Then, based on that, apply a Naive Bayes classifier from sklearn that matches that most-common distribution. scikit-learn.org/stable/modules/naive_bayes.html contains details on the different classifiers. Therefore, what I need to know, is which distribution is the best fit across my entire dataframe, which I assumed to mean, which was the most common amongst the distribution of tfidf values in my words. From there, I will know which type of classifier to apply to my dataframe. In the example above, there is a column not shown called class which is a positive or negative classification. I am not looking for input to this, I am simply following the instructions I have been given by my lead.
I summarize the question as: given a list of nonnegative integers, can we fit a probability distribution, in particular a Gaussian, multinomial, and Bernoulli, and compare the quality of the fit?
For discrete quantities, the correct term is probability mass function: P(k) is the probability that a number picked is exactly equal to the integer value k. A Bernoulli distribution can be parametrized by a p parameter: Be(k, p) where 0 <= p <= 1 and k can only take the values 0 or 1. It is a special case of the binomial distribution B(k, p, n) that has parameters 0 <= p <= 1 and integer n >= 1. (See the linked Wikipedia article for an explanation of the meaning of p and n) It is related to the Bernoulli distribution as Be(k, p) = B(k, p, n=1). The trinomial distribution T(k1, k2, p1, p2, n) is parametrized by p1, p2, n and describes the probability of pairs (k1, k2). For example, the set {(0,0), (0,1), (1,0), (0,1), (0,0)} could be pulled from a trinomial distribution. Binomial and trinomial distributions are special cases of multinomial distributions; if you have data occuring as quintuples such as (1, 5, 5, 2, 7), they could be pulled from a multinomial (hexanomial?) distribution M6(k1, ..., k5, p1, ..., p5, n). The question specifically asks for the probability distribution of the numbers of a single column, so the only multinomial distribution that fits here is the binomial one, unless you specify that the sequence [0, 1, 5, 2, 3, 1] should be interpreted as [(0, 1), (5, 2), (3, 1)] or as [(0, 1, 5), (2, 3, 1)]. But the question does not specify that numbers can be accumulated in pairs or triplets.
Therefore, as far as discrete distributions go, the PMF for one list of integers is of the form P(k) and can only be fitted to the binomial distribution, with suitable n and p values. If the best fit is obtained for n=1, then it is a Bernoulli distribution.
The Gaussian distribution is a continuous distribution G(x, mu, sigma), where mu (mean) and sigma (standard deviation) are parameters. It tells you that the probability of finding x0-a/2 < x < x0+a/2 is equal to G(x0, mu, sigma)*a, for a << sigma. Strictly speaking, the Gaussian distribution does not apply to discrete variables, since the Gaussian distribution has nonzero probabilities for non-integer x values, whereas the probability of pulling a non-integer out of a distribution of integers is zero. Typically, you would use a Gaussian distribution as an approximation for a binomial distribution, where you set a=1 and set P(k) = G(x=k, mu, sigma)*a.
For sufficiently large n, a binomial distribution and a Gaussian will appear similar according to
B(k, p, n) = G(x=k, mu=p*n, sigma=sqrt(p*(1-p)*n)).
If you wish to fit a Gaussian distribution, you can use the standard scipy function scipy.stats.norm.fit. Such fit functions are not offered for the discrete distributions such as the binomial. You can use the function scipy.optimize.curve_fit to fit non-integer parameters such as the p parameter of the binomial distribution. In order to find the optimal integer n value, you need to vary n, fit p for each n, and pick the n, p combination with the best fit.
In the implementation below, I estimate n and p from the relation with the mean and sigma value above and search around that value. The search could be made smarter, but for the small test datasets that I used, it's fast enough. Moreover, it helps illustrate a point; more on that later. I have provided a function fit_binom, which takes a histogram with actual counts, and a function fit_samples, which can take a column of numbers from your dataframe.
"""Binomial fit routines.
Author: Han-Kwang Nienhuys (2020)
Copying: CC-BY-SA, CC-BY, BSD, GPL, LGPL.
https://stackoverflow.com/a/62365555/6228891
"""
import numpy as np
from scipy.stats import binom, poisson
from scipy.optimize import curve_fit
import matplotlib.pyplot as plt
class BinomPMF:
"""Wrapper so that integer parameters don't occur as function arguments."""
def __init__(self, n):
self.n = n
def __call__(self, ks, p):
return binom(self.n, p).pmf(ks)
def fit_binom(hist, plot=True, weighted=True, f=1.5, verbose=False):
"""Fit histogram to binomial distribution.
Parameters:
- hist: histogram as int array with counts, array index as bin.
- plot: whether to plot
- weighted: whether to fit assuming Poisson statistics in each bin.
(Recommended: True).
- f: try to fit n in range n0/f to n0*f where n0 is the initial estimate.
Must be >= 1.
- verbose: whether to print messages.
Return:
- histf: fitted histogram as int array, same length as hist.
- n: binomial n value (int)
- p: binomial p value (float)
- rchi2: reduced chi-squared. This number should be around 1.
Large values indicate a bad fit; small values indicate
"too good to be true" data.
"""
hist = np.array(hist, dtype=int).ravel() # force 1D int array
pmf = hist/hist.sum() # probability mass function
nk = len(hist)
if weighted:
sigmas = np.sqrt(hist+0.25)/hist.sum()
else:
sigmas = np.full(nk, 1/np.sqrt(nk*hist.sum()))
ks = np.arange(nk)
mean = (pmf*ks).sum()
variance = ((ks-mean)**2 * pmf).sum()
# initial estimate for p and search range for n
nest = max(1, int(mean**2 /(mean-variance) + 0.5))
nmin = max(1, int(np.floor(nest/f)))
nmax = max(nmin, int(np.ceil(nest*f)))
nvals = np.arange(nmin, nmax+1)
num_n = nmax-nmin+1
verbose and print(f'Initial estimate: n={nest}, p={mean/nest:.3g}')
# store fit results for each n
pvals, sses = np.zeros(num_n), np.zeros(num_n)
for n in nvals:
# fit and plot
p_guess = max(0, min(1, mean/n))
fitparams, _ = curve_fit(
BinomPMF(n), ks, pmf, p0=p_guess, bounds=[0., 1.],
sigma=sigmas, absolute_sigma=True)
p = fitparams[0]
sse = (((pmf - BinomPMF(n)(ks, p))/sigmas)**2).sum()
verbose and print(f' Trying n={n} -> p={p:.3g} (initial: {p_guess:.3g}),'
f' sse={sse:.3g}')
pvals[n-nmin] = p
sses[n-nmin] = sse
n_fit = np.argmin(sses) + nmin
p_fit = pvals[n_fit-nmin]
sse = sses[n_fit-nmin]
chi2r = sse/(nk-2) if nk > 2 else np.nan
if verbose:
print(f' Found n={n_fit}, p={p_fit:.6g} sse={sse:.3g},'
f' reduced chi^2={chi2r:.3g}')
histf = BinomPMF(n_fit)(ks, p_fit) * hist.sum()
if plot:
fig, ax = plt.subplots(2, 1, figsize=(4,4))
ax[0].plot(ks, hist, 'ro', label='input data')
ax[0].step(ks, histf, 'b', where='mid', label=f'fit: n={n_fit}, p={p_fit:.3f}')
ax[0].set_xlabel('k')
ax[0].axhline(0, color='k')
ax[0].set_ylabel('Counts')
ax[0].legend()
ax[1].set_xlabel('n')
ax[1].set_ylabel('sse')
plotfunc = ax[1].semilogy if sses.max()>20*sses.min()>0 else ax[1].plot
plotfunc(nvals, sses, 'k-', label='SSE over n scan')
ax[1].legend()
fig.show()
return histf, n_fit, p_fit, chi2r
def fit_binom_samples(samples, f=1.5, weighted=True, verbose=False):
"""Convert array of samples (nonnegative ints) to histogram and fit.
See fit_binom() for more explanation.
"""
samples = np.array(samples, dtype=int)
kmax = samples.max()
hist, _ = np.histogram(samples, np.arange(kmax+2)-0.5)
return fit_binom(hist, f=f, weighted=weighted, verbose=verbose)
def test_case(n, p, nsamp, weighted=True, f=1.5):
"""Run test with n, p values; nsamp=number of samples."""
print(f'TEST CASE: n={n}, p={p}, nsamp={nsamp}')
ks = np.arange(n+1) # bins
pmf = BinomPMF(n)(ks, p)
hist = poisson.rvs(pmf*nsamp)
fit_binom(hist, weighted=weighted, f=f, verbose=True)
if __name__ == '__main__':
plt.close('all')
np.random.seed(1)
weighted = True
test_case(10, 0.2, 500, f=2.5, weighted=weighted)
test_case(10, 0.3, 500, weighted=weighted)
test_case(10, 0.8, 10000, weighted)
test_case(1, 0.3, 100, weighted) # equivalent to Bernoulli distribution
fit_binom_samples(binom(15, 0.5).rvs(100), weighted=weighted)
In principle, the most best fit will be obtained if you set weighted=True. However, the question asks for the minimum sum of squared errors (SSE) as a metric; then, you can set weighted=False.
It turns out that it is difficult to fit a binomial distribution unless you have a lot of data. Here are tests with realistic (random-generated) data for n, p combinations (10, 0.2), (10, 0.3), (10, 0.8), and (1, 0.3), for various numbers of samples. The plots also show how the weighted SSE changes with n.
Typically, with 500 samples, you get a fit that looks OK by eye, but which does not recover the actual n and p values correctly, although the product n*p is quite accurate. In those cases, the SSE curve has a broad minimum, which is a giveaway that there are several reasonable fits.
The code above can be adapted for different discrete distributions. In that case, you need to figure out reasonable initial estimates for the fit parameters. For example: Poisson: the mean is the only parameter (use the reduced chi2 or SSE to judge whether it's a good fit).
If you want to fit a combination of m input columns to a (m+1)-dimensional multinomial , you can do a binomial fit on each input column and store the fit results in arrays nn and pp (each an array with shape (m,)). Transform these into an initial estimate for a multinomial:
n_est = int(nn.mean()+0.5)
pp_est = pp*nn/n_est
pp_est = np.append(pp_est, 1-pp_est.sum())
If the individual values in the nn array vary a lot, or if the last element of pp_est is negative, then it's probably not a multinomial.
You want to compare the residuals of multiple models; be aware that a model that has more fit parameters will tend to produce lower residuals, but this does not necessarily mean that the model is better.
Note: this answer underwent a large revision.
The distfit library can help you to determine the best fitting distribution. If you set method to discrete, a similar approach is followed as described by Han-Kwang Nienhuys.
Let's say I have a list of x,y coordinates like this:
coordinate_list = [(4,6),(2,5),(0,4),(-2,-2),(0,2),(0,0),(8,8),(8,11),(8,14)]
I want to find the average y-value associated with each x-value. So for instance, there's only one "2" x-value in the dataset, so the average y-value would be "5". However, there are three 8's and the average y-value would be 11 [ (8+11+14) / 3 ].
What would be the most efficient way to do this?
y_values_by_x = {}
for x, y in coordinate_list:
y_values_by_x.setdefault(x, []).append(y)
average_y_by_x = {k: sum(v)/len(v) for k, v in y_values_by_x.items()}
You can use pandas
coordinate_list = [(4,6),(2,5),(0,4),(-2,-2),(0,2),(0,0),(8,8),(8,11),(8,14)]
import pandas as pd
df = pd.DataFrame(coordinate_list)
df
df.groupby([0]).mean()
| 0 | | 1 |
| --- | --- |
| -2 | -2 |
| 0 | 2 |
| 2 | 5 |
| 4 | 6 |
| 8 | 11 |
Try the mean() function from statistics module with list comprehension
from statistics import mean
x0_filter_value = 0 # can be any value of your choice for finding average
result = mean([x[1] for x in coordinate_list if x[0] == x0_filter_value])
print(result)
And to print means for all X[0] values:
for i in set([x[0] for x in coordinate_list]):
print (i,mean([x[1] for x in coordinate_list if x[0] == i]))
I would like to produce a specific type of visualization, consisting of a rather simple dot plot but with a twist: both of the axes are categorical variables (i.e. ordinal or non-numerical values). And this complicates matters instead of making it easier.
To illustrate this question, I will be using a small example dataset that is a modification from seaborn.load_dataset("tips") and defined as such:
import pandas
from six import StringIO
df = """total_bill | tip | sex | smoker | day | time | size
16.99 | 1.01 | Male | No | Mon | Dinner | 2
10.34 | 1.66 | Male | No | Sun | Dinner | 3
21.01 | 3.50 | Male | No | Sun | Dinner | 3
23.68 | 3.31 | Male | No | Sun | Dinner | 2
24.59 | 3.61 | Female | No | Sun | Dinner | 4
25.29 | 4.71 | Female | No | Mon | Lunch | 4
8.77 | 2.00 | Female | No | Tue | Lunch | 2
26.88 | 3.12 | Male | No | Wed | Lunch | 4
15.04 | 3.96 | Male | No | Sat | Lunch | 2
14.78 | 3.23 | Male | No | Sun | Lunch | 2"""
df = pandas.read_csv(StringIO(df.replace(' ','')), sep="|", header=0)
My first approach to produce my graph was to try a call to seaborn as such:
import seaborn
axes = seaborn.pointplot(x="time", y="sex", data=df)
This fails with:
ValueError: Neither the `x` nor `y` variable appears to be numeric.
So does the equivalent seaborn.stripplot and seaborn.swarmplot calls. It does work however if one of the variables is categorical and the other one is numerical. Indeed seaborn.pointplot(x="total_bill", y="sex", data=df) works, but is not what I want.
I also attempted a scatterplot like such:
axes = seaborn.scatterplot(x="time", y="sex", size="day", data=df,
x_jitter=True, y_jitter=True)
This produces the following graph which does not contain any jitter and has all the dots overlapping, making it useless:
Do you know of any elegant approach or library that could solve my problem ?
I started writing something myself, which I will include below, but this implementation is suboptimal and limited by the number of points that can overlap at the same spot (currently it fails if more than 4 points overlap).
# Modules #
import seaborn, pandas, matplotlib
from six import StringIO
################################################################################
def amount_to_offets(amount):
"""A function that takes an amount of overlapping points (e.g. 3)
and returns a list of offsets (jittered) coordinates for each of the
points.
It follows the logic that two points are displayed side by side:
2 -> * *
Three points are organized in a triangle
3 -> *
* *
Four points are sorted into a square, and so on.
4 -> * *
* *
"""
assert isinstance(amount, int)
solutions = {
1: [( 0.0, 0.0)],
2: [(-0.5, 0.0), ( 0.5, 0.0)],
3: [(-0.5, -0.5), ( 0.0, 0.5), ( 0.5, -0.5)],
4: [(-0.5, -0.5), ( 0.5, 0.5), ( 0.5, -0.5), (-0.5, 0.5)],
}
return solutions[amount]
################################################################################
class JitterDotplot(object):
def __init__(self, data, x_col='time', y_col='sex', z_col='tip'):
self.data = data
self.x_col = x_col
self.y_col = y_col
self.z_col = z_col
def plot(self, **kwargs):
# Load data #
self.df = self.data.copy()
# Assign numerical values to the categorical data #
# So that ['Dinner', 'Lunch'] becomes [0, 1] etc. #
self.x_values = self.df[self.x_col].unique()
self.y_values = self.df[self.y_col].unique()
self.x_mapping = dict(zip(self.x_values, range(len(self.x_values))))
self.y_mapping = dict(zip(self.y_values, range(len(self.y_values))))
self.df = self.df.replace({self.x_col: self.x_mapping, self.y_col: self.y_mapping})
# Offset points that are overlapping in the same location #
# So that (2.0, 3.0) becomes (2.05, 2.95) for instance #
cols = [self.x_col, self.y_col]
scaling_factor = 0.05
for values, df_view in self.df.groupby(cols):
offsets = amount_to_offets(len(df_view))
offsets = pandas.DataFrame(offsets, index=df_view.index, columns=cols)
offsets *= scaling_factor
self.df.loc[offsets.index, cols] += offsets
# Plot a standard scatter plot #
g = seaborn.scatterplot(x=self.x_col, y=self.y_col, size=self.z_col, data=self.df, **kwargs)
# Force integer ticks on the x and y axes #
locator = matplotlib.ticker.MaxNLocator(integer=True)
g.xaxis.set_major_locator(locator)
g.yaxis.set_major_locator(locator)
g.grid(False)
# Expand the axis limits for x and y #
margin = 0.4
xmin, xmax, ymin, ymax = g.get_xlim() + g.get_ylim()
g.set_xlim(xmin-margin, xmax+margin)
g.set_ylim(ymin-margin, ymax+margin)
# Replace ticks with the original categorical names #
g.set_xticklabels([''] + list(self.x_mapping.keys()))
g.set_yticklabels([''] + list(self.y_mapping.keys()))
# Return for display in notebooks for instance #
return g
################################################################################
# Graph #
graph = JitterDotplot(data=df)
axes = graph.plot()
axes.figure.savefig('jitter_dotplot.png')
you could first convert time and sex to categorical type and tweak it a little bit:
df.sex = pd.Categorical(df.sex)
df.time = pd.Categorical(df.time)
axes = sns.scatterplot(x=df.time.cat.codes+np.random.uniform(-0.1,0.1, len(df)),
y=df.sex.cat.codes+np.random.uniform(-0.1,0.1, len(df)),
size=df.tip)
Output:
With that idea, you can modify the offsets (np.random) in the above code to the respective distance. For example:
# grouping
groups = df.groupby(['time', 'sex'])
# compute the number of samples per group
num_samples = groups.tip.transform('size')
# enumerate the samples within a group
sample_ranks = df.groupby(['time']).cumcount() * (2*np.pi) / num_samples
# compute the offset
x_offsets = np.where(num_samples.eq(1), 0, np.cos(df.sample_rank) * 0.03)
y_offsets = np.where(num_samples.eq(1), 0, np.sin(df.sample_rank) * 0.03)
# plot
axes = sns.scatterplot(x=df.time.cat.codes + x_offsets,
y=df.sex.cat.codes + y_offsets,
size=df.tip)
Output:
I have the following model:
from gurobipy import *
n_units = 1
n_periods = 3
n_ageclasses = 4
units = range(1,n_units+1)
periods = range(1,n_periods+1)
periods_plus1 = periods[:]
periods_plus1.append(max(periods_plus1)+1)
ageclasses = range(1,n_ageclasses+1)
nothickets = ageclasses[1:]
model = Model('MPPM')
HARVEST = model.addVars(units, periods, nothickets, vtype=GRB.INTEGER, name="HARVEST")
FOREST = model.addVars(units, periods_plus1, ageclasses, vtype=GRB.INTEGER, name="FOREST")
model.addConstrs((quicksum(HARVEST[(k+1), (t+1), nothicket] for k in range(n_units) for t in range(n_periods) for nothicket in nothickets) == FOREST[unit, period+1, 1] for unit in units for period in periods if period < max(periods_plus1)), name="A_Thicket")
I have a problem with formulating the constraint. I want for every unit and every period to sum the nothickets part of the variable HARVEST. Concretely I want xk=1,t=1,2 + xk=1,t=1,3 + xk=1,t=1,4
and so on. This should result in only three ones per row of the constraint matrix. But with the formulation above I get 9 ones.
I tried to use a for loop outside of the sum, but this results in another problem:
for k in range(n_units):
for t in range(n_periods):
model.addConstrs((quicksum(HARVEST[(k+1), (t+1), nothicket] for nothicket in nothickets) == FOREST[unit,period+1, 1] for unit in units for period in periods if period < max(periods_plus1)), name="A_Thicket")
With this formulation I get this matrix:
constraint matrix
But what I want is:
row_idx | col_idx | coeff
0 | 0 | 1
0 | 1 | 1
0 | 2 | 1
0 | 13 | -1
1 | 3 | 1
1 | 4 | 1
1 | 5 | 1
1 | 17 | -1
2 | 6 | 1
2 | 7 | 1
2 | 8 | 1
2 | 21 | -1
Can anybody please help me to reformulate this constraint?
This worked for me:
model.addConstrs((HARVEST.sum(unit, period, '*') == ...
I'm looking to calculate intraclass correlation (ICC) in Python. I haven't been able to find an existing module that has this feature. Is there an alternate name, or should I do it myself? I'm aware this question was asked a year ago on Cross Validated by another user, but there were no replies. I am looking to compare the continuous scores between two raters.
There are several implementations of the ICC in R. These can be used from Python via the rpy2 package. Example:
from rpy2.robjects import DataFrame, FloatVector, IntVector
from rpy2.robjects.packages import importr
from math import isclose
groups = [1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4,
4, 5, 5, 5, 5, 6, 6, 6, 6, 7, 7, 7, 7, 8, 8, 8, 8]
values = [1, 2, 0, 1, 1, 3, 3, 2, 3, 8, 1, 4, 6, 4, 3,
3, 6, 5, 5, 6, 7, 5, 6, 2, 8, 7, 7, 9, 9, 9, 9, 8]
r_icc = importr("ICC")
df = DataFrame({"groups": IntVector(groups),
"values": FloatVector(values)})
icc_res = r_icc.ICCbare("groups", "values", data=df)
icc_val = icc_res[0] # icc_val now holds the icc value
# check whether icc value equals reference value
print(isclose(icc_val, 0.728, abs_tol=0.001))
You can find an implementation at ICC or Brain_Data.icc
The pengouin library computes ICC in 6 different ways, along with associated confidence levels and p values.
You can install it with pip install pingouin or conda install -c conda-forge pingouin
import pingouin as pg
data = pg.read_dataset('icc')
icc = pg.intraclass_corr(data=data, targets='Wine', raters='Judge',
ratings='Scores')
data.head()
| | Wine | Judge | Scores |
|---:|-------:|:--------|---------:|
| 0 | 1 | A | 1 |
| 1 | 2 | A | 1 |
| 2 | 3 | A | 3 |
| 3 | 4 | A | 6 |
| 4 | 5 | A | 6 |
| 5 | 6 | A | 7 |
| 6 | 7 | A | 8 |
| 7 | 8 | A | 9 |
| 8 | 1 | B | 2 |
| 9 | 2 | B | 3 |
icc
| | Type | Description | ICC | F | df1 | df2 | pval | CI95% |
|---:|:-------|:------------------------|------:|-------:|------:|------:|------------:|:-------------|
| 0 | ICC1 | Single raters absolute | 0.773 | 11.199 | 5 | 12 | 0.000346492 | [0.39, 0.96] |
| 1 | ICC2 | Single random raters | 0.783 | 27.966 | 5 | 10 | 1.42573e-05 | [0.25, 0.96] |
| 2 | ICC3 | Single fixed raters | 0.9 | 27.966 | 5 | 10 | 1.42573e-05 | [0.65, 0.98] |
| 3 | ICC1k | Average raters absolute | 0.911 | 11.199 | 5 | 12 | 0.000346492 | [0.65, 0.99] |
| 4 | ICC2k | Average random raters | 0.915 | 27.966 | 5 | 10 | 1.42573e-05 | [0.5, 0.99] |
| 5 | ICC3k | Average fixed raters | 0.964 | 27.966 | 5 | 10 | 1.42573e-05 | [0.85, 0.99] |
The R package psych has an implementation of the Intraclass Correlations (ICC) that calculates many types of variants including ICC(1,1), ICC(1,k), ICC(2,1), ICC(2,k), ICC(3,1) and ICC(3,k) plus other metrics.
This page has a good comparison between the different variants,
You can use the R ICC function via rpy2 package.
Example:
First install psych and lme4 in R:
install.packages("psych")
install.packages("lme4")
Calculate ICC coefficients in Python using rpy2:
import rpy2
from rpy2.robjects import IntVector, pandas2ri
from rpy2.robjects.packages import importr
psych = importr("psych")
values = rpy2.robjects.r.matrix(
IntVector(
[9, 2, 5, 8,
6, 1, 3, 2,
8, 4, 6, 8,
7, 1, 2, 6,
10, 5, 6, 9,
6, 2, 4, 7]),
ncol=4, byrow=True
)
icc = psych.ICC(values)
# Convert to Pandas DataFrame
icc_df = pandas2ri.rpy2py(icc[0])
Results:
type ICC F df1 df2 p lower bound upper bound
Single_raters_absolute ICC1 0.165783 1.794916 5.0 18.0 0.164720 -0.132910 0.722589
Single_random_raters ICC2 0.289790 11.026650 5.0 15.0 0.000135 0.018791 0.761107
Single_fixed_raters ICC3 0.714829 11.026650 5.0 15.0 0.000135 0.342447 0.945855
Average_raters_absolute ICC1k 0.442871 1.794916 5.0 18.0 0.164720 -0.884193 0.912427
Average_random_raters ICC2k 0.620080 11.026650 5.0 15.0 0.000135 0.071153 0.927240
Average_fixed_raters ICC3k 0.909311 11.026650 5.0 15.0 0.000135 0.675657 0.985891
Based on Brain_Data, I modified the code in order to calculate the correlation coefficients ICC(2,1), ICC(2,k), ICC(3,1) or ICC(3,k) for data input as a table Y (subjects in rows and repeated measurements in columns).
import os
import numpy as np
from numpy import ones, kron, mean, eye, hstack, dot, tile
from numpy.linalg import pinv
def icc(Y, icc_type='ICC(2,1)'):
''' Calculate intraclass correlation coefficient
ICC Formulas are based on:
Shrout, P. E., & Fleiss, J. L. (1979). Intraclass correlations: uses in
assessing rater reliability. Psychological bulletin, 86(2), 420.
icc1: x_ij = mu + beta_j + w_ij
icc2/3: x_ij = mu + alpha_i + beta_j + (ab)_ij + epsilon_ij
Code modifed from nipype algorithms.icc
https://github.com/nipy/nipype/blob/master/nipype/algorithms/icc.py
Args:
Y: The data Y are entered as a 'table' ie. subjects are in rows and repeated
measures in columns
icc_type: type of ICC to calculate. (ICC(2,1), ICC(2,k), ICC(3,1), ICC(3,k))
Returns:
ICC: (np.array) intraclass correlation coefficient
'''
[n, k] = Y.shape
# Degrees of Freedom
dfc = k - 1
dfe = (n - 1) * (k-1)
dfr = n - 1
# Sum Square Total
mean_Y = np.mean(Y)
SST = ((Y - mean_Y) ** 2).sum()
# create the design matrix for the different levels
x = np.kron(np.eye(k), np.ones((n, 1))) # sessions
x0 = np.tile(np.eye(n), (k, 1)) # subjects
X = np.hstack([x, x0])
# Sum Square Error
predicted_Y = np.dot(np.dot(np.dot(X, np.linalg.pinv(np.dot(X.T, X))),
X.T), Y.flatten('F'))
residuals = Y.flatten('F') - predicted_Y
SSE = (residuals ** 2).sum()
MSE = SSE / dfe
# Sum square column effect - between colums
SSC = ((np.mean(Y, 0) - mean_Y) ** 2).sum() * n
MSC = SSC / dfc # / n (without n in SPSS results)
# Sum Square subject effect - between rows/subjects
SSR = SST - SSC - SSE
MSR = SSR / dfr
if icc_type == 'icc1':
# ICC(2,1) = (mean square subject - mean square error) /
# (mean square subject + (k-1)*mean square error +
# k*(mean square columns - mean square error)/n)
# ICC = (MSR - MSRW) / (MSR + (k-1) * MSRW)
NotImplementedError("This method isn't implemented yet.")
elif icc_type == 'ICC(2,1)' or icc_type == 'ICC(2,k)':
# ICC(2,1) = (mean square subject - mean square error) /
# (mean square subject + (k-1)*mean square error +
# k*(mean square columns - mean square error)/n)
if icc_type == 'ICC(2,k)':
k = 1
ICC = (MSR - MSE) / (MSR + (k-1) * MSE + k * (MSC - MSE) / n)
elif icc_type == 'ICC(3,1)' or icc_type == 'ICC(3,k)':
# ICC(3,1) = (mean square subject - mean square error) /
# (mean square subject + (k-1)*mean square error)
if icc_type == 'ICC(3,k)':
k = 1
ICC = (MSR - MSE) / (MSR + (k-1) * MSE)
return ICC