generating correlated numbers in numpy / pandas - python

I’m trying to generate simulated student grades in 4 subjects, where a student record is a single row of data. The code shown here will generate normally distributed random numbers with a mean of 60 and a standard deviation of 15.
df = pd.DataFrame(15 * np.random.randn(5, 4) + 60, columns=['Math', 'Science', 'History', 'Art'])
What I can’t figure out is how to make it so that a student’s Science mark is highly correlated to their Math mark, and that their History and Art marks are less so, but still somewhat correlated to the Math mark.
I’m neither a statistician or an expert programmer, so a less sophisticated but more easily understood solution is what I’m hoping for.

Let's put what has been suggested by #Daniel into code.
Step 1
Let's import multivariate_normal:
import numpy as np
from scipy.stats import multivariate_normal as mvn
Step 2
Let's construct covariance data and generate data:
cov = np.array([[1, 0.8,.7, .6],[.8,1.,.5,.5],[0.7,.5,1.,.5],[0.6,.5,.5,1]])
cov
array([[ 1. , 0.8, 0.7, 0.6],
[ 0.8, 1. , 0.5, 0.5],
[ 0.7, 0.5, 1. , 0.5],
[ 0.6, 0.5, 0.5, 1. ]])
This is the key step. Note, that covariance matrix has 1's in diagonal, and the covariances decrease as you step from left to right.
Now we are ready to generate data, let's sat 1'000 points:
scores = mvn.rvs(mean = [60.,60.,60.,60.], cov=cov, size = 1000)
Sanity check (from covariance matrix to simple correlations):
np.corrcoef(scores.T):
array([[ 1. , 0.78886583, 0.70198586, 0.56810058],
[ 0.78886583, 1. , 0.49187904, 0.45994833],
[ 0.70198586, 0.49187904, 1. , 0.4755558 ],
[ 0.56810058, 0.45994833, 0.4755558 , 1. ]])
Note, that np.corrcoef expects your data in rows.
Finally, let's put your data into Pandas' DataFrame:
df = pd.DataFrame(data = scores, columns = ["Math", "Science","History", "Art"])
df.head()
Math Science History Art
0 60.629673 61.238697 61.805788 61.848049
1 59.728172 60.095608 61.139197 61.610891
2 61.205913 60.812307 60.822623 59.497453
3 60.581532 62.163044 59.277956 60.992206
4 61.408262 59.894078 61.154003 61.730079
Step 3
Let's visualize some data that we've just generated:
ax = df.plot(x = "Math",y="Art", kind="scatter", color = "r", alpha = .5, label = "Art, $corr_{Math}$ = .6")
df.plot(x = "Math",y="Science", kind="scatter", ax = ax, color = "b", alpha = .2, label = "Science, $corr_{Math}$ = .8")
ax.set_ylabel("Art and Science");

The statistical tool for that is the covariance matrix: https://en.wikipedia.org/wiki/Covariance.
Each cell (i,j) is representing the dependecy between the variable i and the variable j, so in your case it can be between math and science. If there is no dependency the value would be 0.
What you did was assuming that the covariance was a diagonal matrix with the same values on the diagonal. So what you have to do is defines your covariance matrix and afterwards draw the samples from a gaussian with numpy.random.multivariate_normal https://docs.scipy.org/doc/numpy/reference/generated/numpy.random.multivariate_normal.html or any other distribution functions.

Thank you guys for the responses; they were extremely useful. I adapted the code provided by Sergey to produce the result I was looking for, which was records with Math and Science marks that are relatively close most of the time, and History and Art marks that are more independent.
The following produced data that looks reasonable:
cov = np.array([[1, 0.5,.2, .1],[.5,1.,.1,.1],[0.2,.1,1,.3],[0.1,.1,.3,1]])
scores = mvn.rvs(mean = [0.,0.,0.,0.], cov=cov, size = 100)
df = pd.DataFrame(data = 15 * scores + 60, columns = ["Math","Science","History", "Art"])
df.head(10)
The next step would be to make it so that each subject has a different mean, but I have an idea of how to do that. Thanks again.
example dataframe

Related

Generating weighted intervals in Python

I know how to produce a weighted integer with random.choice.
Now I have 5000 integers from 0 to 1000. I want to have say 75% to land in the inerval 0-500, 20% 501-750 and 5% between 751-1000. What I tried and failed is
x = random.choice([np.arange(501), np.arange(501,751), np.arange(751, 1001)], size=5000, p=[0.75, 0.2, 0.05])
But then I only get random aranged intervals. Any help would be appreciated.
How about something like this:
import numpy as np
x = np.random.choice(list(range(1001)), size=5000, p=
[.75/501]*501+[.2/250]*250+[.05/250]*250)
another version would be:
import numpy as np
from scipy import stats
N = 5000
probs = [0.75, 0.2, 0.05]
breaks = [0, 501, 751, 1001]
# figure out how big each group should be
sizes = stats.multinomial.rvs(N, probs)
# get values for each group
x = np.concatenate([
stats.randint.rvs(l, h, size=n)
for n, l, h in zip(sizes, breaks, breaks[1:])])
# mix everything up
np.random.shuffle(x)
some differences to rocket's solution:
less/smaller temporary variables and more opportunity for vectorisation
allows probabilities to take irrational values
the runtime is doesn't depend on the number of possible values generated

Labels obtained from clustering seem visually incorrect

I have the following distance matrix based on 10 datapoints:
import numpy as np
distance_matrix = np.array([[0. , 0.00981376, 0.0698306 , 0.01313118, 0.05344448,
0.0085152 , 0.01996724, 0.14019663, 0.03702411, 0.07054652],
[0.00981376, 0. , 0.06148157, 0.00563764, 0.04473798,
0.00905327, 0.01223233, 0.13140022, 0.03114453, 0.06215728],
[0.0698306 , 0.06148157, 0. , 0.05693448, 0.02083512,
0.06390897, 0.05107812, 0.07539802, 0.04003773, 0.00703263],
[0.01313118, 0.00563764, 0.05693448, 0. , 0.0408836 ,
0.00787845, 0.00799949, 0.12779965, 0.02552774, 0.05766039],
[0.05344448, 0.04473798, 0.02083512, 0.0408836 , 0. ,
0.04846382, 0.03638932, 0.0869414 , 0.03579818, 0.0192329 ],
[0.0085152 , 0.00905327, 0.06390897, 0.00787845, 0.04846382,
0. , 0.01284173, 0.13540522, 0.03010677, 0.0646998 ],
[0.01996724, 0.01223233, 0.05107812, 0.00799949, 0.03638932,
0.01284173, 0. , 0.12310601, 0.01916205, 0.05188323],
[0.14019663, 0.13140022, 0.07539802, 0.12779965, 0.0869414 ,
0.13540522, 0.12310601, 0. , 0.11271352, 0.07346808],
[0.03702411, 0.03114453, 0.04003773, 0.02552774, 0.03579818,
0.03010677, 0.01916205, 0.11271352, 0. , 0.04157886],
[0.07054652, 0.06215728, 0.00703263, 0.05766039, 0.0192329 ,
0.0646998 , 0.05188323, 0.07346808, 0.04157886, 0. ]])
I transform the distance_matrix to an affinity_matrix by using the following
delta = 0.1
np.exp(- distance_matrix ** 2 / (2. * delta ** 2))
Which gives
affinity_matrix = np.array([[1. , 0.99519608, 0.7836321 , 0.99141566, 0.86691389,
0.99638113, 0.98026285, 0.37427863, 0.93375682, 0.77970427],
[0.99519608, 1. , 0.82778719, 0.99841211, 0.90477015,
0.9959103 , 0.99254642, 0.42176757, 0.95265821, 0.82433657],
[0.7836321 , 0.82778719, 1. , 0.85037594, 0.97852875,
0.81528476, 0.8777015 , 0.75258369, 0.92297697, 0.99753016],
[0.99141566, 0.99841211, 0.85037594, 1. , 0.91982353,
0.99690131, 0.99680552, 0.44191509, 0.96794184, 0.84684633],
[0.86691389, 0.90477015, 0.97852875, 0.91982353, 1. ,
0.88919645, 0.93593511, 0.68527137, 0.9379342 , 0.98167476],
[0.99638113, 0.9959103 , 0.81528476, 0.99690131, 0.88919645,
1. , 0.9917884 , 0.39982486, 0.95569077, 0.81114925],
[0.98026285, 0.99254642, 0.8777015 , 0.99680552, 0.93593511,
0.9917884 , 1. , 0.46871776, 0.9818083 , 0.87407117],
[0.37427863, 0.42176757, 0.75258369, 0.44191509, 0.68527137,
0.39982486, 0.46871776, 1. , 0.52982057, 0.76347268],
[0.93375682, 0.95265821, 0.92297697, 0.96794184, 0.9379342 ,
0.95569077, 0.9818083 , 0.52982057, 1. , 0.91719051],
[0.77970427, 0.82433657, 0.99753016, 0.84684633, 0.98167476,
0.81114925, 0.87407117, 0.76347268, 0.91719051, 1. ]])
I transform the distance_matrix into a heatmap to get a better visual of the data
import seaborn as sns
distance_matrix_df = pd.DataFrame(distance_matrix)
distance_matrix_df.columns = [x + 1 for x in range(10))]
distance_matrix_df.index = [x + 1 for x in range(10)]
sns.heatmap(distance_matrix_df, cmap='RdYlGn_r', annot=True, linewidths=0.5)
Next I want to cluster the affinity_matrix in 3 clusters. Before running the actual clustering, I inspect the heatmap to forecast the clusters. Clearly #8 is an outlier and will be a cluster on its own.
Next I run the actual clustering.
from sklearn.cluster import SpectralClustering
clustering = SpectralClustering(n_clusters=3,
assign_labels='kmeans',
affinity='precomputed').fit(affinity_matrix)
clusters = clustering.labels_.copy()
clusters = clusters.astype(np.int32) + 1
The outputs yields
[1, 1, 2, 1, 2, 1, 1, 2, 3, 2]
So, #8 is part of cluster 2 which consists of three other data points. Initially, I would assume that it would be a cluster on its own. Did I do something wrong? Or can someone show me why #8 looks like #3, #5 and #10. Please advice.
When we are moving away from relatively simple clustering algorithms, say like k-means, whatever intuition we may carry along regarding algorithms results and expected behaviors breaks down; indeed, the scikit-learn documentation on spectral clustering gives an implicit warning about that:
Apply clustering to a projection of the normalized Laplacian.
In practice Spectral Clustering is very useful when the structure of
the individual clusters is highly non-convex or more generally when a
measure of the center and spread of the cluster is not a suitable
description of the complete cluster. For instance when clusters are
nested circles on the 2D plane.
Now, even if one pretends to understand exactly what "a projection of the normalized Laplacian" means (I won't), the rest of the description arguably makes clear enough that here we should not expect results similar with more intuitive, distance-based clustering algorithms like k-means.
Nevertheless, your own intuition is not unfounded, and it shows if you just try a k-means clustering instead of a spherical one; using your exact data, we get
from sklearn.cluster import KMeans
clustering = KMeans(n_clusters=3, random_state=42).fit(affinity_matrix)
clusters = clustering.labels_.copy()
clusters = clusters.astype(np.int32) + 1
clusters
# result:
array([2, 2, 1, 2, 1, 2, 2, 3, 2, 1], dtype=int32)
where indeed sample #8 stands out as an outlier in a cluster of its own (#3).
Nevertheless, the same intuition is not necessarily applicable or useful with other clustering algorithms, whose value is arguably exactly that they can uncover regularities of different kinds in the data - arguably they would not be that useful if they just replicated results from existing algorithms like k-means, would they?
The scikit-learn vignette Comparing different clustering algorithms on toy datasets might be useful to get an idea of how different clustering algorithms behave on some toy 2D datasets; here is the summary finding:

How to compute average within given percentiles in Python?

I am doing some scientific computing and I couldn't find an elegant way of performing the following operation. Suppose I have a 2-dimensional numpy array D which stores measurements of a given quantity at several times along the day. Each row corresponds to a different measuring instrument and each column corresponds to a different moment in the day at which the measurement was done.
Consider a list of desired percentiles. For example:
quantiles = [0.25, 0.5, 0.75]
My goal is to compute the average measurement by percentile group, at each moment in the day. In other words, given a column of measurements, I would like to sort all the measurements from that column in groups respecting the quantiles above and then take averages within groups. Using the example, I would have 4 groups at each moment of the day: the measurements in the lower quartile, then the measurements between the 25th and 50th quartile, the ones between the 50th and the 75th and finally the ones in the last quartile. Therefore, if m is the number of moments in the day when measurements were taken and q is the number of elements in the quantiles variable, my desired output would be qxm numpy array.
Currently, I am doing this in the most inefficient and hard-coded way possible. Here we go:
quantiles = [0.25, 0.5, 0.75]
window = "30min"
moments = pd.date_range(start = "9:30", end = "16:00", freq = window).time
quantile_curves = np.zeros((len(quantiles)+1, len(moments)-1))
EmpQuantiles = np.quantile(D, quantiles, axis = 0)
for moment in range(len(moments)-1):
quantile_curves[0, moment] = np.mean(D[:, moment][D[:,moment] < EmpQuantiles[0, moment]])
quantile_curves[1, moment] = np.mean(D[:, moment][np.logical_and(D[:,moment] > EmpQuantiles[0, moment], D[:,moment] <EmpQuantiles[1, moment])])
quantile_curves[2, moment] = np.mean(D[:, moment][np.logical_and(D[:,moment] > EmpQuantiles[1, moment], D[:,moment] <EmpQuantiles[2, moment])])
quantile_curves[3, moment] = np.mean(D[:, moment][D[:,moment] > EmpQuantiles[2, moment]])
What's an elegant and simpler way of doing this? I couldn't find the answer here however there is a related (but not the same) question in R: ddply multiple quantiles by group
I intend to plot the evolution of the in-group average along the day. I show the plot I get below (I am satisfied with the plot and I get the result I want however I seek better way of computing the quantile_curves variable):
Thanks a lot in advance!
You can do it efficiently using masked_arrays:
import numpy as np
quantiles = [0.25, 0.5, 0.75]
print('quantiles:\n', quantiles)
moments = [f'moment {i}' for i in range(5)]
print('nb of moments:\n', len(moments))
nb_measurements = 10000
D = np.random.rand(nb_measurements,len(moments))
quantile_values = np.quantile(D,quantiles,axis=0)
print('quantile_values (for each moment):\n', quantile_values)
quantile_curves = np.zeros((len(quantiles)+1,len(moments)))
quantile_curves[0, :] = np.mean(np.ma.masked_array(D, mask=D>quantile_values[[0],:]), axis=0)
for q in range(len(quantiles)-1):
quantile_curves[q+1, :] = np.mean(np.ma.masked_array(D, mask=np.logical_or(D<quantile_values[[q],:], D>quantile_values[[q+1],:])), axis=0)
quantile_curves[len(quantiles), :] = np.mean(np.ma.masked_array(D, mask=D<quantile_values[[len(quantiles)-1],:]), axis=0)
print('mean for each group and at each moment:')
print(quantile_curves)
Output:
% python3 script.py
quantiles:
[0.25, 0.5, 0.75]
nb of moments:
5
quantile_values (for each moment):
[[0.25271343 0.25434056 0.24658732 0.24612319 0.25221014]
[0.51114344 0.50103699 0.49671249 0.49113293 0.49819521]
[0.75629377 0.75427293 0.74676209 0.74211813 0.7490436 ]]
mean for each group and at each moment
[[0.12650993 0.12823392 0.12492136 0.12200609 0.12655318]
[0.3826476 0.373516 0.37050513 0.36974876 0.37722219]
[0.63454102 0.63023986 0.62280545 0.61696283 0.6238492 ]
[0.87866019 0.87614489 0.87492553 0.87253142 0.87403426]]
Note that I'm using random values between 0 and 1 that's why the quantile values (extremities of groups intervals) are almost equql to quantiles. Also not that this code works for an arbitrary number of quantiles or moments.

Rescale price list from a longer length to a smaller length

Given the following pandas data frame with 60 elements.
import pandas as pd
data = [60,62.75,73.28,75.77,70.28
,67.85,74.58,72.91,68.33,78.59
,75.58,78.93,74.61,85.3,84.63
,84.61,87.76,95.02,98.83,92.44
,84.8,89.51,90.25,93.82,86.64
,77.84,76.06,77.75,72.13,80.2
,79.05,76.11,80.28,76.38,73.3
,72.28,77,69.28,71.31,79.25
,75.11,73.16,78.91,84.78,85.17
,91.53,94.85,87.79,97.92,92.88
,91.92,88.32,81.49,88.67,91.46
,91.71,82.17,93.05,103.98,105]
data_pd = pd.DataFrame(data, columns=["price"])
Is there a formula to rescale this in such a way so that for each window bigger than 20 elements starting from index 0 to index i+1, the data is rescaled down to 20 elements?
Here is a loop that is creating the windows with the data for rescaling, i just do not know any way of doing the rescaling itself for this problem at hand. Any suggestions on how this might be done?
miniLenght = 20
rescaledData = []
for i in range(len(data_pd)):
if(i >= miniLenght):
dataForScaling = data_pd[0:i]
scaledDataToMinLenght = dataForScaling #do the scaling here so that the length of the rescaled data is always equal to miniLenght
rescaledData.append(scaledDataToMinLenght)
Basically after the rescaling the rescaledData should have 40 arrays, each with a length of 20 prices.
From reading the paper, it looks like you are resizing the list back to 20 indices, then interpolating the data at your 20 indices.
We'll make the indices like they do (range(0, len(large), step = len(large)/miniLenght)), then use numpys interp - there are a million ways of interpolating data. np.interp uses a linear interpolation, so if you asked for eg index 1.5, you get the mean of points 1 and 2, and so on.
So, here's a quick modification of your code to do it (nb, we could probably fully vectorize this using 'rolling'):
import numpy as np
miniLenght = 20
rescaledData = []
for i in range(len(data_pd)):
if(i >= miniLenght):
dataForScaling = data_pd['price'][0:i]
#figure out how many 'steps' we have
steps = len(dataForScaling)
#make indices where the data needs to be sliced to get 20 points
indices = np.arange(0,steps, step = steps/miniLenght)
#use np.interp at those points, with the original values as given
rescaledData.append(np.interp(indices, np.arange(steps), dataForScaling))
And the output is as expected:
[array([ 60. , 62.75, 73.28, 75.77, 70.28, 67.85, 74.58, 72.91,
68.33, 78.59, 75.58, 78.93, 74.61, 85.3 , 84.63, 84.61,
87.76, 95.02, 98.83, 92.44]),
array([ 60. , 63.2765, 73.529 , 74.9465, 69.794 , 69.5325,
74.079 , 71.307 , 72.434 , 77.2355, 77.255 , 76.554 ,
81.024 , 84.8645, 84.616 , 86.9725, 93.568 , 98.2585,
93.079 , 85.182 ]),.....

NumPy or SciPy to calculate weighted median

I'm trying to automate a process that JMP does (Analyze->Distribution, entering column A as the "Y value", using subsequent columns as the "weight" value). In JMP you have to do this one column at a time - I'd like to use Python to loop through all of the columns and create an array showing, say, the median of each column.
For example, if the mass array is [0, 10, 20, 30], and the weight array for column 1 is [30, 191, 9, 0], the weighted median of the mass array should be 10. However, I'm not sure how to arrive at this answer.
So far I've
imported the csv showing the weights as an array, masking values of 0, and
created an array of the "Y value" the same shape and size as the weights array (113x32). I'm not entirely sure I need to do this, but thought it would be easier than a for loop for the purpose of weighting.
I'm not sure exactly where to go from here. Basically the "Y value" is a range of masses, and all of the columns in the array represent the number of data points found for each mass. I need to find the median mass, based on the frequency with which they were reported.
I'm not an expert in Python or statistics, so if I've omitted any details that would be useful let me know!
Update: here's some code for what I've done so far:
#Boilerplate & Import files
import csv
import scipy as sp
from scipy import stats
from scipy.stats import norm
import numpy as np
from numpy import genfromtxt
import pandas as pd
import matplotlib.pyplot as plt
inputFile = '/Users/cl/prov.csv'
origArray = genfromtxt(inputFile, delimiter = ",")
nArray = np.array(origArray)
dimensions = nArray.shape
shape = np.asarray(dimensions)
#Mask values ==0
maTest = np.ma.masked_equal(nArray,0)
#Create array of masses the same shape as the weights (nArray)
fieldLength = shape[0]
rowLength = shape[1]
for i in range (rowLength):
createArr = np.arange(0, fieldLength*10, 10)
nCreateArr = np.array(createArr)
massArr.append(nCreateArr)
nCreateArr = np.array(massArr)
nmassArr = nCreateArr.transpose()
What we can do, if i understood your problem correctly. Is to sum up the observations, dividing by 2 would give us the observation number corresponding to the median. From there we need to figure out what observation this number was.
One trick here, is to calculate the observation sums with np.cumsum. Which gives us a running cumulative sum.
Example:
np.cumsum([1,2,3,4]) -> [ 1, 3, 6, 10]
Each element is the sum of all previously elements and itself. We have 10 observations here. so the mean would be the 5th observation. (We get 5 by dividing the last element by 2).
Now looking at the cumsum result, we can easily see that that must be the observation between the second and third elements (observation 3 and 6).
So all we need to do, is figure out the index of where the median (5) will fit.
np.searchsorted does exactly what we need. It will find the index to insert an elements into an array, so that it stays sorted.
The code to do it like so:
import numpy as np
#my test data
freq_count = np.array([[30, 191, 9, 0], [10, 20, 300, 10], [10,20,30,40], [100,10,10,10], [1,1,1,100]])
c = np.cumsum(freq_count, axis=1)
indices = [np.searchsorted(row, row[-1]/2.0) for row in c]
masses = [i * 10 for i in indices] #Correct if the masses are indeed 0, 10, 20,...
#This is just for explanation.
print "median masses is:", masses
print freq_count
print np.hstack((c, c[:, -1, np.newaxis]/2.0))
Output will be:
median masses is: [10 20 20 0 30]
[[ 30 191 9 0] <- The test data
[ 10 20 300 10]
[ 10 20 30 40]
[100 10 10 10]
[ 1 1 1 100]]
[[ 30. 221. 230. 230. 115. ] <- cumsum results with median added to the end.
[ 10. 30. 330. 340. 170. ] you can see from this where they fit in.
[ 10. 30. 60. 100. 50. ]
[ 100. 110. 120. 130. 65. ]
[ 1. 2. 3. 103. 51.5]]
wquantiles is a small python package that will do exactly what you need. It just uses np.cumsum() and np.interp() under the hood.
Since this is the top hit on Google for weighted median in NumPy, I will add my minimal function to select the weighted median from two arrays without changing their contents, and with no assumptions about the order of the values (on the off-chance that anyone else comes here looking for a quick recipe for the same exact pre-conditions).
def weighted_median(values, weights):
i = np.argsort(values)
c = np.cumsum(weights[i])
return values[i[np.searchsorted(c, 0.5 * c[-1])]]
Using argsort lets us maintain the alignment between the two arrays without changing or copying their content. It should be straight-forward to extend is to an arbitrary number of arbitrary quantiles.
Update
Since it may not be fully obvious at first blush exactly how easy it is to extend to arbitrary quantiles, here is the code:
def weighted_quantiles(values, weights, quantiles=0.5):
i = np.argsort(values)
c = np.cumsum(weights[i])
return values[i[np.searchsorted(c, np.array(quantiles) * c[-1])]]
This defaults to median, but you can pass in any quantile, or a list of quantiles. The return type is equivalent to what you pass in as quantiles, with lists promoted to NumPy arrays. With enough uniformly distributed values, you can indeed approximate the input poorly:
>>> weighted_quantiles(np.random.rand(10000), np.random.rand(10000), [0.01, 0.05, 0.25, 0.50, 0.75, 0.95, 0.99])
array([0.01235101, 0.05341077, 0.25355715, 0.50678338, 0.75697424,0.94962936, 0.98980785])
>>> weighted_quantiles(np.random.rand(10000), np.random.rand(10000), 0.5)
0.5036283072043176
>>> weighted_quantiles(np.random.rand(10000), np.random.rand(10000), [0.5])
array([0.49851076])
Update 2
In small data sets where the median/quantile is not actually observed, it may be important to be able to interpolate a point between two observations. This can be fairly easily added by calculating the mid point between two number in the case where the weight mass is equally (or quantile/1-quantile) divided between them. Due to the need for a conditional, this function always returns a NumPy array, even when quantiles is a single scalar. The inputs also need to be NumPy arrays now (except quantiles that may still be a single number).
def weighted_quantiles_interpolate(values, weights, quantiles=0.5):
i = np.argsort(values)
c = np.cumsum(weights[i])
q = np.searchsorted(c, quantiles * c[-1])
return np.where(c[q]/c[-1] == quantiles, 0.5 * (values[i[q]] + values[i[q+1]]), values[i[q]])
This function will fail with arrays smaller than 2 (the original would handle non-empty arrays).
>>> weighted_quantiles_interpolate(np.array([2, 1]), np.array([1, 1]), 0.5)
array(1.5)
Note that this extension is fairly unlikely to be needed when working with actual data sets where we typically have (a) large data sets, and (b) real-values weights that make the odds of ending up exactly at a quantile edge very long, and probably due to rounding errors when it does happen. Including it for completeness nonetheless.
I ended up writing that function based on #muzzle and #maesers replies:
def weighted_quantiles(values, weights, quantiles=0.5, interpolate=False):
i = values.argsort()
sorted_weights = weights[i]
sorted_values = values[i]
Sn = sorted_weights.cumsum()
if interpolate:
Pn = (Sn - sorted_weights/2 ) / Sn[-1]
return np.interp(quantiles, Pn, sorted_values)
else:
return sorted_values[np.searchsorted(Sn, quantiles * Sn[-1])]
The difference between interpolate True and False is as follows:
weighted_quantiles(np.array([1, 2, 3, 4]), np.ones(4))
> 2
weighted_quantiles(np.array([1, 2, 3, 4]), np.ones(4), interpolate=True)
> 2.5
(there is no difference for uneven arrays such as [1, 2, 3, 4, 5])
Speed tests show it is just as performant as #maesers' function in the uninterpolated case, and it is twice as performant in the interpolated case.
Sharing some code that I got a hand with. This allows you to run stats on each column of an excel spreadsheet.
import xlrd
import sys
import csv
import numpy as np
import itertools
from itertools import chain
book = xlrd.open_workbook('/filepath/workbook.xlsx')
sh = book.sheet_by_name("Sheet1")
ofile = '/outputfilepath/workbook.csv'
masses = sh.col_values(0, start_rowx=1) # first column has mass
age = sh.row_values(0, start_colx=1) # first row has age ranges
count = 1
mass = []
for a in ages:
age.append(sh.col_values(count, start_rowx=1))
count += 1
stats = []
count = 0
for a in ages:
expanded = []
# create a tuple with the mass vector
age_mass = zip(masses, age[count])
count += 1
# replicate element[0] for element[1] times
expanded = list(list(itertools.repeat(am[0], int(am[1]))) for am in age_mass)
# separate into one big list
medianlist = [x for t in expanded for x in t]
# convert to array and mask out zeroes
npa = np.array(medianlist)
npa = np.ma.masked_equal(npa,0)
median = np.median(npa)
meanMass = np.average(npa)
maxMass = np.max(npa)
minMass = np.min(npa)
stdev = np.std(npa)
stats1 = [median, meanMass, maxMass, minMass, stdev]
print stats1
stats.append(stats1)
np.savetxt(ofile, (stats), fmt="%d")

Categories