NumPy or SciPy to calculate weighted median - python

I'm trying to automate a process that JMP does (Analyze->Distribution, entering column A as the "Y value", using subsequent columns as the "weight" value). In JMP you have to do this one column at a time - I'd like to use Python to loop through all of the columns and create an array showing, say, the median of each column.
For example, if the mass array is [0, 10, 20, 30], and the weight array for column 1 is [30, 191, 9, 0], the weighted median of the mass array should be 10. However, I'm not sure how to arrive at this answer.
So far I've
imported the csv showing the weights as an array, masking values of 0, and
created an array of the "Y value" the same shape and size as the weights array (113x32). I'm not entirely sure I need to do this, but thought it would be easier than a for loop for the purpose of weighting.
I'm not sure exactly where to go from here. Basically the "Y value" is a range of masses, and all of the columns in the array represent the number of data points found for each mass. I need to find the median mass, based on the frequency with which they were reported.
I'm not an expert in Python or statistics, so if I've omitted any details that would be useful let me know!
Update: here's some code for what I've done so far:
#Boilerplate & Import files
import csv
import scipy as sp
from scipy import stats
from scipy.stats import norm
import numpy as np
from numpy import genfromtxt
import pandas as pd
import matplotlib.pyplot as plt
inputFile = '/Users/cl/prov.csv'
origArray = genfromtxt(inputFile, delimiter = ",")
nArray = np.array(origArray)
dimensions = nArray.shape
shape = np.asarray(dimensions)
#Mask values ==0
maTest = np.ma.masked_equal(nArray,0)
#Create array of masses the same shape as the weights (nArray)
fieldLength = shape[0]
rowLength = shape[1]
for i in range (rowLength):
createArr = np.arange(0, fieldLength*10, 10)
nCreateArr = np.array(createArr)
massArr.append(nCreateArr)
nCreateArr = np.array(massArr)
nmassArr = nCreateArr.transpose()

What we can do, if i understood your problem correctly. Is to sum up the observations, dividing by 2 would give us the observation number corresponding to the median. From there we need to figure out what observation this number was.
One trick here, is to calculate the observation sums with np.cumsum. Which gives us a running cumulative sum.
Example:
np.cumsum([1,2,3,4]) -> [ 1, 3, 6, 10]
Each element is the sum of all previously elements and itself. We have 10 observations here. so the mean would be the 5th observation. (We get 5 by dividing the last element by 2).
Now looking at the cumsum result, we can easily see that that must be the observation between the second and third elements (observation 3 and 6).
So all we need to do, is figure out the index of where the median (5) will fit.
np.searchsorted does exactly what we need. It will find the index to insert an elements into an array, so that it stays sorted.
The code to do it like so:
import numpy as np
#my test data
freq_count = np.array([[30, 191, 9, 0], [10, 20, 300, 10], [10,20,30,40], [100,10,10,10], [1,1,1,100]])
c = np.cumsum(freq_count, axis=1)
indices = [np.searchsorted(row, row[-1]/2.0) for row in c]
masses = [i * 10 for i in indices] #Correct if the masses are indeed 0, 10, 20,...
#This is just for explanation.
print "median masses is:", masses
print freq_count
print np.hstack((c, c[:, -1, np.newaxis]/2.0))
Output will be:
median masses is: [10 20 20 0 30]
[[ 30 191 9 0] <- The test data
[ 10 20 300 10]
[ 10 20 30 40]
[100 10 10 10]
[ 1 1 1 100]]
[[ 30. 221. 230. 230. 115. ] <- cumsum results with median added to the end.
[ 10. 30. 330. 340. 170. ] you can see from this where they fit in.
[ 10. 30. 60. 100. 50. ]
[ 100. 110. 120. 130. 65. ]
[ 1. 2. 3. 103. 51.5]]

wquantiles is a small python package that will do exactly what you need. It just uses np.cumsum() and np.interp() under the hood.

Since this is the top hit on Google for weighted median in NumPy, I will add my minimal function to select the weighted median from two arrays without changing their contents, and with no assumptions about the order of the values (on the off-chance that anyone else comes here looking for a quick recipe for the same exact pre-conditions).
def weighted_median(values, weights):
i = np.argsort(values)
c = np.cumsum(weights[i])
return values[i[np.searchsorted(c, 0.5 * c[-1])]]
Using argsort lets us maintain the alignment between the two arrays without changing or copying their content. It should be straight-forward to extend is to an arbitrary number of arbitrary quantiles.
Update
Since it may not be fully obvious at first blush exactly how easy it is to extend to arbitrary quantiles, here is the code:
def weighted_quantiles(values, weights, quantiles=0.5):
i = np.argsort(values)
c = np.cumsum(weights[i])
return values[i[np.searchsorted(c, np.array(quantiles) * c[-1])]]
This defaults to median, but you can pass in any quantile, or a list of quantiles. The return type is equivalent to what you pass in as quantiles, with lists promoted to NumPy arrays. With enough uniformly distributed values, you can indeed approximate the input poorly:
>>> weighted_quantiles(np.random.rand(10000), np.random.rand(10000), [0.01, 0.05, 0.25, 0.50, 0.75, 0.95, 0.99])
array([0.01235101, 0.05341077, 0.25355715, 0.50678338, 0.75697424,0.94962936, 0.98980785])
>>> weighted_quantiles(np.random.rand(10000), np.random.rand(10000), 0.5)
0.5036283072043176
>>> weighted_quantiles(np.random.rand(10000), np.random.rand(10000), [0.5])
array([0.49851076])
Update 2
In small data sets where the median/quantile is not actually observed, it may be important to be able to interpolate a point between two observations. This can be fairly easily added by calculating the mid point between two number in the case where the weight mass is equally (or quantile/1-quantile) divided between them. Due to the need for a conditional, this function always returns a NumPy array, even when quantiles is a single scalar. The inputs also need to be NumPy arrays now (except quantiles that may still be a single number).
def weighted_quantiles_interpolate(values, weights, quantiles=0.5):
i = np.argsort(values)
c = np.cumsum(weights[i])
q = np.searchsorted(c, quantiles * c[-1])
return np.where(c[q]/c[-1] == quantiles, 0.5 * (values[i[q]] + values[i[q+1]]), values[i[q]])
This function will fail with arrays smaller than 2 (the original would handle non-empty arrays).
>>> weighted_quantiles_interpolate(np.array([2, 1]), np.array([1, 1]), 0.5)
array(1.5)
Note that this extension is fairly unlikely to be needed when working with actual data sets where we typically have (a) large data sets, and (b) real-values weights that make the odds of ending up exactly at a quantile edge very long, and probably due to rounding errors when it does happen. Including it for completeness nonetheless.

I ended up writing that function based on #muzzle and #maesers replies:
def weighted_quantiles(values, weights, quantiles=0.5, interpolate=False):
i = values.argsort()
sorted_weights = weights[i]
sorted_values = values[i]
Sn = sorted_weights.cumsum()
if interpolate:
Pn = (Sn - sorted_weights/2 ) / Sn[-1]
return np.interp(quantiles, Pn, sorted_values)
else:
return sorted_values[np.searchsorted(Sn, quantiles * Sn[-1])]
The difference between interpolate True and False is as follows:
weighted_quantiles(np.array([1, 2, 3, 4]), np.ones(4))
> 2
weighted_quantiles(np.array([1, 2, 3, 4]), np.ones(4), interpolate=True)
> 2.5
(there is no difference for uneven arrays such as [1, 2, 3, 4, 5])
Speed tests show it is just as performant as #maesers' function in the uninterpolated case, and it is twice as performant in the interpolated case.

Sharing some code that I got a hand with. This allows you to run stats on each column of an excel spreadsheet.
import xlrd
import sys
import csv
import numpy as np
import itertools
from itertools import chain
book = xlrd.open_workbook('/filepath/workbook.xlsx')
sh = book.sheet_by_name("Sheet1")
ofile = '/outputfilepath/workbook.csv'
masses = sh.col_values(0, start_rowx=1) # first column has mass
age = sh.row_values(0, start_colx=1) # first row has age ranges
count = 1
mass = []
for a in ages:
age.append(sh.col_values(count, start_rowx=1))
count += 1
stats = []
count = 0
for a in ages:
expanded = []
# create a tuple with the mass vector
age_mass = zip(masses, age[count])
count += 1
# replicate element[0] for element[1] times
expanded = list(list(itertools.repeat(am[0], int(am[1]))) for am in age_mass)
# separate into one big list
medianlist = [x for t in expanded for x in t]
# convert to array and mask out zeroes
npa = np.array(medianlist)
npa = np.ma.masked_equal(npa,0)
median = np.median(npa)
meanMass = np.average(npa)
maxMass = np.max(npa)
minMass = np.min(npa)
stdev = np.std(npa)
stats1 = [median, meanMass, maxMass, minMass, stdev]
print stats1
stats.append(stats1)
np.savetxt(ofile, (stats), fmt="%d")

Related

Generating weighted intervals in Python

I know how to produce a weighted integer with random.choice.
Now I have 5000 integers from 0 to 1000. I want to have say 75% to land in the inerval 0-500, 20% 501-750 and 5% between 751-1000. What I tried and failed is
x = random.choice([np.arange(501), np.arange(501,751), np.arange(751, 1001)], size=5000, p=[0.75, 0.2, 0.05])
But then I only get random aranged intervals. Any help would be appreciated.
How about something like this:
import numpy as np
x = np.random.choice(list(range(1001)), size=5000, p=
[.75/501]*501+[.2/250]*250+[.05/250]*250)
another version would be:
import numpy as np
from scipy import stats
N = 5000
probs = [0.75, 0.2, 0.05]
breaks = [0, 501, 751, 1001]
# figure out how big each group should be
sizes = stats.multinomial.rvs(N, probs)
# get values for each group
x = np.concatenate([
stats.randint.rvs(l, h, size=n)
for n, l, h in zip(sizes, breaks, breaks[1:])])
# mix everything up
np.random.shuffle(x)
some differences to rocket's solution:
less/smaller temporary variables and more opportunity for vectorisation
allows probabilities to take irrational values
the runtime is doesn't depend on the number of possible values generated

Is there a way to make the numbers in a numpy array randomly positive or negative?

I'm making a neural network, and when assigning random weight values using np.random.rand(797, 600) for example, they all turn out positive (from 0 to 1). This is fine normally, but I have up to 800 nodes which means that by the end of initial forward propagation if all the weights are positive the sigmoided output is always 1, just because the sum of all values adds up so fast with so many synapses and nodes.
To solve this problem, I wanted to make a function that would randomly multiply each weight by 1 or -1. Then, with a random number of positive and negative numbers, the outputs would be closer to 0 and the sigmoid function would return an actual prediction that wasn't 1 all the time. Here are the two methods I have tried to do this, and neither of them worked.
# method 1
import random as rand
import numpy as np
def random_positive_or_negative(value):
return rand.choice([1, -1]) * value
example_weights = np.random.rand(4, 4)
print(random_positive_or_negative(example_weights))
prints either something like this:
[[0.89098337 0.82291754 0.7730489 0.371631 ]
[0.22790221 0.19964653 0.94609767 0.57070762]
[0.35840034 0.06689964 0.71565062 0.43360395]
[0.57860037 0.11338668 0.338402 0.30737682]]
or like this:
[[-0.79750561 -0.94206793 -0.389792 -0.18541991]
[-0.36132547 -0.66040689 -0.06270979 -0.90775857]
[-0.22350726 -0.21148559 -0.78874412 -0.9702534 ]
[-0.74124928 -0.31675956 -0.97471565 -0.18389436]]
expected output something like this:
[[0.2158195 0.16492544 0.25672823 -0.5392236 ]
[-0.54530676 0.98215902 -0.14348151 0.02629328]
[-0.8642513 -0.71726141 -0.15890395 -0.08488439]
[0.54413198 -0.69790104 0.05317512 -0.06144755]]
# method 2
import random as rand
import numpy as np
def random_positive_or_negative(value):
return (i * rand.choice([-1, 1]) for i in value)
example_weights = np.random.rand(4, 4)
print(random_positive_or_negative(example_weights))
prints this:
<generator object random_positive_or_negative2.<locals>.<genexpr> at 0x114c474a0>
expected output something like this:
[[0.2158195 0.16492544 0.25672823 -0.5392236 ]
[-0.54530676 0.98215902 -0.14348151 0.02629328]
[-0.8642513 -0.71726141 -0.15890395 -0.08488439]
[0.54413198 -0.69790104 0.05317512 -0.06144755]]
You can create a matrix that is filled by random numbers sampled from universe [-1, 1] and multiply it with the random weights. See the code below
import random as rand
import numpy as np
def random_positive_or_negative(value):
return np.matmul(value, np.random.choice(np.array([-1, 1]), value.shape))
example_weights = np.random.rand(4, 4)
print(random_positive_or_negative(example_weights))
[[-0.7193314 -0.1604493 -0.47038437 -0.34173619]
[ 0.44388733 -0.55476039 -1.24586476 -0.77014132]
[-0.05796445 -1.72406933 -1.5756221 -0.18125272]
[ 0.15338058 -0.56916866 -1.5706919 -0.01815559]]
Your first method chooses one number, 1 or -1, and multiplies the whole argument array by that number. Your second method uses a generator expression, so it will return a generator. If you don't understand this, you should read about generators first.
There is no need to multiply any values by 1, since that does nothing. Instead, pick random indices and multiply them by -1. Something like:
n = example_weights.size
inds = np.random.choice(n, n, replace=False)
example_weights.flat[inds] *= -1
There are a few ways to do it, but in my opinion the sleekest was proposed by #xdurch0: to use np.random.uniform(-1., 1., size)
Here is how the code would work:
import numpy as np
example_weights = np.random.uniform(-1., 1., [4, 4])
print(example_weights)
prints something like:
[[-0.91852112 -0.77686616 -0.41495832 0.78950649]
[-0.7493404 -0.73794508 0.54622202 -0.89033855]
[ 0.31196172 0.06584705 -0.88698673 -0.24149299]
[-0.89654412 0.45450007 -0.40640681 0.81490564]]

Histogram one array based on another array

I have two numpy arrays:
rates = [1.1, 0.8...]
zenith_anlges = [45, 20, ....]
both rates and zen_angles have the same length.
I also have some pre-defined zenith_angle bins,
zen_bins = [0, 10, 20,...]
What I need to do is bin the rates according to its corresponding zenith angle bins.
An ugly way to do it is
nbin = len(zen_bins)-1
norm_binned_zen = [[0]]*nbin
for i in range(nbin):
norm_binned_zen[i] = [0]
for i in range(len(rates)):
ind = np.searchsorted(zen_bins,zen_angles[i]) #The corresponding bin number
norm_binned_zen[ind-1].append(rates[i])
This is not very pythonic and is time consuming for large arrays. I believe there must be some more elegant way to do it?
The starting data (here randomly generated):
import numpy as np
rates = np.random.random(100)
zenith_angles = np.random.random(100)*90.0
zen_bins = np.linspace(0, 90, 10)
Since you are using numpy, you can use a one line solution:
norm_binned_zen = [rates[np.where((zenith_angles > low) & (zenith_angles <= high))] for low, high in zip(zen_bins[:-1], zen_bins[1:])]
Breaking this line into steps:
The list comprehension loops over pairs, the low and hight edges of each bin.
numpy.where is used to find the indexes of the angles inside the given bin in the zenith_angles array.
numpy indexing is used to select the rates values at the indexes obtained at previous step.

Rescale price list from a longer length to a smaller length

Given the following pandas data frame with 60 elements.
import pandas as pd
data = [60,62.75,73.28,75.77,70.28
,67.85,74.58,72.91,68.33,78.59
,75.58,78.93,74.61,85.3,84.63
,84.61,87.76,95.02,98.83,92.44
,84.8,89.51,90.25,93.82,86.64
,77.84,76.06,77.75,72.13,80.2
,79.05,76.11,80.28,76.38,73.3
,72.28,77,69.28,71.31,79.25
,75.11,73.16,78.91,84.78,85.17
,91.53,94.85,87.79,97.92,92.88
,91.92,88.32,81.49,88.67,91.46
,91.71,82.17,93.05,103.98,105]
data_pd = pd.DataFrame(data, columns=["price"])
Is there a formula to rescale this in such a way so that for each window bigger than 20 elements starting from index 0 to index i+1, the data is rescaled down to 20 elements?
Here is a loop that is creating the windows with the data for rescaling, i just do not know any way of doing the rescaling itself for this problem at hand. Any suggestions on how this might be done?
miniLenght = 20
rescaledData = []
for i in range(len(data_pd)):
if(i >= miniLenght):
dataForScaling = data_pd[0:i]
scaledDataToMinLenght = dataForScaling #do the scaling here so that the length of the rescaled data is always equal to miniLenght
rescaledData.append(scaledDataToMinLenght)
Basically after the rescaling the rescaledData should have 40 arrays, each with a length of 20 prices.
From reading the paper, it looks like you are resizing the list back to 20 indices, then interpolating the data at your 20 indices.
We'll make the indices like they do (range(0, len(large), step = len(large)/miniLenght)), then use numpys interp - there are a million ways of interpolating data. np.interp uses a linear interpolation, so if you asked for eg index 1.5, you get the mean of points 1 and 2, and so on.
So, here's a quick modification of your code to do it (nb, we could probably fully vectorize this using 'rolling'):
import numpy as np
miniLenght = 20
rescaledData = []
for i in range(len(data_pd)):
if(i >= miniLenght):
dataForScaling = data_pd['price'][0:i]
#figure out how many 'steps' we have
steps = len(dataForScaling)
#make indices where the data needs to be sliced to get 20 points
indices = np.arange(0,steps, step = steps/miniLenght)
#use np.interp at those points, with the original values as given
rescaledData.append(np.interp(indices, np.arange(steps), dataForScaling))
And the output is as expected:
[array([ 60. , 62.75, 73.28, 75.77, 70.28, 67.85, 74.58, 72.91,
68.33, 78.59, 75.58, 78.93, 74.61, 85.3 , 84.63, 84.61,
87.76, 95.02, 98.83, 92.44]),
array([ 60. , 63.2765, 73.529 , 74.9465, 69.794 , 69.5325,
74.079 , 71.307 , 72.434 , 77.2355, 77.255 , 76.554 ,
81.024 , 84.8645, 84.616 , 86.9725, 93.568 , 98.2585,
93.079 , 85.182 ]),.....

Finding index of nearest point in numpy arrays of x and y coordinates

I have two 2d numpy arrays: x_array contains positional information in the x-direction, y_array contains positions in the y-direction.
I then have a long list of x,y points.
For each point in the list, I need to find the array index of the location (specified in the arrays) which is closest to that point.
I have naively produced some code which works, based on this question:
Find nearest value in numpy array
i.e.
import time
import numpy
def find_index_of_nearest_xy(y_array, x_array, y_point, x_point):
distance = (y_array-y_point)**2 + (x_array-x_point)**2
idy,idx = numpy.where(distance==distance.min())
return idy[0],idx[0]
def do_all(y_array, x_array, points):
store = []
for i in xrange(points.shape[1]):
store.append(find_index_of_nearest_xy(y_array,x_array,points[0,i],points[1,i]))
return store
# Create some dummy data
y_array = numpy.random.random(10000).reshape(100,100)
x_array = numpy.random.random(10000).reshape(100,100)
points = numpy.random.random(10000).reshape(2,5000)
# Time how long it takes to run
start = time.time()
results = do_all(y_array, x_array, points)
end = time.time()
print 'Completed in: ',end-start
I'm doing this over a large dataset and would really like to speed it up a bit.
Can anyone optimize this?
Thanks.
UPDATE: SOLUTION following suggestions by #silvado and #justin (below)
# Shoe-horn existing data for entry into KDTree routines
combined_x_y_arrays = numpy.dstack([y_array.ravel(),x_array.ravel()])[0]
points_list = list(points.transpose())
def do_kdtree(combined_x_y_arrays,points):
mytree = scipy.spatial.cKDTree(combined_x_y_arrays)
dist, indexes = mytree.query(points)
return indexes
start = time.time()
results2 = do_kdtree(combined_x_y_arrays,points_list)
end = time.time()
print 'Completed in: ',end-start
This code above sped up my code (searching for 5000 points in 100x100 matrices) by 100 times. Interestingly, using scipy.spatial.KDTree (instead of scipy.spatial.cKDTree) gave comparable timings to my naive solution, so it is definitely worth using the cKDTree version...
Here is a scipy.spatial.KDTree example
In [1]: from scipy import spatial
In [2]: import numpy as np
In [3]: A = np.random.random((10,2))*100
In [4]: A
Out[4]:
array([[ 68.83402637, 38.07632221],
[ 76.84704074, 24.9395109 ],
[ 16.26715795, 98.52763827],
[ 70.99411985, 67.31740151],
[ 71.72452181, 24.13516764],
[ 17.22707611, 20.65425362],
[ 43.85122458, 21.50624882],
[ 76.71987125, 44.95031274],
[ 63.77341073, 78.87417774],
[ 8.45828909, 30.18426696]])
In [5]: pt = [6, 30] # <-- the point to find
In [6]: A[spatial.KDTree(A).query(pt)[1]] # <-- the nearest point
Out[6]: array([ 8.45828909, 30.18426696])
#how it works!
In [7]: distance,index = spatial.KDTree(A).query(pt)
In [8]: distance # <-- The distances to the nearest neighbors
Out[8]: 2.4651855048258393
In [9]: index # <-- The locations of the neighbors
Out[9]: 9
#then
In [10]: A[index]
Out[10]: array([ 8.45828909, 30.18426696])
scipy.spatial also has a k-d tree implementation: scipy.spatial.KDTree.
The approach is generally to first use the point data to build up a k-d tree. The computational complexity of that is on the order of N log N, where N is the number of data points. Range queries and nearest neighbour searches can then be done with log N complexity. This is much more efficient than simply cycling through all points (complexity N).
Thus, if you have repeated range or nearest neighbor queries, a k-d tree is highly recommended.
If you can massage your data into the right format, a fast way to go is to use the methods in scipy.spatial.distance:
http://docs.scipy.org/doc/scipy/reference/spatial.distance.html
In particular pdist and cdist provide fast ways to calculate pairwise distances.
Search methods have two phases:
build a search structure, e.g. a KDTree, from the npt data points (your x y)
lookup nq query points.
Different methods have different build times, and different query times.
Your choice will depend a lot on npt and nq:
scipy cdist
has build time 0, but query time ~ npt * nq.
KDTree build times are complicated,
lookups are very fast, ~ ln npt * nq.
On a regular (Manhatten) grid, you can do much better: see (ahem)
find-nearest-value-in-numpy-array.
A little
testbench:
: building a KDTree of 5000 × 5000 2d points takes about 30 seconds,
then queries take microseconds;
scipy cdist
25 million × 20 points (all pairs, 4G) takes about 5 seconds, on my old iMac.
I have been trying to follow along with this, but new to Jupyter Notebooks, Python and the various tools being discussed here, but I have managed to get some way down the road I'm travelling.
BURoute = pd.read_csv('C:/Users/andre/BUKP_1m.csv', header=None)
NGEPRoute = pd.read_csv('c:/Users/andre/N1-06.csv', header=None)
I create a combined XY array from my BURoute dataframe
combined_x_y_arrays = BURoute.iloc[:,[0,1]]
And I create the points with the following command
points = NGEPRoute.iloc[:,[0,1]]
I then do the KDTree magic
def do_kdtree(combined_x_y_arrays, points):
mytree = scipy.spatial.cKDTree(combined_x_y_arrays)
dist, indexes = mytree.query(points)
return indexes
results2 = do_kdtree(combined_x_y_arrays, points)
This gives me an array of the indexes. I'm now trying to figure out how to calculate the distance between the points and the indexed points in the results array.
def find_nearest_vector(self,arrList, value):
y,x = value
offset =10
x_Array=[]
y_Array=[]
for p in arrList:
x_Array.append(p[1])
y_Array.append(p[0])
x_Array=np.array(x_Array)
y_Array=np.array(y_Array)
difference_array_x = np.absolute(x_Array-x)
difference_array_y = np.absolute(y_Array-y)
index_x = np.where(difference_array_x<offset)[0]
index_y = np.where(difference_array_y<offset)[0]
index = np.intersect1d(index_x, index_y, assume_unique=True)
nearestCootdinate = (arrList[index][0][0],arrList[index][0][1])
return nearestCootdinate

Categories