Related
I am trying to generate a list of 12 random weights for a stock portfolio in order to determine how the portfolio would have performed in the past given different weights assigned to each stock. The sum of the weights must of course be 1 and there is an additional restriction: each stock must have a weight between 1/24 and 1/4.
Although I am able to generate random numbers such that they all fall within the interval by using random.uniform(), as well as guarantee their sum is 1 by dividing each weighting by the sum of the weightings, I'm finding that
a) each subsequent array of weightings is very similar. I am rarely getting values for weightings that are near the upper boundary of 1/4
b) random.seed() does not seem to be working properly, whether I put it in the randweight() function or at the beginning of the for loop. I'm confused as to why because I thought that generating a random seed value would make my array of weights unique for each iteration. Currently, it's cyclical, with a period of 3.
The following is my code:
# boundaries on weightings
n = 12
min_weight = (1/(2*n))
max_weight = 25 / 100
def rand_weight(e):
random.seed()
return e + np.random.uniform(min_weight, max_weight)
for i in range(100):
weights = np.empty(12)
while not (np.all(weights > min_weight) and np.all(weights < max_weight)):
weights = np.array(list(map(rand_weight, weights)))
weights /= np.sum(weights)
I have already tried scattering the weights by changing the min_weight and max_weight inside the for loop so that rand_weight generates newer values, but this makes the runtime really slow because the "not" condition in the while loop takes longer to evaluate to false (since the probability of all the numbers being in the range decreases).
Lets start with simple facts first. If you want numbers to be in the range [0.042...0.25] and 12 iid numbers in total summed to one, then for mean value
Sum(Xi)=1
E[Sum(Xi)]=Sum(E[Xi])=N E[Xi] = 1
E[Xi]=1/N = 1/12 = 0.083
One corollary is that it would be hard to get numbers close to upper range boundary.
And instead doing things like sampling whatever and then normalizing to get sum to 1, better to use known distribution where sum of values is 1 to begin with.
So lets use Dirichlet distribution, and sample points uniformly in simplex, which means alpha (concentration) vector is all ones.
import numpy as np
N = 12
s = np.random.dirichlet(N*[1.0], 1)
print(np.sum(s))
Some value would be larger (or smaller), and you could reject them
def sampleWeights(alpha, lo, hi):
while True:
s = np.random.dirichlet(alpha, 1)[0]
if np.any(s > hi):
continue # reject
if np.any(s < lo):
continue # reject
return s # accept
and call it like this
N=12
alpha = N*[1.0]
q = sampleWeights(alpha, 1./24., 1./4.)
if you could check it, a lot of rejections happens at low bound, rather then high bound.
BEauty of using known Dirichlet distribution is that you could "concentrate" sampled values around mean, e.g.
alpha = N*[10.0]
q = sampleWeights(alpha, 1./24., 1./4.)
will produce same iid with mean of 1/12 but a lot smaller std.deviation, RV a lot more concentrated around mean
And if you want non-identically distributed RVs, use different alphas
alpha = [1.,2.,3.,4.,5.,6.,6.,5.,4.,3.,2.,1.]
q = sampleWeights(alpha, 1./24., 1./4.)
then some of RVs would be close to upper boundary, and some close to lower boundary. Lots of advantages to use known distribution
The following works. Particularly confusing to me is that np.empty(12) seemed to always return the same array. So once it had been initialized, it stayed the same.
This seems to produce numbers above 0.22 reasonably often.
import numpy as np
from random import random, seed
# boundaries on weightings
n = 12
min_weight = (1/(2*n))
max_weight = 25 / 100
seed(666)
for i in range(100):
weights = np.zeros(n)
while not (np.all(weights > min_weight) and np.all(weights < max_weight)):
weights = np.array([random() for _ in range(n)])
weights /= np.sum(weights) - min_weight * n
weights += min_weight
print(weights)
I have two vectors rev_count and stars. The elements of those form pairs (let's say rev_count is the x coordinate and stars is the y coordinate).
I would like to bin the data by rev_count and then average the stars in a single rev_count bin (I want to bin along the x axis and compute the average y coordinate in that bin).
This is the code that I tried to use (inspired by my matlab background):
import matplotlib.pyplot as plt
import numpy
binwidth = numpy.max(rev_count)/10
revbin = range(0, numpy.max(rev_count), binwidth)
revbinnedstars = [None]*len(revbin)
for i in range(0, len(revbin)-1):
revbinnedstars[i] = numpy.mean(stars[numpy.argwhere((revbin[i]-binwidth/2) < rev_count < (revbin[i]+binwidth/2))])
print('Plotting binned stars with count')
plt.figure(3)
plt.plot(revbin, revbinnedstars, '.')
plt.show()
However, this seems to be incredibly slow/inefficient. Is there a more natural way to do this in python?
Scipy has a function for this:
from scipy.stats import binned_statistic
revbinnedstars, edges, _ = binned_statistic(rev_count, stars, 'mean', bins=10)
revbin = edges[:-1]
If you don't want to use scipy there's also a histogram function in numpy:
sums, edges = numpy.histogram(rev_count, bins=10, weights=stars)
counts, _ = numpy.histogram(rev_count, bins=10)
revbinnedstars = sums / counts
I suppose you are using Python 2 but if not you should change the division when calculating the step to // (floor division) otherwise numpy will be annoyed that it cannot interpret floats as step.
binwidth = numpy.max(rev_count)//10 # Changed this to floor division
revbin = range(0, numpy.max(rev_count), binwidth)
revbinnedstars = [None]*len(revbin)
for i in range(0, len(revbin)-1):
# I actually don't know what you wanted to do but I guess you wanted the
# "logical and" combination in that bin (you don't need to use np.where here)
# You can put that all in one statement but it gets crowded so I'll split it:
index1 = revbin[i]-binwidth/2 < rev_count
index2 = rev_count < revbin[i]+binwidth/2)
revbinnedstars[i] = numpy.mean(stars[np.logical_and(index1, index2)])
That at least should work and gives the right results. It will be very inefficient if you have huge datasets and want more than 10 bins.
One very important takeaway:
Don't use np.argwhere if you want to index an array. That result is just supposed to be human readable. If you really want the coordinates use np.where. That can be used as index but isn't that pretty to read if you have multidimensional inputs.
The numpy documentation supports me on that point:
The output of argwhere is not suitable for indexing arrays. For this purpose use where(a) instead.
That's also the reason why your code was so slow. It tried to do something you don't want it to do and which can be very expensive in memory and cpu usage. Without giving you the right result.
What I have done here is called boolean masks. It's shorter to write than np.where(condition) and involves one less calculation.
A completly vectorized approach could be used by defining a grid that knows which stars are in which bin:
bins = 10
binwidth = numpy.max(rev_count)//bins
revbin = np.arange(0, np.max(rev_count)+binwidth+1, binwidth)
an even better approach for defining the bins would be. Beware that you have to add one to the maximum since you want to include it and one to the number of bins because you are interested in the bin-start and end-points not the center of the bins:
number_of_bins = 10
revbin = np.linspace(np.min(rev_count), np.max(rev_count)+1, number_of_bins+1)
and then you can setup the grid:
grid = np.logical_and(rev_count[None, :] >= revbin[:-1, None], rev_count[None, :] < revbin[1:, None])
The grid is bins x rev_count big (because of the broadcasting, I increased the dimensions of each of those arrays by one BUT not the same). This essentially checkes if a point is bigger than the lower bin range and smaller than the upper bin range (therefore the [:-1] and [1:] indices). This is done multidimensional where the counts are in the second dimension (numpy axis=1) and the bins in the first dimension (numpy axis=0)
So we can get the Y coordinates of the stars in the appropriate bin by just multiplying these with this grid:
stars * grid
To calculate the mean we need the sum of the coordinates in this bin and divide it by the number of stars in that bin (bins are along the axis=1, stars that are not in this bin only have a value of zero along this axis):
revbinnedstars = np.sum(stars * grid, axis=1) / np.sum(grid, axis=1)
I actually don't know if that's more efficient. It'll be a lot more expensive in memory but maybe a bit less expensive in CPU.
The function I use for binning (x,y) data and determining summary statistics such as mean values in those bins is based upon the scipy.stats.statistic() function. I have written a wrapper for it, because I use it a lot. You may find this useful...
def binXY(x,y,statistic='mean',xbins=10,xrange=None):
"""
Finds statistical value of x and y values in each x bin.
Returns the same type of statistic for both x and y.
See scipy.stats.binned_statistic() for options.
Parameters
----------
x : array
x values.
y : array
y values.
statistic : string or callable, optional
See documentation for scipy.stats.binned_statistic(). Default is mean.
xbins : int or sequence of scalars, optional
If xbins is an integer, it is the number of equal bins within xrange.
If xbins is an array, then it is the location of xbin edges, similar
to definitions used by np.histogram. Default is 10 bins.
All but the last (righthand-most) bin is half-open. In other words, if
bins is [1, 2, 3, 4], then the first bin is [1, 2) (including 1, but
excluding 2) and the second [2, 3). The last bin, however, is [3, 4],
which includes 4.
xrange : (float, float) or [(float, float)], optional
The lower and upper range of the bins. If not provided, range is
simply (x.min(), x.max()). Values outside the range are ignored.
Returns
-------
x_stat : array
The x statistic (e.g. mean) in each bin.
y_stat : array
The y statistic (e.g. mean) in each bin.
n : array of dtype int
The count of y values in each bin.
"""
x_stat, xbin_edges, binnumber = stats.binned_statistic(x, x,
statistic=statistic, bins=xbins, range=xrange)
y_stat, xbin_edges, binnumber = stats.binned_statistic(x, y,
statistic=statistic, bins=xbins, range=xrange)
n, xbin_edges, binnumber = stats.binned_statistic(x, y,
statistic='count', bins=xbins, range=xrange)
return x_stat, y_stat, n
this is my code:
import numpy as np
from scipy.stats.kde import gaussian_kde
from scipy.stats import norm
from numpy import linspace,hstack
from pylab import plot,show,hist
import re
import json
attribute_file="path"
attribute_values = [line.rstrip('\n') for line in open(attribute_file)]
obs=[]
#Assume the list obs as loaded
obs=np.asarray(osservazioni)
obs=np.sort(obs,kind='mergesort')
x_min=osservazioni[0]
x_max=osservazioni[len(obs)-1]
# obtaining the pdf (my_pdf is a function!)
my_pdf = gaussian_kde(obs)
# plotting the result
x = linspace(0,x_max,1000)
plot(x,my_pdf(x),'r') # distribution function
hist(obs,normed=1,alpha=.3) # histogram
show()
new_values = np.asarray([-1, 0, 2, 3, 4, 5, 768])[:, np.newaxis]
for e in new_values:
print (str(e)+" - "+str(my_pdf(e)*100*2))
Problem:
The obs array contains a list of all obs.
I need to calcolate a score (between 0 and 1) for new values
[-1, 0, 2, 3, 4, 500, 768]
So the value -1 must have a discrete score because it doesn't appears in the distribution but is next to the 1 value that is very common in the observations.
The reason for that is that you have many more 1's in your observations than 768's. So even if -1 is not exactly 1, it gets a high predicted value, because the histogram has a much larger larger value at 1 than at 768.
Up to a multiplicative constant, the formula for prediction is:
where K is your kernel, D your observations and h your bandwitdh. Looking at the doc for gaussian_kde, we see that if no value is provided for bw_method, it is estimated in some way, which here doesn't suit you.
So you can try some different values: the larger the bandwidth, the more points far from your new data are taken into account, the limit case being an almost constant predicted function.
On the other hand, a very small bandwidth only takes really close points into account, which is what I thing you want.
Some graphs to illustrate the influence of the bandwidth:
Code used:
import matplotlib.pyplot as plt
f, axarr = plt.subplots(2, 2, figsize=(10, 10))
for i, h in enumerate([0.01, 0.1, 1, 5]):
my_pdf = gaussian_kde(osservazioni, h)
axarr[i//2, i%2].plot(x, my_pdf(x), 'r') # distribution function
axarr[i//2, i%2].set_title("Bandwidth: {0}".format(h))
axarr[i//2, i%2].hist(osservazioni, normed=1, alpha=.3) # histogram
With your current code, for x=-1, the value of K((x-x_i)/h) for all x_i's who are equal to 1 is smaller than 1, but you add up a lot of these values (there are 921 1s in your observations, and also 357 2s)
On the other hand for x = 768, the value of the kernel is 1 for all x_i's which are 768, but there are not many such points (39 to be precise). So here a lot of "small" terms make a larger sum than a small number of larger terms.
If you don't want this behavior, you can decrease the size of your gaussian kernel : this way the penalty (K(-2)) paid because of the distance between -1 and 1 will be higher. But I think that this would be overfitting your observations.
A formula to determine whether a new sample is acceptable (compared to your empirical distribution) or not is more of a statistical problem, you can have a look at stats.stackexchange.com
You can always try to use a low value for the bandwidth, which will give you a peaked predicted function. Then you can normalize this function, dividing it by its maximal value.
After that, all predicted values will be between 0 and 1:
maxDensityValue = np.max(my_pdf(x))
for e in new_values:
print("{0} {1}".format(e, my_pdf(e)/maxDensityValue))
-1 and 0 are both very close to 1 which occurs very frequently, so they will be predicted to have a higher value. (This is why 0 has a higher value than -1 even though they both don't show up, 0 is closer to 1). What you need is a smaller bandwidth: Look at the line in your graph to see this - Right now numbers that don't show up at all as far away as 80 are getting a lot of value because of their proximity to 1 and 2. Just set a scalar as your bandwidth_method to achieve this:
my_pdf = gaussian_kde(osservazioni, 0.1)
This may not be the exact scalar you want but try changing 0.1 to 0.05 or even less and see what fits what you are looking for.
Also if you want a value between 0 and 1 you need to make sure that my_pdf() can never return a value over .005 because you are multiplying it by 200.
Here is what I mean:
for e in new_values:
print (str(e)+" - "+str(my_pdf(e)*100*2))
The value you are outputting is:
mypdf(e)*100*2 == mypdf(e)*200
#You want the max value to be 1 so
1 >= mypdf(e)*200
#Divide both sides by 200
0.005 >= mypdf(e)
So mypdf() needs to have a max value of 0.005. OR You can just scale the data.
For the max value to be 1 and stay proportionate to the input, no matter the input, you would need to first collect the output and then scale it based on the largest value.Example:
orig_val=[] #Create intermediate list
for e in new_values:
orig_val += [my_pdf(e)*100*2] #Fill with the data
for i in range(len(new_values)):
print (str(new_values[i])+" - "+str(orig_val[i]/max(orig_val))) #Scale based on largest value
Learn more about the gaussian_kde here: scipy.stats.gaussian_kde
What function can I use in Python if I want to sample a truncated integer power law?
That is, given two parameters a and m, generate a random integer x in the range [1,m) that follows a distribution proportional to 1/x^a.
I've been searching around numpy.random, but I haven't found this distribution.
AFAIK, neither NumPy nor Scipy defines this distribution for you. However, using SciPy it is easy to define your own discrete distribution function using scipy.rv_discrete:
import numpy as np
import scipy.stats as stats
import matplotlib.pyplot as plt
def truncated_power_law(a, m):
x = np.arange(1, m+1, dtype='float')
pmf = 1/x**a
pmf /= pmf.sum()
return stats.rv_discrete(values=(range(1, m+1), pmf))
a, m = 2, 10
d = truncated_power_law(a=a, m=m)
N = 10**4
sample = d.rvs(size=N)
plt.hist(sample, bins=np.arange(m)+0.5)
plt.show()
I don't use Python, so rather than risk syntax errors I'll try to describe the solution algorithmically. This is a brute-force discrete inversion. It should translate quite easily into Python. I'm assuming 0-based indexing for the array.
Setup:
Generate an array cdf of size m with cdf[0] = 1 as the first entry, cdf[i] = cdf[i-1] + 1/(i+1)**a for the remaining entries.
Scale all entries by dividing cdf[m-1] into each -- now they actually are CDF values.
Usage:
Generate your random values by generating a Uniform(0,1) and
searching through cdf[] until you find an entry greater than your
uniform. Return the index + 1 as your x-value.
Repeat for as many x-values as you want.
For instance, with a,m = 2,10, I calculate the probabilities directly as:
[0.6452579827864142, 0.16131449569660355, 0.07169533142071269, 0.04032862392415089, 0.02581031931145657, 0.017923832855178172, 0.013168530260947229, 0.010082155981037722, 0.007966147935634743, 0.006452579827864143]
and the CDF is:
[0.6452579827864142, 0.8065724784830177, 0.8782678099037304, 0.9185964338278814, 0.944406753139338, 0.9623305859945162, 0.9754991162554634, 0.985581272236501, 0.9935474201721358, 1.0]
When generating, if I got a Uniform outcome of 0.90 I would return x=4 because 0.918... is the first CDF entry larger than my uniform.
If you're worried about speed you could build an alias table, but with a geometric decay the probability of early termination of a linear search through the array is quite high. With the given example, for instance, you'll terminate on the first peek almost 2/3 of the time.
Use numpy.random.zipf and just reject any samples greater than or equal to m
Note:
This is for a homework assignment in my data mining class.
I'm going to put relevant code snippets on this SO post, but you can find my entire program at http://pastebin.com/CzNFbLJ2
The dataset I'm using for this program can be found at http://archive.ics.uci.edu/ml/datasets/Iris
So I'm getting: RuntimeWarning: invalid value encountered in sqrt
return np.sqrt(m)
I am attempting to find the average Mahalanobis distance of the given iris dataset (for both raw and normalized datasets). The error is only happening on the normalized version of the dataset which is making me wonder if I have incorrectly understood what normalization means (both in code and mathematically).
I thought that normalization means that each component of a vector is divided by it's vector length (causing the vector to add up to 1). I found this SO question How to normalize a 2-dimensional numpy array in python less verbose? and thought it matched up to my concept of normalization. But now my code is reporting that the Mahalanobis distance over the normalized dataset is NAN
def mahalanobis(data):
import numpy as np;
import scipy.spatial.distance;
avg = 0
count = 0
covar = np.cov(data, rowvar=0);
invcovar = np.linalg.inv(covar)
for i in range(len(data)):
for j in range(i + 1, len(data)):
if(j == len(data)):
break
avg += scipy.spatial.distance.mahalanobis(data[i], data[j], invcovar)
count += 1
return avg / count
def normalize(data):
import numpy as np
row_sums = data.sum(axis=1)
norm_data = np.zeros((50, 4))
for i, (row, row_sum) in enumerate(zip(data, row_sums)):
norm_data[i,:] = row / row_sum
return norm_data
Probably too late, but check out page 64-65 in our textbook "Introduction to Data Mining". There's a section called "Normalization or Standardization", which explains the concept of normalized data that Hearne is looking for.
Basically, standardized data set x' = (x - mean(x)) / standardDeviation(x)
Since I see you're using python, here's how to do it using SciPy:
normalizedData = (data - data.mean(axis=0)) / data.std(axis=0, ddof=1)
Source: http://mail.scipy.org/pipermail/numpy-discussion/2011-April/056023.html
You can use pdist() to do the distance calculation without for loop:
from sklearn import datasets
iris = datasets.load_iris()
from scipy.spatial.distance import pdist, squareform
print squareform(pdist(iris.data, 'mahalanobis'))
Normalization in this context probably does mean subtracting the mean and scaling so the data has a unit covariance matrix.
However, to scale every vector in your dataset to unit norm use: norm_data=data/np.sqrt(np.sum(data*data,1))[:,None].
You need to divide by the L2 norm of each vector, which means squaring the value of each element, then taking the square root of the sum. Broadcasting allows you to avoid explicitly coding the loop (see the answer to the question you cited: https://stackoverflow.com/a/8904762/1149913).