Finding appropriate cut-off values

Finding appropriate cut-off values - python

I try to implement Hampel tanh estimators to normalize highly asymmetric data. In order to do this, I need to perform the following calculation:
Given x - a sorted list of numbers and m - the median of x, I need to find a such that approximately 70% of the values in x fall into the range (m-a; m+a). We know nothing about the distribution of values in x. I write in python using numpy, and the best idea that I had is to write some sort of stochastic iterative search (for example, as was described by Solis and Wets), but I suspect that there is a better approach, either in form of better algorithm or as a ready function. I searched the numpy and scipy documentation, but couldn't find any useful hint.
EDIT
Seth suggested to use scipy.stats.mstats.trimboth, however in my test for a skewed distribution, this suggestion didn't work:
from scipy.stats.mstats import trimboth
import numpy as np
theList = np.log10(1+np.arange(.1, 100))
theMedian = np.median(theList)
trimmedList = trimboth(theList, proportiontocut=0.15)
a = (trimmedList.max() - trimmedList.min()) * 0.5
#check how many elements fall into the range
sel = (theList > (theMedian - a)) * (theList < (theMedian + a))
print np.sum(sel) / float(len(theList))
The output is 0.79 (~80%, instead of 70)

You need to first symmetrize your distribution by folding all values less than the mean over to the right. Then you can use the standard scipy.stats functions on this one-sided distribution:
from scipy.stats import scoreatpercentile
import numpy as np
theList = np.log10(1+np.arange(.1, 100))
theMedian = np.median(theList)
oneSidedList = theList[:] # copy original list
# fold over to the right all values left of the median
oneSidedList[theList < theMedian] = 2*theMedian - theList[theList < theMedian]
# find the 70th centile of the one-sided distribution
a = scoreatpercentile(oneSidedList, 70) - theMedian
#check how many elements fall into the range
sel = (theList > (theMedian - a)) * (theList < (theMedian + a))
print np.sum(sel) / float(len(theList))
This gives the result of 0.7 as required.

Restate the problem slightly. You know the length of the list, and what fraction of the numbers in the list to consider. Given that, you can determine the difference between the first and last indices in the list that give you the desired range. The goal then is to find the indices that will minimize a cost function corresponding to the desired symmetric values about the median.
Let the smaller index be n1 and the larger index by n2; these are not independent. The values from the list at the indices are x[n1] = m-b and x[n2]=m+c. You now want to choose n1 (and thus n2) so that b and c are as close as possible. This occurs when (b - c)**2 is minimal. That's pretty easy using numpy.argmin. Paralleling the example in the question, here's an interactive session illustrating the approach:
$ python
Python 2.6.5 (r265:79063, Jun 12 2010, 17:07:01)
[GCC 4.3.4 20090804 (release) 1] on cygwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import numpy as np
>>> theList = np.log10(1+np.arange(.1, 100))
>>> theMedian = np.median(theList)
>>> listHead = theList[0:30]
>>> listTail = theList[-30:]
>>> b = np.abs(listHead - theMedian)
>>> c = np.abs(listTail - theMedian)
>>> squaredDiff = (b - c) ** 2
>>> np.argmin(squaredDiff)
25
>>> listHead[25] - theMedian, listTail[25] - theMedian
(-0.2874888056626983, 0.27859407466756614)

What you want is scipy.stats.mstats.trimboth. Set proportiontocut=0.15. After trimming, take (max-min)/2.

Related

Is it possible to sum values of arrays stored in a dictionary without for loop?

So I feel like I might have coded myself into a corner -- but here I am.
I have created a dictionary of arrays (well specifically ascii Columns) because I needed to create five arrays performing the same calculation on an array with five different parameters (The calculation involved multiplying arrays and one of five arbitrary constants).
I now want to create an array where each element corresponds to the sum of the equivalent element from all five arrays. I'd rather not use the ugly for loop that I've created (it's also hard to check if i'm getting the right answer with the loop).
Here is a modified snippet for testing!
import numpy as np
from astropy.table import Column
from pylab import *
# The five paramaters for the Columns
n1 = [14.18,19.09,33.01,59.73,107.19,172.72] #uJy/beam
n2 = [14.99,19.04,32.90,59.99,106.61,184.06] #uJy/beam
n1 = np.array([x*1e-32 for x in n1]) #W/Hz
n2 = np.array([x*1e-32 for x in n2]) #W/Hz
# an example of the arrays being mathed upon
luminosity=np.array([2.393e+24,1.685e+24,2.264e+23,5.466e+22,3.857e+23,4.721e+23,1.818e+23,3.237e+23])
redshift = np.array([1.58,1.825,0.624,0.369,1.247,0.906,0.422,0.66])
field = np.array([True,True,False,True,False,True,False,False])
DMs = {}
for i in range(len(n1)):
DMs['C{0}'.format(i)]=0
for SC,SE,level in zip(n1,n2,DMs):
DMmax = Column([1 for x in redshift], name='DMmax')
DMmax[field]=(((1+redshift[field])**(-0.25))*(luminosity[field]/(4*pi*5*SE))**0.5)*3.24078e-23
DMmax[~field]=(((1+redshift[~field])**(-0.25))*(luminosity[~field]/(4*pi*5*SC))**0.5)*3.24078e-23
DMs[level] = DMmax
Thanks all!

Numpy was built for this! (provided all arrays are of the same shape)
Just add them, and numpy will move element-wise through the arrays. This also has the benefit of being orders of magnitude faster than using a for-loop in the Python layer.
Example:
>>> n1 = np.array([1,2,3])
>>> n2 = np.array([1,2,3])
>>> total = n1 + n2
>>> total
array([2,4,6])
>>> mask = np.array([True, False, True])
>>> n1[mask] ** n2[mask]
array([ 1, 27])
Edit additional input
You might be able to do something like this:
SE_array = (((1+redshift[field]) ** (-0.25)) * (luminosity[field]/(4*pi*5*n1[field])) ** 0.5) * 3.24078e-23
SC_array = (((1+redshift[field]) ** (-0.25)) * (luminosity[field]/(4*pi*5*n2[field])) ** 0.5) * 3.24078e-23
and make the associations by stacking the new arrays:
DM = np.dstack((SE_array, SC_array))
reshaper = DM.shape[1:] # take from shape (1, 6, 2) to (6,2), where 6 is the length of the arrays
DM = DM.reshape(reshaper)
This will give you a 2d array like:
array([[SE_1, SC_1],
[SE_2, SC_2]])
Hope this is helpful

If you can't just add the numpy arrays you can extract the creation of the composite array into a function.
def get_element(i):
global n1, n2, luminosity, redshift, field
return n1[i] + n2[i] + luminosity[i] + redshift[i] + field[i]
L = len(n1)
composite = [get_element(i) for i in range(L)]

The answer was staring at me in the face, but thanks to #willnx, #cricket_007, and #andrew-lavq. Your suggestions made me realise how simple the solution is.
Just add them, and numpy will move element-wise through the arrays. -- willnx
You need a loop to sum all values of a collection -- cricket_007
so it really is as simple as
sum(x for x in DMs.values())
I'm not sure if this is the fastest solution, but I think it's the simplest.

Get the similarity of two numbers with python

i'm studying about Case-Based Reasoning algorithms, and I need to get the similarity of two numbers (integer or float).
For strings i'm using the Levenshtein lib and it handle well, but I don't know any Python lib to calculate the similarity of two numbers, there is one out there?
Anyone knows?
The result should be between 0 (different) and 1(perfect match), like Levenshtein.ratio().
#update1:
Using Levenshtein.ratio we get the ratio of similarity of two strings, 0 means totaly different, 1 exact match, any between 0 and 1 is the coeficient of similarity.
Example:
>>> import Levenshtein
>>> Levenshtein.ratio("This is a test","This is a test with similarity")
0.6363636363636364
>>> Levenshtein.ratio("This is a test","This is another test")
0.8235294117647058
>>> Levenshtein.ratio("This is a test","This is a test")
1.0
>>>
I need something like that, but with numbers.
For example, 5 has n% of similarity with 6. The number 5.4 has n% of similarity with 5.8.
I don't know if my example is clear.
#update 2:
Let me put a real word example. Let's say i'm looking for similar versions of CentOS linux distributions on a range of 100 servers. The CentOS Linux version numbers are something like 5.6, 5.7, 6.5. So, how close the number 5.7 are of 6.5? It's not so close, we get many versions (numbers) between them. But there is a coeficient of similarity, let's say 40% (or 0.4) using some algorithm of similarity like Levenshtein.
#update 3:
I got the answer for thia question. Im posting here to help more people:
>>> sum = 2.4 * 2.4
>>> sum2 = 7.5 * 7.5
>>> sum /math.sqrt(sum*sum2)
0.32
>>> sum = 7.4 * 7.4
>>> sum /math.sqrt(sum*sum2)
0.9866666666666666
>>> sum = 7.5 * 7.5
>>> sum /math.sqrt(sum*sum2)
1.0

To calculate the similarity of 2 numbers (float or integer) I wrote a simple function
def num_sim(n1, n2):
""" calculates a similarity score between 2 numbers """
return 1 - abs(n1 - n2) / (n1 + n2)
It simply returns 1 if they are exactly equal. It will go to 0 as the values of numbers differ.

From the link, I see that Ian Watson's slides show three options for assessing "similarity" of numbers. Of these, the "step function" option is readily available from numpy:
In [1]: from numpy import allclose
In [2]: a = 0.3 + 1e-9
In [3]: a == 0.3
Out[3]: False
In [4]: allclose(a, 0.3)
Out[4]: True
To get numeric output, as required for similarity, we make one change:
In [5]: int(a == 0.3)
Out[5]: 0
In [6]: int(allclose(a, 0.3))
Out[6]: 1
If preferred, float can be used in place of int:
In [8]: float(a == 0.3)
Out[8]: 0.0
In [9]: float(allclose(a, 0.3))
Out[9]: 1.0
allclose takes optional arguments rtol and atol so that you can specify, respectively, the relative or absolute tolerance to be used. Full documentation on allclose is here.

Why is sin(180) not zero when using python and numpy?

Does anyone know why the below doesn't equal 0?
import numpy as np
np.sin(np.radians(180))
or:
np.sin(np.pi)
When I enter it into python it gives me 1.22e-16.

The number π cannot be represented exactly as a floating-point number. So, np.radians(180) doesn't give you π, it gives you 3.1415926535897931.
And sin(3.1415926535897931) is in fact something like 1.22e-16.
So, how do you deal with this?
You have to work out, or at least guess at, appropriate absolute and/or relative error bounds, and then instead of x == y, you write:
abs(y - x) < abs_bounds and abs(y-x) < rel_bounds * y
(This also means that you have to organize your computation so that the relative error is larger relative to y than to x. In your case, because y is the constant 0, that's trivial—just do it backward.)
Numpy provides a function that does this for you across a whole array, allclose:
np.allclose(x, y, rel_bounds, abs_bounds)
(This actually checks abs(y - x) < abs_ bounds + rel_bounds * y), but that's almost always sufficient, and you can easily reorganize your code when it's not.)
In your case:
np.allclose(0, np.sin(np.radians(180)), rel_bounds, abs_bounds)
So, how do you know what the right bounds are? There's no way to teach you enough error analysis in an SO answer. Propagation of uncertainty at Wikipedia gives a high-level overview. If you really have no clue, you can use the defaults, which are 1e-5 relative and 1e-8 absolute.

One solution is to switch to sympy when calculating sin's and cos's, then to switch back to numpy using sp.N(...) function:
>>> # Numpy not exactly zero
>>> import numpy as np
>>> value = np.cos(np.pi/2)
6.123233995736766e-17
# Sympy workaround
>>> import sympy as sp
>>> def scos(x): return sp.N(sp.cos(x))
>>> def ssin(x): return sp.N(sp.sin(x))
>>> value = scos(sp.pi/2)
0
just remember to use sp.pi instead of sp.np when using scos and ssin functions.

Faced same problem,
import numpy as np
print(np.cos(math.radians(90)))
>> 6.123233995736766e-17
and tried this,
print(np.around(np.cos(math.radians(90)), decimals=5))
>> 0
Worked in my case. I set decimal 5 not lose too many information. As you can think of round function get rid of after 5 digit values.

Try this... it zeros anything below a given tiny-ness value...
import numpy as np
def zero_tiny(x, threshold):
if (x.dtype == complex):
x_real = x.real
x_imag = x.imag
if (np.abs(x_real) < threshold): x_real = 0
if (np.abs(x_imag) < threshold): x_imag = 0
return x_real + 1j*x_imag
else:
return x if (np.abs(x) > threshold) else 0
value = np.cos(np.pi/2)
print(value)
value = zero_tiny(value, 10e-10)
print(value)
value = np.exp(-1j*np.pi/2)
print(value)
value = zero_tiny(value, 10e-10)
print(value)

Python uses the normal taylor expansion theory it solve its trig functions and since this expansion theory has infinite terms, its results doesn't reach exact but it only approximates.
For e.g
sin(x) = x - x³/3! + x⁵/5! - ...
=> Sin(180) = 180 - ... Never 0 bout approaches 0.
That is my own reason by prove.

Simple.
np.sin(np.pi).astype(int)
np.sin(np.pi/2).astype(int)
np.sin(3 * np.pi / 2).astype(int)
np.sin(2 * np.pi).astype(int)
returns
0
1
0
-1

Create a random array with a particular average

I am using scipy and want to create an an array of legnth n with a particular average.
Suppose I want an random arrays of length 3 with an average of 2.5 hence the possible options could be:
[1.5, 3.5, 2.5]
[.25, 7.2, .05]
and so on and so forth...
I need to create many such arrays with varying lengths and different averages(specified) for each, so a generalized solution will be welcome.

Just generate numbers over the range you want (0...10 in this case)
>>> import random
>>> nums = [10*random.random() for x in range(5)]
Work out the average
>>> sum(nums)/len(nums)
4.2315222659844824
Shift the average to where you want it
>>> nums = [x - 4.2315222659844824 + 2.5 for x in nums]
>>> nums
[-0.628013346633133, 4.628537956666447, -1.7219257458163257, 7.617565127420011, 2.6038360083629986]
>>> sum(nums)/len(nums)
2.4999999999999996
You can use whichever distribution/range you like. By shifting the average this way it will always get you an average of 2.5 (or very close to it)

You haven't specified what distribution you want.
It's also not clear whether you want the average of the actual array to be 2.5, or the amortized average over all arrays to be 2.5.
The simplest solution—three random numbers in an even distribution from 0 to 2*avg—is this:
return 2*avg * np.random.rand(3)
If you want to guarantee that the average of the array is 2.5, that's a pretty simple constraint, but there are many different ways you could satisfy it, and you need to describe which way you want. For example:
n0 = random.random() * 2*avg
n1 = random.random() * (2*avg - n0)
n2 = random.random() * (2*avg - n0 - n1)
return np.array((n0, n1, n2))

I found a solution to the problem.
numpy.random.triangular(left, mode, right, size=None)
Visit: http://docs.scipy.org/doc/numpy/reference/generated/numpy.random.triangular.html#numpy.random.triangular
However, the minor problem is that it forces a triangular distribution on the samples.

How to generate random variables and sum all them in python

My problem explicitly is
Z=sum_(i)^12 (x_i).
where i is indices and x_i's are random number...
I need an explicit code in Python to produce 12 random variables and sum all them.
I tried write code using if, while loop, but I could not get it.
I need your help...

In order to be able to use arbitrary variable just structure it as a function.
You can structure it similar to l82Munch, but this may be more readable for you since your just starting. Note that range is a generator function that returns a list up to the last call. So range(1,3) returns [1,2]
import random
def rand_sum(i, j):
sum_list = []
for rand_num in range(i, j+1):
sum_list.append(random.random()) # Check random docs for a function that returns
return sum(sum_list) # a different set of randoms if this isn't
# appropriate

import random
rand_sum = sum( random.random() for x in range(12) )
See the random documentation for more info.

In probabilistic modelling, you can define distributions then sum them.
Personally, I use OpenTURNS platform for that.
import openturns as ot
x1 = ot.Normal(0, 2) # Normal distribution mean = 0, std = 2
x2 = ot.Uniform(3, 5) # Uniform distribution between 3 and 5
sum = x1 + x2
That's it.
If x1,..., x12 are 12 distributions identically distributed you can write:
sum_12 = sum([x1] * 12)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Finding appropriate cut-off values - python

What you want is scipy.stats.mstats.trimboth. Set proportiontocut=0.15. After trimming, take (max-min)/2.

Related

Is it possible to sum values of arrays stored in a dictionary without for loop?

Get the similarity of two numbers with python

Why is sin(180) not zero when using python and numpy?

Create a random array with a particular average

How to generate random variables and sum all them in python

Categories

Resources