Get pairs of positive values efficiently with numpy - python

I have a python function which takes in two lists, looks for pairs in the two inputs where both have positive values at the same index, and creates two output lists by appending to each one of those two positive values. I have a working function:
def get_pairs_in_first_quadrant(x_in, y_in):
"""If both x_in[i] and y_in[i] are > 0 then both will appended to the output list. If either are negative
then the pair of them will be absent from the output list.
:param x_in: A list of positive or negative floats
:param y_in: A list of positive or negative floats
:return: A list of positive floats <= in length to the inputs.
"""
x_filtered, y_filtered = [], []
for x, y in zip(x_in, y_in):
if x > 0 and y > 0:
x_filtered.append(x)
y_filtered.append(y)
return x_filtered, y_filtered
How can I make this faster using numpy?

You can do this by simply finding the indices where they are both positive:
import numpy as np
a = np.random.random(10) - .5
b = np.random.random(10) - .5
def get_pairs_in_first_quadrant(x_in, y_in):
i = np.nonzero( (x_in>0) & (y_in>0) ) # main line of interest
return x_in[i], y_in[i]
print a # [-0.18012451 -0.40924713 -0.3788772 0.3186816 0.14811581 -0.04021951 -0.21278312 -0.36762629 -0.45369899 -0.46374929]
print b # [ 0.33005969 -0.03167875 0.11387641 0.22101336 0.38412264 -0.3880842 0.08679424 0.3126209 -0.08760505 -0.40921421]
print get_pairs_in_first_quadrant(a, b) # (array([ 0.3186816 , 0.14811581]), array([ 0.22101336, 0.38412264]))
I was interested in Jaime's suggestion to just using the boolean indexing without calling nonzero so I ran some timing tests. The results are somewhat interesting since they advantage ratio is non-monotonic with the number of positive matches, but basically, at least for speed, it doesn't really matter which is used (though nonzero is usually a bit faster, and can be about twice as fast):
threshold = .6
a = np.random.random(10000) - threshold
b = np.random.random(10000) - threshold
def f1(x_in, y_in):
i = np.nonzero( (x_in>0) & (y_in>0) ) # main line of interest
return x_in[i], y_in[i]
def f2(x_in, y_in):
i = (x_in>0) & (y_in>0) # main line of interest
return x_in[i], y_in[i]
print threshold, len(f1(a,b)[0]), len(f2(a,b)[0])
print timeit("f1(a, b)", "from __main__ import a, b, f1, f2", number = 1000)
print timeit("f2(a, b)", "from __main__ import a, b, f1, f2", number = 1000)
Which gives, for different threshold values:
0.05 9086 9086
0.0815141201019
0.104746818542
0.5 2535 2535
0.0715141296387
0.153401851654
0.95 21 21
0.027126789093
0.0324990749359

Related

Getting the minimum value by using lambda through numpy Array Python

The code below calculates the Compounding values starting from $100 and the percentage gains gains. The code below goes from the start off with the entirety of the gains array [20,3,4,55,6.5,-10, 20,-60,5] resulting in 96.25 at the end and then takes off the first index and recalculates the compounding value [3,4,55,6.5,-10, 20,-60,5] resulting in 80.20. It would do this until the end of the gains array [5]. I want to write a code that calculates maximum drawdown as it is calculating f. This would be the compounding results for the first iteration of f [120., 123.6 ,128.544, 199.243, 212.194008 190.9746072, 229.16952864, 91.66781146, 96.25120203] I want to record a value if it is lower than the initial capital Amount value. So the lowest value is 91.67 on the first iteration so that would be the output, and on the second iteration it would be 76.37. Since in the last iteration there is [5] which results in the compounding output of 105 there are no values that go below 100 so it is None as the output. How would I be able to implement this to the code below and get the expected output?
import numpy as np
Amount = 100
def moneyrisk(array):
f = lambda array: Amount*np.cumprod(array/100 + 1, 1)
rep = array[None].repeat(len(array), 0)
rep_t = np.triu(rep, k=0)
final = f(rep_t)[:, -1]
gains= np.array([20,3,4,55,6.5,-10, 20,-60,5])
Expected output:
[91.67, 76.37, 74.164, 71.312, 46.008, 43.2, 48., 40., None]
I think I've understood the requirement. Calculating the compound factors after the np.triu fills the zeroes with ones which means the min method returns a valid value.
import numpy as np
gains= np.array( [20,3,4,55,6.5,-10, 20,-60,5] ) # Gains in %
amount = 100
def moneyrisk( arr ):
rep = arr[ None ].repeat( len(arr), 0 )
rep_t = np.triu( rep, k = 0 )
rep_t = ( 1 + rep_t * .01 ) # Create factors to compound in rep_t
result = amount*(rep_t.cumprod( axis = 1 ).min( axis = 1 ))
# compound and find min value.
return [ x if x < amount else None for x in result ]
# Set >= amount to None in a list as numpy floats can't hold None
moneyrisk( gains )
# [91.667811456, 76.38984288, 74.164896, 71.3124, 46.008, 43.2, 48.0, 40.0, None]

Unique ordered ratio of integers

I have two ordered lists of consecutive integers m=0, 1, ... M and n=0, 1, 2, ... N. Each value of m has a probability pm, and each value of n has a probability pn. I am trying to find the ordered list of unique values r=n/m and their probabilities pr. I am aware that r is infinite if n=0 and can even be undefined if m=n=0.
In practice, I would like to run for M and N each be of the order of 2E4, meaning up to 4E8 values of r - which would mean 3 GB of floats (assuming 8 Bytes/float).
For this calculation, I have written the python code below.
The idea is to iterate over m and n, and for each new m/n, insert it in the right place with its probability if it isn't there yet, otherwise add its probability to the existing number. My assumption is that it is easier to sort things on the way instead of waiting until the end.
The cases related to 0 are added at the end of the loop.
I am using the Fraction class since we are dealing with fractions.
The code also tracks the multiplicity of each unique value of m/n.
I have tested up to M=N=100, and things are quite slow. Are there better approaches to the question, or more efficient ways to tackle the code?
Timing:
M=N=30: 1 s
M=N=50: 6 s
M=N=80: 30 s
M=N=100: 82 s
import numpy as np
from fractions import Fraction
import time # For timiing
start_time = time.time() # Timing
M, N = 6, 4
mList, nList = np.arange(1, M+1), np.arange(1, N+1) # From 1 to M inclusive, deal with 0 later
mProbList, nProbList = [1/(M+1)]*(M), [1/(N+1)]*(N) # Probabilities, here assumed equal (not general case)
# Deal with mn=0 later
pmZero, pnZero = 1/(M+1), 1/(N+1) # P(m=0) and P(n=0)
pNaN = pmZero * pnZero # P(0/0) = P(m=0)P(n=0)
pZero = pmZero * (1 - pnZero) # P(0) = P(m=0)P(n!=0)
pInf = pnZero * (1 - pmZero) # P(inf) = P(m!=0)P(n=0)
# Main list of r=m/n, P(r) and mult(r)
# Start with first line, m=1
rList = [Fraction(mList[0], n) for n in nList[::-1]] # Smallest first
rProbList = [mProbList[0] * nP for nP in nProbList[::-1]] # Start with first line
rMultList = [1] * len(rList) # Multiplicity of each element
# Main loop
for m, mP in zip(mList[1:], mProbList[1:]):
for n, nP in zip(nList[::-1], nProbList[::-1]): # Pick an n value
r, rP, rMult = Fraction(m, n), mP*nP, 1
for i in range(len(rList)-1): # See where it fits in existing list
if r < rList[i]:
rList.insert(i, r)
rProbList.insert(i, rP)
rMultList.insert(i, 1)
break
elif r == rList[i]:
rProbList[i] += rP
rMultList[i] += 1
break
elif r < rList[i+1]:
rList.insert(i+1, r)
rProbList.insert(i+1, rP)
rMultList.insert(i+1, 1)
break
elif r == rList[i+1]:
rProbList[i+1] += rP
rMultList[i+1] += 1
break
if r > rList[-1]:
rList.append(r)
rProbList.append(rP)
rMultList.append(1)
break
# Deal with 0
rList.insert(0, Fraction(0, 1))
rProbList.insert(0, pZero)
rMultList.insert(0, N)
# Deal with infty
rList.append(np.Inf)
rProbList.append(pInf)
rMultList.append(M)
# Deal with undefined case
rList.append(np.NAN)
rProbList.append(pNaN)
rMultList.append(1)
print(".... done in %s seconds." % round(time.time() - start_time, 2))
print("************** Final list\nr", 'Prob', 'Mult')
for r, rP, rM in zip(rList, rProbList, rMultList): print(r, rP, rM)
print("************** Checks")
print("mList", mList, 'nList', nList)
print("Sum of proba = ", np.sum(rProbList))
print("Sum of multi = ", np.sum(rMultList), "\t(M+1)*(N+1) = ", (M+1)*(N+1))
Based on the suggestion of #Prune, and on this thread about merging lists of tuples, I have modified the code as below. It's a lot easier to read, and runs about an order of magnitude faster for N=M=80 (I have omitted dealing with 0 - would be done same way as in original post). I assume there may be ways to tweak the merge and conversion back to lists further yet.
# Do calculations
data = [(Fraction(m, n), mProb(m) * nProb(n)) for n in range(1, N+1) for m in range(1, M+1)]
data.sort()
# Merge duplicates using a dictionary
d = {}
for r, p in data:
if not (r in d): d[r] = [0, 0]
d[r][0] += p
d[r][1] += 1
# Convert back to lists
rList, rProbList, rMultList = [], [], []
for k in d:
rList.append(k)
rProbList.append(d[k][0])
rMultList.append(d[k][1])
I expect that "things are quite slow" because you've chosen a known inefficient sort. A single list insertion is O(K) (later list elements have to be bumped over, and there is added storage allocation on a regular basis). Thus a full-list insertion sort is O(K^2). For your notation, that is O((M*N)^2).
If you want any sort of reasonable performance, research and use the best-know methods. The most straightforward way to do this is to make your non-exception results as a simple list comprehension, and use the built-in sort for your penultimate list. Simply append your n=0 cases, and you're done in O(K log K) time.
I the expression below, I've assumed functions for m and n probabilities.
This is a notational convenience; you know how to directly compute them, and can substitute those expressions if you wish.
data = [ (mProb(m) * nProb(n), Fraction(m, n))
for n in range(1, N+1)
for m in range(0, M+1) ]
data.sort()
data.extend([ # generate your "zero" cases here ])

How to speed up an N dimensional interval tree in python?

Consider the following problem: Given a set of n intervals and a set of m floating-point numbers, determine, for each floating-point number, the subset of intervals that contain the floating-point number.
This problem has been addressed by constructing an interval tree (or called range tree or segment tree). Implementations have been done for the one-dimensional case, e.g. python's intervaltree package. Usually, these implementations consider one or few floating-point numbers, namely a small "m" above.
In my problem setting, both n and m are extremely large numbers (from solving an image processing problem). Further, I need to consider the N-dimensional intervals (called cuboid when N=3, because I was modeling human brains with the Finite Element Method). I have implemented a simple N-dimensional interval tree in python, but it run in a loop and can only take one floating-point number at a time. Can anyone help improve the implementation in terms of efficiency? You can change data structure freely.
import sys
import time
import numpy as np
# find the index of a satisfying x > a in one dimension
def find_index_smaller(a, x):
idx = np.argsort(a)
ss = np.searchsorted(a, x, sorter=idx)
res = idx[0:ss]
return res
# find the index of a satisfying x < a in one dimension
def find_index_larger(a, x):
return find_index_smaller(-a, -x)
# find the index of a satisfing amin < x < amax in one dimension
def find_intv_at(amin, amax, x):
idx = find_index_smaller(amin, x)
idx2 = find_index_larger(amax[idx], x)
res = idx[idx2]
return res
# find the index of a satisfying amin < x < amax in N dimensions
def find_intv_at_nd(amin, amax, x):
dim = amin.shape[0]
res = np.arange(amin.shape[-1])
for i in range(dim):
idx = find_intv_at(amin[i, res], amax[i, res], x[i])
res = res[idx]
return res
I also have two test examples for sanity check and performance testing:
def demo1():
print ("By default, we do a correctness test")
n_intv = 2
n_point = 2
# generate the test data
point = np.random.rand(3, n_point)
intv_min = np.random.rand(3, n_intv)
intv_max = intv_min + np.random.rand(3, n_intv)*8
print ("point ")
print (point)
print ("intv_min")
print (intv_min)
print ("intv_max")
print (intv_max)
print ("===Indexes of intervals that contain the point===")
for i in range(n_point):
print (find_intv_at_nd(intv_min,intv_max, point[:, i]))
def demo2():
print ("Performance:")
n_points=100
n_intv = 1000000
# generate the test data
points = np.random.rand(n_points, 3)*512
intv_min = np.random.rand(3, n_intv)*512
intv_max = intv_min + np.random.rand(3, n_intv)*8
print ("point.shape = "+str(points.shape))
print ("intv_min.shape = "+str(intv_min.shape))
print ("intv_max.shape = "+str(intv_max.shape))
starttime = time.time()
for point in points:
tmp = find_intv_at_nd(intv_min, intv_max, point)
print("it took this long to run {} points, with {} interva: {}".format(n_points, n_intv, time.time()-starttime))
My idea would be:
Remove np.argsort() from the algo, because the interval tree does not change, so sorting could have been done in pre-processing.
Vectorize x. The algo runs a loop for each x. It would be nice if we can get rid of the loop over x.
Any contribution would be appreciated.

python: divide list into pequal parts and add samples in each part together

The following is my script. Each equal part has self.number samples, in0 is input sample. There is an error as follows:
pn[i] = pn[i] + d
IndexError: list index out of range
Is this the problem about the size of pn? How can I define a list with a certain size but no exact number in it?
for i in range(0,len(in0)/self.number):
pn = []
m = i*self.number
for d in in0[m: m + self.number]:
pn[i] += d
if pn[i] >= self.alpha:
out[i] = 1
elif pn[i] <= self.beta:
out[i] = 0
else:
if pn[i] >= self.noise:
out[i] = 1
else:
out[i] = 0
if pn[i] >= self.noise:
out[i] = 1
else:
out[i] = 0
There are a number of problems in the code as posted, however, the gist seems to be something that you'd want to do with numpy arrays instead of iterating over lists.
For example, the set of if/else cases that check if pn[i] >= some_value and then sets a corresponding entry into another list with the result (true/false) could be done as a one-liner with an array operation much faster than iterating over lists.
import numpy as np
# for example, assuming you have 9 numbers in your list
# and you want them divided into 3 sublists of 3 values each
# in0 is your original list, which for example might be:
in0 = [1.05, -0.45, -0.63, 0.07, -0.71, 0.72, -0.12, -1.56, -1.92]
# convert into array
in2 = np.array(in0)
# reshape to 3 rows, the -1 means that numpy will figure out
# what the second dimension must be.
in2 = in2.reshape((3,-1))
print(in2)
output:
[[ 1.05 -0.45 -0.63]
[ 0.07 -0.71 0.72]
[-0.12 -1.56 -1.92]]
With this 2-d array structure, element-wise summing is super easy. So is element-wise threshold checking. Plus 'vectorizing' these operations has big speed advantages if you are working with large data.
# add corresponding entries, we want to add the columns together,
# as each row should correspond to your sub-lists.
pn = in2.sum(axis=0) # you can sum row-wise or column-wise, or all elements
print(pn)
output: [ 1. -2.72 -1.83]
# it is also trivial to check the threshold conditions
# here I check each entry in pn against a scalar
alpha = 0.0
out1 = ( pn >= alpha )
print(out1)
output: [ True False False]
# you can easily convert booleans to 1/0
x = out1.astype('int') # or simply out1 * 1
print(x)
output: [1 0 0]
# if you have a list of element-wise thresholds
beta = np.array([0.0, 0.5, -2.0])
out2 = (pn >= beta)
print(out2)
output: [True False True]
I hope this helps. Using the correct data structures for your task can make the analysis much easier and faster. There is a wealth of documentation on numpy, which is the standard numeric library for python.
You initialize pn to an empty list just inside the for loop, never assign anything into it, and then attempt to access an index i. There is nothing at index i because there is nothing at any index in pn yet.
for i in range(0, len(in0) / self.number):
pn = []
m = i*self.number
for d in in0[m: m + self.number]:
pn[i] += d
If you are trying to add the value d to the pn list, you should do this instead:
pn.append(d)

How to use python to find matching data points in two arrays and return a third?

I am new to python, and trying to create a 3rd array from 2 others. I have two variables (X and Y) both related to depth, but not on the exact same depth points. I want to go through the depth values associated with X, and find ones in array Y that have a depth within 50cm of depth for X. Then return the depth and the Y value in a third array.
I thought 'for' loops might do this, but I don't know how.
Code:
A = np.genfromtxt('file.txt', names=True)
B = np.genfromtxt('file2.txt', names=True)
Depth1 = A['Depth']
X = A['variable1']
Depth2 = B['Depth']
Y = B['number']
A contains 806 lines, B contains 456.
I want to filter through A and extract the values (both depth and X) that correspond to within 50cm of each depth point in B, preferably into another array.
How can I do this? I have found things searching online that cover lists with the for loop, but not arrays.
Sample Data:
A = [(0.6, 1.463) (0.95, 1.468) (1.7, 1.465) (2.5, 1.502) (265.38, 1.715) ... (Depth1, X)]
B = [(0.58, 0.726) (0.93, 0.688) (1.69, 0.713) (2.48, 0.606) ... (Depth2, Y)]
Sample Output:
C = [(0.58, 1.463) (0.93, 1.468) (1.69, 1.465) ... (Depth2, X)]
depths = [a[(i-50. <= a) & (a <= i+50.)] for i in b]
Edit: in response to comment, that's not what's happening. a and b are numpy.arrays; i-50. < a evaluates to a flag array, with 1 in each position in which the value is > i-50., then a[flagarray] returns just the entries for which the flag array contains 1. The & combines the two flag arrays, in order to pull only the values of interest. Hope that helps.
Edit2: something like
result = []
for i,n in zip(depth2,y):
mask = (i-50. <= a) & (a <= i+50.)
result.append((n, depth1[mask], x[mask]))
Edit3: it looks like, for each B depth, you want a single value - the label for the nearest corresponding A depth?
import numpy as np
a = np.array([[0.6, 1.463], [0.95, 1.468], [1.7, 1.465], [2.5, 1.502], [265.38, 1.715]])
b = np.array([[0.58, 0.726], [0.93, 0.688], [1.69, 0.713], [2.48, 0.606]])
d1 = a[:,0]
x = a[:,1]
d2 = b[:,0]
y = b[:,1]
def find_index_of_nearest_value(array, value):
return np.abs(array - value).argmin()
c = [(d, x[find_index_of_nearest_value(d1, d)]) for d,y in b]
results in
[(0.58, 1.463), (0.93, 1.468), (1.69, 1.465), (2.48, 1.502)]
This could be sped up by sorting the depth-arrays and walking through them in ascending order - but for less than 1000 values, this should be plenty fast enough.
You could use the piecewise function. For instance, to find all items in an array greater than 4 and less than 6 would be something like:
n = numpy.array(range(10))
n = numpy.piecewise(n, [ n<4, n>6 ], [0, 0, lambda x: x])
n.sort()[::-1]
This approach doesn't filter the results and instead sets the unintended results to zero. This has the advantage of staying within numpy which will result in better performance.
Code:
A = [(0.6, 1.463), (0.95, 1.468), (1.7, 1.465), (2.5, 1.502), (265.38, 1.715)]
B = [(0.58, 0.726), (0.93, 0.688), (1.69, 0.713), (2.48, 0.606)]
tolerance = 0.5
print "\nTolerance: {0}\n".format(tolerance)
for b in B:
print "B value: {0}".format(b)
a_vals = [a for a in A if (b[0] + tolerance) > a[0] > (b[0] - tolerance)]
print " A values: {0}".format(a_vals)
Output:
Tolerance: 0.5
B value: (0.58, 0.726)
A values: [(0.6, 1.463), (0.95, 1.468)]
B value: (0.93, 0.688)
A values: [(0.6, 1.463), (0.95, 1.468)]
B value: (1.69, 0.713)
A values: [(1.7, 1.465)]
B value: (2.48, 0.606)
A values: [(2.5, 1.502)]

Categories