I'm trying to use numpy/pandas to constuct a sliding window style comparator. I've got list of lists each of which is a different length. I want to compare each list to to another list as depicted below:
lists = [[10,15,5],[5,10],[5]]
window_diff(l[1],l[0]) = 25
The window diff for lists[0] and lists[1] would give 25 using the following window sliding technique shown in the image below. Because lists[1] is the shorter path we shift it once to the right, resulting in 2 windows of comparison. If you sum the last row in the image below we get the total difference between the two lists using the two windows of comparison; in this case a total of 25. To note we are taking the absolute difference.
The function should aggregate the total window_diff between each list and the other lists, so in this case
tot = total_diffs(lists)
tot>>[40, 30, 20]
# where tot[0] represents the sum of lists[0] window_diff with all other lists.
I wanted to know if there was a quick route to doing this in pandas or numpy. Currently I am using a very long winded process of for looping through each of the lists and then comparing bitwise by shifting the shorter list in accordance to the longer list.
My approach works fine for short lists, but my dataset is 10,000 lists long and some of these lists contain 60 or so datapoints, so speed is a criteria here. I was wondering if numpy, pandas had some advice on this? Thanks
Sample problem data
from random import randint
lists = [[random.randint(0,1000) for r in range(random.randint(0,60))] for x in range(100000)]
Steps :
Among each pair of lists from the input list of lists create sliding windows for the bigger array and then get the absolute difference against the smaller one in that pair. We can use NumPy strides to get those sliding windows.
Get the total sum and store this summation as a pair-wise differentiation.
Finally sum along each row and col on the 2D array from previous step and their summation is final output.
Thus, the implementation would look something like this -
import itertools
def strided_app(a, L, S=1 ): # Window len = L, Stride len/stepsize = S
a = np.asarray(a)
nrows = ((a.size-L)//S)+1
n = a.strides[0]
return np.lib.stride_tricks.as_strided(a, shape=(nrows,L), strides=(S*n,n))
N = len(lists)
pair_diff_sums = np.zeros((N,N),dtype=type(lists[0][0]))
for i, j in itertools.combinations(range(N), 2):
A, B = lists[i], lists[j]
if len(A)>len(B):
pair_diff_sums[i,j] = np.abs(strided_app(A,L=len(B)) - B).sum()
else:
pair_diff_sums[i,j] = np.abs(strided_app(B,L=len(A)) - A).sum()
out = pair_diff_sums.sum(1) + pair_diff_sums.sum(0)
For really heavy datasets, here's one method using one more level of looping -
N = len(lists)
out = np.zeros((N),dtype=type(lists[0][0]))
for k,i in enumerate(lists):
for j in lists:
if len(i)>len(j):
out[k] += np.abs(strided_app(i,L=len(j)) - j).sum()
else:
out[k] += np.abs(strided_app(j,L=len(i)) - i).sum()
strided_app is inspired from here.
Sample input, output -
In [77]: lists
Out[77]: [[10, 15, 5], [5, 10], [5]]
In [78]: pair_diff_sums
Out[78]:
array([[ 0, 25, 15],
[25, 0, 5],
[15, 5, 0]])
In [79]: out
Out[79]: array([40, 30, 20])
Just for completeness of #Divakar's great answer and for its application to very large datasets:
import itertools
N = len(lists)
out = np.zeros(N, dtype=type(lists[0][0]))
for i, j in itertools.combinations(range(N), 2):
A, B = lists[i], lists[j]
if len(A)>len(B):
diff = np.abs(strided_app(A,L=len(B)) - B).sum()
else:
diff = np.abs(strided_app(B,L=len(A)) - A).sum()
out[i] += diff
out[j] += diff
It does not create unnecessary large datasets and updates a single vector while iterating only over the upper triangular array.
It will still take a while to compute, as there is a tradeoff between computational complexity and larger-than-ram datasets. Solutions for larger than ram datasets often rely on iterations, and python is not great at it. Iterating in python over a large dataset is slow, very slow.
Translating the code above to cython could speedup things a bit.
Related
I want to add two numpy arrays of different sizes starting at a specific index. As I need to do this couple of thousand times with large arrays, this needs to be efficient, and I am not sure how to do this efficiently without iterating through each cell.
a = [5,10,15]
b = [0,0,10,10,10,0,0]
res = add_arrays(b,a,2)
print(res) => [0,0,15,20,25,0,0]
naive approach:
# b is the bigger array
def add_arrays(b, a, i):
for j in range(len(a)):
b[i+j] = a[j]
You might assign smaller one into zeros array then add, I would do it following way
import numpy as np
a = np.array([5,10,15])
b = np.array([0,0,10,10,10,0,0])
z = np.zeros(b.shape,dtype=int)
z[2:2+len(a)] = a # 2 is offset
res = z+b
print(res)
output
[ 0 0 15 20 25 0 0]
Disclaimer: I assume that offset + len(a) is always less or equal len(b).
Nothing wrong with your approach. You cannot get better asymptotic time or space complexity. If you want to reduce code lines (which is not an end in itself), you could use slice assignment and some other utils:
def add_arrays(b, a, i):
b[i:i+len(a)] = map(sum, zip(b[i:i+len(a)], a))
But the functional overhead should makes this less performant, if anything.
Some docs:
map
sum
zip
It should be faster than Daweo answer, 1.5-5x times (depending on the size ratio between a and b).
result = b.copy()
result[offset: offset+len(a)] += a
I have two lists of different lengths, which are the range of the lower and upper bound that I am using to filter a nested list "lst". To save space, I just copy part of the data 10 times. I want to iterate over both the upper limit and lower limit twice and here is my attempt:
import numpy as np
import itertools
lst = [[4.256, 3.8518], [2.2121, 1.6064], [3.9662, 3.2433], [5.1571, 5.8898], [4.4909, 3.7328], [9.38, 10.2276], [4.8912, 5.846], [4.5729, 3.5768], [6.25, 5.2267], [3.1019, 4.1603], [7.7822, 14.9629], [4.7673, 12.1189]]
lst_long = lst * 10
lower_limit = np.arange(1, 3, 0.1).tolist()
upper_limit = np.arange(9, 12, 0.1).tolist()
def create_combo(a, b):
for sublist1 in itertools.product(a,b):
for sublist2 in itertools.product(a,b):
yield sublist1[0], sublist1[1], sublist2[0], sublist2[1]
for lower1, upper1, lower2, upper2 in create_combo(lower_limit, upper_limit):
filtered_list = [sublist for sublist in lst_long if lower1<=sublist[0]<=upper1 and lower2<=sublist[1]<=upper2]
x = [lst[0] for lst in filtered_list]
y = [lst[1] for lst in filtered_list]
This code now takes over 9 sec to run on my pc. As the range expands, I am sure the increase in run time would be exponential. Therefore, I am looking for suggestions regarding how I may conduct the iteration more efficiently. Is there any special feature from any package that could speed up the process?
Thank you.
I'm having a homework assignment about airport flights, where at first i have to create the representation of a sparse matrix(i, j and values) for a 1000x1000 array from 10000 random numbers with the following criteria:
i and j must be between 0-999 since are the rows and columns of array
values must be between 1.0-5.0
i must not be equal to j
i and j must be present only once
The i is the departure airport, the j is the arrival airport and the values are the hours for the trip from i to j.
Then i have to find the roundtrips for an airport A with 2 to 8 maximum stops based on the criteria above. For example:
A, D, F, G, A is a legal roundtrip with 4 stops
A, D, F, D, A is not a legal roundtrip since the D is visited twice
NOTE: the problem must be solved purely with python built-in libraries. No external libraries are accepted like scipy and numpy.
I have tried to run a loop for 10000 numbers and assign to row, column and value a random number based on the above criteria but i guess this is not what the assignment asks me to do since the loop doesn't stop.
I guess the i and j are not the actual iloc and j representations of the sparse matrix but rather the values of those? i don't know.
I currently don't have a working code other than the example for the roundtrip implementation. Although will raise an error if the list is empty:
dNext = {
0: [],
1: [4, 2, 0],
2: [1, 4],
3: [0],
4: [3, 1]
}
def findRoundTrips(trip, n, trips):
if (trip[0] == trip[-1]) and (1 < len(trip) <= n + 1):
trips.append(trip.copy())
return
for x in dNext[trip[-1]]:
if ((x not in trip[1:]) and (len(trip) < n)) or (x == trip[0]):
trip.append(x)
findRoundTrips(trip, n, trips)
trip.pop()
Here's how I would build a sparse matrix:
from collections import defaultdict
import random
max_location = 1000
min_value = 1.0
max_value = 5.0
sparse_matrix = defaultdict(list)
num_entries = 10000
for _ in range(num_entries):
source = random.randint(0, max_location)
dest = random.randint(0, max_location)
value = random.uniform(min_value, max_value)
sparse_matrix[source].append((dest, value))
What this does is define a sparse matrix as a dictionary where the key of the dictionary is the starting point of a trip. The values of a key define everywhere you can fly to and how long it takes to fly there as a list of tuples.
Note, I didn't check that I'm using randint and uniform perfectly correctly, if you use this, you should look at the documentation of those functions to find out if there are any off-by-one errors in this solution.
I have two lists, and I want to compare the value in each list to see if the difference is in a certain range, and return the number of same value in each list. Here is my code 1st version:
m = [1,3,5,7]
n = [1,4,7,9,5,6,34,52]
k=0
for i in xrange(0, len(m)):
for j in xrange(0, len(n)):
if abs(m[i] - n[j]) <=0.5:
k+=1
else:
continue
the output is 3. I also tried 2nd version:
for i, j in zip(m,n):
if abs(i - j) <=0.5:
t+=1
else:
continue
the output is 1, the answer is wrong. So I am wondering if there is simpler and more efficient code for the 1st version, I have a big mount of data to deal with. Thank you!
The first thing you could do is remove the else: continue, since that doesn't add anything. Also, you can directly use for a in m to avoid iterating over a range and indexing.
If you wanted to write it more succiently, you could use itertools.
import itertools
m = [1,3,5,7]
n = [1,4,7,9,5,6,34,52]
k = sum(abs(a - b) <= 0.5 for a, b in itertools.product(m, n))
The runtime of this (and your solution) is O(m * n), where m and n are the lengths of the lists.
If you need a more efficient algorithm, you can use a sorted data structure like a binary tree or a sorted list to achieve better lookup.
import bisect
m = [1,3,5,7]
n = [1,4,7,9,5,6,34,52]
n.sort()
k = 0
for a in m:
i = bisect.bisect_left(n, a - 0.5)
j = bisect.bisect_right(n, a + 0.5)
k += j - i
The runtime is O((m + n) * log n). That's n * log n for sorting and m * log n for lookups. So you'd want to make n the shorter list.
More pythonic version of your first version:
ms = [1, 3, 5, 7]
ns = [1, 4, 7, 9, 5, 6, 34, 52]
k = 0
for m in ms:
for n in ns:
if abs(m - n) <= 0.5:
k += 1
I don't think it will run faster, but it's simpler (more readable).
It's simpler, and probably slightly faster, to simply iterate over the lists directly rather than to iterate over range objects to get index values. You already do this in your second version, but you're not constructing all possible pairs with that zip() call. Here's a modification of your first version:
m = [1,3,5,7]
n = [1,4,7,9,5,6,34,52]
k=0
for x in m:
for y in n:
if abs(x - y) <=0.5:
k+=1
You don't need the else: continue part, which does nothing at the end of a loop, so I left it out.
If you want to explore generator expressions to do this, you can use:
k = sum(sum( abs(x-y) <= 0.5 for y in n) for x in m)
That should run reasonably fast using just the core language and no imports.
Your two code snippets are doing two different things. The first one is comparing each element of n with each element of m, but the second one is only doing a pairwise comparison of corresponding elements of m and n, stopping when the shorter list runs out of elements. We can see exactly which elements are being compared in the second case by printing the zip:
>>> m = [1,3,5,7]
>>> n = [1,4,7,9,5,6,34,52]
>>> zip(m,n)
[(1, 1), (3, 4), (5, 7), (7, 9)]
pawelswiecki has posted a more Pythonic version of your first snippet. Generally, it's better to directly iterate over containers rather than using an indexed loop unless you actually need the index. And even then, it's more Pythonic to use enumerate() to generate the index than to use xrange(len(m)). Eg
>>> for i, v in enumerate(m):
... print i, v
...
0 1
1 3
2 5
3 7
A rule of thumb is that if you find yourself writing for i in xrange(len(m)), there's probably a better way to do it. :)
William Gaul has made a good suggestion: if your lists are sorted you can break out of the inner loop once the absolute difference gets bigger than your threshold of 0.5. However, Paul Draper's answer using bisect is my favourite. :)
I need to fill a numpy array of three elements with random integers such that the sum total of the array is three (e.g. [0,1,2]).
By my reckoning there are 10 possible arrays:
111,
012,
021,
102,
120,
201,
210,
300,
030,
003
My ideas is to randomly generate an integer between 1 and 10 using randint, and then use a look-up table to fill the array from the above list of combinations.
Does anyone know of a better approach?
Here is how I did it:
>>> import numpy as np
>>> a=np.array([[1,1,1],[0,1,2],[0,2,1],[1,0,2],[1,2,0],[2,0,1],[2,1,0],[3,0,0],[0,3,0],[0,0,3]])
>>> a[np.random.randint(0,10)]
array([1, 2, 0])
>>> a[np.random.randint(0,10)]
array([0, 1, 2])
>>> a[np.random.randint(0,10)]
array([1, 0, 2])
>>> a[np.random.randint(0,10)]
array([3, 0, 0])
Here’s a naive programmatic way to do this for arbitrary array sizes/sums:
def n_ints_summing_to_v(n, v):
elements = (np.arange(n) == np.random.randint(0, n)) for i in range(v))
return np.sum(elements, axis=0)
This will, of course, slow down proportionally to the desired sum, but would be ok for small values.
Alternatively, we can phrase this in terms of drawing samples from the Multinomial distribution, for which there is a function available in NumPy (see here), as follows:
def n_ints_summing_to_v(n, v):
return np.random.multinomial(v, ones((n)) / float(n))
This is a lot quicker!
This problem can be solved in the generic case, where the number of elements and their sum are both configurable. One advantage of the solution below is that it does not require generating a list of all the possibilities. The idea is to pick random numbers sequentially, each of which is less than the required sum. The required sum is reduced every time you pick a number:
import numpy
def gen(numel = 3, sum = 3):
arr = numpy.zeros((numel,), dtype = numpy.int)
for i in range(len(arr) - 1): # last element must be free to fill in the sum
arr[i] = numpy.random.randint(0, sum + 1)
sum -= arr[i]
if sum == 0: break # Nothing left to do
arr[-1] = sum # ensure that everything adds up
return arr
print(gen())
This solution does not guarantee that the possibilities will all occur with the same frequency. Among the ten possibilities you list, four start with 0, three with 1, two with 2 and one with 3. This is clearly not the uniform distribution that numpy.random.randint() provides for the first digit.