Numpy: fill conditional subarray with increasing numbers - python

I often come across an idiom like the following: say I have data like
N = 20 # or some other number
a = np.random.randint(0, 10, N) # or any other 1D np.array
predicate = lambda x: x%2 == 0 # or any other predicate
The idiom I encounter is along the lines
b = np.full_like(a, -1)
i1 = 0
for i, x in enumerate(a):
if predicate(x):
b[i] = i1
i1 += 1
How do I translate this to numpy? The following:
b = np.full_like(a, -1)
m = some_predicate(a)
b[m] = np.arange(np.count_nonzero(m))
looks a bit odd to me: this is three lines for such a simple task. In particular, it disturbs me that I need to store m, which I do since I need to reference it twice (because I have no way to say "arange with as many values as necessary").

Walrus operator to the rescue (starting with Python 3.8):
i = -1
b = np.array([ -1 if not predicate(val) else (i := i+1) for val in a ])
or (presumably significantly faster for large arrays)
b = np.full_like(a, -1)
b[sel] = np.arange(np.count_nonzero(sel := predicate(a)))

Related

Replace every second and third item in list with 0

I have a randomly generated list of items, and I want to replace every second and third item in that list with the number 0. For the replacement of every second item I have used the code below.
import random
x = [random.randint(0,11) for x in range(1000)]
y = [random.randint(0,11) for x in range(1000)]
a = (x + y)
a[::2] = [0]*(2000//2)
print(a)
It works fine, but I can't use the same method with replacing every third item since it gives me an error
attempt to assign sequence of size 666 to extended slice of size 667
I thought of using list comprehension, but I'm unsure of how to execute it, and I could not find a definite answer in my research.
In the case of every second element, the size of [0]*(2000//2) is equal to the size of a[::2], which is 1000. That is why you are not getting an error. But in the case of a[::3], there are 667 elements and [0]*(2000//3) returns a list of size 666, which is not possible to assign. You can use math.ceil to solve this issue. As:
import random
from math import ceil
x = [random.randint(0, 11) for x in range(1000)]
y = [random.randint(0, 11) for x in range(1000)]
a = (x + y)
index = 2
a[::index] = [0] * ceil(2000/index)
print(a)
You can simply replace 2000//2 with len(a[::2]) like this
import random
x = [random.randint(0,11) for x in range(1000)]
y = [random.randint(0,11) for x in range(1000)]
a = (x + y)
a[::2] = [0]*len(a[::2])
print(a)
b = (x + y)
b[::3] = [0]*len(b[::3])
print(b)
Something like this, where every other or every 3rd become zero.
[0 if (i+1)%2==0 or (i+1)%3==0 else x for i, x in enumerate(a)]
Not quite as neat as subscripting the list outright, but this can be done with the % operator and indices:
import random
x = [random.randint(0,11) for x in range(1000)]
y = [random.randint(0,11) for x in range(1000)]
a = (x + y)
for i, v in enumerate(a):
if (i+1) % 2 == 0 or (i+1) % 3 == 0:
a[i] = 0
# if you prefer a comprehension:
a = [0 if (i+1) % 3 == 0 or (i+1) % 2 == 0 else v for i, v in enumerate(a)]
print(a)
[3, 6, 0, 5, 1, 0, 8, 5, 0, 1, 5, 0, 2, 3, 0, 7, 9, 0, ... ]
Others have explained how replacement is done, but the same result could be achieved by not generating these random throwaway numbers in the first place.
from random import randint
N = 2_000
[randint(0, 11) if (x % 2) * (x % 3) != 0 else 0
for x in range(1, N + 1)]
To get this done efficiently I would recommend using the numpy module here is the code
import numpy as np
a = np.random.randint(0,12,(2000,))
a[::2] = 0
a[::3] = 0
a = a.tolist()
print(a)
np.random.randint takes in 3 arguments. the first is the lower bound inclusive, the second is the upper bound exclusive and the third is a tuple of the array dimensions. In this case a 1-d array of 2000. using numpy you can just use slicing to set the parts you want to zero and then use tolist() to convert back to a list if you need to or keep it as a numpy array.

Efficiency of alternative merge sort?

I'm learning merge sort and many tutorials I've seen merge by replacing values of the original array, like the way here. I was wondering if my alternative implementation is correct. I have only seen 1 tutorial do the same. My implementation returns the sorted array which goes like this:
def mergesort(arr):
if len(arr) == 1:
return arr
mid = len(arr) // 2
left_arr = arr[:mid]
right_arr = arr[mid:]
return merge(mergesort(left_arr), mergesort(right_arr))
def merge(left_arr, right_arr):
merged_arr = [] # put merge of left_arr & right_arr here
i,j = 0, 0 # indices for left_arr & right_arr
while i < len(left_arr) and j < len(right_arr):
if left_arr[i] < right_arr[j]:
merged_arr.append(left_arr[i])
i += 1
else:
merged_arr.append(right_arr[j])
j += 1
# add remaining elements to resulting arrray
merged_arr.extend(left_arr[i:])
merged_arr.extend(right_arr[j:])
return merged_arr
arr = [12, 11, 13, 5, 6, 7]
sorted_arr = mergesort(arr)
print(sorted_arr)
# Output: [5, 6, 7, 11, 12, 13]
To me, this is a more intuitive way of doing merge sort. Did this implementation break what merge sort should be? Is it less efficient speed-wise or space-wise (Aside from creating the results array)?
If we are considering a merge sort with O(n) extra memory, then your implementation seems to be correct but inefficient. Let's take a look at these lines:
def mergesort(arr):
...
mid = len(arr) // 2
left_arr = arr[:mid]
right_arr = arr[mid:]
You are actually creating two new arrays on each call to mergesort() and then copy elements from the original arr. It's two extra memory allocations on the heap and O(n) copies. Usually, heap memory allocations are quite slow due to complicated allocators algorithms.
Going father, let's consider this line:
merged_arr.append(left_arr[i]) # or similar merged_arr.append(left_arr[j])
Here again a bunch of memory allocations happens because you use a dynamically allocated array (aka list).
So, the most efficient way to mergesort would be to allocate one extra array of size of the original array once at the very beginning and then use its parts for temporary results.
def mergesort(arr):
mergesort_helper(arr[:], arr, 0, len(arr))
def mergesort_helper(arr, aux, l, r):
""" sorts from arr to aux """
if l >= r - 1:
return
m = l + (r - l) // 2
mergesort_helper(aux, arr, l, m)
mergesort_helper(aux, arr, m, r)
merge(arr, aux, l, m, r)
def merge(arr, aux, l, m, r):
i = l
j = m
k = l
while i < m and j < r:
if arr[i] < arr[j]:
aux[k] = arr[i]
i += 1
else:
aux[k] = arr[j]
j += 1
k += 1
while i < m:
aux[k] = arr[i]
i += 1
k += 1
while j < r:
aux[k] = arr[j]
j += 1
k += 1
import random
def testit():
for _ in range(1000):
n = random.randint(1, 1000)
arr = [0]*n
for i in range(n):
arr[i] = random.randint(0, 100)
sarr = sorted(arr)
mergesort(arr)
assert sarr == arr
testit()
Do Python guys bother about effectiveness with their lists :) ?
To achieve the best speed of classical merge sort implementation, in compiled languages one should provide auxiliary memory piece only once to minimize allocation operations (memory throughput frequently is limiting stage when arithmetics is rather simple).
Perhaps this approach (preallocation of working space as list with size = source size) might be useful in Python implementation too.
Your implementation of merge sort is right.
As you pointed you are using an extra array to merge your results. Using this alternative array, adds a space complexity of O(n).
However, the first link you mentioned: https://www.geeksforgeeks.org/merge-sort/
also adds the same space complexity:
/* create temp arrays */
int L[n1], R[n2];
Note: In case you are interested, take a look to "in place" merge sort
I think this is a good implementation of merge sort because evaluating the complexity of your algorithm is part of the complexity of the merge sort that is: given n the number of elements to be ordered,
T(n) = 2T (n / 2) + n

Getting mean values out of a for loop

Fairly new to python and I have a for loop that resembles this (I won't include the contents since they seem irrelevant):
for i, (A, B) in enumerate(X):
...
arbitrary calculations
...
print s1
print s2
This cycles through ten times(although it does vary occasionally), giving me 10 values for s1 and 10 for s2. Is there an efficient way of finding the means of these values?
You would need to either append each number to a list, or add them up on the fly before finding the mean.
Using a list:
s1_list = []
s2_list = []
for i, (A, B) in enumerate(X):
...
arbitrary calculations
...
s1_list.append(s1)
s2_list.append(s2)
s1_mean = sum(s1_list)/float(len(s1_list))
s2_mean = sum(s2_list)/float(len(s2_list))
Adding them up on the fly:
s1_total = 0
s2_total = 0
for i, (A, B) in enumerate(X):
...
arbitrary calculations
...
s1_total += s1
s2_total += s2
s1_mean = s1_total/float(len(X))
s2_mean = s2_total/float(len(X))
Use float otherwise the mean will be rounded down if it is a decimal number.
I would not allocate lists like in the other answer, just sum inside the loop and divide afterwards by the total number of elements:
sum1 = 0
sum2 = 0
for i, (A, B) in enumerate(X):
...
arbitrary calculations
...
sum1 += s1
sum2 += s2
n = i+1
print(sum1/n)
print(sum2/n)
Allocation is costly if the lists grow too much bigger.
Sure, you can save them to do so.
lst_s1, lst_s2 = [], []
for i, (A,B) in enumerate(X):
...
lst_s1.append(s1)
lst_s2.append(s2)
print s1
print s2
avg_s1 = sum(lst_s1) / len(lst_s1)
avg_s2 = sum(lst_s2) / len(lst_s2)
Try following snippet to calculate mean of array. Bottom line is that it will not cause an overflow.
X = [9, 9, 9, 9, 9, 9, 9, 9]
factor = 1000000
xlen = len(X)
mean = (sum([float(x) / factor for x in X]) * factor) / xlen
print(mean)

what is a pythonic way to get the number of times list1[i] < list2[i] and vise versa

I have two lists with values, the expected result is a tuple (a,b) where a is the number of i values which list1[i] < list2[i], and b is the number of i values where list1[i] > list2[i] (equalities are not counted at all).
I have this solution, and it works perfectly:
x = (0,0)
for i in range(len(J48)):
if J48[i] < useAllAttributes7NN[i]:
x = (x[0]+1,x[1])
elif J48[i] > useAllAttributes7NN[i]:
x = (x[0], x[1]+1)
However, I am trying to improve my python skills, and it seems very non-pythonic way to achieve it.
What is a pythonic way to achieve the same result?
FYI, this is done to achieve the required input for binom_test() that tries to prove two algorithms are not statistically identical.
I don't believe this information has any additional value to the specific question though.
One way is to build a set of scores and then add them up.
scores = [ (a < b, a > b) for (a, b) in zip(J48, useAllAttributes7nn) ]
x = (sum( a for (a, _) in scores ), sum( b for (_, b) in scores ))
// Or, as per #agf's suggestion (though I prefer comprehensions to map)...
x = [ sum(s) for s in zip(*scores) ]
Another is to zip them once then count scores separately:
zipped = zip(J48, useAllAttributes7nn)
x = (sum( a < b for (a, b) in zipped ), sum( a > b for (a, b) in zipped ))
Note that this doesn't work in Python 3 (thanks #Darthfett).
Just for the sake of fun, solving this problem using complex numbers. Though not Pythonic but quite mathematical :-)
Just think this problem as plotting the result on a two dimensional complex space
result=sum((x < y) + (x > y)*1j for x,y in zip(list1,list2))
(result.real,result.imag)
import itertools
x = [0, 0, 0]
for a, b in itertools.izip(J48, useAllAttributes7NN):
x[cmp(a, b)] += 1
and then take just x[0] and x[2], because x[1] counts the equalities.
Another way (has to parse the lists twice):
first = sum(1 for a, b in itertools.izip(J48, useAllAttributes7NN) if a > b)
second = sum(1 for a, b in itertools.izip(J48, useAllAttributes7NN) if a < b)
A simple solution is:
list1 = range(10)
list2 = reversed(range(10))
x = [0, 0]
for a, b in zip(list1, list2):
x[0] += 1 if a < b else 0
x[1] += 1 if a > b else 0
x = tuple(x)
Giving us:
(5, 5)
zip() is the best way to iterate over two lists at once. If you are using Python 2.x, you might want to use itertools.izip for performance reasons (it's lazy like Python 3.x's zip().
It's also easier to work on a list until you stop changing it, as a list is mutable.
Edit:
A Python 3-compatible version of the versions that use cmp:
def add_tuples(*args):
return tuple(sum(z) for z in zip(*args))
add_tuples(*[(1, 0) if a < b else ((0, 1) if a > b else (0, 0)) for a, b in zip(list1, list2)])
Well, what about:
import itertools
def func(list1, list2):
x,y = 0,0
for (a,b) in itertools.izip(list1, list2):
if a > b:
x += 1
elif a < b:
y += 1
print x,y
list1 = [1, 3, 7, 15, 22, 27]
list2 = [2, 5, 10, 12, 20, 30]
x = 0
y = 0
for index, value in enumerate(list1):
if value < list2[index]:
x += 1
elif value > list2[index]:
y += 1
print (x, y)

Comparing two numpy arrays to each other

I have two equally sized numpy arrays (they happen to be 48x365) where every element is either -1, 0, or 1. I want to compare the two and see how many times they are both the same and how many times they are different while discounting all the times where at least one of the arrays has a zero as no data. For instance:
for x in range(48):
for y in range(365):
if array1[x][y] != 0:
if array2[x][y] != 0:
if array1[x][y] == array2[x][y]:
score = score + 1
else:
score = score - 1
return score
This takes a very long time. I was thinking to take advantage of the fact that multiplying the elements together and summing all the answers may give the same outcome, and I'm looking for a special numpy function to help with that. I'm not really sure what unusual numpy function are out there.
Simpy do not iterate. Iterating over a numpy array defeats the purpose of using the tool.
ans = np.logical_and(
np.logical_and(array1 != 0, array2 != 0),
array1 == array2 )
should give the correct solution.
For me the easiest way is to do this :
A = numpy.array()
B = numpy.array()
T = A - B
max = numpy.max(numpy.abs(T))
epsilon = 1e-6
if max > epsilon:
raise Exception("Not matching arrays")
It allow to know quickly if arrays are the same and allow to compare float values !!
Simple calculations along the following lines, will help you to select the most suitable way to handle your case:
In []: A, B= randint(-1, 2, size= (48, 365)), randint(-1, 2, size= (48, 365))
In []: ignore= (0== A)| (0== B)
In []: valid= ~ignore
In []: (A[valid]== B[valid]).sum()
Out[]: 3841
In []: (A[valid]!= B[valid]).sum()
Out[]: 3849
In []: ignore.sum()
Out[]: 9830
Ensuring that the calculations are valid:
In []: 3841+ 3849+ 9830== 48* 365
Out[]: True
Therefore your score (with these random values) would be:
In []: a, b= A[valid], B[valid]
In []: score= (a== b).sum()- (a!= b).sum()
In []: score
Out[]: -8
import numpy as np
A = np.array()
B = np.array()
...
Z = np.array()
to_test = np.array([A, B, .., Z])
# compare linewise if all lines are equal
np.all(map(lambda x: np.all(x==to_test[0,:]), to_test[1:,:]))

Categories