Python: Resizing array by removing nth element - python

I have some dynamically created arrays that have varying lengths and I would like to resize them to the same 5000 element length by popping every n element.
Here is what I got so far:
import numpy as np
random_array = np.random.rand(26975,3)
n_to_pop = int(len(random_array) / 5000)
print(n)
If I do the downsampling with n (5) I get 5395 elements
I can do 5395 / 5000 = 1.07899, but I don't know how to calculate how often I should pop a element to remove the last 0.07899 elements.
If I can get within 5000-5050 length that would also be acceptable, then the remainder can be sacrificed with a simple .resize
This is probably just a simple math question, but I couldn't seem to find an answer anywhere.
Any help is much appreciated.
Best regards
Martin

You can use something like np.linspace to make your solution as uniform as possible:
subset = random_array[np.round(np.linspace(0, len(random_array), 5000, endpoint=False)).astype(int)]
You don't always want to drop a uniform number of elements. Consider the case of reducing a 5003 element array to 5000 elements vs a 50003 element array. The trick is to create a set of elements to keep or drop that's as linear as possible in the index, which is exactly what np.linspace does.
You could also do something like
np.delete(random_array, np.round(np.linspace(0, len(random_array) len(random_array) - 5000, endpoint=False)).astype(int))

You can use Step solution using np.random.choice or np.random.permutation as:
random_array[np.random.permutation(random_array.shape[0])[:5000]]
In case of near uniformly remove the rows, one way is:
indices = np.linspace(0, random_array.shape[0], endpoint=False, num=5000, dtype=int)
# [ 0 5 10 16 ... 26958 26964 26969] --> shape = (5000,)
result = random_array[indices]

Related

How can I solve for x with Ax=B, when A and X are 1-d arrays and I know A?

In my original code I have the following function:
B = np.inner(A,x)
where A.shape = [307_200] and has values -1 or 1
where x.shape = [307_200] and has values 0 to 256
where B results in a integer with a large value.
Assuming I know A and B, but don't know x, how can I solve for x??
To simplify the problem...
import numpy as np
A = np.random.choice(a=[-1,1], size=10)
x = np.random.choice(a=range(0,256), size=10)
B = np.inner(A, x)
I want to solve for x now. So something like one of the following...
x_solved = np.linalg.solve(A,x)
x_solved = np.linalg.lstsq(A,x)
Is it possible?
Extra info...
I could change A to be a n x m matrix, but since I am dealing with large matrices, when I try to use lstsq I quickly run out of memory. This is bad because 1. I can't run on my local machine and 2. the end use application needs to limit RAM.
However, for the problem above, I can except RAM intensive solutions since I might be able to moderate the compute resources with some cleaver tricks.
Also, we could switch A to boolean values if that would help.
Apologies if solution is obvious or simple.
Thanks for helps.
Here is your problem re-stated:
I have an array A containing many 1s and -1s. I want to make another array x containing integers 0-255 so that when I multiply each entry by the corresponding first array, then add up all the entries, I get some target number B.
Notice that the problem is just as difficult if you shuffle the array elements. So let's shuffle them so all the 1s are at the start and all the -1s are at the end. After solving this simplified version of the problem, we can shuffle them back.
Now the simplified problem is this:
I have A1 number of 1s and A-1 number of -1s. I want to make two arrays x1 and x-1 containing numbers from 0-255 so that when I add all the numbers in x1 and subtract all the numbers in x-1 I get some target number B.
Can you work out how to solve this?
I'd start by filling x1 with numbers 255 until the next 255 would make the sum too high, then fill the next entry with the number that makes the sum equal the target, then fill the rest with 0s. Then fill x-1 with 0s. If the target number is negative, do the opposite. Then un-shuffle it - match up the x1 and x-1 arrays with positions of the the 1s and -1s in your array A. And you're done.
You can actually write that algorithm so it puts the numbers directly in x without needing to make the temporary arrays x1 and x-1.

Increment through list and find max value for a range of slices

I have a problem in which I need to find the max values of a range of slices.
Ex: I have a list of 5,000 ints, and want to find the maximum mean for each slice from 1 to 3600 elements.
Currently my code is as follows:
power_vals = # some list / array of ints
max_vals = []
for i in range(1, 3600):
max_vals += [max([statistics.mean(power_vals[ix:ix+i]) for ix in range(len(power_vals)) if ix+i < len(power_vals)])]
This works fine but it's really slow (for obvious reasons). I tried to use cython to speed up the process. It's obviously better but still not ideal.
Is there a more time efficient way to do this?
Your first step is to prepend a 0 to the array, and then create a cumulative sum. At this point, calculating the mean from any point to any other point is two substractions followed by a division.
mean(x[i:j]) = (cumsum[j] - cumsum[i])/(j - i)
If you're trying to find the largest mean of, say, length 10, then you can make it even faster by just looking for the largest value of (cumsum[i + 10] - cumsum[i]). Once you've found that largest value, you can then divide it by 10 to get the mean.

Equation calculations with 4D arrays

Basically I have over 1000 3D arrays with the shape (100,100,1000). So some pretty large arrays, which I need to use in some calculations. The great thing about python and Numpy is that instead of interations, calculations on each element and such can be done very quickly. For example, I can make a sum of each index for each 3D array almost instant. The result is one large array with the sum of each index for each array. In principle, that is ALMOST what I want to do, however, there is a bit of a problem.
What I need to do is use an equation that looks like this:
So as stated, I have around 1000 3D arrays. In total, the shape of this total array is (1000, 100, 100, 1000). For each of the 1000 I also have a list going from 1 to 1000 that corresponds to the 1000 3D arrays, and each index of that list contains either a 1 or a 0. If it has a 1 that entire 3D array of that index should go in the first term of the equation, and if 0, it goes into the other.
I am however very much in doubt about how I am going to do this without turning to some kind of looping that might destroy the speed of the calculations by a great deal.
You could sort it by locating the 1's and 0's.
Something like:
list_ones = np.where(Array[0] == 1)
list_zeros = np.where(Array[0] == 0)
Then Array[list_ones,:,:,:] will contain all elements corresponding to a one and Array[list_zeros,:,:,:] will correspond to all elements corresponding to a zero.
Then you can just put
first_term = Array[list_ones,:,:,:]
second_term = Array[list_zeros,:,:,:]
And sum as appropriate.
Would this work for your purpose?

Fast way to construct a matrix in Python

I have been browsing through the questions, and could find some help, but I prefer having confirmation by asking it directly. So here is my problem.
I have an (numpy) array u of dimension N, from which I want to build a square matrix k of dimension N^2. Basically, each matrix element k(i,j) is defined as k(i,j)=exp(-|u_i-u_j|^2).
My first naive way to do it was like this, which is, I believe, Fortran-like:
for i in range(N):
for j in range(N):
k[i][j]=np.exp(np.sum(-(u[i]-u[j])**2))
However, this is extremely slow. For N=1000, for example, it is taking around 15 seconds.
My other way to proceed is the following (inspired by other questions/answers):
i, j = np.ogrid[:N,:N]
k = np.exp(np.sum(-(u[i]-u[j])**2,axis=2))
This is way faster, as for N=1000, the result is almost instantaneous.
So I have two questions.
1) Why is the first method so slow, and why is the second one so fast ?
2) Is there a faster way to do it ? For N=10000, it is starting to take quite some time already, so I really don't know if this was the "right" way to do it.
Thank you in advance !
P.S: the matrix is symmetric, so there must also be a way to make the process faster by calculating only the upper half of the matrix, but my question was more related to the way to manipulate arrays, etc.
First, a small remark, there is no need to use np.sum if u can be re-written as u = np.arange(N). Which seems to be the case since you wrote that it is of dimension N.
1) First question:
Accessing indices in Python is slow, so best is to not use [] if there is a way to not use it. Plus you call multiple times np.exp and np.sum, whereas they can be called for vectors and matrices. So, your second proposal is better since you compute your k all in once, instead of elements by elements.
2) Second question:
Yes there is. You should consider using only numpy functions and not using indices (around 3 times faster):
k = np.exp(-np.power(np.subtract.outer(u,u),2))
(NB: You can keep **2 instead of np.power, which is a bit faster but has smaller precision)
edit (Take into account that u is an array of tuples)
With tuple data, it's a bit more complicated:
ma = np.subtract.outer(u[:,0],u[:,0])**2
mb = np.subtract.outer(u[:,1],u[:,1])**2
k = np.exp(-np.add(ma, mb))
You'll have to use twice np.substract.outer since it will return a 4 dimensions array if you do it in one time (and compute lots of useless data), whereas u[i]-u[j] returns a 3 dimensions array.
I used np.add instead of np.sum since it keep the array dimensions.
NB: I checked with
N = 10000
u = np.random.random_sample((N,2))
I returns the same as your proposals. (But 1.7 times faster)

Finding the number of values that satisfy something in a matrix, without using loops. Python

I am to write a function that takes 3 arguments: Matrix 1, Matrix 2, and a number p. The functions outputs the number of entries in which the difference between Matrix 1 and Matrix 2 is bigger than p. I was instructed not to use loops.
I was advised to use X.sum() function where X is an ndarray.
I don't know what to do here.
The first thing I want to do is to subtract M2 from M1. Now I have entries, one of which is either or not bigger than p.
I tried to find a way to us the sum function, but I am afraid I can't see how it can help me.
The only thing I can think about is going through the entries, which I am not allowed to. I would appreciate you help in this. No recursion allowed as well.
import pandas as pd
# Pick value of P
p = 20
# Instantiate fake frames
a = pd.DataFrame({'foo':[4, 10], 'bar':[34, -12]})
b = pd.DataFrame({'foo':[64, 0], 'bar':[21, 354]})
# Get absolute value of difference
c = (b - a).applymap(abs)
# Boolean slice, then sum along each axis to get total number of "True"s
c.applymap(lambda x: x > p).sum().sum()

Categories