Downsample numpy array while preserving distribution

Downsample numpy array while preserving distribution - python

I'm trying to write a function that can randomly sample a numpy.ndarray that has floating point numbers while preserving the distribution of the numbers in the array. I have this function for now:
import random
from collections import Counter
def sample(A, N):
population = np.zeros(sum(A))
counter = 0
for i, x in enumerate(A):
for j in range(x):
population[counter] = i
counter += 1
sampling = population[np.random.choice(0, len(population), N)]
return np.histogram(sampling, bins = np.arange(len(A)+1))[0]
So I would like the function to work something like this(doesn't include accounting for distribution for this example):
a = np.array([1.94, 5.68, 2.77, 7.39, 2.51])
new_a = sample(a,3)
new_a
array([1.94, 2.77, 7.39])
However, when I apply the function to an array like this I'm getting:
TypeError Traceback (most recent call last)
<ipython-input-74-07e3aa976da4> in <module>
----> 1 sample(a, 3)
<ipython-input-63-2d69398e2a22> in sample(A, N)
3
4 def sample(A, N):
----> 5 population = np.zeros(sum(A))
6 counter = 0
7 for i, x in enumerate(A):
TypeError: 'numpy.float64' object cannot be interpreted as an integer
Any help on modifying or create a function that would work for this would be really appreciated!

In [67]: a = np.array([1.94, 5.68, 2.77, 7.39, 2.51])
In [68]: np.zeros(sum(a))
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-68-263779bc977b> in <module>
----> 1 np.zeros(sum(a))
TypeError: 'numpy.float64' object cannot be interpreted as an integer
sum on the shape does not produce this error:
In [69]: np.zeros(sum(a.shape))
Out[69]: array([0., 0., 0., 0., 0.])
But you shouldn't need to use sum:
In [70]: a.shape
Out[70]: (5,)
In [71]: np.zeros(a.shape)
Out[71]: array([0., 0., 0., 0., 0.])
In fact if a is 2d, and you want a 1d array with the same number of items, you want the product of the shape, not the sum.
But do you want to return an array exactly the same size as A? I thought you were trying to downsize.

Related

using Numpy for Kmean Clustering

I'm new in machine learning and want to build a Kmean algorithm with k = 2 and I'm struggling by calculate the new centroids. here is my code for kmeans:
def euclidean_distance(x: np.ndarray, y: np.ndarray):
# x shape: (N1, D)
# y shape: (N2, D)
# output shape: (N1, N2)
dist = []
for i in x:
for j in y:
new_list = np.sqrt(sum((i - j) ** 2))
dist.append(new_list)
distance = np.reshape(dist, (len(x), len(y)))
return distance
def kmeans(x, centroids, iterations=30):
assignment = None
for i in iterations:
dist = euclidean_distance(x, centroids)
assignment = np.argmin(dist, axis=1)
for c in range(len(y)):
centroids[c] = np.mean(x[assignment == c], 0) #error here
return centroids, assignment
I have input x = [[1., 0.], [0., 1.], [0.5, 0.5]] and y = [[1., 0.], [0., 1.]] and
distance is an array and look like that:
[[0. 1.41421356]
[1.41421356 0. ]
[0.70710678 0.70710678]]
and when I run kmeans(x,y) then it returns error:
--------------------------------------------------------------------------- TypeError Traceback (most recent call
last) /tmp/ipykernel_40086/2170434798.py in
5
6 for c in range(len(y)):
----> 7 centroids[c] = (x[classes == c], 0)
8 print(centroids)
TypeError: only integer scalar arrays can be converted to a scalar
index
Does anyone know how to fix it or improve my code? Thank you in advance!

Changing inputs to NumPy arrays should get rid of errors:
x = np.array([[1., 0.], [0., 1.], [0.5, 0.5]])
y = np.array([[1., 0.], [0., 1.]])
Also seems like you must change for i in iterations to for i in range(iterations) in kmeans function.

Numpy floor float values to int

I have array of floats, and I want to floor them to nearest integer, so I can use them as indices.
For example:
In [2]: import numpy as np
In [3]: arr = np.random.rand(1, 10) * 10
In [4]: arr
Out[4]:
array([[4.97896461, 0.21473121, 0.13323678, 3.40534157, 5.08995577,
6.7924586 , 1.82584208, 6.73890807, 2.45590354, 9.85600841]])
In [5]: arr = np.floor(arr)
In [6]: arr
Out[6]: array([[4., 0., 0., 3., 5., 6., 1., 6., 2., 9.]])
In [7]: arr.dtype
Out[7]: dtype('float64')
They are still floats after flooring, is there a way to automatically cast them to integers?

I am edit answer with #DanielF explanation:
"floor doesn't convert to integer, it just gives integer-valued floats, so you still need an astype to change to int"
Check this code to understand the solution:
import numpy as np
arr = np.random.rand(1, 10) * 10
print(arr)
arr = np.floor(arr).astype(int)
print(arr)
OUTPUT:
[[2.76753828 8.84095843 2.5537759 5.65017407 7.77493733 6.47403036
7.72582766 5.03525625 9.75819442 9.10578944]]
[[2 8 2 5 7 6 7 5 9 9]]

Why not just use:
np.random.randint(1,10)

As alternative to changing type after floor division, you can provide an output array of the desired data type to np.floor (and to any other numpy ufunc). For example, imagine you want to convert the output to np.int32, then do the following:
import numpy as np
arr = np.random.rand(1, 10) * 10
out = np.empty_like(arr, dtype=np.int32)
np.floor(arr, out=out, casting='unsafe')
As the casting argument already indicates, you should know what you are doing when casting outputs into different types. However, in your case it is not really unsafe.
Although, I would not call np.floor in your case, because all values are greater than zero. Therefore, the simplest and probably fastest solution to your problem would be a direct casting to integer.
import numpy as np
arr = (np.random.rand(1, 10) * 10).astype(int)

Numpy Array bug

I have an array
array = [np.array([[0.76103773], [0.12167502]]),
np.array([[ 0.72017135, 0.1633635 , 0.39956811, 0.91484082, 0.76242736, -0.39897202],
[0.38787197, -0.06179132, -0.04213892, 0.16762614, 0.05880554, 0.59370467]])]
And I want to convert it into a numpy object array that contains numpy ndarrays. So I tried, np.array(array), np.array(array, dtype=object),np.array(array, dtype=np.object)
But all of them give the same error, ValueError: could not broadcast input array from shape (2,1) into shape (2). So basically, the end result should be the same, just that the type of the end result is a numpy object array, not a python list. Can anyone help?

Your list contains (2,1) and (2,6) shaped arrays.
np.array tries to create a multidimensional array from the inputs. That works fine with inputs that have matching shapes (or length and nesting). Failing that it falls back on creating object dtype arrays.
But in cases where the first dimensions of the input arrays match is produces this kind of error. Evidently it has initialed an 'blank' array, and is trying to copy the list arrays into it. I haven't looked at the details, but I've seen the error message before.
In effect giving np.array an list of diverse size arrays, forces it to use some backup methods. So produce an object array, others produce this kind of error. If your list contained arrays all the same shape, the result would be a 3d array, not an object array.
The surest way to make a object array with given shape, is to initialize it, and then copy from the list.
In [66]: alist =[np.array([[0.76103773], [0.12167502]]),
...: np.array([[ 0.72017135, 0.1633635 , 0.39956811, 0.91484082, 0.76242736, -0.39897202],
...: [0.38787197, -0.06179132, -0.04213892, 0.16762614, 0.05880554, 0.59370467]])]
In [67]: alist[0].shape
Out[67]: (2, 1)
In [68]: alist[1].shape
Out[68]: (2, 6)
In [69]: np.array(alist, object)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-69-261e1ad7e5cc> in <module>
----> 1 np.array(alist, object)
ValueError: could not broadcast input array from shape (2,1) into shape (2)
In [70]: arr = np.zeros(2, object)
In [71]: arr[:] = alist
In [72]: arr
Out[72]:
array([array([[0.76103773],
[0.12167502]]),
array([[ 0.72017135, 0.1633635 , 0.39956811, 0.91484082, 0.76242736,
-0.39897202],
[ 0.38787197, -0.06179132, -0.04213892, 0.16762614, 0.05880554,
0.59370467]])], dtype=object)
Don't expect too much from object dtype arrays. Math is hit-or-miss. Somethings work - if they can delegate the action to the elements. Others don't work:
In [73]: arr - arr
Out[73]:
array([array([[0.],
[0.]]),
array([[0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0.]])], dtype=object)
In [74]: np.log(arr)
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-74-a67b4ae04e95> in <module>
----> 1 np.log(arr)
AttributeError: 'numpy.ndarray' object has no attribute 'log'
Even when the math works it isn't faster than a list comprehension. In fact iteration on an object array is slower than iteration on a list.

Is this what you're trying to accomplish?
array1 = np.array([[0.76103773], [0.12167502]])
array2 = np.array([[ 0.72017135, 0.1633635 , 0.39956811, 0.91484082, 0.76242736, -0.39897202],[0.38787197, -0.06179132, -0.04213892, 0.16762614, 0.05880554, 0.59370467]])
result = np.hstack([array1,array2])
EDIT:
Maybe this?
array1 = [[0.76103773], [0.12167502]]
array2 = [[ 0.72017135, 0.1633635 , 0.39956811, 0.91484082, 0.76242736, -0.39897202],[0.38787197, -0.06179132, -0.04213892, 0.16762614, 0.05880554, 0.59370467]]
result = np.array([array1,array2])
EDIT 2:
Ok, Let's try one more time. I think this is it.
array1 = np.array([[0.76103773], [0.12167502]])
array2 = np.array([[ 0.72017135, 0.1633635 , 0.39956811, 0.91484082, 0.76242736, -0.39897202],[0.38787197, -0.06179132, -0.04213892, 0.16762614, 0.05880554, 0.59370467]])
#solution is either
result = np.array([array1,array2.transpose()])
#or this
result2 = np.array([array1.transpose(),array2])

Defining Error of An Array with Two Index

I get an error such as;
Traceback (most recent call last): File
"C:\Users\SONY\Desktop\deneme.py", line 42, in
G[alpha][n]=compute_G(x,n) NameError: name 'G' is not defined
Here is my code:
N = 20
N_cor = 25
N_cf = 25
a = 0.5
eps = 1.4
def update(x):
for j in range(0,N):
old_x = x[j]
old_Sj = S(j,x)
x[j] = x[j] + random.uniform(-eps,eps)
dS = S(j,x) - old_Sj
if dS>0 and exp(-dS)<random.uniform(0,1):
x[j] = old_x
def S(j,x):
jp = (j+1)%N
jm = (j-1)%N
return a*x[j]**2/2 + x[j]*(x[j]-x[jp]-x[jm])/a
def compute_G(x,n):
g = 0
for j in range(0,N):
g = g + x[j]*x[(j+n)%N]
return g/N
#def MCaverage(x,G):
import random
from math import exp
x=[]
for j in range(0,N):
x.append(0.0)
print"x(%d)=%f"%(j,x[j])
for j in range(0,5*N_cor):
update(x)
for alpha in range(0,N_cf):
for j in range(0,N_cor):
update(x)
for i in range(0,N):
print"x(%d)=%f"%(i,x[i])
for n in range(0,N):
G[alpha][n]=compute_G(x,n)
for n in range(0,N):
avg_G = 0
for alpha in range(0,N_cf):
avg_G = avg_G + G[alpha][n]
avg_G = avg_G / N_cf
print "G(%d) = %f"%(n,avg_G)
When i define G I get another error such as:
Traceback (most recent call last): File
"C:\Users\SONY\Desktop\deneme.py", line 43, in
G[alpha][n]=compute_G(x,n) IndexError: list index out of range
Here is how i define G:
...
for alpha in range(0,N_cf):
for j in range(0,N_cor):
update(x)
for n in range(0,N):
G=[][]
G[alpha][n]=compute_G(x,n)
...
What should i do to define an array with two index ie a two dimensional matrix?

In Python a=[] defines a list, not an array. It certainly can be used to store a lot of elements all of the same numeric type, and one can define a mapping from two integers indexing a rectangular array to one list index. It's rather going against the grain, though. Hard to program and inefficiently stored, because lists are intended as ordered collections of objects which may be of arbitrary type.
What you probably need most is a direction to where to start reading. Here it is. Learn about Numpy http://www.numpy.org/, which is a Python module for use in typical scienticic calculations with arrays of (mostly) numeric data in which all the elements are of the same type. Here is a brief taster, after you have installed numpy.
>>> import numpy as np # importing as np is conventional
>>> p = np.zeros( (6,4) ) # two dimensional, 24 elements in total
>>> for i in range(4): p[i,i]=1
>>> p
array([[ 1., 0., 0., 0.],
[ 0., 1., 0., 0.],
[ 0., 0., 1., 0.],
[ 0., 0., 0., 1.],
[ 0., 0., 0., 0.],
[ 0., 0., 0., 0.]])
numpy arrays are efficient ways of manipulating as much data as you can fit into your computer's RAM.
Underlying numpy is Python's array.array datatype, but it is rarely used on its own. numpy is the support code that you'll usually not want to write for yourself. Not least, because when your arrays are millions or billions of elements, you can't afford the inefficiency of inner loops over their indices in an interpreted language like Python. Numpy offers you row-, column- and array-level operations whose underlying code is compiled and optimized, so it runs considerably faster.

Raising an array to different values

I'm planning on plotting y^n vs x for different values of n. Here is my sample code:
import numpy as np
x=np.range(1,5)
y=np.range(2,9,2)
exponent=np.linspace(1,8,50)
z=y**exponent
With this, I got the following error:
ValueError: operands could not be broadcast together with shapes (4) (5)
My idea is that for each value of n, I will get an array where that array contains the new values of y that is now raised to n. For instance:
y1= [] #an array where y**1
y2= [] #an array where y**1.5
y3= [] #an array where y**2
etc. I don't know if how I can get that 50 arrays for y**n and is there an easier way to do it? Thank you.

You can use "broadcasting" (explained here in the docs) and create a new axis:
z = y**exponent[:,np.newaxis]
In other words, instead of
>>> y = np.arange(2,9,2)
>>> exponent = np.linspace(1, 8, 50)
>>> z = y**exponent
Traceback (most recent call last):
File "<ipython-input-40-2fe7ff9626ed>", line 1, in <module>
z = y**exponent
ValueError: operands could not be broadcast together with shapes (4,) (50,)
You can use array[:,np.newaxis] (or array[:,None], the same thing, but newaxis is more explicit about your intent) to give the array an extra dimension of size 1:
>>> exponent.shape
(50,)
>>> exponent[:,np.newaxis].shape
(50, 1)
and so
>>> z = y**exponent[:,np.newaxis]
>>> z.shape
(50, 4)
>>> z[0]
array([ 2., 4., 6., 8.])
>>> z[1]
array([ 2.20817903, 4.87605462, 7.75025005, 10.76720154])
>>> z[0]**exponent[1]
array([ 2.20817903, 4.87605462, 7.75025005, 10.76720154])

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Downsample numpy array while preserving distribution - python

Related

using Numpy for Kmean Clustering

Numpy floor float values to int

Numpy Array bug

Defining Error of An Array with Two Index

Raising an array to different values

Categories

Resources