Loop over finite probability weights with SciPy/NumPy - python

Let us have a single event probability prob which is a scalar between 0-1. If I want to iterate over every possible probability with 0.1 increments, then I can use:
prob = np.arange(0.01, 1, 0.1)
Now assume I have 5 events (independent, probabilities sum to 1), each with probability p_i. I would like to have multi-dimensional probability arrays such as:
1.0 - 0.0 - 0.0 - 0.0 - 0.0
0.9 - 0.1 - 0.0 - 0.0 - 0.0
0.9 - 0.0 - 0.1 - 0.0 - 0.0
0.9 - 0.0 - 0.0 - 0.1 - 0.0
0.9 - 0.0 - 0.0 - 0.0 - 0.1
0.8 - 0.1 - 0.1 - 0.0 - 0.0
0.8 - 0.1 - 0.0 - 0.1 - 0.0
. . . . .
. . . . .
. . . . .
0.2 - 0.2 - 0.2 - 0.2 - 0.2
Is there a more clever way than to consider all the combinations of 0 - 0.1 - ... - 1 and delete the rows not summing up to 1? If yes, what is the easiest way?

You can use itertools.product and filter to create all combinations that sum 10 and pass it to an array:
import itertools
f = filter(lambda x: sum(x) == 10, itertools.product(*[range(11)]*5))
x = np.array(list(f)).astype(np.float)/10
x
>> array([[0. , 0. , 0. , 0. , 1. ],
[0. , 0. , 0. , 0.1, 0.9],
[0. , 0. , 0. , 0.2, 0.8],
...,
[0.9, 0. , 0.1, 0. , 0. ],
[0.9, 0.1, 0. , 0. , 0. ],
[1. , 0. , 0. , 0. , 0. ]])
EDIT
For the record, here's a more efficient way without using filtering. Essentially you create k bins (in your example, 10), and "assign" them to "n" samples (in your example, 3) in all possible combinations, using combinations_with_replacement
Then, you count how many bins each samples gets: this is your probability. This method is more complex to understand but avoids the filter, and thus it is much more efficient. You can try it with divisions of 0.01 (k = 100)
n = 3 # number of samples
k = 100 # number of subdivisions
f = itertools.combinations_with_replacement(range(3),k) #your iterator
r = np.array(list(f)) #your array of combinations
x = np.vstack((r==i).sum(1) for i in range(n)).T/k #your probability matrix

There's likely a more elegant solution using itertools but this is probably fine and uses no dependencies?:
for i in prob:
for j in prob:
for k in prob:
for l in prob:
m = 1 - i - j - l
if m>=0:
print(i,j,k,l,m)

Related

Creating a Kernel matrix without for-loops in Python

I know there are other posts asking similar questions, but didn't manage to find something that answers my specific question.
I have the code below :
def kernel_function(self, x1, x2):
h = 0.5
return np.exp(-(np.linalg.norm(x2 - x1)/h)**2)
for i, x1 in enumerate(train_x):
for j, x2 in enumerate(train_x):
K[i,j] = self.kernel_function(x1, x2)
where x1 and x2 are arrays of shape (2,). I need to vertorize it for performance. I looked at np.fromfunction, np.outer, but they don't seem to be what I am looking for...
Thank you in advance. Sorry if there is already an answer somewhere!
Assuming train_x has the following format:
>>> train_x = np.array(((-.2, -.1), (0, .1), (.2, 0), (.1, -.1)))
Executing your code you get:
>>> np.set_printoptions(precision=2)
>>> K
[[1. 0.73 0.51 0.7 ]
[0.73 1. 0.82 0.82]
[0.51 0.82 1. 0.92]
[0.7 0.82 0.92 1. ]]
You can reshape train_x:
>>> train_x_cols = train_x.T.reshape(2, -1, 1)
>>> train_x_rows = train_x.T.reshape(2, 1, -1)
So, thanks to broadcasting, you get all the combinations when you subtract them:
>>> train_x_rows - train_x_cols
[[[ 0. 0.2 0.4 0.3]
[-0.2 0. 0.2 0.1]
[-0.4 -0.2 0. -0.1]
[-0.3 -0.1 0.1 0. ]]
[[ 0. 0.2 0.1 0. ]
[-0.2 0. -0.1 -0.2]
[-0.1 0.1 0. -0.1]
[ 0. 0.2 0.1 0. ]]]
And you can rewrite kernel_function() to calculate the norm on the first axis only:
def kernel_function(x1, x2):
h = 0.5
return np.exp(-(np.linalg.norm(x2 - x1, axis=0) / h) ** 2)
Then you get:
>>> kernel_function(train_x_cols, train_x_rows)
[[1. 0.73 0.51 0.7 ]
[0.73 1. 0.82 0.82]
[0.51 0.82 1. 0.92]
[0.7 0.82 0.92 1. ]]

Render a Circle or Ellipse with Anti-Aliasing

Assume I have a square raster of given size, and I want to "draw" (render) a circle (or ellipse) of given radius (or major / minor axes) and center.
One way of doing this in Python with NumPy is:
import numpy as np
def ellipse(box_size, semisizes, position=0.5, n_dim=2):
shape = (box_size,) * n_dim
if isinstance(semisizes, (int, float)):
semisizes = (semisizes,) * n_dim
position = ((box_size - 1) * position,) * n_dim
grid = [slice(-x0, dim - x0) for x0, dim in zip(position, shape)]
position = np.ogrid[grid]
arr = np.zeros(shape, dtype=float)
for x_i, semisize in zip(position, semisizes):
arr += (np.abs(x_i / semisize) ** 2)
return arr <= 1.0
print(ellipse(5, 2).astype(float))
# [[0. 0. 1. 0. 0.]
# [0. 1. 1. 1. 0.]
# [1. 1. 1. 1. 1.]
# [0. 1. 1. 1. 0.]
# [0. 0. 1. 0. 0.]]
which produces a rasterization without anti-aliasing.
In particular, the pixels that are only partially included in the circle get a value of 0 (similarly to pixels excluded from the circle), while pixels entirely included in the circle gets a value of 1.
With anti-aliasing, the pixels partially included in the circle would get a value between 0 and 1 depending on how much of their area is included in the circle.
How could I modify the code from above to (possibly cheaply) include anti-aliasing?
I am struggling to see how (if?) I could use the values of arr.
Super-sampling-based methods are out of question here.
Eventually, the result should look something like:
# [[0.0 0.2 1.0 0.2 0.0]
# [0.2 1.0 1.0 1.0 0.2]
# [1.0 1.0 1.0 1.0 1.0]
# [0.2 1.0 1.0 1.0 0.2]
# [0.0 0.2 1.0 0.2 0.0]]
(where 0.2 should be a value between 0.0 and 1.0 representing how much area of that specific pixel is covered by the circle).
EDIT
I see now obvious way on how to adapt the code from Creating anti-aliased circular mask efficiently although obviously, np.clip() must be part of the solution.
One fast but not necessarily mathematically correct way of doing this (loosely based on the code from Creating anti-aliased circular mask efficiently) is:
import numpy as np
def prod(items, start=1):
for item in items:
start *= item
return start
def ellipse(box_size, semisizes, position=0.5, n_dim=2, smoothing=1.0):
shape = (box_size,) * n_dim
if isinstance(semisizes, (int, float)):
semisizes = (semisizes,) * n_dim
position = ((box_size - 1) * position,) * n_dim
grid = [slice(-x0, dim - x0) for x0, dim in zip(position, shape)]
position = np.ogrid[grid]
arr = np.zeros(shape, dtype=float)
for x_i, semisize in zip(position, semisizes):
arr += (np.abs(x_i / semisize) ** 2)
if smoothing:
k = prod(semisizes) ** (0.5 / n_dim / smoothing)
return 1.0 - np.clip(arr - 1.0, 0.0, 1.0 / k) * k
elif isinstance(smoothing, float):
return (arr <= 1.0).astype(float)
else:
return arr <= 1.0
n = 1
print(np.round(ellipse(5 * n, 2 * n, smoothing=0.0), 2))
# [[0. 0. 1. 0. 0.]
# [0. 1. 1. 1. 0.]
# [1. 1. 1. 1. 1.]
# [0. 1. 1. 1. 0.]
# [0. 0. 1. 0. 0.]]
n = 1
print(np.round(ellipse(5 * n, 2 * n, smoothing=1.0), 2))
# [[0. 0.65 1. 0.65 0. ]
# [0.65 1. 1. 1. 0.65]
# [1. 1. 1. 1. 1. ]
# [0.65 1. 1. 1. 0.65]
# [0. 0.65 1. 0.65 0. ]]
A slightly more general version of this approach has been included in the raster_geometry Python package (Disclaimer: I am the main author of it).

Use Numpy to Dynamically Create Arrays

I am trying to use numpy to dynamically create a set of zeros based on the size of a separate numpy array.
This is a small portion of the code of a much larger project. I have posted everything relevant in this question. I have a function k means which takes in a dataset (posted below) and a k value (which is 3, for this example).
I create a variable centroids which is supposed to look something like
[[4.9 3.1 1.5 0.1]
[7.2 3. 5.8 1.6]
[7.2 3.6 6.1 2.5]]
From there, I need to create a numpy array of "labels", one corresponding to every row in the dataset, of all zeroes with the same shape as the centroids array. Meaning, for a dataset with 5 rows, it would look like:
[[0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0]]
This is what I am trying to achieve, albiet on a dynamic scale (i.e. where the # of rows and columns in the dataset are unknown).
The following (hard coded, non numpy) satisfies that (assuming there are 150 lines in the dataset:
def k_means(dataset, k):
centroids = [[5,3,2,4.5],[5,3,2,5],[2,2,2,2]]
cluster_labels = []
for i in range(0,150):
cluster_labels.append([0,0,0,0])
print (cluster_labels)
I am trying to do this dynamically with the following:
def k_means(dataset, k):
centroids = dataset[numpy.random.choice(dataset.shape[0], k, replace=False), :]
print(centroids)
cluster_labels = []
cluster_labels = numpy.asarray(cluster_labels)
for index in range(len(dataset)):
# temp_array = numpy.zeros_like(centroids)
# print(temp_array)
cluster_labels = cluster_labels.append(cluster_labels, numpy.zeros_like(centroids))
The current result is: AttributeError: 'numpy.ndarray' object has no attribute 'append'
Or, if I comment out the cluster_labels line and uncomment the temp, I get:
[[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]]
I will ultimately get 150 sets of that.
Sample of Iris Dataset:
5.1 3.5 1.4 0.2
4.9 3 1.4 0.2
4.7 3.2 1.3 0.2
4.6 3.1 1.5 0.2
5 3.6 1.4 0.2
5.4 3.9 1.7 0.4
4.6 3.4 1.4 0.3
5 3.4 1.5 0.2
4.4 2.9 1.4 0.2
4.9 3.1 1.5 0.1
5.4 3.7 1.5 0.2
4.8 3.4 1.6 0.2
4.8 3 1.4 0.1
4.3 3 1.1 0.1
5.8 4 1.2 0.2
5.7 4.4 1.5 0.4
5.4 3.9 1.3 0.4
5.1 3.5 1.4 0.3
5.7 3.8 1.7 0.3
5.1 3.8 1.5 0.3
5.4 3.4 1.7 0.2
5.1 3.7 1.5 0.4
4.6 3.6 1 0.2
5.1 3.3 1.7 0.5
4.8 3.4 1.9 0.2
5 3 1.6 0.2
5 3.4 1.6 0.4
5.2 3.5 1.5 0.2
5.2 3.4 1.4 0.2
4.7 3.2 1.6 0.2
4.8 3.1 1.6 0.2
5.4 3.4 1.5 0.4
5.2 4.1 1.5 0.1
5.5 4.2 1.4 0.2
Can anybody help me dynamically use numpy to achieve what I am aiming for?
Thanks.
shape of a numpy array is the size of the array. In a 2D array shape represents (number of rows, number of columns). So, shape[0] is the number of rows and shape[1] is the number of columns. You can use numpy.zeros((dataset.shape[0], centroids.shape[1])) to create a numpy array with your desired dimensions. Here is an example code with modified version of your k-means function.
import numpy
def k_means(dataset, k):
centroids = dataset[numpy.random.choice(dataset.shape[0], k, replace=False), :]
print(centroids)
cluster_labels = numpy.zeros((dataset.shape[0], centroids.shape[1]))
print(cluster_labels)
dataset = numpy.array([[1,2,3,4,5,6,7,8,9,0],
[3,4,5,6,4,3,2,2,6,7],
[4,4,5,6,7,7,8,9,9,0],
[5,6,7,8,5,3,3,2,2,1],
[6,3,3,2,2,4,5,6,6,8]])
k_means(dataset, 2)
Output:
[[1 2 3 4 5 6 7 8 9 0]
[5 6 7 8 5 3 3 2 2 1]]
[[0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]]
I used numpy.zeros((dataset.shape[0], centroids.shape[1])) to make it more similar to your code. Actually, numpy.zeros(dataset.shape) would do the same thing, because centroids.shape[1] and dataset.shape[1] is the same. The number of columns of centroids and the number columns dataset are the same, because you choose your centroids from the dataset. So, the last version should be like:
def k_means(dataset, k):
centroids = dataset[numpy.random.choice(dataset.shape[0], k, replace=False), :]
cluster_labels = numpy.zeros(dataset.shape)

Python: effective way to find the cumulative sum of repeated index (numpy method) [duplicate]

This question already has answers here:
Pandas Groupby and Sum Only One Column
(3 answers)
Pandas sum by groupby, but exclude certain columns
(4 answers)
Closed 4 years ago.
I have a 2d numpy array with repeated values in first column.
The repeated values can have any corresponding value in second column.
Its easy to find the cumsum using numpy, but, I have to find the cumsum for all the repeated values.
How can we do this effectively using numpy or pandas?
Here, I have solved the problem using ineffective for-loop.
I was wondering if there is a more elegant solution.
Question
How can we get the same result in more effective fashion?
Help will be appreciated.
#!python
# -*- coding: utf-8 -*-#
#
# Imports
import pandas as pd
import numpy as np
np.random.seed(42) # make results reproducible
aa = np.random.randint(1, 20, size=10).astype(float)
bb = np.arange(10)*0.1
unq = np.unique(aa)
ans = np.zeros(len(unq))
print(aa)
print(bb)
print(unq)
for i, u in enumerate(unq):
for j, a in enumerate(aa):
if a == u:
print(a, u)
ans[i] += bb[j]
print(ans)
"""
# given data
idx col0 col1
0 7. 0.0
1 15. 0.1
2 11. 0.2
3 8. 0.3
4 7. 0.4
5 19. 0.5
6 11. 0.6
7 11. 0.7
8 4. 0.8
9 8. 0.9
# sorted data
4. 0.8
7. 0.0
7. 0.4
8. 0.9
8. 0.3
11. 0.6
11. 0.7
11. 0.2
15. 0.1
19. 0.5
# cumulative sum for repeated serial
4. 0.8
7. 0.0 + 0.4
8. 0.9 + 0.3
11. 0.6 + 0.7 + 0.2
15. 0.1
19. 0.5
# Required answer
4. 0.8
7. 0.4
8. 1.2
11. 1.5
15. 0.1
19. 0.5
"""
You can groupby col0 and find the .sum() for col1.
df.groupby('col0')['col1'].sum()
Output:
col0
4.0 0.8
7.0 0.4
8.0 1.2
11.0 1.5
15.0 0.1
19.0 0.5
Name: col1, dtype: float64
I think a pandas method such as the one offered by #HarvIpan is best for readability and functionality, but since you asked for a numpy method as well, here is a way to do it in numpy using a list comprehension, which is more succinct than your original loop:
np.array([[i,np.sum(bb[np.where(aa==i)])] for i in np.unique(aa)])
which returns:
array([[ 4. , 0.8],
[ 7. , 0.4],
[ 8. , 1.2],
[ 11. , 1.5],
[ 15. , 0.1],
[ 19. , 0.5]])

Printing items in two separate lists in proper alignment using python

I am trying to print items in two separate lists in a way that items in list-1 will align with items in list-2.
Here is my attempt:
import numpy as np
list_1=[1,2,3,4]
list_2=np.arange(0.1,0.4,0.1)
for x in list_1:
j=x/2.0
for y in list_2:
print j,',', y
My Output:
0.5 , 0.1
0.5 , 0.2
0.5 , 0.3
0.5 , 0.4
1.0 , 0.1
1.0 , 0.2
1.0 , 0.3
1.0 , 0.4
1.5 , 0.1
1.5 , 0.2
1.5 , 0.3
1.5 , 0.4
2.0 , 0.1
2.0 , 0.2
2.0 , 0.3
2.0 , 0.4
Desired Output:
0.5 , 0.1
1.0 , 0.2
1.5 , 0.3
2.0 , 0.4
What you want is zip().
Example:
>>> l1 = range(10)
>>> l2 = range(20,30)
>>> for x,y in zip(l1, l2):
print x, y
0 20
1 21
2 22
3 23
4 24
5 25
6 26
7 27
8 28
9 29
Explanation:
zip receives iterables, and then iterates over all of them at once, starting from the 0 element of each, then going on to the 1st and then 2nd and so on, once any of the iterables reaches the end - the zip will stop, you can use izip_longest from itertools to fill empty items in iterables with None (or you can do some fancier things - but that is for a different question)

Categories