Related
I have the following numpy arrays:
theta_array =
array([[ 1, 10],
[ 1, 11],
[ 1, 12],
[ 1, 13],
[ 1, 14],
[ 2, 10],
[ 2, 11],
[ 2, 12],
[ 2, 13],
[ 2, 14],
[ 3, 10],
[ 3, 11],
[ 3, 12],
[ 3, 13],
[ 3, 14],
[ 4, 10],
[ 4, 11],
[ 4, 12],
[ 4, 13],
[ 4, 14]])
XY_array =
array([[ 44.0394952 , 505.81099922],
[ 61.03882938, 515.97253226],
[ 26.69851841, 525.18083012],
[ 46.78487831, 533.42309602],
[ 45.77188401, 545.42988355],
[ 81.12969132, 554.78767379],
[ 54.178463 , 565.8716283 ],
[ 41.58952084, 574.76827133],
[ 85.24956815, 585.1355127 ],
[ 80.73726733, 595.49446033],
[ 22.70625059, 605.59017175],
[ 40.66810604, 615.26308629],
[ 47.16694695, 624.39222332],
[ 48.72499541, 633.19846364],
[ 50.68589921, 643.72334885],
[ 38.42731134, 654.68595883],
[ 47.39519707, 666.28232866],
[ 58.07767155, 673.9572227 ],
[ 72.11393347, 683.68307373],
[ 53.70872932, 694.65509894],
[ 82.08237952, 704.5868817 ],
[ 46.64069738, 715.18427515],
[ 40.46032478, 723.91308011],
[ 75.69090892, 733.69595658],
[120.61447884, 745.31322786],
[ 60.17764744, 754.89747186],
[ 87.15961973, 766.24040447],
[ 82.93872713, 773.01518252],
[ 93.56688906, 785.60640153],
[ 70.0474047 , 793.81792947],
[104.3613818 , 805.40234676],
[108.39253837, 814.75002114],
[ 78.97643673, 824.95386427],
[ 85.69096895, 834.44797862],
[ 53.07112931, 844.39555058],
[111.49875807, 855.660508 ],
[ 70.88824958, 865.53417489],
[ 79.55499469, 875.31303945],
[ 60.86941464, 885.85235946],
[101.06017712, 896.69986636],
[ 74.55823544, 905.87417231],
[113.24705653, 915.19350121],
[ 94.21920882, 925.87933273],
[ 63.26478103, 933.70804578],
[ 95.97827181, 945.76196917],
[ 80.48623318, 955.60422694],
[ 80.03451808, 964.39856485],
[ 73.86032436, 973.91032818],
[103.96923524, 984.24366761],
[ 93.20663129, 995.44618851]])
I am trying to combine both, so for each combination of theta_array I get all combinations from XY_array.
I am aware about this post so I have done this:
np.array(np.meshgrid(theta_array, XY_array)).T.reshape(-1,4)
But this generates:
array([[ 1. , 44.0394952 , 1. , 505.81099922],
[ 1. , 61.03882938, 1. , 515.97253226],
[ 1. , 26.69851841, 1. , 525.18083012],
...,
[ 14. , 73.86032436, 14. , 973.91032818],
[ 14. , 103.96923524, 14. , 984.24366761],
[ 14. , 93.20663129, 14. , 995.44618851]])
and the problem requires:
array([[ 1. , 1. , 44.0394952 , 505.81099922],
[ 1. , 1. , 61.03882938, 515.97253226],
[ 1. , 1. , 26.69851841, 525.18083012],
...,
[ 14. , 14. , 73.86032436, 973.91032818],
[ 14. , 14. , 103.96923524, 984.24366761],
[ 14. , 14. , 93.20663129, 995.44618851]])
Which would be the way of doing this combination/aggregation in numpy?
EDIT:
There is a mistake in the above process as the combined arrays do not lead to the generation of that matrix. With separate vectors for each column the actual solution to merge this is:
dataset = np.array(np.meshgrid(theta0_range, theta1_range, X)).T.reshape(-1,3)
And later the Y vector can be added as an additional column.
You can reorder the "columns" after using meshgrid with [:,[0,2,1,3]] and if you need to make the list dynamic because of a large number of columns, then you can see the end of my answer:
np.array(np.meshgrid(theta_array, XY_array)).T.reshape(-1,4)[:,[0,2,1,3]]
Output:
array([[ 1. , 1. , 44.0394952 , 505.81099922]],
[[ 1. , 1. , 61.03882938, 515.97253226]],
[[ 1. , 1. , 26.69851841, 525.18083012]],
...,
[[ 14. , 14. , 73.86032436, 973.91032818]],
[[ 14. , 14. , 103.96923524, 984.24366761]],
[[ 14. , 14. , 93.20663129, 995.44618851]])
If you have many columns you could dynamically create this list: [0,2,1,3] with list comprehension. For example:
n = new_arr.shape[1]*2
lst = [x for x in range(n) if x % 2 == 0]
[lst.append(z) for z in [y for y in range(n) if y % 2 == 1]]
lst
[0, 2, 4, 6, 1, 3, 5, 7]
Then, you could rewrite to:
np.array(np.meshgrid(theta_array, XY_array)).T.reshape(-1,4)[:,lst]
You can use itertools.product:
out = np.array([*product(theta_array, XY_array)])
out = out.reshape(out.shape[0],-1)
Output:
array([[ 1. , 10. , 44.0394952 , 505.81099922],
[ 1. , 10. , 61.03882938, 515.97253226],
[ 1. , 10. , 26.69851841, 525.18083012],
...,
[ 4. , 14. , 73.86032436, 973.91032818],
[ 4. , 14. , 103.96923524, 984.24366761],
[ 4. , 14. , 93.20663129, 995.44618851]])
That said, this looks very much like an XY-problem. What are you trying to do with this array?
Just as side/complementary reference here is a comparison in terms of execution time for both solutions. For this specific operation itertools takes 10 times more time to complete than its numpy equivalent.
%%time
for i in range(1000):
z = np.array(np.meshgrid(theta_array, XY_array)).T.reshape(-1,4)[:,[0,2,1,3]]
CPU times: user 299 ms, sys: 0 ns, total: 299 ms
Wall time: 328 ms
%%time
for i in range(1000):
z = np.array([*product(theta_array, XY_array)])
z = z.reshape(z.shape[0],-1)
CPU times: user 2.79 s, sys: 474 µs, total: 2.79 s
Wall time: 2.84 s
[[ 208.47 26. ]
[ 202.84 17. ]
[ 143.37 10. ]
...,
[ 45.99 3. ]
[ 159.31 10. ]
[ 34.12 4. ]]
[[ 58.64 1. ]
[ 44.31 19. ]
[ 37.89 14. ]
...,
[ 46.86 4. ]
[ 60.73 5. ]
[ 41.91 6. ]]
[[ 36.6 4. ]
[ 219.29 17. ]
[ 64.77 5. ]
...,
[ 51.85 37. ]
[ 161.26 10. ]
[ 53.63 20. ]]
[[ 52.97 32. ]
[ 51.32 3. ]
[ 196.23 4. ]
...,
[ 41.39 8. ]
[ 47.49 5. ]
[ 34.34 3. ]]
I have this numpy array entering my function:
def initialize_centroids(points, k):
"""returns k centroids from the initial points"""
centroids = points.copy()
np.random.shuffle(centroids)
print centroids
return centroids[:k]
Now what the function is currently doing is, shuffling the values and sending the first k of them. I want to basically randomize the values of the first column between 0 and 300 and the second between 0 and 100. How would I do this?
This is part of my work on building a K-Means algorithm using Python.
As #kazemakase has commented, the answer is simply using:
np.random.rand(k, 2) * [300, 100]
I want to add a vector as the first column of my 2D array which looks like :
[[ 1. 0. 0. nan]
[ 4. 4. 9.97 1. ]
[ 4. 4. 27.94 1. ]
[ 2. 1. 4.17 1. ]
[ 3. 2. 38.22 1. ]
[ 4. 4. 31.83 1. ]
[ 3. 4. 41.87 1. ]
[ 2. 1. 18.33 1. ]
[ 4. 4. 33.96 1. ]
[ 2. 1. 5.65 1. ]
[ 3. 3. 40.74 1. ]
[ 2. 1. 10.04 1. ]
[ 2. 2. 53.15 1. ]]
I want to add an aray [] of 13 elements as the first column of the matrix. I tried with np.stack_column, np.append but it is for 1D vector or doesn't work because I can't chose axis=1 and only do np.append(peak_values, results)
I have a very simple option for you using numpy -
x = np.array( [[ 3.9427767, -4.297677 ],
[ 3.9427767, -4.297677 ],
[ 3.9427767, -4.297677 ],
[ 3.9427767, -4.297677 ],
[ 3.942777 , -4.297677 ],
[ 3.9427767, -4.297677 ],
[ 3.9427767, -4.297677 ],
[ 3.9427767 ,-4.297677 ],
[ 3.9427767, -4.297677 ],
[ 3.9427772 ,-4.297677 ]])
b = np.arange(10).reshape(-1,1)
np.concatenate((b.T, x), axis=1)
Output-
array([[ 0. , 3.9427767, -4.297677 ],
[ 1. , 3.9427767, -4.297677 ],
[ 2. , 3.9427767, -4.297677 ],
[ 3. , 3.9427767, -4.297677 ],
[ 4. , 3.942777 , -4.297677 ],
[ 5. , 3.9427767, -4.297677 ],
[ 6. , 3.9427767, -4.297677 ],
[ 7. , 3.9427767, -4.297677 ],
[ 8. , 3.9427767, -4.297677 ],
[ 9. , 3.9427772, -4.297677 ]])
Improving on this answer by removing the unnecessary transposition, you can indeed use reshape(-1, 1) to transform the 1d array you'd like to prepend along axis 1 to the 2d array to a 2d array with a single column. At this point, the arrays only differ in shape along the second axis and np.concatenate accepts the arguments:
>>> import numpy as np
>>> a = np.arange(12).reshape(3, 4)
>>> b = np.arange(3)
>>> a
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])
>>> b
array([0, 1, 2])
>>> b.reshape(-1, 1) # preview the reshaping...
array([[0],
[1],
[2]])
>>> np.concatenate((b.reshape(-1, 1), a), axis=1)
array([[ 0, 0, 1, 2, 3],
[ 1, 4, 5, 6, 7],
[ 2, 8, 9, 10, 11]])
For the simplest answer, you probably don't even need numpy.
Try the following:
new_array = []
new_array.append(your_array)
That's it.
I would suggest using Numpy. It will allow you to easily do what you want.
Here is an example of squaring the entire set. you can use something like nums[0].
nums = [0, 1, 2, 3, 4]
even_squares = [x ** 2 for x in nums if x % 2 == 0]
print even_squares # Prints "[0, 4, 16]"
I have a numpy array which holds 4-dimensional vectors which have the following format (x, y, z, w)
The size of the array is 4 x N. Now, the data I have is where I have (x, y, z) spatial locations and w holds some particular measurement at this location. Now, there could be multiple measurements associated with an (x, y, z) position (measured as floats).
What I would like to do is filter the array, so that I get a new array where I get the maximum measurement corresponding with each (x, y, z) position.
So if my data is like:
x, y, z, w1
x, y, z, w2
x, y, z, w3
where w1 is greater than w2 and w3, the filtered data would be:
x, y, z, w1
So more concretely, say I have data like:
[[ 0.7732126 0.48649481 0.29771819 0.91622924]
[ 0.7732126 0.48649481 0.29771819 1.91622924]
[ 0.58294263 0.32025559 0.6925856 0.0524125 ]
[ 0.58294263 0.32025559 0.6925856 0.05 ]
[ 0.58294263 0.32025559 0.6925856 1.7 ]
[ 0.3239913 0.7786444 0.41692853 0.10467392]
[ 0.12080023 0.74853649 0.15356663 0.4505753 ]
[ 0.13536096 0.60319054 0.82018125 0.10445047]
[ 0.1877724 0.96060999 0.39697999 0.59078612]]
This should return
[[ 0.7732126 0.48649481 0.29771819 1.91622924]
[ 0.58294263 0.32025559 0.6925856 1.7 ]
[ 0.3239913 0.7786444 0.41692853 0.10467392]
[ 0.12080023 0.74853649 0.15356663 0.4505753 ]
[ 0.13536096 0.60319054 0.82018125 0.10445047]
[ 0.1877724 0.96060999 0.39697999 0.59078612]]
This is convoluted, but it is probably as good as you are going to get using numpy only...
First, we use lexsort to put all entries with the same coordinates together. With a being your sample array:
>>> perm = np.lexsort(a[:, 3::-1].T)
>>> a[perm]
array([[ 0.12080023, 0.74853649, 0.15356663, 0.4505753 ],
[ 0.7732126 , 0.48649481, 0.29771819, 0.91622924],
[ 0.7732126 , 0.48649481, 0.29771819, 1.91622924],
[ 0.1877724 , 0.96060999, 0.39697999, 0.59078612],
[ 0.3239913 , 0.7786444 , 0.41692853, 0.10467392],
[ 0.58294263, 0.32025559, 0.6925856 , 0.0524125 ],
[ 0.58294263, 0.32025559, 0.6925856 , 0.05 ],
[ 0.58294263, 0.32025559, 0.6925856 , 1.7 ],
[ 0.13536096, 0.60319054, 0.82018125, 0.10445047]])
Note that by reversing the axis, we are sorting by x, breaking ties with y, then z, then w.
Because it is the maximum we are looking for, we just need to take the last entry in every group, which is a pretty straightforward thing to do:
>>> a_sorted = a[perm]
>>> last = np.concatenate((np.all(a_sorted[:-1, :3] != a_sorted[1:, :3], axis=1),
[True]))
>>> a_unique_max = a_sorted[last]
>>> a_unique_max
array([[ 0.12080023, 0.74853649, 0.15356663, 0.4505753 ],
[ 0.13536096, 0.60319054, 0.82018125, 0.10445047],
[ 0.1877724 , 0.96060999, 0.39697999, 0.59078612],
[ 0.3239913 , 0.7786444 , 0.41692853, 0.10467392],
[ 0.58294263, 0.32025559, 0.6925856 , 1.7 ],
[ 0.7732126 , 0.48649481, 0.29771819, 1.91622924]])
If you would rather not have the output sorted, but keep them in the original order they came up in the original array, you can also get that with the aid of perm:
>>> a_unique_max[np.argsort(perm[last])]
array([[ 0.7732126 , 0.48649481, 0.29771819, 1.91622924],
[ 0.58294263, 0.32025559, 0.6925856 , 1.7 ],
[ 0.3239913 , 0.7786444 , 0.41692853, 0.10467392],
[ 0.12080023, 0.74853649, 0.15356663, 0.4505753 ],
[ 0.13536096, 0.60319054, 0.82018125, 0.10445047],
[ 0.1877724 , 0.96060999, 0.39697999, 0.59078612]])
This will only work for the maximum, and it comes as a by-product of the sorting. If you are after a different function, say the product of all same-coordinates entries, you could do something like:
>>> first = np.concatenate(([True],
np.all(a_sorted[:-1, :3] != a_sorted[1:, :3], axis=1)))
>>> a_unique_prods = np.multiply.reduceat(a_sorted, np.nonzero(first)[0])
And you will have to play a little around with these results to assemble your return array.
I see that you already got the pointer towards pandas in the comments. FWIW, here's how you can get the desired behavior, assuming you don't care about the final sort order since groupby changes it up.
In [14]: arr
Out[14]:
array([[ 0.7732126 , 0.48649481, 0.29771819, 0.91622924],
[ 0.7732126 , 0.48649481, 0.29771819, 1.91622924],
[ 0.58294263, 0.32025559, 0.6925856 , 0.0524125 ],
[ 0.58294263, 0.32025559, 0.6925856 , 0.05 ],
[ 0.58294263, 0.32025559, 0.6925856 , 1.7 ],
[ 0.3239913 , 0.7786444 , 0.41692853, 0.10467392],
[ 0.12080023, 0.74853649, 0.15356663, 0.4505753 ],
[ 0.13536096, 0.60319054, 0.82018125, 0.10445047],
[ 0.1877724 , 0.96060999, 0.39697999, 0.59078612]])
In [15]: import pandas as pd
In [16]: pd.DataFrame(arr)
Out[16]:
0 1 2 3
0 0.773213 0.486495 0.297718 0.916229
1 0.773213 0.486495 0.297718 1.916229
2 0.582943 0.320256 0.692586 0.052413
3 0.582943 0.320256 0.692586 0.050000
4 0.582943 0.320256 0.692586 1.700000
5 0.323991 0.778644 0.416929 0.104674
6 0.120800 0.748536 0.153567 0.450575
7 0.135361 0.603191 0.820181 0.104450
8 0.187772 0.960610 0.396980 0.590786
In [17]: pd.DataFrame(arr).groupby([0,1,2]).max().reset_index()
Out[17]:
0 1 2 3
0 0.120800 0.748536 0.153567 0.450575
1 0.135361 0.603191 0.820181 0.104450
2 0.187772 0.960610 0.396980 0.590786
3 0.323991 0.778644 0.416929 0.104674
4 0.582943 0.320256 0.692586 1.700000
5 0.773213 0.486495 0.297718 1.916229
You can start off with lex-sorting input array to bring entries with identical first three elements in succession. Then, create another 2D array to store the last column entries, such that elements corresponding to each duplicate triplet goes into the same rows. Next, find the max along axis=1 for this 2D array and thus have the final max output for each such unique triplet. Here's the implementation, assuming A as the input array -
# Lex sort A
sortedA = A[np.lexsort(A[:,:-1].T)]
# Mask of start of unique first three columns from A
start_unqA = np.append(True,~np.all(np.diff(sortedA[:,:-1],axis=0)==0,axis=1))
# Counts of unique first three columns from A
counts = np.bincount(start_unqA.cumsum()-1)
mask = np.arange(counts.max()) < counts[:,None]
# Group A's last column into rows based on uniqueness from first three columns
grpA = np.empty(mask.shape)
grpA.fill(np.nan)
grpA[mask] = sortedA[:,-1]
# Concatenate unique first three columns from A and
# corresponding max values for each such unique triplet
out = np.column_stack((sortedA[start_unqA,:-1],np.nanmax(grpA,axis=1)))
Sample run -
In [75]: A
Out[75]:
array([[ 1, 1, 1, 96],
[ 1, 2, 2, 48],
[ 2, 1, 2, 33],
[ 1, 1, 1, 24],
[ 1, 1, 1, 94],
[ 2, 2, 2, 5],
[ 2, 1, 1, 17],
[ 2, 2, 2, 62]])
In [76]: sortedA
Out[76]:
array([[ 1, 1, 1, 96],
[ 1, 1, 1, 24],
[ 1, 1, 1, 94],
[ 2, 1, 1, 17],
[ 2, 1, 2, 33],
[ 1, 2, 2, 48],
[ 2, 2, 2, 5],
[ 2, 2, 2, 62]])
In [77]: out
Out[77]:
array([[ 1., 1., 1., 96.],
[ 2., 1., 1., 17.],
[ 2., 1., 2., 33.],
[ 1., 2., 2., 48.],
[ 2., 2., 2., 62.]])
You can use logical indexing.
I will use random data for an example:
>>> myarr = np.random.random((6, 4))
>>> print(myarr)
[[ 0.7732126 0.48649481 0.29771819 0.91622924]
[ 0.58294263 0.32025559 0.6925856 0.0524125 ]
[ 0.3239913 0.7786444 0.41692853 0.10467392]
[ 0.12080023 0.74853649 0.15356663 0.4505753 ]
[ 0.13536096 0.60319054 0.82018125 0.10445047]
[ 0.1877724 0.96060999 0.39697999 0.59078612]]
To get the row or rows where the last column is the greatest, do this:
>>> greatest = myarr[myarr[:, 3]==myarr[:, 3].max()]
>>> print(greatest)
[[ 0.7732126 0.48649481 0.29771819 0.91622924]]
What this does is it gets the last column of myarr, and finds the maximum of that column, finds all the elements of that column equal to the maximum, and then gets the corresponding rows.
You can use np.argmax
x[np.argmax(x[:,3]),:]
>>> x = np.random.random((5,4))
>>> x
array([[ 0.25461146, 0.35671081, 0.54856798, 0.2027313 ],
[ 0.17079029, 0.66970362, 0.06533572, 0.31704254],
[ 0.4577928 , 0.69022073, 0.57128696, 0.93995176],
[ 0.29708841, 0.96324181, 0.78859008, 0.25433235],
[ 0.58739451, 0.17961551, 0.67993786, 0.73725493]])
>>> x[np.argmax(x[:,3]),:]
array([ 0.4577928 , 0.69022073, 0.57128696, 0.93995176])
I have a matrix A with shape 1.6M rows and 400 columns.
One of the columns in A (call it the output column) has binary values (0,1) with a predominance of 0's.
I want to create a new matrix B (same shape as A) by sampling rows in A with replacement such, that the distribution of 0's & 1's in the output column in B becomes 50/50.
What is the efficient way to do this using python/numpy?
You could do this by:
Creating a list of all rows with 0 in the "output column" (called outputZeros), and a list of all rows with 1 in the output column (called outputOnes); then,
Sampling with replacement from outputZeros and outputOnes 1.6M times.
Here's a small example. It's not clear to me if you want the rows in B to be in any particular order, so here they first include 0s, then include 1s.
In [1]: import numpy as np, random
In [2]: A = np.random.rand(10, 2)
In [3]: A
In [4]: A[:7, 1] = 0
In [5]: A[7:, 1] = 1
In [6]: A
Out[6]:
array([[ 0.70126052, 0. ],
[ 0.51161067, 0. ],
[ 0.76511966, 0. ],
[ 0.91257144, 0. ],
[ 0.97024895, 0. ],
[ 0.55817776, 0. ],
[ 0.55963466, 0. ],
[ 0.6318139 , 1. ],
[ 0.90176108, 1. ],
[ 0.76033151, 1. ]])
In [7]: outputZeros = np.where(A[:, 1] == 0)[0]
In [8]: outputZeros
Out[8]: array([0, 1, 2, 3, 4, 5, 6])
In [9]: outputOnes
Out[9]: array([7, 8, 9])
In [10]: outputOnes = np.where(A[:, 1] == 1)[0]
In [11]: B = np.zeros((10, 2))
In [12]: for i in range(10):
if i < 5:
B[i, :] = A[random.choice(outputZeros), :]
else:
B[i, :] = A[random.choice(outputOnes), :]
....:
In [13]: B
Out[13]:
array([[ 0.97024895, 0. ],
[ 0.97024895, 0. ],
[ 0.76511966, 0. ],
[ 0.76511966, 0. ],
[ 0.51161067, 0. ],
[ 0.90176108, 1. ],
[ 0.76033151, 1. ],
[ 0.6318139 , 1. ],
[ 0.6318139 , 1. ],
[ 0.76033151, 1. ]])
I would create a new 1D numpy array filled with values from numpy.random.random_integers(low, high=None, size=None) and swap that new array with the old one.