I'm looking for an efficient way to do the following with Numpy:
Given a array counts of positive integers containing for instance:
[3, 1, 0, 6, 3, 2]
I would like to generate another array containing the indices of the first one, where the index i is repeated counts[i] times:
[0 0 0 1 3 3 3 3 3 3 4 4 4 5 5]
My problem is that this array is potentially very large and I'm looking for a vectorial (or fast) way to do this.
You can do it with numpy.repeat:
import numpy as np
arr = np.array([3, 1, 0, 6, 3, 2])
repix = np.repeat(np.arange(arr.size), arr)
print(repix)
Output:
[0 0 0 1 3 3 3 3 3 3 4 4 4 5 5]
Related
Take the following code:
import numpy as np
one_dim = np.array([2, 3, 1, 5, 4])
partitioned = np.argpartition(one_dim, 0)
print(f'Unpartitioned array: {one_dim}')
print(f'Partitioned array index: {partitioned}')
print(f'Partitioned array: {one_dim[partitioned]}')
The following output results:
Unpartitioned array: [2 3 1 5 4]
Partitioned array index: [2 1 0 3 4]
Partitioned array: [1 3 2 5 4]
The output for the partitioned array should be [1 2 3 5 4]. How is three on the left side of two? It seems to me the function is making an error or am I missing something?
The second argument is which index will be in sorted position after partitioning, so it is correct that index 0 of the partition (element value 1) is in sorted position, and everything to the right is greater.
I am currently working on the following:
data - with the correct index
for i in range(1, 11):
kmeans = KMeans(n_clusters=i, init='k-means++', max_iter=300, n_init=10, random_state=0)
kmeans.fit(data_values)
wcss.append(kmeans.inertia_)
kmeans = KMeans(n_clusters=2).fit(data_values)
y = kmeans.fit_predict(data_values) # prediction of k
df= pd.DataFrame(y,index = data.index)
....
#got here multiple dicts
Example of y:
[1 2 3 4 5 2 2 5 1 0 0 1 0 0 1 0 1 4 4 4 3 1 0 0 1 0 0 ...]
f = pd.DataFrame(y, columns = [buster] )
f.to_csv('busters.csv, mode = 'a')
y = clusters after determination
I dont know how did I stuck on this.. I am iterating over 20 dataframes, each one consists of one columns and values from 1-9. The index is irrelevent. I am trying to append all frame together but instead it just prints them one after the other. If I put ".T" to transpose it , I still got rows with irrelevent values as index, which I cant remove them because they are actually headers.
Needed result
If the dicts produced in each iteration look like {'Buster1': [0, 2, 2, 4, 5]}, {'Buster2': [1, 2, 3, 4, 5]} ..., using 5 elements here for illustration purposes, and all the lists, i.e., values in the dicts, have the same number of elements (as it is the case in your example), you could create a single dict and use pd.DataFrame directly. (You may also want to take a look at pandas.DataFrame.from_dict.)
You may have lists with more than 5 elements, more than 3 dicts (and thus columns), and you will be generating the dicts with a loop, but the code below should be sufficient for getting the idea.
>>> import pandas as pd
>>>
>>> d = {}
>>> # update d in every iteration
>>> d.update({'Buster 1': [0, 2, 2, 4, 5]})
>>> d.update({'Buster 2': [1, 2, 3, 4, 5]})
>>> # ...
>>> d.update({'Buster n': [0, 9, 3, 0, 0]})
>>>
>>> pd.DataFrame(d, columns=d.keys())
Buster 1 Buster 2 Buster n
0 0 1 0
1 2 2 9
2 2 3 3
3 4 4 0
4 5 5 0
If you have the keys, e.g., 'Buster 1', and values, e.g., [0, 2, 2, 4, 5], separated, as I believe is the case, you can simplify the above (and make it more efficient) by replacing d.update({'Buster 1': [0, 2, 2, 4, 5]}) with d['Buster 1']=[0, 2, 2, 4, 5].
I included columns=d.keys() because depending on your Python and pandas version the ordering of the columns may not be as you expect it to be. You can specify the ordering of the columns through specifying the order in which you provide the keys. For example:
>>> pd.DataFrame(d, columns=sorted(d.keys(),reverse=True))
Buster n Buster 2 Buster 1
0 0 1 0
1 9 2 2
2 3 3 2
3 0 4 4
4 0 5 5
Although it may not apply to your use case, if you do not want to print the index, you can take a look at How to print pandas DataFrame without index.
Suppose I have a 2 dimensional array with a very large number of rows, and a list of pairs of indexes of that array. I want to create a new 2 dim array, whose rows are concatenations of the rows of the original array, made according to the list of pairs of indexes. For example:
a =
1 2 3
4 5 6
7 8 9
0 0 0
indexes = [[0,0], [0,1], [2,3]]
the returned array should be:
1 2 3 1 2 3
1 2 3 4 5 6
7 8 9 0 0 0
Obviously I can iterate the list of indexes, but my question is whether there is a more efficient way of doing this. I should say that the list of indexes is also very large.
First convert indexes to a Numpy array:
ind = np.array(indexes)
Then generate your result as:
result = np.concatenate([a[ind[:,0]], a[ind[:,1]]], axis=1)
The result is:
array([[1, 2, 3, 1, 2, 3],
[1, 2, 3, 4, 5, 6],
[7, 8, 9, 0, 0, 0]])
Another possible formula (with the same result):
result = np.concatenate([ a[ind[:,i]] for i in range(ind.shape[1]) ], axis=1)
You can do this in one line using NumPy as:
a = np.arange(12).reshape(4, 3)
print(a)
b = [[0, 0], [1, 1], [2, 3]]
b = np.array(b)
print(b)
c = a[b.reshape(-1)].reshape(-1, a.shape[1]*b.shape[1])
print(c)
'''
[[ 0 1 2]
[ 3 4 5]
[ 6 7 8]
[ 9 10 11]]
[[0 0]
[1 1]
[2 3]]
[[ 0 1 2 0 1 2]
[ 3 4 5 3 4 5]
[ 6 7 8 9 10 11]]
'''
You can use horizontal stacking np.hstack:
c = np.array(indexes)
np.hstack((a[c[:,0]],a[c[:,1]]))
output:
[[1 2 3 1 2 3]
[1 2 3 4 5 6]
[7 8 9 0 0 0]]
I am trying to stack arrays horizontally, using numpy hstack, but can't get it to work. Instead, it all comes out in one list, instead of a 'matrix-looking' 2D array.
import numpy as np
y = np.array([0,2,-6,4,1])
y_bool = y > 0
y_bool = [1 if l == True else 0 for l in y_bool] #convert to decimals for classification
y_range = range(0,len(y))
print y
print y_bool
print y_range
print np.hstack((y,y_bool,y_range))
Prints this:
[ 0 2 -6 4 1]
[0, 1, 0, 1, 1]
[0, 1, 2, 3, 4]
[ 0 2 -6 4 1 0 1 0 1 1 0 1 2 3 4]
How do I instead get the last line to look like this:
[0 0 0
2 1 1
-6 0 2
4 1 3]
If you want to create a 2D array, do:
print np.transpose(np.array((y, y_bool, y_range)))
# [[ 0 0 0]
# [ 2 1 1]
# [-6 0 2]
# [ 4 1 3]
# [ 1 1 4]]
Well, close enough h is for horizontal/column wise, if you check its help, you will see under See Also
vstack : Stack arrays in sequence vertically (row wise).
dstack : Stack arrays in sequence depth wise (along third axis).
concatenate : Join a sequence of arrays together.
Edit: First thought vstack does it, but it would be if np.vstack(...).T or np.dstack(...).squeeze(). Other then that the "problem" is that the arrays are 1D and you want them to act like 2D, so you could do:
print np.hstack([np.asarray(a)[:,np.newaxis] for a in (y,y_bool,y_range)])
the np.asarray is there just in case one of the variables is a list. The np.newaxis makes them 2D to make it clearer what happens when concatenating.
I'm interested in the performance of NumPy, when it comes to algorithms that check whether a condition is True for an element and its affiliations (e.g. neighbouring elements) and assign a value according to the condition.
An example may be: (I make this up now)
I generate a 2d array of 1's and 0's, randomly.
Then I check whether the first element of the array is the same with its neighbors.
If the similar ones are the majority, I switch (0 -> 1 or 1 -> 0) that particular element.
And I proceed to the next element.
I guess that this kind of element wise conditions and element-wise operations are pretty slow with NumPy, is there a way that I can make the performance better?
For example, would creating the array with type dbool and adjusting the code, would it help?
Thanks in advance.
Maybe http://www.scipy.org/Cookbook/GameOfLifeStrides helps you.
It looks like your are doing some kind of image processing, you can try scipy.ndimage.
from scipy.ndimage import convolve
import numpy as np
np.random.seed(0)
x = np.random.randint(0,2,(5,5))
print x
w = np.ones((3,3), dtype=np.int8)
w[1,1] = 0
y = convolve(x, w, mode="constant")
print y
the outputs are:
[[0 1 1 0 1]
[1 1 1 1 1]
[1 0 0 1 0]
[0 0 0 0 1]
[0 1 1 0 0]]
[[3 4 4 5 2]
[3 5 5 5 3]
[2 4 4 4 4]
[2 3 3 3 1]
[1 1 1 2 1]]
y is the sum of the neighbors of every element. Do the same convolve with all ones, you get the number of neighbors number of every element:
>>> n = convolve(np.ones((5,5),np.int8), w, mode="constant")
>>> n
[[3 5 5 5 3]
[5 8 8 8 5]
[5 8 8 8 5]
[5 8 8 8 5]
[3 5 5 5 3]]
then you can do element-wise operations with x, y, n, and get your result.