Move duplicate index pairs out of one array in new arrays - python

I have a 2d numpy array that I use for indexing (by making it a tuple of two numpy arrays, see below). From that array I now want to remove all duplicate index pairs out of the original array and in a new array (it can be multiple new arrays without or one with possible duplicates if one pair occured more than twice):
>>> import numpy as np
>>> L = 2
>>> indices = tuple(np.random.randint(0, L, (2, L**2)))
>>> indices
(array([0, 1, 0, 1]), array([1, 0, 0, 0]))
What I want to get is:
indices = (array([0, 1, 0]), array([1, 0, 0]))
indices_2 = (array([1]), array([0]))

You can use:
L = 2
indices = tuple(np.random.randint(0, L, (2, L**2)))
# stack the two arrays to handle simultaneously
# one can also use the original array without converting to tuple
a = np.vstack(indices)
# array([[0, 1, 0, 1],
# [1, 0, 0, 0]])
# get unique values and indices of the first occurrences
# expand the indices to a tuple of arrays
(*indices,), idx = np.unique(a, axis=1, return_index=True)
# ((array([0, 0, 1]), array([0, 1, 0])), array([2, 0, 1]))
# remove the first occurrences to keep only the duplicates
# again, convert the 2D indices into a tuple of arrays
(*indices_2,) = np.delete(a, idx, axis=1)
# (array([1]), array([0]))

Related

Repeat Array while Maintaining Order within group

I have the below array and would like to repeat each array n times.
x_array
[array([14.91488012, 1.2986064 , 4.98965322]),
array([2.39389187e+02, 1.04442059e-01, 3.06391338e-01]),
array([ 48.19437348, 201.09951372, 0.35223001]),
array([ 19.96978171, 367.52578786, 0.68676553]),
array([0.55120466, 0.27133609, 0.75646697]),
array([8.21287360e+02, 1.76495077e+02, 4.87263691e-01]),
array([184.03439377, 1.24823107, 5.33109884]),
array([575.59800297, 186.4650814 , 2.21028258]),
array([0.50308552, 3.09976082, 0.10537899]),
array([1.02259912e+00, 1.52282513e+02, 1.15085308e-01])]
I've tried np.repeat(x_array, 2) but this doesn't preserve the order of the matrix/array. I've also tried x_array*2, but this seems to just put the new array at the bottom. I was hopping to repeat x_array[0] n times and do the same for the next set of arrays, so that I have n total of each in order.
Thanks in advance.
Building off of the last example from https://numpy.org/doc/stable/reference/generated/numpy.repeat.html,
x_array = np.array(x_array) # Or a similiar operation to convert x_array to an ndarray vs. a list of arrays.
expanded_x_array = np.repeat(x_array, n, axis=0)
print(expanded_x_array)
should produce what you are looking for.
You just need to specify the axis:
>>> np.repeat(x_array, 2, axis=0)
array([[1.49149e+01, 1.29861e+00, 4.98965e+00],
[1.49149e+01, 1.29861e+00, 4.98965e+00],
[2.39389e+02, 1.04442e-01, 3.06391e-01],
[2.39389e+02, 1.04442e-01, 3.06391e-01],
...,
[5.03086e-01, 3.09976e+00, 1.05379e-01],
[5.03086e-01, 3.09976e+00, 1.05379e-01],
[1.02260e+00, 1.52283e+02, 1.15085e-01],
[1.02260e+00, 1.52283e+02, 1.15085e-01]])
From the docs:
numpy.repeat(a, repeats, axis=None)
...
axis int, optional
The axis along which to repeat values. By default, use the flattened input array, and return a flat output array.
(added bold)
You could use a list comprehension:
n = 2
repeated_list = [row for row in a for _ in range(n)]
print(repeated_list)
Your terminology is confusing. You say it's an "array", but the display looks more like a list, And the fact that x_array*2 puts an "new array" at the bottom confirms that - that's a list use of *.
np.repeat(x_array) first makes an array (a real one!)
np.array(x_array)
is a (n,3) float dtype array. Without axis np.repeat flattens - as documented!
Specifying the axis=0 works because it's repeating on that first n dimension. The result is a (2*n,3) float dtype array (not a list).
It is possible to make a 1d object dtype array containing those arrays. With that repeat will work without the axis parameter.
Knowing what you have, and describing it accurately, can make this kind of task much easier - and the questions clearer.
illustration
Make a list of arrays:
In [21]: alist = [np.ones(3,int),np.zeros(3,int),np.arange(3)]
In [22]: alist
Out[22]: [array([1, 1, 1]), array([0, 0, 0]), array([0, 1, 2])]
List repeat:
In [23]: alist*2
Out[23]:
[array([1, 1, 1]),
array([0, 0, 0]),
array([0, 1, 2]),
array([1, 1, 1]),
array([0, 0, 0]),
array([0, 1, 2])]
Make a 2d array from the list:
In [24]: np.array(alist)
Out[24]:
array([[1, 1, 1],
[0, 0, 0],
[0, 1, 2]])
repeat without axis repeats elements in a flattened way:
In [25]: np.repeat(alist,2)
Out[25]: array([1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 2, 2])
repeat this 2d array on 0 axis:
In [26]: np.repeat(alist,2,axis=0)
Out[26]:
array([[1, 1, 1],
[1, 1, 1],
[0, 0, 0],
[0, 0, 0],
[0, 1, 2],
[0, 1, 2]])
Object dtype array from list:
In [27]: arr = np.empty(3,object); arr[:]=alist
In [28]: arr
Out[28]: array([array([1, 1, 1]), array([0, 0, 0]), array([0, 1, 2])], dtype=object)
Since the arrays have the same size we have to use this special construct. Otherwise we get the 2d array [24].
This array has a repeat method, and with only one dimension we dont need to specify the axis. It's repeating the object elements, arrays, not the numbers in the 2d [24] array.
In [29]: arr.repeat(2)
Out[29]:
array([array([1, 1, 1]), array([1, 1, 1]), array([0, 0, 0]),
array([0, 0, 0]), array([0, 1, 2]), array([0, 1, 2])], dtype=object)

numpy.insert() function insert array into wrong index

Here, my code feats value form text file; and create matrices as multidimensional array, but the problem is the code create more then two dimensional array, that I can't manipulate, I need two dimensional array, how I do that?
Explain algorithm of my code:
Moto of code: My code fetch value from a specific folder, each folder contain 7 'txt' file, that generate from one user, in this way multiple folder contain multiple data of multiple user.
step1: Start a 1st for loop, and control it using how many folder have in specific folder,and in variable 'path' store the first path of first folder.
step2: Open the path and fetch data of 7 txt file using 2nd for loop.after feats, it close 2nd for loop and execute the rest code.
step3: Concat the data of 7 txt file in one 1d array.
step4: create 2d array using getting data of 2 folder
step5(here problem arise): create a row in 2d array ind inser id array
import numpy as np
import array as arr
import os
f_path='Result'
array_control_var=0
#for feacth directory path
for (path,dirs,file) in os.walk(f_path):
if(path==f_path):
continue
f_path_1= path +'\page_1.txt'
#Get data from page1 indivisualy beacuse there string type data exiest
pgno_1 = np.array(np.loadtxt(f_path_1, dtype='U', delimiter=','))
#only for page_2.txt
f_path_2= path +'\page_2.txt'
with open(f_path_2) as f:
str_arr = ','.join([l.strip() for l in f])
pgno_2 = np.asarray(str_arr.split(','), dtype=int)
#using loop feach data from those text file.datda type = int
for j in range(3,8):
#store file path using variable
txt_file_path=path+'\page_'+str(j)+'.txt'
if os.path.exists(txt_file_path)==True:
#genarate a variable name that auto incriment with for loop
foo='pgno_'+str(j)
else:
break
#pass the variable name as string and store value
exec(foo + " = np.array(np.loadtxt(txt_file_path, dtype='i', delimiter=','))")
#marge all array from page 2 to rest in single array in one dimensation
f_array=np.concatenate((pgno_2,pgno_3,pgno_4,pgno_5,pgno_6,pgno_7), axis=0)
#for first time of the loop assing this value
if array_control_var==0:
main_f_array=f_array
if array_control_var==1:
#here use np.array()
main_f_array=np.array([main_f_array,f_array])
else:
main_f_array=np.insert(main_f_array, array_control_var, f_array, 0)
array_control_var+=1
print(main_f_array)
I want output like this
Initial
[[0,0,0],[0,0,0,]]
after insert
[[0,0,0],[0,0,0],[0,0,0]]
but the out put is
[array([0, 0, 0])
array([0, 0, 0])
0 0 0]
When I recommend replacing the insert with a list build, here's what I have in mind.
import numpy as np
alist = []
for i in range(4):
f_array = np.array([i, i+2, i+4])
alist.append(f_array)
print(alist)
main_f_array = np.array(alist)
print(main_f_array)
test run:
1246:~/mypy$ python3 stack54715610.py
[array([0, 2, 4]), array([1, 3, 5]), array([2, 4, 6]), array([3, 5, 7])]
[[0 2 4]
[1 3 5]
[2 4 6]
[3 5 7]]
If your file loading produces arrays that differ in size you'll get different results
f_array = np.arange(i, i+1+i)
1246:~/mypy$ python3 stack54715610.py
[array([0]), array([1, 2]), array([2, 3, 4]), array([3, 4, 5, 6])]
[array([0]) array([1, 2]) array([2, 3, 4]) array([3, 4, 5, 6])]
This is a 1d object dtype array, as opposed to the 2d.
As I commented, collecting arrays with insert (or variations on concatenate) is hard to do right, and slow when working. It builds a whole new array each time. Collecting the arrays in a list, and doing one array build at the end is easier and faster. List append is efficient, and easy to use.
That said, your reported result looks suspicious. I can reproduce it with:
In [281]: arr = np.zeros(2, object)
In [282]: arr
Out[282]: array([0, 0], dtype=object)
In [283]: arr[0] = np.array([0,0,0])
In [284]: arr[1] = np.array([0,0,0])
In [285]: arr
Out[285]: array([array([0, 0, 0]), array([0, 0, 0])], dtype=object)
In [286]: np.insert(arr, 2, np.array([0,0,0]), 0)
Out[286]: array([array([0, 0, 0]), array([0, 0, 0]), 0, 0, 0], dtype=object)
At an earlier iteration, main_f_array must have been created as an object dtype array.
If it had been a 'normal' 2d array, the insert would be different:
In [287]: arr1 = np.zeros((2,3),int)
In [288]: np.insert(arr1, 2, np.array([0,0,0]), 0)
Out[288]:
array([[0, 0, 0],
[0, 0, 0],
[0, 0, 0]])
Or more iteratively as I think you wanted:
In [289]: f_array = np.array([0,0,0])
In [290]: main = f_array
In [291]: main = np.array([main, f_array])
In [292]: main
Out[292]:
array([[0, 0, 0],
[0, 0, 0]])
In [293]: main = np.insert(main, 2, f_array, 0)
In [294]: main
Out[294]:
array([[0, 0, 0],
[0, 0, 0],
[0, 0, 0]])

Example in np.argsort document

For some reason I cannot resolve this.
According to the example here for 1-dim array,
https://docs.scipy.org/doc/numpy/reference/generated/numpy.argsort.html
x = np.array([3, 1, 2])
np.argsort(x)
array([1, 2, 0])
And I have tried this myself. But by default, the return result should be ascending..meaning
x([result])
returns
array([1, 2, 3])
Thus shouldnt the result be [2,0,1]
What am I missing here?
From the docs, the first line states "Returns the indices that would sort an array." Hence if you want the positions of the sorted values we have:
x = np.array([3, 1, 2])
np.argsort(x)
>>>array([1, 2, 0])
here we want the index positions of 1, 2 and 3 in x. The psotion of 3 is 0, the psotion of 1 is 1, and the position of 2 is 2, hence array([1, 2, 0]) = sorted_array(1,2,3).
Again from the notes, " argsort returns an array of indices of the same shape as a that index data along the given axis in sorted order."
A more intuitive way of looking at what that means is to use a for loop, where we loop over our returned argsort values, and then index the initial array with these values:
x = np.array([3, 1, 2])
srt_positions = np.argsort(x)
for k in srt_positions:
print x[k]
>>> 1, 2, 3

How to convert a series of index/category, into a classification array

How, to convert a series of indexes, into a 2-D array which expresses the category/classifier that's defined by the indexes values in list ?
e.g.:
import numpy as np
aList = [0,1,0,2]
anArray = np.array(aList)
resultArray = convertToCategories(anArray)
and the return value of convertToCategories() would be like:
[[1,0,0], # the 0th element of aList is index category 0
[0,1,0], # the 1st element of aList is index category 1
[1,0,0], # the 2nd element of aList is index category 0
[0,0,1]] # the 3rd element of aList is index category 2
In last resort, I could of course:
parse the list,
count the number of categories (it's contiguous/continuous, it could be simply to find the maximum)
create a zeroed array with the good size found
then reparse the list, so as to fill the array according the indices given by the list, with 1 (or True).
But I am wondering if there exists a more pythonic, or dedicated numpy, or pandas function to achieve this kind of transformation.
You can do something like this -
import numpy as np
# Size parameters
N = anArray.size
M = anArray.max()+1
# Setup output array
resultArray = np.zeros((N,M),int)
# Find out the linear indices where 1s would be put
idx = (np.arange(N)*M) + anArray
# Finally, put 1s at those places for the final output
resultArray.ravel()[idx] = 1
Sample run -
In [188]: anArray
Out[188]: array([0, 1, 0, 2, 4, 1, 3])
In [189]: resultArray
Out[189]:
array([[1, 0, 0, 0, 0],
[0, 1, 0, 0, 0],
[1, 0, 0, 0, 0],
[0, 0, 1, 0, 0],
[0, 0, 0, 0, 1],
[0, 1, 0, 0, 0],
[0, 0, 0, 1, 0]])
Or, better just directly index into the output array with the row and column indices -
# Setup output array and put 1s at places indexed by row and column indices.
# Here, anArray would be the column indices and [0,1,....N-1] would be the row indices
resultArray = np.zeros((N,M),int)
resultArray[np.arange(N),anArray] = 1

Ignoring duplicate entries in sparse matrix

I've tried to initialize csc_matrix and csr_matrix from a list of (data, (rows, cols)) values as the documentation suggests.
sparse = csc_matrix((data, (rows, cols)), shape=(n, n))
The problem is that, the method that I actually have for generating the data, rows and cols vectors introduces duplicates for some points. By default, scipy adds the values of the duplicate entries. However, in my case, those duplicates have exactly the same value in data for a given (row, col).
What I'm trying to achieve is to make scipy ignore the second entry if already exists one, instead of adding them.
Ignoring the fact that I could improve the generation algorithm to avoid generating duplicates, is there a parameter or another way of creating a sparse matrix that ignores duplicates?
Currently two entries with data = [4, 4]; cols = [1, 1]; rows = [1, 1]; generate a sparse matrix which value at (1,1) is 8 while the desired value is 4.
>>> c = csc_matrix(([4, 4], ([1,1],[1,1])), shape=(3,3))
>>> c.todense()
matrix([[0, 0, 0],
[0, 8, 0],
[0, 0, 0]])
I'm also aware that I could filter them by using a 2-dimensional numpy unique function, but lists are quite large so this is not really a valid option.
Other possible answer to the question: Is there any way of specifying what to do with duplicates? i.e. keeping the min or max instead of the default sum?
Creating an intermediary dok matrix works in your example:
In [410]: c=sparse.coo_matrix((data, (cols, rows)),shape=(3,3)).todok().tocsc()
In [411]: c.A
Out[411]:
array([[0, 0, 0],
[0, 4, 0],
[0, 0, 0]], dtype=int32)
A coo matrix puts your input arrays into its data,col,row attributes without change. The summing doesn't occur until it is converted to a csc.
todok loads the dictionary directly from the coo attributes. It creates the blank dok matrix, and fills it with:
dok.update(izip(izip(self.row,self.col),self.data))
So if there are duplicate (row,col) values, it's the last one that remains. This uses the standard Python dictionary hashing to find the unique keys.
Here's a way of using np.unique. I had to construct a special object array, because unique operates on 1d, and we have a 2d indexing.
In [479]: data, cols, rows = [np.array(j) for j in [[1,4,2,4,1],[0,1,1,1,2],[0,1,2,1,1]]]
In [480]: x=np.zeros(cols.shape,dtype=object)
In [481]: x[:]=list(zip(rows,cols))
In [482]: x
Out[482]: array([(0, 0), (1, 1), (2, 1), (1, 1), (1, 2)], dtype=object)
In [483]: i=np.unique(x,return_index=True)[1]
In [484]: i
Out[484]: array([0, 1, 4, 2], dtype=int32)
In [485]: c1=sparse.csc_matrix((data[i],(cols[i],rows[i])),shape=(3,3))
In [486]: c1.A
Out[486]:
array([[1, 0, 0],
[0, 4, 2],
[0, 1, 0]], dtype=int32)
I have no idea which approach is faster.
An alternative way of getting the unique index, as per liuengo's link:
rc = np.vstack([rows,cols]).T.copy()
dt = rc.dtype.descr * 2
i = np.unique(rc.view(dt), return_index=True)[1]
rc has to own its own data in order to change the dtype with view, hence the .T.copy().
In [554]: rc.view(dt)
Out[554]:
array([[(0, 0)],
[(1, 1)],
[(2, 1)],
[(1, 1)],
[(1, 2)]],
dtype=[('f0', '<i4'), ('f1', '<i4')])
Since the values in your data at repeating (row, col) are the same, you can get the unique rows, columns and values as follows:
rows, cols, data = zip(*set(zip(rows, cols, data)))
Example:
data = [4, 3, 4]
cols = [1, 2, 1]
rows = [1, 3, 1]
csc_matrix((data, (rows, cols)), shape=(4, 4)).todense()
matrix([[0, 0, 0, 0],
[0, 8, 0, 0],
[0, 0, 0, 0],
[0, 0, 3, 0]])
rows, cols, data = zip(*set(zip(rows, cols, data)))
csc_matrix((data, (rows, cols)), shape=(4, 4)).todense()
matrix([[0, 0, 0, 0],
[0, 4, 0, 0],
[0, 0, 0, 0],
[0, 0, 3, 0]])
Just to update hpaulj's answer to the most recent version of SciPy, the simplest solution to this problem is now, given a COO matrix c now:
dok=sparse.dok_matrix((c.shape),dtype=c.dtype)
dok._update(zip(zip(c.row,c.col),c.data))
new_c = dok.tocsc()
This is due to a new wrapper in the dok update() function, preventing it from direct changes to the array, requiring the use of the underscore to bypass the wrapper.

Categories