Map index of numpy matrix - python

How should I map indices of a numpy matrix?
For example:
mx = np.matrix([[5,6,2],[3,3,7],[0,1,6]]
The row/column indices are 0, 1, 2.
So:
>>> mx[0,0]
5
Let s say I need to map these indices, converting 0, 1, 2 into, e.g. 10, 'A', 'B' in the way that:
mx[10,10] #returns 5
mx[10,'A'] #returns 6 and so on..
I can just set a dict and use it to access the elements, but I would like to know if it is possible to do something like what I just described.

I would suggest using pandas dataframe with the index and columns using the new mapping for row and col indexing respectively for ease in indexing. It allows us to select a single element or an entire row or column with the familiar colon operator.
Consider a generic (non-square 4x3 shaped matrix) -
mx = np.matrix([[5,6,2],[3,3,7],[0,1,6],[4,5,2]])
Consider the mappings for rows and columns -
row_idx = [10, 'A', 'B','C']
col_idx = [10, 'A', 'B']
Let's take a look on the workflow with the given sample -
# Get data into dataframe with given mappings
In [57]: import pandas as pd
In [58]: df = pd.DataFrame(mx,index=row_idx, columns=col_idx)
# Here's how dataframe data looks like
In [60]: df
Out[60]:
10 A B
10 5 6 2
A 3 3 7
B 0 1 6
C 4 5 2
# Get one scalar element
In [61]: df.loc['C',10]
Out[61]: 4
# Get one entire col
In [63]: df.loc[:,10].values
Out[63]: array([5, 3, 0, 4])
# Get one entire row
In [65]: df.loc['A'].values
Out[65]: array([3, 3, 7])
And best of all we are not making any extra copies as the dataframe and its slices are still indexing into the original matrix/array memory space -
In [98]: np.shares_memory(mx,df.loc[:,10].values)
Out[98]: True

Try this:
import numpy as np
A = np.array(((1,2),(3,4),(50,100)))
dt = np.dtype([('ID', np.int32), ('Ring', np.int32)])
B = np.array(list(map(tuple, A)), dtype=dt)
print(B['ID'])

You can use the __getitem__ and __setitem__ special methods and create a new class as shown.
Store the index map as a dictionary in an instance variable self.index_map.
import numpy as np
class Matrix(np.matrix):
def __init__(self, lis):
self.matrix = np.matrix(lis)
self.index_map = {}
def setIndexMap(self, index_map):
self.index_map = index_map
def getIndex(self, key):
if type(key) is slice:
return key
elif key not in self.index_map.keys():
return key
else:
return self.index_map[key]
def __getitem__(self, idx):
return self.matrix[self.getIndex(idx[0]), self.getIndex(idx[1])]
def __setitem__(self, idx, value):
self.matrix[self.getIndex(idx[0]), self.getIndex(idx[1])] = value
Usage:
Creating a matrix.
>>> mx = Matrix([[5,6,2],[3,3,7],[0,1,6]])
>>> mx
Matrix([[5, 6, 2],
[3, 3, 7],
[0, 1, 6]])
Defining the Index Map.
>>> mx.setIndexMap({10:0, 'A':1, 'B':2})
Different ways to index the matrix.
>>> mx[0,0]
5
>>> mx[10,10]
5
>>> mx[10,'A']
6
It also handles slicing as shown.
>>> mx[1:3, 1:3]
matrix([[3, 7],
[1, 6]])

Related

Why cant Pandas replace nan with an array of 0s using masks/replace?

I have a series like this
s = pd.Series([[1,2,3],[1,2,3],np.nan,[1,2,3],[1,2,3],np.nan])
and I simply want the NaN to be replaced by [0,0,0].
I have tried
s.fillna([0,0,0]) # TypeError: "value" parameter must be a scalar or dict, but you passed a "list"
s[s.isna()] = [[0,0,0],[0,0,0]] # just replaces the NaN with a single "0". WHY?!
s.fillna("NAN").replace({"NAN":[0,0,0]}) # ValueError: NumPy boolean array indexing assignment cannot
#assign 3 input values to the 2 output values where the mask is true
s.fillna("NAN").replace({"NAN":[[0,0,0],[0,0,0]]}) # TypeError: NumPy boolean array indexing assignment
# requires a 0 or 1-dimensional input, input has 2 dimensions
I really can't understand, why the two first approaches won't work (maybe I get the first, but the second I cant wrap my head around).
Thanks to this SO-question and answer, we can do it by
is_na = s.isna()
s.loc[is_na] = s.loc[is_na].apply(lambda x: [0,0,0])
but since apply often is rather slow I cannot understand, why we can't use replace or the slicing as above
Pandas working with list with pain, here is hacky solution:
s = s.fillna(pd.Series([[0,0,0]] * len(s), index=s.index))
print (s)
0 [1, 2, 3]
1 [1, 2, 3]
2 [0, 0, 0]
3 [1, 2, 3]
4 [1, 2, 3]
5 [0, 0, 0]
dtype: object
Series.reindex
s.dropna().reindex(s.index, fill_value=[0, 0, 0])
0 [1, 2, 3]
1 [1, 2, 3]
2 [0, 0, 0]
3 [1, 2, 3]
4 [1, 2, 3]
5 [0, 0, 0]
dtype: object
The documentation indicates that this value cannot be a list.
Value to use to fill holes (e.g. 0), alternately a
dict/Series/DataFrame of values specifying which value to use for each
index (for a Series) or column (for a DataFrame). Values not in the
dict/Series/DataFrame will not be filled. This value cannot be a list.
This is probably a limitation of the current implementation, and short of patching the source code you must resort to workarounds (as provided below).
However, if you are not planning to work with jagged arrays, what you really want to do is probably replace pd.Series() with pd.DataFrame(), e.g.:
import numpy as np
import pandas as pd
s = pd.DataFrame(
[[1, 2, 3],
[1, 2, 3],
[np.nan],
[1, 2, 3],
[1, 2, 3],
[np.nan]],
dtype=pd.Int64Dtype()) # to mix integers with NaNs
s.fillna(0)
# 0 1 2
# 0 1 2 3
# 1 1 2 3
# 2 0 0 0
# 3 1 2 3
# 4 1 2 3
# 5 0 0 0
If you do need to use jagged array, you could use any of the proposed workaround from other answers, or you could make one of your attempt work, e.g.:
ii = s.isna()
nn = ii.sum()
s[ii] = pd.Series([[0, 0, 0]] * nn).to_numpy()
# 0 [1, 2, 3]
# 1 [1, 2, 3]
# 2 [0, 0, 0]
# 3 [1, 2, 3]
# 4 [1, 2, 3]
# 5 [0, 0, 0]
# dtype: object
which basically uses NumPy masking to fill in the Series. The trick is to generate a compatible object for the assignment that works at the NumPy level.
If there are too many NaNs in the input, it is probably more efficient / faster to work in a similar way but with s.notna() instead, e.g.:
import pandas as pd
result = pd.Series([[0, 0, 0]] * len(s))
result[s.notna()] = s[s.notna()]
Let's try to do some benchmarking, where:
replace_nan_isna() is from above
import pandas as pd
def replace_nan_isna(s, value, inplace=False):
if not inplace:
s = s.copy()
ii = s.isna()
nn = ii.sum()
s[ii] = pd.Series([value] * nn).to_numpy()
return s
replace_nan_notna() is also from above
import pandas as pd
def replace_nan_notna(s, value, inplace=False):
if inplace:
raise ValueError("In-place not supported!")
result = pd.Series([value] * len(s))
result[s.notna()] = s[s.notna()]
return result
replace_nan_reindex() is from #ShubhamSharma's answer
def replace_nan_reindex(s, value, inplace=False):
if not inplace:
s = s.copy()
s.dropna().reindex(s.index, fill_value=value)
return s
replace_nan_fillna() is from #jezrael's answer
import pandas as pd
def replace_nan_fillna(s, value, inplace=False):
if not inplace:
s = s.copy()
s.fillna(pd.Series([value] * len(s), index=s.index))
return s
with the following code:
import numpy as np
import pandas as pd
def gen_data(n=5, k=2, p=0.7, obj=(1, 2, 3)):
return pd.Series(([obj] * int(p * n) + [np.nan] * (n - int(p * n))) * k)
funcs = replace_nan_isna, replace_nan_notna, replace_nan_reindex, replace_nan_fillna
# : inspect results
s = gen_data(5, 1)
for func in funcs:
print(f'{func.__name__:>20s} {func(s, value)}')
print()
# : generate benchmarks
s = gen_data(100, 1000)
value = (0, 0, 0)
base = funcs[0](s, value)
for func in funcs:
print(f'{func.__name__:>20s} {(func(s, value) == base).all()!s:>5}', end=' ')
%timeit func(s, value)
# replace_nan_isna True 100 loops, best of 5: 16.5 ms per loop
# replace_nan_notna True 10 loops, best of 5: 46.5 ms per loop
# replace_nan_reindex True 100 loops, best of 5: 9.74 ms per loop
# replace_nan_fillna True 10 loops, best of 5: 36.4 ms per loop
indicating that reindex() may be the fastest approach.

Get values of pandas series from a array of index locations

I have a 2-d array of an index of a pandas series. Would like to create a 2-d array of the values from the pandas series that correspond to the index.
For example:
import pandas as pd
import numpy as np
A = pd.Series(data=[1,2,3,4,5])
idx = np.array([[0,2,3],[2,3,1]])
Would like to return:
B = np.array([[1,3,4],[3,4,2]])
I know I could do this as a loop:
B = np.zeros((2,3))
for i in [0,1]:
B[i,:] = test[idx[i]]
However, in practice need to do this repeatedly so would like to broadcast the index locations directly. Pandas is not necessary, happy to do it all in numpy if easier.
Something like this might work:
A[idx.flatten()].values.reshape(idx.shape)
A[idx] gives a Cannot index with multidimensional key error.
In [190]: A = pd.Series(data=[1,2,3,4,5])
...: idx = np.array([[0,2,3],[2,3,1]])
But the 1d array derived from the Series, can be indexed this way:
In [191]: A.values
Out[191]: array([1, 2, 3, 4, 5])
In [192]: A.values[idx]
Out[192]:
array([[1, 3, 4],
[3, 4, 2]])
numpy has no problems returning an array with a dimension that matches idx.
Indexing the Series like this returns a Series - which by definition is 1d:
In [194]: A[idx.ravel()]
Out[194]:
0 1
2 3
3 4
2 3
3 4
1 2
dtype: int64

How to create basic table in Python?

I want to make a table of 10 columns. I want also to find the row with the minimum value in column 0.
Example:
[[1,2,3]
[4,5,6,]
[7,8,9]
[10,11,21]]
How do I get to the row which have minimum value of column 0? I just need a function that can use column 0.
[1,2,3]
With numpy arange we can easily create a range of numbers, and then reshape them into a 2d array:
In [70]: arr = np.arange(1,13).reshape(4,3)
In [71]: arr
Out[71]:
array([[ 1, 2, 3],
[ 4, 5, 6],
[ 7, 8, 9],
[10, 11, 12]])
argmin gives the index of the minimum value, for the whole array (flattened) or by row or column:
In [72]: np.argmin(arr, axis=1)
Out[72]: array([0, 0, 0, 0])
The 0 column:
In [73]: arr[:,0]
Out[73]: array([ 1, 4, 7, 10])
In [74]: np.argmin(arr[:,0])
Out[74]: 0
pandas makes a nice table.
In [76]: import pandas as pd
In [77]: df = pd.DataFrame(arr)
In [78]: df
Out[78]:
0 1 2
0 1 2 3
1 4 5 6
2 7 8 9
3 10 11 12
There is a builtin function for that range.
Range does not create a list but an iterator wich behave quite like a list and should be way enough for you (iterator are "lists" but their item are calculated only when requested).
So :
a = range(10)
print(a) #-> range(0, 10)
for i in a:
print(a) #-> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9
print(a[2]) #-> 2
print(a[0]) #-> 0
If you want not to start from 0 just put range(start_value, end_value).
And if you want a custom increment use range(start_value, end_value, increment) (the default increment is 1 but if you want to go backward you can use -1).
Edit:
To create a table like your example you can use this small function :
def ct(nStart, nEnd, nPerSubTable):
r = [] # Setup initial variable
subTable = []
for i in range(nStart, nEnd): # The main ranging
subTable.append(i)
if len(subTable) == nPerSubTable: # When the len of the sub table hit the requested one append to r and reset sub table
r.append(subTable)
subTable = []
if len(subTable) > 0: # If there is some left over because the last subtable is smaller than expected, add it any way
r.append(subTable)
return r

obtaining indices of n max absolute values in dataframe row

suppose i create a Pandas DataFrame as below
import pandas as pd
import numpy as np
np.random.seed(0)
x = 10*np.random.randn(5,5)
df = pd.DataFrame(x)
as an example, this can generate the below:
for each row, i am looking for a way to readily obtain the indices corresponding to the largest n (say 3) values in absolute value terms. for example, for the first row, i would expect [0,3,4]. we can assume that the results don't need to be ordered.
i tried searching for solutions similar to idxmax and argmax, but it seems these do not readily handle multiple values
You can use np.argsort(axis=1)
Given dataset:
x = 10*np.random.randn(5,5)
df = pd.DataFrame(x)
0 1 2 3 4
0 17.640523 4.001572 9.787380 22.408932 18.675580
1 -9.772779 9.500884 -1.513572 -1.032189 4.105985
2 1.440436 14.542735 7.610377 1.216750 4.438632
3 3.336743 14.940791 -2.051583 3.130677 -8.540957
4 -25.529898 6.536186 8.644362 -7.421650 22.697546
df.abs().values.argsort(1)[:, -3:][:, ::-1]
array([[3, 4, 0],
[0, 1, 4],
[1, 2, 4],
[1, 4, 0],
[0, 4, 2]])
Try this ( this is not the optimal code ) :
idx_nmax = {}
n = 3
for index, row in df.iterrows():
idx_nmax[index] = list(row.nlargest(n).index)
at the end of that you will have a dictionary with:
as Key the index of the row
and as Values ​​the index of the 'n' highest value of this row

Python - best way to set a column in a 2d array to a specific value

I have a 2d array, I would like to set a column to a particular value, my code is below. Is this the best way in python?
rows = 5
cols = 10
data = (rows * cols) *[0]
val = 10
set_col = 5
for row in range(rows):
data[row * cols + set_col - 1] = val
If I want to set a number of columns to a particular value , how could I extend this
I would like to use the python standard library only
Thanks
NumPy package provides powerful N-dimensional array object. If data is a numpy array then to set set_col column to val value:
data[:, set_col] = val
Complete Example:
>>> import numpy as np
>>> a = np.arange(10)
>>> a.shape = (5,2)
>>> a
array([[0, 1],
[2, 3],
[4, 5],
[6, 7],
[8, 9]])
>>> a[:,1] = -1
>>> a
array([[ 0, -1],
[ 2, -1],
[ 4, -1],
[ 6, -1],
[ 8, -1]])
A better solution would be:
data = [[0] * cols for i in range(rows)]
For the values of cols = 2, rows = 3 we'd get:
data = [[0, 0],
[0, 0],
[0, 0]]
Then you can access it as:
v = data[row][col]
Which leads to:
val = 10
set_col = 5
for row in range(rows):
data[row][set_col] = val
Or the more Pythonic (thanks J.F. Sebastian):
for row in data:
row[set_col] = val
There's nothing inherently wrong with the way you're using, except that it would be clearer to name the variableset_col than set_row since you're setting a column.
So set a number of columns, just wrap it with another loop:
for set_col in [...columns that have to be set...]
One concern, though: your 2D array is unusual in that it's packed in a 1D array (Python can support 2D arrays via lists of lists as well), so I would wrap it all with methods or functions.
In your case rows and columns are probably interchangeable, i.e. it's matter of semantics which is which. If this is the case, then you could make columns to occupy sequence of cells in data list, and then zero them using just:
data[column_start:column_start+rows] = rows * [0]
An earlier answer left out a range, so you could try the following:
cols = 7
rows = 8
data = [[0] * cols for i in range(rows)]
val = 10
set_col = 5
for row in data:
row[set_col] = val
to extend this to a number of columns you could store the column number and it's value in a dict. So to set colum 5 to 10 and column 2 to 7:
cols = 7
rows = 8
data = [[0] * cols for i in range(rows)]
valdict = {5:10, 2:7}
for col, val in valdict.items():
for row in data:
row[col] = val
Swapping the rows and columns, as suggested in another answer, makes this slightly simpler:
cols = 7
rows = 8
data = [[0] * rows for i in range(cols)]
valdict = {5:10, 2:7}
for col, val in valdict.items():
data[col] = [val] * rows

Categories