I need to update values in a Xarray DataArray. There are only certain values I want to update, which are stored in a dataframe together with their coordinates.
One can also reformulate the problem as both (i) update a non-continuous (non-rectangular) area of values in a DataArray and (ii) left-join dataset with new values on matching coords and rewrite old values.
Intial DataArray (Dataset):
import xarray as xr
import pandas as pd
ds = xr.Dataset(
data_vars = {"x": (("a", "b"), np.zeros((3, 4)))},
coords = {"a": np.arange(3), "b": np.arange(4)})
ds.x.values
array([[0., 0., 0., 0.],
[0., 0., 0., 0.],
[0., 0., 0., 0.]])
New values to assign with their coordinates:
new_val = pd.DataFrame([
(0, 1, 10),
(1, 0, 11),
(2, 2, 12)],
columns=['a', 'b', 'x'])
Result I want to get:
array([[0., 10., 0., 0.],
[11., 0., 0., 0.],
[0., 0., 12., 0.]])
I was trying to use methods both in Combining data and in Assigning values with indexing turials, but no luck so far.
Result can be achieved using combine_first.
# set index and convert to xr Dataset
new_val_idx = new_val.set_index(['a', 'b'])
new_val_ds = xr.Dataset.from_dataframe(new_val_idx)
combined = new_val_ds.combine_first(ds)
combined.x.values
array([[ 0., 10., 0., 0.],
[11., 0., 0., 0.],
[ 0., 0., 12., 0.]])
Related
I want to create 3D array with a dimension which depends on the length of a 2D list
Method 1 :
the length of list_1 is 2 and each element has the same length
list_1 = [[1,2,3,4,5], [1,2,3,4,5]]
array_1 = np.zeros((2,len(list_1), 114))
the shape of array_1 is (2,2,114)
method 2 :
the length of list_2 is also 2 but elements do not have the same length
list_2 = [[1,2,3,4,5],[1,2,3,4]]
array_2 = [np.zeros((len(list_2[i]),114)) for i in range(len(list_2))]
array_2 = np.array(list_2, dtype=object)
In this case the shape of array_2 is (2,)
Does someone know the reason ? I do not understand why I do not get the same shape.
Is there a way to get the same shape ?
You made a list of 2 arrays, that differ in the number of rows:
In [130]: [np.zeros((len(list_2[i]), 3)) for i in range(len(list_2))]
Out[130]:
[array([[0., 0., 0.],
[0., 0., 0.],
[0., 0., 0.],
[0., 0., 0.],
[0., 0., 0.]]),
array([[0., 0., 0.],
[0., 0., 0.],
[0., 0., 0.],
[0., 0., 0.]])]
np.array cannot combine those into one 3d array, so it makes a shape 2 array, containing those same 2 arrays:
In [131]: np.array(_, object)
Out[131]:
array([array([[0., 0., 0.],
[0., 0., 0.],
[0., 0., 0.],
[0., 0., 0.],
[0., 0., 0.]]),
array([[0., 0., 0.],
[0., 0., 0.]
[0., 0., 0.],
[0., 0., 0.]])], dtype=object)
If you don't include the object dtype it warns about making a "ragged array".
Using len(list_1) for the first case, and len(list_2[i]) in the second, are two very different situations. In one you are using length of the list itself, and the the length of the sublists.
If you truncated the sublists to the same length:
In [137]: np.array([np.zeros((len(list_2[i][:4]),114)) for i in range(len(list_2))],object).shape
Out[137]: (2, 4, 114)
The subarrays now have the same shape (4,114), and can be combined into one 3d array.
I want to create a matrix with Numpy, but I need to add every element by its row and column indices.
for example:
my_matrix = np.matrix(np.zeros((5, 5)))
my_matrix.insert(row_index=2, column_index=1, value=10)
output:
matrix([[0., 0., 0., 0., 0.],
[10., 0., 0., 0., 0.],
[0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0.]])
How can I do that?
Do you want to add or insert values?
The add function that you mentioned is used as an element-wise addition.
Example:
np.add([1, 2], [2, 3])
Out[41]: array([3, 5])
If you really want to create a matrix a matrix by inserting values with column and row indices, create the matrix first and insert your values afterwards.
number_rows = 10
number_cols = 20
arr = np.empty((number_rows, number_cols))
arr[2, 1] = 10
The use of np.matrix is discouraged, if not actually deprecated. It is rarely needed, except for some backward compatibility cases.
In [1]: arr = np.zeros((5,5))
In [2]: arr
Out[2]:
array([[0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0.]])
In [3]: mat = np.matrix(arr)
In [4]: mat
Out[4]:
matrix([[0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0.]])
Indexing one row of arr produces a 1d array
In [5]: arr[2]
Out[5]: array([0., 0., 0., 0., 0.])
Indexing one row of mat produces a 2d matrix, with shape (1,5)
In [6]: mat[2]
Out[6]: matrix([[0., 0., 0., 0., 0.]])
We can access an element in the 1d array:
In [7]: arr[2][1]
Out[7]: 0.0
but this indexing of the mat tries to access a row, and gives an error:
In [8]: mat[2][1]
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-8-212ad5378f8e> in <module>
----> 1 mat[2][1]
...
IndexError: index 1 is out of bounds for axis 0 with size 1
In both cases it is better to access an element with the tuple syntax, rather than the chained one:
In [9]: arr[2,1]
Out[9]: 0.0
In [10]: mat[2,1]
Out[10]: 0.0
This indexing also works for setting values. But try to avoid iterating to set individual values. Try to find ways of creating the whole array with the desired values directly, with whole array methods, not iteration.
I'd like to create an MxN matrix based on two input arrays (XI and X), where each row has 1 if the column represents the result of search sorted for that row's X value (within XI).
Code:
import numpy as np
XI = np.array([1., 2., 4., 5., 7.])
X = np.array([6.5, 2.2, 1.4, 4., 3.7, 3.9, 0.1, 5.3, 10.2])
def bmap(xi, x):
i = np.searchsorted(xi, x, side="right") - 1
result_shape = (x.shape[0], xi.shape[0])
result = np.zeros(result_shape)
for row, column in enumerate(i):
if -1 < column < xi.shape[0] - 1:
result[row,column] = 1.
return result
bmap(XI, X)
Expected Output:
array([[0., 0., 0., 1., 0.],
[0., 1., 0., 0., 0.],
[1., 0., 0., 0., 0.],
[0., 0., 1., 0., 0.],
[0., 1., 0., 0., 0.],
[0., 1., 0., 0., 0.],
[0., 0., 0., 0., 0.],
[0., 0., 0., 1., 0.],
[0., 0., 0., 0., 0.]])
How do I do this using only vectorized operations (e.g. excluding the enumeration of i and the bounds checking)? Bonus points for something that can be used in TensorFlow as well (since, ultimately, I'm trying to port this over to TensorFlow so I can take advantage of algorithmic differentiation).
If I understood you just want to convert the indices i to positions in an array (one per row), unless that index is negative or in the last column.
i = np.searchsorted(xi, x, side="right") - 1
# array([ 3, 1, 0, 2, 1, 1, -1, 3, 4], dtype=int64)
Creat your output array and a mask of valid values:
out=np.zeros([x.size, xi.size])
valid = (i>-1)&(i<xi.shape[0] - 1)
#valid: array([ True, True, True, True, True, True, False, True, False])
Use valid mask to index rows and i (where valid) serves as indices for columns:
out[valid, i[valid]] = 1
I have the following list of indices [2 4 3 4] which correspond to my target indices. I'm creating a matrix of zeroes with the following line of code targets = np.zeros((features.shape[0], 5)). Im wondering if its possible to slice in such a way that I could update the specific indices all at once and set those values to 1 without a for loop, ideally the matrix would look like
([0,0,1,0,0], [0,0,0,0,1], [0,0,0,1,0], [0,0,0,0,1])
I believe you can do something like this:
targets = np.zeros((4, 5))
ind = [2, 4, 3, 4]
targets[np.arange(0, 4), ind] = 1
Here is the result:
array([[ 0., 0., 1., 0., 0.],
[ 0., 0., 0., 0., 1.],
[ 0., 0., 0., 1., 0.],
[ 0., 0., 0., 0., 1.]])
I know it can be easily realized using the package pandas, but because it is too sparse and large (170,000 x 5000), and at the end I need to use sklearn to deal with the data again, I'm wondering if there is a way to do with sklearn. I tried the one hot encoder, but got stuck to associate dummies with the 'id'.
df = pd.DataFrame({'id': [1, 1, 2, 2, 3, 3], 'item': ['a', 'a', 'c', 'b', 'a', 'b']})
id item
0 1 a
1 1 a
2 2 c
3 2 b
4 3 a
5 3 b
dummy = pd.get_dummies(df, prefix='item', columns=['item'])
dummy.groupby('id').sum().reset_index()
id item_a item_b item_c
0 1 2 0 0
1 2 0 1 1
2 3 1 1 0
Update:
Now I'm here, and the 'id' is lost, how to do aggregation then?
lab = sklearn.preprocessing.LabelEncoder()
labels = lab.fit_transform(np.array(df.item))
enc = sklearn.preprocessing.OneHotEncoder()
dummy = enc.fit_transform(labels.reshape(-1,1))
dummy.todense()
matrix([[ 1., 0., 0.],
[ 1., 0., 0.],
[ 0., 0., 1.],
[ 0., 1., 0.],
[ 1., 0., 0.],
[ 0., 1., 0.]])
In case anyone needs a reference in the future, I put my solution here.
I used scipy sparse matrix.
First, do a grouping and count the number of records.
df = df.groupby(['id','item']).size().reset_index().rename(columns={0:'count'})
This takes some time but not days.
Then use pivot table, which I found a solution here.
from scipy.sparse import csr_matrix
def to_sparse_pivot(df, id, item, count):
id_u = list(df[id].unique())
item_u = list(np.sort(df[item].unique()))
data = df[count].tolist()
row = df[id].astype('category', categories=id_u).cat.codes
col = df[item].astype('category', categories=item_u).cat.codes
return csr_matrix((data, (row, col)), shape=(len(id_u), len(item_u)))
Then call the function
result = to_sparse_pivot(df, 'id', 'item', 'count')
OneHotEncoder requires integers, so here is one way to map your items to a unique integer. Because the mapping is one-to-one, we can also reverse this dictionary.
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
df = pd.DataFrame({'ID': [1, 1, 2, 2, 3, 3],
'Item': ['a', 'a', 'c', 'b', 'a', 'b']})
mapping = {letter: integer for integer, letter in enumerate(df.Item.unique())}
reverse_mapping = {integer: letter for letter, integer in mapping.iteritems()}
>>> mapping
{'a': 0, 'b': 2, 'c': 1}
>>> reverse_mapping
{0: 'a', 1: 'c', 2: 'b'}
Now create a OneHotEncoder and map your values.
hot = OneHotEncoder()
h = hot.fit_transform(df.Item.map(mapping).values.reshape(len(df), 1))
>>> h
<6x3 sparse matrix of type '<type 'numpy.float64'>'
with 6 stored elements in Compressed Sparse Row format>
>>> h.toarray()
array([[ 1., 0., 0.],
[ 1., 0., 0.],
[ 0., 1., 0.],
[ 0., 0., 1.],
[ 1., 0., 0.],
[ 0., 0., 1.]])
And for reference, these would be the appropriate columns:
>>> [reverse_mapping[n] for n in reverse_mapping.keys()]
['a', 'c', 'b']
From your data, you can see that the value c in the dataframe was in the third row (with an index value of 2). This has been mapped to c which you can see from the reverse mapping is the middle column. It is also the only value in the middle column of the matrix to contain a value of one, confirming the result.
Beyond this, I'm not sure where you'd be stuck. If you still have issues, please clarify.
To concatenate the ID values:
>>> np.concatenate((df.ID.values.reshape(len(df), 1), h.toarray()), axis=1)
array([[ 1., 1., 0., 0.],
[ 1., 1., 0., 0.],
[ 2., 0., 1., 0.],
[ 2., 0., 0., 1.],
[ 3., 1., 0., 0.],
[ 3., 0., 0., 1.]])
To keep the array sparse:
from scipy.sparse import hstack, lil_matrix
id_vals = lil_matrix(df.ID.values.reshape(len(df), 1))
h_dense = hstack([id_vals, h.tolil()])
>>> type(h_dense)
scipy.sparse.coo.coo_matrix
>>> h_dense.toarray()
array([[ 1., 1., 0., 0.],
[ 1., 1., 0., 0.],
[ 2., 0., 1., 0.],
[ 2., 0., 0., 1.],
[ 3., 1., 0., 0.],
[ 3., 0., 0., 1.]])