How to create an xarray from a sparse, denormalized table?

How to create an xarray from a sparse, denormalized table? - python

Say I have the following structured array:
import numpy as np
l, h, w = 6, 5, 5
dtype = [('a', int), ('b', '<U3'), ('data', (float, (h, w)))]
table = np.empty(l, dtype)
table['a'] = [1, 2, 3, 1, 2, 3]
table['b'] = ['foo', 'bar'] * 3
table['data'] = np.random.rand(l, h, w)
My data has shape (6, 5, 5). But really, its shape is (3, 2, 5, 5), but I just have columns a and b denormalized.
Is it possible to create an xarray DataArray directly from this shape (6, 5, 5) by providing columns a and b of length 6 and have xarray figure out the (3, 2, 5, 5) shape? What would coords and dims be?
In reality, table is sparse and has many dimensions, and I'm trying to see if there's any xarray creation machinery I can lean on instead of reshaping table myself.

Related

Subtracting one dimensional array (list of scalars) from 3 dimensional arrays using broadcasting

I have a one dimesional array of scalar values
Y = np.array([1, 2])
I also have a 3-dimensional array:
X = np.random.randint(0, 255, size=(2, 2, 3))
I am attempting to subtract each value of Y from X, so I should get back Z which should be of shape (2, 2, 2, 3) or maybe (2, 2, 2, 3).
I can"t seem to figure out how to do this via broadcasting.
I tried changing the change of Y:
Y = np.array([[[1, 2]]])
but not sure what the correct shape should be.

Broadcasting lines up dimensions on the right. So you're looking to operate on a (2, 1, 1, 1) array and a (2, 2, 3) array.
The simplest way I can think of is using reshape:
Y = Y.reshape(-1, 1, 1, 1)
More generally:
Y = Y.reshape(-1, *([1] * X.ndim))
At most one of the arguments to reshape can be -1, indicating all the remaining size not accounted for by other dimensions.
To get Z of shape (2, 2, 2, 3):
Z = X - Y.reshape(-1, *([1] * X.ndim))
If you were OK with having Z of shape (2, 2, 3, 2), the operation would be much simpler:
Z = X[..., None] - Y
None or np.newaxis will insert a unit axis into the end of X's shape, making it broadcast properly with the 1D Y.

I am not entirely sure on which dimension you want your subtraction to take place, but X - Y will not return an error if you define Y such as Y = numpy.array([1,2]).reshape(2, 1, 1) or Y = numpy.array([1,2]).reshape(1, 2, 1).

Scipy sparse matrix from edge list

How to convert an edge list (data) to a python scipy sparse matrix to get this result:
Dataset (where 'agn' is node category one and 'fct' is node category two):
data['agn'].tolist()
['p1', 'p1', 'p1', 'p1', 'p1', 'p2', 'p2', 'p2', 'p2', 'p3', 'p3', 'p3', 'p4', 'p4', 'p5']
data['fct'].tolist()
['f1', 'f2', 'f3', 'f4', 'f5', 'f3', 'f4', 'f5', 'f6', 'f5', 'f6', 'f7', 'f7', 'f8', 'f9']
(not working) python code:
from scipy.sparse import csr_matrix, coo_matrix
csr_matrix((data_sub['agn'].values, data['fct'].values),
shape=(len(set(data['agn'].values)), len(set(data_sub['fct'].values))))
-> Error: "TypeError: invalid input format"
Do I really need three arrays to construct the matrix, like the examples in the scipy csr documentation do suggest (can only use two links, sorry!)?
(working) R code used to construct the matrix with only two vectors:
library(Matrix)
grph_tim <- sparseMatrix(i = as.numeric(data$agn),
j = as.numeric(data$fct),
dims = c(length(levels(data$agn)),
length(levels(data$fct))),
dimnames = list(levels(data$agn),
levels(data$fct)))
EDIT:
It finally worked after I modified the code from here and added the needed array:
import numpy as np
import pandas as pd
import scipy.sparse as ss
def read_data_file_as_coo_matrix(filename='edges.txt'):
"Read data file and return sparse matrix in coordinate format."
# if the nodes are integers, use 'dtype = np.uint32'
data = pd.read_csv(filename, sep = '\t', encoding = 'utf-8')
# where 'rows' is node category one and 'cols' node category 2
rows = data['agn'] # Not a copy, just a reference.
cols = data['fct']
# crucial third array in python, which can be left out in r
ones = np.ones(len(rows), np.uint32)
matrix = ss.coo_matrix((ones, (rows, cols)))
return matrix
Additionally, I converted the string names of the nodes to integers. Thus data['agn'] becomes [0, 0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 3, 3, 4] and data['fct'] becomes [0, 1, 2, 3, 4, 2, 3, 4, 5, 4, 5, 6, 6, 7, 8].
I get this sparse matrix:
(0, 0) 1
(0, 1) 1
(0, 2) 1
(0, 3) 1
(0, 4) 1
(1, 2) 1
(1, 3) 1
(1, 4) 1
(1, 5) 1
(2, 4) 1
(2, 5) 1
(2, 6) 1
(3, 6) 1
(3, 7) 1
(4, 8) 1

It finally worked after I modified the code from here and added the needed array:
import numpy as np
import pandas as pd
import scipy.sparse as ss
def read_data_file_as_coo_matrix(filename='edges.txt'):
"Read data file and return sparse matrix in coordinate format."
# if the nodes are integers, use 'dtype = np.uint32'
data = pd.read_csv(filename, sep = '\t', encoding = 'utf-8')
# where 'rows' is node category one and 'cols' node category 2
rows = data['agn'] # Not a copy, just a reference.
cols = data['fct']
# crucial third array in python, which can be left out in r
ones = np.ones(len(rows), np.uint32)
matrix = ss.coo_matrix((ones, (rows, cols)))
return matrix
Additionally, I converted the string names of the nodes to integers. Thus data['agn'] becomes [0, 0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 3, 3, 4] and data['fct'] becomes [0, 1, 2, 3, 4, 2, 3, 4, 5, 4, 5, 6, 6, 7, 8].
I get this sparse matrix:
(0, 0) 1
(0, 1) 1
(0, 2) 1
(0, 3) 1
(0, 4) 1
(1, 2) 1
(1, 3) 1
(1, 4) 1
(1, 5) 1
(2, 4) 1
(2, 5) 1
(2, 6) 1
(3, 6) 1
(3, 7) 1
(4, 8) 1

Reduce array over ranges

Say I have an array of numbers
np.array(([1, 4, 2, 1, 2, 5]))
And I want to compute the sum over a list of slices
((0, 3), (2, 4), (2, 6))
Giving
[(1 + 4 + 2), (2 + 1), (2 + 1 + 2 + 5)]
Is there a nice way to do this in numpy?
Looking for something equivalent to
def reduce(a, ranges):
np.array(list(np.sum(a[low:high]) for (low, high) in ranges))
Seems like there is probably some fancy numpy way to do this though. Anyone know?

One way is to use np.add.reduceat. If a is the array of values [1, 4, 2, 1, 2, 5]:
>>> np.add.reduceat(a, [0,3, 2,4, 2])[::2]
array([ 7, 3, 10], dtype=int32)
Here the slice indexes are passed in a list and are summed to return [ 7, 1, 3, 2, 10] (i.e. the sums of a[0:3], a[3:], a[2:4], a[4:], a[2:]). We only want every other element from this array.
Longer alternative approach...
The fact that the slices are of different lengths makes this slightly trickier to vectorise in NumPy, but here is one way you approach the problem.
Given an array of values and an array of slices to make...
a = np.array(([1, 4, 2, 1, 2, 5]))
slices = np.array([(0, 3), (2, 4), (2, 6)])
...create a mask-like array z that, for each slice, will be used to "zero-out" the values from a we don't want to sum:
z = np.zeros((3, 6))
s1 = np.arange(6) >= s[:, 0][:,None]
s2 = np.arange(6) < s[:, 1][:,None]
z[s1 & s2] = 1
Then you can do:
>>> (z * a).sum(axis=1)
array([ 7., 3., 10.])
A quick %timeit shows this is slightly faster than the list comprehension, even though we had to construct z and z * a. If slices is made to be of length 3000, this method is around 40 times quicker.
However note that the array z will be of shape (len(slices), len(a)) which may not be as practical if a or slices are both very long - an iterative approach might be preferred to avoid large temporary arrays in memory.

Python Multiply tuples of equal length

I was hoping for an elegant or effective way to multiply sequences of integers (or floats).
My first thought was to try (1, 2, 3) * (1, 2, 2) would result (1, 4, 6), the products of the individual multiplications.
Though python isn't preset to do that for sequences. Which is fine, I wouldn't really expect it to. So what's the pythonic way to multiply (or possibly other arithmetic operations as well) each item in two series with and to their respective indices?
A second example (0.6, 3.5) * (4, 4) = (2.4, 14)

The simplest way is to use zip function, with a generator expression, like this
tuple(l * r for l, r in zip(left, right))
For example,
>>> tuple(l * r for l, r in zip((1, 2, 3), (1, 2, 3)))
(1, 4, 9)
>>> tuple(l * r for l, r in zip((0.6, 3.5), (4, 4)))
(2.4, 14.0)
In Python 2.x, zip returns a list of tuples. If you want to avoid creating the temporary list, you can use itertools.izip, like this
>>> from itertools import izip
>>> tuple(l * r for l, r in izip((1, 2, 3), (1, 2, 3)))
(1, 4, 9)
>>> tuple(l * r for l, r in izip((0.6, 3.5), (4, 4)))
(2.4, 14.0)
You can read more about the differences between zip and itertools.izip in this question.

A simpler way would be:
from operator import mul
In [19]: tuple(map(mul, [0, 1, 2, 3], [10, 20, 30, 40]))
Out[19]: (0, 20, 60, 120)

If you are interested in element-wise multiplication, you'll probably find that many other element-wise mathematical operations are also useful. If that is the case, consider using the numpy library.
For example:
>>> import numpy as np
>>> x = np.array([1, 2, 3])
>>> y = np.array([1, 2, 2])
>>> x * y
array([1, 4, 6])
>>> x + y
array([2, 4, 5])

With list comprehensions the operation could be completed like
def seqMul(left, right):
return tuple([value*right[idx] for idx, value in enumerate(left)])
seqMul((0.6, 3.5), (4, 4))

A = (1, 2, 3)
B = (4, 5, 6)
AB = [a * b for a, b in zip(A, B)]
use itertools.izip instead of zip for larger inputs.

Change a 1D NumPy array from (implicit) row major to column major order

I have a 1D array in NumPy that implicitly represents some 2D data in row-major order. Here's a trivial example:
import numpy as np
# My data looks like [[1,2,3,4], [5,6,7,8]]
a = np.array([1,2,3,4,5,6,7,8])
I want to get a 1D array in column-major order (ie. b = [1,5,2,6,3,7,4,8] in the example above).
Normally, I would just do the following:
mat = np.reshape(a, (-1,4))
b = mat.flatten('F')
Unfortunately, the length of my input array is not an exact multiple of the row length I want (ie. a = [1,2,3,4,5,6,7]), so I can't call reshape. I want to keep that extra data, though, which might be quite a lot since my rows are pretty long. Is there any straightforward way to do this in NumPy?

The simplest way I can think of is not to try and use reshape with methods such as ravel('F'), but just to concatenate sliced views of your array.
For example:
>>> cols = 4
>>> a = np.array([1,2,3,4,5,6,7])
>>> np.concatenate([a[i::cols] for i in range(cols)])
array([1, 5, 2, 6, 3, 7, 4])
This works for any length of array and any number of columns:
>>> cols = 5
>>> b = np.arange(17)
>>> np.concatenate([b[i::cols] for i in range(cols)])
array([ 0, 5, 10, 15, 1, 6, 11, 16, 2, 7, 12, 3, 8, 13, 4, 9, 14])
Alternatively, use as_strided to reshape. The fact that the array a is too small to fit the (2, 4) shape doesn't matter: you'll just get junk (i.e. whatever's in memory) in the last place:
>>> np.lib.stride_tricks.as_strided(a, shape=(2, 4))
array([[ 1, 2, 3, 4],
[ 5, 6, 7, 168430121]])
>>> _.flatten('F')[:7]
array([1, 5, 2, 6, 3, 7, 4])
In the general case, given an array b and a desired number of columns cols you can do this:
>>> x = np.lib.stride_tricks.as_strided(b, shape=(len(b)//cols + 1, cols)) # reshape to min 2d array needed to hold array b
>>> np.concatenate((x[:,:len(b)%cols].ravel('F'), x[:-1, len(b)%cols:].ravel('F')))
This unravels the "good" part of the array (those columns not containing junk values) and the bad part (except for the junk values which lie in the bottom row) and concatenates the two unraveled arrays. For example:
>>> cols = 5
>>> b = np.arange(17)
>>> x = np.lib.stride_tricks.as_strided(b, shape=(len(b)//cols + 1, cols))
>>> np.concatenate((x[:,:len(b)%cols].ravel('F'), x[:-1, len(b)%cols:].ravel('F')))
array([ 0, 5, 10, 15, 1, 6, 11, 16, 2, 7, 12, 3, 8, 13, 4, 9, 14])

Use some value to represent null to make the array be a multiple of how you want to split it. If casting to float is acceptable, you could use nan's to represent the added elements that represent nulls. Then reshape to 2D, call transpose, and reshape to 1D. Then eliminate the nulls.
import numpy as np
a = np.array([1,2,3,4,5,6,7]) # input
b = np.concatenate( (a, [np.NaN]) ) # add a NaN to make it 8 = 4x2
c = b.reshape(2,4).transpose().reshape(8,) # reshape to 2x4, transpose, reshape to 8x1
d = c[-np.isnan(c)] # remove NaN
print d
[ 1. 5. 2. 6. 3. 7. 4.]

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to create an xarray from a sparse, denormalized table? - python

Related

Subtracting one dimensional array (list of scalars) from 3 dimensional arrays using broadcasting

Scipy sparse matrix from edge list

Reduce array over ranges

Python Multiply tuples of equal length

Change a 1D NumPy array from (implicit) row major to column major order

Categories

Resources