Dask element wise string concatination - python

I need to create a multi-index for dask by concatenating two arrays (preferably dask arrays). I found the following solution for numpy, but looking for a dask solution
cols=100000
index = np.array([x1 + x2 +x3 for x1,x2,x3 in zip(repeat(1,cols ).astype('str'),repeat('-',cols ),repeat(1,cols ).astype('str'))])
if I pass it da.from_array() it balks at + two arrays.
I have also tried np.core.defchararray.add(), this works but converts to dask array to numpy arrays (as far as i can tell).

You might want to try da.map_blocks. You can make a numpy function that does whatever you want, and then da.map_blocks will apply that numpy function blockwise on to each of the numpy arrays that make up your dask array.

Related

H2OFrame column to array: quickest way?

Suppose I have an H2OFrame called df. What is the quickest way to get the values of column x from said frame as a numpy array?
One could do
x_array = df['x'].as_data_frame()['x'].values
But that seems unnecessarily verbose. Especially passing via a pandas DataFrame with as_data_frame seems superfluous. I was hoping for something more elegant like, e.g. df['x'].to_array(). But I can't find it.
here is another way. However, I'm not sure it's faster. I'm using the h2o.as_list() function to convert a column to a list and then I use the np.array() function to convert the list to an array.
import h2o
import numpy as np
h2o.init()
# Using sample dataset from H2O
train = h2o.import_file("https://s3.amazonaws.com/erin-data/higgs/higgs_train_10k.csv")
## Creating np array from h2o frame column
np.array(h2o.as_list(train['x1']))

The efficient way of Array transformation by using numpy

How to change the ARRAY U(Nz,Ny, Nx) to U(Nx,Ny, Nz) by using numpy? thanks
Just numpy.transpose(U) or U.T.
In general, if you want to change the order of data in numpy array, see http://docs.scipy.org/doc/numpy-1.10.1/reference/routines.array-manipulation.html#rearranging-elements.
The np.fliplr() and np.flipud() functions can be particularly useful when the transpose is not actually what you want.
Additionally, more general element reordering can be done by creating an index mask, partially explained here

Save a csr_matrix and a numpy array in one file

I need to save a large sparse csr_matrix and a numpy array to be able to read them back later. Let X be the sparse csr_matrix and Y be the number array.
Currently I take the following slightly insane route.
from scipy.sparse import csr_matrix
import numpy as np
def save_sparse_csr(filename,array):
np.savez(filename,data = array.data ,indices=array.indices,
indptr =array.indptr, shape=array.shape )
def load_sparse_csr(filename):
loader = np.load(filename)
return csr_matrix(( loader['data'], loader['indices'], loader['indptr']),
shape = loader['shape'])
save_sparse_csr("file1", X)
np.save("file2", Y)
Then when I want to read them in it is:
X = load_sparse_csr("file1.npz")
Y = np.load("file2.npy")
Two questions:
Is there a better way to save a csr_matrix than this?
Can I save both X and Y to the same file somehow? I seems crazy to have to make two files for this.
So you are saving the 3 array attributes of the csr along with its shape. And that is sufficient to recreate the array, right?
What's wrong with that? Even if you find a function that saves the csr for you, I bet it is doing the same thing - saving those same arrays.
The normal way in Python to save a class is to pickle it. But the class has to create the appropriate pickle method. numpy does that (essentially its save function). But as far as I know scipy.sparse has not provided that.
Since scipy.sparse has its roots in the MATLAB sparse code (and C/Fortran code developed for linear algebra problems), it can load/save using the loadmat/savemat functions. I'd have to double check but I think the work with csc the default MATLAB sparse format.
There are one or two other sparse.io modules than handle sparse, but I have worked with those. There formats for sharing sparse arrays among different packages working with the same problems (for example PDEs or finite element). More than likely those formats will use a coo compatible layout (data, rows, cols), either as 3 arrays, a csv of 3 columns, or 2d array.
Mentioning coo format raises another possibility. Make a structure array with data, row, col fields, and use np.save or even np.savetxt. I don't think it's any faster or cleaner than csr direct. But it does put all the data in one array (but shape might still need a separate entry).
You might also be able to pickle the dok format, since it is a dict subclass.

numpy style which is preferred? array.T[x] or array[:,0]

What is the preferred way to extract a column of data in numpy?
array[:,x]
or
array.T[x]
I find that having an array of data with the fields along the rows and data in columns is cleaner to manipulate in numpy:
array[x]
to get a whole series along one variable as opposed to the above options.
But having variables ordered by column is the standard file format.
Any preferences as to what is the easiest way to work with the data?
Should I transpose all my data when I read it in and then transpose again when I output?
You should prefer slicing [:, x], for several reasons:
It is faster, probably because you are not transposing the entire array to extract a piece from it. Tested in Python 3.5.1, NumPy 1.11.0:
>>> timeit.timeit('A[:,568]', setup = 'import numpy as np\nA = np.random.uniform(size=(1000,1000))')
0.21135332298581488
>>> timeit.timeit('A.T[568]', setup = 'import numpy as np\nA = np.random.uniform(size=(1000,1000))')
0.3025632489880081
It generalizes in a straightforward way to higher dimensional arrays, like A[3, :, 4]
It reflects the NumPy way of thinking of arrays as multidimensional objects, rather than lists of lists (of lists).

Is there a Pandas equivalent to each_slice to operate on dataframes

I am wondering if there is a Python or Pandas function that approximates the Ruby #each_slice method. In this example, the Ruby #each_slice method will take the array or hash and break it into groups of 100.
var.each_slice(100) do |batch|
# do some work on each batch
I am trying to do this same operation on a Pandas dataframe. Is there a Pythonic way to accomplish the same thing?
I have checked out this answer: Python equivalent of Ruby's each_slice(count)
However, it is old and is not Pandas specific. I am checking it out but am wondering if there is a more direct method.
There isn't a built in method as such but you can use numpy's array_slice, you can pass the dataframe to this and the number of slices.
In order to get ~100 size slices you'll have to calculate this which is simply the number of rows/100:
import numpy as np
# df.shape returns the dimensions in a tuple, the first dimension is the number of rows
np.array_slice(df, df.shape[0]/100)
This returns a list of dataframes sliced as evenly as possible

Categories