How to add column to numpy array - python

I am trying to add one column to the array created from recfromcsv. In this case it's an array: [210,8] (rows, cols).
I want to add a ninth column. Empty or with zeroes doesn't matter.
from numpy import genfromtxt
from numpy import recfromcsv
import numpy as np
import time
if __name__ == '__main__':
print("testing")
my_data = recfromcsv('LIAB.ST.csv', delimiter='\t')
array_size = my_data.size
#my_data = np.append(my_data[:array_size],my_data[9:],0)
new_col = np.sum(x,1).reshape((x.shape[0],1))
np.append(x,new_col,1)

I think that your problem is that you are expecting np.append to add the column in-place, but what it does, because of how numpy data is stored, is create a copy of the joined arrays
Returns
-------
append : ndarray
A copy of `arr` with `values` appended to `axis`. Note that `append`
does not occur in-place: a new array is allocated and filled. If
`axis` is None, `out` is a flattened array.
so you need to save the output all_data = np.append(...):
my_data = np.random.random((210,8)) #recfromcsv('LIAB.ST.csv', delimiter='\t')
new_col = my_data.sum(1)[...,None] # None keeps (n, 1) shape
new_col.shape
#(210,1)
all_data = np.append(my_data, new_col, 1)
all_data.shape
#(210,9)
Alternative ways:
all_data = np.hstack((my_data, new_col))
#or
all_data = np.concatenate((my_data, new_col), 1)
I believe that the only difference between these three functions (as well as np.vstack) are their default behaviors for when axis is unspecified:
concatenate assumes axis = 0
hstack assumes axis = 1 unless inputs are 1d, then axis = 0
vstack assumes axis = 0 after adding an axis if inputs are 1d
append flattens array
Based on your comment, and looking more closely at your example code, I now believe that what you are probably looking to do is add a field to a record array. You imported both genfromtxt which returns a structured array and recfromcsv which returns the subtly different record array (recarray). You used the recfromcsv so right now my_data is actually a recarray, which means that most likely my_data.shape = (210,) since recarrays are 1d arrays of records, where each record is a tuple with the given dtype.
So you could try this:
import numpy as np
from numpy.lib.recfunctions import append_fields
x = np.random.random(10)
y = np.random.random(10)
z = np.random.random(10)
data = np.array( list(zip(x,y,z)), dtype=[('x',float),('y',float),('z',float)])
data = np.recarray(data.shape, data.dtype, buf=data)
data.shape
#(10,)
tot = data['x'] + data['y'] + data['z'] # sum(axis=1) won't work on recarray
tot.shape
#(10,)
all_data = append_fields(data, 'total', tot, usemask=False)
all_data
#array([(0.4374783740738456 , 0.04307289878861764, 0.021176067323686598, 0.5017273401861498),
# (0.07622262416466963, 0.3962146058689695 , 0.27912715826653534 , 0.7515643883001745),
# (0.30878532523061153, 0.8553768789387086 , 0.9577415585116588 , 2.121903762680979 ),
# (0.5288343561208022 , 0.17048864443625933, 0.07915689716226904 , 0.7784798977193306),
# (0.8804269791375121 , 0.45517504750917714, 0.1601389248542675 , 1.4957409515009568),
# (0.9556552723429782 , 0.8884504475901043 , 0.6412854758843308 , 2.4853911958174133),
# (0.0227638618687922 , 0.9295332854783015 , 0.3234597575660103 , 1.275756904913104 ),
# (0.684075052174589 , 0.6654774682866273 , 0.5246593820025259 , 1.8742119024637423),
# (0.9841793718333871 , 0.5813955915551511 , 0.39577520705133684 , 1.961350170439875 ),
# (0.9889343795296571 , 0.22830104497714432, 0.20011292764078448 , 1.4173483521475858)],
# dtype=[('x', '<f8'), ('y', '<f8'), ('z', '<f8'), ('total', '<f8')])
all_data.shape
#(10,)
all_data.dtype.names
#('x', 'y', 'z', 'total')

If you have an array, a of say 210 rows by 8 columns:
a = numpy.empty([210,8])
and want to add a ninth column of zeros you can do this:
b = numpy.append(a,numpy.zeros([len(a),1]),1)

The easiest solution is to use numpy.insert().
The Advantage of np.insert() over np.append is that you can insert the new columns into custom indices.
import numpy as np
X = np.arange(20).reshape(10,2)
X = np.insert(X, [0,2], np.random.rand(X.shape[0]*2).reshape(-1,2)*10, axis=1)
'''

np.append or np.hstack expects the appended column to be the proper shape, that is N x 1. We can use np.zeros to create this zeros column (or np.ones to create a ones column) and append it to our original matrix (2D array).
def append_zeros(x):
zeros = np.zeros((len(x), 1)) # zeros column as 2D array
return np.hstack((x, zeros)) # append column

I add a new column with ones to a matrix array in this way:
Z = append([[1 for _ in range(0,len(Z))]], Z.T,0).T
Maybe it is not that efficient?

It can be done like this:
import numpy as np
# create a random matrix:
A = np.random.normal(size=(5,2))
# add a column of zeros to it:
print(np.hstack((A,np.zeros((A.shape[0],1)))))
In general, if A is an m*n matrix, and you need to add a column, you have to create an n*1 matrix of zeros, then use "hstack" to add the matrix of zeros to the right of the matrix A.

Similar to some of the other answers suggesting using numpy.hstack, but more readable:
import numpy as np
# declare 10 rows x 3 cols integer array of all 1s
arr = np.ones((10, 3), dtype=np.int64)
# get the number of rows in the original array (as if we didn't know it was 10 or it could be different in other cases)
numRows = arr.shape[0]
# declare the new array which will be the new column, integer array of all 0s so it's visually distinct from the original array
additionalColumn = np.zeros((numRows, 1), dtype=np.int64)
# use hstack to tack on the additionl column
result = np.hstack((arr, additionalColumn))
print(result)
result:
$ python3 scratchpad.py
[[1 1 1 0]
[1 1 1 0]
[1 1 1 0]
[1 1 1 0]
[1 1 1 0]
[1 1 1 0]
[1 1 1 0]
[1 1 1 0]
[1 1 1 0]
[1 1 1 0]]

Here's a shorter one-liner:
import numpy as np
data = np.random.rand(210, 8)
data = np.c_[data, np.zeros(len(data))]
Something that I use often to convert points to homogenous coordinates with np.ones instead.

Related

Faster way to extract one numpy array using another

I have two large one-dimensional numpy arrays (~1e8+ elements) of the same size. Array 1 (a1) is an integer indicator value, it tells me the meaning of the values in a2. Array 2 (a2) has measurement values. I want to separate the values in a2, based on the values in a1. I want to iterate over a list of possible values in a1 and select corresponding values (by index) from a2. Because of the data size and the number of different times I have to do this, it needs to be as fast as possible.
Here's a simple example to hopefully explain my problem more clearly:
a1 = [0, 2, 1, 1, 0, 2] # actual values range from 0 to 4095
a2 = [0.5, 2.4, 1.0, 1.2, 0.4, 2.6] # dummy values for example
I would like to separate the values in a2 based on the values in a1. I am saving each separated array into an HDF5 dataset.
So, in the end, I need an array for each value in a1
out = [0.5,0.4] # a1 == 0
out = [1.0, 1.2] # a1 == 1
out = [2.4, 2.6] # a1 == 2
Currently, I've tried the following:
import numpy as np
size = int(1e8)
a1 = np.random.randint(0, 4096, size=size)
a2 = np.random.rand(size)
for i in range(0, 4096):
ind = a1 == i
out = a2[ind]
# more code here to save out to h5 file
Based on some timeit testing, np.extract works just slightly faster than this boolean masking approach for large arrays:
import numpy as np
size = int(1e8)
a1 = np.random.randint(0, 4096, size=size)
a2 = np.random.rand(size)
for i in range(0,4096):
ind = a1 == i
out = np.extract(ind, a2)
# more code here to save out to h5 file
I also tried putting this into a pandas series with a1 as the index, but this was much slower.
import pandas as pd
import numpy as np
size = int(1e8)
a1 = np.random.randint(0, 4096, size=size)
a2 = np.random.rand(size)
s = pd.Series(a2, index=a1)
for i in range(0,4096):
out = s.loc[i] #this is WAY slower than numpy
# more code here to save out to h5 file
I'm wondering if there's a faster way of doing this? Maybe with sorting and searchsorted? Is it possible to generate a "database-like index" for a numpy array?
As I see, your both source arrays are plain pythonic lists.
So the first step is to convert them to Numpy arrays:
arr1 = np.array(a1)
arr2 = np.array(a2)
Then, to get your expected result as a list of Numpy arrays, run:
rng = arr1.max() + 1
result = [ arr2[np.nonzero(arr1 == i)] for i in range(rng) ]
Details:
np.nonzero(arr1 == i) - generates the list of indices of arr1
where the element contains i,
arr2[...] - retrieves elements of arr2 indicated by the above
indices,
[ ... for i in range(rng) ] - a list comprehension to generate
an output array for each value present in arr1.
For your data sample you will get:
[array([0.5, 0.4]), array([1. , 1.2]), array([2.4, 2.6])]
Or if you want your result as a list of plain list, change the above
code to:
result = [ arr2[np.nonzero(arr1 == i)].tolist() for i in range(rng) ]
This time you will get:
[[0.5, 0.4], [1.0, 1.2], [2.4, 2.6]]

Taking specific 2d array from 3d in numpy

Is there a way to avoid using the for loop and get the result just by calling arr with some indexing? Potentially dim1 will be equal to 50 000, dim2 up to 1000, dim3 fixed to 3.
import numpy as np
dim1 = 10
dim2 = 2
dim3 = 3
arr = np.arange(60).reshape(dim1,dim2,dim3)
arr2 = np.arange(dim1*dim2).reshape(dim1,dim2)
np.mod(arr2,dim3,out=arr2)
res = []
rng = np.arange(dim1)
for x in range(dim2):
sl = arr2[:,x]
temp = arr[rng,x,sl]
res.append(temp)
res = np.asarray(res).T
Basically, I would like to extract the values from arr which is a 3D array, however the matrix arr2 indicates which columns to select.
Best

(Python) Mapping between two arrays with a precedence array

Given a source array
src = np.random.rand(320,240)
and an index array
idx = np.indices(src.shape).reshape(2, -1)
np.random.shuffle(idx.T)
we can map the linear index i in src to the 2-dimensional index idx[:,i] in a destination array dst via
dst = np.empty_like(src)
dst[tuple(idx)] = src.ravel()
This is discussed in Python: Mapping between two arrays with an index array
However, if this mapping is not 1-to-1, i.e., multiple entries in src map to the same entry in dst, according to the docs it is unspecified which of the source entries will be written to dst:
For advanced assignments, there is in general no guarantee for the iteration order. This means that if an element is set more than once, it is not possible to predict the final result.
If we are additionally given a precedence array
p = np.random.rand(*src.shape)
how can we use p to disambiguate this situation, i.e., write the entry with highest precedence according to p?
Here is a method using a sparse matrix for sorting (it has large overhead but scales better than argsort, presumably because it uses some radix sort like method (?)). Duplicate indices without precedence are explicitly set to -1. We make the destination array one cell too big, the surplus cell serving as trash can.
import numpy as np
from scipy import sparse
N = 2
idx = np.random.randint(0, N, (2, N, N))
prec = np.random.random((N, N))
src = np.arange(N*N).reshape(N, N)
def f_sparse(idx, prec, src):
idx = np.ravel_multi_index(idx, src.shape).ravel()
sp = sparse.csr_matrix((prec.ravel(), idx, np.arange(idx.size+1)),
(idx.size, idx.size)).tocsc()
top = sp.indptr.argmax()
mx = np.repeat(np.maximum.reduceat(sp.data, sp.indptr[:top]),
np.diff(sp.indptr[:top+1]))
res = idx.copy()
res[sp.indices[sp.data != mx]] = -1
dst = np.full((idx.size + 1,), np.nan)
dst[res] = src.ravel()
return dst[:-1].reshape(src.shape)
print(idx)
print(prec)
print(src)
print(f_sparse(idx, prec, src))
Sample run:
[[[1 0]
[1 0]]
[[0 1]
[0 0]]]
[[0.90995366 0.92095225]
[0.60997092 0.84092015]]
[[0 1]
[2 3]]
[[ 3. 1.]
[ 0. nan]]

How can I always have numpy.ndarray.shape return a two valued tuple?

I'm trying to get the values of (nRows, nCols) from a 2D Matrix but when it's a single row (i.e. x = np.array([1, 2, 3, 4])), x.shape will return (4,) and so my statement of (nRows, nCols) = x.shape returns "ValueError: need more than 1 value to unpack"
Any suggestions on how I can make this statement more adaptable? It's for a function that is used in many programs and should work with both single row and multi-row matices. Thanks!
You could create a function that returns a tuple of rows and columns like this:
def rowsCols(a):
if len(a.shape) > 1:
rows = a.shape[0]
cols = a.shape[1]
else:
rows = a.shape[0]
cols = 0
return (rows, cols)
where a is the array you input to the function. Here's an example of using the function:
import numpy as np
x = np.array([1,2,3])
y = np.array([[1,2,3],[4,5,6],[7,8,9],[10,11,12]])
def rowsCols(a):
if len(a.shape) > 1:
rows = a.shape[0]
cols = a.shape[1]
else:
rows = a.shape[0]
cols = 0
return (rows, cols)
(nRows, nCols) = rowsCols(x)
print('rows {} and columns {}'.format(nRows, nCols))
(nRows, nCols) = rowsCols(y)
print('rows {} and columns {}'.format(nRows, nCols))
This prints rows 3 and columns 0 then rows 4 and columns 3. Alternatively, you can use the atleast_2d function for a more concise approach:
(r, c) = np.atleast_2d(x).shape
print('rows {}, cols {}'.format(r, c))
(r, c) = np.atleast_2d(y).shape
print('rows {}, cols {}'.format(r, c))
Which prints rows 1, cols 3 and rows 4, cols 3.
If your function uses
(nRows, nCols) = x.shape
it probably also indexes or iterates on x with the assumption that it has nRows rows, e.g.
x[0,:]
for row in x:
# do something with the row
Common practice is to reshape x (as needed) so it has at least 1 row. In other words, change the shape from (n,) to (1,n).
x = np.atleast_2d(x)
does this nicely. Inside a function, such a change to x won't affect x outside it. This way you can treat x as 2d through out your function, rather than constantly looking to see whether it is 1d v 2d.
Python: How can I force 1-element NumPy arrays to be two-dimensional?
is one of many previous SO questions that asks about treating 1d arrays as 2d.

Compare two numpy arrays by first Column and create a third numpy array by concatenating two arrays

I have two 2d numpy arrays which is used to plot simulation results.
The first column of both arrays a and b contains the time intervals and the second column contains the data to be plotted. The two arrays have different shapes a(500,2) b(600,2). I want to compare these two numpy arrays by first column and create a third array with matches found on the first column of a. If no match is found add 0 to third column.
Is there any numpy trick to do this?
For instance:
a=[[0.002,0.998],
[0.004,0.997],
[0.006,0.996],
[0.008,0.995],
[0.010,0.993]]
b= [[0.002,0.666],
[0.004,0.665],
[0.0041,0.664],
[0.0042,0.664],
[0.0043,0.664],
[0.0044,0.663],
[0.0045,0.663],
[0.0005,0.663],
[0.006,0.663],
[0.0061,0.662],
[0.008,0.661]]
expected output
c= [[0.002,0.998,0.666],
[0.004,0.997,0.665],
[0.006,0.996,0.663],
[0.008,0.995,0.661],
[0.010,0.993, 0 ]]
I can quickly think of the solution as
import numpy as np
a = np.array([[0.002, 0.998],
[0.004, 0.997],
[0.006, 0.996],
[0.008, 0.995],
[0.010, 0.993]])
b = np.array([[0.002, 0.666],
[0.004, 0.665],
[0.0041, 0.664],
[0.0042, 0.664],
[0.0043, 0.664],
[0.0044, 0.663],
[0.0045, 0.663],
[0.0005, 0.663],
[0.0006, 0.663],
[0.00061, 0.662],
[0.0008, 0.661]])
c = []
for row in a:
index = np.where(b[:,0] == row[0])[0]
if np.size(index) != 0:
c.append([row[0], row[1], b[index[0], 1]])
else:
c.append([row[0], row[1], 0])
print c
As pointed out in the comments above, there seems to be a data entry error
import numpy as np
i = np.intersect1d(a[:,0], b[:,0])
overlap = np.vstack([i, a[np.in1d(a[:,0], i), 1], b[np.in1d(b[:,0], i), 1]]).T
underlap = np.setdiff1d(a[:,0], b[:,0])
underlap = np.vstack([underlap, a[np.in1d(a[:,0], underlap), 1], underlap*0]).T
fast_c = np.vstack([overlap, underlap])
This works by taking the intersection of the first column of a and b using intersect1d, and then using in1d to cross-reference that intersection with the second columns.
vstack stacks the elements of the input vertically, and the transpose is needed to get the right dimensions (very fast operation).
Then find times in a that are not in b using setdiff1d, and complete the result by putting 0s in the third column.
This prints out
array([[ 0.002, 0.998, 0.666],
[ 0.004, 0.997, 0.665],
[ 0.006, 0.996, 0. ],
[ 0.008, 0.995, 0. ],
[ 0.01 , 0.993, 0. ]])
The following works both for numpy arrays and simple python lists.
c = [[*x, y[1]] for x in a for y in b if x[0] == y[0]]
d = [[*x, 0] for x in a if x[0] not in [y[0] for y in b]]
c.extend(d)
Someone braver than I am could try to make this one line.

Categories