Python - How to split an array based on the first column? - python

I have below fake data. After reading it into array it will have shape (8, 3). Now I want to split the data based on the first column(ID) and return a list of array whose shape will be:[(3,3),(2,3),(3,3)]. I think np.split could do the job by assigning a 1-D array to "indices_or_sections" argument. But is there any more convenient way to do this?
1 700 35
1 700 35
1 700 35
2 680 25
2 680 25
3 750 40
3 750 40
3 750 40

You can achieve this by using a combination of np.split, sort, np.unique and np.cumsum.
>>> a = [[1, 700, 35],
... [1, 700, 35],
... [1, 700, 35],
... [2, 680, 25],
... [2, 680, 25],
... [3, 750, 40],
... [3, 750, 40],
... [3, 750, 40]]
>>> a = np.array(a)
>>> # sort the array by first column.
>>> a = a[a[:,0].argsort()]
>>> np.split(a, np.cumsum(np.unique(a[:, 0], return_counts=True)[1])[:-1])
[array([[ 1, 700, 35],
[ 1, 700, 35],
[ 1, 700, 35]]), array([[ 2, 680, 25],
[ 2, 680, 25]]), array([[ 3, 750, 40],
[ 3, 750, 40],
[ 3, 750, 40]])]

Related

np.dot 3x3 with N 1x3 arrays

I have an ndarray of N 1x3 arrays I'd like to perform dot multiplication with a 3x3 matrix. I can't seem to figure out an efficient way to do this, as all the multi_dot and tensordot, etc methods seem to recursively sum or multiply the results of each operation. I simply want to apply a dot multiply the same way you can apply a scalar. I can do this with a for loop or list comprehension but it is much too slow for my application.
N = np.asarray([[1, 2, 3], [4, 5, 6], [7, 8, 9], ...])
m = np.asarray([[10, 20, 30], [40, 50, 60], [70, 80, 90]])
I'd like to perform something such as this but without any python loops:
np.asarray([np.dot(m, a) for a in N])
so that it simply returns [m * N[0], m * N[1], m * N[2], ...]
What's the most efficient way to do this? And is there a way to do this so that if N is just a single 1x3 matrix, it will just output the same as np.dot(m, N)?
Try This:
import numpy as np
N = np.asarray([[1, 2, 3], [4, 5, 6], [7, 8, 9], [1, 2, 3], [4, 5, 6]])
m = np.asarray([[10, 20, 30], [40, 50, 60], [70, 80, 90]])
re0 = np.asarray([np.dot(m, a) for a in N]) # original
re1 = np.dot(m, N.T).T # efficient
print("result0:\n{}".format(re0))
print("result1:\n{}".format(re1))
print("Is result0 == result1? {}".format(np.array_equal(re0, re1)))
Output:
result0:
[[ 140 320 500]
[ 320 770 1220]
[ 500 1220 1940]
[ 140 320 500]
[ 320 770 1220]]
result1:
[[ 140 320 500]
[ 320 770 1220]
[ 500 1220 1940]
[ 140 320 500]
[ 320 770 1220]]
Is result0 == result1? True
Time cost:
import timeit
setup = '''
import numpy as np
N = np.random.random((1, 3))
m = np.asarray([[10, 20, 30], [40, 50, 60], [70, 80, 790]])
'''
>> timeit.timeit("np.asarray([np.dot(m, a) for a in N])", setup=setup, number=100000)
0.295798063278
>> timeit.timeit("np.dot(m, N.T).T", setup=setup, number=100000)
0.10135102272
# N = np.random.random((10, 3))
>> timeit.timeit("np.asarray([np.dot(m, a) for a in N])", setup=setup, number=100000)
1.7417007659969386
>> timeit.timeit("np.dot(m, N.T).T", setup=setup, number=100000)
0.1587108800013084
# N = np.random.random((100, 3))
>> timeit.timeit("np.asarray([np.dot(m, a) for a in N])", setup=setup, number=100000)
11.6454949379
>> timeit.timeit("np.dot(m, N.T).T", setup=setup, number=100000)
0.180465936661
First, regarding your last question. There's a difference between a (3,) N and (1,3):
In [171]: np.dot(m,[1,2,3])
Out[171]: array([140, 320, 500]) # (3,) result
In [172]: np.dot(m,[[1,2,3]])
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-172-e8006b318a32> in <module>()
----> 1 np.dot(m,[[1,2,3]])
ValueError: shapes (3,3) and (1,3) not aligned: 3 (dim 1) != 1 (dim 0)
Your iterative version produces a (1,3) result:
In [174]: np.array([np.dot(m,a) for a in [[1,2,3]]])
Out[174]: array([[140, 320, 500]])
Make N a (4,3) array (this helps keep the first dim of N distinct):
In [176]: N = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9], [10,11,12]])
In [177]: N.shape
Out[177]: (4, 3)
In [178]: np.array([np.dot(m,a) for a in N])
Out[178]:
array([[ 140, 320, 500],
[ 320, 770, 1220],
[ 500, 1220, 1940],
[ 680, 1670, 2660]])
Result is (4,3).
A simple dot doesn't work (same as in the (1,3) case):
In [179]: np.dot(m,N)
...
ValueError: shapes (3,3) and (4,3) not aligned: 3 (dim 1) != 4 (dim 0)
In [180]: np.dot(m,N.T) # (3,3) dot with (3,4) -> (3,4)
Out[180]:
array([[ 140, 320, 500, 680],
[ 320, 770, 1220, 1670],
[ 500, 1220, 1940, 2660]])
So this needs another transpose to match your iterative result.
The explicit indices of einsum can also take care of these transpose:
In [181]: np.einsum('ij,kj->ki',m,N)
Out[181]:
array([[ 140, 320, 500],
[ 320, 770, 1220],
[ 500, 1220, 1940],
[ 680, 1670, 2660]])
Also works with the (1,3) case (but not with the (3,) case):
In [182]: np.einsum('ij,kj->ki',m,[[1,2,3]])
Out[182]: array([[140, 320, 500]])
matmul, # is also designed to calculate repeated dots - if the inputs are 3d (or broadcastable to that):
In [184]: (m#N[:,:,None]).shape
Out[184]: (4, 3, 1)
In [185]: (m#N[:,:,None])[:,:,0] # to squeeze out that last dimension
Out[185]:
array([[ 140, 320, 500],
[ 320, 770, 1220],
[ 500, 1220, 1940],
[ 680, 1670, 2660]])
dot and matmul describe what happens with 1, 2 and 3d inputs. It can take some time, and experimentation, to get a feel for what is happening. The basic rule is last of A with 2nd to the last of B.
Your N is actually (n,3), n (3,) arrays. Here's what 4 (1,3) arrays looks like:
In [186]: N1 = N[:,None,:]
In [187]: N1.shape
Out[187]: (4, 1, 3)
In [188]: N1
Out[188]:
array([[[ 1, 2, 3]],
[[ 4, 5, 6]],
[[ 7, 8, 9]],
[[10, 11, 12]]])
and the dot as before (4,1,3) dot (3,3).T -> (4,1,3) -> (4,3)
In [190]: N1.dot(m.T).squeeze()
Out[190]:
array([[ 140, 320, 500],
[ 320, 770, 1220],
[ 500, 1220, 1940],
[ 680, 1670, 2660]])
and n of those:
In [191]: np.array([np.dot(a,m.T).squeeze() for a in N1])
Out[191]:
array([[ 140, 320, 500],
[ 320, 770, 1220],
[ 500, 1220, 1940],
[ 680, 1670, 2660]])

Counting uneven bins in Panda

pd.DataFrame({'email':["a#gmail.com", "b#gmail.com", "c#gmail.com", "d#gmail.com", "e#gmail.com",],
'one':[88, 99, 11, 44, 33],
'two': [80, 80, 85, 80, 70],
'three': [50, 60, 70, 80, 20]})
Given this DataFrame, I would like to compute, for each column, one, two and three, how many values are in certain ranges.
The ranges are for example: 0-70, 71-80, 81-90, 91-100
So the result would be:
out = pd.DataFrame({'colname': ["one", "two", "three"],
'b0to70': [3, 1, 4],
'b71to80': [0, 3, 1],
'b81to90': [1, 1, 0],
'b91to100': [1, 0, 0]})
What would be a nice idiomatic way to do this?
This would do it:
out = pd.DataFrame()
for name in ['one','two','three']:
out[name] = pd.cut(df[name], bins=[0,70,80,90,100]).value_counts()
out.sort_index(inplace=True)
Returns:
one two three
(0, 70] 3 1 4
(70, 80] 0 3 1
(80, 90] 1 1 0
(90, 100] 1 0 0

How to break numpy array into smaller chunks/batches, then iterate through them

Suppose i have this numpy array
[[1, 2, 3],
[4, 5, 6],
[7, 8, 9],
[10, 11, 12]]
And i want to split it in 2 batches and then iterate:
[[1, 2, 3], Batch 1
[4, 5, 6]]
[[7, 8, 9], Batch 2
[10, 11, 12]]
What is the simplest way to do it?
EDIT: I'm deeply sorry i missed putting such info: Once i intend to carry on with the iteration, the original array would be destroyed due to splitting and iterating over batches. Once the batch iteration finished, i need to restart again from the first batch hence I should preserve that the original array wouldn't be destroyed. The whole idea is to be consistent with Stochastic Gradient Descent algorithms which require iterations over batches. In a typical example, I could have a 100000 iteration For loop for just 1000 batch that should be replayed again and again.
You can use numpy.split to split along the first axis n times, where n is the number of desired batches. Thus, the implementation would look like this -
np.split(arr,n,axis=0) # n is number of batches
Since, the default value for axis is 0 itself, so we can skip setting it. So, we would simply have -
np.split(arr,n)
Sample runs -
In [132]: arr # Input array of shape (10,3)
Out[132]:
array([[170, 52, 204],
[114, 235, 191],
[ 63, 145, 171],
[ 16, 97, 173],
[197, 36, 246],
[218, 75, 68],
[223, 198, 84],
[206, 211, 151],
[187, 132, 18],
[121, 212, 140]])
In [133]: np.split(arr,2) # Split into 2 batches
Out[133]:
[array([[170, 52, 204],
[114, 235, 191],
[ 63, 145, 171],
[ 16, 97, 173],
[197, 36, 246]]), array([[218, 75, 68],
[223, 198, 84],
[206, 211, 151],
[187, 132, 18],
[121, 212, 140]])]
In [134]: np.split(arr,5) # Split into 5 batches
Out[134]:
[array([[170, 52, 204],
[114, 235, 191]]), array([[ 63, 145, 171],
[ 16, 97, 173]]), array([[197, 36, 246],
[218, 75, 68]]), array([[223, 198, 84],
[206, 211, 151]]), array([[187, 132, 18],
[121, 212, 140]])]
consider array a
a = np.array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9],
[10, 11, 12]])
Option 1
use reshape and //
a.reshape(a.shape[0] // 2, -1, a.shape[1])
array([[[ 1, 2, 3],
[ 4, 5, 6]],
[[ 7, 8, 9],
[10, 11, 12]]])
Option 2
if you wanted groups of two rather than two groups
a.reshape(-1, 2, a.shape[1])
array([[[ 1, 2, 3],
[ 4, 5, 6]],
[[ 7, 8, 9],
[10, 11, 12]]])
Option 3
Use a generator
def get_every_n(a, n=2):
for i in range(a.shape[0] // n):
yield a[n*i:n*(i+1)]
for sa in get_every_n(a, n=2):
print sa
[[1 2 3]
[4 5 6]]
[[ 7 8 9]
[10 11 12]]
To avoid the error "array split does not result in an equal division",
np.array_split(arr, n, axis=0)
is better than np.split(arr, n, axis=0).
For example,
a = np.array([[170, 52, 204],
[114, 235, 191],
[ 63, 145, 171],
[ 16, 97, 173]])
then
print(np.array_split(a, 2))
[array([[170, 52, 204],
[114, 235, 191]]), array([[ 63, 145, 171],
[ 16, 97, 173]])]
print(np.array_split(a, 3))
[array([[170, 52, 204],
[114, 235, 191]]), array([[ 63, 145, 171]]), array([[ 16, 97, 173]])]
However, print(np.split(a, 3)) will raise an error since 4/3 is not an integer.
This is what I have used to iterate through. I use b.next() method to generate the indices, then pass the output to slice a numpy array, for example a[b.next()] where a is a numpy array.
class Batch():
def __init__(self, total, batch_size):
self.total = total
self.batch_size = batch_size
self.current = 0
def next(self):
max_index = self.current + self.batch_size
indices = [i if i < self.total else i - self.total
for i in range(self.current, max_index)]
self.current = max_index % self.total
return indices
b = Batch(10, 3)
print(b.next()) # [0, 1, 2]
print(b.next()) # [3, 4, 5]
print(b.next()) # [6, 7, 8]
print(b.next()) # [9, 0, 1]
print(b.next()) # [2, 3, 4]
print(b.next()) # [5, 6, 7]
Improving previous answer, to split based on batch size, you can use:
def split_by_batchsize(arr, batch_size):
return np.array_split(arr, (arr.shape[0]/batch_size)+1)
or with extra safety:
def split_by_batch_size(arr, batch_size):
nbatches = arr.shape[0]//batch_size
if nbatches != arr.shape[0]/batch_size:
nbatches += 1
return np.array_split(arr, nbatches)
example:
import numpy as np
ncols = 17
batch_size= 2
split_by_batchsize(np.random.random((ncols, 2)), batch_size)
# [array([[0.60482079, 0.81391257],
# [0.00175093, 0.25126441]]),
# array([[0.48591974, 0.77793401],
# [0.72128946, 0.3606879 ]]),
# array([[0.95649328, 0.24765806],
# [0.78844782, 0.56304567]]),
# array([[0.07310456, 0.76940976],
# [0.92163079, 0.90803845]]),
# array([[0.77838703, 0.98460593],
# [0.88397437, 0.39227769]]),
# array([[0.87599421, 0.7038426 ],
# [0.19780976, 0.12763436]]),
# array([[0.14263759, 0.9182901 ],
# [0.40523958, 0.0716843 ]]),
# array([[0.9802908 , 0.01067808],
# [0.53095143, 0.74797636]]),
# array([[0.7596607 , 0.97923229]])]
Sadly, simple iteration is faster than this fancy method. Thus, I did not suggest you to use this approach.
batch_size = 3
nrows = 1000
arr = np.random.random((nrows, 2))
%%timeit
for i in range((arr.shape[0] // batch_size) + 1):
idx = i*batch_size
foo = arr[idx:idx+batch_size,:]
# 345 µs ± 119 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
%%timeit
for foo in split_by_batch_size(arr, batch_size):
pass
# 1.84 ms ± 377 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
The speed different seems came from np.array_split create list of array first.
do like this:
a = [[1, 2, 3],[4, 5, 6],
[7, 8, 9],[10, 11, 12]]
b = a[0:2]
c = a[2:4]

loop for computing average of selected data in dataframe using pandas

I have a 3 row x 96 column dataframe. I'm trying to computer the average of the two rows beneath the index (row1:96) for every 12 data points. here is my dataframe:
Run 1 Run 2 Run 3 Run 4 Run 5 Run 6 \
0 1461274.92 1458079.44 1456807.1 1459216.08 1458643.24 1457145.19
1 478167.44 479528.72 480316.08 475569.52 472989.01 476054.89
2 ------ ------ ------ ------ ------ ------
Run 7 Run 8 Run 9 Run 10 ... Run 87 \
0 1458117.08 1455184.82 1455768.69 1454738.07 ... 1441822.45
1 473630.89 476282.93 475530.87 474200.22 ... 468525.2
2 ------ ------ ------ ------ ... ------
Run 88 Run 89 Run 90 Run 91 Run 92 Run 93 \
0 1445339.53 1461050.97 1446849.43 1438870.43 1431275.76 1430781.28
1 460076.8 473263.06 455885.07 475245.64 483875.35 487065.25
2 ------ ------ ------ ------ ------ ------
Run 94 Run 95 Run 96
0 1436007.32 1435238.23 1444300.51
1 474328.87 475789.12 458681.11
2 ------ ------ ------
[3 rows x 96 columns]
Currently I am trying to use df.irow(0) to select all the data in row index 0.
something along the lines of:
selection = np.arange(0,13)
for i in selection:
new_df = pd.DataFrame()
data = df.irow(0)
........
then i get lost
I just don't know how to link this range with the dataframe in order to computer the mean for every 12 data points in each column.
To summarize, I want the average for every 12 runs in each column. So, i should end up with a separate dataframe with 2 * 8 average values (96/12).
any ideas?
thanks.
You can do a groupby on axis=1 (using some dummy data I made up):
>>> h = df.iloc[:2].astype(float)
>>> h.groupby(np.arange(len(h.columns))//12, axis=1).mean()
0 1 2 3 4 5 6 7
0 0.609643 0.452047 0.536786 0.377845 0.544321 0.214615 0.541185 0.544462
1 0.382945 0.596034 0.659157 0.437576 0.490161 0.435382 0.476376 0.423039
First we extract the data and force recognition of a float (the presence of the ------ row means that you've probably got an object dtype, which will make the mean unhappy.)
Then we make an array saying what groups we want to put the different columns in:
>>> np.arange(len(df.columns))//12
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5, 5, 5,
5, 5, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 7, 7, 7, 7, 7, 7, 7, 7,
7, 7, 7, 7], dtype=int32)
which we feed as an argument to groupby. .mean() handles the rest.
It's always best to try to use pandas methods when you can, rather than iterating over the rows. The DataFrame's iloc method is useful for extracting any number of rows.
The following example shows you how to do what you want in a two-column DataFrame. The same technique will work independent of the number of columns:
In [14]: df = pd.DataFrame({"x": [1, 2, "-"], "y": [3, 4, "-"]})
In [15]: df
Out[15]:
x y
0 1 3
1 2 4
2 - -
In [16]: df.iloc[2] = df.iloc[0:2].sum()
In [17]: df
Out[17]:
x y
0 1 3
1 2 4
2 3 7
However, in your case you want to sum each group of eight cells in df.iloc[2]`, so you might be better simply taking the result of the summing expression with the statement
ds = df.iloc[0:2].sum()
which with your data will have the form
col1 0
col2 1
col3 2
col4 3
...
col93 92
col94 93
col95 94
col96 95
(These numbers are representative, you will obviously see your column sums). You can then turn this into a 12x8 matrix with
ds.values.reshape(12, 8)
whose value is
array([[ 0, 1, 2, 3, 4, 5, 6, 7],
[ 8, 9, 10, 11, 12, 13, 14, 15],
[16, 17, 18, 19, 20, 21, 22, 23],
[24, 25, 26, 27, 28, 29, 30, 31],
[32, 33, 34, 35, 36, 37, 38, 39],
[40, 41, 42, 43, 44, 45, 46, 47],
[48, 49, 50, 51, 52, 53, 54, 55],
[56, 57, 58, 59, 60, 61, 62, 63],
[64, 65, 66, 67, 68, 69, 70, 71],
[72, 73, 74, 75, 76, 77, 78, 79],
[80, 81, 82, 83, 84, 85, 86, 87],
[88, 89, 90, 91, 92, 93, 94, 95]])
but summing this array will give you the sum of all elements, so instead create another DataFrame with
rs = pd.DataFrame(ds.values.reshape(12, 8))
and then sum that:
rs.sum()
giving
0 528
1 540
2 552
3 564
4 576
5 588
6 600
7 612
dtype: int64
You may find in practice that it is easier to simply create two 12x8 matrices in the first place, which you can add together before creating a dataframe which you can then sum. Much depends on how you are reading your data.

reshaping ndarrays versus regular arrays in numpy?

I have an object of type 'numpy.ndarray', called "myarray", that when printed to the screen using python's "print", looks like hits
[[[ 84 0 213 232] [153 0 304 363]]
[[ 33 0 56 104] [ 83 0 77 238]]
[[ 0 0 9 61] [ 0 0 2 74]]]
"myarray" is made by another library. The value of myarray.shape equals (3, 2). I expected this to be a 3dimensional array, with three indices. When I try to make this structure myself, using:
second_array = array([[[84, 0, 213, 232], [153, 0, 304, 363]],
[[33, 0, 56, 104], [83, 0, 77, 238]],
[[0, 0, 9, 61], [0, 0, 2, 74]]])
I get that second_array.shape is equal to (3, 2, 4), as expected. Why is there this difference? Also, given this, how can I reshape "myarray" so that the two columns are merged, i.e. so that the result is:
[[[ 84 0 213 232 153 0 304 363]]
[[ 33 0 56 104 83 0 77 238]]
[[ 0 0 9 61 0 0 2 74]]]
Edit: to clarify, I know that in the case of second_array, I can do second_array.reshape((3,8)). But how does this work for the ndarray which has the format of myarray but does not have a 3d index?
myarray.dtype is "object" but can be changed to be ndarray too.
Edit 2: Getting closer, but still cannot quite get the ravel/flatten followed by reshape. I have:
a = array([[1, 2, 3],
[4, 5, 6]])
b = array([[ 7, 8, 9],
[10, 11, 12]])
arr = array([a, b])
I try:
arr.ravel().reshape((2,6))
But this gives [[1, 2, 3, 4, 5, 6], ...] and I wanted [[1, 2, 3, 7, 8, 9], ...]. How can this be done?
thanks.
Indeed, ravel and hstack can be useful tools for reshaping arrays:
import numpy as np
myarray = np.empty((3,2),dtype = object)
myarray[:] = [[np.array([ 84, 0, 213, 232]), np.array([153, 0, 304, 363])],
[np.array([ 33, 0, 56, 104]), np.array([ 83, 0, 77, 238])],
[np.array([ 0, 0, 9, 61]), np.array([ 0, 0, 2, 74])]]
myarray = np.hstack(myarray.ravel()).reshape(3,2,4)
print(myarray)
# [[[ 84 0 213 232]
# [153 0 304 363]]
# [[ 33 0 56 104]
# [ 83 0 77 238]]
# [[ 0 0 9 61]
# [ 0 0 2 74]]]
myarray = myarray.ravel().reshape(3,8)
print(myarray)
# [[ 84 0 213 232 153 0 304 363]
# [ 33 0 56 104 83 0 77 238]
# [ 0 0 9 61 0 0 2 74]]
Regarding Edit 2:
import numpy as np
a = np.array([[1, 2, 3],
[4, 5, 6]])
b = np.array([[ 7, 8, 9],
[10, 11, 12]])
arr = np.array([a, b])
print(arr)
# [[[ 1 2 3]
# [ 4 5 6]]
# [[ 7 8 9]
# [10 11 12]]]
Notice that
In [45]: arr[:,0,:]
Out[45]:
array([[1, 2, 3],
[7, 8, 9]])
Since you want the first row to be [1,2,3,7,8,9], the above shows that you want the second axis to be the first axis. This can be accomplished with the swapaxes method:
print(arr.swapaxes(0,1).reshape(2,6))
# [[ 1 2 3 7 8 9]
# [ 4 5 6 10 11 12]]
Or, given a and b, or equivalently, arr[0] and arr[1], you could form arr directly with the hstack method:
arr = np.hstack([a, b])
# [[ 1 2 3 7 8 9]
# [ 4 5 6 10 11 12]]

Categories