Extend lists within a pandas Series - python

I have a pandas series that looks like this:
group
A [1,0,5,4,6,...]
B [2,2,0,1,9,...]
C [3,5,2,0,6,...]
I have similar series that I would like to add to the existing series by extending each of the lists. How can I do this?
I tried
for x in series:
x.extend(series[series.index[x]])
but this isn't working.

Consider the series s
s = pd.Series([[1, 0], [2, 2], [4, 1]], list('ABC'), name='group')
s
A [1, 0]
B [2, 2]
C [4, 1]
Name: group, dtype: object
You can extend each list with a similar series simply by adding them. pandas will use the underlying objects __add__ method to combine the pairwise elements. In the case of a list, the __add__ method concatenates the lists.
s + s
A [1, 0, 1, 0]
B [2, 2, 2, 2]
C [4, 1, 4, 1]
Name: group, dtype: object
However, this would not work if the elements were numpy.array
s = pd.Series([[1, 0], [2, 2], [4, 1]], list('ABC'), name='group')
s = s.apply(np.array)
In this case, I'd make sure they are lists
s.apply(list) + s.apply(list)
A [1, 0, 1, 0]
B [2, 2, 2, 2]
C [4, 1, 4, 1]
Name: group, dtype: object

Solution with add function (borrowed data sample from piRSquared):
s1 = s.add(s)
print (s1)
A [1, 0, 1, 0]
B [2, 2, 2, 2]
C [4, 1, 4, 1]
Name: group, dtype: object
EDIT:
If some index values are different, it is more complicated, because need reindex of union of all index values and replace NaN by empty lists by combine_first:
s = pd.Series([[1, 0], [2, 2], [4, 1]], list('ABC'), name='group')
s1 = pd.Series([[3, 9], [6, 4]], list('AD'), name='group')
idx = s.index.union(s1.index)
s = s.reindex(idx).combine_first(pd.Series([[]], index=idx))
s1 = s1.reindex(idx).combine_first(pd.Series([[]], index=idx))
s2 = s.add(s1)
print (s2)
A [1, 0, 3, 9]
B [2, 2]
C [4, 1]
D [6, 4]
Name: group, dtype: object

Related

How to apply Max function between rows on 2D list in pandas grouped dataframe

I have a dataframe similar to the following where "data" is a 2D array:
id grouping_val data
1 a [[0, 1], [1, 0]]
2 a [[1, 0], [0, 1]]
3 b [[2, 0], [3, 0]]
4 b [[0, 4], [4, 5]]
How can I group them by "grouping_val" and taking the max value at each index in the "data" column across all the rows. Resulting in the following dataframe:
id grouping_val data
1 a [[1, 1], [1, 1]]
2 b [[2, 4], [4, 5]]
You can np.stack() the grouped arrays and take their max() along axis=0:
df = (df.groupby('grouping_val').data
.apply(lambda x: np.stack(x).max(axis=0))
.reset_index())
# grouping_val data
# 0 a [[1, 1], [1, 1]]
# 1 b [[2, 4], [4, 5]]
df = (
df.groupby("grouping_val")["data"]
.apply(lambda x: [[*map(max, zip(*subl))] for subl in zip(*x)])
.reset_index()
)
print(df)
Prints:
grouping_val data
0 a [[1, 1], [1, 1]]
1 b [[2, 4], [4, 5]]

Python 3.x get odd columns on matrix

I am working with python 3.7 and I would like to get all the odd columns of a matrix.
To give a example, I have a 4x4 matrix of this style right now.
[[0, 9, 1, 6], [0, 3, 1, 5], [0, 2, 1, 7], [0, 6, 1, 2]]
That is...
0 9 1 6
0 3 1 5
0 2 1 7
0 6 1 2
And I would like to get:
9 6
3 5
2 7
6 2
The numbers and the size of the matrix will change but the structure will always be
[[0, (int), 1, (int), 2...], [0, (int), 1, (int), 2 ...], [0, (int), 1, (int), 2...], [0, (int), 1, (int), 2...], ...]
To get the rows I can do [:: 2], but that wonderful solution does not work for me right now. I try to access the matrix with:
for i in matrix:
for j in matrix:
But none of this doesn't work either.
How can I solve it?
Thank you.
Without using numpy, you can use something similar to your indexing scheme ([1::2]) in a list comprehension:
>>> [i[1::2] for i in mat]
[[9, 6], [3, 5], [2, 7], [6, 2]]
Using numpy, you can do something similar:
>>> import numpy as np
>>> np.array(mat)[:,1::2]
array([[9, 6],
[3, 5],
[2, 7],
[6, 2]])
If you can't use NumPy for whatever reason, write a custom implementation:
def getColumns(matrix, columns):
return {c: [matrix[r][c] for r in range(len(matrix))] for c in columns}
It takes a 2D array and a list of columns, and it returns a dictionary where the column indexes are keys and the actual columns are values. Note that if you passed all indices you would get a transposed matrix.
In your case,
M = [[0, 9, 1, 6],
[0, 3, 1, 5],
[0, 2, 1, 7],
[0, 6, 1, 2]]
All odd columns are even indices (because the index of the first one is 0), Therefore:
L = list(range(0, len(M[0]), 2))
And then you would do:
myColumns = getColumns(M, L)
print(list(myColumns.values()))
#result: [[0, 0, 0, 0], [1, 1, 1, 1]]
But since you showed the values as if they were in rows:
def f(matrix, columns):
return [[matrix[row][i] for i in columns] for row in range(len(matrix))]
print(f(M, L))
#result: [[0, 1], [0, 1], [0, 1], [0, 1]]
And I believe that the latter is what you wanted.

numpy sort 2d: rearrange rows without changing values in row

How can the rows in an array be sorted without that the values in each row will changed?
Furthermore: how to get the indicies of this sort-process?
input:
a = np.array([[4,3],[0,3],[3,0],[1,3],[1,2],[2,0]])
required sorting arrray:
b = np.array([1,4,3,5,2,0])
a = a[b]
output:
a = np.array([[0,3],[1,2],[1,3][2,0],[3,0],[4,3]])
How do I get the array b ?
You need lexsort here:
b = np.lexsort((a[:, 1], a[:, 0]))
# array([1, 4, 3, 5, 2, 0], dtype=int64)
And applied to your initial array:
>>> a[b]
array([[0, 3],
[1, 2],
[1, 3],
[2, 0],
[3, 0],
[4, 3]])
As #miradulo pointed out, you may also use:
b = np.lexsort(np.fliplr(a).T)
Which is less verbose than explicitly stating the columns to sort on.

In Python 3, convert np.array object type to float type, with variable number of object element

I have a np.array with dtype as object. Each element here is a np.array with dtype as float and shape as (2,2) --- in maths, it is a 2-by-2 matrix. My aim is to obtain one 2-dimenional matrix by converting all the object-type element into float-type element. This can be better presented by the following example.
dA = 2 # dA is the dimension of the following A, here use 2 as example only
A = np.empty((dA,dA), dtype=object) # A is a np.array with dtype as object
A[0,0] = np.array([[1,1],[1,1]]) # each element in A is a 2-by-2 matrix
A[0,1] = A[0,0]*2
A[1,0] = A[0,0]*3
A[1,1] = A[0,0]*4
My aim is to have one matrix B (the dimension of B is 2*dA-by-2*dA). The form of B in maths should be
B =
1 1 2 2
1 1 2 2
3 3 4 4
3 3 4 4
If dA is fixed at 2, then things can be easier, because I can hard-code
a00 = A[0,0]
a01 = A[0,1]
a10 = A[1,0]
a11 = A[1,1]
B0 = np.hstack((a00,a01))
B1 = np.hstack((a10,a11))
B = np.vstack((B0,B1))
But in reality, dA is a variable, it can be 2 or any other integer. Then I don't know how to do it. I think nested for loops can help but maybe you have brilliant ideas. It would be great if there is something like cell2mat function in MATLAB. Because here you can see A[i,j] as a cell in MATLAB.
Thanks in advance.
Here's a quick way.
Your A:
In [137]: A
Out[137]:
array([[array([[1, 1],
[1, 1]]), array([[2, 2],
[2, 2]])],
[array([[3, 3],
[3, 3]]), array([[4, 4],
[4, 4]])]], dtype=object)
Use numpy.bmat, but convert A to a python list first, so bmat does what we want:
In [138]: B = np.bmat(A.tolist())
In [139]: B
Out[139]:
matrix([[1, 1, 2, 2],
[1, 1, 2, 2],
[3, 3, 4, 4],
[3, 3, 4, 4]])
The result is actually a numpy.matrix. If you need a regular numpy array, use the .A attribute of the matrix object:
In [140]: B = np.bmat(A.tolist()).A
In [141]: B
Out[141]:
array([[1, 1, 2, 2],
[1, 1, 2, 2],
[3, 3, 4, 4],
[3, 3, 4, 4]])
Here's an alternative. (It still uses A.tolist().)
In [164]: np.swapaxes(A.tolist(), 1, 2).reshape(4, 4)
Out[164]:
array([[1, 1, 2, 2],
[1, 1, 2, 2],
[3, 3, 4, 4],
[3, 3, 4, 4]])
In the general case, you would need something like:
In [165]: np.swapaxes(A.tolist(), 1, 2).reshape(A.shape[0]*dA, A.shape[1]*dA)
Out[165]:
array([[1, 1, 2, 2],
[1, 1, 2, 2],
[3, 3, 4, 4],
[3, 3, 4, 4]])
Your vstack/hstack could be written more compactly, and generally as
In [132]: np.vstack((np.hstack(a) for a in A))
Out[132]:
array([[1, 1, 2, 2],
[1, 1, 2, 2],
[3, 3, 4, 4],
[3, 3, 4, 4]])
since for a in A iterates on the rows of A.
Warren suggests np.bmat, which is fine. But if you look at the bmat code, you'll see that it just doing this kind of nested concatenate (expressed a row loop with arr_rows.append(np.concatenate...)).

Updating a NumPy array with another

Seemingly simple question: I have an array with two columns, the first represents an ID and the second a count. I'd like to update it with another, similar array such that
import numpy as np
a = np.array([[1, 2],
[2, 2],
[3, 1],
[4, 5]])
b = np.array([[2, 2],
[3, 1],
[4, 0],
[5, 3]])
a.update(b) # ????
>>> np.array([[1, 2],
[2, 4],
[3, 2],
[4, 5],
[5, 3]])
Is there a way to do this with indexing/slicing such that I don't simply have to iterate over each row?
Generic case
Approach #1: You can use np.add.at to do such an ID-based adding operation like so -
# First column of output array as the union of first columns of a,b
out_id = np.union1d(a[:,0],b[:,0])
# Initialize second column of output array
out_count = np.zeros_like(out_id)
# Find indices where the first columns of a,b are placed in out_id
_,a_idx = np.where(a[:,None,0]==out_id)
_,b_idx = np.where(b[:,None,0]==out_id)
# Place second column of a into out_id & add in second column of b
out_count[a_idx] = a[:,1]
np.add.at(out_count, b_idx,b[:,1])
# Stack the ID and count arrays into a 2-column format
out = np.column_stack((out_id,out_count))
To find a_idx and b_idx, as probably a faster alternative, np.searchsorted could be used like so -
a_idx = np.searchsorted(out_id, a[:,0], side='left')
b_idx = np.searchsorted(out_id, b[:,0], side='left')
Sample input-output :
In [538]: a
Out[538]:
array([[1, 2],
[4, 2],
[3, 1],
[5, 5]])
In [539]: b
Out[539]:
array([[3, 7],
[1, 1],
[4, 0],
[2, 3],
[6, 2]])
In [540]: out
Out[540]:
array([[1, 3],
[2, 3],
[3, 8],
[4, 2],
[5, 5],
[6, 2]])
Approach #2: You can use np.bincount to do the same ID based adding -
# First column of output array as the union of first columns of a,b
out_id = np.union1d(a[:,0],b[:,0])
# Get all IDs and counts in a single arrays
id_arr = np.concatenate((a[:,0],b[:,0]))
count_arr = np.concatenate((a[:,1],b[:,1]))
# Get binned summations
summed_vals = np.bincount(id_arr,count_arr)
# Get mask of valid bins
mask = np.in1d(np.arange(np.max(out_id)+1),out_id)
# Mask valid summed bins for final counts array output
out_count = summed_vals[mask]
# Stack the ID and count arrays into a 2-column format
out = np.column_stack((out_id,out_count))
Specific case
If the ID columns in a and b are sorted, it becomes easier, as we can just use masks with np.in1d to index into the output ID array created with np.union like so -
# First column of output array as the union of first columns of a,b
out_id = np.union1d(a[:,0],b[:,0])
# Masks of first columns of a and b matches in the output ID array
mask1 = np.in1d(out_id,a[:,0])
mask2 = np.in1d(out_id,b[:,0])
# Initialize second column of output array
out_count = np.zeros_like(out_id)
# Place second column of a into out_id & add in second column of b
out_count[mask1] = a[:,1]
np.add.at(out_count, np.where(mask2)[0],b[:,1])
# Stack the ID and count arrays into a 2-column format
out = np.column_stack((out_id,out_count))
Sample run -
In [552]: a
Out[552]:
array([[1, 2],
[2, 2],
[3, 1],
[4, 5],
[8, 5]])
In [553]: b
Out[553]:
array([[2, 2],
[3, 1],
[4, 0],
[5, 3],
[6, 2],
[8, 2]])
In [554]: out
Out[554]:
array([[1, 2],
[2, 4],
[3, 2],
[4, 5],
[5, 3],
[6, 2],
[8, 7]])
>>> col=np.unique(np.hstack((b[:,0],a[:,0])))
>>> dif=np.setdiff1d(col,a[:,0])
>>> val=b[np.in1d(b[:,0],dif)]
>>> result=np.concatenate((a,val))
array([[1, 2],
[2, 2],
[3, 1],
[4, 5],
[5, 3]])
Note that if you want the result become sorted you can use np.lexsort :
result[np.lexsort((result[:,0],result[:,0]))]
Explanation :
First you can find the unique ids with following command :
>>> col=np.unique(np.hstack((b[:,0],a[:,0])))
>>> col
array([1, 2, 3, 4, 5])
Then find the different between the ids if a and all of ids :
>>> dif=np.setdiff1d(col,a[:,0])
>>> dif
array([5])
Then find the items within b with the ids in diff :
>>> val=b[np.in1d(b[:,0],dif)]
>>> val
array([[5, 3]])
And at last concatenate the result with list a:
>>> np.concatenate((a,val))
consider another example with sorting :
>>> a = np.array([[1, 2],
... [2, 2],
... [3, 1],
... [7, 5]])
>>>
>>> b = np.array([[2, 2],
... [3, 1],
... [4, 0],
... [5, 3]])
>>>
>>> col=np.unique(np.hstack((b[:,0],a[:,0])))
>>> dif=np.setdiff1d(col,a[:,0])
>>> val=b[np.in1d(b[:,0],dif)]
>>> result=np.concatenate((a,val))
>>> result[np.lexsort((result[:,0],result[:,0]))]
array([[1, 2],
[2, 2],
[3, 1],
[4, 0],
[5, 3],
[7, 5]])
That's an old question but here is a solution with pandas (that could be generalized for other aggregation functions than sum). Also sorting will occur automatically:
import pandas as pd
import numpy as np
a = np.array([[1, 2],
[2, 2],
[3, 1],
[4, 5]])
b = np.array([[2, 2],
[3, 1],
[4, 0],
[5, 3]])
print((pd.DataFrame(a[:, 1], index=a[:, 0])
.add(pd.DataFrame(b[:, 1], index=b[:, 0]), fill_value=0)
.astype(int))
.reset_index()
.to_numpy())
Output:
[[1 2]
[2 4]
[3 2]
[4 5]
[5 3]]

Categories