Related
What I need:
I have a dataframe where the elements of a column are lists. There are no duplications of elements in a list. For example, a dataframe like the following:
import pandas as pd
>>d = {'col1': [[1, 2, 4, 8], [15, 16, 17], [18, 3], [2, 19], [10, 4]]}
>>df = pd.DataFrame(data=d)
col1
0 [1, 2, 4, 8]
1 [15, 16, 17]
2 [18, 3]
3 [2, 19]
4 [10, 4]
I would like to obtain a dataframe where, if at least a number contained in a list at row i is also contained in a list at row j, then the two list are merged (without duplication). But the values could also be shared by more than two lists, in that case I want all lists that share at least a value to be merged.
col1
0 [1, 2, 4, 8, 19, 10]
1 [15, 16, 17]
2 [18, 3]
The order of the rows of the output dataframe, nor the values inside a list is important.
What I tried:
I have found this answer, that shows how to tell if at least one item in list is contained in another list, e.g.
>>not set([1, 2, 4, 8]).isdisjoint([2, 19])
True
Returns True, since 2 is contained in both lists.
I have also found this useful answer that shows how to compare each row of a dataframe with each other. The answer applies a custom function to each row of the dataframe using a lambda.
df.apply(lambda row: func(row['col1']), axis=1)
However I'm not sure how to put this two things together, how to create the func method. Also I don't know if this approach is even feasible since the resulting rows will probably be less than the ones of the original dataframe.
Thanks!
You can use networkx and graphs for that:
import networkx as nx
G = nx.Graph([edge for nodes in df['col1'] for edge in zip(nodes, nodes[1:])])
result = pd.Series(nx.connected_components(G))
This is basically treating every number as a node, and whenever two number are in the same list then you connect them. Finally you find the connected components.
Output:
0 {1, 2, 4, 8, 10, 19}
1 {16, 17, 15}
2 {18, 3}
This is not straightforward. Merging lists has many pitfalls.
One solid approach is to use a specialized library, for example networkx to use a graph approach. You can generate successive edges and find the connected components.
Here is your graph:
You can thus:
generate successive edges with add_edges_from
find the connected_components
craft a dictionary and map the first item of each list
groupby and merge the lists (you could use the connected components directly but I'm giving a pandas solution in case you have more columns to handle)
import networkx as nx
G = nx.Graph()
for l in df['col1']:
G.add_edges_from(zip(l, l[1:]))
groups = {k:v for v,l in enumerate(nx.connected_components(G)) for k in l}
# {1: 0, 2: 0, 4: 0, 8: 0, 10: 0, 19: 0, 16: 1, 17: 1, 15: 1, 18: 2, 3: 2}
out = (df.groupby(df['col1'].str[0].map(groups), as_index=False)
.agg(lambda x: sorted(set().union(*x)))
)
output:
col1
0 [1, 2, 4, 8, 10, 19]
1 [15, 16, 17]
2 [3, 18]
Seems more like a Python problem than pandas one, so here's one attempt that checks every after list, merges (and removes) if intersecting:
vals = d["col1"]
# while there are at least 1 more list after to process...
i = 0
while i < len(vals) - 1:
current = set(vals[i])
# for the next lists...
j = i + 1
while j < len(vals):
# any intersection?
# then update the current and delete the other
other = vals[j]
if current.intersection(other):
current.update(other)
del vals[j]
else:
# no intersection, so keep going for next lists
j += 1
# put back the updated current back, and move on
vals[i] = current
i += 1
at the end, vals is
In [108]: vals
Out[108]: [{1, 2, 4, 8, 10, 19}, {15, 16, 17}, {3, 18}]
In [109]: pd.Series(map(list, vals))
Out[109]:
0 [1, 2, 19, 4, 8, 10]
1 [16, 17, 15]
2 [18, 3]
dtype: object
if you don't want vals modified, can chain .copy() for it.
To add on mozway's answer. It wasn't clear from the question, but I also had rows with single-valued lists. This values aren't clearly added to the graph when calling add_edges_from(zip(l, l[1:]), since l[1:] is empty. I solved it adding a singular node to the graph when encountering emtpy l[1:] lists. I leave the solution in case anyone needs it.
import networkx as nx
import pandas as pd
d = {'col1': [[1, 2, 4, 8], [15, 16, 17], [18, 3], [2, 19], [10, 4], [9]]}
df= pd.DataFrame(data=d)
G = nx.Graph()
for l in df['col1']:
if len(l[1:]) == 0:
G.add_node(l[0])
else:
G.add_edges_from(zip(l, l[1:]))
groups = {k: v for v, l in enumerate(nx.connected_components(G)) for k in l}
out= (df.groupby(df['col1'].str[0].map(groups), as_index=False)
.agg(lambda x: sorted(set().union(*x))))
Result:
col1
0 [1, 2, 4, 8, 10, 19]
1 [15, 16, 17]
2 [3, 18]
3 [9]
I have a numpy array and a mask specifying which entries from that array to shuffle while keeping their relative order. Let's have an example:
In [2]: arr = np.array([5, 3, 9, 0, 4, 1])
In [4]: mask = np.array([True, False, False, False, True, True])
In [5]: arr[mask]
Out[5]: array([5, 4, 1]) # These entries shall be shuffled inside arr, while keeping their order.
In [6]: np.where(mask==True)
Out[6]: (array([0, 4, 5]),)
In [7]: shuffle_array(arr, mask) # I'm looking for an efficient realization of this function!
Out[7]: array([3, 5, 4, 9, 0, 1]) # See how the entries 5, 4 and 1 haven't changed their order.
I've written some code that can do this, but it's really slow.
import numpy as np
def shuffle_array(arr, mask):
perm = np.arange(len(arr)) # permutation array
n = mask.sum()
if n > 0:
old_true_pos = np.where(mask == True)[0] # old positions for which mask is True
old_false_pos = np.where(mask == False)[0] # old positions for which mask is False
new_true_pos = np.random.choice(perm, n, replace=False) # draw new positions
new_true_pos.sort()
new_false_pos = np.setdiff1d(perm, new_true_pos)
new_pos = np.hstack((new_true_pos, new_false_pos))
old_pos = np.hstack((old_true_pos, old_false_pos))
perm[new_pos] = perm[old_pos]
return arr[perm]
To make things worse, I actually have two large matrices A and B with shape (M,N). Matrix A holds arbitrary values, while each row of matrix B is the mask which to use for shuffling one corresponding row of matrix A according to the procedure that I outlined above. So what I want is shuffled_matrix = row_wise_shuffle(A, B).
The only way I have so far found to do it is via my shuffle_array() function and a for loop.
Can you think of any numpy'onic way to accomplish this task avoiding loops? Thank you so much in advance!
For 1d case:
import numpy as np
a = np.arange(8)
b = np.array([1,1,1,1,0,0,0,0])
# Get ordered values
ordered_values = a[np.where(b==1)]
# We'll shuffle both arrays
shuffled_ix = np.random.permutation(a.shape[0])
a_shuffled = a[shuffled_ix]
b_shuffled = b[shuffled_ix]
# Replace the values with correct order
a_shuffled[np.where(b_shuffled==1)] = ordered_values
a_shuffled # Notice that 0, 1, 2, 3 preserves order.
>>>
array([0, 1, 2, 6, 3, 4, 7, 5])
for 2d case, columnwise shuffle (along axis=1):
import numpy as np
a = np.arange(24).reshape(4,6)
b = np.array([[0,0,0,0,1,1], [1,1,1,0,0,0], [1,1,1,1,0,0], [0,0,1,1,0,0]])
# The code below works for column shuffle (i.e. axis=1).
# Get ordered values
i,j = np.where(b==1)
values = a[i, j]
values
# We'll shuffle both arrays for axis=1
# taken from https://stackoverflow.com/questions/5040797/shuffling-numpy-array-along-a-given-axis
idx = np.random.rand(*a.shape).argsort(axis=1)
a_shuffled = np.take_along_axis(a,idx,axis=1)
b_shuffled = np.take_along_axis(b,idx,axis=1)
# Replace the values with correct order
a_shuffled[np.where(b_shuffled==1)] = values
# Get the result
a_shuffled # see that 4,5 | 6,7,8 | 12,13,14,15 | 20, 21 preserves order
>>>
array([[ 4, 1, 0, 3, 2, 5],
[ 9, 6, 7, 11, 8, 10],
[12, 13, 16, 17, 14, 15],
[23, 20, 19, 22, 21, 18]])
for 2d case, rowwise shuffle (along axis=0), we can use the same code, first transpose arrays and after shuffle transpose back:
import numpy as np
a = np.arange(24).reshape(4,6)
b = np.array([[0,0,0,0,1,1], [1,1,1,0,0,0], [1,1,1,1,0,0], [0,0,1,1,0,0]])
# The code below works for column shuffle (i.e. axis=1).
# As you said rowwise, we first transpose
at = a.T
bt = b.T
# Get ordered values
i,j = np.where(bt==1)
values = at[i, j]
values
# We'll shuffle both arrays for axis=1
# taken from https://stackoverflow.com/questions/5040797/shuffling-numpy-array-along-a-given-axis
idx = np.random.rand(*at.shape).argsort(axis=1)
at_shuffled = np.take_along_axis(at,idx,axis=1)
bt_shuffled = np.take_along_axis(bt,idx,axis=1)
# Replace the values with correct order
at_shuffled[np.where(bt_shuffled==1)] = values
# Get the result
a_shuffled = at_shuffled.T
a_shuffled # see that 6,12 | 7, 13 | 8,14,20 | 15, 21 preserves order
>>>
array([[ 6, 7, 2, 3, 10, 17],
[18, 19, 8, 15, 16, 23],
[12, 13, 14, 21, 4, 5],
[ 0, 1, 20, 9, 22, 11]])
Say, I have an N dimensional array my_array[D1][D2]...[DN]
For a certain application, like sensitivity analysis, I need to fix a point p=(d1, d2, ..., dN) and iterate along each dimension at a time.
The resulting behavior is
for x1 in range(0, D1):
do_something(my_array[x1][d2][d3]...[dN])
for x2 in range(0, D2):
do_something(my_array[d1][x2][d3]...[dN])
.
.
.
for xN in range(0, DN):
do_something(my_array[d1][d2][d3]...[xN])
As you can see, there are many duplicated code here. How can I reduce the work and write some elegant code instead?
For example, is there any approach to the generation of code similar to the below?
for d in range(0, N):
iterate along the (d+1)th dimension of my_array, denoting the element as x:
do_something(x)
You can use numpy.take and do something like the following. Go through the documentation for reference.
https://docs.scipy.org/doc/numpy/reference/generated/numpy.take.html
N = len(my_array)
for i in range(N):
n = len(my_array(i))
indices = p
indices[i] = x[i]
for j in range(n):
do_something(np.take(my_array,indices))
I don't understand what are d1 d2 d3, but I guess you can do something like this:
def get_list_item_by_indexes_list(in_list, indexes_list):
if len(indexes_list) <= 1:
return in_list[indexes_list[0]]
else:
return get_list_item_by_indexes_list(in_list[indexes_list[0]], indexes_list[1:])
def do_to_each_dimension(multi_list, func, dimensions_lens):
d0_to_dN_list = [l - 1 for l in dimensions_lens] # I dont know what is it
for dimension_index in range(0, len(dimensions_lens)):
dimension_len = dimensions_lens[dimension_index]
for x in range(0, dimension_len):
curr_d0_to_dN_list = d0_to_dN_list.copy()
curr_d0_to_dN_list[dimension_index] = x
func(get_list_item_by_indexes_list(multi_list, curr_d0_to_dN_list))
def do_something(n):
print(n)
dimensions_lens = [3, 5]
my_array = [
[1, 2, 3, 4, 5],
[6, 7, 8, 9, 10],
[11, 12, 13, 14, 15]
]
do_to_each_dimension(my_array, do_something, dimensions_lens)
Output:
5 10 15 11 12 13 14 15
This code iterates through the last column and the last row of a 2d array.
Now, to iterate through the last line of each dimension of 3d array:
dimensions_lens = [2, 4, 3]
my_array = [
[
[1, 2, 3],
[4, 5, 6],
[7, 8, 9],
[10, 11, 12]
],
[
[13, 14, 15],
[16, 17, 18],
[19, 20, 21],
[22, 23, 24]
],
]
do_to_each_dimension(my_array, do_something, dimensions_lens)
Output:
12 24 15 18 21 24 22 23 24
(Note: don't use zero-length dimensions with this code)
You could mess with the string representation of your array access (my_arr[d1][d2]...[dN]) and eval that afterwards to get the values you want. This is fairly "hacky", but it will work on arrays with arbitrary dimensions and allows you to supply the indices as a list while handling the nested array access under the hood, allowing for a clean double for loop .
def access_at(arr, point):
# build 'arr[p1][p2]...[pN]'
access_str = 'arr' + ''.join([f'[{p}]' for p in point])
return eval(access_str)
Using this access method is pretty straight forward:
p = [p1, ..., pN]
D = [D1, ..., DN]
for i in range(N):
# deep copy p
pt = p[:]
for x in range(D[i]):
pt[i] = x
do_something(access_at(my_array, pt))
Consider two sorted numpy arrays:
import numpy as np
a = np.array([1,2,4,4,6,8,10,10,21])
b = np.array([3,3,4,6,10,18,22])
How do I:
1. Find the elements that appear in both lists, and
2. Remove only one instance of that occurrence from each list.
That is the output should be:
a = [1,2,4,8,10,21]
b = [3,3,18,22]
So even if there are duplicates, only one instance is removed. However if the lists are
c = np.array([1,2,4,4,6,8,10,10,10,21])
d = np.array([3,3,4,6,10,10,18,22])
I expect to obtain the new outputs:
c = [1,2,4,8,10,21]
d = [3,3,18,22]
which is the same as above. The difference is the number of 10's in the list. Each of the two 10's in list d takes away one 10 each from c leaving the same result.
This post was the closest match to my question, but it removed all instances of repeats from both lists.
You can use collections.Counter:
from collections import Counter
import numpy as np
a = np.array([1, 2, 4, 4, 6, 8, 10, 10, 21])
b = np.array([3, 3, 4, 6, 10, 18, 22])
ca = Counter(a)
cb = Counter(b)
result_a = sorted((ca - cb).elements())
result_b = sorted((cb - ca).elements())
print(result_a)
print(result_b)
Output
[1, 2, 4, 8, 10, 21]
[3, 3, 18, 22]
It returns the same result for (as expected):
a = np.array([1, 2, 4, 4, 6, 8, 10, 10, 10, 21])
b = np.array([3, 3, 4, 6, 10, 10, 18, 22])
You can find the indices of first occurences of intersecting items using np.searchsorted as following and then remove them using np.delete() function:
In [58]: intersect = a[np.in1d(a, b)]
In [59]: mask1 = np.searchsorted(a, intersect)
In [60]: mask2 = np.searchsorted(b, intersect)
In [61]: np.delete(a, mask1)
Out[61]: array([ 1, 2, 4, 8, 10, 21])
In [62]: np.delete(b, mask2)
Out[62]: array([ 3, 3, 18, 22])
I'm not 100% sure what you're looking to do based on the question, but I have been able to duplicate the output using the methods described.
import numpy as np
# List of b that are not in a
a = np.array([1,2,4,4,6,8,10,10,21])
b = np.array([3,3,4,6,10,18,22])
newb = [x for x in b if x not in a]
print(newb)
# REMOVE ONE DUPLICATED ELEMENT FROM LIST
import collections
counter=collections.Counter(a)
print(counter)
newa = list(a)
for k,v in counter.items():
if v > 1:
newa.remove(k)
print(newa)
If you don't mind the verbosity:
import numpy as np
a = np.array([1,2,4,4,6,8,10,10,21])
b = np.array([3,3,4,6,10,18,22])
common_values = set(a) & set(b)
a = a.tolist()
b = b.tolist()
for value in common_values:
a.remove(value)
b.remove(value)
a = np.array(a)
b = np.array(b)
Using for loops:
import numpy as np
a = np.array([1,2,4,4,6,8,10,10,21])
b = np.array([3,3,4,6,10,18,22])
for i, val in enumerate(a):
if val in b:
a = np.delete(a, np.where(a == val)[0][0])
b = np.delete(b, np.where(b == val)[0][0])
for i, val in enumerate(b):
if val in a:
a = np.delete(a, np.where(a == val)[0][0])
b = np.delete(b, np.where(b == val)[0][0])
print(a)
print(b)
Outputs:
[1,2,4,8,10,21]
[3,3,18,22]
Here is a numpy approach:
import numpy as np
a = np.array([1,2,4,4,6,8,10,10,21])
b = np.array([3,3,4,6,10,18,22])
# join and sort (with Tim sort this should be O(n))
ab = np.concatenate([a,b])
i = ab.argsort(kind="stable")
abo = ab[i]
# mark 1st of each group of equal values
d = np.flatnonzero(np.diff(abo,prepend=abo[0]-1,append=abo[-1]+1))
# mark sorted total by origin (a -> False, b -> True)
ig = i>=len(a)
# compare origins of first and last of each group of equal values
# if they are different mark for deletion
dupl = ig[d[:-1]] ^ ig[d[1:]-1]
# finally, delete
ar = np.delete(a,i[d[:-1][dupl]])
br = np.delete(b,i[d[1:][dupl]-1]-len(a))
# inspect
ar
array([ 1, 2, 4, 8, 10, 21])
br
array([ 3, 3, 18, 22])
x = np.array([1,2,3,4,5,6,7,8,9,10,11,12,13,14,15])
I want to grab first 2 rows of array x from every block of 5, result should be:
x[fancy_indexing] = [1,2, 6,7, 11,12]
It's easy enough to build up an index like that using a for loop.
Is there a one-liner slicing trick that will pull it off? Points for simplicity here.
Approach #1 Here's a vectorized one-liner using boolean-indexing -
x[np.mod(np.arange(x.size),M)<N]
Approach #2 If you are going for performance, here's another vectorized approach using NumPy strides -
n = x.strides[0]
shp = (x.size//M,N)
out = np.lib.stride_tricks.as_strided(x, shape=shp, strides=(M*n,n)).ravel()
Sample run -
In [61]: # Inputs
...: x = np.array([1,2,3,4,5,6,7,8,9,10,11,12,13,14,15])
...: N = 2
...: M = 5
...:
In [62]: # Approach 1
...: x[np.mod(np.arange(x.size),M)<N]
Out[62]: array([ 1, 2, 6, 7, 11, 12])
In [63]: # Approach 2
...: n = x.strides[0]
...: shp = (x.size//M,N)
...: out=np.lib.stride_tricks.as_strided(x,shape=shp,strides=(M*n,n)).ravel()
...:
In [64]: out
Out[64]: array([ 1, 2, 6, 7, 11, 12])
I first thought you need this to work for 2d arrays due to your phrasing of "first N rows of every block of M rows", so I'll leave my solution as this.
You could work some magic by reshaping your array into 3d:
M = 5 # size of blocks
N = 2 # number of columns to cut
x = np.arange(3*4*M).reshape(4,-1) # (4,3*N)-shaped dummy input
x = x.reshape(x.shape[0],-1,M)[:,:,:N+1].reshape(x.shape[0],-1) # (4,3*N)-shaped output
This will extract every column according to your preference. In order to use it for your 1d case you'd need to make your 1d array into a 2d one using x = x[None,:].
Reshape the array to multiple rows of five columns then take (slice) the first two columns of each row.
>>> x
array([ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15])
>>> x.reshape(x.shape[0] / 5, 5)[:,:2]
array([[ 1, 2],
[ 6, 7],
[11, 12]])
Or
>>> x.reshape(x.shape[0] / 5, 5)[:,:2].flatten()
array([ 1, 2, 6, 7, 11, 12])
>>>
It only works with 1-d arrays that have a length that is a multiple of five.
import numpy as np
x = np.array(range(1, 16))
y = np.vstack([x[0::5], x[1::5]]).T.ravel()
y
// => array([ 1, 2, 6, 7, 11, 12])
Taking the first N rows of every block of M rows in the array [1, 2, ..., K]:
import numpy as np
K = 30
M = 5
N = 2
x = np.array(range(1, K+1))
y = np.vstack([x[i::M] for i in range(N)]).T.ravel()
y
// => array([ 1, 2, 6, 7, 11, 12, 16, 17, 21, 22, 26, 27])
Notice that .T and .ravel() are fast operations: they don't copy any data, but just manipulate the dimensions and strides of the array.
If you insist on getting your slice using fancy indexing:
import numpy as np
K = 30
M = 5
N = 2
x = np.array(range(1, K+1))
fancy_indexing = [i*M+n for i in range(len(x)//M) for n in range(N)]
x[fancy_indexing]
// => array([ 1, 2, 6, 7, 11, 12, 16, 17, 21, 22, 26, 27])