python union of 2 nested lists with index - python

I want to get the union of 2 nested lists plus an index to the common values.
I have two lists like A = [[1,2,3],[4,5,6],[7,8,9]] and B = [[1,2,3,4],[3,3,5,7]] but the length of each list is about 100 000. To A belongs an index vector with len(A): I = [2,3,4]
What I want is to find all sublists in B where the first 3 elements are equal to a sublist in A. In this example I want to get B[0] returned ([1,2,3,4]) because its first three elements are equal to A[0]. In addition, I also want the index to A[0] in this example, that is I[0].
I tried different things, but nothing worked so far :(
First I tried this:
Common = []
for i in range(len(B)):
if B[i][:3] in A:
id = [I[x] for x,y in enumerate(A) if y == B[i][:3]][0]
ctdCommon.append([int(id)] + B[i])
But that takes ages, or never finishes
Then I transformed A and B into sets and took the union from both, which was very quick, but then I don't know how to get the corresponding indices
Does anyone have an idea?

Create an auxiliary dict (work is O(len(A)) -- assuming the first three items of a sublist in A uniquely identify it (otherwise you need a dict of lists):
aud = dict((tuple(a[:3]), i) for i, a in enumerate(A))
Use said dict to loop once on B (work is O(len(B))) to get B sublists and A indices:
result = [(b, aud[tuple(b[:3])]) for b in B if tuple(b[:3]) in aud]

Related

Python: Get item(s) at index list

I am trying to write something that takes a list, and gets the item(s) at an index, either one or multiple.
The example below which I found in another post here works great when I have more than one index. This example doesnt work if b = a single index.
a = [-2,1,5,3,8,5,6]
b = [1,2,5]
c = [ a[i] for i in b]
How do I get this to work with both 1 and multiple index?
Example:
a = [-2,1,5,3,8,5,6]
b = 2
c = [ a[i] for i in b] doesnt work in this case
You can actually check if the type your trying to use for fetching the indices is a list (or a tuple, etc.). Here it is, wrapped into a function:
def find_values(in_list, ind):
# ind is a list of numbers
if isinstance(ind, list):
return [in_list[i] for i in ind]
else:
# ind is a single numer
return [in_list[ind]]
in_list = [-2,1,5,3,8,5,6]
list_of_indices = [1,2,5]
one_index = 3
print(find_values(in_list, list_of_indices))
print(find_values(in_list, one_index))
The function takes the input list and the indices (renamed for clarity - it's best to avoid single letter names). The indices can either be a list or a single number. If isinstance determines your input is a list, it proceeds with a list comprehension. If it's a number - it just treats it as an index. If it is anything else, the program crashes.
This post gives you more details on isinstance and recognizing other iterables, like tuples, or lists and tuples together.
a = [-2, 1, 5, 3, 8, 5, 6]
a2 = [-2]
b = [1, 2, 5]
b2 = [1]
c = [a[i] for i in b]
c2 = [a2[i-1] for i in b2]
The first item of the list is 0, the list with one item is perfectly valid.
Instead of creating a list that manually validates the value of list b in the list a, you could create a separate 3 line code to print out the overlapping intersection of list a and b by this:
a = [-2,1,5,3,8,5,6]
b = [3,4,6]
for i in range(0,len(b)):
if b[i] in a:
print(b[i])
By doing so, you would be able to print out the overlapping intersection even if there were 1 or even no value stored in list b.

Grouping a grouped list of str without duplicates

I have a grouped list of strings that sort of looks like this, the lists inside of these groups will always contain 5 elements:
text_list = [['aaa','bbb','ccc','ddd','eee'],
['fff','ggg','hhh','iii','jjj'],
['xxx','mmm','ccc','bbb','aaa'],
['fff','xxx','aaa','bbb','ddd'],
['aaa','bbb','ccc','ddd','eee'],
['fff','xxx','aaa','ddd','eee'],
['iii','xxx','ggg','jjj','aaa']]
The objective is simple, group all of the list that is similar by the first 3 elements that is then compared against all of the elements inside of the other groups.
So from the above example the output might look like this (output is the index of the list):
[[0,2,4],[3,5]]
Notice how if there is another list that contains the same elements but in a different order is removed.
I've written the following code to extract the groups but they would return duplicates and I am unsure how to proceed. I also think this might not be the most efficient way to do the extraction as the real list can contain upwards to millions of groups:
grouped_list = []
for i in range(0,len(text_list)):
int_temp = []
for m in range(0,len(text_list)):
if i == m:
continue
bool_check = all( x in text_list[m] for x in text_list[i][0:3])
if bool_check:
if len(int_temp) == 0:
int_temp.append(i)
int_temp.append(m)
continue
int_temp.append(m)
grouped_list.append(int_temp)
## remove index with no groups
grouped_list = [x for x in grouped_list if x != []]
Is there a better way to go about this? How do I remove the duplicate group afterwards? Thank you.
Edit:
To be clearer, I would like to retrieve the lists that is similar to each other but only using the first 3 elements of the other lists. For example, using the first 3 elements from list A, check if list B,C,D... contains all 3 of the elements from list A. Repeat for the entire list then remove any list that contains duplicate elements.
You can build a set of frozensets to keep track of indices of groups with the first 3 items being a subset of the rest of the members:
groups = set()
sets = list(map(set, text_list))
for i, lst in enumerate(text_list):
groups.add(frozenset((i, *(j for j, s in enumerate(sets) if set(lst[:3]) <= s))))
print([sorted(group) for group in groups if len(group) > 1])
If the input list is long, it would be faster to create a set of frozensets of the first 3 items of all sub-lists and use the set to filter all combinations of 3 items from each sub-list, so that the time complexity is essentially linear to the input list rather than quadratic despite the overhead in generating combinations:
from itertools import combinations
sets = {frozenset(lst[:3]) for lst in text_list}
groups = {}
for i, lst in enumerate(text_list):
for c in map(frozenset, combinations(lst, 3)):
if c in sets:
groups.setdefault(c, []).append(i)
print([sorted(group) for group in groups.values() if len(group) > 1])

difference between two lists greater than the difference in the number of elements in the two list

I have two lists (A, B) of the files that have been processed. List A contains all the initial files, list B contains all the files that have been processed successfully (so the second list (B) is a subset of the first one).
A contains 231453 items.
B contains 124769 items.
I want to subtract them to see which of those files didn't get process. (C should contains 106684 items)
To do so I am using set :
newlist=[]
newlist2=[]
newlist3=[]
newlist=( set(A) - ( set(A) & set(B) ) )
newlist2=(set(A)^set(B))
newlist3=(set(A) - set(B))
print len(newlist)
print len(newlist2)
print len(newlist3)
The results are:
134173
161662
134173
Why there are more items than the one expected?
You have specified that A and B are lists. There is a possibility that there are duplicates in the list which are lost on converting to set.
A set is a collection which is unordered and unindexed. In Python sets
are written with curly brackets.
For your case you could do
not_processed = filter(lambda x: x in A, B)
OR
not_processed = [x for x in A if x in B]
The above code will gice all the x values that are present in A if the X value is in B
Your B contains some items that are not in A if B is subset of A then all three length should have been the same. The fact that your symmetric different has higher length, your B contains certain items that are not in A

comparing two unequal length and unequal dimension list

I have two lists, a and b:
a=[['apple'],['pear'],['grapes'],['cherry'],['mangoes'],['banana']]
b=[['apple',15,14],['orange',30,43],['pear',6,67],['grapes',90,709],['cherry',23,9]]
The result I want is:
b=[['apple',15,14],['orange',30,43],['pear',6,67],['grapes',90,709],['cherry',23,9],['mangoes',0.0,0.0],['banana',0.0,0.0]]
I am trying to compare two unequal length lists and append unique values from one list into another list of unequal dimension.
First, I would make a flat list of strings. If you are getting a from somewhere else, you can convert it with
a = [k[0] for k in a]
Next, I would make b a dict, again converting if necessary with
b = dict((k[0], k[1:]) for k in b)
Finally, you just want to augment b with new values for keys that are in a but not yet in b.
b.update(dict((k, [0.0,0.0]) for k in a if k not in b))
To convert back to a list of lists, use
b = [[key] + list(values) for k, values in b.iteritems()]

large array searching with numpy

I have a two arrays of integers
a = numpy.array([1109830922873, 2838383, 839839393, ..., 29839933982])
b = numpy.array([2838383, 555555555, 2839474582, ..., 29839933982])
where len(a) ~ 15,000 and len(b) ~ 2 million.
What I want is to find the indices of array b elements which match those in array a. Now, I'm using list comprehension and numpy.argwhere() to achieve this:
bInds = [ numpy.argwhere(b == c)[0] for c in a ]
however, obviously, it is taking a long time to complete this. And array a will become larger too, so this is not a sensible route to take.
Is there a better way to achieve this result, considering the large arrays I'm dealing with here? It currently takes around ~5 minutes to do this. Any speed up is needed!
More info: I want the indices to match the order of array a too. (Thanks Charles)
Unless I'm mistaken, your approach searches the entire array b for each element of a again and again.
Alternatively, you could create a dictionary mapping the individual elements from b to their indices.
indices = {}
for i, e in enumerate(b):
indices[e] = i # if elements in b are unique
indices.setdefault(e, []).append(i) # otherwise, use lists
Then you can use this mapping for quickly finding the indices where elements from a can be found in b.
bInds = [ indices[c] for c in a ]
This take about a second to run.
import numpy
#make some fake data...
a = (numpy.random.random(15000) * 2**16).astype(int)
b = (numpy.random.random(2000000) * 2**16).astype(int)
#find indcies of b that are contained in a.
set_a = set(a)
result = set()
for i,val in enumerate(b):
if val in set_a:
result.add(i)
result = numpy.array(list(result))
result.sort()
print result

Categories