comparing two unequal length and unequal dimension list - python

I have two lists, a and b:
a=[['apple'],['pear'],['grapes'],['cherry'],['mangoes'],['banana']]
b=[['apple',15,14],['orange',30,43],['pear',6,67],['grapes',90,709],['cherry',23,9]]
The result I want is:
b=[['apple',15,14],['orange',30,43],['pear',6,67],['grapes',90,709],['cherry',23,9],['mangoes',0.0,0.0],['banana',0.0,0.0]]
I am trying to compare two unequal length lists and append unique values from one list into another list of unequal dimension.

First, I would make a flat list of strings. If you are getting a from somewhere else, you can convert it with
a = [k[0] for k in a]
Next, I would make b a dict, again converting if necessary with
b = dict((k[0], k[1:]) for k in b)
Finally, you just want to augment b with new values for keys that are in a but not yet in b.
b.update(dict((k, [0.0,0.0]) for k in a if k not in b))
To convert back to a list of lists, use
b = [[key] + list(values) for k, values in b.iteritems()]

Related

Grouping a grouped list of str without duplicates

I have a grouped list of strings that sort of looks like this, the lists inside of these groups will always contain 5 elements:
text_list = [['aaa','bbb','ccc','ddd','eee'],
['fff','ggg','hhh','iii','jjj'],
['xxx','mmm','ccc','bbb','aaa'],
['fff','xxx','aaa','bbb','ddd'],
['aaa','bbb','ccc','ddd','eee'],
['fff','xxx','aaa','ddd','eee'],
['iii','xxx','ggg','jjj','aaa']]
The objective is simple, group all of the list that is similar by the first 3 elements that is then compared against all of the elements inside of the other groups.
So from the above example the output might look like this (output is the index of the list):
[[0,2,4],[3,5]]
Notice how if there is another list that contains the same elements but in a different order is removed.
I've written the following code to extract the groups but they would return duplicates and I am unsure how to proceed. I also think this might not be the most efficient way to do the extraction as the real list can contain upwards to millions of groups:
grouped_list = []
for i in range(0,len(text_list)):
int_temp = []
for m in range(0,len(text_list)):
if i == m:
continue
bool_check = all( x in text_list[m] for x in text_list[i][0:3])
if bool_check:
if len(int_temp) == 0:
int_temp.append(i)
int_temp.append(m)
continue
int_temp.append(m)
grouped_list.append(int_temp)
## remove index with no groups
grouped_list = [x for x in grouped_list if x != []]
Is there a better way to go about this? How do I remove the duplicate group afterwards? Thank you.
Edit:
To be clearer, I would like to retrieve the lists that is similar to each other but only using the first 3 elements of the other lists. For example, using the first 3 elements from list A, check if list B,C,D... contains all 3 of the elements from list A. Repeat for the entire list then remove any list that contains duplicate elements.
You can build a set of frozensets to keep track of indices of groups with the first 3 items being a subset of the rest of the members:
groups = set()
sets = list(map(set, text_list))
for i, lst in enumerate(text_list):
groups.add(frozenset((i, *(j for j, s in enumerate(sets) if set(lst[:3]) <= s))))
print([sorted(group) for group in groups if len(group) > 1])
If the input list is long, it would be faster to create a set of frozensets of the first 3 items of all sub-lists and use the set to filter all combinations of 3 items from each sub-list, so that the time complexity is essentially linear to the input list rather than quadratic despite the overhead in generating combinations:
from itertools import combinations
sets = {frozenset(lst[:3]) for lst in text_list}
groups = {}
for i, lst in enumerate(text_list):
for c in map(frozenset, combinations(lst, 3)):
if c in sets:
groups.setdefault(c, []).append(i)
print([sorted(group) for group in groups.values() if len(group) > 1])

difference between two lists greater than the difference in the number of elements in the two list

I have two lists (A, B) of the files that have been processed. List A contains all the initial files, list B contains all the files that have been processed successfully (so the second list (B) is a subset of the first one).
A contains 231453 items.
B contains 124769 items.
I want to subtract them to see which of those files didn't get process. (C should contains 106684 items)
To do so I am using set :
newlist=[]
newlist2=[]
newlist3=[]
newlist=( set(A) - ( set(A) & set(B) ) )
newlist2=(set(A)^set(B))
newlist3=(set(A) - set(B))
print len(newlist)
print len(newlist2)
print len(newlist3)
The results are:
134173
161662
134173
Why there are more items than the one expected?
You have specified that A and B are lists. There is a possibility that there are duplicates in the list which are lost on converting to set.
A set is a collection which is unordered and unindexed. In Python sets
are written with curly brackets.
For your case you could do
not_processed = filter(lambda x: x in A, B)
OR
not_processed = [x for x in A if x in B]
The above code will gice all the x values that are present in A if the X value is in B
Your B contains some items that are not in A if B is subset of A then all three length should have been the same. The fact that your symmetric different has higher length, your B contains certain items that are not in A

All items in list are identical? [duplicate]

This question already has answers here:
How do I clone a list so that it doesn't change unexpectedly after assignment?
(24 answers)
Closed 6 years ago.
I some lines of code in order to prep the data to be sorted. In order to do this, I needed each item in the list to be a list. For example, if my list was [a, b, c, d], the list should be become [[a], [b], [c], [d]] where a, b, c, and d are integers. My code however returns [[d], [d], [d], [d]]:
len_of_list = random.randint(8500, 10000)
val_list = range(len_of_list)
list_one = [[-1]]
for index in range(0, len_of_list):
list_one[0][0] = val_list[index]
val_list[index] = list_one[0]
print val_list
The weirdest thing that happens to the code is that when the second-to-last line is replaced with val_list[index] = list_one[0][0], it returns the correct values just without the []. Thus I assumed that removing the last [0] would return the values with the [] around them. But what is returned was the last integer in the list originally surrounded by [] for all values. This shouldn't happen because list[0][0] is reset every iteration with list[0][0] = val_list[index].
So why is the list being returned as [[d], [d], [d], [d]] instead of [[a], [b], [c], [d]]. Of course, the list has at least 8500 unique integers but d represents the last value in the list or put simply d = val_list[len(val_list) - 1].
You put list_one[0] in val_list[index]. Since list_one[0] is a list, val_list[index] receives its address, not its value. Therefore, when the value of list_one[0] is changed, ie. at every iteration, the value of val_list[index] for every index is changed as well.
Thus, the final value contained at every index of the list is the last one, [d].
This happens because list_one[0] is a list, which is mutable.
Beyond this fact, your code is not really pythonic. First, you build an empty list of length len_of_list by writing val_list = range(len_of_list). While this works in Python 2, it does not in Python 3, because a range object is semantically different from a list. Besides, a list should be built by appending or inserting elements, but not by making a long enough empty list and then filling the indices.
Then, you are using complicated code to build your list of list, while the following would have been enough:
for index in range(len_of_list):
val_list[index] = [index]
A more basic pythonic way to do the same, with append, would be:
len_of_list = random.randint(8500, 10000)
val_list = list()
for index in range(len_of_list):
val_list.append([index])
However, the most pythonic way would be to use a list comprehension. If you have l = [a, b, c, d], and you want to have [[a], [b], [c], [d]], the fastes is the following:
[[elt] for elt in l]
Try this:
[[x] for x in myList]

Matching values in variable length lists containing sublists in python

I am trying to iterate through a dictionary where each key contains a list which in turn contains from 0 up to 20+ sub-lists. The goal is to iterate through the values of dictionary 1, check if they are in any of the sublists of dictionary 2 for the same key, and if so, add +1 to a counter and not consider that sublist again.
The code looks somewhat like this:
dict1={"key1":[[1,2],[6,7]],"key2":[[1,2,3,4,5,6,7,8,9]]}
dict2={"key1":[[0,1,2,3],[5,6,7,8],[11,13,15]],"key2":[[7,8,9,10,11],[16,17,18]]}
for (k,v), (k2,v2) in zip(dict1.iteritems(),dict2.iteritems()):
temp_hold=[]
span_overlap=0
for x in v:
if x in v2 and v2 not in temp_hold:
span_overlap+=1
temp_hold.append(v2)
else:
continue
print temp_hold, span_overlap
This does obviously not work mainly due to the code not being able to check hierarchally through the list and sublists, and partly due to likely incorrect iteration syntax. I have not the greatest of grasp of nested loops and iterations which makes this a pain. Another option would be to first join the sublists into a single list using:
v=[y for x in v for y in x]
Which would make it easy to check if one value is in another dictionary, but then I lose the ability to work specifically with the sublist which contained parts of the values iterated through, nor can I count that sublist only once.
The desired output is a count of 2 for key1, and 1 for key2, as well as being able to handle the matching sublists for further analysis.
Here is one solution. I am first converting the list of lists into a list of sets. If you have any control over the lists, make them sets.
def matching_sublists(dict1, dict2):
result = dict()
for k in dict1:
assert(k in dict2)
result[k] = 0
A = [set(l) for l in dict1[k]]
B = [set(l) for l in dict2[k]]
for sublistA in A:
result[k] += sum([1 for sublistB in B if not sublistA.isdisjoint(sublistB) ])
return result
if __name__=='__main__':
dict1={"key1":[[1,2],[6,7]],"key2":[[1,2,3,4,5,6,7,8,9]]}
dict2={"key1":[[0,1,2,3],[5,6,7,8],[11,13,15]],"key2":[[7,8,9,10,11],[16,17,18]]}
print(matching_sublists(dict1, dict2))

python union of 2 nested lists with index

I want to get the union of 2 nested lists plus an index to the common values.
I have two lists like A = [[1,2,3],[4,5,6],[7,8,9]] and B = [[1,2,3,4],[3,3,5,7]] but the length of each list is about 100 000. To A belongs an index vector with len(A): I = [2,3,4]
What I want is to find all sublists in B where the first 3 elements are equal to a sublist in A. In this example I want to get B[0] returned ([1,2,3,4]) because its first three elements are equal to A[0]. In addition, I also want the index to A[0] in this example, that is I[0].
I tried different things, but nothing worked so far :(
First I tried this:
Common = []
for i in range(len(B)):
if B[i][:3] in A:
id = [I[x] for x,y in enumerate(A) if y == B[i][:3]][0]
ctdCommon.append([int(id)] + B[i])
But that takes ages, or never finishes
Then I transformed A and B into sets and took the union from both, which was very quick, but then I don't know how to get the corresponding indices
Does anyone have an idea?
Create an auxiliary dict (work is O(len(A)) -- assuming the first three items of a sublist in A uniquely identify it (otherwise you need a dict of lists):
aud = dict((tuple(a[:3]), i) for i, a in enumerate(A))
Use said dict to loop once on B (work is O(len(B))) to get B sublists and A indices:
result = [(b, aud[tuple(b[:3])]) for b in B if tuple(b[:3]) in aud]

Categories