How to remove overlapping item in a nested list?

How to remove overlapping item in a nested list? - python

I am trying to delete the overlapping values in a nested list.
The data looks like this:
[[22, 37, 'foobar'], [301, 306, 'foobar'],[369, 374, 'foobar'], [650, 672, 'foobar'], [1166, 1174, 'foobar'],[1469, 1477, 'foobar'],[2237, 2245, 'foobar'],[2702, 2724, 'foobar'],[3426, 3446, 'foobar'],[3505, 3513, 'foobar'],[3756, 3764, 'foobar'],[69524, 69535, 'foobar'],[3812, 3820, 'foobar'],[4034, 4057, 'foobar'],[4318, 4347, 'foobar'],[58531, 58548, 'foobar'],[4552, 4574, 'foobar'],[4854, 4861, 'foobar'],[5769, 5831, 'foobar'], [5976, 5986, 'foobar'],[6541, 6558, 'foobar'],[6541, 6608, 'foobar'],[7351, 7364, 'foobar'],[7351, 7364, 'foobar'], [7764, 7770, 'foobar'],[58540, 58548, 'foobar'],[69524, 69556, 'foobar']]
There are some overlapping values in index 0 and 1 of across the list. Such as:
[6541, 6558, 'foobar'] overlaps with [6541, 6608, 'foobar']
[7351, 7364, 'foobar'] overlaps with [7351, 7364, 'foobar']
[58531, 58548, 'foobar'] overlaps with [58540, 58548, 'foobar']
[69524, 69535, 'foobar'] overlaps with [69524, 69556, 'foobar']
I am trying to go through the list and remove shorter first instance of the overlapping values. If
[6541, 6558, 'foobar'] overlaps with [6541, 6608, 'foobar'] I want to keep [6541, 6608, 'foobar'] and remove [6541, 6558, 'foobar'] from the list.
So far i tried:
def clean_span(adata):
data = adata.copy()
rem_idx = []
for i in range(len(data)-1):
if data[i][0] in data[i+1] or data[i][1] in data[i+1]:
print(" {} overlaps with {}".format(data[i], data[i+1]))
rem_idx.append(i)
for i in rem_idx:
del data[i]
return data
But this code always leaves some overlapping values behind.
It is same with this approach as well.
def clean_span(adata):
data = adata.copy()
new_data = []
for i in range(len(data)-1):
if data[i][0] in data[i+1] or data[i][1] in data[i+1]:
print(" {} overlaps with {}".format(data[i], data[i+1]))
new_data.append(data[i+1])
else:
new_data.append(data[i])
return new_data
I would appreciate your help to solve this problem.

def clean_span(adata):
# to perform O(1) search for index 0 and 1
# you can just have one dictionary if indexes don't matter
d0 = dict()
d1 = dict()
r = []
for a in adata:
if a[0] in d0:
print(str(a) + " overlaps with " + str(d0[a[0]]))
elif a[1] in d1:
print(str(a) + " overlaps with " + str(d1[a[1]]))
else:
r.append(a)
d0[a[0]] = a
d1[a[1]] = a
return r
Keep 2 dictionaries: one for first element and one for second. Then while iterating on the data, check if any of the keys exist in the respective dictionaries -- if the key is found, it is an overlap; else it is not.
Problem in your code:
if data[i][0] in data[i+1] or data[i][1] in data[i+1]:
print(" {} overlaps with {}".format(data[i], data[i+1]))
new_data.append(data[i+1])
else:
new_data.append(data[i])
In the if statement, you add i + 1 to the new_data. So naturally when the loop increments to i + 1, it goes into the else and it adds the overlapped element back to the list.
Side note:
if data[i][0] in data[i+1] or data[i][1] in data[i+1]:
You are trying to search the element in the entire list here. Making your time complexity O(nk).

Overlapping can be found with set.intersection.
import itertools as it
l = # list
# merge the first two entries of the sublist into a from-to set values
m = ((set(range(p[0], p[1]+1)), p[2]) for p in l)
# combine each element of the list to check overlapping
new_l = []
for p1, p2 in it.combinations(m, 2):
s1, l1 = p1
s2, l2 = p2
if set.intersection(s1, s2):
m1, M1 = min(s1), max(s1)
m2, M2 = min(s2), max(s2)
# choose the biggest one
if M2-m2 > M1-m1:
new_l.append((m2, M2, l2))
else:
new_l.append((m1, M1, l1))
print(sorted(new_l, key=lambda p: p[0]))

I convert your list to dict type. The keys are foobar1,foobar2,... then first items in nested list are value of new dict. The second loop and in result variable has duplicated values in
rev_dict.The third loop find matches item then remove item from your list.
l_ist = [[22, 37, 'foobar'], ....]
ini_dict = {}
rev_dict = {}
for i in range(len(lis)):
ini_dict[l_ist[i][2]+str(i)] = l_ist[i][0]
for key, value in ini_dict.items():
rev_dict.setdefault(value, set()).add(key)
result = [key for key, values in rev_dict.items()if len(values) > 1] # duplicated values in rev_dic
for i in result:
for j in range(len(lis)) :
if i == l_ist[j][0]:
l_ist[j].remove(i)
print(l_ist)

Related

Find duplicates in a list of strings differing only in upper and lower case writing

I have a list of strings that contains 'literal duplicates' and 'pseudo-duplicates' which differ only in lower- and uppercase writing. I am looking for a function that treats all literal duplicates as one group, returns their indices, and finds all pseudo-duplicates for these elements, again returning their indices.
Here's an example list:
a = ['bar','bar','foo','Bar','Foo','Foo']
And this is the output I am looking for (a list of lists of lists):
dupe_list = [[[0,1],[3]],[[2],[4,5]]]
Explanation: 'bar' appears twice at the indexes 0 and 1 and there is one pseudo-duplicate 'Bar' at index 3. 'foo' appears once at index 2 and there are two pseudo-duplicates 'Foo' at indexes 4 and 5.

Here is one solution (you didn't clarify what the logic of list items will be and i considered that you want the items in lower format as they are met from left to right in the list, let me know if it must be different):
d={i:[[], []] for i in set(k.lower() for k in a)}
for i in range(len(a)):
if a[i] in d.keys():
d[a[i]][0].append(i)
else:
d[a[i].lower()][1].append(i)
result=list(d.values())
Output:
>>> print(result)
[[[0, 1], [3]], [[2], [4, 5]]]

Here's how I would achieve it. But you should consider using a dictionary and not a list of list of list. Dictionaries are excellent data structures for problems like this.
#default argument vars
a = ['bar','bar','foo','Bar','Foo','Foo']
#initalize a dictionary to count occurances
a_dict = {}
for i in a:
a_dict[i] = None
#loop through keys in dictionary, which is values from a_list
#loop through the items from list a
#if the item is exact match to key, add index to list of exacts
#if the item is similar match to key, add index to list of similars
#update the dictionary key's value
for k, v in a_dict.items():
index_exact = []
index_similar = []
for i in range(len(a)):
print(a[i])
print(a[i] == k)
if a[i] == str(k):
index_exact.append(i)
elif a[i].lower() == str(k):
index_similar.append(i)
a_dict[k] = [index_exact, index_similar]
#print out dictionary values to assure answer
print(a_dict.items())
#segregate values from dictionary to its own list.
dup_list = []
for v in a_dict.values():
dup_list.append(v)
print(dup_list)

Here is the solution. I have handled the situation where if there are only pseudo duplicates present or only literal duplicates present
a = ['bar', 'bar', 'foo', 'Bar', 'Foo', 'Foo', 'ka']
# Dictionaries to store the positions of words
literal_duplicates = dict()
pseudo_duplicates = dict()
for index, item in enumerate(a):
# Treates words as literal duplicates if word is in smaller case
if item.islower():
if item in literal_duplicates:
literal_duplicates[item].append(index)
else:
literal_duplicates[item] = [index]
# Handle if only literal_duplicates present
if item not in pseudo_duplicates:
pseudo_duplicates[item] = []
# Treates words as pseudo duplicates if word is in not in smaller case
else:
item_lower = item.lower()
if item_lower in pseudo_duplicates:
pseudo_duplicates[item_lower].append(index)
else:
pseudo_duplicates[item_lower] = [index]
# Handle if only pseudo_duplicates present
if item not in literal_duplicates:
literal_duplicates[item_lower] = []
# Form final list from the dictionaries
dupe_list = [[v, pseudo_duplicates[k]] for k, v in literal_duplicates.items()]

Here is the simple and easy to understand answer for you
a = ['bar','bar','foo','Bar','Foo','Foo']
dupe_list = []
ilist = []
ilist2 =[]
samecase = -1
dupecase = -1
for i in range(len(a)):
if a[i] != 'Null':
ilist = []
ilist2 = []
for j in range(i+1,len(a)):
samecase = -1
dupecase = -1
# print(a)
if i not in ilist:
ilist.append(i)
if a[i] == a[j]:
# print(a[i],a[j])
samecase = j
a[j] = 'Null'
elif a[i] == a[j].casefold():
# print(a[i],a[j])
dupecase = j
a[j] = 'Null'
# print(samecase)
# print(ilist,ilist2)
if samecase != -1:
ilist.append(samecase)
if dupecase != -1:
ilist2.append(dupecase)
dupe_list.append([ilist,ilist2])
a[i]='Null'
print(dupe_list)

adding empty string while joining the 2 lists - Python

I have 2 lists
mainlist=[['RD-12',12,'a'],['RD-13',45,'c'],['RD-15',50,'e']] and
sublist=[['RD-12',67],['RD-15',65]]
if i join both the list based on 1st element condition by using below code
def combinelist(mainlist,sublist):
dict1 = { e[0]:e[1:] for e in mainlist }
for e in sublist:
try:
dict1[e[0]].extend(e[1:])
except:
pass
result = [ [k] + v for k, v in dict1.items() ]
return result
Its results in like below
[['RD-12',12,'a',67],['RD-13',45,'c',],['RD-15',50,'e',65]]
as their is no element in for 'RD-13' in sublist, i want to empty string on that.
The final output should be
[['RD-12',12,'a',67],['RD-13',45,'c'," "],['RD-15',50,'e',65]]
Please help me.

Your problem can be solved using a while loop to adjust the length of your sublists until it matches the length of the longest sublist by appending the wanted string.
for list in result:
while len(list) < max(len(l) for l in result):
list.append(" ")

You could just go through the result list and check where the total number of your elements is 2 instead of 3.
for list in lists:
if len(list) == 2:
list.append(" ")
UPDATE:
If there are more items in the sublist, just subtract the lists containing the 'keys' of your lists, and then add the desired string.
def combinelist(mainlist,sublist):
dict1 = { e[0]:e[1:] for e in mainlist }
list2 = [e[0] for e in sublist]
for e in sublist:
try:
dict1[e[0]].extend(e[1:])
except:
pass
for e in dict1.keys() - list2:
dict1[e].append(" ")
result = [[k] + v for k, v in dict1.items()]
return result

You can try something like this:
mainlist=[['RD-12',12],['RD-13',45],['RD-15',50]]
sublist=[['RD-12',67],['RD-15',65]]
empty_val = ''
# Lists to dictionaries
maindict = dict(mainlist)
subdict = dict(sublist)
result = []
# go through all keys
for k in list(set(list(maindict.keys()) + list(subdict.keys()))):
# pick the value from each key or a default alternative
result.append([k, maindict.pop(k, empty_val), subdict.pop(k, empty_val)])
# sort by the key
result = sorted(result, key=lambda x: x[0])
You can set up your empty value to whatever you need.
UPDATE
Following the new conditions, it would look like this:
mainlist=[['RD-12',12,'a'], ['RD-13',45,'c'], ['RD-15',50,'e']]
sublist=[['RD-12',67], ['RD-15',65]]
maindict = {a:[b, c] for a, b, c in mainlist}
subdict = dict(sublist)
result = []
for k in list(set(list(maindict.keys()) + list(subdict.keys()))):
result.append([k, ])
result[-1].extend(maindict.pop(k, ' '))
result[-1].append(subdict.pop(k, ' '))
sorted(result, key=lambda x: x[0])

Another option is to convert the sublist to a dict, so items are easily and rapidly accessible.
sublist_dict = dict(sublist)
So you can do (it modifies the mainlist):
for i, e in enumerate(mainlist):
data: mainlist[i].append(sublist_dict.get(e[0], ""))
#=> [['RD-12', 12, 'a', 67], ['RD-13', 45, 'c', ''], ['RD-15', 50, 'e', 65]]
Or a one liner list comprehension (it produces a new list):
[ e + [sublist_dict.get(e[0], "")] for e in mainlist ]
If you want to skip the missing element:
for i, e in enumerate(mainlist):
data = sublist_dict.get(e[0])
if data: mainlist[i].append(data)
print(mainlist)
#=> [['RD-12', 12, 'a', 67], ['RD-13', 45, 'c'], ['RD-15', 50, 'e', 65]]

Given a linear order completely represented by a list of tuples of strings, output the order as a list of strings

Given pairs of items of form [(a,b),...] where (a,b) means a > b, for example:
[('best','better'),('best','good'),('better','good')]
I would like to output a list of form:
['best','better','good']
This is very hard for some reason. Any thoughts?
======================== code =============================
I know why it doesn't work.
def to_rank(raw):
rank = []
for u,v in raw:
if u in rank and v in rank:
pass
elif u not in rank and v not in rank:
rank = insert_front (u,v,rank)
rank = insert_behind(v,u,rank)
elif u in rank and v not in rank:
rank = insert_behind(v,u,rank)
elif u not in rank and v in rank:
rank = insert_front(u,v,rank)
return [[r] for r in rank]
# #Use: insert word u infront of word v in list of words
def insert_front(u,v,words):
if words == []: return [u]
else:
head = words[0]
tail = words[1:]
if head == v: return [u] + words
else : return ([head] + insert_front(u,v,tail))
# #Use: insert word u behind word v in list of words
def insert_behind(u,v,words):
words.reverse()
words = insert_front(u,v,words)
words.reverse()
return words
=================== Update ===================
Per suggestion of many, this is a straight forward topological sort setting, I ultimately decided to use the code from this source: algocoding.wordpress.com/2015/04/05/topological-sorting-python/
which solved my problem.
def go_topsort(graph):
in_degree = { u : 0 for u in graph } # determine in-degree
for u in graph: # of each node
for v in graph[u]:
in_degree[v] += 1
Q = deque() # collect nodes with zero in-degree
for u in in_degree:
if in_degree[u] == 0:
Q.appendleft(u)
L = [] # list for order of nodes
while Q:
u = Q.pop() # choose node of zero in-degree
L.append(u) # and 'remove' it from graph
for v in graph[u]:
in_degree[v] -= 1
if in_degree[v] == 0:
Q.appendleft(v)
if len(L) == len(graph):
return L
else: # if there is a cycle,
return []
RockBilly's solution also work in my case, because in my setting, for every v < u, we are guaranteed to have a pair (u,v) in our list. So his answer is not very "computer-sciency", but it gets the job done in this case.

If you have a complete grammar specified then you can simply count up the items:
>>> import itertools as it
>>> from collections import Counter
>>> ranks = [('best','better'),('best','good'),('better','good')]
>>> c = Counter(x for x, y in ranks)
>>> sorted(set(it.chain(*ranks)), key=c.__getitem__, reverse=True)
['best', 'better', 'good']
If you have an incomplete grammar then you can build a graph and dfs all paths to find the longest. This isn't very inefficient, as I haven't thought about that yet :):
def dfs(graph, start, end):
stack = [[start]]
while stack:
path = stack.pop()
if path[-1] == end:
yield path
continue
for next_state in graph.get(path[-1], []):
if next_state in path:
continue
stack.append(path+[next_state])
def paths(ranks):
graph = {}
for n, m in ranks:
graph.setdefault(n,[]).append(m)
for start, end in it.product(set(it.chain(*ranks)), repeat=2):
yield from dfs(graph, start, end)
>>> ranks = [('black', 'dark'), ('black', 'dim'), ('black', 'gloomy'), ('dark', 'gloomy'), ('dim', 'dark'), ('dim', 'gloomy')]
>>> max(paths(ranks), key=len)
['black', 'dim', 'dark', 'gloomy']
>>> ranks = [('a','c'), ('b','a'),('b','c'), ('d','a'), ('d','b'), ('d','c')]
>>> max(paths(ranks), key=len)
['d', 'b', 'a', 'c']

What you're looking for is topological sort. You can do this in linear time using depth-first search (pseudocode included in the wiki I linked)

Here is one way. It is based on using the complete pairwise rankings to make an old-style (early Python 2) cmp function and then using functools.cmp_to_key to convert it to a key suitable for the Python 3 approach to sorting:
import functools
def sortByRankings(rankings):
def cmp(x,y):
if x == y:
return 0
elif (x,y) in rankings:
return -1
else:
return 1
items = list({x for y in rankings for x in y})
items.sort(key = functools.cmp_to_key(cmp))
return items
Tested like:
ranks = [('a','c'), ('b','a'),('b','c'), ('d','a'), ('d','b'), ('d','c')]
print(sortByRankings(ranks)) #prints ['d', 'b', 'a', 'c']
Note that to work correctly, the parameter rankings must contain an entry for each pair of distinct items. If it doesn't, you would first need to compute the transitive closure of the pairs that you do have before you feed it to this function.

You can take advantage of the fact that the lowest ranked item in the list will never appear at the start of any tuple. You can extract this lowest item, then remove all elements which contain this lowest item from your list, and repeat to get the next lowest.
This should work even if you have redundant elements, or have a sparser list than some of the examples here. I've broken it up into finding the lowest ranked item, and then the grunt work of using this to create a final ranking.
from copy import copy
def find_lowest_item(s):
#Iterate over set of all items
for item in set([item for sublist in s for item in sublist]):
#If an item does not appear at the start of any tuple, return it
if item not in [x[0] for x in s]:
return item
def sort_by_comparison(s):
final_list = []
#Make a copy so we don't mutate original list
new_s = copy(s)
#Get the set of all items
item_set = set([item for sublist in s for item in sublist])
for i in range(len(item_set)):
lowest = find_lowest_item(new_s)
if lowest is not None:
final_list.insert(0, lowest)
#For the highest ranked item, we just compare our current
#ranked list with the full set of items
else:
final_list.insert(0,set(item_set).difference(set(final_list)).pop())
#Update list of ranking tuples to remove processed items
new_s = [x for x in new_s if lowest not in x]
return final_list
list_to_compare = [('black', 'dark'), ('black', 'dim'), ('black', 'gloomy'), ('dark', 'gloomy'), ('dim', 'dark'), ('dim', 'gloomy')]
sort_by_comparison(list_to_compare)
['black', 'dim', 'dark', 'gloomy']
list2 = [('best','better'),('best','good'),('better','good')]
sort_by_comparison(list2)
['best', 'better', 'good']
list3 = [('best','better'),('better','good')]
sort_by_comparison(list3)
['best', 'better', 'good']

If you do sorting or create a dictionary from the list items, you are going to miss the order as #Rockybilly mentioned in his answer. I suggest you to create a list from the tuples of the original list and then remove duplicates.
def remove_duplicates(seq):
seen = set()
seen_add = seen.add
return [x for x in seq if not (x in seen or seen_add(x))]
i = [(5,2),(1,3),(1,4),(2,3),(2,4),(3,4)]
i = remove_duplicates(list(x for s in i for x in s))
print(i) # prints [5, 2, 1, 3, 4]
j = [('excellent','good'),('excellent','great'),('great','good')]
j = remove_duplicates(list(x for s in j for x in s))
print(j) # prints ['excellent', 'good', 'great']
See reference: How do you remove duplicates from a list in whilst preserving order?
For explanation on the remove_duplicates() function, see this stackoverflow post.

If the list is complete, meaning has enough information to do the ranking(Also no duplicate or redundant inputs), this will work.
from collections import defaultdict
lst = [('best','better'),('best','good'),('better','good')]
d = defaultdict(int)
for tup in lst:
d[tup[0]] += 1
d[tup[1]] += 0 # To create it in defaultdict
print sorted(d, key = lambda x: d[x], reverse=True)
# ['best', 'better', 'good']
Just give them points, increment the left one each time you encounter it in the list.
Edit: I do think the OP has a determined type of input. Always have tuple count of combination nCr(n, 2). Which makes this a correct solution. No need to complain about the edge cases, which I already knew posting the answer(and mentioned it).

Increasing Speed of Fuzzy Matching words on two lists

I have a list of about 500 items on one list. I'd like to replace all fuzzy-matched items in that list with the smallest length item.
Is there a way to speed up my implementation of fuzzy match?
Note: I posted a similar question before, but I'm reframing it due to lack of response.
My implementation:
def find_fuzzymatch_samelist(list1, list2, cutoff=90):
"""
#list1 = list(ds1.Title)
#list2 = list(ds1.Title)
"""
matchdict = defaultdict(list)
for i, u in enumerate(list1):
for i1, u1 in enumerate(list2):
#Since list orders are the same, this makes sure this isn't the same item.
if i != i1:
if fuzz.partial_token_sort_ratio(u, u1) >= cutoff:
pair = (u, u1)
#Because there are potential duplicates, I have to make the key constant.
#Otherwise, putting list1 as the key will result in both duplicate items
#serving as the key.
"""
Potential problem:
• what if there are diffrent shortstr?
"""
shortstr = min(pair, key=len)
longstr = max(pair, key=len)
matchdict[shortstr].append(longstr)
return matchdict

I will assume you have installed python-Levenshtein, that will give you a 4x speed up.
Optimising the loop and the dictionary access:
def find_fuzzymatch_samelist(list1, list2, cutoff=90):
matchdict = dict()
for i1, i2 in itertools.permutations(range(len(list1), repeat=2)
u1 = list1[i1]
u2 = list2[i2]
if fuzz.partial_token_sort_ratio(u, u1) >= cutoff:
shortstr = min(u1, u2, key=len)
longstr = max(u1, u2, key=len)
matchdict.get(shortstr, list).append(longstr)
return matchdict
This is as fast as it gets besides the fuzz call. If you read the source, you see that some preprocessing is done for each string, in every iteration. We can do it all at once:
def _asciionly(s):
if PY3:
return s.translate(translation_table)
else:
return s.translate(None, bad_chars)
def full_pre_process(s, force_ascii=False):
s = _asciionly(s)
# Keep only Letters and Numbres (see Unicode docs).
string_out = StringProcessor.replace_non_letters_non_numbers_with_whitespace(s)
# Force into lowercase.
string_out = StringProcessor.to_lower_case(string_out)
# Remove leading and trailing whitespaces.
string_out = StringProcessor.strip(string_out)
out = ''.join(sorted(string_out))
out.strip()
return out
def find_fuzzymatch_samelist(list1, list2, cutoff=90):
matchdict = dict()
if list1 is not list2:
list1 = [full_pre_process(each) for each in list1]
list2 = [full_pre_process(each) for each in list2]
else:
# If you are comparing a list to itself, we don't want to overwrite content.
list1 = [full_pre_process(each) for each in list1]
list2 = list1
for i1, i2 in itertools.permutations(range(len(list1), repeat=2)
u1 = list1[i1]
u2 = list2[i2]
if fuzz.partial_ratio(u, u1) >= cutoff:
pair = (u1, u2)
shortstr = min(pair, key=len)
longstr = max(pair, key=len)
matchdict.get(shortstr, list).append(longstr)
return matchdict

Merge nested list items based on a repeating value

Although poorly written, this code:
marker_array = [['hard','2','soft'],['heavy','2','light'],['rock','2','feather'],['fast','3'], ['turtle','4','wet']]
marker_array_DS = []
for i in range(len(marker_array)):
if marker_array[i-1][1] != marker_array[i][1]:
marker_array_DS.append(marker_array[i])
print marker_array_DS
Returns:
[['hard', '2', 'soft'], ['fast', '3'], ['turtle', '4', 'wet']]
It accomplishes part of the task which is to create a new list containing all nested lists except those that have duplicate values in index [1]. But what I really need is to concatenate the matching index values from the removed lists creating a list like this:
[['hard heavy rock', '2', 'soft light feather'], ['fast', '3'], ['turtle', '4', 'wet']]
The values in index [1] must not be concatenated. I kind of managed to do the concatenation part using a tip from another post:
newlist = [i + n for i, n in zip(list_a, list_b]
But I am struggling with figuring out the way to produce the desired result. The "marker_array" list will be already sorted in ascending order before being passed to this code. All like-values in index [1] position will be contiguous. Some nested lists may not have any values beyond [0] and [1] as illustrated above.

Quick stab at it... use itertools.groupby to do the grouping for you, but do it over a generator that converts the 2 element list into a 3 element.
from itertools import groupby
from operator import itemgetter
marker_array = [['hard','2','soft'],['heavy','2','light'],['rock','2','feather'],['fast','3'], ['turtle','4','wet']]
def my_group(iterable):
temp = ((el + [''])[:3] for el in marker_array)
for k, g in groupby(temp, key=itemgetter(1)):
fst, snd = map(' '.join, zip(*map(itemgetter(0, 2), g)))
yield filter(None, [fst, k, snd])
print list(my_group(marker_array))

from collections import defaultdict
d1 = defaultdict(list)
d2 = defaultdict(list)
for pxa in marker_array:
d1[pxa[1]].extend(pxa[:1])
d2[pxa[1]].extend(pxa[2:])
res = [[' '.join(d1[x]), x, ' '.join(d2[x])] for x in sorted(d1)]
If you really need 2-tuples (which I think is unlikely):
for p in res:
if not p[-1]:
p.pop()

marker_array = [['hard','2','soft'],['heavy','2','light'],['rock','2','feather'],['fast','3'], ['turtle','4','wet']]
marker_array_DS = []
marker_array_hit = []
for i in range(len(marker_array)):
if marker_array[i][1] not in marker_array_hit:
marker_array_hit.append(marker_array[i][1])
for i in marker_array_hit:
lists = [item for item in marker_array if item[1] == i]
temp = []
first_part = ' '.join([str(item[0]) for item in lists])
temp.append(first_part)
temp.append(i)
second_part = ' '.join([str(item[2]) for item in lists if len(item) > 2])
if second_part != '':
temp.append(second_part);
marker_array_DS.append(temp)
print marker_array_DS
I learned python for this because I'm a shameless rep whore

marker_array = [
['hard','2','soft'],
['heavy','2','light'],
['rock','2','feather'],
['fast','3'],
['turtle','4','wet'],
]
data = {}
for arr in marker_array:
if len(arr) == 2:
arr.append('')
(first, index, last) = arr
firsts, lasts = data.setdefault(index, [[],[]])
firsts.append(first)
lasts.append(last)
results = []
for key in sorted(data.keys()):
current = [
" ".join(data[key][0]),
key,
" ".join(data[key][1])
]
if current[-1] == '':
current = current[:-1]
results.append(current)
print results
--output:--
[['hard heavy rock', '2', 'soft light feather'], ['fast', '3'], ['turtle', '4', 'wet']]

A different solution based on itertools.groupby:
from itertools import groupby
# normalizes the list of markers so all markers have 3 elements
def normalized(markers):
for marker in markers:
yield marker + [""] * (3 - len(marker))
def concatenated(markers):
# use groupby to iterator over lists of markers sharing the same key
for key, markers_in_category in groupby(normalized(markers), lambda m: m[1]):
# get separate lists of left and right words
lefts, rights = zip(*[(m[0],m[2]) for m in markers_in_category])
# remove empty strings from both lists
lefts, rights = filter(bool, lefts), filter(bool, rights)
# yield the concatenated entry for this key (also removing the empty string at the end, if necessary)
yield filter(bool, [" ".join(lefts), key, " ".join(rights)])
The generator concatenated(markers) will yield the results. This code correctly handles the ['fast', '3'] case and doesn't return an additional third element in such cases.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to remove overlapping item in a nested list? - python

Related

Find duplicates in a list of strings differing only in upper and lower case writing

adding empty string while joining the 2 lists - Python

Given a linear order completely represented by a list of tuples of strings, output the order as a list of strings

Increasing Speed of Fuzzy Matching words on two lists

Merge nested list items based on a repeating value

Categories

Resources