Reduce list based off of element substrings - python

I'm looking for the most efficient way to reduce a given list based off of substrings already in the list.
For example
mylist = ['abcd','abcde','abcdef','qrs','qrst','qrstu']
would be reduced to:
mylist = ['abcd','qrs']
because both 'abcd' and 'qrs' are the smallest substring of other elements in that list. I was able to do this with about 30 lines of code, but I suspect there is a crafty one-liner out there..

this seems to be working (but not so efficient i suppose)
def reduce_prefixes(strings):
sorted_strings = sorted(strings)
return [element
for index, element in enumerate(sorted_strings)
if all(not previous.startswith(element) and
not element.startswith(previous)
for previous in sorted_strings[:index])]
tests:
>>>reduce_prefixes(['abcd', 'abcde', 'abcdef',
'qrs', 'qrst', 'qrstu'])
['abcd', 'qrs']
>>>reduce_prefixes(['abcd', 'abcde', 'abcdef',
'qrs', 'qrst', 'qrstu',
'gabcd', 'gab', 'ab'])
['ab', 'gab', 'qrs']

Probably not the most efficient, but at least short:
mylist = ['abcd','abcde','abcdef','qrs','qrst','qrstu']
outlist = []
for l in mylist:
if any(o.startswith(l) for o in outlist):
# l is a prefix of some elements in outlist, so it replaces them
outlist = [ o for o in outlist if not o.startswith(l) ] + [ l ]
if not any(l.startswith(o) for o in outlist):
# l has no prefix in outlist yet, so it becomes a prefix candidate
outlist.append(l)
print(outlist)

One solution is to iterate over all the strings and split them based on if they had different characters, and recursively apply that function.
def reduce_substrings(strings):
return list(_reduce_substrings(map(iter, strings)))
def _reduce_substrings(strings):
# A dictionary of characters to a list of strings that begin with that character
nexts = {}
for string in strings:
try:
nexts.setdefault(next(string), []).append(string)
except StopIteration:
# Reached the end of this string. It is the only shortest substring.
yield ''
return
for next_char, next_strings in nexts.items():
for next_substrings in _reduce_substrings(next_strings):
yield next_char + next_substrings
This splits it into a dictionary based on the character, and tries to find the shortest substring out of those that it split into a different list in the dictionary.
Of course, because of the recursive nature of this function, a one-liner wouldn't be possible as efficiently.

Try this one:
import re
mylist = ['abcd','abcde','abcdef','qrs','qrst','qrstu']
new_list=[]
for i in mylist:
if re.match("^abcd$",i):
new_list.append(i)
elif re.match("^qrs$",i):
new_list.append(i)
print(new_list)
#['abcd', 'qrs']

Related

Check if element of list is sub-element of other list elements in same list

I am looking for a way to check if an element of a list is sub-element of any other elements of that same list?
For example, let's use the below list as an example.
['Lebron James', 'Lebron', 'James']
The 2nd and 3rd elements of this list are a sub-element of the 1st element of the list.
I am looking for a way to remove these elements from the list so only the 1st element remains. I have been spinning my wheels and unable to come up with a solution.
Can someone help?
Thanks
Here's a slow solution, might be acceptable depending on your data size:
lst = ['Lebron James', 'Lebron', 'James']
[s for s in lst if not any(s in s2.split() for s2 in lst if s != s2)]
This is definitely an easier problem to tackle with the starting and ending points for the match instead of the strings themselves.
One approach can be to take all ranges from biggest to smallest, and work backwards, creating the result as you go, given a range is not fully contained in another.
lst = [(0, 10),(0, 4),(5, 10)]
result = []
def membership(big_range, small_range):
'''return true if big_range fully contains the small_range.
where both are tuples with a start and end value.
'''
if small_range[0] >= big_range[0] and small_range[1] <= big_range[1]:
return True
return False
for range_ in sorted(lst, key= lambda x: x[1] - x[0], reverse=True):
if not any(membership(x, range_) for x in result):
result.append(range_)
print(result)
#[(0, 10)]
Edit: this answer was in response to the OP'S edited question, which seems to have since been rolled back. Oh well. Hope it helps someone anyways.
Can try to create a dictionary of all permutations (the choice between permutations, or sublists, or whatever, depends on the desired behavior) grouped by element's word count:
import re
import itertools
from collections import defaultdict
lst = [
'Lebron Raymone James', 'Lebron Raymone',
'James', "Le", "Lebron James",
'Lebron James 1 2 3', 'Lebron James 1 2'
]
d = defaultdict(dict)
g = "\\b\w+\\b"
for x in lst:
words = re.findall(g, x) # could simply use x.split() if have just spaces
combos = [
x for i in range(1, len(words) + 1)
for x in list(itertools.permutations(words, i))
]
for c in combos:
d[len(words)][tuple(c)] = True
and take just elements whose words are not present in any of the groups with greater words count:
M = max(d)
res = []
for x in lst:
words = tuple(re.findall(g, x))
if not any(d[i].get(words) for i in range(len(words)+1, M+1)):
res.append(x)
set(res)
# {'Le', 'Lebron James 1 2 3', 'Lebron Raymone James'}
Create a set containing all the words in the strings that are multiple words. Then go through the list, testing the strings to see if they're in the set.
wordset = set()
lst = ['Lebron James', 'Lebron', 'James']
for s in lst:
if " " in s:
wordset.update(s.split())
result = [x for x in lst if x not in wordset]

Matching two string lists that partially match into another list

I am trying to match a List containing strings (50 strings) with a list containing strings that are part of some of the strings of the previous list (5 strings). I will post the complete code in order to give context below but I also want to give a short example:
List1 = ['abcd12', 'efgh34', 'ijkl56', 'mnop78']
List2 = ['abc', 'ijk']
I want to return a list of the strings from List1 that have matches in List2. I have tried to do something with set.intersection but it seems you can't do partial matches with it (or at I can't with my limited abilities). I also tried any() but I had no success making it work with my lists. In my book it says I should use a nested loop but I don't know which function I should use and how regarding lists.
Here is the complete code as reference:
#!/usr/bin/env python3.4
# -*- coding: utf-8 -*-
import random
def generateSequences (n):
L = []
dna = ["A","G","C","T"]
for i in range(int(n)):
random_sequence=''
for i in range(50):
random_sequence+=random.choice(dna)
L.append(random_sequence)
print(L)
return L
def generatePrefixes (p, L):
S = [x[:20] for x in L]
D = []
for i in range(p):
randomPrefix = random.choice(S)
D.append(randomPrefix)
return S, D
if __name__ == "__main__":
L = generateSequences(15)
print (L)
S, D = generatePrefixes(5, L)
print (S)
print (D)
edit: As this was flagged as a possible duplicate i want to edit this in order to say that in this post python is used and the other is for R. I don't know R and if there are any similarities but it doesn't look like that to me at first glance. Sorry for the inconvenience.
Using a nested for loop:
def intersect(List1, List2):
# empty list for values that match
ret = []
for i in List2:
for j in List1:
if i in j:
ret.append(j)
return ret
List1 = ['abcd12', 'efgh34', 'ijkl56', 'mnop78']
List2 = ['abc', 'ijk']
print(intersect(List1, List2))
This may not be the most efficient way, but it works
matches = []
for seq_1 in List1:
for seq_2 in List2:
if seq_1 in seq_2 or seq_2 in seq_1:
matches.append(seq_1)
continue
You can just compare strings, I remove any duplicates from a result list from list1 that contain list2 items. This basically does it what you want:
f = []
for i in list1:
for j in list2:
if j in i:
f.append(i)
result = list(set(f))
Try
[l1 for l1 in List1 if any([l2 in l1 for l2 in List2])]

Extract longest strings from sublist within list . Python

so i have a list of sublists and within the sublists, there are strings.
the strings are usually at different lengths, but can be the same length as well.
below is an example of the list
sequences = [['aaa'],['aaaa','bb'],[],['aaaaaa','bb','cccccc']]
i want to find a way to extract the LONGEST string from each list and if there are two that are equally long, then take both of those strings
example_output = [['aaa'],['aaaa'],[],['aaaaaa','cccccc']]
usually i would set a threshold in a for-loop where if it was longer than a certain length then append to a list and then after each iteration append that to a list . . . but i don't have a threshold value in this case
if possible i would like try and avoid using lambda and functions since this will be within another function
You can use the length of the longest string seen so far as the threshold (maxlen in the code below):
def get_longest(seq):
maxlen = -1
ret = []
for el in seq:
if len(el) > maxlen:
ret = [el]
maxlen = len(el)
elif len(el) == maxlen:
ret.append(el)
return ret
sequences = [['aaa'],['aaaa','bb'],[],['aaaaaa','bb','cccccc']]
example_output = list(map(get_longest, sequences))
print(example_output)
This produces:
[['aaa'], ['aaaa'], [], ['aaaaaa', 'cccccc']]
This answer is not the most efficient, but easy to understand.
You can first extract the max lengths (here I'm using a generator expression for that), then extract the strings with those lengths.
lengths = ( max(len(s) for s in sublist) if sublist else 0 for sublist in sequences )
[ [ s for s in sublist if len(s) == l ] for l, sublist in zip(lengths, sequences) ]
-> [['aaa'], ['aaaa'], [], ['aaaaaa', 'cccccc']]
itertools.izip is preferable over zip in this case.
I'll give my shot with the following (cryptic :)) one liner:
example_output = [list(filter(lambda x: len(x)==len(max(sub_lst, key=len)), sub_lst)) for sub_lst in sequences]

Find all index position in list based on partial string inside item in list

mylist = ["aa123", "bb2322", "aa354", "cc332", "ab334", "333aa"]
I need the index position of all items that contain 'aa'. I'm having trouble combining enumerate() with partial string matching. I'm not even sure if I should be using enumerate.
I just need to return the index positions: 0,2,5
You can use enumerate inside a list-comprehension:
indices = [i for i, s in enumerate(mylist) if 'aa' in s]
Your idea to use enumerate() was correct.
indices = []
for i, elem in enumerate(mylist):
if 'aa' in elem:
indices.append(i)
Alternatively, as a list comprehension:
indices = [i for i, elem in enumerate(mylist) if 'aa' in elem]
Without enumerate():
>>> mylist = ["aa123", "bb2322", "aa354", "cc332", "ab334", "333aa"]
>>> l = [mylist.index(i) for i in mylist if 'aa' in i]
>>> l
[0, 2, 5]
Based on this answer, I'd like to show how to "early exit" the iteration once the first item containing the substring aa is encountered. This only returns the first position.
import itertools
first_idx = len(tuple(itertools.takewhile(lambda x: "aa" not in x, mylist)))
This should be much more performant than looping over the whole list when the list is large, since takewhile will stop once the condition is False for the first time.
I know that the question asked for all positions, but since there will be many users stumbling upon this question when searching for the first substring, I'll add this answer anyways.
spell_list = ["Tuesday", "Wednesday", "February", "November", "Annual", "Calendar", "Solstice"]
index=spell_list.index("Annual")
print(index)

Cross-matching two lists

I have two lists where I am trying to see if there is any matches between substrings in elements in both lists.
["Po2311tato","Pin2231eap","Orange2231edg","add22131dfes"]
["2311","233412","2231"]
If any substrings in an element matches the second list such as "Po2311tato" will match with "2311". Then I would want to put "Po2311tato" in a new list in which all elements of the first that match would be placed in the new list. So the new list would be ["Po2311tato","Pin2231eap","Orange2231edg"]
You can use the syntax 'substring' in string to do this:
a = ["Po2311tato","Pin2231eap","Orange2231edg","add22131dfes"]
b = ["2311","233412","2231"]
def has_substring(word):
for substring in b:
if substring in word:
return True
return False
print filter(has_substring, a)
Hope this helps!
This can be a little more concise than the jobby's answer by using a list comprehension:
>>> list1 = ["Po2311tato","Pin2231eap","Orange2231edg","add22131dfes"]
>>> list2 = ["2311","233412","2231"]
>>> list3 = [string for string in list1 if any(substring in string for substring in list2)]
>>> list3
['Po2311tato', 'Pin2231eap', 'Orange2231edg']
Whether or not this is clearer / more elegant than jobby's version is a matter of taste!
import re
list1 = ["Po2311tato","Pin2231eap","Orange2231edg","add22131dfes"]
list2 = ["2311","233412","2231"]
matchlist = []
for str1 in list1:
for str2 in list2:
if (re.search(str2, str1)):
matchlist.append(str1)
break
print matchlist

Categories