Extract longest strings from sublist within list . Python - python

so i have a list of sublists and within the sublists, there are strings.
the strings are usually at different lengths, but can be the same length as well.
below is an example of the list
sequences = [['aaa'],['aaaa','bb'],[],['aaaaaa','bb','cccccc']]
i want to find a way to extract the LONGEST string from each list and if there are two that are equally long, then take both of those strings
example_output = [['aaa'],['aaaa'],[],['aaaaaa','cccccc']]
usually i would set a threshold in a for-loop where if it was longer than a certain length then append to a list and then after each iteration append that to a list . . . but i don't have a threshold value in this case
if possible i would like try and avoid using lambda and functions since this will be within another function

You can use the length of the longest string seen so far as the threshold (maxlen in the code below):
def get_longest(seq):
maxlen = -1
ret = []
for el in seq:
if len(el) > maxlen:
ret = [el]
maxlen = len(el)
elif len(el) == maxlen:
ret.append(el)
return ret
sequences = [['aaa'],['aaaa','bb'],[],['aaaaaa','bb','cccccc']]
example_output = list(map(get_longest, sequences))
print(example_output)
This produces:
[['aaa'], ['aaaa'], [], ['aaaaaa', 'cccccc']]

This answer is not the most efficient, but easy to understand.
You can first extract the max lengths (here I'm using a generator expression for that), then extract the strings with those lengths.
lengths = ( max(len(s) for s in sublist) if sublist else 0 for sublist in sequences )
[ [ s for s in sublist if len(s) == l ] for l, sublist in zip(lengths, sequences) ]
-> [['aaa'], ['aaaa'], [], ['aaaaaa', 'cccccc']]
itertools.izip is preferable over zip in this case.

I'll give my shot with the following (cryptic :)) one liner:
example_output = [list(filter(lambda x: len(x)==len(max(sub_lst, key=len)), sub_lst)) for sub_lst in sequences]

Related

removing a sublist if a string in the sublist contains a substring (all values within all sublists are strings)

Given nested list: mistake_list = [['as','as*s','sd','*ssa'],['a','ds','dfg','mal']]
Required output: corrected_list = [['a','ds','dfg','mal']]
Now the given list can contain hundreds or thousands of sublists in which the strings may or may not contain the special character *, but if it does that whole sublist has to be removed.
I have shown an example above where the mistake_list is the input nested list, and corrected_list is the output nested list.
NOTE: all sublists have an equal number of elements (I don't think it is necessary to know this though)
The filter function can help you:
mistake_list = [['as','as*s','sd','*ssa'],['a','ds','dfg','mal']]
corrected_list = list(filter(lambda l: not any("*" in x for x in l), mistake_list))
print(corrected_list)
[['a', 'ds', 'dfg', 'mal']]
You can use list comprehension:
mistake_list = [['as','as*s','sd','*ssa'],['a','ds','dfg','mal']]
corrected_list = [sublst for sublst in mistake_list if not any('*' in s for s in sublst)]
print(corrected_list) # [['a', 'ds', 'dfg', 'mal']]
The filtering condition here checks whether there is any '*' character in each item of sublst.

Grouping a grouped list of str without duplicates

I have a grouped list of strings that sort of looks like this, the lists inside of these groups will always contain 5 elements:
text_list = [['aaa','bbb','ccc','ddd','eee'],
['fff','ggg','hhh','iii','jjj'],
['xxx','mmm','ccc','bbb','aaa'],
['fff','xxx','aaa','bbb','ddd'],
['aaa','bbb','ccc','ddd','eee'],
['fff','xxx','aaa','ddd','eee'],
['iii','xxx','ggg','jjj','aaa']]
The objective is simple, group all of the list that is similar by the first 3 elements that is then compared against all of the elements inside of the other groups.
So from the above example the output might look like this (output is the index of the list):
[[0,2,4],[3,5]]
Notice how if there is another list that contains the same elements but in a different order is removed.
I've written the following code to extract the groups but they would return duplicates and I am unsure how to proceed. I also think this might not be the most efficient way to do the extraction as the real list can contain upwards to millions of groups:
grouped_list = []
for i in range(0,len(text_list)):
int_temp = []
for m in range(0,len(text_list)):
if i == m:
continue
bool_check = all( x in text_list[m] for x in text_list[i][0:3])
if bool_check:
if len(int_temp) == 0:
int_temp.append(i)
int_temp.append(m)
continue
int_temp.append(m)
grouped_list.append(int_temp)
## remove index with no groups
grouped_list = [x for x in grouped_list if x != []]
Is there a better way to go about this? How do I remove the duplicate group afterwards? Thank you.
Edit:
To be clearer, I would like to retrieve the lists that is similar to each other but only using the first 3 elements of the other lists. For example, using the first 3 elements from list A, check if list B,C,D... contains all 3 of the elements from list A. Repeat for the entire list then remove any list that contains duplicate elements.
You can build a set of frozensets to keep track of indices of groups with the first 3 items being a subset of the rest of the members:
groups = set()
sets = list(map(set, text_list))
for i, lst in enumerate(text_list):
groups.add(frozenset((i, *(j for j, s in enumerate(sets) if set(lst[:3]) <= s))))
print([sorted(group) for group in groups if len(group) > 1])
If the input list is long, it would be faster to create a set of frozensets of the first 3 items of all sub-lists and use the set to filter all combinations of 3 items from each sub-list, so that the time complexity is essentially linear to the input list rather than quadratic despite the overhead in generating combinations:
from itertools import combinations
sets = {frozenset(lst[:3]) for lst in text_list}
groups = {}
for i, lst in enumerate(text_list):
for c in map(frozenset, combinations(lst, 3)):
if c in sets:
groups.setdefault(c, []).append(i)
print([sorted(group) for group in groups.values() if len(group) > 1])

Sorting a list of strings based on numeric order of numeric part

I have a list of strings that may contain digits. I would like to sort this list alphabetically, but every time the String contains a number, I want it to be sorted by value.
For example, if the list is
['a1a','b1a','a10a','a5b','a2a'],
the sorted list should be
['a1a','a2a','a5b','a10a','b1a']
In general I want to treat each number (a sequence of digits) in the string as a special character, which is smaller than any letter and can be compared numerically to other numbers.
Is there any python function which does this compactly?
You could use the re module to split each string into a tuple of characters and grouping the digits into one single element. Something like r'(\d+)|(.)'. The good news with this regex is that it will return separately the numeric and non numeric groups.
As a simple key, we could use:
def key(x):
# the tuple comparison will ensure that numbers come before letters
return [(j, int(i)) if i != '' else (j, i)
for i, j in re.findall(r'(\d+)|(.)', x)]
Demo:
lst = ['a1a', 'a2a', 'a5b', 'a10a', 'b1a', 'abc']
print(sorted(lst, key=key)
gives:
['a1a', 'a2a', 'a5b', 'a10a', 'abc', 'b1a']
If you want a more efficient processing, we could compile the regex only once in a closure
def build_key():
rx = re.compile(r'(\d+)|(.)')
def key(x):
return [(j, int(i)) if i != '' else (j, i)
for i, j in rx.findall(x)]
return key
and use it that way:
sorted(lst, key=build_key())
giving of course the same output.

Python - how to add string depending on length to new list

I am fairly new to python and I am attempting to write a function that takes a list of strings (e.g. ['my', 'name, 'is', 'John']) and return a new list with these same strings in order of length. I have borken it down into four steps. So far I have managed to compute the maximum length of all words, create empty lists (buckets).
Where I am struggling is with Step 3 - my lack of ability is preventing me from writing something that will look at the length of the word and place it in the corresponding bucket, for example if the word length is 8 characters long. I can "hardcode" it so it is limited to length of x characters, however my abilities have me stumped there.
def empty_buckets(n):
"""Return a list with n empty lists. Assume n is a positive integer. """
buckets = []
for bucket in range(n):
buckets.append([])
return buckets
def bucket_sorted(words):
"""Return a new list with the same words, but by increasing length.
Assume words is a non-empty list of non-empty strings.
"""
# Step 1. Compute the maximum length L of all words.
for i in words:
if len(i) > 0:
L = len(i)
print(L)
# Step 2. Create a list of L empty lists (buckets).
buckets = empty_buckets(L)
# Step 3. Put each word in the bucket corresponding to its length
# for example words like'a' go in buckets[0], words like 'as' go in buckets[1] etc.
# Step 4. Put all buckets together into a single list of words.
newList = []
for k in buckets:
if len(k) > 0:
newList = newList + k
return newList
(This answer assumes that you are using a bucket sort approach as an exercise.)
Now's a good time to start practicing functional programming approaches.
Step 1: use the built-in max function, and map with function len to calculate the lengths.
L = max(map(len, words))
Step 2: use a list comprehension here.
buckets = [[] for i in range(0, L)]
Step 4: (optional – your current approach is fine) instead of concatenating buckets in a loop, use itertools.chain to chain them together.
from itertools import chain
...
newList = list(chain(*buckets))
Step 3: For each string s, use len(s) - 1 as the bucket index (since Python list indices start at 0, not 1):
for word in words:
buckets[len(word)-1].append(word)
Putting the above all together:
from itertools import chain
def bucket_sort(words):
# step 1
L = max(map(len, words))
# step 2
buckets = [[] for i in range(0, L)]
# step 3
for word in words:
buckets[len(word)-1].append(word)
# step 4
return list(chain(*buckets))
Test:
>>> bucket_sort(["my", "name", "is", "Sherlock", "Holmes", "."])
['.', 'my', 'is', 'name', 'Holmes', 'Sherlock']
Here's a hint:- create a loop to look at each word, and inside the loop assign the length of the word to a variable and then use the variable.
I think you will also have less trouble if you change the buckets to a dictionary instead of a list, for easy referencing.

Reduce list based off of element substrings

I'm looking for the most efficient way to reduce a given list based off of substrings already in the list.
For example
mylist = ['abcd','abcde','abcdef','qrs','qrst','qrstu']
would be reduced to:
mylist = ['abcd','qrs']
because both 'abcd' and 'qrs' are the smallest substring of other elements in that list. I was able to do this with about 30 lines of code, but I suspect there is a crafty one-liner out there..
this seems to be working (but not so efficient i suppose)
def reduce_prefixes(strings):
sorted_strings = sorted(strings)
return [element
for index, element in enumerate(sorted_strings)
if all(not previous.startswith(element) and
not element.startswith(previous)
for previous in sorted_strings[:index])]
tests:
>>>reduce_prefixes(['abcd', 'abcde', 'abcdef',
'qrs', 'qrst', 'qrstu'])
['abcd', 'qrs']
>>>reduce_prefixes(['abcd', 'abcde', 'abcdef',
'qrs', 'qrst', 'qrstu',
'gabcd', 'gab', 'ab'])
['ab', 'gab', 'qrs']
Probably not the most efficient, but at least short:
mylist = ['abcd','abcde','abcdef','qrs','qrst','qrstu']
outlist = []
for l in mylist:
if any(o.startswith(l) for o in outlist):
# l is a prefix of some elements in outlist, so it replaces them
outlist = [ o for o in outlist if not o.startswith(l) ] + [ l ]
if not any(l.startswith(o) for o in outlist):
# l has no prefix in outlist yet, so it becomes a prefix candidate
outlist.append(l)
print(outlist)
One solution is to iterate over all the strings and split them based on if they had different characters, and recursively apply that function.
def reduce_substrings(strings):
return list(_reduce_substrings(map(iter, strings)))
def _reduce_substrings(strings):
# A dictionary of characters to a list of strings that begin with that character
nexts = {}
for string in strings:
try:
nexts.setdefault(next(string), []).append(string)
except StopIteration:
# Reached the end of this string. It is the only shortest substring.
yield ''
return
for next_char, next_strings in nexts.items():
for next_substrings in _reduce_substrings(next_strings):
yield next_char + next_substrings
This splits it into a dictionary based on the character, and tries to find the shortest substring out of those that it split into a different list in the dictionary.
Of course, because of the recursive nature of this function, a one-liner wouldn't be possible as efficiently.
Try this one:
import re
mylist = ['abcd','abcde','abcdef','qrs','qrst','qrstu']
new_list=[]
for i in mylist:
if re.match("^abcd$",i):
new_list.append(i)
elif re.match("^qrs$",i):
new_list.append(i)
print(new_list)
#['abcd', 'qrs']

Categories