Find the position of the longest repeated letter - python

I have a file that contains letters. I need to find the position of the longest repeated letters. For example, if the file contains aaassdddffccsdddfgssfrsfspppppppppppddsfs, I need a program that finds the position of ppppppppppp. I know that I need to use a .index function to find the location however I am stuck on the loop.

Using itertools.groupby:
import itertools
mystr = 'aaassdddffccsdddfgssfrsfspppppppppppddsfs'
idx = 0
maxidx, maxlen = 0, 0
for _, group in itertools.groupby(mystr):
grouplen = sum(1 for _ in group)
if grouplen > maxlen:
maxidx, maxlen = idx, grouplen
idx += grouplen
Gives the idx and the length of the longest identical substring:
>>> print(maxidx, maxlen)
25, 11
>>> mystr[25:25+11]
'ppppppppppp'

You're going to need to loop through the entire string. Keep track of each new letter you come across as well as it's index and how long each sequence is. Only store the max sequence
s = 'aaassdddffccsdddfgssfrsfspppppppppppddsfs'
max_c = max_i = max_len = None
cur_c = cur_i = cur_len = None
for i, c in enumerate(s):
if c != cur_c:
if max_len is None or cur_len > max_len:
max_c, max_i, max_len = cur_c, cur_i, cur_len
cur_c = c
cur_i = i
cur_len = 1
else:
cur_len += 1
else:
# One last check when the loop completes
if max_len is None or cur_len > max_len:
max_c, max_i, max_len = cur_c, cur_i, cur_len
print max_c, max_i, max_len

Here is an oneliner
from itertools import groupby
from functools import reduce
[(k, next(g)[0], sum(1 for _ in g)+1) for k, g in groupby(enumerate(
'aaassdddffccsdddfgssfrsfspppppppppppddsfs'), key=itemgetter(1))]
The above generates (key, position, length). You can get the maximum length by
applying reduce
from itertools import groupby
from functools import reduce
from operator import itemgetter
reduce(lambda x,y:x if x[2] >= y[2] else y,
((k, next(g)[0], sum(1 for _ in g)+1) for k, g in groupby(enumerate(
'aaassdddffccsdddfgssfrsfspppppppppppddsfs'), key=itemgetter(1))))

A quick way of achieving this is to use a regex to match repeating characters with (.)(\1+). Then we loop over all those results using a generator comprehension and find the max according to the length (key=len). Finally having found the largest string, we call thestr.index() to find where the longest repeated letter occurred:
import re
txt = "aaassdddffccsdddfgssfrsfspppppppppppddsfs"
idx = txt.index(max((''.join(f) for f in re.findall(r"(.)(\1+)", txt)), key=len))
print(idx)
Here is the same code broken out into stages:
>>> import re
>>> txt = "aaassdddffccsdddfgssfrsfspppppppppppddsfs"
>>> matches = list(''.join(f) for f in re.findall(r"(.)(\1+)", txt))
>>> print(matches)
['aaa', 'ss', 'ddd', 'ff', 'cc', 'ddd', 'ss', 'ppppppppppp', 'dd']
>>> longest = max(matches, key=len)
>>> print(longest)
ppppppppppp
>>> print(txt.index(longest))
25

Related

How can I join a list of strings and remove duplicated letters (keep them chained)

My list:
l = ["volcano", "noway", "lease", "sequence", "erupt"]
Desired output:
'volcanowayleasequencerupt'
I have tried:
using itertools.groupby but it seems like it doesn't work well when there is 2 repeated letters in row (i.e. leasesequence -> sese stays):
>>> from itertools import groupby
>>> "".join([i[0] for i in groupby("".join(l))])
'volcanonowayleasesequencerupt'
As you can see it got rid only for the last 'e', and this is not ideal because if a letter has double characters they will be shrunk to 1. i.e 'suddenly' becomes 'sudenly'.
I'm looking for the most Pythonic approach for this.
Thank you in advance.
EDIT
My list does not have any duplicated items in it.
Using a helper function that crops a word t by removing its longest prefix that's also a suffix of s:
def crop(s, t):
for k in range(len(t), -1, -1):
if s.endswith(t[:k]):
return t[k:]
And then crop each word with its preceding word:
>>> l = ["volcano", "noway", "lease", "sequence", "erupt"]
>>> ''.join(crop(s, t) for s, t in zip([''] + l, l))
'volcanowayleasequencerupt'
>>> l = ['split', 'it', 'lit']
>>> ''.join(crop(s, t) for s, t in zip([''] + l, l))
'splitlit'
A more readable version, in my opinion:
from functools import reduce
def max_overlap(s1, s2):
return next(
i
for i in reversed(range(len(s2) + 1))
if s1.endswith(s2[:i])
)
def overlap(strs):
return reduce(
lambda s1, s2:
s1 + s2[max_overlap(s1, s2):],
strs, '',
)
overlap(l)
#> 'volcanowayleasequencerupt'
However, it also considers "accumulated" characters from previous words that overlapped:
overlap(['split', 'it', 'lit'])
#> 'split'
Here's a brute-force deduplicator:
def dedup(a, b):
for i in range(len(b), 0, -1):
if a[-i:] == b[:i]:
return a[:-i]
return a
Then, simply zip through:
>>> from itertools import chain, islice
>>> xs = ["volcano", "noway", "lease", "sequence", "erupt"]
>>> xs = [dedup(*x) for x in zip(xs, chain(islice(xs, 1, None), [""]))]
>>> "".join(xs)
'volcanowayleasequencerupt'
Naturally, this works for any length of list xs.

Count occurrences of char in a single string

string = input(" ")
count = string.count()
print(string + str(count))
Need to use a for loop to get the output: ll2a1m1a1
Use groupby from itertools
>>> from itertools import groupby
>>> s = 'llama'
>>> [[k, len(list(g))] for k, g in groupby(s)]
[['l', 2], ['a', 1], ['m', 1], ['a', 1]]
If you want exactly that output you asked, try the following, and as suggested by #DanielMesejo, use sum(1 for _ in g) instead of len(list(g)):
>>> from itertools import groupby
>>> s = 'llama'
>> groups = [[k, sum(1 for _ in g)] for k, g in groupby(s)]
>>> ''.join(f'{a * b}{b}' for a, b in groups)
'll2a1m1a1'
This works for any word you want, let's say the word is 'happen', so
>>> from itertools import groupby
>>> s = 'happen'
>> groups = [[k, sum(1 for _ in g)] for k, g in groupby(s)]
>>> ''.join(f'{a * b}{b}' for a, b in groups)
'h1a1pp2e1n1'
a more basic approach:
string = 'llama'
def get_count_str(s):
previous = s[0]
for c in s[1:]:
if c != previous:
yield f'{previous}{len(previous)}'
previous = c
else:
previous += c
# yield last
yield f'{previous}{len(previous)}'
print(*get_count_str(string ), sep='')
output:
ll2a1m1a1
Look bud, you gotta explain more, this loops through and counts how many times each letter and prints it out.
greeting = 'llama'
for i in range(0, len(greeting)):
#start count at 1 for original instance.
count = 1
for k in range(0, len(greeting)):
# check letters are not the same position letter.
if not k == i:
#check if letters match
if greeting[i] == greeting[k]:
count += 1
print(greeting[i] + str(count))

Filter a list of sets with specific criteria

I have a list of sets:
a = [{'foo','cpu','phone'},{'foo','mouse'}, {'dog','cat'}, {'cpu'}]
Expected outcome:
I want to look at each individual string, do a count and return everything x >= 2 in the original format:
a = [{'foo','cpu'}, {'foo'}, {'cpu'}]
Here's what I have so far but I'm stuck on the last part where I need to append the new list:
from collections import Counter
counter = Counter()
for a_set in a:
# Created a counter to count the occurrences a word
counter.update(a_set)
result = []
for a_set in a:
for word in a_set:
if counter[word] >= 2:
# Not sure how I should append my new set below.
result.append(a_set)
break
print(result)
You are just appending the original set. So you should create a new set with the words that occur at least twice.
result = []
for a_set in a:
new_set = {
word for word in a_set
if counter[word] >= 2
}
if new_set: # check if new set is not empty
result.append(new_set)
Instead, use the following short approach based on sets intersection:
from collections import Counter
a = [{'foo','cpu','phone'},{'foo','mouse'}, {'dog','cat'}, {'cpu'}]
c = Counter([i for s in a for i in s])
valid_keys = {k for k,v in c.items() if v >= 2}
res = [s & valid_keys for s in a if s & valid_keys]
print(res) # [{'cpu', 'foo'}, {'foo'}, {'cpu'}]
Here's what I ended up doing:
Build a counter then iterate over the original list of sets and filter items with <2 counts, then filter any empty sets:
from itertools import chain
from collections import Counter
a = [{'foo','cpu','phone'},{'foo','mouse'}, {'dog','cat'}, {'cpu'}]
c = Counter(chain.from_iterable(map(list, a)))
res = list(filter(None, ({item for item in s if c[item] >= 2} for s in a)))
print(res)
Out: [{'foo', 'cpu'}, {'foo'}, {'cpu'}]

Given a linear order completely represented by a list of tuples of strings, output the order as a list of strings

Given pairs of items of form [(a,b),...] where (a,b) means a > b, for example:
[('best','better'),('best','good'),('better','good')]
I would like to output a list of form:
['best','better','good']
This is very hard for some reason. Any thoughts?
======================== code =============================
I know why it doesn't work.
def to_rank(raw):
rank = []
for u,v in raw:
if u in rank and v in rank:
pass
elif u not in rank and v not in rank:
rank = insert_front (u,v,rank)
rank = insert_behind(v,u,rank)
elif u in rank and v not in rank:
rank = insert_behind(v,u,rank)
elif u not in rank and v in rank:
rank = insert_front(u,v,rank)
return [[r] for r in rank]
# #Use: insert word u infront of word v in list of words
def insert_front(u,v,words):
if words == []: return [u]
else:
head = words[0]
tail = words[1:]
if head == v: return [u] + words
else : return ([head] + insert_front(u,v,tail))
# #Use: insert word u behind word v in list of words
def insert_behind(u,v,words):
words.reverse()
words = insert_front(u,v,words)
words.reverse()
return words
=================== Update ===================
Per suggestion of many, this is a straight forward topological sort setting, I ultimately decided to use the code from this source: algocoding.wordpress.com/2015/04/05/topological-sorting-python/
which solved my problem.
def go_topsort(graph):
in_degree = { u : 0 for u in graph } # determine in-degree
for u in graph: # of each node
for v in graph[u]:
in_degree[v] += 1
Q = deque() # collect nodes with zero in-degree
for u in in_degree:
if in_degree[u] == 0:
Q.appendleft(u)
L = [] # list for order of nodes
while Q:
u = Q.pop() # choose node of zero in-degree
L.append(u) # and 'remove' it from graph
for v in graph[u]:
in_degree[v] -= 1
if in_degree[v] == 0:
Q.appendleft(v)
if len(L) == len(graph):
return L
else: # if there is a cycle,
return []
RockBilly's solution also work in my case, because in my setting, for every v < u, we are guaranteed to have a pair (u,v) in our list. So his answer is not very "computer-sciency", but it gets the job done in this case.
If you have a complete grammar specified then you can simply count up the items:
>>> import itertools as it
>>> from collections import Counter
>>> ranks = [('best','better'),('best','good'),('better','good')]
>>> c = Counter(x for x, y in ranks)
>>> sorted(set(it.chain(*ranks)), key=c.__getitem__, reverse=True)
['best', 'better', 'good']
If you have an incomplete grammar then you can build a graph and dfs all paths to find the longest. This isn't very inefficient, as I haven't thought about that yet :):
def dfs(graph, start, end):
stack = [[start]]
while stack:
path = stack.pop()
if path[-1] == end:
yield path
continue
for next_state in graph.get(path[-1], []):
if next_state in path:
continue
stack.append(path+[next_state])
def paths(ranks):
graph = {}
for n, m in ranks:
graph.setdefault(n,[]).append(m)
for start, end in it.product(set(it.chain(*ranks)), repeat=2):
yield from dfs(graph, start, end)
>>> ranks = [('black', 'dark'), ('black', 'dim'), ('black', 'gloomy'), ('dark', 'gloomy'), ('dim', 'dark'), ('dim', 'gloomy')]
>>> max(paths(ranks), key=len)
['black', 'dim', 'dark', 'gloomy']
>>> ranks = [('a','c'), ('b','a'),('b','c'), ('d','a'), ('d','b'), ('d','c')]
>>> max(paths(ranks), key=len)
['d', 'b', 'a', 'c']
What you're looking for is topological sort. You can do this in linear time using depth-first search (pseudocode included in the wiki I linked)
Here is one way. It is based on using the complete pairwise rankings to make an old-style (early Python 2) cmp function and then using functools.cmp_to_key to convert it to a key suitable for the Python 3 approach to sorting:
import functools
def sortByRankings(rankings):
def cmp(x,y):
if x == y:
return 0
elif (x,y) in rankings:
return -1
else:
return 1
items = list({x for y in rankings for x in y})
items.sort(key = functools.cmp_to_key(cmp))
return items
Tested like:
ranks = [('a','c'), ('b','a'),('b','c'), ('d','a'), ('d','b'), ('d','c')]
print(sortByRankings(ranks)) #prints ['d', 'b', 'a', 'c']
Note that to work correctly, the parameter rankings must contain an entry for each pair of distinct items. If it doesn't, you would first need to compute the transitive closure of the pairs that you do have before you feed it to this function.
You can take advantage of the fact that the lowest ranked item in the list will never appear at the start of any tuple. You can extract this lowest item, then remove all elements which contain this lowest item from your list, and repeat to get the next lowest.
This should work even if you have redundant elements, or have a sparser list than some of the examples here. I've broken it up into finding the lowest ranked item, and then the grunt work of using this to create a final ranking.
from copy import copy
def find_lowest_item(s):
#Iterate over set of all items
for item in set([item for sublist in s for item in sublist]):
#If an item does not appear at the start of any tuple, return it
if item not in [x[0] for x in s]:
return item
def sort_by_comparison(s):
final_list = []
#Make a copy so we don't mutate original list
new_s = copy(s)
#Get the set of all items
item_set = set([item for sublist in s for item in sublist])
for i in range(len(item_set)):
lowest = find_lowest_item(new_s)
if lowest is not None:
final_list.insert(0, lowest)
#For the highest ranked item, we just compare our current
#ranked list with the full set of items
else:
final_list.insert(0,set(item_set).difference(set(final_list)).pop())
#Update list of ranking tuples to remove processed items
new_s = [x for x in new_s if lowest not in x]
return final_list
list_to_compare = [('black', 'dark'), ('black', 'dim'), ('black', 'gloomy'), ('dark', 'gloomy'), ('dim', 'dark'), ('dim', 'gloomy')]
sort_by_comparison(list_to_compare)
['black', 'dim', 'dark', 'gloomy']
list2 = [('best','better'),('best','good'),('better','good')]
sort_by_comparison(list2)
['best', 'better', 'good']
list3 = [('best','better'),('better','good')]
sort_by_comparison(list3)
['best', 'better', 'good']
If you do sorting or create a dictionary from the list items, you are going to miss the order as #Rockybilly mentioned in his answer. I suggest you to create a list from the tuples of the original list and then remove duplicates.
def remove_duplicates(seq):
seen = set()
seen_add = seen.add
return [x for x in seq if not (x in seen or seen_add(x))]
i = [(5,2),(1,3),(1,4),(2,3),(2,4),(3,4)]
i = remove_duplicates(list(x for s in i for x in s))
print(i) # prints [5, 2, 1, 3, 4]
j = [('excellent','good'),('excellent','great'),('great','good')]
j = remove_duplicates(list(x for s in j for x in s))
print(j) # prints ['excellent', 'good', 'great']
See reference: How do you remove duplicates from a list in whilst preserving order?
For explanation on the remove_duplicates() function, see this stackoverflow post.
If the list is complete, meaning has enough information to do the ranking(Also no duplicate or redundant inputs), this will work.
from collections import defaultdict
lst = [('best','better'),('best','good'),('better','good')]
d = defaultdict(int)
for tup in lst:
d[tup[0]] += 1
d[tup[1]] += 0 # To create it in defaultdict
print sorted(d, key = lambda x: d[x], reverse=True)
# ['best', 'better', 'good']
Just give them points, increment the left one each time you encounter it in the list.
Edit: I do think the OP has a determined type of input. Always have tuple count of combination nCr(n, 2). Which makes this a correct solution. No need to complain about the edge cases, which I already knew posting the answer(and mentioned it).

How to remove a string from a list of strings if its length is lower than the length of the string with max length in Python 2.7?

How to remove a string from a list of strings if its length is lower than the length of the string with max length in Python 2.7?
Basically, if I have a list such as:
test = ['cat', 'dog', 'house', 'a', 'range', 'abc']
max_only(test)
The output should be:
['house', 'range']
'cat''s length is 3, 'dog' is 3, 'house' is 5, 'a' is 1, 'range' is 5, 'abc' is 3. The string with the highest length are 'house' and 'range', so they're returned.
I tried with something like this but, of course, it doesn't work :)
def max_only(lst):
ans_lst = []
for i in lst:
ans_lst.append(len(i))
for k in range(len(lst)):
if len(i) < max(ans_lst):
lst.remove(lst[ans_lst.index(max(ans_lst))])
return lst
Could you help me?
Thank you.
EDIT: What about the same thing for the min length element?
Use a list comprehension and max:
>>> test = ['cat', 'dog', 'house', 'a', 'range', 'abc']
>>> max_ = max(len(x) for x in test) #Find the length of longest string.
>>> [x for x in test if len(x) == max_] #Filter out all strings that are not equal to max_
['house', 'range']
A solution that loops just once:
def max_only(lst):
result, maxlen = [], -1
for item in lst:
itemlen = len(item)
if itemlen == maxlen:
result.append(item)
elif itemlen > maxlen:
result[:], maxlen = [item], itemlen
return result
max(iterable) has to loop through the whole list once, and a list comprehension picking out items of matching length has to loop through the list again. The above version loops through the input list just once.
If your input list is not a sequence but an iterator, this algorithm will still work while anything that has to use max() won't; it'd have exhausted the iterator just to find the maximum length.
Timing comparison on 100 random words between length 1 and 9, repeated 1 million times:
>>> import timeit
>>> import random
>>> import string
>>> words = [''.join([random.choice(string.ascii_lowercase) for _ in range(1, random.randrange(11))]) for _ in range(100)]
>>> def max_only(lst):
... result, maxlen = [], -1
... for item in lst:
... itemlen = len(item)
... if itemlen == maxlen:
... result.append(item)
... elif itemlen > maxlen:
... result[:], maxlen = [item], itemlen
... return result
...
>>> timeit.timeit('f(words)', 'from __main__ import max_only as f, words')
23.173006057739258
>>> def max_listcomp(lst):
... max_ = max(len(x) for x in lst)
... return [x for x in lst if len(x) == max_]
>>> timeit.timeit('f(words)', 'from __main__ import max_listcomp as f, words')
36.34060215950012
Replacing result.append() with a cached r_append = result.append outside the for loop shaves off another 2 seconds:
>>> def max_only(lst):
... result, maxlen = [], -1
... r_append = result.append
... for item in lst:
... itemlen = len(item)
... if itemlen == maxlen:
... r_append(item)
... elif itemlen > maxlen:
... result[:], maxlen = [item], itemlen
... return result
...
>>> timeit.timeit('f(words)', 'from __main__ import max_only as f, words')
21.21125817298889
And by popular request, a min_only() version:
def min_only(lst):
result, minlen = [], float('inf')
r_append = result.append
for item in lst:
itemlen = len(item)
if itemlen == minlen:
r_append(item)
elif itemlen < minlen:
result[:], minlen = [item], itemlen
return result
More fun still, a completely different tack: sorting on length:
from itertools import groupby
def max_only(lst):
return list(next(groupby(sorted(lst, key=len, reverse=True), key=len))[1])[::-1]
def min_only(lst):
return list(next(groupby(sorted(lst, key=len), key=len))[1])
These work by sorting by length, then picking out the first group of words with equal length. For max_only() we need to sort in reverse, then re-reverse the result. Sorting has a O(NlogN) cost, making this less efficient than the O(2N) solutions in other answers here or my O(N) solution above:
>>> timeit.timeit('f(words)', 'from __main__ import max_only_sorted as f, words')
52.725801944732666
Still, the sorting approach gives you a fun one-liner.
You can use max() which returns the largest item in the list.
>>> len_max = len(max(test, key=len))
>>> [x for x in test if len(x) == len_max]
['house', 'range']
If you then take all the strings that have the same length as the element you get the desired result.
>>> test = ['cat', 'dog', 'house', 'a', 'range', 'abc']
>>> filter(lambda x,m=max(map(len, test)):len(x)==m, test)
['house', 'range']
For Python3.x you would need to use list(filter(...))
This works:
max_len = len(max(test, key=len))
result = [word for word in test if len(word) == max_len]

Categories