Determining length and position of character repeats in string - python

Assume a string s that may contain several adjacent occurrences of dashes. For the sake of simplicity, let's call each of these occurrences a "repeat motive". For example, the following string s contains five repeat motives of dashes, namely of length 3,2,6,5 and 1.
s = "abcde---fghij--klmnopq------rstuvw-----xy-z"
I am trying to come up with Python code that returns the respective length and the respective position within the string of each of the repeat motives. Preferentially, the code returns a list of tuples, with each tuple being of format (length, position).
sought_function(s)
# [(3,5), (2,13), (6,22), (5,34), (1,41)]
Would you have any suggestions as to how to start this code?

You can use groupby:
s = "abcde---fghij--klmnopq------rstuvw-----xy-z"
from itertools import groupby
[(next(g)[0], sum(1 for _ in g) + 1) for k, g in groupby(enumerate(s), lambda x: x[1]) if k == "-"]
# [(5, 3), (13, 2), (22, 6), (34, 5), (41, 1)]
Or as #Willem commented, replace the sum with len:
[(next(g)[0], len(list(g)) + 1) for k, g in groupby(enumerate(s), lambda x: x[1]) if k == "-"]
# [(5, 3), (13, 2), (22, 6), (34, 5), (41, 1)]

If you want to write your own function: simply iterate over the characters, and hold in memory the current length, if the sequence is cut off, you yield the element:
def find_sequences(s,to_find):
result = []
lng = 0
for i,c in enumerate(s):
if c == to_find:
lng += 1
else:
if lng:
result.append((lng,i-lng))
lng = 0
if lng:
result.append((lng,i-lng))
return result
so s is the string and to_find is the character you are interested in (here '-').

if using numpy is fine :
import numpy as np
a = "abcde---fghij--klmnopq------rstuvw-----xy-z"
bool_vec = np.array([letter == "-" for letter in a])
dots = np.where(np.diff(bool_vec)!=0)[0] + 1
number = np.diff(dots.reshape((-1,2)),1).ravel()
idx = dots[::2]
with number and idx two arrays that contain what you want :)

You could do re.split("(-+)", s) which will return a list of ["abcde", "---", ...], and then iterate over that.

Here is would be my suggestion for this:
import re
s = "abcde---fghij--klmnopq------rstuvw-----xy-z"
list1= []
for x in re.findall("[a-z]*-", s):
temp = x.strip("-")
if len(temp) > 0:
list1.append(temp)
print(list1)

Related

How to find if a word occurs twice, ad a number occurs once in a tuple

data = ((45, 'foot'), (21, 'basket'), (10, 'hand'), (24, 'foot'), (21, 'hand'))
def unique_data_items(data):
input data is made of ((int, string), (int, string), ...)
unique_nums = () #initialising the tuple
unique_words = ()
Add code to fill the tuples unique_nums and unique_words with
numbers and words that are unique
returns the pair (tuple) of the numbers of unique numbers and
words
How do i complete the code, so it can return the word which occurs once, and the number which occurs twice, i have attempted it but cannot figure out how to this, thank you
d = {}
for x,y in data:
if y not in d:
d[y] = x
else:
d[y] += x
unique = tuple(d.items())
I think is what you want the output of this code is:
(('foot', 69), ('basket', 21), ('hand', 31))

find all occurences of a string and its indices in nested list python

I have a nested list in the following format:
[['john'],['jack','john','mary'],['howard','john'],['jude']...]
I want to find the first 3 or 5 indices of john that occurs in the nested list(since the list is really long) and return the indices like:
(0,0),(1,1),(2,1) or in any format which is advisable.
I'm fairly new to nested list. Any help would be much appreciated.
Question 1: Here is one way using a nested comprehension list. I will however look if there is a dupe.
nested_list = [['john'],['jack','john','mary'],['howard','john'],['jude']]
out = [(ind,ind2) for ind,i in enumerate(nested_list)
for ind2,y in enumerate(i) if y == 'john']
print(out)
Returns: [(0, 0), (1, 1), (2, 1)]
Update: Something similar found here Finding the index of an element in nested lists in python. The answer however only takes the first value which could be translated into:
out = next(((ind,ind2) for ind,i in enumerate(nested_list)
for ind2,y in enumerate(i) if y == 'john'),None)
print(out) # (0,0)
Question 2: (from comment)
Yes this is quite easy by editing y == 'john' to: 'john' in y.
nested_list = [['john xyz'],['jack','john dow','mary'],['howard','john'],['jude']]
out = [(ind,ind2) for ind,i in enumerate(nested_list)
for ind2,y in enumerate(i) if 'john' in y]
print(out)
Returns: [(0, 0), (1, 1), (2, 1)]
Question 3: (from comment)
The most efficient way to get the first N elements is to use pythons library itertools like this:
import itertools
nested_list = [['john xyz'],['jack','john dow','mary'],['howard','john'],['jude']]
gen = ((ind,ind2) for ind,i in enumerate(nested_list)
for ind2,y in enumerate(i) if 'john' in y)
out = list(itertools.islice(gen, 2)) # <-- Next 2
print(out)
Returns: [(0, 0), (1, 1)]
This is also answered here: How to take the first N items from a generator or list in Python?
Question 3 extended:
And say now that you want to take them in chunks of N, then you can do this:
import itertools
nested_list = [['john xyz'],['jack','john dow','mary'],['howard','john'],['jude']]
gen = ((ind,ind2) for ind,i in enumerate(nested_list)
for ind2,y in enumerate(i) if 'john' in y)
f = lambda x: list(itertools.islice(x, 2)) # Take two elements from generator
print(f(gen)) # calls the lambda function asking for 2 elements from gen
print(f(gen)) # calls the lambda function asking for 2 elements from gen
print(f(gen)) # calls the lambda function asking for 2 elements from gen
Returns:
[(0, 0), (1, 1)]
[(2, 1)]
[]

Python: how to replace substrings in a string given list of indices

I have a string:
"A XYZ B XYZ C"
and a list of index-tuples:
((2, 5), (8, 11))
I would like to apply a replacement of each substring defined by indices by the sum of them:
A 7 B 19 C
I can't do it using string replace as it will match both instances of XYZ. Replacing using index information will break on the second and forth iterations as indices are shifting throughout the process.
Is there a nice solution for the problem?
UPDATE. String is given for example. I don't know its contents a priori nor can I use them in the solution.
My dirty solution is:
text = "A XYZ B XYZ C"
replace_list = ((2, 5), (8, 11))
offset = 0
for rpl in replace_list:
l = rpl[0] + offset
r = rpl[1] + offset
replacement = str(r + l)
text = text[0:l] + replacement + text[r:]
offset += len(replacement) - (r - l)
Which counts on the order of index-tuples to be ascending. Could it be done nicer?
Imperative and stateful:
s = 'A XYZ B XYZ C'
indices = ((2, 5), (8, 11))
res = []
i = 0
for start, end in indices:
res.append(s[i:start] + str(start + end))
i = end
res.append(s[end:])
print(''.join(res))
Result:
A 7 B 19 C
You can use re.sub():
In [17]: s = "A XYZ B XYZ C"
In [18]: ind = ((2, 5), (8, 11))
In [19]: inds = map(sum, ind)
In [20]: re.sub(r'XYZ', lambda _: str(next(inds)), s)
Out[20]: 'A 7 B 19 C'
But note that if the number of matches is larger than your index pairs it will raise a StopIteration error. In that case you can pass a default argument to the next() to replace the sub-string with.
If you want to use the tuples of indices for finding the sub strings, here is another solution:
In [81]: flat_ind = tuple(i for sub in ind for i in sub)
# Create all the pairs with respect to your intended indices.
In [82]: inds = [(0, ind[0][0]), *zip(flat_ind, flat_ind[1:]), (ind[-1][-1], len(s))]
# replace the respective slice of the string with sum of indices of they exist in intended pairs, otherwise just the sub-string itself.
In [85]: ''.join([str(i+j) if (i, j) in ind else s[i:j] for i, j in inds])
Out[85]: 'A 7 B 19 C'
One way to do this using itertools.groupby.
from itertools import groupby
indices = ((2, 5), (8, 11))
data = list("A XYZ B XYZ C")
We start with replacing the range of matched items with equal number of None.
for a, b in indices:
data[a:b] = [None] * (b - a)
print(data)
# ['A', ' ', None, None, None, ' ', 'B', ' ', None, None, None, ' ', 'C']
The we loop over the grouped data and replace the None groups with the sum from indices list.
it = iter(indices)
output = []
for k, g in groupby(data, lambda x: x is not None):
if k:
output.extend(g)
else:
output.append(str(sum(next(it))))
print(''.join(output))
# A 7 B 19 C
Here's a quick and slightly dirty solution using string formatting and tuple unpacking:
s = 'A XYZ B XYZ C'
reps = ((2, 5), (8, 11))
totals = (sum(r) for r in reps)
print s.replace('XYZ','{}').format(*totals)
This prints:
A 7 B 19 C
First, we use a generator expression to find the totals for each of our replacements. Then, by replacing 'XYZ' with '{}' we can use string formatting - *totals will ensure we get the totals in the correct order.
Edit
I didn't realise the indices were actually string indices - my bad. To do this, we could use re.sub as follows:
import re
s = 'A XYZ B XYZ C'
reps = ((2, 5), (8, 11))
for a, b in reps:
s = s[:a] + '~'*(b-a) + s[b:]
totals = (sum(r) for r in reps)
print re.sub(r'(~+)', r'{}', s).format(*totals)
Assuming there are no tildes (~) used in your string - if there are, replace with a different character. This also assumes none of the "replacement" groups are consecutive.
Assuming there are no overlaps then you could do it in reverse order
text = "A XYZ B XYZ C"
replace_list = ((2, 5), (8, 11))
for start, end in reversed(replace_list):
text = f'{text[:start]}{start + end}{text[end:]}'
# A 7 B 19 C
Here's a reversed-order list-slice assignment solution:
text = "A XYZ B XYZ C"
indices = ((2, 5), (8, 11))
chars = list(text)
for start, end in reversed(indices):
chars[start:end] = str(start + end)
text = ''.join(chars) # A 7 B 19 C
There is also a solution which does exactly what you want.
I have not worked it out completely, but you may want to use:
re.sub() from the re library.
Look here, and look for the functions re.sub() or re.subn():
https://docs.python.org/2/library/re.html
If I have time, I will work out your example later today.
Yet another itertools solution
from itertools import *
s = "A XYZ B XYZ C"
inds = ((2, 5), (8, 11))
res = 'A 7 B 19 C'
inds = list(chain([0], *inds, [len(s)]))
res_ = ''.join(s[i:j] if k % 2 == 0 else str(i + j)
for k, (i,j) in enumerate(zip(inds, inds[1:])))
assert res == res_
Anticipating that if these pairs-of-integer selections are useful here, they will also be useful in other places, then I would proably do something like this:
def make_selections(data, selections):
start = 0
# sorted(selections) if you don't want to require the caller to provide them in order
for selection in selections:
yield None, data[start:selection[0]]
yield selection, data[selection[0]:selection[1]]
start = selection[1]
yield None, data[start:]
def replace_selections_with_total(data, selections):
return ''.join(
str(selection[0] + selection[1]) if selection else value
for selection, value in make_selections(data, selections)
)
This still relies on the selections not overlapping, but I'm not sure what it would even mean for them to overlap.
You could then make the replacement itself more flexible too:
def replace_selections(data, selections, replacement):
return ''.join(
replacement(selection, value) if selection else value
for selection, value in make_selections(data, selections)
)
def replace_selections_with_total(data, selections):
return replace_selections(data, selections, lambda s,_: str(s[0]+s[1]))

split a list of strings at positions they match a different list of strings

I wrote a small program to do the following. I'm wondering if there is an obviously more optimal solution:
1) Take 2 lists of strings. In general, the strings in the second list will be longer than in the first list, but this is not guaranteed
2) Return a list of strings derived from the second list that has removed any matching strings from the first list. The list will therefore contain strings that are <= the length of the strings in the second list.
Below I've displayed a picture example of what I'm talking about:
so far I this is what I have. It seems to be working fine, but I'm just curious if there is a more elegant solution that I'm missing. By the way, I'm keeping track of the "positions" of each start and end of the string, which is important for a later part of this program.
def split_sequence(sequence = "", split_seq = "", length = 8):
if len(sequence) < len(split_seq):
return [],[]
split_positions = [0]
for pos in range(len(sequence)-len(split_seq)):
if sequence[pos:pos+len(split_seq)] == split_seq and pos > split_positions[-1]:
split_positions += [pos, pos+len(split_seq)]
if split_positions[-1] == 0:
return [sequence], [(0,len(sequence)-1)]
split_positions.append(len(sequence))
assert len(split_positions) % 2 == 0
split_sequences = [sequence[split_positions[_]:split_positions[_+1]] for _ in range(0, len(split_positions),2)]
split_seq_positions = [(split_positions[_],split_positions[_+1]) for _ in range(0, len(split_positions),2)]
return_sequences = []
return_positions = []
for pos,seq in enumerate(split_sequences):
if len(seq) >= length:
return_sequences.append(split_sequences[pos])
return_positions.append(split_seq_positions[pos])
return return_sequences, return_positions
def create_sequences_from_targets(sequence_list = [] , positions_list = [],length=8, avoid = []):
if avoid:
for avoided_seq in avoid:
new_sequence_list = []
new_positions_list = []
for pos,sequence in enumerate(sequence_list):
start = positions_list[pos][0]
seqs, positions = split_sequence(sequence = sequence, split_seq = avoided_seq, length = length)
new_sequence_list += seqs
new_positions_list += [(positions[_][0]+start,positions[_][1]+start) for _ in range(len(positions))]
return new_sequence_list, new_positions_list
A Sample output:
In [60]: create_sequences_from_targets(sequence_list=['MPHSSLHPSIPCPRGHGAQKA', 'AEELRHIHSRYRGSYWRTVRA', 'KGLAPAEISAVCEKGNFNVA'],positions_list=[(0, 20), (66, 86), (136, 155)],avoid=['SRYRGSYW'],length=3)
Out[60]:
(['MPHSSLHPSIPCPRGHGAQKA', 'AEELRHIH', 'RTVRA', 'KGLAPAEISAVCEKGNFNVA'],
[(0, 20), (66, 74), (82, 87), (136, 155)])
Let's define this string, s, and this list list1 of strings to remove:
>>> s = 'NowIsTheTimeForAllGoodMenToComeToTheAidOfTheParty'
>>> list1 = 'The', 'Good'
Now, let's remove those strings:
>>> import re
>>> re.split('|'.join(list1), s)
['NowIs', 'TimeForAll', 'MenToComeTo', 'AidOf', 'Party']
One of the powerful features of the above is that the strings in list1 can contain regex-active characters. That may also be undesirable. As John La Rooy points out in the comments, the strings in list1 can be made inactive with:
>>> re.split('|'.join(re.escape(x) for x in list1), s)
['NowIs', 'TimeForAll', 'MenToComeTo', 'AidOf', 'Party']
Using regular expressions simplifies the code, but it may or may not be more efficient.
>>> import re
>>> sequence_list = ['MPHSSLHPSIPCPRGHGAQKA', 'AEELRHIHSRYRGSYWRTVRA', 'KGLAPAEISAVCEKGNFNVA'],positions_list=[(0, 20), (66, 86), (136, 155)]
>>> avoid = ['SRYRGSYW']
>>> rex = re.compile("|".join(map(re.escape, avoid)))
get the positions like this (you'll need to add your offsets to these)
>>> [[j.span() for j in rex.finditer(i)] for i in sequence_list]
[[], [(8, 16)], []]
get the new strings like this
>>> [rex.split(i) for i in sequence_list]
[['MPHSSLHPSIPCPRGHGAQKA'], ['AEELRHIH', 'RTVRA'], ['KGLAPAEISAVCEKGNFNVA']]
or the flattened list
>>> [j for i in sequence_list for j in rex.split(i)]
['MPHSSLHPSIPCPRGHGAQKA', 'AEELRHIH', 'RTVRA', 'KGLAPAEISAVCEKGNFNVA']

Merging a list of time-range tuples that have overlapping time-ranges

I have a list of tuples where each tuple is a (start-time, end-time). I am trying to merge all overlapping time ranges and return a list of distinct time ranges.
For example
[(1, 5), (2, 4), (3, 6)] ---> [(1,6)]
[(1, 3), (2, 4), (5, 8)] ---> [(1, 4), (5,8)]
Here is how I implemented it.
# Algorithm
# initialranges: [(a,b), (c,d), (e,f), ...]
# First we sort each tuple then whole list.
# This will ensure that a<b, c<d, e<f ... and a < c < e ...
# BUT the order of b, d, f ... is still random
# Now we have only 3 possibilities
#================================================
# b<c<d: a-------b Ans: [(a,b),(c,d)]
# c---d
# c<=b<d: a-------b Ans: [(a,d)]
# c---d
# c<d<b: a-------b Ans: [(a,b)]
# c---d
#================================================
def mergeoverlapping(initialranges):
i = sorted(set([tuple(sorted(x)) for x in initialranges]))
# initialize final ranges to [(a,b)]
f = [i[0]]
for c, d in i[1:]:
a, b = f[-1]
if c<=b<d:
f[-1] = a, d
elif b<c<d:
f.append((c,d))
else:
# else case included for clarity. Since
# we already sorted the tuples and the list
# only remaining possibility is c<d<b
# in which case we can silently pass
pass
return f
I am trying to figure out if
Is the a an built-in function in some python module that can do this more efficiently? or
Is there a more pythonic way of accomplishing the same goal?
Your help is appreciated. Thanks!
A few ways to make it more efficient, Pythonic:
Eliminate the set() construction, since the algorithm should prune out duplicates during in the main loop.
If you just need to iterate over the results, use yield to generate the values.
Reduce construction of intermediate objects, for example: move the tuple() call to the point where the final values are produced, saving you from having to construct and throw away extra tuples, and reuse a list saved for storing the current time range for comparison.
Code:
def merge(times):
saved = list(times[0])
for st, en in sorted([sorted(t) for t in times]):
if st <= saved[1]:
saved[1] = max(saved[1], en)
else:
yield tuple(saved)
saved[0] = st
saved[1] = en
yield tuple(saved)
data = [
[(1, 5), (2, 4), (3, 6)],
[(1, 3), (2, 4), (5, 8)]
]
for times in data:
print list(merge(times))
Sort tuples then list, if t1.right>=t2.left => merge
and restart with the new list, ...
-->
def f(l, sort = True):
if sort:
sl = sorted(tuple(sorted(i)) for i in l)
else:
sl = l
if len(sl) > 1:
if sl[0][1] >= sl[1][0]:
sl[0] = (sl[0][0], sl[1][1])
del sl[1]
if len(sl) < len(l):
return f(sl, False)
return sl
The sort part: use standard sorting, it compares tuples the right way already.
sorted_tuples = sorted(initial_ranges)
The merge part. It eliminates duplicate ranges, too, so no need for a set. Suppose you have current_tuple and next_tuple.
c_start, c_end = current_tuple
n_start, n_end = next_tuple
if n_start <= c_end:
merged_tuple = min(c_start, n_start), max(c_end, n_end)
I hope the logic is clear enough.
To peek next tuple, you can use indexed access to sorted tuples; it's a wholly known sequence anyway.
Sort all boundaries then take all pairs where a boundary end is followed by a boundary start.
def mergeOverlapping(initialranges):
def allBoundaries():
for r in initialranges:
yield r[0], True
yield r[1], False
def getBoundaries(boundaries):
yield boundaries[0][0]
for i in range(1, len(boundaries) - 1):
if not boundaries[i][1] and boundaries[i + 1][1]:
yield boundaries[i][0]
yield boundaries[i + 1][0]
yield boundaries[-1][0]
return getBoundaries(sorted(allBoundaries()))
Hm, not that beautiful but was fun to write at least!
EDIT: Years later, after an upvote, I realised my code was wrong! This is the new version just for fun:
def mergeOverlapping(initialRanges):
def allBoundaries():
for r in initialRanges:
yield r[0], -1
yield r[1], 1
def getBoundaries(boundaries):
openrange = 0
for value, boundary in boundaries:
if not openrange:
yield value
openrange += boundary
if not openrange:
yield value
def outputAsRanges(b):
while b:
yield (b.next(), b.next())
return outputAsRanges(getBoundaries(sorted(allBoundaries())))
Basically I mark the boundaries with -1 or 1 and then sort them by value and only output them when the balance between open and closed braces is zero.
Late, but might help someone looking for this. I had a similar problem but with dictionaries. Given a list of time ranges, I wanted to find overlaps and merge them when possible. A little modification to #samplebias answer led me to this:
Merge function:
def merge_range(ranges: list, start_key: str, end_key: str):
ranges = sorted(ranges, key=lambda x: x[start_key])
saved = dict(ranges[0])
for range_set in sorted(ranges, key=lambda x: x[start_key]):
if range_set[start_key] <= saved[end_key]:
saved[end_key] = max(saved[end_key], range_set[end_key])
else:
yield dict(saved)
saved[start_key] = range_set[start_key]
saved[end_key] = range_set[end_key]
yield dict(saved)
Data:
data = [
{'start_time': '09:00:00', 'end_time': '11:30:00'},
{'start_time': '15:00:00', 'end_time': '15:30:00'},
{'start_time': '11:00:00', 'end_time': '14:30:00'},
{'start_time': '09:30:00', 'end_time': '14:00:00'}
]
Execution:
print(list(merge_range(ranges=data, start_key='start_time', end_key='end_time')))
Output:
[
{'start_time': '09:00:00', 'end_time': '14:30:00'},
{'start_time': '15:00:00', 'end_time': '15:30:00'}
]
When using Python 3.7, following the suggestion given by “RuntimeError: generator raised StopIteration” every time I try to run app, the method outputAsRanges from #UncleZeiv should be:
def outputAsRanges(b):
while b:
try:
yield (next(b), next(b))
except StopIteration:
return

Categories