Consecutive values in strings, getting indices - python

The following is a python string of length of approximately +1000.
string1 = "XXXXXXXXXXXXXXXXXXXXXAAAAAAAAAAAAAAAAAABBBBBBBBBBBBBB........AAAAXXXXX"
len(string1) ## 1311
I would like to know the index of where the consecutive X's end and the non-X characters begin. Reading this string from left to right, the first non-X character is at index location 22, and the first non-X character from the right is at index location 1306.
How does one find these indices?
My guess would be:
for x in string1:
if x != "X":
print(string.index(x))
The problem with this is it outputs all indices that are not X. It does not give me the index where the consecutive X's end.
Even more confusing for me is how to "check" for consecutive X's. Let's say I have this string:
string2 = "XXXXAAXAAAAAAAAAAAAAAABBBBBBBBBBBBBB........AAAAXXXXX"
Here, the consecutive X's end at index 4, not index 7. How could I check several characters ahead whether this is really no longer consecutive?

using regex, split the first & last group of Xs, get their lengths to construct the indices.
import re
mystr = 'XXXXAAXAAAAAAAAAAAAAAABBBBBBBBBBBBBB........AAAAXXXXX'
xs = re.split('[A-W|Y-Z]+', mystr)
indices = (len(xs[0]), len(mystr) - len(xs[-1]) - 1)
# (4, 47)
I simply need the outputs for the indices. I'm then going to put them in randint(first_index, second_index)
Its possible to pass the indices to the function like this
randint(*indices)
However, I suspect that you want to use the output of randint(first_index, last_index) to select a random character from the middle, this would be a shorter alternative.
from random import choice
randchar = choice(mystr.strip('X'))

If I understood well your question, you just do:
def getIndexs(string):
lst =[]
flag = False
for i, char in enumerate(string):
if char == "x":
flag = True
if ((char != "x") and flag):
lst.append(i-1)
flag = False
return lst
print(getIndexs("xxxxbbbxxxxaaaxxxbb"))
[3, 10, 16]

If the sequences are, as you say, only in the beginning and at the end of your string, a simple loop / reversed loop would suffice:
string1 = "XXXXXXXXXXXXXXXXXXXXXAAAAAAAAAAAAAAAAAABBBBBBBBBBBBBB........AAAAXXXXX"
left_index = 0
for char in string1:
left_index += 1
if char != "X":
break
right_index = len(string1)
for char in reversed(string1):
if char != "X":
break
right_index -= 1
print(left_index) # 22
print(right_index) # 65

Regex can lookahead and identify characters that don't match the pattern:
>>>[match.span() for match in re.finditer(r'X{2,}((?=[^X])|$)', string2)]
[(0, 4), (48, 53)]
Breaking this down:
X - the character we're matching
{2,} - need to see at least two in a row to consider a match
((?=[^X])|$) - two conditions will satisfy the match
(?=[^X]) - lookahead for anything but an X
$ - the end of the string
As a result, finditer returns each instance where there are multiple X's, followed by a non-X or an end of line. match.span() extracts the position information from each match from the string.

This will give you the first index and last index (of non-'X' character).
s = 'XXABCDXXXEFGHXXXXX'
first_index = len(s) - len(s.lstrip('X'))
last_index = len(s.rstrip('X')) - len(s) - 1
print first_index, last_index
2 -6
How it works:
For first_index:
We strip all the 'X' characters at the beginning of our string. Finding the difference in length between the original and shortened string gives us the index of the first non-'X' character.
For last_index:
Similarly, we strip the 'X' characters at the end of our string. We also subtract 1 from the difference, since reverse indexing in Python starts from -1.
Note:
If you just want to randomly select one of the characters between first_index and last_index, you can do:
import random
shortened_s = s.strip('X')
random.choice(shortened_s)

Related

StarKill riddle in Python

Riddle:
Return a version of the given string, where for every star (*) in the string the star and the chars immediately to its left and right are gone. So "ab*cd" yields "ad" and "ab**cd" also yields "ad".
I'm wondering if there's a pythonish way to improve this algorithm:
def starKill(string):
result = ''
for idx in range(len(string)):
if(idx == 0 and string[idx] != '*'):
result += string[idx]
elif (idx > 0 and string[idx] != '*' and (string[idx-1]) != '*'):
result += string[idx]
elif (idx > 0 and string[idx] == '*' and (string[idx-1]) != '*'):
result = result[0:len(result) - 1]
return result
starKill("wacy*xko") yields wacko
Here's a numpy solution just for fun:
def star_kill(string, target='*'):
arr = np.array(list(string))
mask = arr != '*'
mask[1:] &= mask[:-1]
mask[:-1] &= mask[1:]
arr = arr[mask]
return arr[mask].view(dtype=f'U{arr.size}').item()
Regular expression?
>>> import re
>>> for s in "ab*cd", "ab**cd", "wacy*xko", "*Mad*Physicist*":
print(re.sub(r'\w?\*\w?', '', s))
ad
ad
wacko
ahysicis
You can do this by iterating over the string three times in parallel. Each iteration will be shifted relative to the next by one character. The middle one is the one that will provide the valid letters, the other two let us check if adjacent characters are stars. The two flanking iterators require dummy values to represent "before the start" and "after the end" of the string. There are a variety of ways to set that up, I'm using itertools.chain (and .islice) to fill in None for the dummy values. But you could use plain string and iterator manipulation if you prefer (i.e. iter('x' + string) and iter(string[1:] + 'x')):
import itertools
def star_kill(string):
main_iterator = iter(string)
look_behind = itertools.chain([None], string)
look_ahead = itertools.chain(itertools.islice(string, 1, None), [None])
return "".join(a for a, b, c in zip(main_iterator, look_behind, look_ahead)
if a != '*' and b != '*' and c != '*')
Not sure whether or not it's "Pythonic," but the problem can be solved with regular expressions.
import re
def starkill(s):
s = re.sub(".{0,1}\\*{1,}.{0,1}", "", s)
return s
For those not familiar with regex, I'll break that long string down:
Prefix
".{0,1}"
This specifies we want the replaced section to begin with either 0 or 1 of any character. If there is a character before the star, we want to replace it; otherwise, we still want the expression to hit if the star is at the very beginning of the input string.
Star
"\\*{1,}"
This specifies that the middle of the expression must contain an asterisk character, but it can also contain more than one. For instance, "a****b" will still hit, even though there are four stars. We need a backslash before the asterisk because regex has asterisk as a reserved character, and we need a second backslash before that because Python strings reserve the backslash character.
Suffix
.{0,1}
Same as the prefix. The expression can either end with one or zero of any character.
Hope that helps!

Compare adjacent characters in string for differing case

I am working through a coding challenge in python, the rules is to take a string and any two adjacent letters of the same character but differing case should be deleted. The process repeated until there are no matching letters of differing case side by side. Finally the length of the string should be printed. I have made a solution below that iterates left to right. Although I have been told there are better more efficient ways.
list_of_elves=list(line)
n2=len(list_of_elves)
i=0
while i < len(list_of_elves):
if list_of_elves[i-1].lower()==list_of_elves[i].lower() and list_of_elves[i-1] != list_of_elves[i]:
del list_of_elves[i]
del list_of_elves[i-1]
if i<2:
i-=1
else:
i-=2
if len(list_of_elves)<2:
break
else:
i+=1
if len(list_of_elves)<2:
break
print(len(list_of_elves))
I have made some pseudo code as well
PROBLEM STATEMENT
Take a given string of alpabetical characters
Build a process to count the initial string length and store to variable
Build a process to iterate through the list and identify the following rule:
Two adjacent matching letters && Of differing case
Delete the pair
Repeat process
Count final length of string
For example, if we had a string with 'aAa' then 'aA' would be deleted, leaving 'a' behind.
In Python, if you want to do it with a regex, use
re.sub(r"([a-zA-Z])(?=(?!\1)(?i:\1))", "", s) # For ASCII only letters
re.sub(r"([^\W\d_])(?=(?!\1)(?i:\1))", "", s) # For any Unicode letters
See the Python demo
Details
([^\W\d_]) - Capturing group 1: any Unicode letter (or any ASCII letter if ([^\W\d_]) is used)
(?=(?!\1)(?i:\1)) - a positive lookahead that requires the same char as matched in the first capturing group (case insensitive) (see (?i:\1)) that is not the same char as matched in Group 1 (see (?!\1))
This is a very similar problem to matching parenthesis, but instead of a match being opposite pairs, the match is upper/lower case. You can use a similar technique of maintaining a stack. Then iterate through and compare the current letter with the top of the stack. If they match pop the element off the stack; if they don't append the letter to the stack. In the end, the length of the stack will be your answer:
line = "cABbaC"
stack = []
match = lambda m, n: m != n and m.upper() == n.upper()
for c in line:
if len(stack) == 0 or not match(c, stack[-1]):
stack.append(c)
else:
stack.pop()
stack
# stack is empty because `Bb` `Aa` and `Cc` get deleted.
Similarly line = "cGBbgaCF" would result in a stack of ['c', 'a', 'C', 'F'] because Bb, then Gg are deleted.
A method that should be very fast:
result = 1
pairs = zip(string, string[1:])
for a, b in pairs:
if a.lower() == b.lower() and a != b:
next(pairs)
else:
result += 1
print(result)
First we create a zip of the input with the input sliced by 1 position, this gives us an iterable that returns all the pairs in the string in order
Then for every pair that doesn't match we increment the result, for every pair that does match we just advance the iterator by one so that we skip the matching pair.
Result is then the length of what would be the result, we don't actually need to store the result as we can just calculate it as we go along since it's the only thing that needs to be returned
Really only need a single assertion in the regex to match the pair and
delete it.
re.sub(r"(?-i:([a-zA-Z])(?!\1)(?i:\1))", "", target)
Code sample :
>>> import re
>>> strs = ["aAa","aaa","aAaAA"]
>>> for target in strs:
... modtarg = re.sub(r"(?-i:([a-zA-Z])(?!\1)(?i:\1))", "", target)
... print( target, "\t--> (", len(modtarg), ") ", modtarg )
...
aAa --> ( 1 ) a
aaa --> ( 3 ) aaa
aAaAA --> ( 1 ) A
Info :
(?-i: # Disable Case insensitive if on
( [a-zA-Z] ) # (1), upper or lower case
(?! \1 ) # Not the same cased letter
(?i: \1 ) # Enable Case insensitive, must be the opposite cased letter
)

slice string skip specific character

I have a string like this in python3:
ab_cdef_ghilm__nop_q__rs
starting from a specific character, based on the index position I want to slice a window around this character of 5 characters per side. But if the _ character is found it has to skip and to go to the next character. for example, considering in this string the character "i" I want to have a final string of 11 characters around the "i" skipping the _ characters all the times it occurs like outputting this:
defghilmnop
Consider that I have long strings and I want to decide the index position where I want to do this thing.
in this case index=10
Is there a command that crops a string of a specific size skipping a specific character?
for the moment what I'm able to do is to remove the _ from the string meanwhile counting the number of _ occurrences and use it to define the shift in the middle index position and finally I crop a window of the desired size but I want something more processive so if I could just jump every time he find a "_" wolud be perfect
situation B) index=13
I want to have 5 character on the left and 5 on the right of this index getting rid (abd not counting) of the _ characters so having this output:
ghilmnopqrs
so basically when the index corresponds to a character star to from it instead when the index correspond to a _ character we have to shift (to the right up to the next character to have in the end a string of 11 characters.
to make long story short the output is 11 characters with the index position in the middle. if the index position is a _ we have to skip this character and consider the middle character the one close by(closer).
I don't think there's specific command for this, but you could build your own.
For example:
s = 'ab_cdef_ghilm__nop_q__rs'
def get_slice(s, idx, n=5, ignored_chars='_'):
if s[idx] in ignored_chars:
# adjust idx to first valid on right side:
idx = next((i for i, ch in enumerate(s[idx:], idx) if ch not in ignored_chars), None)
if idx is None:
return ''
d = {i: ch for i, ch in enumerate(s) if ch not in ignored_chars}
if idx in d:
keys = [k for k in d.keys()]
idx = keys.index(idx)
return ''.join(d[k] for k in keys[max(0, idx-n):min(idx+n+1, len(s))])
print(get_slice(s, 10, 5, '_'))
print(get_slice(s, 13, 5, '_'))
Prints:
defghilmnop
ghilmnopqrs
In case print(get_slice(s, 1, 5, '_')):
abcdefg
EDIT: Added check for starting index equals ignored char.
you define a function split like below which will split a string such that it has given number of characters on left and right side which is not "_"
st = "ab_cdef_ghilm__nop_q__rs"
def slice(st, ind, c_count):
cp = [char!="_" for char in st]
for i in range(len(st)):
if sum(cp[ind:ind+i]) == c_count:
break
right = ind + i
for i in range(len(st)):
if sum(cp[ind-i:ind]) == c_count:
break
left = ind - i
return st[left:right+1]
slice(st, 10, 5)

String manipulation algorithm to find string greater than original string

I have few words(strings) like 'hefg','dhck','dkhc','lmno' which is to be converted to new words by swapping some or all the characters such that the new word is greater than the original word lexicographically also the new word is the least of all the words greater than the original word.
for e.g 'dhck'
should output 'dhkc' and not 'kdhc','dchk' or any other.
i have these inputs
hefg
dhck
dkhc
fedcbabcd
which should output
hegf
dhkc
hcdk
fedcbabdc
I have tried with this code in python it worked for all except 'dkhc' and 'fedcbabcd'.
I have figured out that the first character in case of 'fedcbabcd' is the max so, it is not getting swapped.and
Im getting "ValueError: min() arg is an empty sequence"
How can I modify the algorithm To fix the cases?
list1=['d','k','h','c']
list2=[]
maxVal=list1.index(max(list1))
for i in range(maxVal):
temp=list1[maxVal]
list1[maxVal]=list1[i-1]
list1[i-1]=temp
list2.append(''.join(list1))
print(min(list2))
You can try something like this:
iterate the characters in the string in reverse order
keep track of the characters you've already seen, and where you saw them
if you've seen a character larger than the curent character, swap it with the smallest larger character
sort all the characters after the that position to get the minimum string
Example code:
def next_word(word):
word = list(word)
seen = {}
for i in range(len(word)-1, -1, -1):
if any(x > word[i] for x in seen):
x = min(x for x in seen if x > word[i])
word[i], word[seen[x]] = word[seen[x]], word[i]
return ''.join(word[:i+1] + sorted(word[i+1:]))
if word[i] not in seen:
seen[word[i]] = i
for word in ["hefg", "dhck", "dkhc", "fedcbabcd"]:
print(word, next_word(word))
Result:
hefg hegf
dhck dhkc
dkhc hcdk
fedcbabcd fedcbabdc
The max character and its position doesn't influence the algorithm in the general case. For example, for 'fedcbabcd', you could prepend an a or a z at the beginning of the string and it wouldn't change the fact that you need to swap the final two letters.
Consider the input 'dgfecba'. Here, the output is 'eabcdfg'. Why? Notice that the final six letters are sorted in decreasing order, so by changing anything there, you get a smaller string lexicographically, which is no good. It follows that you need to replace the initial 'd'. What should we put in its place? We want something greater than 'd', but as small as possible, so 'e'. What about the remaining six letters? Again, we want a string that's as small as possible, so we sort the letters lexicographically: 'eabcdfg'.
So the algorithm is:
start at the back of the string (right end);
go left while the symbols keep increasing;
let i be the rightmost position where s[i] < s[i + 1]; in our case, that's i = 0;
leave the symbols on position 0, 1, ..., i - 1 untouched;
find the position among i+1 ... n-1 containing the least symbol that's greater than s[i]; call this position j; in our case, j = 3;
swap s[i] and s[j]; in our case, we obtain 'egfdcba';
reverse the string s[i+1] ... s[n-1]; in our case, we obtain 'eabcdfg'.
Your problem can we reworded as finding the next lexicographical permutation of a string.
The algorithm in the above link is described as follow:
1) Find the longest non-increasing suffix
2) The number left of the
suffix is our pivot
3) Find the right-most successor of the pivot in
the suffix
4) Swap the successor and the pivot
5) Reverse the suffix
The above algorithm is especially interesting because it is O(n).
Code
def next_lexicographical(word):
word = list(word)
# Find the pivot and the successor
pivot = next(i for i in range(len(word) - 2, -1, -1) if word[i] < word[i+1])
successor = next(i for i in range(len(word) - 1, pivot, -1) if word[i] > word[pivot])
# Swap the pivot and the successor
word[pivot], word[successor] = word[successor], word[pivot]
# Reverse the suffix
word[pivot+1:] = word[-1:pivot:-1]
# Reform the word and return it
return ''.join(word)
The above algorithm will raise a StopIteration exception if the word is already the last lexicographical permutation.
Example
words = [
'hefg',
'dhck',
'dkhc',
'fedcbabcd'
]
for word in words:
print(next_lexicographical(word))
Output
hegf
dhkc
hcdk
fedcbabdc

Realizing if there is a pattern in a string (Does not need to start at index 0, could be any length)

Coding a program to detect a n-length pattern in a string, even without knowing where the pattern starts, could be easily done by creating a list of n-length substrings and check if starting at one point there are same items or the rest of the list. Without any piece of information other than the string to check through, is the only way to recognize the pattern is to brute-force through all lengths and check or is there a more efficient algorithm?
(I'm just a beginner in Python, so this may be easy to code... )
Current code that only suits checking for starting at index 0:
def search(s):
match=s[0]+s[1]
while (match != s) and (match[0] != match[-1]):
for matchLen in range(len(match),len(s)-1):
letter = s[matchLen]
if letter == match[-1]:
match += s[len(match)]
break
if match == s:
return None
else:
return match[:-1]
You can use re.findall(r'(.{2,})\1+', string). The parentheses creates a capture group that is later backreferenced by \1. The . matches any character (except for line breaks). The {2,} requires the pattern to be at least two characters long (otherwise strings like ss would be considered a pattern). Finally the + requires that pattern to repeat 1 or more times (in addition to the first time that it occurred inside the capture group). You can see it working in action.
Pattern is a far too vague term, but assuming you mean some string repeating itself, the regexp (?P<pat>.+)(?P=pat) will work.
Given a string what you could do is -
You start with length = 1, and take two pointer variables i and j which you shall use to traverse the string.
Set i = 0 and j = i+length
if str[i]==str[j]:
i++,j++ // till j not equal to length of string
else:
length = length + 1
//increase length by 1 and start the algorithm over from i = 0
Take the example abcdeabcde :
In this we see
Initially i = 0, j = 1 ,
but str[0]!=str[1] i.e. a!=b,
Then we get length = 2 i.e., i = 0,j = 2
but str[0]!=str[2] i.e. a!=c,
Continuing in the same fashion,
We see when length = 5 and i = 0 and j = 5,
str[0]==str[5]
and thus you can see that i and j increment till j is equal to string length.
And you have your answer that is the pattern length. It may not seem obvious but i would suggest you dry-run this algorithm over some of your test cases and let me know the results.
You can use re.findall() to find all matches:
import re
s = "somethingabcdeabcdeabcdeabcdeabcdeelseabcdeabcdeabcde"
li = re.findall(r'abcde',s)
print(li)
Output:
['abcde', 'abcde', 'abcde', 'abcde', 'abcde', 'abcde', 'abcde', 'abcde']

Categories