slice string skip specific character - python

I have a string like this in python3:
ab_cdef_ghilm__nop_q__rs
starting from a specific character, based on the index position I want to slice a window around this character of 5 characters per side. But if the _ character is found it has to skip and to go to the next character. for example, considering in this string the character "i" I want to have a final string of 11 characters around the "i" skipping the _ characters all the times it occurs like outputting this:
defghilmnop
Consider that I have long strings and I want to decide the index position where I want to do this thing.
in this case index=10
Is there a command that crops a string of a specific size skipping a specific character?
for the moment what I'm able to do is to remove the _ from the string meanwhile counting the number of _ occurrences and use it to define the shift in the middle index position and finally I crop a window of the desired size but I want something more processive so if I could just jump every time he find a "_" wolud be perfect
situation B) index=13
I want to have 5 character on the left and 5 on the right of this index getting rid (abd not counting) of the _ characters so having this output:
ghilmnopqrs
so basically when the index corresponds to a character star to from it instead when the index correspond to a _ character we have to shift (to the right up to the next character to have in the end a string of 11 characters.
to make long story short the output is 11 characters with the index position in the middle. if the index position is a _ we have to skip this character and consider the middle character the one close by(closer).

I don't think there's specific command for this, but you could build your own.
For example:
s = 'ab_cdef_ghilm__nop_q__rs'
def get_slice(s, idx, n=5, ignored_chars='_'):
if s[idx] in ignored_chars:
# adjust idx to first valid on right side:
idx = next((i for i, ch in enumerate(s[idx:], idx) if ch not in ignored_chars), None)
if idx is None:
return ''
d = {i: ch for i, ch in enumerate(s) if ch not in ignored_chars}
if idx in d:
keys = [k for k in d.keys()]
idx = keys.index(idx)
return ''.join(d[k] for k in keys[max(0, idx-n):min(idx+n+1, len(s))])
print(get_slice(s, 10, 5, '_'))
print(get_slice(s, 13, 5, '_'))
Prints:
defghilmnop
ghilmnopqrs
In case print(get_slice(s, 1, 5, '_')):
abcdefg
EDIT: Added check for starting index equals ignored char.

you define a function split like below which will split a string such that it has given number of characters on left and right side which is not "_"
st = "ab_cdef_ghilm__nop_q__rs"
def slice(st, ind, c_count):
cp = [char!="_" for char in st]
for i in range(len(st)):
if sum(cp[ind:ind+i]) == c_count:
break
right = ind + i
for i in range(len(st)):
if sum(cp[ind-i:ind]) == c_count:
break
left = ind - i
return st[left:right+1]
slice(st, 10, 5)

Related

How to substitute unstressed vowel?

I have a CSV file with the following data:
bel.lez.za;bellézza
e.la.bo.ra.re;elaboràre
a.li.an.te;alïante
u.mi.do;ùmido
the first value is the word divided in syllables and the second is for the stress.
I'd like to merge the the two info and obtain the following output:
bel.léz.za
e.la.bo.rà.re
a.lï.an.te
ù.mi.do
I computed the position of the stressed vowel and tried to substitute the same unstressed vowel in the first value, but full stops make indexing difficult. Is there a way to tell python to ignore full stops while counting? or is there an easier way to perform it? Thx
After splitting the two values for each line I computed the position of the stressed vowels:
char_list=['ò','à','ù','ì','è','é','ï']
for character in char_list:
if character in value[1]:
position_of_stressed_vowel=value[1].index(character)
I'd suggest merging/aligning the two forms in parallel instead of trying to substitute things via indexing. The idea is to iterate through the plain form and take out one character from the accented form for every character from the plain form, keeping dots as they are.
(Or perhaps, the idea is to add the dots to the accented form instead of adding the accented characters to the syllabified form.)
def merge_accents(plain, accented):
output = ""
acc_chars = iter(accented)
for char in plain:
if char == ".":
output += char
else:
output += next(acc_chars)
return output
Test:
data = [['bel.lez.za', 'bellézza'],
['e.la.bo.ra.re', 'elaboràre'],
['a.li.an.te', 'alïante'],
['u.mi.do', 'ùmido']]
# Returns
# bel.léz.za
# e.la.bo.rà.re
# a.lï.an.te
# ù.mi.do
for plain, accented in data:
print(merge_accents(plain, accented))
Is there a way to tell python to ignore full stops while counting?
Yes, by implementing it yourself using an index lookup that tells you which index in the space-delimited string an index in the word is equivalent to:
i = 0
corrected_index = []
for char in value[0]:
if char != ".":
corrected_index.append(i)
i+=1
now, you can correct the index and replace the character:
value[0][corrected_index[position_of_stressed_vowel]] = character
Make sure to use UTF-16 as encoding for your "stressed vowel" characters to have a single index.
You can loop over the two halfs of the string, keep track of the index in the first half, excluding the dots and add the character at the tracked index from the second half of the string to a buffer (modified) string. Like the code below:
data = ['bel.lez.za;bellézza',
'e.la.bo.ra.re;elaboràre',
'a.li.an.te;alïante',
'u.mi.do;ùmido']
converted_data = []
# Loop over the data.
for pair in data:
# Split the on ";"
first_half, second_half = pair.split(';')
# Create variables to keep track of the current letter and the modified string.
current_letter = 0
modified_second_half = ''
# Loop over the letter of the first half of the string.
for current_char in first_half:
# If the current_char is a dot add it to the modified string.
if current_char == '.':
modified_second_half += '.'
# If the current_char is not a dot add the current letter from the second half to the modified string,
# and update the current letter value.
else:
modified_second_half += second_half[current_letter]
current_letter += 1
converted_data.append(modified_second_half)
print(converted_data)
data = ['bel.lez.za;bellézza',
'e.la.bo.ra.re;elaboràre',
'a.li.an.te;alïante',
'u.mi.do;ùmido']
def slice_same(input, lens):
# slices the given string into the given lengths.
res = []
strt = 0
for size in lens:
res.append(input[strt : strt + size])
strt += size
return res
# split into two.
data = [x.split(';') for x in data]
# Add third column that's the length of each piece.
data = [[x, y, [len(z) for z in x.split('.')]] for x, y in data]
# Put text and lens through function.
data = ['.'.join(slice_same(y, z)) for x, y, z in data]
print(data)
Output:
['bel.léz.za',
'e.la.bo.rà.re',
'a.lï.an.te',
'ù.mi.do']

String manipulation algorithm to find string greater than original string

I have few words(strings) like 'hefg','dhck','dkhc','lmno' which is to be converted to new words by swapping some or all the characters such that the new word is greater than the original word lexicographically also the new word is the least of all the words greater than the original word.
for e.g 'dhck'
should output 'dhkc' and not 'kdhc','dchk' or any other.
i have these inputs
hefg
dhck
dkhc
fedcbabcd
which should output
hegf
dhkc
hcdk
fedcbabdc
I have tried with this code in python it worked for all except 'dkhc' and 'fedcbabcd'.
I have figured out that the first character in case of 'fedcbabcd' is the max so, it is not getting swapped.and
Im getting "ValueError: min() arg is an empty sequence"
How can I modify the algorithm To fix the cases?
list1=['d','k','h','c']
list2=[]
maxVal=list1.index(max(list1))
for i in range(maxVal):
temp=list1[maxVal]
list1[maxVal]=list1[i-1]
list1[i-1]=temp
list2.append(''.join(list1))
print(min(list2))
You can try something like this:
iterate the characters in the string in reverse order
keep track of the characters you've already seen, and where you saw them
if you've seen a character larger than the curent character, swap it with the smallest larger character
sort all the characters after the that position to get the minimum string
Example code:
def next_word(word):
word = list(word)
seen = {}
for i in range(len(word)-1, -1, -1):
if any(x > word[i] for x in seen):
x = min(x for x in seen if x > word[i])
word[i], word[seen[x]] = word[seen[x]], word[i]
return ''.join(word[:i+1] + sorted(word[i+1:]))
if word[i] not in seen:
seen[word[i]] = i
for word in ["hefg", "dhck", "dkhc", "fedcbabcd"]:
print(word, next_word(word))
Result:
hefg hegf
dhck dhkc
dkhc hcdk
fedcbabcd fedcbabdc
The max character and its position doesn't influence the algorithm in the general case. For example, for 'fedcbabcd', you could prepend an a or a z at the beginning of the string and it wouldn't change the fact that you need to swap the final two letters.
Consider the input 'dgfecba'. Here, the output is 'eabcdfg'. Why? Notice that the final six letters are sorted in decreasing order, so by changing anything there, you get a smaller string lexicographically, which is no good. It follows that you need to replace the initial 'd'. What should we put in its place? We want something greater than 'd', but as small as possible, so 'e'. What about the remaining six letters? Again, we want a string that's as small as possible, so we sort the letters lexicographically: 'eabcdfg'.
So the algorithm is:
start at the back of the string (right end);
go left while the symbols keep increasing;
let i be the rightmost position where s[i] < s[i + 1]; in our case, that's i = 0;
leave the symbols on position 0, 1, ..., i - 1 untouched;
find the position among i+1 ... n-1 containing the least symbol that's greater than s[i]; call this position j; in our case, j = 3;
swap s[i] and s[j]; in our case, we obtain 'egfdcba';
reverse the string s[i+1] ... s[n-1]; in our case, we obtain 'eabcdfg'.
Your problem can we reworded as finding the next lexicographical permutation of a string.
The algorithm in the above link is described as follow:
1) Find the longest non-increasing suffix
2) The number left of the
suffix is our pivot
3) Find the right-most successor of the pivot in
the suffix
4) Swap the successor and the pivot
5) Reverse the suffix
The above algorithm is especially interesting because it is O(n).
Code
def next_lexicographical(word):
word = list(word)
# Find the pivot and the successor
pivot = next(i for i in range(len(word) - 2, -1, -1) if word[i] < word[i+1])
successor = next(i for i in range(len(word) - 1, pivot, -1) if word[i] > word[pivot])
# Swap the pivot and the successor
word[pivot], word[successor] = word[successor], word[pivot]
# Reverse the suffix
word[pivot+1:] = word[-1:pivot:-1]
# Reform the word and return it
return ''.join(word)
The above algorithm will raise a StopIteration exception if the word is already the last lexicographical permutation.
Example
words = [
'hefg',
'dhck',
'dkhc',
'fedcbabcd'
]
for word in words:
print(next_lexicographical(word))
Output
hegf
dhkc
hcdk
fedcbabdc

partition a string by dash (-) python

I want to get a string and divide it into parts separated by "-".
Input:
aabbcc
And output:
aa-bb-cc
is there a way to do so?
If you want to do it based on the same letter then you can use itertools.groupby() to do this, e.g.:
In []:
import itertools as it
s = 'aabbcc'
'-'.join(''.join(g) for k, g in it.groupby(s))
Out[]:
'aa-bb-cc'
Or if you want it in chunks of 2 you can use iter() and zip():
In []:
n = 2
'-'.join(''.join(p) for p in zip(*[iter(s)]*n))
Out[]:
'aa-bb-cc'
Note: if the string length is not divisible by 2 this will drop the last character - you can replace zip(...) with itertools.zip_longest(..., fillvalue='') but it is unclear if the OP has this issue)
If you consider creating pair-divided by a dash, you can use the below function:
def pair_div(string):
newString=str() #for storing the divided string
for i,s in enumerate(string):
if i%2!=0 and i<(len(string)-1): #we make sure the function divides every two chars but not the last character of string.
newString+=s+'-' #If it is the second member of pair, add a dash after it
else:
newString+=s #If not, just add the character
return(newString)
And for example:
[In]:string="aazzxxcceewwqqbbvvaa"
[Out]:'aa-zz-xx-cc-ee-ww-qq-bb-vv-aa'
But if you consider dividing same characters as a group and separate with a dash, you better your regex methods.
BR,
Shend
You can try
data = "aabbcc"
"-".join([data[x:x+2] for x in range(0, len(data), 2)])
if you want to divide the string into block of 2 characters, then this will help you.
import textwrap
s='aabbcc'
lst=textwrap.wrap(s,2)
print('-'.join(lst))
2nd attribute defines the no. of characters you want in a particular group
s = 'aabbccdd'
#index 01234567
new_s = ''
1)
for idx, char in enumerate(s):
new_s+=char
if idx%2 != 0:
new_s += '-'
print(new_s.strip('-'))
# aa-bb-cc-dd
2)
new_s = ''.join([s[i]+'-' if i%2 != 0 else s[i] for i in range(len(s))]).strip('-')
print(new_s)
# aa-bb-cc-dd

Python script to make every combination of a string with placed characters

I'm looking for help in creating a script to add periods to a string in every place but first and last, using as many periods as needed to create as many combinations as possible:
The output for the string 1234 would be:
["1234", "1.234", "12.34", "123.4", "1.2.34", "1.23.4" etc. ]
And obviously this needs to work for all lengths of string.
You should solve this type of problems yourself, these are simple algorithms to manipulate data that you should know how to come up with.
However, here is the solution (long version for more clarity):
my_str = "1234" # original string
# recursive function for constructing dots
def construct_dot(s, t):
# s - the string to put dots
# t - number of dots to put
# zero dots will return the original string in a list (stop criteria)
if t==0: return [s]
# allocation for results list
new_list = []
# iterate the next dot location, considering the remaining dots.
for p in range(1,len(s) - t + 1):
new_str = str(s[:p]) + '.' # put the dot in the location
res_str = str(s[p:]) # crop the string frot the dot to the end
sub_list = construct_dot(res_str, t-1) # make a list with t-1 dots (recursive)
# append concatenated strings
for sl in sub_list:
new_list.append(new_str + sl)
# we result with a list of the string with the dots.
return new_list
# now we will iterate the number of the dots that we want to put in the string.
# 0 dots will return the original string, and we can put maximum of len(string) -1 dots.
all_list = []
for n_dots in range(len(my_str)):
all_list.extend(construct_dot(my_str,n_dots))
# and see the results
print(all_list)
Output is:
['1234', '1.234', '12.34', '123.4', '1.2.34', '1.23.4', '12.3.4', '1.2.3.4']
A concise solution without recursion: using binary combinations (think of 0, 1, 10, 11, etc) to determine where to insert the dots.
Between each letter, put a dot when there's a 1 at this index and an empty string when there's a 0.
your_string = "1234"
def dot_combinations(string):
i = 0
combinations = []
# Iter while the binary representation length is smaller than the string size
while i.bit_length() < len(string):
current_word = []
for index, letter in enumerate(string):
current_word.append(letter)
# Append a dot if there's a 1 in this position
if (1 << index) & i:
current_word.append(".")
i+=1
combinations.append("".join(current_word))
return combinations
print dot_combinations(your_string)
Output:
['1234', '1.234', '12.34', '1.2.34', '123.4', '1.23.4', '12.3.4', '1.2.3.4']

Consecutive values in strings, getting indices

The following is a python string of length of approximately +1000.
string1 = "XXXXXXXXXXXXXXXXXXXXXAAAAAAAAAAAAAAAAAABBBBBBBBBBBBBB........AAAAXXXXX"
len(string1) ## 1311
I would like to know the index of where the consecutive X's end and the non-X characters begin. Reading this string from left to right, the first non-X character is at index location 22, and the first non-X character from the right is at index location 1306.
How does one find these indices?
My guess would be:
for x in string1:
if x != "X":
print(string.index(x))
The problem with this is it outputs all indices that are not X. It does not give me the index where the consecutive X's end.
Even more confusing for me is how to "check" for consecutive X's. Let's say I have this string:
string2 = "XXXXAAXAAAAAAAAAAAAAAABBBBBBBBBBBBBB........AAAAXXXXX"
Here, the consecutive X's end at index 4, not index 7. How could I check several characters ahead whether this is really no longer consecutive?
using regex, split the first & last group of Xs, get their lengths to construct the indices.
import re
mystr = 'XXXXAAXAAAAAAAAAAAAAAABBBBBBBBBBBBBB........AAAAXXXXX'
xs = re.split('[A-W|Y-Z]+', mystr)
indices = (len(xs[0]), len(mystr) - len(xs[-1]) - 1)
# (4, 47)
I simply need the outputs for the indices. I'm then going to put them in randint(first_index, second_index)
Its possible to pass the indices to the function like this
randint(*indices)
However, I suspect that you want to use the output of randint(first_index, last_index) to select a random character from the middle, this would be a shorter alternative.
from random import choice
randchar = choice(mystr.strip('X'))
If I understood well your question, you just do:
def getIndexs(string):
lst =[]
flag = False
for i, char in enumerate(string):
if char == "x":
flag = True
if ((char != "x") and flag):
lst.append(i-1)
flag = False
return lst
print(getIndexs("xxxxbbbxxxxaaaxxxbb"))
[3, 10, 16]
If the sequences are, as you say, only in the beginning and at the end of your string, a simple loop / reversed loop would suffice:
string1 = "XXXXXXXXXXXXXXXXXXXXXAAAAAAAAAAAAAAAAAABBBBBBBBBBBBBB........AAAAXXXXX"
left_index = 0
for char in string1:
left_index += 1
if char != "X":
break
right_index = len(string1)
for char in reversed(string1):
if char != "X":
break
right_index -= 1
print(left_index) # 22
print(right_index) # 65
Regex can lookahead and identify characters that don't match the pattern:
>>>[match.span() for match in re.finditer(r'X{2,}((?=[^X])|$)', string2)]
[(0, 4), (48, 53)]
Breaking this down:
X - the character we're matching
{2,} - need to see at least two in a row to consider a match
((?=[^X])|$) - two conditions will satisfy the match
(?=[^X]) - lookahead for anything but an X
$ - the end of the string
As a result, finditer returns each instance where there are multiple X's, followed by a non-X or an end of line. match.span() extracts the position information from each match from the string.
This will give you the first index and last index (of non-'X' character).
s = 'XXABCDXXXEFGHXXXXX'
first_index = len(s) - len(s.lstrip('X'))
last_index = len(s.rstrip('X')) - len(s) - 1
print first_index, last_index
2 -6
How it works:
For first_index:
We strip all the 'X' characters at the beginning of our string. Finding the difference in length between the original and shortened string gives us the index of the first non-'X' character.
For last_index:
Similarly, we strip the 'X' characters at the end of our string. We also subtract 1 from the difference, since reverse indexing in Python starts from -1.
Note:
If you just want to randomly select one of the characters between first_index and last_index, you can do:
import random
shortened_s = s.strip('X')
random.choice(shortened_s)

Categories