Getting rid of duplicate blocks in a string - python

I've got a string broken into pairs of letters and I'm looking for a way to get rid of all the pairs of identical letters, by inserting characters in between them, to form new pairs. Further, I'm looking to split them up one pair at a time. What I've managed to do so far is put split all identical blocks simultaneously, but that's not what I'm looking for. So, for example, consider "fr ee tr ee". This should go to "fr eX et re e", not "fr eXe tr eXe".
Anyone got any ideas?
EDIT: TO be more clear, I need to go through the string, and at the first instance of a "double block", insert an X, and form new pairs on everything to the right of the X. SO. "AA BB", goes to "AX AB B".
So far I have
def FUN(text):
if len(text) < 2:
return text
result = ""
for i in range(1, len(text), 2):
if text[i] == text[i - 1]:
result += text[i - 1] + "X" + text[i]
else:
result += text[i-1:i+1]
if len(text) % 2 != 0:
result += text[-1]
return result

How about this ? :
r = list()
S = "free tree"
S = "".join(S.split())
s = list()
for i in range(0,len(S)) :
s.append(S[i])
while len(s) > 0 :
c1 = s.pop(0)
c2 = 'X'
if len(s) > 0 :
if s[0]!=c1 :
c2 = s.pop(0)
else :
c2 = ''
r.append("{0}{1}".format(c1,c2))
result = " ".join(r)
print(result)
Hope this helps :)

You could turn your string into a list and check each pairing in a loop then insert another character in between where you find the same character. Working on the code now will edit.

my_string = "freetreebreefrost"
my_parts = [my_string[i:i+2] for i in range(0,len(my_string),2)]
final_list = []
while len(my_parts):
part = my_parts.pop(0)
if part in my_parts:
tmp_str = part[1] +"".join(my_parts)
my_parts = [tmp_str[i:i+2] for i in range(0,len(tmp_str),2)]
final_list.append(part[0]+"X")
else:
final_list.append(part)
print final_list
there is probably a much cooler way to do this

Ok here it is:
s = "free tree aa"
def seperateStringEveryTwoChars(s):
# get rid of any space
s = s.replace(' ', '')
x = ""
for i, l in enumerate(s, 0):
x += l
if i % 2:
x += ' '
return x.rstrip()
def findFirstDuplicateEntry(stringList):
for i, elem in enumerate(stringList, 0):
if len(elem) > 1 and elem[0] == elem[1]:
return i
return None
def yourProgram(s):
x = seperateStringEveryTwoChars(s)
# print x # debug only
splitX = x.split(' ')
# print splitX # debug only
i = findFirstDuplicateEntry(splitX)
if i == None:
return seperateStringEveryTwoChars(s)
# print i # debug only
splitX[i] = splitX[i][0] + "X" + splitX[i][1]
# print splitX # debug only
s = ''.join(splitX)
# print s # debug only
# print "Done" # debug only
return yourProgram(s)
print yourProgram(s)
Output:
fr eX et re ea a
with an input of string of "aabbccddd" it will output "aX ab bc cd dX d"

This is a simple 3 lines of code solution, single pass, as easy as it gets.
No splitting, joining, arrays, for loops, nothing.
First, remove all spaces from the string, Replace_All \s+ with ""
Replace_All with callback ((.)(?:(?!\2)(.)|)(?!$))
a. if (matched $3) replace with $1
b. else replace with $1+"X"
Finally, put a space between every 2 chars. Replace_All (..) with $1 + " "
This is a test using Perl (don't know Python that well)
$str = 'ee ee rx xx tt bb ff fr ee tr ee';
$str =~ s/\s+//g;
$str =~ s/((.)(?:(?!\2)(.)|)(?!$))/ defined $3 ? "$1" : "$1X"/eg;
$str =~ s/(..)/$1 /g;
print $str,"\n";
# Output:
# eX eX eX er xX xX xt tb bf fX fr eX et re e
# Expanded regex
#
( # (1 start)
( . ) # (2)
(?:
(?! \2 ) # Not equal to the first char?
( . ) # (3) Grab the next one
|
# or matches the first, an X will be inserted here
)
(?! $ )
) # (1 end)

Related

Split string into list, but two words in quotation mark as one?

I'm trying to make my own little console in Python3 and I'm trying to split my given commands.
For example:
mkdir dir becomes arg[0] = mkdir, arg[1] = dir
I know i can do that with args.split(' '), but I'm trying to make it so that anything in quotation becomes one argument.
For example:
mkdir "New Folder" becomes arg[0] = mkdir, arg[1] = New Folder.
You can do the following using shlex:
import shlex
res = shlex.split('mkdir "New Folder"')
print(res)
# ['mkdir', 'New Folder']
Another option using re:
import re
[p for p in re.split("( |\\\".*?\\\"|'.*?')", 'mkdir "New Folder"') if p.strip()]
# ['mkdir', '"New Folder"']
Or:
import re
res3 = re.findall("(?:\".*?\"|\S)+", 'mkdir "New Folder"')
print(res3)
# ['mkdir', '"New Folder"']
Other option using csv:
import csv
res4 = list(csv.reader(['mkdir "New Folder"'], delimiter=' '))[0]
print(res4)
This should work:
def splitArgsIntoArguments(args):
result = args.split(' ')
i = 0
length = len(result)
while i < length:
if result[i][0] == '"':
# Trim string
result[i] = result[i][1:]
# Then for each next element merge it with result[i] until it ends with '"'
for j in range(i+1 , len(result)):
stop = False
if result[j][-1] == '"':
# Trim string
result[j] = result[j][0:-1]
stop = True
result[i] += " " + result[j]
# Remove the jth element from the list
result = result[0:j] + result[j+1:]
# Then substract one from the length to not get out of range error
length -= 1
if stop: break
i += 1
return result
It's a bit ugly, but the function goes through each argument in the result list and if it starts with '"' it merges all elements to the right of it until it finds one that ends with '"'.

rotate a string n characters to the left, except the special characters

Hi I need help rotating a string to the left n amount of times, I have done so: btw Strings is a list of strings:
finaltext = ""
for i in strings:
first = i[0 : n]
second = i[n :]
i = second + first
finaltext += i
However, i'm not sure how to do this so that in a given string, say: "The intern", the space or any special characters would not move.
s1 = "The intern"
Right now my output is:
ternThe in
output I want:
ern eThein
any ideas? I currently created a function that indicates when a special character and its index in a string, I used that in a for loop to know that the current character is a special character, but when it comes to rotation how would i avoid that character
An intriguing question. How to rotate a string while ignoring specific characters?
Here we remove, rotate, reinsert characters.
Given
import collections as ct
def index(s):
"""Return a reversed dict of (char, [index, ...]) pairs."""
dd = ct.defaultdict(list)
for i, x in enumerate(s):
dd[x].append(i)
return dd
s1 = "The intern"
s2 = "Hello world!"
Code
def rotate(s, n=0, ignore=""):
"""Return string of rotated items, save ignored chars."""
s0 = s[:]
# Remove ignored chars
for ig in ignore:
s = s.replace(ig, "")
# Rotate remaining string, eqiv. to `res = s[-n:] + s[:-n]`
tail = s[-n:]
head = ""
for c in s[:-n]:
head += c
res = tail + head
# Reinsert ignored chars
if ignore:
res = list(res)
lookup = index(s0)
for ig in ignore:
for idx in lookup[ig]:
res.insert(idx, ig)
res = "".join(res)
return res
Tests
assert rotate(s1, n=0, ignore="") == "The intern"
assert rotate(s1, n=1, ignore="") == "nThe inter"
assert rotate(s1, n=1, ignore=" ") == "nTh einter"
assert rotate(s1, n=3, ignore=" ") == "ern Theint"
assert rotate(s2, n=12, ignore="") == "Hello world!"
assert rotate(s2, n=1, ignore="") == "!Hello world"
assert rotate(s2, n=1, ignore="H !") == "Hdell oworl!"
assert rotate(s2, n=1, ignore="!") == "dHello worl!"

Python: replace string, matched from a list

Trying to match and mark character based n-grams. The string
txt = "how does this work"
is to be matched with n-grams from the list
ngrams = ["ow ", "his", "s w"]
and marked with <> – however, only if there is no preceding opened quote. The output i am seeking for this string is h<ow >does t<his w>ork (notice the double match in the 2-nd part, but within just 1 pair of expected quotes).
The for loop i’ve tried for this doesn’t, however, produce the wanted output at all:
switch = False
for i in txt:
if i in "".join(ngrams) and switch == False:
txt = txt.replace(i, "<" + i)
switch = True
if i not in "".join(ngrams) and switch == True:
txt = txt.replace(i, ">" + i)
switch = False
print(txt)
Any help would be greatly appreciated.
This solution uses the str.find method to find all copies of an ngram within the txt string, saving the indices of each copy to the indices set so we can easily handle overlapping matches.
We then copy txt, char by char to the result list, inserting angle brackets where required. This strategy is more efficient than inserting the angle brackets using multiple .replace call because each .replace call needs to rebuild the whole string.
I've extended your data slightly to illustrate that my code handles multiple copies of an ngram.
txt = "how does this work now chisolm"
ngrams = ["ow ", "his", "s w"]
print(txt)
print(ngrams)
# Search for all copies of each ngram in txt
# saving the indices where the ngrams occur
indices = set()
for s in ngrams:
slen = len(s)
lo = 0
while True:
i = txt.find(s, lo)
if i == -1:
break
lo = i + slen
print(s, i)
indices.update(range(i, lo-1))
print(indices)
# Copy the txt to result, inserting angle brackets
# to show matches
switch = True
result = []
for i, u in enumerate(txt):
if switch:
if i in indices:
result.append('<')
switch = False
result.append(u)
else:
result.append(u)
if i not in indices:
result.append('>')
switch = True
print(''.join(result))
output
how does this work now chisolm
['ow ', 'his', 's w']
ow 1
ow 20
his 10
his 24
s w 12
{1, 2, 10, 11, 12, 13, 20, 21, 24, 25}
h<ow >does t<his w>ork n<ow >c<his>olm
If you want adjacent groups to be merged, we can easily do that using the str.replace method. But to make that work properly we need to pre-process the original data, converting all runs of whitespace to single spaces. A simple way to do that is to split the data and re-join it.
txt = "how does this\nwork now chisolm hisow"
ngrams = ["ow", "his", "work"]
#Convert all whitespace to single spaces
txt = ' '.join(txt.split())
print(txt)
print(ngrams)
# Search for all copies of each ngram in txt
# saving the indices where the ngrams occur
indices = set()
for s in ngrams:
slen = len(s)
lo = 0
while True:
i = txt.find(s, lo)
if i == -1:
break
lo = i + slen
print(s, i)
indices.update(range(i, lo-1))
print(indices)
# Copy the txt to result, inserting angle brackets
# to show matches
switch = True
result = []
for i, u in enumerate(txt):
if switch:
if i in indices:
result.append('<')
switch = False
result.append(u)
else:
result.append(u)
if i not in indices:
result.append('>')
switch = True
# Convert the list to a single string
output = ''.join(result)
# Merge adjacent groups
output = output.replace('> <', ' ').replace('><', '')
print(output)
output
how does this work now chisolm hisow
['ow', 'his', 'work']
ow 1
ow 20
ow 34
his 10
his 24
his 31
work 14
{32, 1, 34, 10, 11, 14, 15, 16, 20, 24, 25, 31}
h<ow> does t<his work> n<ow> c<his>olm <hisow>
This should work:
txt = "how does this work"
ngrams = ["ow ", "his", "s w"]
# first find where letters match ngrams
L = len(txt)
match = [False]*L
for ng in ngrams:
l = len(ng)
for i in range(L-l):
if txt[i:i+l] == ng:
for j in range(l):
match[i+j] = True
# then sandwich matches with quotes
out = []
switch = False
for i in range(L):
if not switch and match[i]:
out.append('<')
switch = True
if switch and not match[i]:
out.append('>')
switch = False
out.append(txt[i])
print "".join(out)
Here's a method with only one for loop. I timed it and it's about as fast as the other answers to this question. I think it's a bit more clear, although that might be because I wrote it.
I iterate over the index of the first character in the n-gram, then if it matches, I use a bunch of if-else clauses to see whether I should add a < or > in this situation. I add to the end of the string output from the original txt, so I'm not really inserting in the middle of a string.
txt = "how does this work"
ngrams = set(["ow ", "his", "s w"])
n = 3
prev = -n
output = ''
shift = 0
open = False
for i in xrange(len(txt) - n + 1):
ngram = txt[i:i + n]
if ngram in ngrams:
if i - prev > n:
if open:
output += txt[prev:prev + n] + '>' + txt[prev + n:i] + '<'
elif not open:
if prev > 0:
output += txt[prev + n:i] + '<'
else:
output += txt[:i] + '<'
open = True
else:
output += txt[prev:i]
prev = i
if open:
output += txt[prev:prev + n] + '>' + txt[prev + n:]
print output

Get the word around a given position

Let
s = 'hello you blablablbalba qyosud'
i = 17
How to get the word around position i? i.e. blablablbalba in my example.
I was thinking about this, but it seems unpythonic:
for j, c in enumerate(s):
if c == ' ':
if j < i:
start = j
else:
end = j
break
print start, end
print s[start+1:end]
Here is another simple approach with regex,
import re
s = 'hello you blablablbalba qyosud'
i = 17
string_at_i = re.findall(r"(\w+)", s[i:])[0]
print(re.findall(r"\w*%s\w*" % string_at_i, s))
Updated : Previous pattern was failing when there is space. Current pattern takes care of it !
To answer your first question,
p = s[0 : i].rfind(' ')
Output: 9
For your second question,
s[ p + 1 : (s[p + 1 : ].find(' ') + p + 1) ]
Output: 'blablablbalba'
Description:
Extract the string from the starting to the ith position.
Find the index of the last occurrence of space. This will be your starting point for your required word (the second question).
Go from here to the next occurrence of space and extract the word in between.
The following consolidated code should work in all scenarios:
s = s + ' '
p = s[0 : i].rfind(' ')
s[ p + 1 : (s[p + 1 : ].find(' ') + p + 1) ]
You can split the word by space, after that you count the number of the spaces until the threshold parameter (i) and this would be the index of the item in the splitted list.
Solution:
print (s.split()[s[:i].count(" ")])
EDIT:
If we have more than one space between words and we want to consider two spaces (or more) as one space we can do:
print (s.split()[" ".join(s[:i].split()).count(" ")])
Output:
blablablbalba
Explanation:
This return's 2 as there are two spaces until the 17 index.
s[:i].count(" ") # return's 2
This return's a list splitted by space.
s.split()
What you need is the index of the relevant item, which you got from s[:i].count(" ")
['hello', 'you', 'blablablbalba', 'qyosud']
def func(s, i):
s1 = s[0:i]
k = s1.rfind(' ')
pos1 = k
s1 = s[k+1:len(s)]
k = s1.find(' ')
k = s[pos1+1:pos1+k+1]
return k
s = 'hello you blablablbalba qyosud'
i = 17
k = func(s, i)
print(k)
output:
blablablbalba
You can use index or find to get the index of the space starting from a precise position. In this case it will look for the space character position starting from start+1. Then, if it finds any space it will print out the word between the two indexes start and end
s = 'hello you blablablbalba qyosud'
def get_word(my_string, start_index):
end = -1
try:
end = s.find(' ', start_index + 1)
except ValueError:
# no second space was found
pass
return s[start_index:end] if end else None
print get_word(s)
Output: 'blablablbalba'
You can use rfind to search for the previous whitespace including s[i]:
>>> s = 'hello you blablablbalba qyosud'
>>> i = 17
>>> start = s.rfind(' ', 0, i + 1)
>>> start
9
Then you can use find to search the following whitespace again including s[i]:
>>> end = s.find(' ', i)
>>> end
23
And finally use slice to generate the word:
>>> s[start+1:(end if end != -1 else None)]
'blablablbalba'
Above will result to the word in case s[i] is not whitespace. In case s[i] is whitespace the result is empty string.

How to join correctly join words that trails or heads of dashes? - python

I have a list of strings that contains tokens that ends or starts with - I need to join them up such that the words with dashes join up into the correct tokens, e.g.
[in]:
x = "ko- zo- fond- w- a (* nga- bantawana )."
y = "ngi -leth- el a -unfundi"
z = "ba- ya -gula buye incdai- fv -buye"
[out]:
kozofondwa (* ngabantawana ).
ngilethel aunfundi
bayagula buye incdaifvbuye
I've been doing it as such, it's real ugly and inelegant especially when i need to call the function twice. Is there other way to achieve the same output? maybe with regex or something?
x = "ko- zo- fond- w- a (* nga- bantawana )."
y = "ngi -leth- el a -unfundi"
z = "ba- ya -gula buye incdai- fv -buye"
def join_morph(text):
tempstr = ""
outstr = []
for i in text.split():
if i.startswith('-'):
outstr[len(outstr)-1]+=i
elif i.endswith('-'):
tempstr+=i
else:
tempstr+=i
outstr.append(tempstr)
tempstr = ""
return " ".join(outstr)
# There is a problem because of the ordering of
# the if-else, it can only handle head or
# trailing dashes, not both
a = join_morph(x)
print a
a = join_morph(x).replace('-','')
print a
a = join_morph(join_morph(y)).replace('-','')
print a
a = join_morph(join_morph(z)).replace('-','')
print a
x = "ko- zo- fond- w- a (* nga- bantawana )." #or any other input
x = x.replace("- ", "").replace(" -", "")
It will remove all occurrences of "- " and " -" from the input effectively transforming strings as you need it.
maybe:
import re
re.sub( ' *- *', '', txt )
edit: if you know that there will always be exactly one space before or after the dash, then go with the replace solution, otherwise if you expect to have strings like high-rise( no space before or after dash), or high -rise (more than one space) or high - rise (one space on both sides), then the regular expression may fit better.

Categories