Find tokens that are connected

Find tokens that are connected - python

I wrote code that gets text-tokens as input:
tokens = ["Tap-", "Berlin", "Was-ISt", "das", "-ist", "cool", "oh", "Man", "-Hum", "-Zuh-UH-", "glit"]
The code should find all tokens that contain hyphens or are connected to each other with hyphens: Basically the output should be:
[["Tap-", "Berlin"], ["Was-ISt"], ["das", "-ist"], ["Man", "-Hum", "-Zuh-UH-", "glit"]]
I wrote a code, but somehow Im not getting the with hypens connected Tokens back: To try it out: http://goo.gl/iqov0q
def find_hyphens(self):
tokens_with_hypens =[]
for i in range(len(self.tokens)):
hyp_leng = 0
while self.hypen_between_two_tokens(i + hyp_leng):
hyp_leng += 1
if self.has_hypen_in_middle(i) or hyp_leng > 0:
if hyp_leng == 0:
tokens_with_hypens.append(self.tokens[i:i + 1])
else:
tokens_with_hypens.append(self.tokens[i:i + hyp_leng])
i += hyp_leng - 1
return tokens_with_hypens
What do I wrong? Is there a more performant solution? Thanks

I found 3 mistakes in your code:
1) You are comparing the last 2 characters of tok1 here, rather than the last of tok1 and the first of tok2:
if "-" in joined[len(tok1) - 2: len(tok1)]:
# instead, do this:
if "-" in joined[len(tok1) - 1: len(tok1) + 1]:
2) You are omitting the last matching token here. Increase the end-index of your slice here by 1:
tokens_with_hypens.append(self.tokens[i:i + hyp_leng])
# instead, do this:
tokens_with_hypens.append(self.tokens[i:i + 1 + hyp_leng])
3) You cannot manipulate the index of a for i in range loop in python. the next iteration will just retrieve the next index, overwriting your change. Instead, you could use a while-loop like this:
i = 0
while i < len(self.tokens):
[...]
i += 1
These 3 corrections lead to your test passing: http://goo.gl/fd07oL
Nonetheless I couldn't resist to write an algorithm from scratch, solving your problem as simple as possible:
def get_hyphen_groups(tokens):
i_start, i_end = 0, 1
while i_start < len(tokens):
while (i_end < len(tokens) and
(tokens[i_end].startswith("-") ^ tokens[i_end - 1].endswith("-"))):
i_end += 1
yield tokens[i_start:i_end]
i_start, i_end = i_end, i_end + 1
tokens = ["Tap-", "Berlin", "Was-ISt", "das", "-ist", "cool", "oh", "Man", "-Hum", "-Zuh-UH-", "glit"]
for group in get_hyphen_groups(tokens):
print ("".join(group))
To exclude 1-element-groups, like in your expected result, wrap the yield into this if:
if i_end - i_start > 1:
yield tokens[i_start:i_end]
To include 1-element-groups that already include a hyphen, change that if to this for example:
if i_end - i_start > 1 or "-" in tokens[i_start]:
yield tokens[i_start:i_end]

One thing that is wrong with your approach is trying to change the value of i in the for i in range(len(self.tokens)) loop. It won't work because the value of i will get the next value from range in each iteration, ignoring your changes.
I changed your algorithm to use an iterative algorithm that pops one item at the time from the list and decides what to do with it. It uses a buffer where it stored items belonging to one chain until it's complete.
The full code is:
class Hyper:
def __init__(self, tokens):
self.tokens = tokens
def find_hyphens(self):
tokens_with_hypens =[]
copy = list(self.tokens)
buffer = []
while len(copy) > 0:
item = copy.pop(0)
if self.has_hyphen_in_middle(item) and item[0] != '-' and item[-1] != '-':
# words with hyphens that are not part of a bigger chain
tokens_with_hypens.append([item])
elif item[-1] == '-' or (len(copy) > 0 and copy[0][0] == '-'):
# part of a chain - append to the buffer
buffer.append(item)
elif len(buffer) > 0:
# the last word in a chain - the buffer contains the complete chain
buffer.append(item)
tokens_with_hypens.append(buffer)
buffer = []
return tokens_with_hypens
#staticmethod
def has_hyphen_in_middle(input):
return len(input) > 2 and "-" in input[1:-2]
tokens = ["Tap-", "Berlin", "Was-ISt", "das", "-ist", "cool", "oh", "Man", "-Hum", "-Zuh-UH-", "glit"]
hyper = Hyper(tokens)
result = hyper.find_hyphens()
print(result)
print(result == [["Tap-", "Berlin"], ["Was-ISt"], ["das", "-ist"], ["Man", "-Hum", "-Zuh-UH-", "glit"]])

Related

Ignoring Changed Index Check (Python)

I have made a script:
our_word = "Success"
def duplicate_encode(word):
char_list = []
final_str = ""
changed_index = []
base_wrd = word.lower()
for k in base_wrd:
char_list.append(k)
for i in range(0, len(char_list)):
count = 0
for j in range(i + 1, len(char_list)):
if j not in changed_index:
if char_list[j] == char_list[i]:
char_list[j] = ")"
changed_index.append(j)
count += 1
else:
continue
if count > 0:
char_list[i] = ")"
else:
char_list[i] = "("
print(changed_index)
print(char_list)
final_str = "".join(char_list)
return final_str
print(duplicate_encode(our_word))
essentialy the purpose of this script is to convert a string to a new string where each character in the new string is "(", if that character appears only once in the original string, or ")", if that character appears more than once in the original string. I have made a rather layered up script (I am relatively new to the python language so didn't want to use any helpful in-built functions) that attempts to do this. My issue is that where I check if the current index has been previously edited (in order to prevent it from changing), it seems to ignore it. So instead of the intended )())()) I get )()((((. I'd really appreciate an insightful answer to why I am getting this issue and ways to work around this, since I'm trying to gather an intuitive knowledge surrounding python. Thanks!

word = "Success"
print(''.join([')' if word.lower().count(c) > 1 else '(' for c in word.lower()]))

The issue here has nothing to do with your understanding of Python. It's purely algorithmic. If you retain this 'layered' algorithm, it is essential that you add one more check in the "i" loop.
our_word = "Success"
def duplicate_encode(word):
char_list = list(word.lower())
changed_index = []
for i in range(len(word)):
count = 0
for j in range(i + 1, len(word)):
if j not in changed_index:
if char_list[j] == char_list[i]:
char_list[j] = ")"
changed_index.append(j)
count += 1
if i not in changed_index: # the new inportant check to avoid reversal of already assigned ')' to '('
char_list[i] = ")" if count > 0 else "("
return "".join(char_list)
print(duplicate_encode(our_word))

Your algorithm can be greatly simplified if you avoid using char_list as both the input and output. Instead, you can create an output list of the same length filled with ( by default, and then only change an element when a duplicate is found. The loops will simply walk along the entire input list once for each character looking for any matches (other than self-matches). If one is found, the output list can be updated and the inner loop will break and move on to the next character.
The final code should look like this:
def duplicate_encode(word):
char_list = list(word.lower())
output = list('(' * len(word))
for i in range(len(char_list)):
for j in range(len(char_list)):
if i != j and char_list[i] == char_list[j]:
output[i] = ')'
break
return ''.join(output)
for our_word in (
'Success',
'ChJsTk(u cIUzI htBp#qX)OTIHpVtHHhQ',
):
result = duplicate_encode(our_word)
print(our_word)
print(result)
Output:
Success
)())())
ChJsTk(u cIUzI htBp#qX)OTIHpVtHHhQ
))(()(()))))())))()()((())))()))))

Recursive Decompression of Strings

I'm trying to decompress strings using recursion. For example, the input:
3[b3[a]]
should output:
baaabaaabaaa
but I get:
baaaabaaaabaaaabbaaaabaaaabaaaaa
I have the following code but it is clearly off. The first find_end function works as intended. I am absolutely new to using recursion and any help understanding / tracking where the extra letters come from or any general tips to help me understand this really cool methodology would be greatly appreciated.
def find_end(original, start, level):
if original[start] != "[":
message = "ERROR in find_error, must start with [:", original[start:]
raise ValueError(message)
indent = level * " "
index = start + 1
count = 1
while count != 0 and index < len(original):
if original[index] == "[":
count += 1
elif original[index] == "]":
count -= 1
index += 1
if count != 0:
message = "ERROR in find_error, mismatched brackets:", original[start:]
raise ValueError(message)
return index - 1
def decompress(original, level):
# set the result to an empty string
result = ""
# for any character in the string we have not looked at yet
for i in range(len(original)):
# if the character at the current index is a digit
if original[i].isnumeric():
# the character of the current index is the number of repetitions needed
repititions = int(original[i])
# start = the next index containing the '[' character
x = 0
while x < (len(original)):
if original[x].isnumeric():
start = x + 1
x = len(original)
else:
x += 1
# last = the index of the matching ']'
last = find_end(original, start, level)
# calculate a substring using `original[start + 1:last]
sub_original = original[start + 1 : last]
# RECURSIVELY call decompress with the substring
# sub = decompress(original, level + 1)
# concatenate the result of the recursive call times the number of repetitions needed to the result
result += decompress(sub_original, level + 1) * repititions
# set the current index to the index of the matching ']'
i = last
# else
else:
# concatenate the letter at the current index to the result
if original[i] != "[" and original[i] != "]":
result += original[i]
# return the result
return result
def main():
passed = True
ORIGINAL = 0
EXPECTED = 1
# The test cases
provided = [
("3[b]", "bbb"),
("3[b3[a]]", "baaabaaabaaa"),
("3[b2[ca]]", "bcacabcacabcaca"),
("5[a3[b]1[ab]]", "abbbababbbababbbababbbababbbab"),
]
# Run the provided tests cases
for t in provided:
actual = decompress(t[ORIGINAL], 0)
if actual != t[EXPECTED]:
print("Error decompressing:", t[ORIGINAL])
print(" Expected:", t[EXPECTED])
print(" Actual: ", actual)
print()
passed = False
# print that all the tests passed
if passed:
print("All tests passed")
if __name__ == '__main__':
main()

From what I gathered from your code, it probably gives the wrong result because of the approach you've taken to find the last matching closing brace at a given level (I'm not 100% sure, the code was a lot). However, I can suggest a cleaner approach using stacks (almost similar to DFS, without the complications):
def decomp(s):
stack = []
for i in s:
if i.isalnum():
stack.append(i)
elif i == "]":
temp = stack.pop()
count = stack.pop()
if count.isnumeric():
stack.append(int(count)*temp)
else:
stack.append(count+temp)
for i in range(len(stack)-2, -1, -1):
if stack[i].isnumeric():
stack[i] = int(stack[i])*stack[i+1]
else:
stack[i] += stack[i+1]
return stack[0]
print(decomp("3[b]")) # bbb
print(decomp("3[b3[a]]")) # baaabaaabaaa
print(decomp("3[b2[ca]]")) # bcacabcacabcaca
print(decomp("5[a3[b]1[ab]]")) # abbbababbbababbbababbbababbbab
This works on a simple observation: rather tha evaluating a substring after on reading a [, evaluate the substring after encountering a ]. That would allow you to build the result AFTER the pieces have been evaluated individually as well. (This is similar to the prefix/postfix evaluation using programming).
(You can add error checking to this as well, if you wish. It would be easier to check if the string is semantically correct in one pass and evaluate it in another pass, rather than doing both in one go)

Here is the solution with the similar idea from above:
we go through string putting everything on stack until we find ']', then we go back until '[' taking everything off, find the number, multiply and put it back on stack
It's much less consuming as we don't add strings, but work with lists
Note: multiply number can't be more than 9 as we parse it as one element string
def decompress(string):
stack = []
letters = []
for i in string:
if i != ']':
stack.append(i)
elif i == ']':
letter = stack.pop()
while letter != '[':
letters.append(letter)
letter = stack.pop()
word = ''.join(letters[::-1])
letters = []
stack.append(''.join([word for j in range(int(stack.pop()))]))
return ''.join(stack)

Write a recursive function matching_bracket(string, idx) to find the index of the close bracket matching the open bracket at string[idx]

While there are many questions on stackoverflow to check if the string is balanced, what I need is to find the index of the closing bracket of string[idx]. For example:
>>> matching_bracket('([])', 0)
3
>>> matching_bracket('([])', 1)
2
There are 3 conditions that will return -1:
the closing bracket is not of the same type
the nested brackets are not matched [IMPORTANT]
there are no more brackets available
Here is what I have so far:
def matching_bracket(string, idx):
open_tup = ("(", "{", "<", "[")
close_tup = (")", "}", ">", "]")
chosen = string[idx]
b_index = open_tup.index(chosen)
n = len(string) - 1
if string[idx + 1] in open_tup: # Case 1: Check if nested brackets match
return matching_bracket(string, idx + 1)
elif string[n] != close_tup[b_index]: # Case 2: Closing bracket not the same
return matching_bracket(string[0 : n], idx)
elif len(string) == 1: # Case 3: No more available brackets
return -1
else:
return n
While I am running a recursive function to check if the nested brackets are closed as well, I am having difficulty getting the correct output as I end up returning the index of the closing bracket that is nested instead. See below:
>>> matching_bracket('([])', 0)
2
How should I modify my code?

In above code, in first if condition your are checking whether the next bracket is of type open. if it is you are calling matching_bracket with next bracket index. and losing the actual open bracket index for which you want close bracket index.
Checkout following solution using :
def matching_bracket(string, idx):
open_tup = ("(", "{", "<", "[")
close_tup = (")", "}", ">", "]")
dict_brackets = {"{": "}", "(": ")", "<": ">", "[": "]"}
stack = []
if string[idx] in close_tup or idx >= len(string):
return -1
stack.append(string[idx])
for t in range(idx + 1, len(string)):
if string[t] in open_tup:
stack.append(string[t])
else:
if string[t] != dict_brackets.get(stack.pop()):
return -1
elif len(stack) == 0:
return t
return -1

It's a little convoluted, but should do:
def matching_bracket(string, idx):
bracket_dict = {'[':']', '(':')', '{':'}', '<':'>'}
# Actual recursive function
def inner_func(ix_open, ix_close):
if string[ix_close] == bracket_dict[string[ix_open]]:
return ix_open, ix_close
else:
if ix_close + 1 == len(string) - 1:
return ix_open, -1
else:
return inner_func(ix_open+1, ix_close+1)
if idx == len(string) - 1:
return -1
elif string[idx + 1] == bracket_dict[string[idx]]:
return idx + 1
elif idx == len(string) - 2:
return -1
else:
ix_open, ix_close = idx+2, idx+1
while ix_open != idx and ix_close != -1:
ix_open, ix_close = ix_open - 1, ix_close + 1
ix_open, ix_close = inner_func(ix_open, ix_close)
return ix_close
PS: Wrote down the solution way back, forgot to post :p

If your goal is to match open delimiters and close them with corresponding delimiters, take a look at this library I made, perhaps the algorithm can help you, though it is in java.
Here's how it works-
First the class needs to know which opening delimiter matches which closing delimiter - you can use a dictionary for this in python
delim_dict = {}
delim_dict['('] = ')'
.....
Now if you're only interested in checking whether the closing and opening delimiters don't match - take a look at this function.
Simply put, you have to count the number of each closing delimiter and open delimiter, reverse iterating the string from backwards. Whenever you see the, if the counts don't match, you know the delimiters are also not matched
Now if you want to find the index of your desired delimiter - take a look at this function
It's designed to find the mathematical function in an expression, given its closing delimiter, but you can modify it to match your usecase. Since you want to find a closing delimiter, given opening delimiter, you should be iterating the expression in normal order, instead of reverse
# opening_delim is given as parameter
closing_delim = get_corresponding_delimiter(opening_delim)
closing_delim_count, opening_delim_count = 0, 0
i = 0
for item in expression:
if expression[i] == opening_delim:
opening_delim_count += 1
elif expression[i] == closing_delim:
closing_delim_count+= 1
if opening_delim_count == closing_delim_count:
return i
i += 1
Of course, this code is only for the first index's delimiter and it also assumes the delimiters are matched correctly

Replacing keywords in strings with sequence of symbols

For an exercise, I have to create a simple profanity filter in order to learn about classes.
The filter gets initialized with a list of offensive keywords and a replacement template. Every occurrence of any of these words should be replaced with a string that is generated from the template. If the word size is shorter than the template, a substring should be used that starts from the beginning, for longer sizes, the template should be repeated as often as necessary.
The following are my results so far, with an example.
class ProfanityFilter:
def __init__(self, keywords, template):
self.__keywords = sorted(keywords, key=len, reverse=True)
self.__template = template
def filter(self, msg):
def __replace_letters__(old_word, replace_str):
replaced_word = ""
old_index = 0
replace_index = 0
while old_index <= len(old_word):
if replace_index == len(replace_str):
replace_index = 0
else:
replaced_word += replace_str[replace_index]
replace_index += 1
old_index += 1
return replaced_word
for keyword in self.__keywords:
idx = 0
while idx < len(msg):
index_l = msg.lower().find(keyword.lower(), idx)
if index_l == -1:
break
msg = msg[:index_l] + __replace_letters__(keyword, self.__template) + msg[index_l + len(keyword):]
idx = index_l + len(keyword)
return msg
f = ProfanityFilter(["duck", "shot", "batch", "mastard"], "?#$")
offensive_msg = "this mastard shot my duck"
clean_msg = f.filter(offensive_msg)
print(clean_msg) # should be: "this ?#$?#$? ?#$? my ?#$?"
The example should print:
this ?#$?#$? ?#$? my ?#$?
But it prints:
this ?#$?#$ ?#$? my ?#$?
For some reason it replaces the word "mastard" with 6 symbols instead of 7 (one for each letter). It works for the other keywords, why not for this one?
Also if you see anything else that seems off, feel free to tell me. Do keep in mind tho that I am a beginner and my "toolbox" is quite small atm.

Your problem is in the index logic. You have two errors
Each time you reach the end of the replacement string, you skip a letter in the profanity:
while old_index <= len(old_word):
if replace_index == len(replace_str):
replace_index = 0
# You don't replace a letter; you just reset the new index, but ...
else:
replaced_word += replace_str[replace_index]
replace_index += 1
old_index += 1 # ... but you still advance the old index.
The reason you didn't notice this is that you have a second bug: you run your old_index from 0 through len(old_word), which is one more character than you started with. For the canonical four-letter word (or words of 5 or 6 characters), the two errors cancel each other. You didn't see this because you didn't test enough. For instance, using:
f = ProfanityFilter(["StackOverflow", "PC"], "?#$")
offensive_msg = "StackOverflow on PC rulez!"
clean_msg = f.filter(offensive_msg)
Output:
?#$?#$?#$?# on ?#$ rulez!
The input words are 13 and 2 letters; the replacements are 11 and 3.
Fix those two errors: make old_index stay in bounds, and increment it only when you make a replacement.
while old_index < len(old_word):
if replace_index == len(replace_str):
replace_index = 0
else:
replaced_word += replace_str[replace_index]
replace_index += 1
old_index += 1
Future improvements:
Refactor this into a for loop.
Don't reset your replace_index; in fact, get rid of it. Simply use old_index % len(replace_str).

I'd do this with a regular expression instead, since re.sub() has a handy API for dynamic replacements:
import re
class ProfanityFilter:
def __init__(self, keywords, template):
# Build a regular expression that will match all of the profane words
self.keyword_re = re.compile("|".join(re.escape(keyword) for keyword in keywords), re.I)
self.template = template
def _generate_replacement(self, word):
l = len(word)
# Figure out how many times to repeat the template
repeats = (l // len(self.template)) + 1
# Since we may end up with a string longer than the original,
# slice to the correct length.
return (self.template * repeats)[:l]
def filter(self, msg):
# Replace all occurrences of the regular expression with
# a dynamically computed replacement value.
return self.keyword_re.sub(
lambda m: self._generate_replacement(m.group(0)),
msg,
)
f = ProfanityFilter(["duck", "shot", "batch", "mastard"], "?#$")
offensive_msg = "this mastard shot my duck"
print(f.filter(offensive_msg))

Couldn't make a one-liner, but here's a terrible implementation anway. Don't do what VoNWooDSoN does:
def replace(msg, keywords=["duck", "shot", "batch", "mastard"], template="?#$"):
for keyword in keywords * len(msg)):
msg = (template*len(keyword))[:len(keyword)].join([msg[:msg.find(keyword)], msg[msg.find(keyword)+len(keyword):]]) if msg.find(keyword) > 0 else msg
return msg
offensive_msg = "this mastard shot my duck"
clean_msg = replace(offensive_msg)
print(clean_msg) # should be: "this ?#$?#$? ?#$? my ?#$?"
print(clean_msg=="this ?#$?#$? ?#$? my ?#$?")
edit
So, I guess that 3.8 has assignment expressions... So, but this'd be the one liner then (probably).
print ((lambda msg: [msg := (("?#$"*len(keyword))[:len(keyword)].join([msg[:msg.find(keyword)], msg[msg.find(keyword)+len(keyword):]]) if msg.find(keyword) > 0 else msg) for keyword in ["duck", "shot", "batch", "mastard"]])("this mastard shot my duck")[-1])

How to rearrange a string's characters such that none of it's adjacent characters are the same, using Python

In my attempt to solve the above question, I've written the following code:
Logic: Create a frequency dict for each character in the string (key= character, value= frequency of the character). If any character's frequency is greater than ceil(n/2), there is no solution. Else, print the most frequent character followed by reducing its frequency in the dict/
import math, operator
def rearrangeString(s):
# Fill this in.
n = len(s)
freqDict = {}
for i in s:
if i not in freqDict.keys():
freqDict[i] = 1
else:
freqDict[i] += 1
for j in list(freqDict.values()):
if j > math.ceil(n / 2):
return None
return maxArrange(freqDict)[:-4]
temp = ""
def maxArrange(inp):
global temp
n = len(inp)
if list(inp.values()) != [0] * n:
resCh = max(inp.items(), key=operator.itemgetter(1))[0]
if resCh is not None and resCh != temp:
inp[resCh] -= 1
# Terminates with None
temp = resCh
return resCh + str(maxArrange(inp))
# Driver
print(rearrangeString("abbccc"))
# cbcabc
print(rearrangeString("abbcccc"))
In the first try, with input abbccc, it gives the right answer, i.e. cbcabc, but fails for the input abbcccc, returning ccbcabc, without handling it using the temp variable, else returning cbcabc and skipping c altogether when handled using temp
How should I modify the logic, or is there a better approach?

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Find tokens that are connected - python

Related

Ignoring Changed Index Check (Python)

Recursive Decompression of Strings

Write a recursive function matching_bracket(string, idx) to find the index of the close bracket matching the open bracket at string[idx]

Replacing keywords in strings with sequence of symbols

How to rearrange a string's characters such that none of it's adjacent characters are the same, using Python

Categories

Resources