How to find matches around fixed strings - python

I'm looking for help finding Python functions that allow me to take a list of strings, such as ["I like ", " and ", " because "] and a single target string, such as "I like lettuce and carrots and onions because I do", and finds all the ways the characters in the target string can be grouped such that each of the strings in the list comes in order.
For example:
solution(["I like ", " and ", " because ", "do"],
"I like lettuce and carrots and onions because I do")
should return:
[("I like ", "lettuce", " and ", "carrots and onions", " because ", "I ", "do"),
("I like ", "lettuce and carrots", " and ", "onions", " because ", "I ", "do")]
Notice that in each of the tuples, the strings in the list parameter are there in order, and the function returns each of the possible ways to split up the target string in order to achieve this.
Another example, this time with only one possible way of organizing the characters:
solution(["take ", " to the park"], "take Alice to the park")
should give the result:
[("take ", "Alice", " to the park")]
Here's an example where there is no way to organize the characters correctly:
solution(["I like ", " because ", ""],
"I don't like cheese because I'm lactose-intolerant")
should give back:
[]
because there is no way to do it. Notice that the "I like " in the first parameter cannot be split up. The target string doesn't have the string "I like " in it, so there's no way it could match.
Here's a final example, again with multiple options:
solution(["I", "want", "or", "done"],
"I want my sandwich or I want my pizza or salad done")
should return
[("I", " ", "want", " my sandwich ", "or", " I want my pizza or salad ", "done"),
("I", " ", "want", " my sandwich or I want my pizza ", "or", " salad ", "done"),
("I", " want my sandwich or I", "want", " my pizza ", "or", " salad ", "done")]`
Notice that, again, each of the strings ["I", "want", "or", "done"] is included in each of the tuples, in order, and that the rest of the characters are reordered around those strings in any way possible. The list of all the possible reorderings is what is returned.
Note that it's also assumed that the first string in the list will appear at the start of the target string, and the last string in the list will appear at the end of the target string. (If they don't, the function should return an empty list.)
What Python functions will allow me to do this?
I've tried using regex functions, but it seems to fail in the cases where there's more than one option.

I have a solution, it needs a fair bit of refactoring but it seems to work,
I hope this helps, it was quite an interesting problem.
import itertools
import re
from collections import deque
def solution(search_words, search_string):
found = deque()
for search_word in search_words:
found.append([(m.start()) for m in re.compile(search_word).finditer(search_string)])
if len(found) != len(search_words) or len(found) == 0:
return [] # no search words or not all words found
word_positions_lst = [list(i) for i in itertools.product(*found) if sorted(list(i)) == list(i)]
ret_lst = []
for word_positions in word_positions_lst:
split_positions = list(itertools.chain.from_iterable(
(split_position, split_position + len(search_word))
for split_position, search_word in zip(word_positions, search_words)))
last_seach_word = search_string[split_positions[-1]:]
ret_strs = [search_string[a:b] for a, b in zip(split_positions, split_positions[1:])]
if last_seach_word:
ret_strs.append(last_seach_word)
if len(search_string) == sum(map(len,ret_strs)):
ret_lst.append(tuple(ret_strs))
return ret_lst
print(solution(["I like ", " and ", " because ", "do"],
"I like lettuce and carrots and onions because I do"))
print([("I like ", "lettuce", " and ", "carrots and onions", " because ", "I ", "do"),
("I like ", "lettuce and carrots", " and ", "onions", " because ", "I ", "do")])
print()
print(solution(["take ", " to the park"], "take Alice to the park"))
print([("take ", "Alice", " to the park")])
print()
print(solution(["I like ", " because "],
"I don't like cheese because I'm lactose-intolerant"))
print([])
print()
Outputs:
[('I like ', 'lettuce', ' and ', 'carrots and onions', ' because ', 'I ', 'do'), ('I like ', 'lettuce and carrots', ' and ', 'onions', ' because ', 'I ', 'do')]
[('I like ', 'lettuce', ' and ', 'carrots and onions', ' because ', 'I ', 'do'), ('I like ', 'lettuce and carrots', ' and ', 'onions', ' because ', 'I ', 'do')]
[('take ', 'Alice', ' to the park')]
[('take ', 'Alice', ' to the park')]
[]
[]
[('I', ' ', 'want', ' my sandwich ', 'or', ' I want my pizza or salad ', 'done'), ('I', ' ', 'want', ' my sandwich or I want my pizza ', 'or', ' salad ', 'done'), ('I', ' want my sandwich or I ', 'want', ' my pizza ', 'or', ' salad ', 'done')]
[('I', ' ', 'want', ' my sandwich ', 'or', ' I want my pizza or salad ', 'done'), ('I', ' ', 'want', ' my sandwich or I want my pizza ', 'or', ' salad ', 'done'), ('I', ' want my sandwich or I', 'want', ' my pizza ', 'or', ' salad ', 'done')]
Edit: refactored code to have meaningful variable names.
Edit2: added the last case i forgot about.

EDIT: I've since learned some programming tactics and redone my answer to this problem.
To answer my question, you don't need any special functions. If you'd like a version that's relatively easy to code, look below for a different answer. This solution is also less documented compared to the solution below, but it uses dynamic programming and memoization, so it should be faster than the previous solution, and less memory intensive. It also deals with regex characters (such as |) correctly. (The "previous answer" solution below does not.)
def solution(fixed_strings, target_string):
def get_middle_matches(s, fixed_strings):
'''
Gets the fixed strings matches without the first and last first strings
Example the parameter tuple ("ABCBD", ["B"]) should give back [["A", "B", "CBD"], ["ABC", "B", "D"]]
'''
# in the form {(s, s_index, fixed_string_index, fixed_character_index): return value of recursive_get_middle_matches called with those parameters}
lookup = {}
def memoized_get_middle_matches(*args):
'''memoize the recursive function'''
try:
ans = lookup[args]
return ans
except KeyError:
ans = recursive_get_middle_matches(*args)
lookup[args] = ans
return ans
def recursive_get_middle_matches(s, s_index, fixed_string_index, fixed_character_index):
'''
Takes a string, an index into that string, a index into the list of middle fixed strings,
...and an index into that middle fixed string.
Returns what fixed_string_matches(s, fixed_strings[fixed_string_index:-1]) would return, and deals with edge cases.
'''
# base case: there's no fixed strings left to match
try:
fixed_string = fixed_strings[fixed_string_index]
except IndexError:
# we just finished matching the last fixed string, but there's some stuff left over
return [[s]]
# recursive case: we've finished matching a fixed string
# note that this needs to go before the end of the string base case
# ...because otherwise the matched fixed string may not be added to the answer,
# ...since getting to the end of the main string will short-circuit it
try:
fixed_character = fixed_string[fixed_character_index]
except IndexError:
# finished matching this fixed string
upper_slice = s_index
lower_slice = upper_slice - len(fixed_string)
prefix = s[:lower_slice]
match = s[lower_slice:upper_slice]
postfix = s[upper_slice:]
match_ans = [prefix, match]
recursive_answers = memoized_get_middle_matches(postfix, 0, fixed_string_index + 1, 0)
if fixed_string == '' and s_index < len(s):
recursive_skip_answers = memoized_get_middle_matches(s, s_index + 1, fixed_string_index, fixed_character_index)
return [match_ans + recursive_ans for recursive_ans in recursive_answers] + recursive_skip_answers
else:
return [match_ans + recursive_ans for recursive_ans in recursive_answers]
# base cases: we've reached the end of the string
try:
character = s[s_index]
except IndexError:
# nothing left to match in the main string
if fixed_string_index >= len(fixed_strings):
# it completed matching everything it needed to
return [[""]]
else:
# it didn't finish matching everything it needed to
return []
# recursive cases: either we match this character or we don't
character_matched = (character == fixed_character)
starts_fixed_string = (fixed_character_index == 0)
if starts_fixed_string:
# if this character starts the fixed string, we're still searching for this same fixed string
recursive_skip_answers = memoized_get_middle_matches(s, s_index + 1, fixed_string_index, fixed_character_index)
if character_matched:
recursive_take_answers = memoized_get_middle_matches(s, s_index + 1, fixed_string_index, fixed_character_index + 1)
if starts_fixed_string:
# we have the option to either take the character as a match, or skip over it
return recursive_skip_answers + recursive_take_answers
else:
# this character is past the start of the fixed string; we can no longer match this fixed string
# since we can't match one of the fixed strings, this is a failed path if we don't match this character
# thus, we're forced to take this character as a match
return recursive_take_answers
else:
if starts_fixed_string:
# we can't match it here, so we skip over and continue
return recursive_skip_answers
else:
# this character is past the start of the fixed string; we can no longer match this fixed string
# since we can't match one of the fixed strings, there are no possible matches here
return []
## main code
return memoized_get_middle_matches(s, 0, 0, 0)
## main code
# doing the one fixed string case first because it happens a lot
if len(fixed_strings) == 1:
# if it matches, then there's just that one match, otherwise, there's none.
if target_string == fixed_strings[0]:
return [target_string]
else:
return []
if len(fixed_strings) == 0:
# there's no matches because there are no fixed strings
return []
# separate the first and last from the middle
first_fixed_string = fixed_strings[0]
middle_fixed_strings = fixed_strings[1:-1]
last_fixed_string = fixed_strings[-1]
prefix = target_string[:len(first_fixed_string)]
middle = target_string[len(first_fixed_string):len(target_string)-len(last_fixed_string)]
postfix = target_string[len(target_string)-len(last_fixed_string):]
# make sure the first and last fixed strings match the target string
# if not, the target string does not match
if not (prefix == first_fixed_string and postfix == last_fixed_string):
return []
else:
# now, do the check for the middle fixed strings
return [[prefix] + middle + [postfix] for middle in get_middle_matches(middle, middle_fixed_strings)]
print(solution(["I like ", " and ", " because ", "do"],
"I like lettuce and carrots and onions because I do"))
print([("I like ", "lettuce", " and ", "carrots and onions", " because ", "I ", "do"),
("I like ", "lettuce and carrots", " and ", "onions", " because ", "I ", "do")])
print()
print(solution(["take ", " to the park"], "take Alice to the park"))
print([("take ", "Alice", " to the park")])
print()
# Courtesy of #ktzr
print(solution(["I like ", " because "],
"I don't like cheese because I'm lactose-intolerant"))
print([])
print()
print(solution(["I", "want", "or", "done"],
"I want my sandwich or I want my pizza or salad done"))
print([("I", " ", "want", " my sandwich ", "or", " I want my pizza or salad ", "done"),
("I", " ", "want", " my sandwich or I want my pizza ", "or", " salad ", "done"),
("I", " want my sandwich or I", "want", " my pizza ", "or", " salad ", "done")])
Previous answer:
To answer my question, the itertools.product function and regex.finditer with the overlapped parameter were the two key functions to this solution. I figured I'd include my final code in case it helps someone else in a similar situation.
I really care about my code being super readable, so I ended up coding my own solution based on #ktzr's solution. (Thank you!)
My solution uses a couple of weird things.
First, it uses a overlapped parameter, which is only available through the regex module, and must be installed (most likely via pip install regex). Then, include it at the top with import regex as re. This makes it easy to search for overlapped matches in a string.
Second, my solution makes use of an itertools function that isn't explicitly included in the library, which you have to define as such:
import itertools
def itertools_pairwise(iterable):
'''s -> (s0,s1), (s1,s2), (s2, s3), ...'''
a, b = itertools.tee(iterable)
next(b, None)
return zip(a, b)
This function simply lets me iterate pairwise through a list, making sure each element (except for the first and last) in the list is encountered twice.
With those two things in place, here's my solution:
def solution(fixed_strings, target_string):
# doing the one fixed string case first because it happens a lot
if len(fixed_strings) == 1:
# if it matches, then there's just that one match, otherwise, there's none.
if target_string == fixed_strings[0]:
return [target_string]
else:
return []
# make sure the first and last fixed strings match the target string
# if not, the target string does not match
if not (target_string.startswith(fixed_strings[0]) and target_string.endswith(fixed_strings[-1])):
return []
# get the fixed strings in the middle that it now needs to search for in the middle of the target string
middle_fixed_strings = fixed_strings[1:-1]
# where in the target string it found the middle fixed strings.
# middle_fixed_strings_placements is in the form: [[where it found the 1st middle fixed string], ...]
# [where it found the xth middle fixed string] is in the form: [(start index, end index), ...]
middle_fixed_strings_placements = [[match.span() for match in re.finditer(string, target_string, overlapped=True)]
for string in middle_fixed_strings]
# if any of the fixed strings couldn't be found in the target string, there's no matches
if [] in middle_fixed_strings_placements:
return []
# get all of the possible ways each of the middle strings could be found once within the target string
all_placements = itertools.product(*middle_fixed_strings_placements)
# remove the cases where the middle strings overlap or are out of order
good_placements = [placement for placement in all_placements
if not (True in [placement[index][1] > placement[index + 1][0]
for index in range(len(placement) - 1)])]
# create a list of all the possible final matches
matches = []
target_string_len = len(target_string) # cache for later
# save the start and end spans which are predetermined by their length and placement
start_span = (0, len(fixed_strings[0]))
end_span = (target_string_len - len(fixed_strings[-1]), target_string_len)
for placement in good_placements:
placement = list(placement)
# add in the spans for the first and last fixed strings
# this makes it so each placement is in the form: [1st fixed string span, ..., last fixed string span]
placement.insert(0, start_span)
placement.append(end_span)
# flatten the placements list to get the places where we need to cut up the string.
# we want to cut the string at the span values to get out the fixed strings
cuts = [cut for span in placement for cut in span]
match = []
# go through the cuts and make them to create the list
for start_cut, end_cut in itertools_pairwise(cuts):
match.append(target_string[start_cut:end_cut])
matches.append(match)
return matches

Related

How to insert space by punctuation?

I have strings like so: hey what is up!, "what did you say?", "he said 'well'", etc. and a regex expression like so: [!%&'\(\)$#\"\/\\*+,-.:;<=>?#\[\]^_´{|}~]´. These are my delimiters and into the strings shown a space shall be inserted like so: "hey what is up !", "what did you say ?", "he said ' well '". So if one of the delimiters is in front of another character sequence, add a space, and if its is after, add space as well.
How can I achieve this? I do not want to split by these delimiters.
Here's my solution but I would be curious how to solve it with regex.
space = set("[!%&'()$#\"/\*+,-.:;<=>?#[]^_´`{|}~]")
for sent in self.sentences:
sent = list(sent)
for i, char in enumerate(sent):
# Make sure to respect length of string when indexing
if i != 0:
# insert space in front if char is punctuation
if sent[i] in space and sent[i - 1] != " ":
sent.insert(i, " ")
if i != len(sent)-1:
# insert space after if char is punctuation
if sent[i] in space and sent[i + 1] != " ":
sent.insert(i + 1, " ")
You could expand your pattern to catch optional spaces and then replace by capture group plus spaces before and after (loop only for demo, not neccessary):
import re
strings = ["hey what is up!", "what did you say?", "he said 'well'"]
pattern = r'(\s?[!%&\'\(\)$#\"\/\\*+,-.:;<=>?#\[\]^_´{|}~]\s?)'
for string in strings:
print(re.sub(pattern, r' \1 ', string))
This will give this output:
hey what is up !
what did you say ?
he said ' well '
Without the aid of the re module you could simply do this:
punctuation = "!%&'()$#\"/\\*+,-.:;<=>?#[]^_´{|}~"
mystring = "Well hello! How are you?"
mylist = list(mystring)
i = 0
for c in mystring:
if c in punctuation:
mylist.insert(i, ' ')
i += 2
else:
i += 1
print(''.join(mylist))
You can make a loop that goes through your strings and when it finds a ponctuation character use the slice function to cut your string in half and concatenate with a space in between.
For example:
for i in yourString:
if yourString[i] == '!':
newString = yourString.slice(0, i) + " " + yourString.slice(i + 1)
It only checks for "!" but you could replace it with a dictionnary of ponctuation characters

How can i not count the spaces between my string

So I have, p.e, this string: ' I love python ' and I want to convert all the spaces to '_'. My problem is that I also need to delete the outside spaces so I dont finish with the result: '_I_love_python__' and more like this 'I_love_python'
I searched and found out that I can develop it with a single line of code mystring.strip().replace(" ", "_") which is unfortunaly is sintax that I cant apply in my essay.
So what I landed with was this:
frase= str(input('Introduza: '))
aux=''
for car in frase:
if car==' ':
car='_'
aux+=car
else:
aux+=car
print(aux)
My problem now is on deleting those outside spaces. What I thought about was runing another for i in in the start and another on the final of the string and to stop until they found a non space caracter. But unfortunaly I havent been able to do that...
Apreciate all the help you can suply!
I came up with following solution:
You iterate over the string, but instead of replacing the space with underscore as soon as it appears, you store the amount of spaces encountered. Then, once a non-space-character is reached, you add the amount of spaces found to the string. So if the string ends with lots of spaces, it will never reach a non-space-character and therefore never add the underscores.
For cutting off the spaces at the beginning, I just added a condition to add the underscores being: "Have I encountered a non-space-character before?"
Here is the code:
text = " I love python e "
out = ""
string_started = False
underscores_to_add = 0
for c in text:
if c == " ":
underscores_to_add += 1
else:
if string_started:
out += "_" * underscores_to_add
underscores_to_add = 0
string_started = True
out += c
print(out) # prints "I_love___python____e"
You can use the following trick to remove leading and trailing spaces in your string:
s = ' I love python '
ind1 = min(len(s) if c == ' ' else n for n, c in enumerate(s))
ind2 = max(0 if c == ' ' else n for n, c in enumerate(s))
s = ''.join('_' if c == ' ' else c for c in s[ind1:ind2 + 1])
print('*', s, '*', sep='')
Output:
*I_love_python*
If you are not allowed to use strip() method
def find(text):
for i, s in enumerate(text):
if s != " ":
break
return i
text = " I love python e "
text[find(text):len(text)-find(text[::-1])].replace(" ","_")
texts = [" I love python e ","I love python e"," I love python e","I love python e ", "I love python e"]
for text in texts:
print (text[find(text):len(text)-find(text[::-1])].replace(" ","_"))
output:
I_love___python____e
I_love___python____e
I_love___python____e
I_love___python____e
I_love___python____e
Given a string find will find the first non space character in the string
Use find to find the first nonspace character and the last nonspace character
Get the substring using above found indices
Replace all spaces with _ in the above substring

how can i concatenate the string in "for clause"?

filecontents = f.read()
from nltk.tokenize import sent_tokenize
sent_tokenize_list = sent_tokenize(filecontents)
for sentence in sent_tokenize_list:
sentence = "Start " + sentence + " End"
print sentence
this results like
"start ~~~~~~ end "
"start ~~~~~ end"
"start ~~~~~ end"
but I want to concatenate them all, as a whole string. how can I do it.
You can use a list comprehension and join method on an empty string to join all of the sentences.
print "".join(["Start " + sentence + " End" for sentence in sent_tokenize_list])
You can also use a generator comprehension as mentioned in the comments for better performance.
print "".join("Start " + sentence + " End" for sentence in sent_tokenize_list)
You can just add each processed sentence to the result, like this
result = ''
for sentence in sent_tokenize_list:
result += "Start " + sentence + " End"
print result
Alternatively (and in a more pythonic way as i feel it) you can use list comprehension instead of for clause to make a list of modified sentences and then just join all the sentences together
new_sentences = ['Start ' + sentence + ' End' for sentence in sent_tokenize_list]
result = ''.join(new_sentences)
One liner:
" ".join(["Start " + sentence + " End" for sentence in sent_tokenize_list])
Try a list comprehension:
result = "".join("Start {} end".format(sentence) for sentence in sent_tokenize_list)

How to use any item in a list as a split value in Python?

I want to do something to the effect of:
x = [" ","cats"]
def function(a):
a.split([any of the elements in x])
return a
How would I make Python accept any of the elements in x as an argument (so it splits as spaces and cats)?
Edit: Sorry, I should have probably specified: a is expected to be a list
(I'm really, really sorry for not being clear.)
input - a = "I love cats so much"
expected output - ["I" , "love","so" , "much"]
Basically, I want it to perform the split function for " " and then for "cats".
Here's one way. Not the most readable of approaches but gets the job done.
import re
x = [" ", "cats"]
pattern = "|".join(x)
re.split(pattern, "I love cats very much")
def split_many(s, *words):
return filter(bool, reduce(lambda s, word:
s.replace(word, words[0]), words, s).split(words[0]))
print split_many('I love cats so much', 'cats', ' ')
s = "I love cats so much"
x = [" ", "cats"]
for delim in x:
s = s.replace(delim, " ")
print s.split()
You're making a new list, without any of the cats.

Separate words into list, except for symbols

I'm creating a project where I'll receive a list of tweets (Twitter), and then check if there words inside of a dictionary, which has words that certain values. I've gotten my code to take the words, but I don't know how to eliminate the symbols like: , . ":
Here's the code:
def getTweet(tweet, dictionary):
score = 0
seperate = tweet.split(' ')
print seperate
print "------"
if(len(tweet) > 0):
for item in seperate:
if item in dictionary:
print item
score = score + int(dictionary[item])
print "here's the score: " + str(score)
return score
else:
print "you haven't tweeted a tweet"
return 0
Here's the parameter/tweet that will be checked:
getTweet("you are the best loyal friendly happy cool nice", scoresDict)
Any ideas?
If you want to get rid of all the non alphanumerical values you can try:
import re
re.sub(r'[^\w]', ' ', string)
the flag [^\w] will do the trick for you!
Before doing the split, replace the characters with spaces, and then split on the spaces.
import re
line = ' a.,b"c'
line = re.sub('[,."]', ' ', line)
print line # ' a b c'

Categories