Python : Split string every three words - python

I've been searching around for a while now, but I can't seem to find the answer to this small problem.
I have this code that is supposed to split the string after every three words:
import re
def splitTextToTriplet(Text):
x = re.split('^((?:\S+\s+){2}\S+).*',Text)
return x
print(splitTextToTriplet("Do you know how to sing"))
Currently the output is as such:
['', 'Do you know', '']
But I am actually expecting this output:
['Do you know', 'how to sing']
And if I print(splitTextToTriplet("Do you know how to")), it should also output:
['Do you know', 'how to']
how can I change the regex so it produces the expected output?

I believe re.split might not be the best approach for this since look-behind cannot take variable-length patterns.
Instead, you could use str.split and then join back words together.
def splitTextToTriplet(string):
words = string.split()
grouped_words = [' '.join(words[i: i + 3]) for i in range(0, len(words), 3)]
return grouped_words
splitTextToTriplet("Do you know how to sing")
# ['Do you know', 'how to sing']
splitTextToTriplet("Do you know how to")
# ['Do you know', 'how to']
Although be advised that with this solution, if some of your white spaces are linebreaks, that information will be lost in the process.

I used re.findall for the output you expected. To get more generic split function, I replaced splitTextToTriplet on splitTextonWords with numberOfWords as a param:
import re
def splitTextonWords(Text, numberOfWords=1):
if (numberOfWords > 1):
text = Text.lstrip()
pattern = '(?:\S+\s*){1,'+str(numberOfWords-1)+'}\S+(?!=\s*)'
x =re.findall(pattern,text)
elif (numberOfWords == 1):
x = Text.split()
else:
x = None
return x
print(splitTextonWords("Do you know how to sing", 3))
print(splitTextonWords("Do you know how to", 3))
print(splitTextonWords("Do you know how to sing how to dance how to", 3))
print(splitTextonWords("A sentence this code will fail at", 3))
print(splitTextonWords("A sentence this code will fail at ", 3))
print(splitTextonWords(" A sentence this code will fail at s", 3))
print(splitTextonWords(" A sentence this code will fail at s", 4))
print(splitTextonWords(" A sentence this code will fail at s", 2))
print(splitTextonWords(" A sentence this code will fail at s", 1))
print(splitTextonWords(" A sentence this code will fail at s", 0))
output:
['Do you know', 'how to sing']
['Do you know', 'how to']
['Do you know', 'how to sing', 'how to dance', 'how to']
['A sentence this', 'code will fail', 'at']
['A sentence this', 'code will fail', 'at']
['A sentence this', 'code will fail', 'at s']
['A sentence this code', 'will fail at s']
['A sentence', 'this code', 'will fail', 'at s']
['A', 'sentence', 'this', 'code', 'will', 'fail', 'at', 's']
None

Using the grouper itertools recipe:
import itertools
def grouper(iterable, n, fillvalue=None):
"Collect data into fixed-length chunks or blocks"
# grouper('ABCDEFG', 3, 'x') --> ABC DEF Gxx"
args = [iter(iterable)] * n
return itertools.zip_longest(*args, fillvalue=fillvalue)
See also the more_itertools third-party library that implements this recipe for you.
Code
def split_text_to_triplet(s):
"""Return strings of three words."""
return [" ".join(c) for c in grouper(3, s.split())]
split_text_to_triplet("Do you know how to sing")
# ['Do you know', 'how to sing']

Related

How do I convert a list of strings to a proper sentence

How do I convert a list of strings to a proper sentence like this?
lst = ['eat', 'drink', 'dance', 'sleep']
string = 'I love"
output: "I love to eat, drink, dance and sleep."
Note: the "to" needs to be generated and not added manually to string
Thanks!
You can join all the verbs except the last with commas, and add the last with an and
def build(start, verbs):
return f"{start} to {', '.join(verbs[:-1])} and {verbs[-1]}."
string = 'I love'
lst = ['eat', 'drink', 'dance', 'sleep']
print(build(string, lst)) # I love to eat, drink, dance and sleep
lst = ['eat', 'drink', 'dance', 'sleep', 'run', 'walk', 'count']
print(build(string, lst)) # I love to eat, drink, dance, sleep, run, walk and count.
One option, using list to string joining:
lst = ['eat', 'drink', 'dance', 'sleep']
string = 'I love'
output = string + ' to ' + ', '.join(lst)
output = re.sub(r', (?!.*,)', ' and ', output)
print(output) # I love to eat, drink, dance and sleep
Note that the call to re.sub above selectively replaces the final comma with and.
Heyy, you can add string elements of lists to form bigger string by doing the following :-
verbs = lst[:-1].join(", ") # This will result in "eat, drink, dance"
verbs = verbs + " and " + lst[-1] # This will result in "eat, drink, dance and sleep"
string = string + ' to ' + verbs # This will result in "I love to eat, drink, dance and sleep"
print(string)

Split string on n or more whitespaces

I have a string like that:
sentence = 'This is a nice day'
I want to have the following output:
output = ['This is', 'a nice', 'day']
In this case, I split the string on n=3 or more whitespaces and this is why it is split like it is shown above.
How can I efficiently do this for any n?
You may try using Python's regex split:
sentence = 'This is a nice day'
output = re.split(r'\s{3,}', sentence)
print(output)
['This is', 'a nice day']
To handle this for an actual variable n, we can try:
n = 3
pattern = r'\s{' + str(n) + ',}'
output = re.split(pattern, sentence)
print(output)
['This is', 'a nice day']
You can use the basic .split() function:
sentence = 'This is a nice day'
n = 3
sentence.split(' '*n)
>>> ['This is', 'a nice day']
You can also split by n spaces, strip the results and remove empty elements (if there are several such long spaces that would produce them):
sentence = 'This is a nice day'
n = 3
parts = [part.strip() for part in sentence.split(' ' * n) if part.strip()]
x = 'This is a nice day'
result = [i.strip() for i in x.split(' ' * 3)]
print(result)
['This is', 'a nice day']

split string by using regex in python

What is the best way to split a string like
text = "hello there how are you"
in Python?
So I'd end up with an array like such:
['hello there', 'there how', 'how are', 'are you']
I have tried this:
liste = re.findall('((\S+\W*){'+str(2)+'})', text)
for a in liste:
print(a[0])
But I'm getting:
hello there
how are
you
How can I make the findall function move only one token when searching?
Here's a solution with re.findall:
>>> import re
>>> text = "hello there how are you"
>>> re.findall(r"(?=(?:(?:^|\W)(\S+\W\S+)(?:$|\W)))", text)
['hello there', 'there how', 'how are', 'are you']
Have a look at the Python docs for re: https://docs.python.org/3/library/re.html
(?=...) Lookahead assertion
(?:...) Non-capturing regular parentheses
If regex isn't require you could do something like:
l = text.split(' ')
out = []
for i in range(len(l)):
try:
o.append(l[i] + ' ' + l[i+1])
except IndexError:
continue
Explanation:
First split the string on the space character. The result will be a list where each element is a word in the sentence. Instantiate an empty list to hold the result. Loop over the list of words adding the two word combinations seperated by a space to the output list. This will throw an IndexError when accessing the last word in the list, just catch it and continue since you don't seem to want that lone word in your result anyway.
I don't think you actually need regex for this.
I understand you want a list, in which each element contains two words, the latter also being the former of the following element. We can do this easily like this:
string = "Hello there how are you"
liste = string.split(" ").pop(-1)
# we remove the last index, as otherwise we'll crash, or have an element with only one word
for i in range(len(liste)-1):
liste[i] = liste[i] + " " + liste[i+1]
I don't know if it's mandatory for you need to use regex, but I'd do this way.
First, you can get the list of words with the str.split() method.
>>> sentence = "hello there how are you"
>>> splited_sentence = sentence.split(" ")
>>> splited_sentence
['hello', 'there', 'how', 'are', 'you']
Then, you can make pairs.
>>> output = []
>>> for i in range (1, len(splited_sentence) ):
... output += [ splited[ i-1 ] + ' ' + splited_sentence[ i ] ]
...
output
['hello there', 'there how', 'how are', 'are you']
An alternative is just to split, zip, then join like so...
sentence = "Hello there how are you"
words = sentence.split()
[' '.join(i) for i in zip(words, words[1:])]
Another possible solution using findall.
>>> liste = list(map(''.join, re.findall(r'(\S+(?=(\s+\S+)))', text)))
>>> liste
['hello there', 'there how', 'how are', 'are you']

Erasing list of phrases from list of texts in python

I am trying to erase specific words found in a list. Lets say that I have the following example:
a= ['you are here','you are there','where are you','what is that']
b = ['you','what is']
The desired output should be the following:
['are here', 'are there', 'where are', 'that']
I created the following code for that task:
import re
def _find_word_and_remove(w,strings):
"""
w:(string)
strings:(string)
"""
temp= re.sub(r'\b({0})\b'.format(w),'',strings).strip()# removes word from string
return re.sub("\s{1,}", " ", temp)# removes double spaces
def find_words_and_remove(words,strings):
"""
words:(list)
strings:(list)
"""
if len(words)==1:
return [_find_word_and_remove(words[0],word_a) for word_a in strings]
else:
temp =[_find_word_and_remove(words[0],word_a) for word_a in strings]
return find_words_and_remove(words[1:],temp)
find_words_and_remove(b,a)
>>> ['are here', 'are there', 'where are', 'that']
It seems that I am over-complicating the 'things' by using recursion for this task. Is there a more simple and readable way to do this task?
You can use list comprehension:
def find_words_and_remove(words, strings):
return [" ".join(word for word in string.split() if word not in words) for string in strings]
That will work only when there are single words in b, but because of your edit and comment, I now know that you really do need _find_word_and_remove(). Your recursion way isn't really too bad, but if you don't want recursion, do this:
def find_words_and_remove(words, strings):
strings_copy = strings[:]
for i, word in enumerate(words):
for string in strings:
strings_copy[i] = _find_word_and_remove(word, string)
return strings_copy
the simple way is to use regex:
import re
a= ['you are here','you are there','where are you','what is that']
b = ['you','what is']
here you go:
def find_words_and_remove(b,a):
return [ re.sub("|".join(b), "", x).strip() if len(re.sub("|".join(b), "", x).strip().split(" ")) < len(x.split(' ')) else x for x in a ]
find_words_and_remove(b,a)
>> ['are here', 'are there', 'where are', 'that']

Select strings by positions of words

For the following tuple
mysentence = 'i have a dog and a cat', 'i have a cat and a dog', 'i have a cat',
'i have a dog'
How to select only the strings 'i have a cat' , 'i have a dog', i.e exclude strings having the word dog or cat in the middle.
You can do this with regular expressions. The regex .+(dog|cat).+ will match one or more characters, followed by dog or cat, and one of more characters afterwards. You can then use filter to find strings which don't match this regex:
import re
regex.compile(r'.+(dog|cat).+')
sentence = 'i have a dog and a cat', 'i have a cat and a dog', 'i have a cat',
'i have a dog'
filtered_sentence = filter(lambda s: not regex.match(s), sentence)
You could use a Regular Expression to match the sentences you don't want.
We can build up the pattern as follows:
We want to match dog or cat - (dog|cat)
followed by a space, i.e. not at the end of the line
So our code looks like so:
>>> mysentence = ('i have a dog and a cat', 'i have a cat and a dog', 'i have a cat', 'i have a dog')
>>> import re
>>> pattern = re.compile("(dog|cat) ")
>>> [x for x in mysentence if not pattern.search(x)]
['i have a cat', 'i have a dog']
If the string should just end with a specific phrase then this will do the job:
phases = ("I have a cat", "I have a dog")
for sentence in mysentence:
for phase in phases:
if sentence.lower().endswith(phase.lower()):
print(sentence)
Simplest thing that could possibly work:
In [10]: [phrase for phrase in mysentence if not ' and ' in phrase]
Out[10]: ['i have a cat', 'i have a dog']
You can use regexp or string methods.
I see other answered with regex, so I try string methods: with string.find() you will get position of substring in string. Then check if it is in the middle of the sentence.
def filter_function(sentence, words):
for word in words:
p = sentence.find(word)
if p > 0 and p < len(sentence) - len(word):
return 0
return 1
for sentence in mysentence:
print('%s: %d' % (sentence, filter_function(sentence, ['dog', 'cat'])))
You also must define what to do when you will have only 'cat' in sentence.
for items in mysentence:
if (items.find("dog")>=0)^(items.find("cat")>=0):
print(items)
You just need an xor operator and the find function. No need to import

Categories