Split string on n or more whitespaces - python

I have a string like that:
sentence = 'This is a nice day'
I want to have the following output:
output = ['This is', 'a nice', 'day']
In this case, I split the string on n=3 or more whitespaces and this is why it is split like it is shown above.
How can I efficiently do this for any n?

You may try using Python's regex split:
sentence = 'This is a nice day'
output = re.split(r'\s{3,}', sentence)
print(output)
['This is', 'a nice day']
To handle this for an actual variable n, we can try:
n = 3
pattern = r'\s{' + str(n) + ',}'
output = re.split(pattern, sentence)
print(output)
['This is', 'a nice day']

You can use the basic .split() function:
sentence = 'This is a nice day'
n = 3
sentence.split(' '*n)
>>> ['This is', 'a nice day']

You can also split by n spaces, strip the results and remove empty elements (if there are several such long spaces that would produce them):
sentence = 'This is a nice day'
n = 3
parts = [part.strip() for part in sentence.split(' ' * n) if part.strip()]

x = 'This is a nice day'
result = [i.strip() for i in x.split(' ' * 3)]
print(result)
['This is', 'a nice day']

Related

How do I convert a list of strings to a proper sentence

How do I convert a list of strings to a proper sentence like this?
lst = ['eat', 'drink', 'dance', 'sleep']
string = 'I love"
output: "I love to eat, drink, dance and sleep."
Note: the "to" needs to be generated and not added manually to string
Thanks!
You can join all the verbs except the last with commas, and add the last with an and
def build(start, verbs):
return f"{start} to {', '.join(verbs[:-1])} and {verbs[-1]}."
string = 'I love'
lst = ['eat', 'drink', 'dance', 'sleep']
print(build(string, lst)) # I love to eat, drink, dance and sleep
lst = ['eat', 'drink', 'dance', 'sleep', 'run', 'walk', 'count']
print(build(string, lst)) # I love to eat, drink, dance, sleep, run, walk and count.
One option, using list to string joining:
lst = ['eat', 'drink', 'dance', 'sleep']
string = 'I love'
output = string + ' to ' + ', '.join(lst)
output = re.sub(r', (?!.*,)', ' and ', output)
print(output) # I love to eat, drink, dance and sleep
Note that the call to re.sub above selectively replaces the final comma with and.
Heyy, you can add string elements of lists to form bigger string by doing the following :-
verbs = lst[:-1].join(", ") # This will result in "eat, drink, dance"
verbs = verbs + " and " + lst[-1] # This will result in "eat, drink, dance and sleep"
string = string + ' to ' + verbs # This will result in "I love to eat, drink, dance and sleep"
print(string)

How to split a list of strings?

Is there a way to split a list of strings per character?
Here is a simple list that I want split per "!":
name1 = ['hello! i like apples!', ' my name is ! alfred!']
first = name1.split("!")
print(first)
I know it's not expected to run, I essentially want a new list of strings whose strings are now separated by "!". So output can be:
["hello", "i like apples", "my name is", "alfred"]
Based on your given output, I've "solved" the problem.
So basically what I do is:
1.) Create one big string by simply concatenating all of the strings contained in your list.
2.) Split the big string by character "!"
Code:
lst = ['hello! i like apples!', 'my name is ! alfred!']
s = "".join(lst)
result = s.split('!')
print(result)
Output:
['hello', ' i like apples', 'my name is ', ' alfred', '']
Just loop on each string and flatten its split result to a new list:
name1=['hello! i like apples!',' my name is ! alfred!']
print([s.strip() for sub in name1 for s in sub.split('!') if s])
Gives:
['hello', 'i like apples', 'my name is', 'alfred']
Try this:
name1 = ['hello! i like apples!', 'my name is ! alfred!']
new_list = []
for l in range(0, len(name1)):
new_list += name1[l].split('!')
new_list.remove('')
print(new_list)
Prints:
['hello', ' i like apples', 'my name is ', ' alfred']

split string by using regex in python

What is the best way to split a string like
text = "hello there how are you"
in Python?
So I'd end up with an array like such:
['hello there', 'there how', 'how are', 'are you']
I have tried this:
liste = re.findall('((\S+\W*){'+str(2)+'})', text)
for a in liste:
print(a[0])
But I'm getting:
hello there
how are
you
How can I make the findall function move only one token when searching?
Here's a solution with re.findall:
>>> import re
>>> text = "hello there how are you"
>>> re.findall(r"(?=(?:(?:^|\W)(\S+\W\S+)(?:$|\W)))", text)
['hello there', 'there how', 'how are', 'are you']
Have a look at the Python docs for re: https://docs.python.org/3/library/re.html
(?=...) Lookahead assertion
(?:...) Non-capturing regular parentheses
If regex isn't require you could do something like:
l = text.split(' ')
out = []
for i in range(len(l)):
try:
o.append(l[i] + ' ' + l[i+1])
except IndexError:
continue
Explanation:
First split the string on the space character. The result will be a list where each element is a word in the sentence. Instantiate an empty list to hold the result. Loop over the list of words adding the two word combinations seperated by a space to the output list. This will throw an IndexError when accessing the last word in the list, just catch it and continue since you don't seem to want that lone word in your result anyway.
I don't think you actually need regex for this.
I understand you want a list, in which each element contains two words, the latter also being the former of the following element. We can do this easily like this:
string = "Hello there how are you"
liste = string.split(" ").pop(-1)
# we remove the last index, as otherwise we'll crash, or have an element with only one word
for i in range(len(liste)-1):
liste[i] = liste[i] + " " + liste[i+1]
I don't know if it's mandatory for you need to use regex, but I'd do this way.
First, you can get the list of words with the str.split() method.
>>> sentence = "hello there how are you"
>>> splited_sentence = sentence.split(" ")
>>> splited_sentence
['hello', 'there', 'how', 'are', 'you']
Then, you can make pairs.
>>> output = []
>>> for i in range (1, len(splited_sentence) ):
... output += [ splited[ i-1 ] + ' ' + splited_sentence[ i ] ]
...
output
['hello there', 'there how', 'how are', 'are you']
An alternative is just to split, zip, then join like so...
sentence = "Hello there how are you"
words = sentence.split()
[' '.join(i) for i in zip(words, words[1:])]
Another possible solution using findall.
>>> liste = list(map(''.join, re.findall(r'(\S+(?=(\s+\S+)))', text)))
>>> liste
['hello there', 'there how', 'how are', 'are you']

Erasing list of phrases from list of texts in python

I am trying to erase specific words found in a list. Lets say that I have the following example:
a= ['you are here','you are there','where are you','what is that']
b = ['you','what is']
The desired output should be the following:
['are here', 'are there', 'where are', 'that']
I created the following code for that task:
import re
def _find_word_and_remove(w,strings):
"""
w:(string)
strings:(string)
"""
temp= re.sub(r'\b({0})\b'.format(w),'',strings).strip()# removes word from string
return re.sub("\s{1,}", " ", temp)# removes double spaces
def find_words_and_remove(words,strings):
"""
words:(list)
strings:(list)
"""
if len(words)==1:
return [_find_word_and_remove(words[0],word_a) for word_a in strings]
else:
temp =[_find_word_and_remove(words[0],word_a) for word_a in strings]
return find_words_and_remove(words[1:],temp)
find_words_and_remove(b,a)
>>> ['are here', 'are there', 'where are', 'that']
It seems that I am over-complicating the 'things' by using recursion for this task. Is there a more simple and readable way to do this task?
You can use list comprehension:
def find_words_and_remove(words, strings):
return [" ".join(word for word in string.split() if word not in words) for string in strings]
That will work only when there are single words in b, but because of your edit and comment, I now know that you really do need _find_word_and_remove(). Your recursion way isn't really too bad, but if you don't want recursion, do this:
def find_words_and_remove(words, strings):
strings_copy = strings[:]
for i, word in enumerate(words):
for string in strings:
strings_copy[i] = _find_word_and_remove(word, string)
return strings_copy
the simple way is to use regex:
import re
a= ['you are here','you are there','where are you','what is that']
b = ['you','what is']
here you go:
def find_words_and_remove(b,a):
return [ re.sub("|".join(b), "", x).strip() if len(re.sub("|".join(b), "", x).strip().split(" ")) < len(x.split(' ')) else x for x in a ]
find_words_and_remove(b,a)
>> ['are here', 'are there', 'where are', 'that']

Search list: match only exact word/string

How to match exact string/word while searching a list. I have tried, but its not correct. below I have given the sample list, my code and the test results
list = ['Hi, hello', 'hi mr 12345', 'welcome sir']
my code:
for str in list:
if s in str:
print str
test results:
s = "hello" ~ expected output: 'Hi, hello' ~ output I get: 'Hi, hello'
s = "123" ~ expected output: *nothing* ~ output I get: 'hi mr 12345'
s = "12345" ~ expected output: 'hi mr 12345' ~ output I get: 'hi mr 12345'
s = "come" ~ expected output: *nothing* ~ output I get: 'welcome sir'
s = "welcome" ~ expected output: 'welcome sir' ~ output I get: 'welcome sir'
s = "welcome sir" ~ expected output: 'welcome sir' ~ output I get: 'welcome sir'
My list contains more than 200K strings
It looks like you need to perform this search not only once so I would recommend to convert your list into dictionary:
>>> l = ['Hi, hello', 'hi mr 12345', 'welcome sir']
>>> d = dict()
>>> for item in l:
... for word in item.split():
... d.setdefault(word, list()).append(item)
...
So now you can easily do:
>>> d.get('hi')
['hi mr 12345']
>>> d.get('come') # nothing
>>> d.get('welcome')
['welcome sir']
p.s. probably you have to improve item.split() to handle commas, point and other separators. maybe use regex and \w.
p.p.s. as cularion mentioned this won't match "welcome sir". if you want to match whole string, it is just one additional line to proposed solution. but if you have to match part of string bounded by spaces and punctuation regex should be your choice.
>>> l = ['Hi, hello', 'hi mr 12345', 'welcome sir']
>>> search = lambda word: filter(lambda x: word in x.split(),l)
>>> search('123')
[]
>>> search('12345')
['hi mr 12345']
>>> search('hello')
['Hi, hello']
if you search for exact match:
for str in list:
if set (s.split()) & set(str.split()):
print str
Provided s only ever consists of just a few words, you could do
s = s.split()
n = len(s)
for x in my_list:
words = x.split()
if s in (words[i:i+n] for i in range(len(words) - n + 1)):
print x
If s consists of many words, there are more efficient, but also much more complex algorithm for this.
use regular expression here to match exact word with word boundary \b
import re
.....
for str in list:
if re.search(r'\b'+wordToLook+'\b', str):
print str
\b only matches a word which is terminated and starts with word terminator e.g. space or line break
or do something like this to avoid typing the word for searching again and again.
import re
list = ['Hi, hello', 'hi mr 12345', 'welcome sir']
listOfWords = ['hello', 'Mr', '123']
reg = re.compile(r'(?i)\b(?:%s)\b' % '|'.join(listOfWords))
for str in list:
if reg.search(str):
print str
(?i) is to search for without worrying about the case of words, if you want to search with case sensitivity then remove it.

Categories