Splitting words by whitespace without affecting brackets content using regex - python

I'm trying to tokenize sentences using re in python like an example mentioned here:
I want a (hot chocolate)[food] and (two)[quantity] boxes of (crispy bacon)[food]
I wish to tokenize by splitting them using whitespace but without affecting the bracket set.
For example, I want the split list as:
["I", "want", "a", "(hot chocolate)[food]", "and", "(two)[quantity]", "boxes", "of", "(crispy bacon)[food]"]
How do I write the re.split expression to achieve the same.

You can do this with the regex pattern: \s(?!\w+\))
import re
s = """I want a (hot chocolate)[food] and (two)[quantity] boxes of (crispy bacon)[food]"""
print(re.split(r'\s(?!\w+\))',s))
# ['I', 'want', 'a', '(hot chocolate)[food]', 'and', '(two)[quantity]', 'boxes', 'of', '(crispy bacon)[food]']
\s(?!\w+\))
The above pattern will NOT match any space that is followed by a word and a ), basically any space inside ')'.
Test regex here: https://regex101.com/r/SRHEXO/1
Test python here: https://ideone.com/reIIcU
EDIT: Answer to the question from your comment:
Since your input has multiple words inside ( ), you can change the pattern to [\s,](?![\s\w]+\))
Test regex here: https://regex101.com/r/Ea9XlY/1

Regular expressions, no matter how clever, are not always the right answer.
def split(s):
result = []
brace_depth = 0
temp = ''
for ch in s:
if ch == ' ' and brace_depth == 0:
result.append(temp[:])
temp = ''
elif ch == '(' or ch == '[':
brace_depth += 1
temp += ch
elif ch == ']' or ch == ')':
brace_depth -= 1
temp += ch
else:
temp += ch
if temp != '':
result.append(temp[:])
return result
>>> s="I want a (hot chocolate)[food] and (two)[quantity] boxes of (crispy bacon)[food]"
>>> split(s)
['I', 'want', 'a', '(hot chocolate)[food]', 'and', '(two)[quantity]', 'boxes', 'of', '(crispy bacon)[food]']

The regex for string is \s. So using this with re.split:
print(re.split("[\s]", "I want a (hot chocolate)[food] and (two)[quantity] boxes of (crispy bacon)[food]"))
The output is ['I', 'want', 'a', '(hot', 'chocolate)[food]', 'and', '(two)[quantity]', 'boxes', 'of', '(crispy', 'bacon)[food]']

Related

Pulling out certain strings from split method

I would like my output to be only words in parentheses inside the string without using regex.
Code:
def parentheses(s):
return s.split()
print(parentheses("Inside a (space?)"))
print(parentheses("To see (here) but color"))
print(parentheses("Very ( good (code"))
Output:
['Inside', 'a', '(space?)'] -> **space?**
['To', 'see', '(here)', 'but', 'color'] -> **here**
['Very', '(', 'good', '(code'] -> **empty**
Here is an old fashioned way of doing it with a loop and referencing the ends with a dictionary.
def parentheses(s):
ends = {"(":[],")":[]}
for i, char in enumerate(s):
if char in ["(", ")"]:
ends[char].append(i)
if not ends["("] or not ends[")"]:
return ""
return s[min(ends["("]) + 1: max(ends[")"])]
print(parentheses("Inside a (space?)"))
print(parentheses("To see (here) but color"))
print(parentheses("Very ( good (code"))
OUTPUT:
space?
here
If you must use str.split this will work as well.
def parentheses(s):
parts = s.split("(")
if len(parts) > 1:
s = "(".join(parts[1:])
parts = s.split(")")
if len(parts) > 1:
return ")".join(parts[:-1])
return ""

Split string in Python while keeping the line break inside the generated list

As simple as it sounds, can't think of a straightforward way of doing the below in Python.
my_string = "This is a test.\nAlso\tthis"
list_i_want = ["This", "is", "a", "test.", "\n", "Also", "this"]
I need the same behaviour as with string.split(), i.e. remove any type and number of whitespaces, but excluding the line breaks \n in which case I need it as a standalone list item.
How could I do this?
Split String using Regex findall()
import re
my_string = "This is a test.\nAlso\tthis"
my_list = re.findall(r"\S+|\n", my_string)
print(my_list)
How it Works:
"\S+": "\S" = non whitespace characters. "+" is a greed quantifier so it find any groups of non-whitespace characters aka words
"|": OR logic
"\n": Find "\n" so it's returned as well in your list
Output:
['This', 'is', 'a', 'test.', '\n', 'Also', 'this']
Here's a code that works but is definitely not efficient/pythonic:
my_string = "This is a test.\nAlso\tthis"
l = my_string.splitlines() #Splitting lines
list_i_want = []
for i in l:
list_i_want.extend((i.split())) # Extending elements in list by splitting lines
list_i_want.extend('\n') # adding newline character
list_i_want.pop() # Removing last newline character
print(list_i_want)
Output:
['This', 'is', 'a', 'test.', '\n', 'Also', 'this']

Splitting the sentences in python

I am trying to split the sentences in words.
words = content.lower().split()
this gives me the list of words like
'evening,', 'and', 'there', 'was', 'morning--the', 'first', 'day.'
and with this code:
def clean_up_list(word_list):
clean_word_list = []
for word in word_list:
symbols = "~!##$%^&*()_+`{}|\"?><`-=\][';/.,']"
for i in range(0, len(symbols)):
word = word.replace(symbols[i], "")
if len(word) > 0:
clean_word_list.append(word)
I get something like:
'evening', 'and', 'there', 'was', 'morningthe', 'first', 'day'
if you see the word "morningthe" in the list, it used to have "--" in between words. Now, is there any way I can split them in two words like "morning","the"??
I would suggest a regex-based solution:
import re
def to_words(text):
return re.findall(r'\w+', text)
This looks for all words - groups of alphabetic characters, ignoring symbols, seperators and whitespace.
>>> to_words("The morning-the evening")
['The', 'morning', 'the', 'evening']
Note that if you're looping over the words, using re.finditer which returns a generator object is probably better, as you don't have store the whole list of words at once.
Alternatively, you may also use itertools.groupby along with str.alpha() to extract alphabets-only words from the string as:
>>> from itertools import groupby
>>> sentence = 'evening, and there was morning--the first day.'
>>> [''.join(j) for i, j in groupby(sentence, str.isalpha) if i]
['evening', 'and', 'there', 'was', 'morning', 'the', 'first', 'day']
PS: Regex based solution is much cleaner. I have mentioned this as an possible alternative to achieve this.
Specific to OP: If all you want is to also split on -- in the resultant list, then you may firstly replace hyphens '-' with space ' ' before performing split. Hence, your code should be:
words = content.lower().replace('-', ' ').split()
where words will hold the value you desire.
Trying to do this with regexes will send you crazy e.g.
>>> re.findall(r'\w+', "Don't read O'Rourke's books!")
['Don', 't', 'read', 'O', 'Rourke', 's', 'books']
Definitely look at the nltk package.
Besides the solutions given already, you could also improve your clean_up_list function to do a better work.
def clean_up_list(word_list):
clean_word_list = []
# Move the list out of loop so that it doesn't
# have to be initiated every time.
symbols = "~!##$%^&*()_+`{}|\"?><`-=\][';/.,']"
for word in word_list:
current_word = ''
for index in range(len(word)):
if word[index] in symbols:
if current_word:
clean_word_list.append(current_word)
current_word = ''
else:
current_word += word[index]
if current_word:
# Append possible last current_word
clean_word_list.append(current_word)
return clean_word_list
Actually, you could apply the block in for word in word_list: to the whole sentence to get the same result.
You could also do this:
import re
def word_list(text):
return list(filter(None, re.split('\W+', text)))
print(word_list("Here we go round the mulberry-bush! And even---this and!!!this."))
Returns:
['Here', 'we', 'go', 'round', 'the', 'mulberry', 'bush', 'And', 'even', 'this', 'and', 'this']

String split formatting in python 3

I'm trying to format this string below where one row contains five words. However, I keep getting this as the output:
I love cookies yes I do Let s see a dog
First, I am not getting 5 words in one line, but instead, everything in one line.
Second, why does the "Let's" get split? I thought in splitting the string using "words", it will only split if there was a space in between?
Suggestions?
string = """I love cookies. yes I do. Let's see a dog."""
# split string
words = re.split('\W+',string)
words = [i for i in words if i != '']
counter = 0
output=''
for i in words:
if counter == 0:
output +="{0:>15s}".format(i)
# if counter == 5, new row
elif counter % 5 == 0:
output += '\n'
output += "{0:>15s}".format(i)
else:
output += "{0:>15s}".format(i)
# Increase the counter by 1
counter += 1
print(output)
As a start, don't call a variable "string" since it shadows the module with the same name
Secondly, use split() to do your word-splitting
>>> s = """I love cookies. yes I do. Let's see a dog."""
>>> s.split()
['I', 'love', 'cookies.', 'yes', 'I', 'do.', "Let's", 'see', 'a', 'dog.']
From re-module
\W
Matches any character which is not a Unicode word character. This is the opposite of \w. If the ASCII flag is used this becomes the equivalent of [^a-zA-Z0-9_] (but the flag affects the entire regular expression, so in such cases using an explicit [^a-zA-Z0-9_] may be a better choice).
Since the ' is not listed in the above, the regexp used splits the "Let's" string into two parts:
>>> words = re.split('\W+', s)
>>> words
['I', 'love', 'cookies', 'yes', 'I', 'do', 'Let', 's', 'see', 'a', 'dog', '']
This is the output I get using the strip()-approach above:
$ ./sp3.py
I love cookies. yes I
do. Let's see a dog.
The code could probably be simplified to this since counter==0 and the else-clause does the same thing. I through in an enumerate there as well to get rid of the counter:
#!/usr/bin/env python3
s = """I love cookies. yes I do. Let's see a dog."""
words = s.split()
output = ''
for n, i in enumerate(words):
if n % 5 == 0:
output += '\n'
output += "{0:>15s}".format(i)
print(output)
words = string.split()
while (len(words))
for word in words[:5]
print(word, end=" ")
print()
words = words[5:]
That's the basic concept, split it using the split() method
Then slice it using slice notation to get the first 5 words
Then slice off the first 5 words, and loop again

Python line.split to include a whitespace

If I have a string and want to return a word that includes a whitespace how would it be done?
For example, I have:
line = 'This is a group of words that include #this and #that but not ME ME'
response = [ word for word in line.split() if word.startswith("#") or word.startswith('#') or word.startswith('ME ')]
print response ['#this', '#that', 'ME']
So ME ME does not get printed because of the whitespace.
Thanks
You could just keep it simple:
line = 'This is a group of words that include #this and #that but not ME ME'
words = line.split()
result = []
pos = 0
try:
while True:
if words[pos].startswith(('#', '#')):
result.append(words[pos])
pos += 1
elif words[pos] == 'ME':
result.append('ME ' + words[pos + 1])
pos += 2
else:
pos += 1
except IndexError:
pass
print result
Think about speed only if it proves to be too slow in practice.
From python Documentation:
string.split(s[, sep[, maxsplit]]): Return a list of the words of the string s. If the optional second
argument sep is absent or None, the words are separated by arbitrary
strings of whitespace characters (space, tab, newline, return,
formfeed).
so your error is first on the call for split.
print line.split()
['This', 'is', 'a', 'group', 'of', 'words', 'that', 'include', '#this', 'and', '#that', 'but', 'not', 'ME', 'ME']
I recommend to use re for splitting the string. Use the re.split(pattern, string, maxsplit=0, flags=0)

Categories