Regex joining words splitted by whitespace and hyphen - python

My string is quite messy and looks something like this:
s="I'm hope-less and can -not solve this pro- blem on my own. Wo - uld you help me?"
I'd like to have the hyphen (& sometimes whitespace) stripped words together in one list.. Desired output:
list = ['I'm','hopeless','and','cannot','solve','this','problem','on','my','own','.','Would','you','help','me','?']
I tried a lot of different variations, but nothing worked..
rgx = re.compile("([\w][\w'][\w\-]*\w)")
s = "My string'"
rgx.findall(s)

Here's one way:
[re.sub(r'\s*-\s*', '', i) for i in re.split(r'(?<!-)\s(?!-)', s)]
# ["I'm", 'hopeless', 'and', 'cannot', 'solve', 'this', 'problem', 'on', 'my', 'own.', 'Would', 'you', 'help', 'me?']
Two operations here:
Split the text based on whitespaces without hyphens using both negative lookahead and negative lookbehind.
In each of the split word, replace the hyphens with possible whitespaces in front or behind to empty string.
You can see the first operation's demo here: https://regex101.com/r/ayHPvY/2
And the second: https://regex101.com/r/ayHPvY/1
Edit: To get the . and ? to be separated as well, use this instead:
[re.sub(r'\s*-\s*','', i) for i in re.split(r"(?<!-)\s(?!-)|([^\w\s'-]+)", s) if i]
# ["I'm", 'hopeless', 'and', 'cannot', 'solve', 'this', 'problem', 'on', 'my', 'own', '.', 'Would', 'you', 'help', 'me', '?']
The catch was also splitting the non-alphabets, non-whitespace and not hyphens/apostrophe. The if i is necessary as the split might return some None items.

Quick, non-regex way to do it would be
''.join(map(lambda s: s.strip(), s.split('-'))).split()
that is split on hyphens, strip of additional whitespaces, join back into string and split on space, this however doesn't separate dot or question marks.

How about this:
>>> s
"I'm hope-less and can -not solve this pro- blem on my own. Wo - uld you help me
?"
>>> list(map(lambda x:re.sub(' *- *','',x), filter(lambda x:x, re.split(r'(?<!-) +(?!-)|([.?])',s))))
["I'm", 'hopeless', 'and', 'cannot', 'solve', 'this', 'problem', 'on', 'my', 'own', '.', 'Would', 'you', 'help', 'me', '?']
Above used a simple space ' ', but use \s is better:
list(map(lambda x:re.sub('\s*-\s*','',x), filter(lambda x:x, re.split(r'(?<!-)\s+(?!-)|([.?])',s))))
(?<!-)\s+(?!-) means spaces that don't have - before or after.
[.?] means single . or ?.
re.split(r'(?<!-)\s+(?!-)|([.?])',s) will split the string accordingly, but will have some None and empty string '' inside:
["I'm", None, 'hope-less', None, 'and', None, 'can -not', None, 'solve', None, 'this', None, 'pro- blem', None, 'on', None, 'my', None, 'own', '.', '', None, 'Wo - uld', None, 'you', None, 'help', None, 'me', '?', '']
This result was directly feed to filter to remove None and '', and then feed to map to remove space and - inside each word.

Related

Joining only a part of a list of strings depends on its value in python

I have a list of strings, for example:
['this', 'is', 'an', 'example', 'of', 'list', 'of', 'strings']
I want to extract only some of these words and join them together only if they have quotes between them for example
so if my list was like...
['I', 'only', 'want', '"this', 'part"', 'of', 'the', 'list']
it will return "this part" Because this is the part that has quotes in between.
I tried using str.find("\"")
but it found only the first quotation mark so I couldn't really use that much, does anyone have any idea on how to do that? I appreciate all the help :))
You can use the following pattern with re.findall:
l = ['I', 'only', 'want', '"this', 'part"', 'of', 'the', 'list', '"another', 'example"']
re.findall(r'\"(.*?)\"', ' '.join(l))
# ['this part', 'another example']
str has method rfind which return index of last occurence (or -1 if not found), so you might do:
elements = ['I', 'only', 'want', '"this', 'part"', 'of', 'the', 'list']
txt = ' '.join(elements)
part = txt[txt.find('"')+1:txt.rfind('"')]
print(part) # this part
+1 is required due to inclusive-exclusive nature of slicing in Python.

discord.py - Divinding string to list [duplicate]

How do I split a sentence and store each word in a list? For example, given a string like "these are words", how do I get a list like ["these", "are", "words"]?
To split on other delimiters, see Split a string by a delimiter in python.
To split into individual characters, see How do I split a string into a list of characters?.
Given a string sentence, this stores each word in a list called words:
words = sentence.split()
To split the string text on any consecutive runs of whitespace:
words = text.split()
To split the string text on a custom delimiter such as ",":
words = text.split(",")
The words variable will be a list and contain the words from text split on the delimiter.
Use str.split():
Return a list of the words in the string, using sep as the delimiter
... If sep is not specified or is None, a different splitting algorithm is applied: runs of consecutive whitespace are regarded as a single separator, and the result will contain no empty strings at the start or end if the string has leading or trailing whitespace.
>>> line = "a sentence with a few words"
>>> line.split()
['a', 'sentence', 'with', 'a', 'few', 'words']
Depending on what you plan to do with your sentence-as-a-list, you may want to look at the Natural Language Took Kit. It deals heavily with text processing and evaluation. You can also use it to solve your problem:
import nltk
words = nltk.word_tokenize(raw_sentence)
This has the added benefit of splitting out punctuation.
Example:
>>> import nltk
>>> s = "The fox's foot grazed the sleeping dog, waking it."
>>> words = nltk.word_tokenize(s)
>>> words
['The', 'fox', "'s", 'foot', 'grazed', 'the', 'sleeping', 'dog', ',',
'waking', 'it', '.']
This allows you to filter out any punctuation you don't want and use only words.
Please note that the other solutions using string.split() are better if you don't plan on doing any complex manipulation of the sentence.
[Edited]
How about this algorithm? Split text on whitespace, then trim punctuation. This carefully removes punctuation from the edge of words, without harming apostrophes inside words such as we're.
>>> text
"'Oh, you can't help that,' said the Cat: 'we're all mad here. I'm mad. You're mad.'"
>>> text.split()
["'Oh,", 'you', "can't", 'help', "that,'", 'said', 'the', 'Cat:', "'we're", 'all', 'mad', 'here.', "I'm", 'mad.', "You're", "mad.'"]
>>> import string
>>> [word.strip(string.punctuation) for word in text.split()]
['Oh', 'you', "can't", 'help', 'that', 'said', 'the', 'Cat', "we're", 'all', 'mad', 'here', "I'm", 'mad', "You're", 'mad']
I want my python function to split a sentence (input) and store each word in a list
The str().split() method does this, it takes a string, splits it into a list:
>>> the_string = "this is a sentence"
>>> words = the_string.split(" ")
>>> print(words)
['this', 'is', 'a', 'sentence']
>>> type(words)
<type 'list'> # or <class 'list'> in Python 3.0
If you want all the chars of a word/sentence in a list, do this:
print(list("word"))
# ['w', 'o', 'r', 'd']
print(list("some sentence"))
# ['s', 'o', 'm', 'e', ' ', 's', 'e', 'n', 't', 'e', 'n', 'c', 'e']
shlex has a .split() function. It differs from str.split() in that it does not preserve quotes and treats a quoted phrase as a single word:
>>> import shlex
>>> shlex.split("sudo echo 'foo && bar'")
['sudo', 'echo', 'foo && bar']
NB: it works well for Unix-like command line strings. It doesn't work for natural-language processing.
Split the words without without harming apostrophes inside words
Please find the input_1 and input_2 Moore's law
def split_into_words(line):
import re
word_regex_improved = r"(\w[\w']*\w|\w)"
word_matcher = re.compile(word_regex_improved)
return word_matcher.findall(line)
#Example 1
input_1 = "computational power (see Moore's law) and "
split_into_words(input_1)
# output
['computational', 'power', 'see', "Moore's", 'law', 'and']
#Example 2
input_2 = """Oh, you can't help that,' said the Cat: 'we're all mad here. I'm mad. You're mad."""
split_into_words(input_2)
#output
['Oh',
'you',
"can't",
'help',
'that',
'said',
'the',
'Cat',
"we're",
'all',
'mad',
'here',
"I'm",
'mad',
"You're",
'mad']

Missing last word in a sentence when using regular expression

Code:
import re
def main():
a=['the mississippi is well worth reading about', ' it is not a commonplace river, but on the contrary is in all ways remarkable']
b=word_find(a)
print(b)
def word_find(sentence_list):
word_list=[]
word_reg=re.compile(r"[\(|\)|,|\'|\"|:|\[|\]|\{|\}| |\-\-+|\t|;]?(.+?)[\(|\)|,|\'|\"|:|\[|\]|\{|\}| |\-\-+|\t|;]")
for i in range(len(sentence_list)):
words=re.findall(word_reg,sentence_list[i])
word_list.append(words)
return word_list
main()
What I need is to break every words into single elements of a list
now the output looks like this:
[['the', 'mississippi', 'is', 'well', 'worth', 'reading'], ['it', 'is', 'not', 'a', 'commonplace', 'river', 'but', 'on', 'the', 'contrary', 'is', 'in', 'all', 'ways']]
I found that the last word of the first sentence 'about' and the second sentence 'remarkable'is missing
It might be some problem in my regular expression
word_reg=re.compile(r"[\(|\)|,|\'|\"|:|\[|\]|\{|\}| |\-\-+|\t|;]?(.+?)[\(|\)|,|\'|\"|:|\[|\]|\{|\}| |\-\-+|\t|;]")
But if I add a question mark into the last part of this regular expression like this:
[\(|\)|,|\'|\"|:|\[|\]|\{|\}| |\-\-+|\t|;]**?**")
the result become many single letters instead of words. What can I do with it?
Edit:
The reason why I didn't use string.split is that there might be many ways for people to break words
For example: when people input a--b, there is no space, but we have to break it into 'a','b'
Using the right tools is always the winning strategy. In your case, the right tool is the NLTK word tokenizer, because it was designed to do just that: break sentences into words.
import nltk
a = ['the mississippi is well worth reading about',
' it is not a commonplace river, but on the contrary is in all ways remarkable']
nltk.word_tokenize(a[1])
#['it', 'is', 'not', 'a', 'commonplace', 'river', ',', 'but',
# 'on', 'the', 'contrary', 'is', 'in', 'all', 'ways', 'remarkable']
Suggest a simpler solution:
b = re.split(r"[\W_]", a)
The regex [\W_] matches any single non-word characters (non-letter and non-digit and non-underline) plus the underline, which is practically enough.
Your current regex requires that the word is followed by one of the characters in your list, but not "end of line", which can be matched with $.
You can use re.split and filter:
filter(None, re.split("[, \-!?:]+", a])
Where I have put the string "[, \-!?:]+", you should put whatever characters it is that are your delimiters. filter will just remove any empty strings because of leading/trailing separators.
You can either find what you don't want and split on that:
>>> a=['the mississippi is well worth reading about', ' it is not a commonplace river, but on the contrary is in all ways remarkable']
>>> [re.split(r'\W+', s) for s in a]
[['the', 'mississippi', 'is', 'well', 'worth', 'reading', 'about'], ['', 'it', 'is', 'not', 'a', 'commonplace', 'river', 'but', 'on', 'the', 'contrary', 'is', 'in', 'all', 'ways', 'remarkable']]
(You may need to filter the '' elements produced by re.split)
Or capture what you do want with re.findall and keep those elements:
>>> [re.findall(r'\b\w+', s) for s in a]
[['the', 'mississippi', 'is', 'well', 'worth', 'reading', 'about'], ['it', 'is', 'not', 'a', 'commonplace', 'river', 'but', 'on', 'the', 'contrary', 'is', 'in', 'all', 'ways', 'remarkable']]
Thanks everybody
From others answers, the solution is to use re.split()
and there is a SUPER STAR NLTK in the uppermost answer
def word_find(sentence_list):
word_list=[]
for i in range(len(sentence_list)):
word_list.append(re.split('\(|\)|,|\'|\"|:|\[|\]|\{|\}| |\-\-+|\t|;',sentence_list[i]))
return word_list

Better split string method - split by multiple characters [duplicate]

This question already has answers here:
Split Strings into words with multiple word boundary delimiters
(31 answers)
Closed 6 years ago.
The built-in <string>.split() procedure works only uses whitespace to split the string.
I'd like to define a procedure, split_string, that takes two inputs: the string to split and a string containing all of the characters considered separators.
The procedure should return a list of strings that break the source string up by the characters in the list.
def split_string(source,list):
...
>>> print split_string("This is a test-of the,string separation-code!",",!-")
['This', 'is', 'a', 'test', 'of', 'the', 'string', 'separation', 'code']
re.split() works:
>>> import re
>>> s = "This is a test-of the,string separation-code!"
>>> re.split(r'[ \-\,!]+', s)
['This', 'is', 'a', 'test', 'of', 'the', 'string', 'separation', 'code', '']
In your case searching for words seems more useful:
>>> re.findall(r'[\w']+', s)
['This', 'is', 'a', 'test', 'of', 'the', 'string', 'separation', 'code']
Here's a function you can reuse - that also escapes special characters:
def escape_char(char):
special = ['.', '^', '$', '*', '+', '?', '\\', '[', ']', '|']
return '\\{}'.format(char) if char in special else char
def split(text, *delimiters):
return re.split('|'.join([escape_char(x) for x in delimiters]), text)
It doesn't automatically remove empty entries, e.g.:
>>> split('Python, is awesome!', '!', ',', ' ')
['Python', '', 'is', 'awesome', '']

How do I split a string into a list of words?

How do I split a sentence and store each word in a list? For example, given a string like "these are words", how do I get a list like ["these", "are", "words"]?
To split on other delimiters, see Split a string by a delimiter in python.
To split into individual characters, see How do I split a string into a list of characters?.
Given a string sentence, this stores each word in a list called words:
words = sentence.split()
To split the string text on any consecutive runs of whitespace:
words = text.split()
To split the string text on a custom delimiter such as ",":
words = text.split(",")
The words variable will be a list and contain the words from text split on the delimiter.
Use str.split():
Return a list of the words in the string, using sep as the delimiter
... If sep is not specified or is None, a different splitting algorithm is applied: runs of consecutive whitespace are regarded as a single separator, and the result will contain no empty strings at the start or end if the string has leading or trailing whitespace.
>>> line = "a sentence with a few words"
>>> line.split()
['a', 'sentence', 'with', 'a', 'few', 'words']
Depending on what you plan to do with your sentence-as-a-list, you may want to look at the Natural Language Took Kit. It deals heavily with text processing and evaluation. You can also use it to solve your problem:
import nltk
words = nltk.word_tokenize(raw_sentence)
This has the added benefit of splitting out punctuation.
Example:
>>> import nltk
>>> s = "The fox's foot grazed the sleeping dog, waking it."
>>> words = nltk.word_tokenize(s)
>>> words
['The', 'fox', "'s", 'foot', 'grazed', 'the', 'sleeping', 'dog', ',',
'waking', 'it', '.']
This allows you to filter out any punctuation you don't want and use only words.
Please note that the other solutions using string.split() are better if you don't plan on doing any complex manipulation of the sentence.
[Edited]
How about this algorithm? Split text on whitespace, then trim punctuation. This carefully removes punctuation from the edge of words, without harming apostrophes inside words such as we're.
>>> text
"'Oh, you can't help that,' said the Cat: 'we're all mad here. I'm mad. You're mad.'"
>>> text.split()
["'Oh,", 'you', "can't", 'help', "that,'", 'said', 'the', 'Cat:', "'we're", 'all', 'mad', 'here.', "I'm", 'mad.', "You're", "mad.'"]
>>> import string
>>> [word.strip(string.punctuation) for word in text.split()]
['Oh', 'you', "can't", 'help', 'that', 'said', 'the', 'Cat', "we're", 'all', 'mad', 'here', "I'm", 'mad', "You're", 'mad']
I want my python function to split a sentence (input) and store each word in a list
The str().split() method does this, it takes a string, splits it into a list:
>>> the_string = "this is a sentence"
>>> words = the_string.split(" ")
>>> print(words)
['this', 'is', 'a', 'sentence']
>>> type(words)
<type 'list'> # or <class 'list'> in Python 3.0
If you want all the chars of a word/sentence in a list, do this:
print(list("word"))
# ['w', 'o', 'r', 'd']
print(list("some sentence"))
# ['s', 'o', 'm', 'e', ' ', 's', 'e', 'n', 't', 'e', 'n', 'c', 'e']
shlex has a .split() function. It differs from str.split() in that it does not preserve quotes and treats a quoted phrase as a single word:
>>> import shlex
>>> shlex.split("sudo echo 'foo && bar'")
['sudo', 'echo', 'foo && bar']
NB: it works well for Unix-like command line strings. It doesn't work for natural-language processing.
Split the words without without harming apostrophes inside words
Please find the input_1 and input_2 Moore's law
def split_into_words(line):
import re
word_regex_improved = r"(\w[\w']*\w|\w)"
word_matcher = re.compile(word_regex_improved)
return word_matcher.findall(line)
#Example 1
input_1 = "computational power (see Moore's law) and "
split_into_words(input_1)
# output
['computational', 'power', 'see', "Moore's", 'law', 'and']
#Example 2
input_2 = """Oh, you can't help that,' said the Cat: 'we're all mad here. I'm mad. You're mad."""
split_into_words(input_2)
#output
['Oh',
'you',
"can't",
'help',
'that',
'said',
'the',
'Cat',
"we're",
'all',
'mad',
'here',
"I'm",
'mad',
"You're",
'mad']

Categories