Modify python nltk.word_tokenize to exclude "#" as delimiter - python

I am using Python's NLTK library to tokenize my sentences.
If my code is
text = "C# billion dollars; we don't own an ounce C++"
print nltk.word_tokenize(text)
I get this as my output
['C', '#', 'billion', 'dollars', ';', 'we', 'do', "n't", 'own', 'an', 'ounce', 'C++']
The symbols ; , . , # are considered as delimiters. Is there a way to remove # from the set of delimiters like how + isn't a delimiter and thus C++ appears as a single token?
I want my output to be
['C#', 'billion', 'dollars', ';', 'we', 'do', "n't", 'own', 'an', 'ounce', 'C++']
I want C# to be considered as one token.

As dealing with multi-word tokenization, another way would be to retokenize the extracted tokens with NLTK Multi-Word Expression tokenizer:
mwtokenizer = nltk.MWETokenizer(separator='')
mwtokenizer.add_mwe(('c', '#'))
mwtokenizer.tokenize(tokens)

Another idea: instead of altering how text is tokenized, just loop over the tokens and join every '#' with the preceding one.
txt = "C# billion dollars; we don't own an ounce C++"
tokens = word_tokenize(txt)
i_offset = 0
for i, t in enumerate(tokens):
i -= i_offset
if t == '#' and i > 0:
left = tokens[:i-1]
joined = [tokens[i - 1] + t]
right = tokens[i + 1:]
tokens = left + joined + right
i_offset += 1
>>> tokens
['C#', 'billion', 'dollars', ';', 'we', 'do', "n't", 'own', 'an', 'ounce', 'C++']

NLTK uses regular expressions to tokenize text, so you could use its regexp tokenizer to define your own regexp.
I'll create an example for you where text will be split on any space character (tab, new line, ecc) and a couple of other symbols just for instance:
>>> txt = "C# billion dollars; we don't own an ounce C++"
>>> regexp_tokenize(txt, pattern=r"\s|[\.,;']", gaps=True)
['C#', 'billion', 'dollars', 'we', 'don', 't', 'own', 'an', 'ounce', 'C++']

Related

How to convert command line argument into python list [duplicate]

How do I split a sentence and store each word in a list? For example, given a string like "these are words", how do I get a list like ["these", "are", "words"]?
To split on other delimiters, see Split a string by a delimiter in python.
To split into individual characters, see How do I split a string into a list of characters?.
Given a string sentence, this stores each word in a list called words:
words = sentence.split()
To split the string text on any consecutive runs of whitespace:
words = text.split()
To split the string text on a custom delimiter such as ",":
words = text.split(",")
The words variable will be a list and contain the words from text split on the delimiter.
Use str.split():
Return a list of the words in the string, using sep as the delimiter
... If sep is not specified or is None, a different splitting algorithm is applied: runs of consecutive whitespace are regarded as a single separator, and the result will contain no empty strings at the start or end if the string has leading or trailing whitespace.
>>> line = "a sentence with a few words"
>>> line.split()
['a', 'sentence', 'with', 'a', 'few', 'words']
Depending on what you plan to do with your sentence-as-a-list, you may want to look at the Natural Language Took Kit. It deals heavily with text processing and evaluation. You can also use it to solve your problem:
import nltk
words = nltk.word_tokenize(raw_sentence)
This has the added benefit of splitting out punctuation.
Example:
>>> import nltk
>>> s = "The fox's foot grazed the sleeping dog, waking it."
>>> words = nltk.word_tokenize(s)
>>> words
['The', 'fox', "'s", 'foot', 'grazed', 'the', 'sleeping', 'dog', ',',
'waking', 'it', '.']
This allows you to filter out any punctuation you don't want and use only words.
Please note that the other solutions using string.split() are better if you don't plan on doing any complex manipulation of the sentence.
[Edited]
How about this algorithm? Split text on whitespace, then trim punctuation. This carefully removes punctuation from the edge of words, without harming apostrophes inside words such as we're.
>>> text
"'Oh, you can't help that,' said the Cat: 'we're all mad here. I'm mad. You're mad.'"
>>> text.split()
["'Oh,", 'you', "can't", 'help', "that,'", 'said', 'the', 'Cat:', "'we're", 'all', 'mad', 'here.', "I'm", 'mad.', "You're", "mad.'"]
>>> import string
>>> [word.strip(string.punctuation) for word in text.split()]
['Oh', 'you', "can't", 'help', 'that', 'said', 'the', 'Cat', "we're", 'all', 'mad', 'here', "I'm", 'mad', "You're", 'mad']
I want my python function to split a sentence (input) and store each word in a list
The str().split() method does this, it takes a string, splits it into a list:
>>> the_string = "this is a sentence"
>>> words = the_string.split(" ")
>>> print(words)
['this', 'is', 'a', 'sentence']
>>> type(words)
<type 'list'> # or <class 'list'> in Python 3.0
If you want all the chars of a word/sentence in a list, do this:
print(list("word"))
# ['w', 'o', 'r', 'd']
print(list("some sentence"))
# ['s', 'o', 'm', 'e', ' ', 's', 'e', 'n', 't', 'e', 'n', 'c', 'e']
shlex has a .split() function. It differs from str.split() in that it does not preserve quotes and treats a quoted phrase as a single word:
>>> import shlex
>>> shlex.split("sudo echo 'foo && bar'")
['sudo', 'echo', 'foo && bar']
NB: it works well for Unix-like command line strings. It doesn't work for natural-language processing.
Split the words without without harming apostrophes inside words
Please find the input_1 and input_2 Moore's law
def split_into_words(line):
import re
word_regex_improved = r"(\w[\w']*\w|\w)"
word_matcher = re.compile(word_regex_improved)
return word_matcher.findall(line)
#Example 1
input_1 = "computational power (see Moore's law) and "
split_into_words(input_1)
# output
['computational', 'power', 'see', "Moore's", 'law', 'and']
#Example 2
input_2 = """Oh, you can't help that,' said the Cat: 'we're all mad here. I'm mad. You're mad."""
split_into_words(input_2)
#output
['Oh',
'you',
"can't",
'help',
'that',
'said',
'the',
'Cat',
"we're",
'all',
'mad',
'here',
"I'm",
'mad',
"You're",
'mad']

How can you combine a list of tokens and characters (including punctuation and symbols) into a single sentence string in python?

I'm trying to join a of list of words and characters such as the one below list (ls), and convert it into a single, correctly formatted sentence string (sentence) for a collection of lists.
ls = ['"', 'Time', '"', 'magazine', 'said' 'the', 'film', 'was',
'"', 'a', 'multimillion', 'dollar', 'improvisation', 'that',
'does', 'everything', 'but', 'what', 'the', 'title', 'promises',
'"', 'and', 'suggested', 'that', '"', 'writer', 'George',
'Axelrod', '(', '"', 'The', 'Seven', 'Year', 'Itch', '"', ')',
'and', 'director', 'Richard', 'Quine', 'should', 'have', 'taken',
'a', 'hint', 'from', 'Holden', "'s", 'character', 'Richard',
'Benson', 'who', 'writes', 'his', 'movie', ',', 'takes', 'a',
'long', 'sober', 'look', 'at', 'what', 'he', 'has', 'wrought',
',', 'and', 'burns', 'it', '.', '"']
sentence = '"Time" magazine said the film was "a multimillion dollar improvisation that does everything but what the title promises" and suggested that "writer George Axelrod ("The Seven Year Itch") and director Richard Quine should have taken a hint from Holden's character Richard Benson who writes his movie, takes a long sober look at what he has wrought, and burns it."'
I've tried a rule based approach that adds an empty space after an element depending on the contents of the next element but my method ended as a really long piece of code that contains rules for as many cases as I could think of like those for parenthesis or quotations. Is there a way to effectively join this list into a correctly formatted sentence more efficiently and effectively?
I think a simple for should do the trick:
sentence = ""
for word in ls:
if (word == ',' or word == '.') and sentence != '':
sentence = sentence[:-1] #removing the last space added
sentence += word
if word != '\"' or word != '(':
sentence += ' ' #adding a space after each word

Regex cutting the word in python

import PyPDF2
fileReader = PyPDF2.PdfFileReader(file)
s=""
for i in range(2, fileReader.numPages):
s+=fileReader.getPage(i).extractText()
from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer(r'\w+')
sentences = []
while s.find('.') != -1:
index = s.find('.')
sentences.append(s[:index])
s = s[index+1:]
#splits the text into array of sentences based on where we see a '.' - need to account for how to avoid breaking at e.g. Mr.
corpus=[]
for sentence in sentences:
corpus.append(tokenizer.tokenize(sentence))
print(corpus[20])
The above is code for reading files and tokenizing the string. The output I get is as follows:
But the desired output is:
['With', 'the', 'graph', 'of', 'COVID', '19', 'at', 'this', 'moment', 'have', 'started', 'a', 'slide', 'downward', 'trend', 'we', 'are', 'confident', 'of', 'all', 'our', 'brands', 'performing', 'strongly', 'in', 'the', 'coming', 'quarters']
i.e. the words should not get broken down. Is there any way to avoid this?
The string 's' is taken from a pdf and looks something like this:

discord.py - Divinding string to list [duplicate]

How do I split a sentence and store each word in a list? For example, given a string like "these are words", how do I get a list like ["these", "are", "words"]?
To split on other delimiters, see Split a string by a delimiter in python.
To split into individual characters, see How do I split a string into a list of characters?.
Given a string sentence, this stores each word in a list called words:
words = sentence.split()
To split the string text on any consecutive runs of whitespace:
words = text.split()
To split the string text on a custom delimiter such as ",":
words = text.split(",")
The words variable will be a list and contain the words from text split on the delimiter.
Use str.split():
Return a list of the words in the string, using sep as the delimiter
... If sep is not specified or is None, a different splitting algorithm is applied: runs of consecutive whitespace are regarded as a single separator, and the result will contain no empty strings at the start or end if the string has leading or trailing whitespace.
>>> line = "a sentence with a few words"
>>> line.split()
['a', 'sentence', 'with', 'a', 'few', 'words']
Depending on what you plan to do with your sentence-as-a-list, you may want to look at the Natural Language Took Kit. It deals heavily with text processing and evaluation. You can also use it to solve your problem:
import nltk
words = nltk.word_tokenize(raw_sentence)
This has the added benefit of splitting out punctuation.
Example:
>>> import nltk
>>> s = "The fox's foot grazed the sleeping dog, waking it."
>>> words = nltk.word_tokenize(s)
>>> words
['The', 'fox', "'s", 'foot', 'grazed', 'the', 'sleeping', 'dog', ',',
'waking', 'it', '.']
This allows you to filter out any punctuation you don't want and use only words.
Please note that the other solutions using string.split() are better if you don't plan on doing any complex manipulation of the sentence.
[Edited]
How about this algorithm? Split text on whitespace, then trim punctuation. This carefully removes punctuation from the edge of words, without harming apostrophes inside words such as we're.
>>> text
"'Oh, you can't help that,' said the Cat: 'we're all mad here. I'm mad. You're mad.'"
>>> text.split()
["'Oh,", 'you', "can't", 'help', "that,'", 'said', 'the', 'Cat:', "'we're", 'all', 'mad', 'here.', "I'm", 'mad.', "You're", "mad.'"]
>>> import string
>>> [word.strip(string.punctuation) for word in text.split()]
['Oh', 'you', "can't", 'help', 'that', 'said', 'the', 'Cat', "we're", 'all', 'mad', 'here', "I'm", 'mad', "You're", 'mad']
I want my python function to split a sentence (input) and store each word in a list
The str().split() method does this, it takes a string, splits it into a list:
>>> the_string = "this is a sentence"
>>> words = the_string.split(" ")
>>> print(words)
['this', 'is', 'a', 'sentence']
>>> type(words)
<type 'list'> # or <class 'list'> in Python 3.0
If you want all the chars of a word/sentence in a list, do this:
print(list("word"))
# ['w', 'o', 'r', 'd']
print(list("some sentence"))
# ['s', 'o', 'm', 'e', ' ', 's', 'e', 'n', 't', 'e', 'n', 'c', 'e']
shlex has a .split() function. It differs from str.split() in that it does not preserve quotes and treats a quoted phrase as a single word:
>>> import shlex
>>> shlex.split("sudo echo 'foo && bar'")
['sudo', 'echo', 'foo && bar']
NB: it works well for Unix-like command line strings. It doesn't work for natural-language processing.
Split the words without without harming apostrophes inside words
Please find the input_1 and input_2 Moore's law
def split_into_words(line):
import re
word_regex_improved = r"(\w[\w']*\w|\w)"
word_matcher = re.compile(word_regex_improved)
return word_matcher.findall(line)
#Example 1
input_1 = "computational power (see Moore's law) and "
split_into_words(input_1)
# output
['computational', 'power', 'see', "Moore's", 'law', 'and']
#Example 2
input_2 = """Oh, you can't help that,' said the Cat: 'we're all mad here. I'm mad. You're mad."""
split_into_words(input_2)
#output
['Oh',
'you',
"can't",
'help',
'that',
'said',
'the',
'Cat',
"we're",
'all',
'mad',
'here',
"I'm",
'mad',
"You're",
'mad']

How do I split a string into a list of words?

How do I split a sentence and store each word in a list? For example, given a string like "these are words", how do I get a list like ["these", "are", "words"]?
To split on other delimiters, see Split a string by a delimiter in python.
To split into individual characters, see How do I split a string into a list of characters?.
Given a string sentence, this stores each word in a list called words:
words = sentence.split()
To split the string text on any consecutive runs of whitespace:
words = text.split()
To split the string text on a custom delimiter such as ",":
words = text.split(",")
The words variable will be a list and contain the words from text split on the delimiter.
Use str.split():
Return a list of the words in the string, using sep as the delimiter
... If sep is not specified or is None, a different splitting algorithm is applied: runs of consecutive whitespace are regarded as a single separator, and the result will contain no empty strings at the start or end if the string has leading or trailing whitespace.
>>> line = "a sentence with a few words"
>>> line.split()
['a', 'sentence', 'with', 'a', 'few', 'words']
Depending on what you plan to do with your sentence-as-a-list, you may want to look at the Natural Language Took Kit. It deals heavily with text processing and evaluation. You can also use it to solve your problem:
import nltk
words = nltk.word_tokenize(raw_sentence)
This has the added benefit of splitting out punctuation.
Example:
>>> import nltk
>>> s = "The fox's foot grazed the sleeping dog, waking it."
>>> words = nltk.word_tokenize(s)
>>> words
['The', 'fox', "'s", 'foot', 'grazed', 'the', 'sleeping', 'dog', ',',
'waking', 'it', '.']
This allows you to filter out any punctuation you don't want and use only words.
Please note that the other solutions using string.split() are better if you don't plan on doing any complex manipulation of the sentence.
[Edited]
How about this algorithm? Split text on whitespace, then trim punctuation. This carefully removes punctuation from the edge of words, without harming apostrophes inside words such as we're.
>>> text
"'Oh, you can't help that,' said the Cat: 'we're all mad here. I'm mad. You're mad.'"
>>> text.split()
["'Oh,", 'you', "can't", 'help', "that,'", 'said', 'the', 'Cat:', "'we're", 'all', 'mad', 'here.', "I'm", 'mad.', "You're", "mad.'"]
>>> import string
>>> [word.strip(string.punctuation) for word in text.split()]
['Oh', 'you', "can't", 'help', 'that', 'said', 'the', 'Cat', "we're", 'all', 'mad', 'here', "I'm", 'mad', "You're", 'mad']
I want my python function to split a sentence (input) and store each word in a list
The str().split() method does this, it takes a string, splits it into a list:
>>> the_string = "this is a sentence"
>>> words = the_string.split(" ")
>>> print(words)
['this', 'is', 'a', 'sentence']
>>> type(words)
<type 'list'> # or <class 'list'> in Python 3.0
If you want all the chars of a word/sentence in a list, do this:
print(list("word"))
# ['w', 'o', 'r', 'd']
print(list("some sentence"))
# ['s', 'o', 'm', 'e', ' ', 's', 'e', 'n', 't', 'e', 'n', 'c', 'e']
shlex has a .split() function. It differs from str.split() in that it does not preserve quotes and treats a quoted phrase as a single word:
>>> import shlex
>>> shlex.split("sudo echo 'foo && bar'")
['sudo', 'echo', 'foo && bar']
NB: it works well for Unix-like command line strings. It doesn't work for natural-language processing.
Split the words without without harming apostrophes inside words
Please find the input_1 and input_2 Moore's law
def split_into_words(line):
import re
word_regex_improved = r"(\w[\w']*\w|\w)"
word_matcher = re.compile(word_regex_improved)
return word_matcher.findall(line)
#Example 1
input_1 = "computational power (see Moore's law) and "
split_into_words(input_1)
# output
['computational', 'power', 'see', "Moore's", 'law', 'and']
#Example 2
input_2 = """Oh, you can't help that,' said the Cat: 'we're all mad here. I'm mad. You're mad."""
split_into_words(input_2)
#output
['Oh',
'you',
"can't",
'help',
'that',
'said',
'the',
'Cat',
"we're",
'all',
'mad',
'here',
"I'm",
'mad',
"You're",
'mad']

Categories