import PyPDF2
fileReader = PyPDF2.PdfFileReader(file)
s=""
for i in range(2, fileReader.numPages):
s+=fileReader.getPage(i).extractText()
from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer(r'\w+')
sentences = []
while s.find('.') != -1:
index = s.find('.')
sentences.append(s[:index])
s = s[index+1:]
#splits the text into array of sentences based on where we see a '.' - need to account for how to avoid breaking at e.g. Mr.
corpus=[]
for sentence in sentences:
corpus.append(tokenizer.tokenize(sentence))
print(corpus[20])
The above is code for reading files and tokenizing the string. The output I get is as follows:
But the desired output is:
['With', 'the', 'graph', 'of', 'COVID', '19', 'at', 'this', 'moment', 'have', 'started', 'a', 'slide', 'downward', 'trend', 'we', 'are', 'confident', 'of', 'all', 'our', 'brands', 'performing', 'strongly', 'in', 'the', 'coming', 'quarters']
i.e. the words should not get broken down. Is there any way to avoid this?
The string 's' is taken from a pdf and looks something like this:
Related
I'm trying to join a of list of words and characters such as the one below list (ls), and convert it into a single, correctly formatted sentence string (sentence) for a collection of lists.
ls = ['"', 'Time', '"', 'magazine', 'said' 'the', 'film', 'was',
'"', 'a', 'multimillion', 'dollar', 'improvisation', 'that',
'does', 'everything', 'but', 'what', 'the', 'title', 'promises',
'"', 'and', 'suggested', 'that', '"', 'writer', 'George',
'Axelrod', '(', '"', 'The', 'Seven', 'Year', 'Itch', '"', ')',
'and', 'director', 'Richard', 'Quine', 'should', 'have', 'taken',
'a', 'hint', 'from', 'Holden', "'s", 'character', 'Richard',
'Benson', 'who', 'writes', 'his', 'movie', ',', 'takes', 'a',
'long', 'sober', 'look', 'at', 'what', 'he', 'has', 'wrought',
',', 'and', 'burns', 'it', '.', '"']
sentence = '"Time" magazine said the film was "a multimillion dollar improvisation that does everything but what the title promises" and suggested that "writer George Axelrod ("The Seven Year Itch") and director Richard Quine should have taken a hint from Holden's character Richard Benson who writes his movie, takes a long sober look at what he has wrought, and burns it."'
I've tried a rule based approach that adds an empty space after an element depending on the contents of the next element but my method ended as a really long piece of code that contains rules for as many cases as I could think of like those for parenthesis or quotations. Is there a way to effectively join this list into a correctly formatted sentence more efficiently and effectively?
I think a simple for should do the trick:
sentence = ""
for word in ls:
if (word == ',' or word == '.') and sentence != '':
sentence = sentence[:-1] #removing the last space added
sentence += word
if word != '\"' or word != '(':
sentence += ' ' #adding a space after each word
Within a dataframe I have a variable containing different abstracts of academic literature. Below you find a example of the first 3 observations:
abstract = ['Word embeddings are an active topic in the NLP', 'We propose a new shared task for tactical data', 'We evaluate a semantic parser based on a character']
I want to split the sentences in this variable in seperate words and remove possible periods '.'
The line of code in this case should return the following list:
abstractwords = ['Word', 'embeddings', 'are', 'an', 'active', 'topic', 'in', 'the', 'NPL', 'We', 'Propose', 'a', 'new', 'shared', 'task', 'for', 'tactical', 'data', 'We', 'evaluate', 'a', 'semantic', 'parser', 'based', 'on', 'a', 'character']
You can use nested list comprehension:
abstract = ['Word embeddings are an active topic in the NLP.', 'We propose a new shared task for tactical data.', 'We evaluate a semantic parser based on a character.']
words = [word.strip('.') for sentence in abstract for word in sentence.split()]
print(words)
# ['Word', 'embeddings', 'are', 'an', 'active', 'topic', 'in', 'the', 'NLP', 'We', 'propose', 'a', 'new', 'shared', 'task', 'for', 'tactical', 'data', 'We', 'evaluate', 'a', 'semantic', 'parser', 'based', 'on', 'a', 'character']
If you want to remove '.' in the middle of the words as well, use word.replace('.', '') instead.
Use for..each loop to go through elements, replace "." with a space. Split the sentence, and concatenate the lists.
abstractwords = []
for sentence in abstract:
sentence = sentence.replace(".", " ")
abstractwords.extend(sentence.split())
I am using Python's NLTK library to tokenize my sentences.
If my code is
text = "C# billion dollars; we don't own an ounce C++"
print nltk.word_tokenize(text)
I get this as my output
['C', '#', 'billion', 'dollars', ';', 'we', 'do', "n't", 'own', 'an', 'ounce', 'C++']
The symbols ; , . , # are considered as delimiters. Is there a way to remove # from the set of delimiters like how + isn't a delimiter and thus C++ appears as a single token?
I want my output to be
['C#', 'billion', 'dollars', ';', 'we', 'do', "n't", 'own', 'an', 'ounce', 'C++']
I want C# to be considered as one token.
As dealing with multi-word tokenization, another way would be to retokenize the extracted tokens with NLTK Multi-Word Expression tokenizer:
mwtokenizer = nltk.MWETokenizer(separator='')
mwtokenizer.add_mwe(('c', '#'))
mwtokenizer.tokenize(tokens)
Another idea: instead of altering how text is tokenized, just loop over the tokens and join every '#' with the preceding one.
txt = "C# billion dollars; we don't own an ounce C++"
tokens = word_tokenize(txt)
i_offset = 0
for i, t in enumerate(tokens):
i -= i_offset
if t == '#' and i > 0:
left = tokens[:i-1]
joined = [tokens[i - 1] + t]
right = tokens[i + 1:]
tokens = left + joined + right
i_offset += 1
>>> tokens
['C#', 'billion', 'dollars', ';', 'we', 'do', "n't", 'own', 'an', 'ounce', 'C++']
NLTK uses regular expressions to tokenize text, so you could use its regexp tokenizer to define your own regexp.
I'll create an example for you where text will be split on any space character (tab, new line, ecc) and a couple of other symbols just for instance:
>>> txt = "C# billion dollars; we don't own an ounce C++"
>>> regexp_tokenize(txt, pattern=r"\s|[\.,;']", gaps=True)
['C#', 'billion', 'dollars', 'we', 'don', 't', 'own', 'an', 'ounce', 'C++']
I'm trying to split words, punctuation, numbers from a sentence. However, my code produces output that isn't expected. How can I fix it?
This is my input text (in a text file):
"I 2changed to ask then, said that mildes't of men2,
And my code outputs this:
['"', 'I', '2', 'changed', 'to', 'ask', 'then', ',', 'said', 'that', "mildes't", 'of', 'men2']
However, the expected output is:
['"', 'I', '2', 'changed', 'to', 'ask', 'then', ',', 'said', 'that', "mildes't", 'of', 'men','2']
Here's my code:
import re
newlist = []
f = open("Inputfile2.txt",'r')
out = f.readlines()
for line in out:
word = line.strip('\n')
f.close()
lst = re.compile(r"\d|\w+[\w']+|\w|[^\w\s]").findall(word)
print(lst)
In regular expressions, '\w' matches any alphanumeric character, i.e. [a-zA-Z0-9].
Also in the first part of your regular expression, it should be '\d+' to match more than one digits.
The second and the third part of your regular expression '\w+[\w']+|\w' can be merged into a single part by changing '+' to '*'.
import re
with open('Inputfile2.txt', 'r') as f:
for line in f:
word = line.strip('\n')
lst = re.compile(r"\d+|[a-zA-Z]+[a-zA-Z']*|[^\w\s]").findall(word)
print(lst)
This gives:
['"', 'I', '2', 'changed', 'to', 'ask', 'then', ',', 'said', 'that', "mildes't", 'of', 'men', '2', ',']
Note that your expected output is incorrect. It is missing a ','.
In test.txt, I have 2 lines of sentences.
The heart was made to be broken.
There is no surprise more magical than the surprise of being loved.
The code:
import re
file = open('/test.txt','r')#specify file to open
data = file.readlines()
file.close()
for line in data:
line_split = re.split(r'[ \t\n\r, ]+',line)
print line_split
Results from the codes:
['The', 'heart', 'was', 'made', 'to', 'be', 'broken.', '']
['There', 'is', 'no', 'surprise', 'more', 'magical', 'than', 'the', 'surprise', 'of', 'being', 'loved.']
How to get only word print out? (see the first sentence) Expect result:
['The', 'heart', 'was', 'made', 'to', 'be', 'broken.']
['There', 'is', 'no', 'surprise', 'more', 'magical', 'than', 'the', 'surprise', 'of', 'being', 'loved.']
Any advice?
Instead of using split to match the delimiters, you can use findall with the negated regular expression to match the parts you want to keep:
line_split = re.findall(r'[^ \t\n\r., ]+',line)
See it working online: ideone
To fix, with a few other changes, explained further on:
import re
with open("test.txt", "r") as file:
for line in file:
line_split = filter(bool, re.split(r'[ \t\n\r, ]+', line))
print(line_split)
Here we use a filter() to remove any empty strings from the result.
Note my use of the with statement to open the file. This is more readable and handles closing the file for you, even on exceptions.
We also loop directly over the file - this is a better idea as it doesn't load the entire file into memory at once, which is not needed and could cause problems with big files.
words = re.compile(r"[\w']+").findall(yourString)
Demo
>>> yourString = "Mary's lamb was white as snow."
["Mary's", 'lamb', 'was', 'white', 'as', 'snow']
If you really do want periods, you can add those as [\w'\.]
In [2]: with open('test.txt','r') as f:
...: lines = f.readlines()
...:
In [3]: words = [l.split() for l in lines]
In [4]: words
Out[4]:
[['The', 'heart', 'was', 'made', 'to', 'be', 'broken.'],
['There',
'is',
'no',
'surprise',
'more',
'magical',
'than',
'the',
'surprise',
'of',
'being',
'loved.']]