I'm trying to split words, punctuation, numbers from a sentence. However, my code produces output that isn't expected. How can I fix it?
This is my input text (in a text file):
"I 2changed to ask then, said that mildes't of men2,
And my code outputs this:
['"', 'I', '2', 'changed', 'to', 'ask', 'then', ',', 'said', 'that', "mildes't", 'of', 'men2']
However, the expected output is:
['"', 'I', '2', 'changed', 'to', 'ask', 'then', ',', 'said', 'that', "mildes't", 'of', 'men','2']
Here's my code:
import re
newlist = []
f = open("Inputfile2.txt",'r')
out = f.readlines()
for line in out:
word = line.strip('\n')
f.close()
lst = re.compile(r"\d|\w+[\w']+|\w|[^\w\s]").findall(word)
print(lst)
In regular expressions, '\w' matches any alphanumeric character, i.e. [a-zA-Z0-9].
Also in the first part of your regular expression, it should be '\d+' to match more than one digits.
The second and the third part of your regular expression '\w+[\w']+|\w' can be merged into a single part by changing '+' to '*'.
import re
with open('Inputfile2.txt', 'r') as f:
for line in f:
word = line.strip('\n')
lst = re.compile(r"\d+|[a-zA-Z]+[a-zA-Z']*|[^\w\s]").findall(word)
print(lst)
This gives:
['"', 'I', '2', 'changed', 'to', 'ask', 'then', ',', 'said', 'that', "mildes't", 'of', 'men', '2', ',']
Note that your expected output is incorrect. It is missing a ','.
Related
I'm trying to join a of list of words and characters such as the one below list (ls), and convert it into a single, correctly formatted sentence string (sentence) for a collection of lists.
ls = ['"', 'Time', '"', 'magazine', 'said' 'the', 'film', 'was',
'"', 'a', 'multimillion', 'dollar', 'improvisation', 'that',
'does', 'everything', 'but', 'what', 'the', 'title', 'promises',
'"', 'and', 'suggested', 'that', '"', 'writer', 'George',
'Axelrod', '(', '"', 'The', 'Seven', 'Year', 'Itch', '"', ')',
'and', 'director', 'Richard', 'Quine', 'should', 'have', 'taken',
'a', 'hint', 'from', 'Holden', "'s", 'character', 'Richard',
'Benson', 'who', 'writes', 'his', 'movie', ',', 'takes', 'a',
'long', 'sober', 'look', 'at', 'what', 'he', 'has', 'wrought',
',', 'and', 'burns', 'it', '.', '"']
sentence = '"Time" magazine said the film was "a multimillion dollar improvisation that does everything but what the title promises" and suggested that "writer George Axelrod ("The Seven Year Itch") and director Richard Quine should have taken a hint from Holden's character Richard Benson who writes his movie, takes a long sober look at what he has wrought, and burns it."'
I've tried a rule based approach that adds an empty space after an element depending on the contents of the next element but my method ended as a really long piece of code that contains rules for as many cases as I could think of like those for parenthesis or quotations. Is there a way to effectively join this list into a correctly formatted sentence more efficiently and effectively?
I think a simple for should do the trick:
sentence = ""
for word in ls:
if (word == ',' or word == '.') and sentence != '':
sentence = sentence[:-1] #removing the last space added
sentence += word
if word != '\"' or word != '(':
sentence += ' ' #adding a space after each word
import PyPDF2
fileReader = PyPDF2.PdfFileReader(file)
s=""
for i in range(2, fileReader.numPages):
s+=fileReader.getPage(i).extractText()
from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer(r'\w+')
sentences = []
while s.find('.') != -1:
index = s.find('.')
sentences.append(s[:index])
s = s[index+1:]
#splits the text into array of sentences based on where we see a '.' - need to account for how to avoid breaking at e.g. Mr.
corpus=[]
for sentence in sentences:
corpus.append(tokenizer.tokenize(sentence))
print(corpus[20])
The above is code for reading files and tokenizing the string. The output I get is as follows:
But the desired output is:
['With', 'the', 'graph', 'of', 'COVID', '19', 'at', 'this', 'moment', 'have', 'started', 'a', 'slide', 'downward', 'trend', 'we', 'are', 'confident', 'of', 'all', 'our', 'brands', 'performing', 'strongly', 'in', 'the', 'coming', 'quarters']
i.e. the words should not get broken down. Is there any way to avoid this?
The string 's' is taken from a pdf and looks something like this:
I'm trying to come up with the regular expression to split a string up into a list based on white space or trailing punctuation.
e.g.
s = 'hel-lo this has whi(.)te, space. very \n good'
What I want is
['hel-lo', 'this', 'has', 'whi(.)te', ',', 'space', '.', 'very', 'good']
s.split() gets me most of the way there, except it doesn't take care of the trailing whitespace.
import re
s = 'hel-lo this has whi(.)te, space. very \n good'
[x for x in re.split(r"([.,!?]+)?\s+", s) if x]
# => ['hel-lo', 'this', 'has', 'whi(.)te', ',', 'space', '.', 'very', 'good']
You might need to tweak what "punctuation" is.
Rough solution using spacy. It works pretty good with tokenizing word already.
import spacy
s = 'hel-lo this has whi(.)te, space. very \n good'
nlp = spacy.load('en')
ls = [t.text for t in nlp(s) if t.text.strip()]
>> ['hel', '-', 'lo', 'this', 'has', 'whi(.)te', ',', 'space', '.', 'very', 'good']
However, it also tokenize words between - so I borrow solution from here to merge words between - back together.
merge = [(i-1, i+2) for i, s in enumerate(ls) if i >= 1 and s == '-']
for t in merge[::-1]:
merged = ''.join(ls[t[0]:t[1]])
ls[t[0]:t[1]] = [merged]
>> ['hel-lo', 'this', 'has', 'whi(.)te', ',', 'space', '.', 'very', 'good']
I am using Python 3.6.1.
import re
s = 'hel-lo this has whi(.)te, space. very \n good'
a = [] # this list stores the items
for i in s.split(): # split on whitespaces
j = re.split('(\,|\.)$',i) # split on your definition of trailing punctuation marks
if len(j) > 1:
a.extend(j[:-1])
else:
a.append(i)
# a -> ['hel-lo', 'this', 'has', 'whi(.)te', ',', 'space', '.', 'very', 'good']
This question already has answers here:
Split Strings into words with multiple word boundary delimiters
(31 answers)
Closed 6 years ago.
The built-in <string>.split() procedure works only uses whitespace to split the string.
I'd like to define a procedure, split_string, that takes two inputs: the string to split and a string containing all of the characters considered separators.
The procedure should return a list of strings that break the source string up by the characters in the list.
def split_string(source,list):
...
>>> print split_string("This is a test-of the,string separation-code!",",!-")
['This', 'is', 'a', 'test', 'of', 'the', 'string', 'separation', 'code']
re.split() works:
>>> import re
>>> s = "This is a test-of the,string separation-code!"
>>> re.split(r'[ \-\,!]+', s)
['This', 'is', 'a', 'test', 'of', 'the', 'string', 'separation', 'code', '']
In your case searching for words seems more useful:
>>> re.findall(r'[\w']+', s)
['This', 'is', 'a', 'test', 'of', 'the', 'string', 'separation', 'code']
Here's a function you can reuse - that also escapes special characters:
def escape_char(char):
special = ['.', '^', '$', '*', '+', '?', '\\', '[', ']', '|']
return '\\{}'.format(char) if char in special else char
def split(text, *delimiters):
return re.split('|'.join([escape_char(x) for x in delimiters]), text)
It doesn't automatically remove empty entries, e.g.:
>>> split('Python, is awesome!', '!', ',', ' ')
['Python', '', 'is', 'awesome', '']
In test.txt, I have 2 lines of sentences.
The heart was made to be broken.
There is no surprise more magical than the surprise of being loved.
The code:
import re
file = open('/test.txt','r')#specify file to open
data = file.readlines()
file.close()
for line in data:
line_split = re.split(r'[ \t\n\r, ]+',line)
print line_split
Results from the codes:
['The', 'heart', 'was', 'made', 'to', 'be', 'broken.', '']
['There', 'is', 'no', 'surprise', 'more', 'magical', 'than', 'the', 'surprise', 'of', 'being', 'loved.']
How to get only word print out? (see the first sentence) Expect result:
['The', 'heart', 'was', 'made', 'to', 'be', 'broken.']
['There', 'is', 'no', 'surprise', 'more', 'magical', 'than', 'the', 'surprise', 'of', 'being', 'loved.']
Any advice?
Instead of using split to match the delimiters, you can use findall with the negated regular expression to match the parts you want to keep:
line_split = re.findall(r'[^ \t\n\r., ]+',line)
See it working online: ideone
To fix, with a few other changes, explained further on:
import re
with open("test.txt", "r") as file:
for line in file:
line_split = filter(bool, re.split(r'[ \t\n\r, ]+', line))
print(line_split)
Here we use a filter() to remove any empty strings from the result.
Note my use of the with statement to open the file. This is more readable and handles closing the file for you, even on exceptions.
We also loop directly over the file - this is a better idea as it doesn't load the entire file into memory at once, which is not needed and could cause problems with big files.
words = re.compile(r"[\w']+").findall(yourString)
Demo
>>> yourString = "Mary's lamb was white as snow."
["Mary's", 'lamb', 'was', 'white', 'as', 'snow']
If you really do want periods, you can add those as [\w'\.]
In [2]: with open('test.txt','r') as f:
...: lines = f.readlines()
...:
In [3]: words = [l.split() for l in lines]
In [4]: words
Out[4]:
[['The', 'heart', 'was', 'made', 'to', 'be', 'broken.'],
['There',
'is',
'no',
'surprise',
'more',
'magical',
'than',
'the',
'surprise',
'of',
'being',
'loved.']]