Tokenize based on white space and trailing punctuation?

Tokenize based on white space and trailing punctuation? - python

I'm trying to come up with the regular expression to split a string up into a list based on white space or trailing punctuation.
e.g.
s = 'hel-lo this has whi(.)te, space. very \n good'
What I want is
['hel-lo', 'this', 'has', 'whi(.)te', ',', 'space', '.', 'very', 'good']
s.split() gets me most of the way there, except it doesn't take care of the trailing whitespace.

import re
s = 'hel-lo this has whi(.)te, space. very \n good'
[x for x in re.split(r"([.,!?]+)?\s+", s) if x]
# => ['hel-lo', 'this', 'has', 'whi(.)te', ',', 'space', '.', 'very', 'good']
You might need to tweak what "punctuation" is.

Rough solution using spacy. It works pretty good with tokenizing word already.
import spacy
s = 'hel-lo this has whi(.)te, space. very \n good'
nlp = spacy.load('en')
ls = [t.text for t in nlp(s) if t.text.strip()]
>> ['hel', '-', 'lo', 'this', 'has', 'whi(.)te', ',', 'space', '.', 'very', 'good']
However, it also tokenize words between - so I borrow solution from here to merge words between - back together.
merge = [(i-1, i+2) for i, s in enumerate(ls) if i >= 1 and s == '-']
for t in merge[::-1]:
merged = ''.join(ls[t[0]:t[1]])
ls[t[0]:t[1]] = [merged]
>> ['hel-lo', 'this', 'has', 'whi(.)te', ',', 'space', '.', 'very', 'good']

I am using Python 3.6.1.
import re
s = 'hel-lo this has whi(.)te, space. very \n good'
a = [] # this list stores the items
for i in s.split(): # split on whitespaces
j = re.split('(\,|\.)$',i) # split on your definition of trailing punctuation marks
if len(j) > 1:
a.extend(j[:-1])
else:
a.append(i)
# a -> ['hel-lo', 'this', 'has', 'whi(.)te', ',', 'space', '.', 'very', 'good']

Related

How can you combine a list of tokens and characters (including punctuation and symbols) into a single sentence string in python?

I'm trying to join a of list of words and characters such as the one below list (ls), and convert it into a single, correctly formatted sentence string (sentence) for a collection of lists.
ls = ['"', 'Time', '"', 'magazine', 'said' 'the', 'film', 'was',
'"', 'a', 'multimillion', 'dollar', 'improvisation', 'that',
'does', 'everything', 'but', 'what', 'the', 'title', 'promises',
'"', 'and', 'suggested', 'that', '"', 'writer', 'George',
'Axelrod', '(', '"', 'The', 'Seven', 'Year', 'Itch', '"', ')',
'and', 'director', 'Richard', 'Quine', 'should', 'have', 'taken',
'a', 'hint', 'from', 'Holden', "'s", 'character', 'Richard',
'Benson', 'who', 'writes', 'his', 'movie', ',', 'takes', 'a',
'long', 'sober', 'look', 'at', 'what', 'he', 'has', 'wrought',
',', 'and', 'burns', 'it', '.', '"']
sentence = '"Time" magazine said the film was "a multimillion dollar improvisation that does everything but what the title promises" and suggested that "writer George Axelrod ("The Seven Year Itch") and director Richard Quine should have taken a hint from Holden's character Richard Benson who writes his movie, takes a long sober look at what he has wrought, and burns it."'
I've tried a rule based approach that adds an empty space after an element depending on the contents of the next element but my method ended as a really long piece of code that contains rules for as many cases as I could think of like those for parenthesis or quotations. Is there a way to effectively join this list into a correctly formatted sentence more efficiently and effectively?

I think a simple for should do the trick:
sentence = ""
for word in ls:
if (word == ',' or word == '.') and sentence != '':
sentence = sentence[:-1] #removing the last space added
sentence += word
if word != '\"' or word != '(':
sentence += ' ' #adding a space after each word

Regex joining words splitted by whitespace and hyphen

My string is quite messy and looks something like this:
s="I'm hope-less and can -not solve this pro- blem on my own. Wo - uld you help me?"
I'd like to have the hyphen (& sometimes whitespace) stripped words together in one list.. Desired output:
list = ['I'm','hopeless','and','cannot','solve','this','problem','on','my','own','.','Would','you','help','me','?']
I tried a lot of different variations, but nothing worked..
rgx = re.compile("([\w][\w'][\w\-]*\w)")
s = "My string'"
rgx.findall(s)

Here's one way:
[re.sub(r'\s*-\s*', '', i) for i in re.split(r'(?<!-)\s(?!-)', s)]
# ["I'm", 'hopeless', 'and', 'cannot', 'solve', 'this', 'problem', 'on', 'my', 'own.', 'Would', 'you', 'help', 'me?']
Two operations here:
Split the text based on whitespaces without hyphens using both negative lookahead and negative lookbehind.
In each of the split word, replace the hyphens with possible whitespaces in front or behind to empty string.
You can see the first operation's demo here: https://regex101.com/r/ayHPvY/2
And the second: https://regex101.com/r/ayHPvY/1
Edit: To get the . and ? to be separated as well, use this instead:
[re.sub(r'\s*-\s*','', i) for i in re.split(r"(?<!-)\s(?!-)|([^\w\s'-]+)", s) if i]
# ["I'm", 'hopeless', 'and', 'cannot', 'solve', 'this', 'problem', 'on', 'my', 'own', '.', 'Would', 'you', 'help', 'me', '?']
The catch was also splitting the non-alphabets, non-whitespace and not hyphens/apostrophe. The if i is necessary as the split might return some None items.

Quick, non-regex way to do it would be
''.join(map(lambda s: s.strip(), s.split('-'))).split()
that is split on hyphens, strip of additional whitespaces, join back into string and split on space, this however doesn't separate dot or question marks.

How about this:
>>> s
"I'm hope-less and can -not solve this pro- blem on my own. Wo - uld you help me
?"
>>> list(map(lambda x:re.sub(' *- *','',x), filter(lambda x:x, re.split(r'(?<!-) +(?!-)|([.?])',s))))
["I'm", 'hopeless', 'and', 'cannot', 'solve', 'this', 'problem', 'on', 'my', 'own', '.', 'Would', 'you', 'help', 'me', '?']
Above used a simple space ' ', but use \s is better:
list(map(lambda x:re.sub('\s*-\s*','',x), filter(lambda x:x, re.split(r'(?<!-)\s+(?!-)|([.?])',s))))
(?<!-)\s+(?!-) means spaces that don't have - before or after.
[.?] means single . or ?.
re.split(r'(?<!-)\s+(?!-)|([.?])',s) will split the string accordingly, but will have some None and empty string '' inside:
["I'm", None, 'hope-less', None, 'and', None, 'can -not', None, 'solve', None, 'this', None, 'pro- blem', None, 'on', None, 'my', None, 'own', '.', '', None, 'Wo - uld', None, 'you', None, 'help', None, 'me', '?', '']
This result was directly feed to filter to remove None and '', and then feed to map to remove space and - inside each word.

How to print words (and punctuation) from a list of positions

(Removed Code to stop Class mates copying)
Right now it will create a text file with positions of each word, so for example if I wrote
"Hello, my name is mika, Hello"
The positions in that list would be [1,2,3,4,5,6,2,1], and it will also list each word/punctuation but only once, so in this case it would be
['Hello', ',', 'my', 'name', 'is', 'mika']
The only thing now is to be able to get the words back into the original sentence using those positions in the list, which I can't seem to do.
I did try searching for other posts but it seemed to come up only with other people wanting the positions of the words rather than wanting to put the words back into a sentence using the positons.
I also thought it could be started by doing this:
for i in range(len(readlines[1])):
but I honestly have no idea how to go around doing this.
Edit: This has now been solved by #Abhishek, thank you.

indices = [1,2,3,4,5,6,2,1]
namelst = ['Hello', ',', 'my', 'name', 'is', 'mika']
newstr = " ".join([namelst[x-1] for x in indices])
print (newstr)
output:
>> 'Hello , my name is mika , Hello'
I agree there will be some offsets / spaces, but it will give you the complete sentence again

Code: (Will remove spaces after punctuation)
postitions = [1,2,3,4,5,6,2,1]
wordslist = ['Hello', ',', 'my', 'name', 'is', 'mika']
recreated=''
for i in indices:
w = namelst[i-1]
if w not in ['(', ')', '?', ':', ';', ',', '.', '!', '/', '"', "'"]:
w = ' ' + w
recreated = (recreated + w).strip()
print (recreated)
Output:
C:\Users\dinesh_pundkar\Desktop>python c.py
Hello, my name is mika, Hello
C:\Users\dinesh_pundkar\Desktop>

You can use numpy to do this.
>>> import numpy as np
>>> indices = np.array([1,2,3,4,5,6,2,1])
>>> namelst = np.array(['Hello', ',', 'my', 'name', 'is', 'mika', ',', 'Hello'])
>>> ' '.join(namelst[indices-1])
'Hello , my name is mika , Hello'

Modify python nltk.word_tokenize to exclude "#" as delimiter

I am using Python's NLTK library to tokenize my sentences.
If my code is
text = "C# billion dollars; we don't own an ounce C++"
print nltk.word_tokenize(text)
I get this as my output
['C', '#', 'billion', 'dollars', ';', 'we', 'do', "n't", 'own', 'an', 'ounce', 'C++']
The symbols ; , . , # are considered as delimiters. Is there a way to remove # from the set of delimiters like how + isn't a delimiter and thus C++ appears as a single token?
I want my output to be
['C#', 'billion', 'dollars', ';', 'we', 'do', "n't", 'own', 'an', 'ounce', 'C++']
I want C# to be considered as one token.

As dealing with multi-word tokenization, another way would be to retokenize the extracted tokens with NLTK Multi-Word Expression tokenizer:
mwtokenizer = nltk.MWETokenizer(separator='')
mwtokenizer.add_mwe(('c', '#'))
mwtokenizer.tokenize(tokens)

Another idea: instead of altering how text is tokenized, just loop over the tokens and join every '#' with the preceding one.
txt = "C# billion dollars; we don't own an ounce C++"
tokens = word_tokenize(txt)
i_offset = 0
for i, t in enumerate(tokens):
i -= i_offset
if t == '#' and i > 0:
left = tokens[:i-1]
joined = [tokens[i - 1] + t]
right = tokens[i + 1:]
tokens = left + joined + right
i_offset += 1
>>> tokens
['C#', 'billion', 'dollars', ';', 'we', 'do', "n't", 'own', 'an', 'ounce', 'C++']

NLTK uses regular expressions to tokenize text, so you could use its regexp tokenizer to define your own regexp.
I'll create an example for you where text will be split on any space character (tab, new line, ecc) and a couple of other symbols just for instance:
>>> txt = "C# billion dollars; we don't own an ounce C++"
>>> regexp_tokenize(txt, pattern=r"\s|[\.,;']", gaps=True)
['C#', 'billion', 'dollars', 'we', 'don', 't', 'own', 'an', 'ounce', 'C++']

Splitting sentences in Python using regex

I'm trying to split words, punctuation, numbers from a sentence. However, my code produces output that isn't expected. How can I fix it?
This is my input text (in a text file):
"I 2changed to ask then, said that mildes't of men2,
And my code outputs this:
['"', 'I', '2', 'changed', 'to', 'ask', 'then', ',', 'said', 'that', "mildes't", 'of', 'men2']
However, the expected output is:
['"', 'I', '2', 'changed', 'to', 'ask', 'then', ',', 'said', 'that', "mildes't", 'of', 'men','2']
Here's my code:
import re
newlist = []
f = open("Inputfile2.txt",'r')
out = f.readlines()
for line in out:
word = line.strip('\n')
f.close()
lst = re.compile(r"\d|\w+[\w']+|\w|[^\w\s]").findall(word)
print(lst)

In regular expressions, '\w' matches any alphanumeric character, i.e. [a-zA-Z0-9].
Also in the first part of your regular expression, it should be '\d+' to match more than one digits.
The second and the third part of your regular expression '\w+[\w']+|\w' can be merged into a single part by changing '+' to '*'.
import re
with open('Inputfile2.txt', 'r') as f:
for line in f:
word = line.strip('\n')
lst = re.compile(r"\d+|[a-zA-Z]+[a-zA-Z']*|[^\w\s]").findall(word)
print(lst)
This gives:
['"', 'I', '2', 'changed', 'to', 'ask', 'then', ',', 'said', 'that', "mildes't", 'of', 'men', '2', ',']
Note that your expected output is incorrect. It is missing a ','.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Tokenize based on white space and trailing punctuation? - python

import re s = 'hel-lo this has whi(.)te, space. very \n good' [x for x in re.split(r"([.,!?]+)?\s+", s) if x] # => ['hel-lo', 'this', 'has', 'whi(.)te', ',', 'space', '.', 'very', 'good'] You might need to tweak what "punctuation" is.

Related

How can you combine a list of tokens and characters (including punctuation and symbols) into a single sentence string in python?

Regex joining words splitted by whitespace and hyphen

How to print words (and punctuation) from a list of positions

Modify python nltk.word_tokenize to exclude "#" as delimiter

Splitting sentences in Python using regex

Categories

Resources