Tokenize list of strings without comma separation - python

I'm still new to Python and want to know how I can tokenize a list of strings without every word being seperated by a comma.
For example, starting from a list like ['I have to get groceries.','I need some bananas.','Anything else?'], I want to obtain a list like this: ['I have to get groceries .', 'I need some bananas .', 'Anything else ?']. The point is thus not to create a list with separate tokens necessarily, but to create a list with sentences in which all words and punctuation marks are separated from each other.
Any ideas? I only managed to create a list of comma separated tokens, using this code:
nltk.download('punkt')
from nltk import word_tokenize
tokenized = []
for line in unique:
tokenized.append(word_tokenize(line))```

You can join the tokenized lines with a space, just use
from nltk import word_tokenize
unique = ['I have to get groceries.','I need some bananas.','Anything else?']
tokenized = [" ".join(word_tokenize(line)) for line in unique]
print(tokenized)
# => ['I have to get groceries .', 'I need some bananas .', 'Anything else ?']

Related

replace any words in string that match an entry in list with a single tag (python)

I have a list of sentences (~100k sentences total) and a list of "infrequent words" (length ~20k). I would like to run through each sentence and replace any word that matches an entry in "infrequent_words" with the tag "UNK".
(so as a small example, if
infrequent_words = ['dog','cat']
sentence = 'My dog likes to chase after cars'
Then after applying the transformation it should be
sentence = 'My unk likes for chase after cars'
I am having trouble finding an efficient way to do this. This function below (applied to each sentence) works, but it is very slow and I know there must be something better. Any suggestions?
def replace_infrequent_words(text,infrequent_words):
for word in infrequent_words:
text = text.replace(word,'unk')
return text
Thank you!
infrequent_words = {'dog','cat'}
sentence = 'My dog likes to chase after cars'
def replace_infrequent_words(text, infrequent_words):
words = text.split()
for i in range(len(words)):
if words[i] in infrequent_words:
words[i] = 'unk'
return ' '.join(words)
print(replace_infrequent_words(sentence, infrequent_words))
Two things that should improve performance:
Use a set instead of a list for storing infrequent_words.
Use a list to store each word in text so you don't have to scan the entire text string with each replacement.
This doesn't account for grammar and punctuation but this should be a performance improvement from what you posted.

NLTK tokenize text with dialog into sentences

I am able to tokenize non-dialog text into sentences but when I add quotation marks to the sentence the NLTK tokenizer doesn't split them up correctly. For example, this works as expected:
import nltk.data
tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
text1 = 'Is this one sentence? This is separate. This is a third he said.'
tokenizer.tokenize(text1)
This results in a list of three different sentences:
['Is this one sentence?', 'This is separate.', 'This is a third he said.']
However, if I make it into a dialogue, the same process doesn't work.
text2 = '“Is this one sentence?” “This is separate.” “This is a third” he said.'
tokenizer.tokenize(text2)
This returns it as a single sentence:
['“Is this one sentence?” “This is separate.” “This is a third” he said.']
How can I make the NLTK tokenizer work in this case?
It seems the tokenizer doesn't know what to do with the directed quotes. Replace them with regular ASCII double quotes and the example works fine.
>>> text3 = re.sub('[“”]', '"', text2)
>>> nltk.sent_tokenize(text3)
['"Is this one sentence?"', '"This is separate."', '"This is a third" he said.']

Using itertools in python 3.5, how do I create only some variations of some lists, but all variations of others?

Right now I'm using itertools to create thousands of variations of meta descriptions for a website. The sentence structure goes like this:
(SentencePart1a|SentencePart1b) Keyword1, Keyword2, and Keyword3. (MiddleSentenceA|MiddleSentenceB). (FinalSentenceA|FinalSentenceB).
The short version of the script:
import itertools
metas = [
['Shop our store for ', 'Browse our shop for ,'], #SentenceParts
['keyword,', 'keyword,', 'etc,'],
['keyword, and', 'keyword, and', 'etc, and'],
['keyword.', 'keyword.', 'etc.'],
['Sentence about our store.', 'A different sentence about our store.'], #MiddleSentences
['A final sentence about our store.', 'A different final sentence about our store.'] #FinalSentences
]
variantmetas = list(itertools.product(*metas))
#print (variantmetas)
for s in variantmetas:
print(' '.join(s))
Right now I get every variation of all of these things. My program spits out all sentence parts, all middle sentences, and all final sentences even if Keywords 1-3 are the same.
How do I make it so that keywords 1-3 only show up one time and with one random variation of Sentence parts? In other words: all variations of the keywords only once, with one SentencePart, one MiddleSentence, and one FinalSentence.
I am trying to minimize redundancy in the final list.
Just perform the product on the keywords, not on the sentences.
Use random.choice on the start & end sentences, and generate the result using a list comprehension.
Proposal without any reorganization of the metas list, which may be better to isolate the keywords, from the start/end sentences:
import itertools,random
metas = [
['Shop our store for ', 'Browse our shop for ,'], #SentenceParts
['keyword,', 'keyword,', 'etc,'],
['keyword, and', 'keyword, and', 'etc, and'],
['keyword.', 'keyword.', 'etc.'],
['Sentence about our store.', 'A different sentence about our store.'], #MiddleSentences
['A final sentence about our store.', 'A different final sentence about our store.'] #FinalSentences
]
variantmetas = [[random.choice(metas[0])]+list(l)+[random.choice(metas[-2]),random.choice(metas[-1])] for l in itertools.product(*metas[1:-2])]
for s in variantmetas:
print(' '.join(s))
extract of the result:
Shop our store for etc, keyword, and keyword. Sentence about our store. A different final sentence about our store.
Browse our shop for , etc, keyword, and etc. Sentence about our store. A different final sentence about our store.
Browse our shop for , etc, etc, and keyword. Sentence about our store. A final sentence about our store.
Shop our store for etc, etc, and keyword. Sentence about our store. A different final sentence about our store.
Shop our store for etc, etc, and etc. Sentence about our store. A final sentence about our store.

Removing stopwords from list using python3

I have been trying to remove stopwords from a csv file that im reading using python code but my code does not seem to work. I have tried using a sample text in the code to validate my code but it is still the same . Below is my code and i would appreciate if anyone can help me rectify the issue.. here is the code below
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import csv
article = ['The computer code has a little bug' ,
'im learning python' ,
'thanks for helping me' ,
'this is trouble' ,
'this is a sample sentence'
'cat in the hat']
tokenized_models = [word_tokenize(str(i)) for i in article]
stopset = set(stopwords.words('english'))
stop_models = [i for i in tokenized_models if str(i).lower() not in stopset]
print('token:'+str(stop_models))
Your tokenized_models is a list of tokenized sentences, so a list of lists. Ergo, the following line tries to match a list of words to a stopword:
stop_models = [i for i in tokenized_models if str(i).lower() not in stopset]
Instead, iterate again through words. Something like:
clean_models = []
for m in tokenized_models:
stop_m = [i for i in m if str(i).lower() not in stopset]
clean_models.append(stop_m)
print(clean_models)
Off-topic useful hint:
To define a multi-line string, use brackets and no comma:
article = ('The computer code has a little bug'
'im learning python'
'thanks for helping me'
'this is trouble'
'this is a sample sentence'
'cat in the hat')
This version would work with your original code
word_tokenize(str(i)) returns a list of words, so tokenized_models is a list of lists. You need to flatten that list, or better yet just make article a single string, since I don't see why it's a list at the moment.
This is because the in operator won't search through a list and then through strings in that list at the same time, e.g.:
>>> 'a' in 'abc'
True
>>> 'a' in ['abc']
False

filtering stopwords near punctuation

I am trying to filter out stopwords in my text like so:
clean = ' '.join([word for word in text.split() if word not in (stopwords)])
The problem is that text.split() has elements like 'word.' that don't match to the stopword 'word'.
I later use clean in sent_tokenize(clean), however, so I don't want to get rid of the punctuation altogether.
How do I filter out stopwords while retaining punctuation, but filtering words like 'word.'?
I thought it would be possible to change the punctuation:
text = text.replace('.',' . ')
and then
clean = ' '.join([word for word in text.split() if word not in (stop words)] or word == ".")
But is there a better way?
Tokenize the text first, than clean it from stopwords. A tokenizer usually recognizes punctuation.
import nltk
text = 'Son, if you really want something in this life,\
you have to work for it. Now quiet! They are about\
to announce the lottery numbers.'
stopwords = ['in', 'to', 'for', 'the']
sents = []
for sent in nltk.sent_tokenize(text):
tokens = nltk.word_tokenize(sent)
sents.append(' '.join([w for w in tokens if w not in stopwords]))
print sents
['Son , if you really want something this life , you have work it .', 'Now quiet !', 'They are about announce lottery numbers .']
You could use something like this:
import re
clean = ' '.join([word for word in text.split() if re.match('([a-z]|[A-Z])+', word).group().lower() not in (stopwords)])
This pulls out everything except lowercase and uppercase ascii letters and matches it to words in your stopcase set or list. Also, it assumes that all of your words in stopwords are lowercase, which is why I converted the word to all lowercase. Take that out if I made to great of an assumption
Also, I'm not proficient in regex, sorry if there's a cleaner or robust way of doing this.

Categories