Project Gutenberg Python problem? - python

I am trying to process various texts by regex and NLTK of python -which is at http://www.nltk.org/book-. I am trying to create a random text generator and I am having a hard time with a problem. First, here is my algorithm:
Enter a sentence as input -this is called trigger string-
Get longest word in trigger string
Search all Project Gutenberg database for sentences that contain this word -regardless of uppercase lowercase-
Return the longest sentence that has the word I spoke about in step 3
Append the sentence in Step 1 and Step4 together
Repeat the process. Note that I have to get the longest word in second sentence and continue like that and so on-
So far I have been able to do this for first two sentences but I cannot perform a case insensitive search. Entire sentence database of Project Gutenberg is available via gutenberg.sents() function but regex - case insensitive search is practically impossible since the gutenberg.sents() outputs the sentences in books as following -in a list of list format-:
EXAMPLE: all the sentences of shakespeare's macbeth is called by typing
import nltk
from nltk.corpus import gutenberg
gutenberg.sents('shakespeare-macbeth.txt')
into the python shell command line and output is:
[['[', 'The', 'Tragedie', 'of', 'Macbeth', 'by', 'William', 'Shakespeare', '1603', ']'],
['Actus', 'Primus', '.'], .......]
with [The Tragedie of Macbeth by William Shakespare, 1603] and Actus Primus. being the first two sentences.
How can I find the word I'm looking for regardless of it being uppercase/lowercase ? I'm desperately in need of help since I have been tinkering with this for the past two days and it's starting to wear on my nerves. Thanks a lot.

Given a list L of words, and a target word t,
any(t.lower()==w.lower() for w in L)
tells you whether L has word t in a case-insensitive way. It's faster, of course, to do
lt = t.lower()
any(lt==w.lower() for w in L)
since Python does not "hoist" the constant computation out of the loop and, unless you hoist it yourself, it will be performed repeatedly.
Given a list of lists lol, the longest sub-list including t can be found by
longest = max((L for L in lol if any(lt==w.lower() for w in L)), key=len)
If multiple sub-lists include t and are of the same maximal length, this will give you the first one, as it happens.

How about using the built-in function: str.lower()ΒΆ
Return a copy of the string converted to lowercase.
Then just compare the strings.

Related

Python - counting frequency of a non-ngram sequence in a list of strings efficiently

As I've stated in the title, I'm trying to calculate the phrase frequency of a given list of sequences that appear in in a list of strings. The problem is that the words in phrases do not have to appear next to the other ones, there may be one or more words in between.
Example:
Sequence: ('able', 'help', 'number') in a sentence "Please call us, we may be able to help, our phone number is 1234"
I remove the stopwords (NLTK stopwords), remove punctuation, lowercase all letters and tokenize the sentence, so the processed sequence looks like ['please', 'call', 'us', 'able', 'help', 'phone', 'number', '1234']. I have about 30,000 sequences varying in length from 1 (single words) to 3, and I'm searching in almost 6,000 short sentences. My current approach is presented below:
from collections import Counter
from tqdm import tqdm
import nltk
# Get term sequency per sentence
def get_bow(sen, vocab):
vector = [0] * len(vocab)
tokenized_sentence = nltk.word_tokenize(sen)
combined_sentence = list(itertools.chain.from_iterable([itertools.combinations(tokenized_sentence, 1),
itertools.combinations(tokenized_sentence, 2),
itertools.combinations(tokenized_sentence, 3)]))
for el in combined_sentence:
if el in vocab:
cnt = combined_sentence.count(el)
idx = vocab.index(el)
vector[idx] = cnt
return vector
sentence_vectors = []
for sentence in tqdm(text_list):
sent_vec = get_bow
sentence_vectors.append(get_bow(sentence, phrase_list))
phrase_list is a list of tuples with the sequences, text_list is a list of strings. Currently, the frequency takes over 1 hour to calculate and I'm trying to find more efficient way to get the list of frequencies associated with the given terms. I've also tried using sklearn's CountVectorizer, but there's a problem with processing sequences with gaps and they're not calculated at all.
I'd be grateful if anyone would try to give me some insight about how to make my script more efficient. Thanks in advance!
EDIT:
Example of phrase_list: [('able',), ('able', 'us', 'software'), ('able', 'back'), ('printer', 'holidays'), ('printer', 'information')]
Example of text_list: ['able add printer mac still working advise calling support team mon fri excluding bank holidays would able look', 'absolutely one cat coyote peterson', 'accurate customs checks cause delays also causing issues expected delivery dates changing', 'add super mario flair fridge desk supermario dworld bowsersfury magnet set available platinum points shipping costs mynintendo reward get', 'additional information though pass comments team thanks']
Expected output: [2, 0, 0, 1, 0] - a vector with occurrence count of each phrase, the order of values should be the same as in phrase_list. My code returns the vector of a phrase occurence per sentence, because I was trying to implement something like a bag-of-words.
There are many aspects that could be made faster, but here is the main problem:
combined_sentence = list(itertools.chain.from_iterable([itertools.combinations(tokenized_sentence, 1),
itertools.combinations(tokenized_sentence, 2),
itertools.combinations(tokenized_sentence, 3)]))
You generate all potential combinations of 1,2 or 3 words of the sentence. This is always bad, no matter what you want to do.
Sentence: "Master Yoda about sentence structure care does not."
You really want to treat this sentence as if it contained "Yoda does not", then you should still not generate all combinations. There are much faster ways, but I will only spend time on this, if that indeed is your goal.
If you would want to treat this sentence as a sentence that does NOT contain "Yoda does not", then I think you can figure out yourself how to speed up your code. Maybe look here.
I hope this helped. Let me know in case you need option 1.

parsing emails to identify keywords

I'm looking to parse through a list of email text to identify keywords. lets say I have this following list:
sentences = [['this is a paragraph there should be lots more words here'],
['more information in this one'],
['just more words to be honest, not sure what to write']]
I want to check to see if words from a keywords list are in any of these sentences in the list, using regex. I wouldn't want informations to be captured, only information
keywords = ['information', 'boxes', 'porcupine']
was trying to do something like:
['words' in words for [word for word in [sentence for sentence in sentences]]
or
for sentence in sentences:
sentence.split(' ')
ultimately would like to filter down current list to elements that only have the keywords I've specified.
keywords = ['information', 'boxes']
sentences = [['this is a paragraph there should be lots more words here'],
['more information in this one'],
['just more words to be honest, not sure what to write']]
output: [False, True, False]
or ultimately:
parsed_list = [['more information in this one']]
Here is a one-liner to solve your problem. I find using lambda syntax is easier to read than nested list comprehensions.
keywords = ['information', 'boxes']
sentences = [['this is a paragraph there should be lots more words here'],
['more information in this one'],
['just more words to be honest, not sure what to write']]
results_lambda = list(
filter(lambda sentence: any((word in sentence[0] for word in keywords)), sentences))
print(results_lambda)
[['more information in this one']]
This can be done with a quick list comprehension!
lists = [['here is one sentence'], ['and here is another'], ['let us filter!'], ['more than one word filter']]
filter = ['filter', 'one']
result = list(set([x for s in filter for x in lists if s in x[0]]))
print(result)
result:
[['let us filter!'], ['more than one word filter'], ['here is one sentence']]
hope this helps!
Do you want to find sentences which have all the words in your keywords list?
If so, then you could use a set of those keywords and filter each sentence based on whether all words are present in the list:
One way is:
keyword_set = set(keywords)
n = len(keyword_set) # number of keywords
def allKeywdsPresent(sentence):
return len(set(sentence.split(" ")) & keyword_set) == n # the intersection of both sets should equal the keyword set
filtered = [sentence for sentence in sentences if allKeywdsPresent(sentence)]
# filtered is the final set of sentences which satisfy your condition
# if you want a list of booleans:
boolean_array = [allKeywdsPresent(sentence[0]) for sentence in sentences]
There could be more optimal ways to do this (e.g. the set created for each sentence in allKeywdsPresent could be replaced with a single pass over all elements, etc.) But, this is a start.
Also, understand that using a set means duplicates in your keyword list will be eliminated. So, if you have a list of keywords with some duplicates, then use a dict instead of the set to keep a count of each keyword and reuse above logic.
From your example, it seems enough to have at least one keyword match. Then you need to modify allKeywdsPresent() [Maybe rename if to anyKeywdsPresent]:
def allKeywdsPresent(sentence):
return any(word in keyword_set for word in sentence.split())
If you want to match only whole words and not just substrings you'll have to account for all word separators (whitespace, puctuation, etc.) and first split your sentences into words, then match them against your keywords. The easiest, although not fool-proof way is to just use the regex \W (non-word character) classifier and split your sentence on such occurences..
Once you have the list of words in your text and list of keywords to match, the easiest, and probably most performant way to see if there is a match is to just do set intersection between the two. So:
# not sure why you have the sentences in single-element lists, but if you insist...
sentences = [['this is a paragraph there should be lots more words here'],
['more information in this one'],
['just more disinformation, to make sure we have no partial matches']]
keywords = {'information', 'boxes', 'porcupine'} # note we're using a set here!
WORD = re.compile(r"\W+") # a simple regex to split sentences into words
# finally, iterate over each sentence, split it into words and check for intersection
result = [s for s in sentences if set(WORD.split(s[0].lower())) & keywords]
# [['more information in this one']]
So, how does it work - simple, we iterate over each of the sentences (and lowercase them for a good measure of case-insensitivity), then we split the sentence into words with the aforementioned regex. This means that, for example, the first sentence will split into:
['this', 'is', 'a', 'paragraph', 'there', 'should', 'be', 'lots', 'more', 'words', 'here']
We then convert it into a set for blazing fast comparisons (set is a hash sequence and intersections based on hashes are extremely fast) and, as a bonus, this also gets rid duplicate words.
Fnally, we do the set intersection against our keywords - if anything is returned these two sets have at least one word in common, which means that the if ... comparison evaluates to True and, in that case, the current sentence gets added to the result.
Final note - beware that while \W+ might be enough to split sentences into words (certainly better than a whitespace split only), it's far from perfect and not really suitable for all languages. If you're serious about word processing take a look at some of the NLP modules available for Python, such as the nltk.

Python Count Number of Phrases in Text

I have a list of product reviews/descriptions in excel and I am trying to classify them using Python based on words that appear in the reviews.
I import both the reviews, and a list of words that would indicate the product falling into a certain classification, into Python using Pandas and then count the number of occurrences of the classification words.
This all works fine for single classification words e.g. 'computer' but I am struggling to make it work for phrases e.g. 'laptop case'.
I have look through a few answers but none were successful for me including:
using just text.count(['laptop case', 'laptop bag']) as per the answer here: Counting phrase frequency in Python 3.3.2 but because you need to split the text up that does not work (and I think maybe text.count does not work for lists either?)
Other answers I have found only look at the occurrence of a single word. Is there something I can do to count words and phrases that does not involve the splitting of the body of text into individual words?
The code I currently have (that works for individual terms) is:
for i in df1.index:
descriptions = df1['detaileddescription'][i]
if type(descriptions) is str:
descriptions = descriptions.split()
pool.append(sum(map(descriptions.count, df2['laptop_bag'])))
else:
pool.append(0)
print(pool)
You're on the right track! You're currently splitting into single words, which facilitates finding occurrences of single words as you pointed out. To find phrases of length n you should split the text into chunks of length n, which are called n-grams.
To do that, check out the NLTK package:
from nltk import ngrams
sentence = 'I have a laptop case and a laptop bag'
n = 2
bigrams = ngrams(sentence.split(), n)
for gram in bigrams:
print(gram)
Sklearn's CountVectorizer is the standard way
from sklearn.feature_extraction import text
vectorizer = text.CountVectorizer()
vec = vectorizer.fit_transform(descriptions)
And if you want to see the counts as a dict:
count_dict = {k:v for k,v in zip(vectorizer.get_feature_names(), vec.toarray()[0]) if v>0}
print (count_dict)
The default is unigrams, you can use bigrams or higher ngrams with the ngram_range parameter

Tokenizing first and last name as one token

Is is possible to tokenize a text in tokens such that first and last name are combined in one token?
For example if my text is:
text = "Barack Obama is the President"
Then:
text.split()
results in:
['Barack', 'Obama', 'is', 'the, 'President']
how can I recognize the first and last name? So I get only ['Barack Obama', 'is', 'the', 'President'] as tokens.
Is there a way to achieve it in Python?
What you are looking for is a named entity recognition system. I suggest you do not consider this as part of tokenization.
For python you can use https://pypi.python.org/pypi/ner/
Example from the site
>>> tagger.json_entities("Alice went to the Museum of Natural History.")
'{"ORGANIZATION": ["Museum of Natural History"], "PERSON": ["Alice"]}'
Here's a regular expression that meets the needs of your question. It will find individual words beginning with a lowercase character, or match singleton or pairs of capitalized words.
import re
re.findall(r"[a-z]\w+|[A-Z]\w+(?: [A-Z]\w+)?",text)
outputs
['Barack Obama', 'is', 'the', 'President']

How to find collocations in text, python

How do you find collocations in text?
A collocation is a sequence of words that occurs together unusually often.
python has built-in func bigrams that returns word pairs.
>>> bigrams(['more', 'is', 'said', 'than', 'done'])
[('more', 'is'), ('is', 'said'), ('said', 'than'), ('than', 'done')]
>>>
What's left is to find bigrams that occur more often based on the frequency of individual words. Any ideas how to put it in the code?
Try NLTK. You will mostly be interested in nltk.collocations.BigramCollocationFinder, but here is a quick demonstration to show you how to get started:
>>> import nltk
>>> def tokenize(sentences):
... for sent in nltk.sent_tokenize(sentences.lower()):
... for word in nltk.word_tokenize(sent):
... yield word
...
>>> nltk.Text(tkn for tkn in tokenize('mary had a little lamb.'))
<Text: mary had a little lamb ....>
>>> text = nltk.Text(tkn for tkn in tokenize('mary had a little lamb.'))
There are none in this small segment, but here goes:
>>> text.collocations(num=20)
Building collocations list
Here is some code that takes a list of lowercase words and returns a list of all bigrams with their respective counts, starting with the highest count. Don't use this code for large lists.
from itertools import izip
words = ["more", "is", "said", "than", "done", "is", "said"]
words_iter = iter(words)
next(words_iter, None)
count = {}
for bigram in izip(words, words_iter):
count[bigram] = count.get(bigram, 0) + 1
print sorted(((c, b) for b, c in count.iteritems()), reverse=True)
(words_iter is introduced to avoid copying the whole list of words as you would do in izip(words, words[1:])
import itertools
from collections import Counter
words = ['more', 'is', 'said', 'than', 'done']
nextword = iter(words)
next(nextword)
freq=Counter(zip(words,nextword))
print(freq)
A collocation is a sequence of tokens that are better treated as a single token when parsing e.g. "red herring" has a meaning that can't be derived from its components. Deriving a useful set of collocations from a corpus involves ranking the n-grams by some statistic (n-gram frequency, mutual information, log-likelihood, etc) followed by judicious manual editing.
Points that you appear to be ignoring:
(1) the corpus must be rather large ... attempting to get collocations from one sentence as you appear to suggest is pointless.
(2) n can be greater than 2 ... e.g. analysing texts written about 20th century Chinese history will throw up "significant" bigrams like "Mao Tse" and "Tse Tung".
What are you actually trying to achieve? What code have you written so far?
Agree with Tim McNamara on using nltk and problems with the unicode. However, I like the text class a lot - there is a hack that you can use to get the collocations as list , i discovered it looking at the source code . Apparently whenever you invoke the collocations method it saves it as a class variable!
import nltk
def tokenize(sentences):
for sent in nltk.sent_tokenize(sentences.lower()):
for word in nltk.word_tokenize(sent):
yield word
text = nltk.Text(tkn for tkn in tokenize('mary had a little lamb.'))
text.collocations(num=20)
collocations = [" ".join(el) for el in list(text._collocations)]
enjoy !

Categories