Using nltk (already imported). Playing around with gutenberg corpus
import nltk
from nltk.corpus import gutenberg
Checked out the fileids, to find one I could play with:
gutenberg.fileids()
I made a small code to find the most common words (in order to choose a few for the graph)
kjv_text = nltk.Text(kjv)
from collections import Counter
for words in [kjv_text]:
c = Counter(words)
print c.most_common()[:100] # top 100
kjv_text.dispersion_plot(["LORD", "God", "Israel", "king", "people"])
Until here it works perfectly. Then I try and implement the ConditionalFreqDist, but get bunch of errors:
cfd2 = nltk.ConditionalFreqDist((target, fileid['bible-kjv.txt'])
for fileid in gutenberg.fileids()
for w in gutenberg.words(fileid)
for target in ['lord']
if w.lower().startswith(target))
cfd2.plot()
I have tried to change a few things, but always get some errors. Any experts that can tell me what I'm doing wrong?
Thanks
Here is what was wrong:
The fileid in:
cfd2 = nltk.ConditionalFreqDist((target, fileid['bible-kjv.txt'])
should reference to which element it is (in this case the 4th on the list of gutemberg texts.
So the line should instead say:
cfd2 = nltk.ConditionalFreqDist((target, fileid[3])
Related
I am trying to find synonyms and antonyms for one word, using strings from a field in a dataframe and not a standard wordnet.synsets lexical database. I'm pretty sure this is possible, but I'm not sure how to feed in the appropriate data source (my specific field).
For instance, the code below works fine.
import nltk
from nltk.corpus import wordnet #Import wordnet from the NLTK
syn = list()
ant = list()
for synset in wordnet.synsets("fake"):
for lemma in synset.lemmas():
syn.append(lemma.name()) #add the synonyms
if lemma.antonyms(): #When antonyms are available, add them into the list
ant.append(lemma.antonyms()[0].name())
print('Synonyms: ' + str(syn))
print('Antonyms: ' + str(ant))
I tried to convert the field to an array, and use that...
import pandas as pd
import nltk.corpus
from nltk.corpus import stopwords, wordnet
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer, WordNetLemmatizer
df = pd.read_csv("C:\\my_path\\dataset.csv")
df['review_text'] = df['review_text'].astype(str)
type(df)
df.dtypes
asarray = pd.array(df['review_text'])
import nltk
from nltk.corpus import wordnet #Import wordnet from the NLTK
syn = list()
ant = list()
for synset in wordnet.asarray('fake'):
for lemma in df['review_text'].iterrows():
syn.append(lemma.name()) #add the synonyms
if lemma.antonyms(): #When antonyms are available, add them into the list
ant.append(lemma.antonyms()[0].name())
print('Synonyms: ' + str(syn))
print('Antonyms: ' + str(ant))
When I run that, I get this error:
AttributeError: 'WordNetCorpusReader' object has no attribute 'asarray'
The field in the dataframe looks like this:
feels comfortable i wear day!
package came end missing box. since it’s gift i update actual fit.
birkenstock amazing shoe!!!! i wish i ten pairs!
delivered advertised.... shoe looks & fits expected. leather color seems bit lighter one seen store, still satisfactory.
second pair i had. nothing beats them.
These are the first 5 rows. Maybe the issue is related to this thing (not sure): it’s
It's a typo or something.
Error happens on:
for synset in wordnet.asarray('fake'):
Where wordnet is an object from nltk.corpus, thus a WordNetCorpusReader
And before the for loop, you have
asarray = pd.array(df['review_text'])
that reads in a Pandas array/series into the asarray variable. But it's not related to WordNetCorpusReader
I am looking for algorithms that could tell the language of the text to me(e.g. Hello - English, Bonjour - French, Servicio - Spanish) and also correct typos of the words in english. I have already explored Google's TextBlob, it is very relevant but it got "Too many requests" error as soon as my code starts executing. I also started exploring Polyglot but I am facing a lot of issues to download the library on Windows.
Code for TextBlob
*import pandas as pd
from tkinter import filedialog
from textblob import TextBlob
import time
from time import sleep
colnames = ['Word']
x=filedialog.askopenfilename(title='Select the word list')
print("Data to be checked: " + x)
df = pd.read_excel(x,sheet_name='Sheet1',header=0,names=colnames,na_values='?',dtype=str)
words = df['Word']
i=0
Language_detector=pd.DataFrame(columns=['Word','Language','corrected_word','translated_word'])
for word in words:
b = TextBlob(word)
language_word=b.detect_language()
time.sleep(0.5)
if language_word in ['en','EN']:
corrected_word=b.correct()
time.sleep(0.5)
Language_detector.loc[i, ['corrected_word']]=corrected_word
else:
translated_word=b.translate(to='en')
time.sleep(0.5)
Language_detector.loc[i, ['Word']]=word
Language_detector.loc[i, ['Language']]=language_word
Language_detector.loc[i, ['translated_word']]=translated_word
i=i+1
filename="Language detector test v 1.xlsx"
Language_detector.to_excel(filename,sheet_name='Sheet1')
print("Languages identified for the word list")**
A common way to classify languages is to gather summary statistics on letter or word frequencies and compare them to a known corpus. A naive bayesian classifier would suffice. See https://pypi.org/project/Reverend/ for a way to do this in Python.
Correction of typos can also be done from a corpus using a statistical model of the most likely words versus the likelihood of a particular typo. See, https://norvig.com/spell-correct.html for an example of how to do this in Python.
You could use this, but it is hardly reliable:
https://github.com/hb20007/hands-on-nltk-tutorial/blob/master/8-1-The-langdetect-and-langid-Libraries.ipynb
Alternatively, you could give compact language detector (cld v3) or fasttext a chance OR you could use a corpus to check frequencies of occurring words with the target text in order to find out whether the target text belongs to the language of the respective corpus. The latter is only possible if you know the set of languages to choose from.
For typo correction, you could use the Levenshtein algorithm, which computes a «edit distance». You can compare your words against a dictionary and choose the most likely word. For Python, you could use: https://pypi.org/project/python-Levenshtein/
See the concept of Levenshtein edit distance here: https://en.wikipedia.org/wiki/Levenshtein_distance
I would like to use textacy for key term extraction but the function I am using keyterms.key_terms.pagerank(doc) is just returning an empty list.
I have tried related functions including the longer keyterms.key_terms_from_semantic_network(doc) with no success. I have also tried using longer pieces of text than shown below but it still does not find any key terms. Other functions in textacy do seem to work so it seems to be a problem just with the keyterms class.
import spacy
import textacy
test_string = "Textacy key term extraction is not working properly. Textacy is built on top of SpaCy."
doc = textacy.make_spacy_doc(test_string)
textacy.keyterms.textrank(doc)
I am getting an empty list rather than a list of tuples with terms and ranking scores as expected.
This works for me
Note the following additions:
I explicitly imported keyterms in line 2.
I passed the spaCy English model in line 4.
import spacy
from textacy import keyterms
test_string = "Textacy key term extraction is not working properly. Textacy is built on top of SpaCy."
doc = textacy.make_spacy_doc(test_string, lang='en_core_web_sm')
textacy.keyterms.textrank(doc)
Here is the results I got from your example sentence:
[('term', 0.24594541923542018),
('textacy', 0.24594541923542018),
('extraction', 0.2390545807645797),
('key', 0.13452729038228986),
('spacy', 0.13452729038228986)]
Here is an example, working with newest version June 2021 :
import spacy
from textacy.extract import keyterms as kt
test_string = "Textacy key term extraction is not working properly. Textacy is built on top of SpaCy."
doc = textacy.make_spacy_doc(test_string, lang='en_core_web_sm')
kt.textrank(doc)
I have a huge number of names from different sources.
I need to extract all the groups (part of the names), which repeat from one to another.
In the example below program should locate: Post, Office, Post Office.
I need to get popularity count.
So I want to extract a sorted by popularity list of phrases.
Here is an example of names:
Post Office - High Littleton
Post Office Pilton Outreach Services
Town Street Post Office
post office St Thomas
Basically need to find out some algorithm or better library, to get such results:
Post Office: 16999
Post: 17934
Office: 16999
Tesco: 7300
...
Here is the full example of names.
I wrote a code which is fine for single words, but not for sentences:
from textblob import TextBlob
import operator
title_file = open("names.txt", 'r')
blob = TextBlob(title_file.read())
list = sorted(blob.word_counts.items(), key=operator.itemgetter(1))
print list
You are not looking for clustering (and that is probably why "all of them suck" for #andrewmatte).
What you are looking for is word counting (or more precisely, n-gram-counting). Which is actually a much easier problem. Thst is why you won't be finding any library for that...
Well, actually you jave some libraries. In python, for example, the collections module has the class Counter that has much of the reusable code.
An untested, very basic code:
from collections import Counter
counter = Counter()
for s in sentences:
words = s.split(" ")
for i in range(len(words)):
counter.add(words[i])
if i > 0: counter.add((words[i-1], words[i]))
You csn get the most frequent from counter. If you want words and word pairs separate, feel free to use two counters. If you need longer phrases, add an inner loop. You may also want to clean sentences (e.g. lowercase) and use a regexp for splitting.
Are you looking for something like this?
workspace={}
with open('names.txt','r') as f:
for name in f:
if len(name): # makes sure line isnt empty
if name in workspace:
workspace[name]+=1
else:
workspace[name]=1
for name in workspace:
print "{}: {}".format(name,workspace[name])
I have a very simple CherryPy webservice that I hope will be the foundation of a larger project, however, I need to get NLTK to work the way I want.
My python script imports NLTK and uses the collocation (bigram) function of NLTK, to do some analysis on pre-loaded data.
I have a couple of questions:
1) Why is the program not returning the collocations to my browser, but only to my console?.
2) Why if I am specifying from nltk.book import text4, the program imports the whole set of sample books (text1 to text9)?
Please, keep in mind that I am a newbie, so the answer might be in front of me, but I don't see it.
Main question: How do I pass the collocation results to the browser (webservice), instead of console?
Thanks
import cherrypy
import nltk
from nltk.book import text4
class BiGrams:
def index(self):
return text4.collocations(num=20)
index.exposed = True
cherrypy.quickstart(BiGrams())
I have been doing some work with Moby Dick and I stumbled on the answer to the question of importing just one specific text the other day:
>>>import nltk.corpus
>>>from nltk.text import Text
>>>moby = Text(nltk.corpus.gutenberg.words('melville-moby_dick.txt'))
Thus, all you really need is the fileid in order to assign the text of that file to your new Text object. Be careful, though, because only "literary" sources are in the gutenberg.words directory.
Anyway, for help with finding file ids for gutenberg, after import nltk.corpus above, you can use the following command:
>>> nltk.corpus.gutenberg.fileids()
['austen-emma.txt', 'austen-persuasion.txt', 'austen-sense.txt', 'bible-kjv.txt', 'blake-poems.txt', 'bryant-stories.txt', 'burgess-busterbrown.txt', 'carroll-alice.txt', 'chesterton-ball.txt', 'chesterton-brown.txt', 'chesterton-thursday.txt', 'edgeworth-parents.txt', 'melville-moby_dick.txt', 'milton-paradise.txt', 'shakespeare-caesar.txt', 'shakespeare-hamlet.txt', 'shakespeare-macbeth.txt', 'whitman-leaves.txt']
This still doesn't answer the question for your specific corpus, the inaugural addresses, however. For that answer, I found this MIT paper: http://web.mit.edu/6.863/www/fall2012/nltk/ch2-3.pdf
(I recommend it to anyone beginning to work with nltk texts because it talks about grabbing all kinds of textual data for analysis). The answer to getting the inaugural address fileids comes on page 6 (edited a bit):
>>> nltk.corpus.inaugural.fileids()
['1789-Washington.txt', '1793-Washington.txt', '1797-Adams.txt', '1801-Jefferson.txt', '1805-Jefferson.txt', '1809-Madison.txt', '1813-Madison.txt', '1817-Monroe.txt', '1821-Monroe.txt', '1825-Adams.txt', '1829-Jackson.txt', '1833-Jackson.txt', '1837-VanBuren.txt', '1841-Harrison.txt', '1845-Polk.txt', '1849-Taylor.txt', '1853-Pierce.txt', '1857-Buchanan.txt', '1861-Lincoln.txt', '1865-Lincoln.txt', '1869-Grant.txt', '1873-Grant.txt', '1877-Hayes.txt', '1881-Garfield.txt', '1885-Cleveland.txt', '1889-Harrison.txt', '1893-Cleveland.txt', '1897-McKinley.txt', '1901-McKinley.txt', '1905-Roosevelt.txt', '1909-Taft.txt', '1913-Wilson.txt', '1917-Wilson.txt', '1921-Harding.txt', '1925-Coolidge.txt', '1929-Hoover.txt', '1933-Roosevelt.txt', '1937-Roosevelt.txt', '1941-Roosevelt.txt', '1945-Roosevelt.txt', '1949-Truman.txt', '1953-Eisenhower.txt', '1957-Eisenhower.txt', '1961-Kennedy.txt', '1965-Johnson.txt', '1969-Nixon.txt', '1973-Nixon.txt', '1977-Carter.txt', '1981-Reagan.txt', '1985-Reagan.txt', '1989-Bush.txt', '1993-Clinton.txt', '1997-Clinton.txt', '2001-Bush.txt', '2005-Bush.txt', '2009-Obama.txt']
Thus, you should be able to import specific inaugural addresses as Texts (assuming you did "from nltk.text import Text" above) or you can work with them using the "inaugural" identifier imported above. For example, this works:
>>>address1 = Text(nltk.corpus.inaugural.words('2009-Obama.txt'))
In fact, you can treat all inaugural addresses as one document by calling inaugural.words without any arguments, as in the following example from this page:
>>>len(nltk.corpus.inaugural.words())
OR
addresses = Text(nltk.corpus.inaugural.words())
I remembered reading this thread a month ago when trying to answer this question myself, so perhaps this information, if coming late, will be helpful to someone somewhere.
(This is my first contribution to Stack Overflow. I've been reading for months and never had anything useful to add until now. Just want to say generally 'thanks to everyone for all the help.')
My guess is that what you get back from the collocations() call is not a string, and that you need to serialize it. Try this instead:
import cherrypy
import nltk
from nltk.book import text4
import simplejson
class BiGrams:
def index(self):
c = text4.collocations(num=20)
return simplejson.dumps(c)
index.exposed = True
cherrypy.quickstart(BiGrams())
Take a look at the source code (http://code.google.com/p/nltk/source/browse/trunk/nltk/) and you'll learn a lot (I know I did).
1) Collocations is returning to your console because that's what it is supposed to do.
help(text4.collocations)
will give you:
Help on method collocations in module nltk.text:
collocations(self, num=20, window_size=2) method of nltk.text.Text instance
Print collocations derived from the text, ignoring stopwords.
#seealso: L{find_collocations}
#param num: The maximum number of collocations to print.
#type num: C{int}
#param window_size: The number of tokens spanned by a collocation (default=2)
#type window_size: C{int}
Browse the source in text.py and you'll find the method for collocations is pretty straight-forward.
2) Importing nltk.book loads each text. You could could just grab the bits you need from book.py and write a method that only loads the inaugural addresses.