I would like to lemmatize some textual data in Hungarian language and encountered a strange feature in spaCy. The token.lemma_ function works well in terms of lemmatization, however, it returns some of the sentences without first letter capitalization. This is quite annoying, as my next function, unnest_stences (R) requires first capital letters in order to identify and break the text down into individual sentences.
First I thought the problem was that I used the latest version of spaCy since I had gotten a warning that
UserWarning: [W031] Model 'hu_core_ud_lg' (0.3.1) requires spaCy v2.1
and is incompatible with the current spaCy version (2.3.2). This may
lead to unexpected results or runtime errors. To resolve this,
download a newer compatible model or retrain your custom model with
the current spaCy version.
So I went ahead and installed spacy 2.1, but the problem still persists.
The source of my data are some email messages I cannot share here, but here is a small, artificial example:
# pip install -U spacy==2.1 # takes 9 mins
# pip install hu_core_ud_lg # takes 50 mins
import spacy
from spacy.lemmatizer import Lemmatizer
import hu_core_ud_lg
import pandas as pd
nlp = hu_core_ud_lg.load()
a = "Tisztelt levélíró!"
b = "Köszönettel vettük megkeresését."
df = pd.DataFrame({'text':[a, b]})
output_lemma = []
for i in df.text:
mondat = ""
doc = nlp(i)
for token in doc:
mondat = mondat + " " + token.lemma_
output_lemma.append(mondat)
output_lemma
which yields
[' tisztelt levélíró !', ' köszönet vesz megkeresés .']
but I would expect
[' Tisztelt levélíró !', ' Köszönet vesz megkeresés .']
When I pass my original data to the function, it returns some sentences with upercase first letters, others with lowercase letters. For some strange reason I couldn't reproduce that pattern above, but I guess the main point is visible. The function does not work as expected.
Any ideas how I could fix this?
I'm using Jupyter Notebook, Python 2.7, Win 7 and a Toshiba laptop (Portégé Z830-10R i3-2367M).
Lowercasing is the expected behavior of spaCy's lemmatizer for non-proper-noun tokens.
One workaround is to check if each token is titlecased, and convert to original casing after lemmatizing (only applies to the first character).
import spacy
nlp = spacy.load('en_core_web_sm')
text = 'This is a test sentence.'
doc = nlp(text)
newtext = ' '.join([tok.lemma_.title() if tok.is_title else tok.lemma_ for tok in doc])
print(newtext)
# This be a test sentence .
I want to implement lemmatization with Spacy package.
Here is my code :
regexp = re.compile( '(?u)\\b\\w\\w+\\b' )
en_nlp = spacy.load('en')
old_tokenizer = en_nlp.tokenizer
en_nlp.tokenizer = lambda string: old_tokenizer.tokens_from_list(regexp.findall(string))
def custom_tokenizer(document):
doc_spacy = en_nlp(document)
return [token.lemma_ for token in doc_spacy]
lemma_tfidfvect = TfidfVectorizer(tokenizer= custom_tokenizer,stop_words = 'english')
But this error message was occured when i run that code.
C:\Users\yu\Anaconda3\lib\runpy.py:193: DeprecationWarning: Tokenizer.from_list is now deprecated. Create a new Doc object instead and pass in the strings as the `words` keyword argument, for example:
from spacy.tokens import Doc
doc = Doc(nlp.vocab, words=[...])
"__main__", mod_spec)
How can i solve this problem ?
To customize spaCy's tokenizer, you need to pass it a list of dictionaries that specify the word that needs custom tokenization and the orths it should be split into. Here's the example code from the docs:
from spacy.attrs import ORTH, LEMMA
case = [{"don't": [{ORTH: "do"}, {ORTH: "n't", LEMMA: "not"}]}]
tokenizer.add_special_case(case)
If you're doing this all because you're wanting to make a custom lemmatizer, you might be better off just creating a custom lemma list directly. You'd have to modify the language data of spaCy itself, but the format is pretty simple:
"dustiest": ("dusty",),
"earlier": ("early",),
"earliest": ("early",),
"earthier": ("earthy",),
...
Those files live here for English.
I think that your code runs fine, you are just getting a DeprecationWarning, which is not really an error.
Following the advice given by the warning, I think you can modify your code substituting
en_nlp.tokenizer = lambda string: Doc(en_nlp.vocab, words = regexp.findall(string))
and that should run fine with no warnings (it does today on my machine).
Firstly I must admit that I am a newbie to Python or R.
Here I am trying to create a file with the list of bi-grams / 2-grams along with their POS tags (NN, VB, etc...). This is used to easily identify meaningful bi-grams and their POS tag combinations.
For example: the bigram - 'Gross' 'Profit' has the POS tag combination of JJ & NN. But the bigram - 'quarter' 'of' has the POS tag combination of NN & IN. With this I can find meaningful POS combinations. It may not be accurate. That is fine. Just want to research with it.
For Reference please check the section "2-gram Results" in this page.My requirement is something like that. But it was done in R. So it was not useful to me.
As I have come across in Python, POS Tagging and creation of bi-grams can be done using NLTK or TextBlob package. But I am unable to find a logic to assign POS tags for the bi-grams generated in Python. Please see below for the code and relevant output.
import nltk
from textblob import TextBlob
from nltk import word_tokenize
from nltk import bigrams
################# Code snippet using TextBlob Package #######################
text1 = """This is an example for using TextBlob Package"""
blobs = TextBlob(text1) ### Converting str to textblob object
blob_tags = blobs.tags ### Assigning POS tags to the word blobs
print(blob_tags)
blob_bigrams = blobs.ngrams(n=2) ### Creating bi-grams from word blobs
print(blob_bigrams)
################# Code snippet using NLTK Package #######################
text2 = """This is an example for using NLTK Package"""
tokens = word_tokenize(text2) ### Converting str object to List object
nltk_tags = nltk.pos_tag(tokens) ### Assigning POS tags to the word tokens
print(nltk_tags)
nltk_bigrams = bigrams(tokens) ### Creating bi-grams from word tokens
print(list(nltk_bigrams))
Any help is much appreciated. Thanks in advance.
Using nltk (already imported). Playing around with gutenberg corpus
import nltk
from nltk.corpus import gutenberg
Checked out the fileids, to find one I could play with:
gutenberg.fileids()
I made a small code to find the most common words (in order to choose a few for the graph)
kjv_text = nltk.Text(kjv)
from collections import Counter
for words in [kjv_text]:
c = Counter(words)
print c.most_common()[:100] # top 100
kjv_text.dispersion_plot(["LORD", "God", "Israel", "king", "people"])
Until here it works perfectly. Then I try and implement the ConditionalFreqDist, but get bunch of errors:
cfd2 = nltk.ConditionalFreqDist((target, fileid['bible-kjv.txt'])
for fileid in gutenberg.fileids()
for w in gutenberg.words(fileid)
for target in ['lord']
if w.lower().startswith(target))
cfd2.plot()
I have tried to change a few things, but always get some errors. Any experts that can tell me what I'm doing wrong?
Thanks
Here is what was wrong:
The fileid in:
cfd2 = nltk.ConditionalFreqDist((target, fileid['bible-kjv.txt'])
should reference to which element it is (in this case the 4th on the list of gutemberg texts.
So the line should instead say:
cfd2 = nltk.ConditionalFreqDist((target, fileid[3])
I have a very simple CherryPy webservice that I hope will be the foundation of a larger project, however, I need to get NLTK to work the way I want.
My python script imports NLTK and uses the collocation (bigram) function of NLTK, to do some analysis on pre-loaded data.
I have a couple of questions:
1) Why is the program not returning the collocations to my browser, but only to my console?.
2) Why if I am specifying from nltk.book import text4, the program imports the whole set of sample books (text1 to text9)?
Please, keep in mind that I am a newbie, so the answer might be in front of me, but I don't see it.
Main question: How do I pass the collocation results to the browser (webservice), instead of console?
Thanks
import cherrypy
import nltk
from nltk.book import text4
class BiGrams:
def index(self):
return text4.collocations(num=20)
index.exposed = True
cherrypy.quickstart(BiGrams())
I have been doing some work with Moby Dick and I stumbled on the answer to the question of importing just one specific text the other day:
>>>import nltk.corpus
>>>from nltk.text import Text
>>>moby = Text(nltk.corpus.gutenberg.words('melville-moby_dick.txt'))
Thus, all you really need is the fileid in order to assign the text of that file to your new Text object. Be careful, though, because only "literary" sources are in the gutenberg.words directory.
Anyway, for help with finding file ids for gutenberg, after import nltk.corpus above, you can use the following command:
>>> nltk.corpus.gutenberg.fileids()
['austen-emma.txt', 'austen-persuasion.txt', 'austen-sense.txt', 'bible-kjv.txt', 'blake-poems.txt', 'bryant-stories.txt', 'burgess-busterbrown.txt', 'carroll-alice.txt', 'chesterton-ball.txt', 'chesterton-brown.txt', 'chesterton-thursday.txt', 'edgeworth-parents.txt', 'melville-moby_dick.txt', 'milton-paradise.txt', 'shakespeare-caesar.txt', 'shakespeare-hamlet.txt', 'shakespeare-macbeth.txt', 'whitman-leaves.txt']
This still doesn't answer the question for your specific corpus, the inaugural addresses, however. For that answer, I found this MIT paper: http://web.mit.edu/6.863/www/fall2012/nltk/ch2-3.pdf
(I recommend it to anyone beginning to work with nltk texts because it talks about grabbing all kinds of textual data for analysis). The answer to getting the inaugural address fileids comes on page 6 (edited a bit):
>>> nltk.corpus.inaugural.fileids()
['1789-Washington.txt', '1793-Washington.txt', '1797-Adams.txt', '1801-Jefferson.txt', '1805-Jefferson.txt', '1809-Madison.txt', '1813-Madison.txt', '1817-Monroe.txt', '1821-Monroe.txt', '1825-Adams.txt', '1829-Jackson.txt', '1833-Jackson.txt', '1837-VanBuren.txt', '1841-Harrison.txt', '1845-Polk.txt', '1849-Taylor.txt', '1853-Pierce.txt', '1857-Buchanan.txt', '1861-Lincoln.txt', '1865-Lincoln.txt', '1869-Grant.txt', '1873-Grant.txt', '1877-Hayes.txt', '1881-Garfield.txt', '1885-Cleveland.txt', '1889-Harrison.txt', '1893-Cleveland.txt', '1897-McKinley.txt', '1901-McKinley.txt', '1905-Roosevelt.txt', '1909-Taft.txt', '1913-Wilson.txt', '1917-Wilson.txt', '1921-Harding.txt', '1925-Coolidge.txt', '1929-Hoover.txt', '1933-Roosevelt.txt', '1937-Roosevelt.txt', '1941-Roosevelt.txt', '1945-Roosevelt.txt', '1949-Truman.txt', '1953-Eisenhower.txt', '1957-Eisenhower.txt', '1961-Kennedy.txt', '1965-Johnson.txt', '1969-Nixon.txt', '1973-Nixon.txt', '1977-Carter.txt', '1981-Reagan.txt', '1985-Reagan.txt', '1989-Bush.txt', '1993-Clinton.txt', '1997-Clinton.txt', '2001-Bush.txt', '2005-Bush.txt', '2009-Obama.txt']
Thus, you should be able to import specific inaugural addresses as Texts (assuming you did "from nltk.text import Text" above) or you can work with them using the "inaugural" identifier imported above. For example, this works:
>>>address1 = Text(nltk.corpus.inaugural.words('2009-Obama.txt'))
In fact, you can treat all inaugural addresses as one document by calling inaugural.words without any arguments, as in the following example from this page:
>>>len(nltk.corpus.inaugural.words())
OR
addresses = Text(nltk.corpus.inaugural.words())
I remembered reading this thread a month ago when trying to answer this question myself, so perhaps this information, if coming late, will be helpful to someone somewhere.
(This is my first contribution to Stack Overflow. I've been reading for months and never had anything useful to add until now. Just want to say generally 'thanks to everyone for all the help.')
My guess is that what you get back from the collocations() call is not a string, and that you need to serialize it. Try this instead:
import cherrypy
import nltk
from nltk.book import text4
import simplejson
class BiGrams:
def index(self):
c = text4.collocations(num=20)
return simplejson.dumps(c)
index.exposed = True
cherrypy.quickstart(BiGrams())
Take a look at the source code (http://code.google.com/p/nltk/source/browse/trunk/nltk/) and you'll learn a lot (I know I did).
1) Collocations is returning to your console because that's what it is supposed to do.
help(text4.collocations)
will give you:
Help on method collocations in module nltk.text:
collocations(self, num=20, window_size=2) method of nltk.text.Text instance
Print collocations derived from the text, ignoring stopwords.
#seealso: L{find_collocations}
#param num: The maximum number of collocations to print.
#type num: C{int}
#param window_size: The number of tokens spanned by a collocation (default=2)
#type window_size: C{int}
Browse the source in text.py and you'll find the method for collocations is pretty straight-forward.
2) Importing nltk.book loads each text. You could could just grab the bits you need from book.py and write a method that only loads the inaugural addresses.