NER Tools for academic use - python

I have some research project which needs best NER results .
Can anyone please help me with the best NER tools which have a Python library.

Talking about Java, Stanford NER seems to be the best ceteris paribus.
There is also LingPipe, Illinois and others, take a look at ACL list.
Also consider this paper for experimental comparison of several NERCs.

Related

grammarly alternative - NLP

I'm trying to learn NLP with python. Although I work with a variety of programming languages I'm looking for some kind of from the ground up solution that I can put together to come up with a product that has a high standard of spelling and grammer like grammerly?
I've tried some approaches with python. https://pypi.org/project/inflect/
Spacy for parts of speech.
Could someone point me in the direction of some kind of fully fledged API, that I can pull apart and try and work out how to get to a decent standard of english, like grammerly.
Many thanks,
Vince.
i would suggest to checkout Stanford core nlp, nltk as a start. but if you want to train your own ner and symantic modeling gensim.

Python NLP differentiation of British English and American English

Currently i am working on a project using nlp and python. i have content and need to find the language. I am using spacy to detect the language. The libraries are providing only language as English language. i need to find whether it is British or American English? Any suggestions?
I tried with Spacy, NLTK, lang-detect. but this libraries provide only English. but i need to display as en-GB for British and en-US for american.
You can train your own model. Many geographically specific data on English were collected by University of Leipzig, but it does not include US English. American National Corpus should a free subset that you can use.
A popular library for language langid.py allows training your own model. They have a nice tutorial on github. Their model is based on character tri-gram frequencies, which might not be sufficiently distinctive statistics in this case.
Another option is to train a classifier on top of BERT using e.g., Pytorch and the transormers library. This will surely get very good results, but if you are not experienced with deep learning, it might be actually a lot of work for you.

Python frameworks for NLP? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 6 years ago.
Improve this question
I am working on a project wherein I have to extract the following information from a set of articles (the articles could be on anything):
People Find the names of any people present, like "Barack Obama"
Topic or related tags of the article, like "Parliament", "World Energy"
Company/Organisation I should be able to obtain the names of the any companies or organisations mentioned, like "Apple" or "Google"
Is there an NLP framework/library of this sort available in Python which would help me accomplish this task?
#sel and #3kt really good answers. OP you are looking for Entity Extraction, commonly referred to as Named entity recognition. There exist many APIs to perform this. But the first question you need to ask yourself is
What is the structure of my DATA? or rather,
Are my sentences good English sentences?
In the sense of figuring out whether the data you are working with is consistently grammatically correct, well capitalized and is well structured. These factors are paramount when it comes to extracting entities. The data I worked with were tweets. ABSOLUTE NIGHTMARE!! I performed a detailed analysis on the performance of various APIs on entity extraction and I shall share with you what I found.
Here's APIs that perform fabulous entity extraction-
NLTK has a handy reference book which talks in-depth about its functions with multiple examples. NLTK does not perform well on noisy data(tweets) because it has been trained on structured data.NLTK is absolute garbage for badly capitalized words(Eg, DUCK, Verb, CHAIR). Moreover, it is slightly less precise when compared to other APIs. It is great for structured data or curated data from News articles and Scholarly reports. It is a great learning tool for beginners.
Alchemy is simpler to implement and performs very well in categorizing the named entities.It has great precision when compared to the APIs I have mentioned.However, it has a certain transaction cost. You can only perform 1000 queries in a day! It identifies twitter-handles and can handle awkward capitalization.
IMHO sPacy API is probably the best. It's open source. It outperforms the Alchemy API but is not as precise. Categorizes entities almost as well Alchemy.
Choosing which API should be a simple problem for you now that you know how each API is likely to behave according to the data you have.
EXTRA -
POLYGLOT is yet another API.
Here is a blog post that performs entity extraction in NLTK.
There is a beautiful paper by Alan Ritter that might go over your head. But it is the standard for entity extraction(particularly in noisy data) at a professional level. You could refer to it every now and then to understand complex concepts like LDA or SVM for capitalisation.
What you are actually looking for is called in literature 'Named entity Recognition' or NER.
You might like to take a look at this tutorial:
http://textminingonline.com/how-to-use-stanford-named-entity-recognizer-ner-in-python-nltk-and-other-programming-languages
One easy way of solving this problem partially this problem is using regular expressions to extract words having the patterns that you can find in this paper to extract peoples names. This of course might lead to extracting all the categories you are looking for i.e. the topics and the campanies names as well.
There is also an API that you can use, that actually gives the same results you are looking for, which is called Alchemy. Unfortunatelly no documentation is available to explain the method they use to extract the topics nor the people's names.
Hope this helps.
You should take a look at NLTK.
Finding names and companies can be achieved by tagging the recovered text, and extracting proper nouns (tagged NNP). Finding the topic is a bit more tricky, and may require some machine learning on a given set of article.
Also, since we're talking about articles, I recommend the newspaper module, that can recover those from their URLs, and do some basic nlp operations (summary, keywords).

Using of nltk for chunking arabic text

I have a project about chunking of Arabic text
I want to know if it is possible to use NLTK to extract the chunks NP, VP, PP of arabic Text and how I can use an Arabic corpus.
Please any one help me!
It's far from perfect (largely because the linguistic properties of Arabic are significantly different from those of English), but a computer science student developed an Arabic language analysis toolkit in 2011 that looks promising. He developed "an integrated solution consisting of a part-of-speech tagger and a morphological analyser. The toolkit was trained on classical Arabic and tested on a sample text of modern standard Arabic." I would think a limitation of this tool would be that the training set was classical while the test set was MSA.
The paper is a great start because it addresses existing tools and their relative successes (and shortcomings). I also highly recommend this 2010 paper which looks like an outstanding reference. It is also available as a book in print or electronic format.
Also, as a personal note, I would love to see a native speaker who is NLP-savvy use Google ta3reeb (available as a Java open source utility) to develop better tools and libraries. Just some of my thoughts, my actual experience with Arabic NLP is very limited. There are a variety of companies that have developed search solutions that apply Arabic NLP principles as well, although much of their work is likely proprietary (for instance, I am aware that Basis Technology has worked with this fairly extensively; I am not affiliated with Basis in any way nor have I ever been).

Text mining: when to use parser, tagger, NER tool?

I'm doing a project on mining blog contents and I need help differentiating on which tool to uses. When do I use a parser, when do I use a tagger, and when do I need to use a NER tool?
For instance, I want to find out the most talked about topics/subjects between several blogs; do I use a part-of-speech tagger to grab the nouns and do a frequency count? That would probably be insufficient because very generic terms can pop up right? Or do I have a list of categories and these synonyms that I can match on?
BTW, I'm using nltk, but am looking at stanford tagger or parser since a couple of dudes said that it was good.
Instead of trying to reinvent the wheel, you might want to read up on Topic Models, which basically creates clusters of words that frequently occur together. Mallet has a readily available toolkit for doing such a task: http://mallet.cs.umass.edu/topics.php .
To answer your original question, POS tagger, parsers, and NER tools are not typically used for topic identification, but are more heavily used for tasks like information extraction where the goal is to identify within a document the specific actors, events, locations, times, etc... For example if you had a simple sentence like "John gave the apple to Mary." you might use a dependency parser to figure out that John is the subject, the apple is the object, and Mary is the prepositional object; thus you know John is the giver and Mary is the receiver and not vice-versa.

Categories