Currently i am working on a project using nlp and python. i have content and need to find the language. I am using spacy to detect the language. The libraries are providing only language as English language. i need to find whether it is British or American English? Any suggestions?
I tried with Spacy, NLTK, lang-detect. but this libraries provide only English. but i need to display as en-GB for British and en-US for american.
You can train your own model. Many geographically specific data on English were collected by University of Leipzig, but it does not include US English. American National Corpus should a free subset that you can use.
A popular library for language langid.py allows training your own model. They have a nice tutorial on github. Their model is based on character tri-gram frequencies, which might not be sufficiently distinctive statistics in this case.
Another option is to train a classifier on top of BERT using e.g., Pytorch and the transormers library. This will surely get very good results, but if you are not experienced with deep learning, it might be actually a lot of work for you.
Related
I'm trying to learn NLP with python. Although I work with a variety of programming languages I'm looking for some kind of from the ground up solution that I can put together to come up with a product that has a high standard of spelling and grammer like grammerly?
I've tried some approaches with python. https://pypi.org/project/inflect/
Spacy for parts of speech.
Could someone point me in the direction of some kind of fully fledged API, that I can pull apart and try and work out how to get to a decent standard of english, like grammerly.
Many thanks,
Vince.
i would suggest to checkout Stanford core nlp, nltk as a start. but if you want to train your own ner and symantic modeling gensim.
I've recently used the Language API to gather sentiment predictions for a work project. I had about 1,300 unlabeled documents and we used NLTK's tools initially, which was based on a dictionary of terms with polarity estimates of each word in the dictionary. I turned to the API, and after reviewing the predictions, the API produced much better results than NLTK.
I understand that the engineers probably won't want to release the details of the prediction engine, but I am curious how it works at a high level. If anybody could enlighten me or point me in the right direction, I'd appreciate it. For example, "it uses a Neural Network, trained on billions of observations," would be a reasonable answer.
Again, I'm using this for a work project and I'd like to be able to give a brief justification of why I switched from NLTK to the API (the improved results should speak for themselves, but I will definitely get "well, how does it work?").
The language API is a pipeline of state-of-the-art machine-learned systems that are trained on a combination of public data (like the Penn Treebank) and proprietary data annotated by Google's linguists.
Performance improvements compared to something like NLTK come from a combination of more and better data for training, as well as cutting edge machine learning algorithms, including but not limited to neural networks.
Related links that discuss some of the algorithms:
https://research.googleblog.com/2016/05/announcing-syntaxnet-worlds-most.html (Parsing algorithms)
https://www.wired.com/2016/11/googles-search-engine-can-now-answer-questions-human-help/ (Mentions the linguist team)
https://research.googleblog.com/2016/08/acl-2016-research-at-google.html (Recent publications from the NLP research team)
I have some research project which needs best NER results .
Can anyone please help me with the best NER tools which have a Python library.
Talking about Java, Stanford NER seems to be the best ceteris paribus.
There is also LingPipe, Illinois and others, take a look at ACL list.
Also consider this paper for experimental comparison of several NERCs.
I have a project about chunking of Arabic text
I want to know if it is possible to use NLTK to extract the chunks NP, VP, PP of arabic Text and how I can use an Arabic corpus.
Please any one help me!
It's far from perfect (largely because the linguistic properties of Arabic are significantly different from those of English), but a computer science student developed an Arabic language analysis toolkit in 2011 that looks promising. He developed "an integrated solution consisting of a part-of-speech tagger and a morphological analyser. The toolkit was trained on classical Arabic and tested on a sample text of modern standard Arabic." I would think a limitation of this tool would be that the training set was classical while the test set was MSA.
The paper is a great start because it addresses existing tools and their relative successes (and shortcomings). I also highly recommend this 2010 paper which looks like an outstanding reference. It is also available as a book in print or electronic format.
Also, as a personal note, I would love to see a native speaker who is NLP-savvy use Google ta3reeb (available as a Java open source utility) to develop better tools and libraries. Just some of my thoughts, my actual experience with Arabic NLP is very limited. There are a variety of companies that have developed search solutions that apply Arabic NLP principles as well, although much of their work is likely proprietary (for instance, I am aware that Basis Technology has worked with this fairly extensively; I am not affiliated with Basis in any way nor have I ever been).
How can I tell NLTK to treat the text in a particular language?
Once in a while I write a specialized NLP routine to do POS tagging, tokenizing and etc. on a non-english (but still hindo-European) text domain.
This question seem to address only different corpora, not the change in code/settings:
POS tagging in German
Alternatively,are there any specialized Hebrew/Spanish/Polish NLP modules for python?
I'm not sure what you're referring to as the changes in code/settings. NLTK mostly relies on machine learning and the "settings" are usually extracted from the training data.
When it comes to POS tagging the results and tagging will be dependant on the tagger you use/train. Should you train your own you'll of course need some spanish / polish training data. The reason these might be hard to find is the lack of gold standard material publicly available. There are tools out there to do that do this, but this one isn't for python (http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/).
The nltk.tokenize.punkt.PunktSentenceTokenizer tokenizer will tokenize sentences according to multilingual sentence boundaries the details of which can be found in this paper (http://www.mitpressjournals.org/doi/abs/10.1162/coli.2006.32.4.485).