I have to find the method to classify a lot of the text samples (comments) written in Portuguese into two groups:
Brazilian Portuguese
European Portuguese
Google wasn't help much, maybe i miss something? Please help, thanks a lot!
It would be great is the some solution will be accessible for python.
A few of my ideas was to find the key grammar differences between those Portuguese variants, but as far as i can see this will not work much, also i was looking into direction where ML models which is capable do so. But to be honest i have no working ideas at all, which is make me ask the first question on that platform ever.
I assume you have many documents
which bear an is_brazilian label.
Compute tf/idf statistics.
Look for unigrams or bigrams that
are moderately common and whose
appearance strongly predicts one
or the other label.
Balanced collection policies are important here.
If, for example, most of the collected Brazilian documents
related to sport, and European documents
related to finance, then such a term frequency
approach would really be predicting "sport vs finance"
instead of predicting language locale.
A part of speech tagger can help with this,
so you focus on common function words
rather than on e.g. proper nouns that
are specific to one geography or the other.
Apparently English (rather than Latin) loan words appear
more often in Brazilian usage.
And the appearance of 'recepção' predicts Brazilian,
while 'receção' predicts European.
Compare with 'Ônibus' versus 'Autocarro'.
Usage of 'você' differs between the two.
There must be many similar examples.
cf https://github.com/MarekCichy/pt-br-classifier
Related
First, I want to explain my task. I have a dataset of 300k documents with an average of 560 words (no stop word removal yet) 75% in German, 15% in English and the rest in different languages. The goal is to recommend similar documents based on an existing one. At the beginning I want to focus on the German and English documents.
To achieve this goal I looked into several methods on feature extraction for document similarity, especially the word embedding methods have impressed me because they are context aware in contrast to simple TF-IDF feature extraction and the calculation of cosine similarity.
I'm overwhelmed by the amount of methods I could use and I haven't found a proper evaluation of those methods yet. I know for sure that the size of my documents are too big for BERT, but there is FastText, Sent2Vec, Doc2Vec and the Universal Sentence Encoder from Google. My favorite method based on my research is Doc2Vec even though there aren't any or old pre-trained models which means I have to do the training on my own.
Now that you know my task and goal, I have the following questions:
Which method should I use for feature extraction based on the rough overview of my data?
My dataset is too small to train Doc2Vec on it. Do I achieve good results if I train the model on English / German Wikipedia?
You really have to try the different methods on your data, with your specific user tasks, with your time/resources budget to know which makes sense.
You 225K German documents and 45k English documents are each plausibly large enough to use Doc2Vec - as they match or exceed some published results. So you wouldn't necessarily need to add training on something else (like Wikipedia) instead, and whether adding that to your data would help or hurt is another thing you'd need to determine experimentally.
(There might be special challenges in German given compound words using common-enough roots but being individually rare, I'm not sure. FastText-based approaches that use word-fragments might be helpful, but I don't know a Doc2Vec-like algorithm that necessarily uses that same char-ngrams trick. The closest that might be possible is to use Facebook FastText's supervised mode, with a rich set of meaningful known-labels to bootstrap better text vectors - but that's highly speculative and that mode isn't supported in Gensim.)
I have ~138k records of user feedback that I'd like to analyze to understand broad patterns in what our users are most often saying. Each one has a rating between 1-5 stars, so I don't need to do any sort of sentiment analysis. I'm mostly interested in splitting the dataset into >=4 stars to see what we're doing well and <= 3 stars to see what we need to improve upon.
One key problem I'm running into is that I expect to see a lot of n-grams. Some of these I know, like "HOV lane", "carpool lane", "detour time", "out of my way", etc. But I also want to detect common bi- and tri-grams programmatically. I've been playing around with Spacy a bit, but it doesn't seem to have any capability to do analysis on the corpus level, only on the document level.
Ideally my pipeline would look something like this (I think):
Import a list of known n-grams into the tokenizer
Process each string into a tokenized document, removing punctuation,
stopwords, etc, while respecting the known n-grams during
tokenization (ie, "HOV lane" should be a single noun token)
Identify the most common bi- and tri- grams in the corpus that I
missed
Re-tokenize using the found n-grams
Split by rating (>=4 and <=3)
Find the most common topics for each split of data in the corpus
I can't seem to find a single tool, or even a collection of tools, that will allow me to do what I want here. Am I approaching this the wrong way somehow? Any pointers on how to get started would be greatly appreciated!
Bingo State of the art results for your problem!
Its called - Zero-Short learning.
State-of-the-art NLP models for text classification without annotated data.
For Code and details read the blog - https://joeddav.github.io/blog/2020/05/29/ZSL.html
Let me know if it works for you or for any other help.
VADER tool is perfect with sentiment analysis and NLP based applications.
I think the proposed workflow is fine with this case study. Closely work with your feature extraction as it matters a lot.
Most of the time tri-grams make a sound sense on these use cases.
Using Spacy would be a better decision as SpaCy's rules-based match engines and components not only help you to find what the terms and sentences are searching for but also allow you to access the tokens inside a text and its relationships compared with regular expressions.
I am attempting to build a model that will attempt to identify the interest category / topic of supplied text. For example:
"Enjoyed playing a game of football earlier."
would resolve to a top level category like:
"Sport".
I'm not sure what the correct terminology is for what I am trying to achieve here so Google hasn't turned up any libraries that may be able to help. With that in mind, my approach would be something like:
Extract features from text. Use tagging to classify each feature / identify names / places. Would probably use NTLK for this, or Topia.
Run a Naive Bayes classifier for each interest category ("Sport", "Video Games", "Politics" etc.) and get a relevancy % for each category.
Identify which category has the highest % accuracy and categorise the text.
My approach would likely involve having individual corpora for each interest category and I'm sure the accuracy would be fairly miserable - I understand it will never be that accurate.
Generally looking for some advice on the viability of what I am trying to accomplish, but the crux of my question: a) is my approach is correct? b) are there any libraries / resources that may be of assistance?
You seem to know a lot of the right terminology. Try searching for "document classification." That is the general problem you are trying to solve. A classifier trained on a representative corpus will be more accurate than you think.
(a) There is no one correct approach. The approach you outline will
work, however.
(b) Scikit
Learn
is a wonderful library for this sort of work.
There is plenty of other information, including tutorials, online about this topic:
This Naive Bayesian Classifier on github probably already does most of what you want to accomplish.
This NLTK tutorial explains the topic in depth.
If you really want to get into it, I am sure a Google Scholar search will turn up thousands of academic articles in computer science and linguistics about exactly this topic.
You should check out Latent Dirichlet Allocation it will give you categories without labels , as always ed chens bolg is a good start.
I think this is an interesting question, at least for me.
I have a list of words, let's say:
photo, free, search, image, css3, css, tutorials, webdesign, tutorial, google, china, censorship, politics, internet
and I have a list of contexts:
Programming
World news
Technology
Web Design
I need to try and match words with the appropriate context/contexts if possible.
Maybe discovering word relationships in some way.
Any ideas?
Help would be much appreciated!
This sounds like it's more of a categorization/ontology problem than NLP. Try WordNet for a standard ontology.
I don't see any real NLP in your stated problem, but if you do need some semantic analysis or a parser try NLTK.
Where do these words come from? Do they come from real texts. If they are then it is a classic data mining problem. What you need to do is to your set of documents into the matrix where rows represent which document the word came from and the columns represent the words in the documents.
For example if you have two documents like this:
D1: Need to find meaning.
D2: Need to separate Apples from oranges
you matrix will look like this:
Need to find meaning Apples Oranges Separate From
D1: 1 1 1 1 0 0 0 0
D2: 1 1 0 0 1 1 1 1
This is called term by document matrix
Having collected this statistics you can use algorithms like K-Means to group similar documents together. Since you already know how many concepts you have your tasks should be soomewhat easier. K-Means is very slow algorithm, so you can try to optimize it using techniques such as SVD
I just found this a couple days ago: ConceptNet
It's a commonsense ontology, so it might not be as specific as you would like, but it has a python API and you can download their entire database (currently around 1GB decompressed). Just keep in mind their licensing restrictions.
If you read the papers that were published by the team that developed it, you may get some ideas on how to relate your words to concepts/contexts.
The answer to your question obviously depends on the target taxonomy you are trying to map your terms into. Once you have decided on this you need to figure out how fine-grained the concepts should be. WordNet, as it has been suggested in other responses, will give you synsets, i.e. sets of terms which are more or less synonymous but which you will have to map to concepts like 'Web Design' or 'World News' by some other mechanism since these are not encoded in WordNet. If you're aiming at a very broad semantic categorization, you could use WordNet's higher-level concept nodes which differentiate, e.g. (going up the hierarchy) human from animal, animates from plants, substances from solids, concrete from abstract things, etc.
Another kind-of-taxonomy which may be quite useful to you is the Wikipedia category system. This is not just a spontaneous idea I just came up with, but there has been a lot of work on deriving real ontologies from Wikipedia categories. Take a look at the Java Wikipedia Library - the idea would be to find a wikipedia article for the term in question (e.g. 'css3'), extract the categories this article belongs to, and pick the best ones with respect to some criterion (i.e. 'programming', 'technology', and 'web-development'). Depending on what you're trying to do this last step (choosing the best of several given categories) may or may not be difficult.
See here for a list of other ontologies / knowledge bases you could use.
I was wondering how as semantic service like Open Calais figures out the names of companies, or people, tech concepts, keywords, etc. from a piece of text. Is it because they have a large database that they match the text against?
How would a service like Zemanta know what images to suggest to a piece of text for instance?
Michal Finkelstein from OpenCalais here.
First, thanks for your interest. I'll reply here but I also encourage you to read more on OpenCalais forums; there's a lot of information there including - but not limited to:
http://opencalais.com/tagging-information
http://opencalais.com/how-does-calais-learn
Also feel free to follow us on Twitter (#OpenCalais) or to email us at team#opencalais.com
Now to the answer:
OpenCalais is based on a decade of research and development in the fields of Natural Language Processing and Text Analytics.
We support the full "NLP Stack" (as we like to call it):
From text tokenization, morphological analysis and POS tagging, to shallow parsing and identifying nominal and verbal phrases.
Semantics come into play when we look for Entities (a.k.a. Entity Extraction, Named Entity Recognition). For that purpose we have a sophisticated rule-based system that combines discovery rules as well as lexicons/dictionaries. This combination allows us to identify names of companies/persons/films, etc., even if they don't exist in any available list.
For the most prominent entities (such as people, companies) we also perform anaphora resolution, cross-reference and name canonization/normalization at the article level, so we'll know that 'John Smith' and 'Mr. Smith', for example, are likely referring to the same person.
So the short answer to your question is - no, it's not just about matching against large databases.
Events/Facts are really interesting because they take our discovery rules one level deeper; we find relations between entities and label them with the appropriate type, for example M&As (relations between two or more companies), Employment Changes (relations between companies and people), and so on. Needless to say, Event/Fact extraction is not possible for systems that are based solely on lexicons.
For the most part, our system is tuned to be precision-oriented, but we always try to keep a reasonable balance between accuracy and entirety.
By the way there are some cool new metadata capabilities coming out later this month so stay tuned.
Regards,
Michal
I'm not familiar with the specific services listed, but the field of natural language processing has developed a number of techniques that enable this sort of information extraction from general text. As Sean stated, once you have candidate terms, it's not to difficult to search for those terms with some of the other entities in context and then use the results of that search to determine how confident you are that the term extracted is an actual entity of interest.
OpenNLP is a great project if you'd like to play around with natural language processing. The capabilities you've named would probably be best accomplished with Named Entity Recognizers (NER) (algorithms that locate proper nouns, generally, and sometimes dates as well) and/or Word Sense Disambiguation (WSD) (eg: the word 'bank' has different meanings depending on it's context, and that can be very important when extracting information from text. Given the sentences: "the plane banked left", "the snow bank was high", and "they robbed the bank" you can see how dissambiguation can play an important part in language understanding)
Techniques generally build on each other, and NER is one of the more complex tasks, so to do NER successfully, you will generally need accurate tokenizers (natural language tokenizers, mind you -- statistical approaches tend to fare the best), string stemmers (algorithms that conflate similar words to common roots: so words like informant and informer are treated equally), sentence detection ('Mr. Jones was tall.' is only one sentence, so you can't just check for punctuation), part-of-speech taggers (POS taggers), and WSD.
There is a python port of (parts of) OpenNLP called NLTK (http://nltk.sourceforge.net) but I don't have much experience with it yet. Most of my work has been with the Java and C# ports, which work well.
All of these algorithms are language-specific, of course, and they can take significant time to run (although, it is generally faster than reading the material you are processing). Since the state-of-the-art is largely based on statistical techniques, there is also a considerable error rate to take into account. Furthermore, because the error rate impacts all the stages, and something like NER requires numerous stages of processing, (tokenize -> sentence detect -> POS tag -> WSD -> NER) the error rates compound.
Open Calais probably use language parsing technology and language statics to guess which words or phrases are Names, Places, Companies, etc. Then, it is just another step to do some kind of search for those entities and return meta data.
Zementa probably does something similar, but matches the phrases against meta-data attached to images in order to acquire related results.
It certainly isn't easy.