NLP NaiveBayesClassifier for utf-8 in Python NLTK - python

I'm trying to use NLTK to perform some NLP NLTK classification for Arabic phrases. If I enter the native words as is in the classifier then it complains about non-ascii characters. Currently, I'm doing word.decode('utf-8') and then entering that as input to the trainer.
When I test the classifier, the results make some sense if there was an exact match. However, if I test a substring of a word in the original training words then results looks somewhat random.
I just want to distinguish if this was a bad classifier or if there's something fundamental in the encoding that degrades the performance of the classifier. Is this a reasonable way to input non-ascii text to classifiers?
#!/usr/bin/python
# -*- coding: utf-8 -*-
from textblob.classifiers import NaiveBayesClassifier
x = "الكتاب".decode('utf-8')
...
train = [
(x,'pos'),
]
cl = NaiveBayesClassifier(train)
t = "كتاب".decode('utf-8')
cl.classify(t)
The word in t is simply x with the first two letters removed. I'm running this with a much bigger dataset of course.

Your post contains, basically, two questions. The first is concerned with encoding, the second one is about predicting substrings of words seen in training.
For encoding, you should use unicode literals directly, so you can omit the decode() part. Like this:
x = u"الكتاب"
Then you will have a decoded representation already.
Concerning substrings, the classifier won't do that for you. If you ask for predictions for a token that wasn't included in the training in exactly the same spelling, then it will be treated as an unknown word – no matter if it's a substring of a word that occurred in training or not.
The substring case wouldn't be well-defined, anyway: Let's say you look up the single letter Alif – probably, a whole lot of words seen in training contain it. Which one should be used? A random one? The one with the highest probability? The sum of the probabilities of all matching ones? There's no easy answer to this.
I suspect that you are trying to match morphological variants of the same root. If this is the case, then you should try using a lemmatiser. So, before training, and also before prediction, you preprocess all tokens by converting them to their lemma (which is usually the root in Arabic, I think). I doubt that NLTK ships with a morphological model for Arabic, though, so you probably need to look for that elsewhere (but this is beyond the scope of this answer now).

Related

Should data feed into Universal Sentence Encoder be normalized?

I am currently working with Tensor Flow's Universal Sentence Encoder (https://arxiv.org/pdf/1803.11175.pdf) for my B.Sc. thesis where I study extractive summarisation techniques.
In the vast majority of techniques for this task (like https://www.aaai.org/ocs/index.php/IJCAI/IJCAI15/paper/view/11225/10855), the sentences are first normalized (lowercasing, stop word removal, lemmantisation), but I couldn't find a hint whether sentences feed into the USE should first be normalized. Is that the case? Does is matter?
The choice really depends on the application of design.
Regarding stop word removal and lemmatization: these operations in general removes some contents from the text, hence, it can remove the information. However, if it doesn't make an impact, then you can remove. (It is always best to try out both. In general the performance differences shouldn't be too much).
Lowercasing depends on the pre-trained model that you use (For example, in BERT, you have bert-base-uncased and bert-base-cased) and choice of application. One simple way to verify is, input a text into USE model, obtain it's sentence embeddings, then lowercase the same input text and obtain it's sentence embeddings. If they are same, that means your model is case insensitive. However, if it gives different embedding, then it's case sensitive. (By running the program provided here, it appears that USE is case-sensitive). The choice of lower-casing is again application dependent.

Use proxy sentences from cleaned data

Gensim's Word2Vec model takes as an input a list of lists with the inner list containing individual tokens/words of a sentence. As I understand Word2Vec is used to "quantify" the context of words within a text using vectors.
I am currently dealing with a corpus of text that has already been split into individual tokens and no longer contains an obvious sentence format (punctuation has been removed). I was wondering how should I input this into the Word2Vec model?
Say if I simply split the corpus into "sentences" of uniform length (10 tokens per sentence for example), would this be a good way of inputting the data into the model?
Essentially, I am wondering how the format of the input sentences (list of lists) affects the output of Word2Vec?
That sounds like a reasonable solution. If you have access to data that is similar to your cleaned data you could get average sentence length from that data set. Otherwise, you could find other data in the language you are working with (from wikipedia or another source) and get average sentence length from there.
Of course your output vectors will not be as reliable as if you had the correct sentence boundaries, but it sounds like word order was preserved so there shouldn't be too much noise from incorrect sentence boundaries.
Most typically, text is passed to Word2Vec in logical units (like sentences or paragraphs). Also, the published papers and early demo code tended to convert punctuation into tokens, as well.
But text without punctuation, and arbitrary breaks between texts, are a reasonable workaround and still give pretty good results.
For example, the text8/text9 corpuses often used in demos (including the word2vec intro Jupyter notebook bundled in gensim) are just giant runs-of-words, lacking punctuation and line-breaks. So, the utility LineSentence class used in gensim will break them into individual 10,000-token texts.
It's probably better to go larger in your arbitrary breaks (eg 10,000), rather than smaller (eg 10), for a couple reasons:
source texts are usually longer than 10 words
often the source material that was run-together was still semantically-related across its original boundaries
the optimized algorithms work better on larger chunks of data
the harm of "false context windows" (created by the concatenation) is probably just noise with no net biasing effect, while more "true windows" (by creating as few false splits as possible) likely retains more of the original corpus' learnable word-to-word relationships signal
you can always simulate more-conservative contexts with a smaller window parameter (if the original source really did have tiny sentences that weren't sequentially-related)
But, gensim's cython-optimized training path has an implementation limit of 10,000 tokens per text – with any more being silently ignored – so you wouldn't intentionally want to supply longer texts for any reason.

NLP - When to lowercase text during preprocessing

I want to build a model for language modelling, which should predict the next words in a sentence, given the previous word(s) and/or the previous sentence.
Use case: I want to automate writing reports. So the model should automatically complete the sentence I am writing. Therefore, it is important that nouns and the words at the beginning of a sentence are capitalized.
Data: The data is in German and contains a lot of technical jargon.
My text corpus is in German and I am currently working on the preprocessing. Because my model should predict gramatically correct sentences I have decided to use/not use the following preprocessing steps:
no stopword removal
no lemmatization
replace all expressions with numbers by NUMBER
normalisation of synonyms and abbreviations
replace rare words with RARE
However, I am not sure whether to convert the corpus to lowercase. When searching the web I found different opinions. Although lower-casing is quite common it will cause my model to wrongly predict the capitalization of nouns, sentence beginnings etc.
I also found the idea to convert only the words at the beginning of a sentence to lower-case on the following Stanford page.
What is the best strategy for this use-case? Should I convert the text to lower-case and change the words to the correct case after prediction? Should I leave the capitalization as it is? Should I only lowercase words at the beginning of a sentence?
Thanks a lot for any suggestions and experiences!
I think for your particular use-case, it would be better to convert it to lowercase because ultimately, you will need to predict the words given a certain context. You probably won't be needing to predict sentence beginnings in your use-case. Also, if a noun is predicted you can capitalize it later. However consider the other way round. (Assuming your corpus is in English) Your model might treat a word which is in the beginning of a sentence with a capital letter different from the same word which appears later in the sentence but without any capital latter. This might lead to decline in the accuracy. Whereas I think, lowering the words would be a better trade off. I did a project on Question-answering system and converting the text to lowercase was a good trade off.
Edit : Since your corpus is in German, it would be better to retain the capitalization since it is an important aspect of German Language.
If it is of any help, Spacey supports German Language. You use it to train your model.
In general, tRuEcasIng helps.
Truecasing is the process of restoring case information to badly-cased or noncased text.
See
How can I best determine the correct capitalization for a word?
https://github.com/nreimers/truecaser
https://github.com/moses-smt/mosesdecoder/blob/master/scripts/recaser/train-truecaser.perl
definitely convert the majority of the words to lowercase, cut consider the following cases:
Acronyms e.g. MIT if you lower case it to mit which is a word (in German) you'll be in trouble
Initials e.g. J. A. Snow
Enumerations e.g. (I),(II),(III),APPENDIX A
I would also advise against the <RARE> token, what percentage of your corpus is <RARE>, what about unknown words ?
Since you are dealing with German, and words can be arbitrary long and rare, you might need a way to break them down further.
Thus some sort of lemmatization and tokenization are needed
I recommend using spacy that support German from day one, and the support and docs are very helpful (Thank you Mathew and Ines)

Testing the Keras sentiment classification with model.predict

I have trained the imdb_lstm.py on my PC.
Now I want to test the trained network by inputting some text of my own. How do I do it?
Thank you!
So what you basically need to do is as follows:
Tokenize sequnces: convert the string into words (features): For example: "hello my name is georgio" to ["hello", "my", "name", "is", "georgio"].
Next, you want to remove stop words (check Google for what stop words are).
This stage is optional, it may lead to faulty results but I think it worth a try. Stem your words (features), that way you'll reduce the number of features which will lead to a faster run. Again, that's optional and might lead to some failures, for example: if you stem the word 'parking' you get 'park' which has a different meaning.
Next thing is to create a dictionary (check Google for that). Each word gets a unique number and from this point we will use this number only.
Computers understand numbers only so we need to talk in their language. We'll take the dictionary from stage 4 and replace each word in our corpus with its matching number.
Now we need to split our data set to two groups: training and testing sets. One (training) will train our NN model and the second (testing) will help us to figure out how good is our NN. You can use Keras' cross validation function.
Next thing is defining whats the max number of features our NN can get as an input. Keras call this parameter - 'maxlen'. But you don't really have to do this manually, Keras can do that automatically just by searching for the longest sentence you have in your corpus.
Next, let's say that Keras found out that the longest sentence in your corpus has 20 words (features) and one of your sentences is the example in the first stage, which its length is 5 (if we'll remove stop words it'll be shorter), in such case we'll need to add zeros, 15 zeros actually. This is called pad sequence, we do that so every input sequence will be in the same length.
This might help.
http://keras.io/models/
Here is an sample usage.
How to use keras for XOR
Probably you have to convert ur corpus into ndarray first and throw it to your model.predict
From what it seem so far the model.predict input of the training model should be 100 words corpus which represent an index of each word in dictionary. So if you want to train it with ur corpus, you have to convert ur corpus according to those dictionary and see if the result is 0 or 1

How can I tag and chunk French text using NLTK and Python?

I have 30,000+ French-language articles in a JSON file. I would like to perform some text analysis on both individual articles and on the set as a whole. Before I go further, I'm starting with simple goals:
Identify important entities (people, places, concepts)
Find significant changes in the importance (~=frequency) of those entities over time (using the article sequence number as a proxy for time)
The steps I've taken so far:
Imported the data into a python list:
import json
json_articles=open('articlefile.json')
articlelist = json.load(json_articles)
Selected a single article to test, and concatenated the body text into a single string:
txt = ' '.join(data[10000]['body'])
Loaded a French sentence tokenizer and split the string into a list of sentences:
nltk.data.load('tokenizers/punkt/french.pickle')
tokens = [french_tokenizer.tokenize(s) for s in sentences]
Attempted to split the sentences into words using the WhiteSpaceTokenizer:
from nltk.tokenize import WhitespaceTokenizer
wst = WhitespaceTokenizer()
tokens = [wst.tokenize(s) for s in sentences]
This is where I'm stuck, for the following reasons:
NLTK doesn't have a built-in tokenizer which can split French into words. White space doesn't work well, particular due to the fact it won't correctly separate on apostrophes.
Even if I were to use regular expressions to split into individual words, there's no French PoS (parts of speech) tagger that I can use to tag those words, and no way to chunk them into logical units of meaning
For English, I could tag and chunk the text like so:
tagged = [nltk.pos_tag(token) for token in tokens]
chunks = nltk.batch_ne_chunk(tagged)
My main options (in order of current preference) seem to be:
Use nltk-trainer to train my own tagger and chunker.
Use the python wrapper for TreeTagger for just this part, as TreeTagger can already tag French, and someone has written a wrapper which calls the TreeTagger binary and parses the results.
Use a different tool altogether.
If I were to do (1), I imagine I would need to create my own tagged corpus. Is this correct, or would it be possible (and premitted) to use the French Treebank?
If the French Treebank corpus format (example here) is not suitable for use with nltk-trainer, is it feasible to convert it into such a format?
What approaches have French-speaking users of NLTK taken to PoS tag and chunk text?
There is also TreeTagger (supporting french corpus) with a Python wrapper. This is the solution I am currently using and it works quite good.
As of version 3.1.0 (January 2012), the Stanford PoS tagger supports French.
It should be possible to use this French tagger in NLTK, using Nitin Madnani's Interface to the Stanford POS-tagger
I haven't tried this yet, but it sounds easier than the other approaches I've considered, and I should be able to control the entire pipeline from within a Python script. I'll comment on this post when I have an outcome to share.
Here are some suggestions:
WhitespaceTokenizer is doing what it's meant to. If you want to split on apostrophes, try WordPunctTokenizer, check out the other available tokenizers, or roll your own with Regexp tokenizer or directly with the re module.
Make sure you've resolved text encoding issues (unicode or latin1), otherwise the tokenization will still go wrong.
The nltk only comes with the English tagger, as you discovered. It sounds like using TreeTagger would be the least work, since it's (almost) ready to use.
Training your own is also a practical option. But you definitely shouldn't create your own training corpus! Use an existing tagged corpus of French. You'll get best results if the genre of the training text matches your domain (articles). Also, you can use nltk-trainer but you could also use the NLTK features directly.
You can use the French Treebank corpus for training, but I don't know if there's a reader that knows its exact format. If not, you must start with XMLCorpusReader and subclass it to provide a tagged_sents() method.
If you're not already on the nltk-users mailing list, I think you'll want to get on it.

Categories