custom feature function with (python) crfsuite

custom feature function with (python) crfsuite - python

I have only read theorey about CRF so far and want to use python crfsuite in my master thesis for extracting ingredients from recipes. Every help is appreciated.
As far as I understand, I can provide training data to crfsuite in the form of the picture below, where w[0] provides the identity of the current word, w[i] the world relative to i and pos[i] its part-of-speech-tag relative to i.
And then crfsuite trains its own feature functions build on the given attributes.
But I can't find a way for providing custom feature functions like "w[i] is in a dictionary" (for example a dictionary of recipe ingredients) or "in the sentence is a negation" (for example "not", or "don't").
In general good tutorials are appreciated, because the manuals (https://python-crfsuite.readthedocs.io/en/latest/ or http://www.chokkan.org/software/crfsuite/manual.html) are not beginner-friendly from my point of view

With python-crfsuite (or sklearn-crfsuite) training data doesn't have to be in the form you've described; a single training sequence should be a list of {"feature_name": <feature_value>"} dicts, with features for each sequence elements (e.g. for a token in a sentence). Features don't have to be words or POS tags. There is a few other supported feature formats (see http://python-crfsuite.readthedocs.io/en/latest).
For a more complete example check https://github.com/TeamHG-Memex/sklearn-crfsuite/blob/master/docs/CoNLL2002.ipynb - it uses custom features.

Related

how to use Nlp Pos Tagging in other langauge like sindhi / urdu

I am working on a research paper on pos tagging in NLP but my question is that how to implement the pos tagging in another local language plz help me thank you.

It depends on the POS-Tagger you are using. Usually a (probabilistic) tagger has two language-specific components: a language model and a dictionary.
The dictionary contains all words with their possible tags, annotated by frequency. This can be created and edited manually, or derived from training data. If your language has a rich morphology, you might want to use a morphological analyser to support this, or you could simply have all inflected forms as dictionary entries in their own right.
The language model contains sequences of tags and their frequencies, usually trigrams (sequences of three items). It is extracted from training data, and reflects grammatical constraints on word class distribution.
So in order to adapt an existing tagger for a new language there are two main steps:
create a tag set for your language. While there is some overlap between tag sets for different languages (they usually all have nouns or verbs), you might want specific markers for cases or tenses, as they can help in disambiguation.
annotate training data. You need some texts to generate the language model (and possibly also the dictionary). This data you feed into the training algorithm to produce the language-specific resource files.
Annotating by hand is fairly tedious, but you can use an iterative process: annotate a smallish text, run it through the training mechanism, and use the tagger to annotate a longer text. This will have many errors, but it's easier to correct the errors than it is to annotate a text from scratch. Then add this text to your training data and repeat. You will find that the tagger's performance will gradually get better as you build up more training data,

How to get similar words from a custom input dictionary of word to vectors in gensim

I am working on a document similarity problem. For each document, I retrieve the vectors for each of its words (from a pre-trained word embedding model) and average them to get the document vector. I end up having a dictionary (say, my_dict) that maps each document in my collection to its vector.
I want to feed this dictionary to gensim and for each document, get other documents in 'my_dict' that are closer to it. How could I do that?

You might want to consider rephrasing your question (from the title, you are looking for word similarity, from the description I gather you want document similarity) and adding a little more detail in the description. Without more detailed info about what you want and what you have tried, it is difficult to help you achieve what you want, because you could want to do a whole bunch of different things. That being said, I think I can help you out generally, even without know what you want gensim to do. gensim is quite powerful, and offers lots of different functionality.
Assuming your dictionary is already in gensim format, you can load it like this:
from gensim import corpora
dictionary = corpora.Dictionary.load('my_dict.dict')
There - now you can use it with gensim, and run analyses and model to your heart's desire. For similarities between words you can play around with such pre-made functions as gensim.word2vec.most_similar('word_one', 'word_two') etc.
For document similarity with a trained LDA model, see this stackoverflow question.
For a more detailed explanation, see this gensim tutorial which uses cosine similartiy as a measure of similarity between documents.
gensim has a bunch of premade functionality which do not require LDA, for example gensim.similarities.MatrixSimilarity from similarities.docsim, I would recommend looking at the documentation and examples.
Also, in order to avoid a bunch of pitfalls: Is there a specific reason to average the vectors by yourself (or even averaging them at all)? You do not need to do this (gensim has a few more sophisticated methods that achieve a mapping of documents to vectors for you, like models.doc2vec), and might lose valuable information.

NER(Named Entity Recognition) Similarity between sentences in documents

I have been using spacy to find the NER of sentences.My problem is I have to calculate the NER similarity between sentences of two different documents. Is there any formula or package available in python for the same?
TIA

I believe you are asking, how similar are two named entities?
This is not so trivial, since we have to define what "similar" means.
If we use a naive bag-of-words approach, two entities are more similar when more of their tokens are identical.
If we put the entity tokens into sets, the calculation would just be the jaccard coefficient.
Sim(ent1, ent2) = |ent1 ∩ ent2| / |ent1  ∪ ent2|
Which in python would be:
ent1 = set(map(str, spacy_entity1))
ent2 = set(map(str, spacy_entity2))
similarity = len(ent1 & ent2) / len(ent1 | ent2)
Where spacy_entity is one of the entities extracted by spacy
We then just create entity sets ent by creating a set of the strings that represent them.

I've been tackling the same problem and heres how I'm Solving.
Since youve given no context about your problem, ill keep the solution as general as possible
Links to tools:
Spacy Spacy Similarity
Flair NLP: Flair NLP zero shot few shot
Roam Research for research articles and their approach Roam Research home page
DISCLAIMER: THIS IS A TRIAL EXPERIMENT. AND NOT THE SOLUTION
Steps:
try a Large language model with a similarity coefficient (Like GPT-3 or TARS) on your dataset
(Instead of similarity you could also use the zero shot/few shot classification and then manually see the accuracy or compute it if you have labelled data)
I am then grab the verbs (assuming you have a corpus and not single word inputs) and calculating their similariy (attaching x-n and x+n words [exlude stopwords according to your domain] if x is position of verb)
This step mainly allows you give more context to the large language model so it is not biased on its large corpus (this is experimental)
And finally Im grabbing all the named entities and hopefully their labels (Example India: Country/Location) And Ask the same LLM (Large Language model) after youve constructed its Prompt to see how many of the same buckets/categories your entities fall into (probability comparison maybe)
Even if you dont reproduce these steps, what you must understand is that these tools give you atomic information about your raw input. you have to put mathematic functions to compare and make the algorithm
In my case im averaging all 3 similarity index of the above steps and ensuring that all classification is multilabel classification
And to validate any of this, human pattern matching maybe.

You need probably http://uima.apache.org/d/uimacpp-2.4.0/docs/Python.html/ plus a CoNLL-U parser attached to it https://universaldependencies.org/format.html. With this approach NERs are based to a dictionary in UIMA Pipeline. You need to develop some proprietary NER search/match algorithms (in Python or in other supported language).

Tweet classification into multiple categories on (Unsupervised data/tweets)

I want to classify the tweets into predefined categories (like: sports, health, and 10 more). If I had labeled data, I would be able to do the classification by training Naive Bayes or SVM. As described in http://cucis.ece.northwestern.edu/publications/pdf/LeePal11.pdf
But I cannot figure out a way with unlabeled data. One possibility could be using Expectation-Maximization and generating clusters and label those clusters. But as said earlier I have predefined set of classes, so clustering won't be as good.
Can anyone guide me on what techniques I should follow. Appreciate any help.

Alright by what i can understand i think there are multiple ways to attend to this case.
there will be trade offs and the accuracy rate may vary. because of the well know fact and observation
Each single tweet is distinct!
(unless you are extracting data from twitter stream api based on tags and other keywords). Please define the source of data and how are you extracting it. i am assuming you're just getting general tweets which can be about anything
The thing you can do is to generate a set of dictionary for each class you have
(i.e Music => pop , jazz , rap , instruments ...)
which will contain relevant words to that class. You can use NLTK for python or Stanford NLP for other languages.
You can start with extracting
Synonyms
Hyponyms
Hypernyms
Meronyms
Holonyms
Go see these NLP Lexical semantics slides. it will surely clear some of the concepts.
Once you have dictionaries for each classes. cross compare them with the tweets you have got. the tweet which has the most similarity (you can rank them according to the occurrences of words from the these dictionaries) you can label it to that class. This will make your tweets labeled like others.
Now the question is the accuracy! But it depends on the data and versatility of your classes. This may be an "Over kill" But it may come close to what you want.
Furthermore you can label some set of tweets this way and use Cosine Similarity to cross identify other tweets. This will help with the optimization part. But then again its up-to you. As you know what Trade offs you can bear
The real struggle will be the machine learning part and how you manage that.

Actually this seems as a typical use case of semi-supervised learning. There are plenty methods of use here, including clustering with constraints (where you force model to cluster samples from the same class together), transductive learning (where you try to extrapolate model from labeled samples onto distribution of unlabeled ones).
You could also simply cluster data as #Shoaib suggested, but then you will have to come up the the heuristic approach how to deal with clusters with mixed labeling. Futhermore - obviously solving optimziation problem not related to the task (labeling) will not be as good as actually using this knowledge.

You can use clustering for that task. For that you have to label some examples for each class first. Then using these labeled examples, you can identify the class of each cluster easily.

Python NLTK difference between a sentiment and an incident

Hi i want to implement a system which can identify whether the given sentence is an incident or a sentiment.
I was going through python NLTK and found out that there is a way to find out positivity or negativity of a sentense.
Found out the ref link: ref link
I want to achieve like
My new Phone is not as good as I expected should be treated as sentiment
and Camera of my phone is not working should be considered as incident.
I gave a Idea of making my own clusters for training my system for finding out such but not getting a desired solution is there a built-in way to find that or any idea on how can be approach for solution of same.
Advance thanks for your time.

If you have, or can construct, a corpus of appropriately categorized sentences, you could use it to train a classifier. There can be as many categories as you need (two, three or more).
You'll have to do some work (reading and experimenting) to find the best features to use for the task. I'd start by POS-tagging the sentence so you can pull out the verb(s), etc. Take a look at the NLTK book's chapter on classifiers.
Use proper training/testing methodology (always test on data that was not seen during training), and make sure you have enough training data-- it's easy to "overtrain" your classifier so that it does well on the training data, by using characteristics that coincidentally correlate with the category but will not recur in novel data.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.