Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
I am working on a requirement where I have history of previous requests. Requests may be like "Send me a report of .." or "Get me this doc" and this will get assigned to some one and that person will respond.
I need to build an app which will analyse the previous request and if a new request arrives and if any of the previous requests matches then I should recommend the previous request's solution.
I am trying to implement the above using Python and after some research I found doc2vector is one of the approach to convert the previous requests to a vector and match with vector of new request. I want to know, is this the right approach or are better approaches available?
There are several different approaches for your problem. Actually, there's no right or wrong answer, but the one that fits your data, objectives and expected results more properly. To mention a few:
Vectorization (doc2vec)
This approach will make a vector representation of a document based on individual words vector from a pretrained source (these so called embeddings can be more general with worse results in too closed contexts or more specific, being better fit to a special type of text).
In order to match a new request to this vector representation of your document, the new request have to share words with a closely related vector representation, otherwise it won't work.
Keyword matching (or topicalization)
A simpler approach, where a document is classified by the more representative keywords in it (using techniques such as TF-IDF or even simpler word distribution).
To match a new request, this has to include the keywords of the document.
Graph Based Approach
I've worked with this approach for Question Answering in my Master's research. In it, each document is modeled as a graph node connected to its keywords (which are also nodes). Each word in the graph is related to other words and compose a network through which the document is accessed.
To match a new request, the keywords from the request are retrieved and "spread" using one of many network traversal techniques, attempting to get to the closest document into the graph. You can see how I documented my approach here. However, this approach requires either an already existing set of inter-word relations (wordnet for a simpler approach) or a good time spent annotating word relations.
Final Words
However, if you're interested in matching "this document" to "Annex A from e-mail 5". Thats a whooooole other problem. One that is actually not solved. You can attempt to use coreference resolution for references inside the same paragraph or phrase. But that won't work with different documents (e-mails). If you want to win some notoriety in NLP (actually NLU - Natural Language Understanding), that's a research to delve into.
Related
I would like to use Tensorflow to create a smart faq. I've seen how to manage a chatbot, but my need is to let the user searching for help and the result must be the most probable chapter or section of a manual.
For example the user can ask:
"What are the O.S. supported?"
The reply must be a list of all the possible sections of the manual in which could be the correct answer.
My text record set for the training procedure is only the manual itself. I've followed the text classification example, but i don't think is what i need because in that case it would only understand if a given text belongs to a category or another one.
What's the best practice to accomplish this task (i use Python)?
Thank you in advance
An idea could be building embeddings of your text using Bert or other pretrained models (take a look to transformers) and later compare (for instance using cosine distance) such embeddings with your query (the question) and get the most similar ones interpreting as the section or chapter containing them.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
Currently I am researching about viable approaches to identify a certain object with the image processing techniques. However I ams struggling finding them. For example, I have a CNN capable of detecting certain objects, like a person, then I can track the person as well. However, my issue is that I want the identify the detected and tracked person like saving its credentials and giving an id. I do not want something like who is he/she. Just giving an id in that manner.
Any help/resource will be appreciated.
Create a database, Store the credentials you needed for later use e.g object type and some usable specifications, by giving them some unique ID. CNN already recognized the object so just need to store it in database and later on you can perform more processing on the generated data. Simple solution is that to the problem you are explaining.
Okay I got your problem that you want to identify what kind of object is being tracked because cnn is only tracking not identifying. For that purpose you have to train your CNN on some specific features and give them some identity like objectA has [x,y,z] features. Then CNN will help you in finding the identity of the object.
You can use openCv to do this as well, store some features of some specific objects, then use some distance matching technique to match the live feature with stored features.
Thanks.
I think you are looking for something called ReID. There are a lot of papers about it in CVPR2018.
You can imagine that you would need some sort of stored characteristic vector for each person. For each detected person, gives a new ID if it does not match any previous record, or returns the ID if it does match a record. The key is how to compute this characteristic vector. CNN features (intermediate layer) can be one. Gaussian mixtures of color of the detected human patch can be another.
It is still a very active research field and I think it would be quite hard to make a accurate one if you don't have much resources/time at hand.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
Update: How would one approach the task of classifying any text on public forums such as Games, or blogs such that derogatory comments/texts before bring posted are filtered.
Original: "
I want to filter out adult content from tweets (or any text for that matter).
For spam detection, we have datasets that check whether a particular text is spam or ham.
For adult content, I found a dataset I want to use (extract below):
arrBad = [
'acrotomophilia',
'anal',
'anilingus',
'anus',
.
. etc.
.
'zoophilia']
Question
How can I use that dataset to filter text instances?
"
I would approach this as a Text Classification problem, because using blacklists of words typically does not work very well to classify full texts. The main reason why blacklists don't work is that you will have a lot of false positives (one example: your list contains the word 'sexy', which alone isn't enough to flag a document as being for adults). To do so you need a training set with documents tagged as being "adult content" and others "safe for work". So here is what I would do:
check whether an existing labelled dataset can be used. You need
several thousands of documents of each class.
If you don't find any, create one. For instance you can create a scraper and download Reddit content. Read for instance Text Classification of NSFW Reddit Posts
Build a text classifier with NLTK. If you don't know how, read: Learning to Classify Text
This can be treated as a binary text classification problem. You should collect documents that contain 'adult-content' as well as documents that do not contain adult content ('universal'). It may so happen that a word/phrase that you have included in the list arrBad may be present in the 'universal' document, for example, 'girl on top' in the sentence 'She wanted to be the first girl on top of Mt. Everest.' You need to get a count vector of the number of times each word/phrase occurs in a 'adult-content' document and a 'universal' document.
I'd suggest you consider using algorithms like Naive Bayes (which should work fairly well in your case). However, if you want to capture the context in which each phrase is used, you could consider the Support Vector Machines algorithm as well (but that would involve tweaking a lot of complex parameters).
You may be interested in something like TextRazor. By using their API you could be able to classify the input text.
And for example you can choose to remove all input texts thats comes with some of the categories or keywords you don't want.
I think you more need to explore on filtering algorithms, studying their usage, how multi pattern searching works and how you can use some of those algorithms (their implementations are free online, so it is not hard to find an existing implementation and customize for your needs). Some pointers can be.
Check how grep family of algorithm works, especially the bitap algorithm and Wu-Manber implementation for fgrep..Depending upon how accurate you want to be, it may require adding some fuzzy logic handling (think why people use fukc instead of fuck..right?).
You may find Bloom Filter interesting, since it wont have any false negatives (your data set), downside is that it may contain false positives..
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 6 years ago.
Improve this question
I am working on a project wherein I have to extract the following information from a set of articles (the articles could be on anything):
People Find the names of any people present, like "Barack Obama"
Topic or related tags of the article, like "Parliament", "World Energy"
Company/Organisation I should be able to obtain the names of the any companies or organisations mentioned, like "Apple" or "Google"
Is there an NLP framework/library of this sort available in Python which would help me accomplish this task?
#sel and #3kt really good answers. OP you are looking for Entity Extraction, commonly referred to as Named entity recognition. There exist many APIs to perform this. But the first question you need to ask yourself is
What is the structure of my DATA? or rather,
Are my sentences good English sentences?
In the sense of figuring out whether the data you are working with is consistently grammatically correct, well capitalized and is well structured. These factors are paramount when it comes to extracting entities. The data I worked with were tweets. ABSOLUTE NIGHTMARE!! I performed a detailed analysis on the performance of various APIs on entity extraction and I shall share with you what I found.
Here's APIs that perform fabulous entity extraction-
NLTK has a handy reference book which talks in-depth about its functions with multiple examples. NLTK does not perform well on noisy data(tweets) because it has been trained on structured data.NLTK is absolute garbage for badly capitalized words(Eg, DUCK, Verb, CHAIR). Moreover, it is slightly less precise when compared to other APIs. It is great for structured data or curated data from News articles and Scholarly reports. It is a great learning tool for beginners.
Alchemy is simpler to implement and performs very well in categorizing the named entities.It has great precision when compared to the APIs I have mentioned.However, it has a certain transaction cost. You can only perform 1000 queries in a day! It identifies twitter-handles and can handle awkward capitalization.
IMHO sPacy API is probably the best. It's open source. It outperforms the Alchemy API but is not as precise. Categorizes entities almost as well Alchemy.
Choosing which API should be a simple problem for you now that you know how each API is likely to behave according to the data you have.
EXTRA -
POLYGLOT is yet another API.
Here is a blog post that performs entity extraction in NLTK.
There is a beautiful paper by Alan Ritter that might go over your head. But it is the standard for entity extraction(particularly in noisy data) at a professional level. You could refer to it every now and then to understand complex concepts like LDA or SVM for capitalisation.
What you are actually looking for is called in literature 'Named entity Recognition' or NER.
You might like to take a look at this tutorial:
http://textminingonline.com/how-to-use-stanford-named-entity-recognizer-ner-in-python-nltk-and-other-programming-languages
One easy way of solving this problem partially this problem is using regular expressions to extract words having the patterns that you can find in this paper to extract peoples names. This of course might lead to extracting all the categories you are looking for i.e. the topics and the campanies names as well.
There is also an API that you can use, that actually gives the same results you are looking for, which is called Alchemy. Unfortunatelly no documentation is available to explain the method they use to extract the topics nor the people's names.
Hope this helps.
You should take a look at NLTK.
Finding names and companies can be achieved by tagging the recovered text, and extracting proper nouns (tagged NNP). Finding the topic is a bit more tricky, and may require some machine learning on a given set of article.
Also, since we're talking about articles, I recommend the newspaper module, that can recover those from their URLs, and do some basic nlp operations (summary, keywords).
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 4 years ago.
Improve this question
I am creating a library to support some standard graph traversals. Some of the graphs are defined explicitly: i.e., all edges are added by providing a data structure, or by repeatedly calling a relevant method. Some graphs are only defined implicitly: i.e., I only can provide a function that, given a node, will return its children (in particular, all the infinite graphs I traverse must be defined implicitly, of course).
The traversal generator needs to be highly customizable. For example, I should be able to specify whether I want DFS post-order/pre-order/in-order, BFS, etc.; in which order the children should be visited (if I provide a key that sorts them); whether the set of visited nodes should be maintained; whether the back-pointer (pointer to parent) should be yielded along with the node; etc.
I am struggling with the API design for this library (the implementation is not complicated at all, once the API is clear). I want it to be elegant, logical, and concise. Is there any graph library that meets these criteria that I can use as a template (doesn't have to be in Python)?
Of course, if there's a Python library that already does all of this, I'd like to know, so I can avoid coding my own.
(I'm using Python 3.)
if you need to handle infinite graphs then you are going to need some kind of functional interface to graphs (as you say in the q). so i would make that the standard representation and provide helper functions that take other representations and generate a functional representation.
for the results, maybe you can yield (you imply a generator and i think that is a good idea) a series of result objects, each of which represents a node. if the user wants more info, like backlinks, they call a method on that, and the extra information is provided (calculated lazily, where possible, so that you avoid that cost for people that don't need it).
you don't mention if the graph is directed or not. obviously you can treat all graphs as directed and return both directions. but then the implementation is not as efficient. typically (eg jgrapht) libraries have different interfaces for different kinds of graph.
(i suspect you're going to have to iterate a lot on this, before you get a good balance between elegant api and efficiency)
finally, are you aware of the functional graph library? i am not sure how it will help, but i remember thinking (years ago!) that the api there was a nice one.
The traversal algorithm and the graph data structure implementation should be separate, and should talk to each other only through the standard API. (If they are coupled, each traversal algorithm would have to be rewritten for every implementation.)
So my question really has two parts:
How to design the API for the graph data structure (used by graph algorithms such as traversals and by client code that creates/accesses graphs)
How to design the API for the graph traversal algorithms (used by client code that needs to traverse a graph)
I believe C++ Boost Graph Library answers both parts of my question very well. I would expect it can be (theoretically) rewritten in Python, although there may be some obstacles I don't see until I try.
Incidentally, I found a website that deals with question 1 in the context of Python: http://wiki.python.org/moin/PythonGraphApi. Unfortunately, it hasn't been updated since Aug 2011.