Automatic Summarization : Extraction Based [closed] - python

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 5 years ago.
Improve this question
What is the algorithm of extraction based automatic summarization ? googled alot, couldnt find anything related to it . I want to implement the algo on python

There is not one single algorithm for extraction based summarization. There are several different algorithms to choose from. You should choose one that fits your specific needs.
There are two approaches to extraction based summarization:
Supervised learning - you give the program lots of examples of documents together with their keywords. The program learns what constitutes a keyword. Then you give it a new document, this time without any keywords, and the program extracts the keywords of this document based on what it learned during the training phase. There is a huge number of supervised learning techniques. To name a few, there are neural networks, decision trees, random forests and support vector machines.
Unsupervised learning - you simly give the program a document and it creates a list of keywords without relying on any past experience. A popular unsupervised algorithm for extraction based summarization is TextRank.

First off, I think you should learn more about how to find papers and research. It is absolutely impossible if you haven't found anything by google. In any case, some of the extraction based text summarziation are:
Easy to implement methods based on word frequency
Bayesian methods
Graph based methods eg TextRank/LexRank is a good start.
Clustering
Fuzzy Systems for summarization
Neural Network based system
I have seen methods based on optimization algorithms
I suggest googling these methods and see what you get. There are a lot of variations for these and I can't really tell what method is the best. Remember to find proper preprocessing tools as well.
Good luck.

Related

Neural network for text generation - Reverse summarizer (Python / Keras) [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 4 years ago.
Improve this question
I am about to start working on a neural network for text generation. Inputs will be some words from a user (e.g. Brexit vote tomorrow chance of UK staying within EU slim) and the output will be a nice, well-written sentence (e.g. The Brexit vote will take place tomorrow and the UK is unlikely to stay within the European Union).
For the implementation, I am thinking about a sequence2sequence model but, before starting to code, I would like to check whether this subject has not been addressed before. After many Google searches, it seems that nobody has done a similar project before (although there's a lot of papers about text translation), which surprises me because such a tool would be useful for many people, such as journalists, etc.
Has any of you seen some useful Python code or relevant articles somewhere?
Sequence2Sequence is what comes to my mind. Text generation code using RNN/LSTM just creates grammatically correct but meaningless sentences as you discovered via Google.
Do you have a large corpus of examples to train a seq2seq model? Translation models require very large corpus. One option for creating such a corpus could be to gather headlines and first paragraphs of news articles. Treat headlines as original language and first paragraph/sentences of the article as the language to translate into.
Here's a blog about using a second model using Doc2Vec to filter the sentences generated from seq2seq

Multiclass MultiOutput Classification with both categorical and continous attribute without encoding in python [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 5 years ago.
Improve this question
I'm Working on a Machine learning (Data-Mining) project and i'm done with data exploration and data preparation step and it was done in python!
Now I'm facing this issue : i have categoricals attributes in my dataset .
After research i've found that the best appropriate algorithm for that kind of data is a decision tree or a random forrest classifier !
But I've read some similar questions about decision tree and categorical attribute and found that the library I'm using (scikit-learn) doesn't works with categoricasl attributes . check here and here , for making it work with categorical i need to encode my categorical variables into numerical ones but i don't want to use encoding because i will loose some properties of my attributes and some informations according to this answer , and also some of my attributes has more than 100 different values.
So I want to know :
is there any other python library that can build decision trees with categorical data without any encoding?
in this answer it was suggest that other libraries like WEKA can build decisions trees with categorical attributes so my question is this can I combine 2 language in the same machine learning project?
Will do data exploration and preparation in python, train the model in weka (java), and deploy it in a python-flask web app?
can it be possible?
The answer you linked about encoding categorical inputs is just saying that you should avoid numerical encoding when your categories don't have an inherent order. It correctly recommends that you use a one-hot encoding in this case.
Simply put, machine learning models operate on numbers, so even if you find a library that takes your raw categories without explicit encoding, it will still have to internally encode them before it can perform any computation.
100 categories is not a lot, and most of the shelf libraries will handle such inputs just fine. I recommend you try xgboost

Sorting words into categories in Python [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 6 years ago.
Improve this question
I have about 3,000 words and I would like to group them into about 20-50 different categories. My words are typical phrases you might find in company names. "Face", "Book", "Sales", "Force", for example.
The libraries I have been looking at so far are pandas and scikit-learn. I'm wondering if there is a machine-learning or deep-learning algorithm that would be well suited for this?
The topics I have been looking are Classification: identifying which category an object belongs to, and Dimensionality Reduction: reducing the random number of variables to consider.
When I search for putting words into categories on Google, it brings up kids puzzles such as "things you do with a pencil" - draw. Or "parts of a house" - yard, room.
for deep learning to work on this you would have to develop a large dataset, most likely manually. the largest natural language processing dataset was, in fact, created manually.
BUT even if you were able to find a dataset which a model could learn off. THEN a model such as gradient boosted trees would be one, amongst others, that would be well suited to multi-class classification like this. A classic library for this is xgboost.

Machine Learning tools for python dealing with potential matches for terms within textual data [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 6 years ago.
Improve this question
I'm planning to write a script that reads in text input data. This would consist of certain terms e.g "red car".
What machine learning tools for python should I use if I wanted to identify potential matches to a term in my text input data within a database of terms and sentences.
For example, I would want similarly spelled terms (e.g mis-spelled terms) like "redd car" to be identified and listed in the output of my script.
Edit 1: I have a method of identifying string similarity using FuzzyWuzzy to return a number representation of two strings's similarity to each other. My question would be now how to divide the words in the database into "similar" and "not similar" using machine learning approaches.
Without knowing much of your setup I would recommend using scikit-learn packages for your project. It has support for almost every aspect of machine learning including but not limited to:
Classification
Regression
Clustering
Dimensionality reduction
Model selection
Preprocessing

Topic or Tag suggestion algorithm [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 7 years ago.
Improve this question
Here is the problem: When given a block of text, I want to suggest possible topics . For example, a news article about Kobe Bryant would have suggested tags like: ‘basketball’, ‘nba’, ‘sports’.
I have a fairly large training dataset (350k+) that includes bodies of text and tags that users have assigned to the text. There are about 40k, pre existing topics; however, many of the topics do not have too many entries in them. I would say only about 5k of the topics have more than 10 entries in them. Users cannot assign topics that don’t already exist in the system. I'd also like to include that
Does anyone have any suggestions for algorithms to use?
If anyone has any suggestions of python libraries as well that would be awesome.
There have been attemps on similar problem - one example is right here - stackoverflow. When you wrote your question, stackoverflow itself suggests some tags without your intervention, though you can manually add or remove them.
Out-of-the-box classification would fail as the number of tags is really huge. There're two directions you could work on this problem from.
Nearest Neighbors
Easy, fast and effective. You have a labelled training set. When a new document comes, you look for closest matches, e.g. words like 'tags', ' training', 'dataset', 'labels' etc helped your question map with other similar questions on StackOverflow. In those questions, machine-learning tag was there - so this tag was suggested. Best way for implementation is index your training data (search-engine tactic). You may use Lucene, Elastic Search or something similar. When a new document appears, use that as a query and search for top 10 matching documents stored previously. Poll their tags. Sort the tags and use the scores of the documents to find how important the tags are. Done.
Probabilistic Models
Idea is on the lines of classification but off-the-shelf tools won't help you with that. Check the works like Clayton Stanley, Predicting Tags for StackOverflow Posts, Darren Kuo, On Word Prediction Methods
or Schuster's report on Predicting Tags for StackOverflow Questions
If you have got this problem as a part of long-term academic project or research, working on Method 2 would be better. However if you need off the shelf solution, use Method 1. Lucene is a great indexing tool used even in production. It is originally in Java but you can easily find wrappers for Python. Another alternatives are Elastic Search, Katta and many more.
p.s. A lot of experimentation is required while playing with the tag-scores.

Categories