ready-made Topics in using LDA to categorize documents? [closed] - python

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 7 years ago.
Improve this question
I'm using LDA to categorize small documents, about 4-5 lines.
I'm categorizing them into topics such as Technology, Politics, Art, Music etc etc
I'm using wikipedia to download articles in each category (Technology, Politics, Art etc etc) and training LDA for each category
Wikipedia is huge (about 8GB compressed), and computations take hours! and uses a huge space in my hard drive
Is there any toolkit that already provides "ready-made" generic topics which i can directly use for categorization?

There are quite a few online API's that categorize text into a predefined set of topics. For example, https://www.textrazor.com/demo identifies topics such as Business, Law, and Politics. You can also take a look at MeaningCloud or AlchemyAPI. Most of these services are paid, but do have a free tier that may be sufficient, depending on your needs.

Related

Neural network for text generation - Reverse summarizer (Python / Keras) [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 4 years ago.
Improve this question
I am about to start working on a neural network for text generation. Inputs will be some words from a user (e.g. Brexit vote tomorrow chance of UK staying within EU slim) and the output will be a nice, well-written sentence (e.g. The Brexit vote will take place tomorrow and the UK is unlikely to stay within the European Union).
For the implementation, I am thinking about a sequence2sequence model but, before starting to code, I would like to check whether this subject has not been addressed before. After many Google searches, it seems that nobody has done a similar project before (although there's a lot of papers about text translation), which surprises me because such a tool would be useful for many people, such as journalists, etc.
Has any of you seen some useful Python code or relevant articles somewhere?
Sequence2Sequence is what comes to my mind. Text generation code using RNN/LSTM just creates grammatically correct but meaningless sentences as you discovered via Google.
Do you have a large corpus of examples to train a seq2seq model? Translation models require very large corpus. One option for creating such a corpus could be to gather headlines and first paragraphs of news articles. Treat headlines as original language and first paragraph/sentences of the article as the language to translate into.
Here's a blog about using a second model using Doc2Vec to filter the sentences generated from seq2seq

Dataset for predicting gender [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 4 years ago.
Improve this question
Can anyone guide me towards any dataset which consists of questions/survey based on psychology which when answered in full extent can tell you the gender if the person taking the test?
I need it to create a tool through which we can detect the patterns of fake profiles on the social platforms.
I know a few groups which are gender-specific (e.g. for mothers, for women private talk) but the opposite gender tries to trash it getting into the group pretending to be female.
I know it sounds silly for now, but anyone who wants to join these group can go through the questionnaire and the AI can detect it's gender.
Thank you in advance.
There is a dataset that I came across on kaggle. It does not have question-answer pairs from surveys, but the project was mainly about attempting to predict gender based on users' tweets. Not sure if you need your dataset in questionnaire format, but if not then you can check this out:
Twitter User Gender Classification

Sorting words into categories in Python [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 6 years ago.
Improve this question
I have about 3,000 words and I would like to group them into about 20-50 different categories. My words are typical phrases you might find in company names. "Face", "Book", "Sales", "Force", for example.
The libraries I have been looking at so far are pandas and scikit-learn. I'm wondering if there is a machine-learning or deep-learning algorithm that would be well suited for this?
The topics I have been looking are Classification: identifying which category an object belongs to, and Dimensionality Reduction: reducing the random number of variables to consider.
When I search for putting words into categories on Google, it brings up kids puzzles such as "things you do with a pencil" - draw. Or "parts of a house" - yard, room.
for deep learning to work on this you would have to develop a large dataset, most likely manually. the largest natural language processing dataset was, in fact, created manually.
BUT even if you were able to find a dataset which a model could learn off. THEN a model such as gradient boosted trees would be one, amongst others, that would be well suited to multi-class classification like this. A classic library for this is xgboost.

Machine Learning tools for python dealing with potential matches for terms within textual data [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 6 years ago.
Improve this question
I'm planning to write a script that reads in text input data. This would consist of certain terms e.g "red car".
What machine learning tools for python should I use if I wanted to identify potential matches to a term in my text input data within a database of terms and sentences.
For example, I would want similarly spelled terms (e.g mis-spelled terms) like "redd car" to be identified and listed in the output of my script.
Edit 1: I have a method of identifying string similarity using FuzzyWuzzy to return a number representation of two strings's similarity to each other. My question would be now how to divide the words in the database into "similar" and "not similar" using machine learning approaches.
Without knowing much of your setup I would recommend using scikit-learn packages for your project. It has support for almost every aspect of machine learning including but not limited to:
Classification
Regression
Clustering
Dimensionality reduction
Model selection
Preprocessing

Dictionary for finding orientation of the words [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 7 years ago.
Improve this question
I am looking for a dictionary that finds the orientation(positive/negative/neutral) of the words as part of analyzing the sentiment of the phrase. Preferably a source that can be imported into python code
You seem to be looking for something like OpinionFinder.
This particular link points to a lexicon of 8233 adjectives, verbs and nouns and their orientation.
You can download it, so you'll be able to simply read the file into python.
From SentiWordNet website:
SentiWordNet is a lexical resource for opinion mining. SentiWordNet
assigns to each synset of WordNet three sentiment scores: positivity,
negativity, objectivity
There are a lot of Python frameworks that use Wordnet and Sentiwordnet, such as NLTK or Pattern.

Categories