Decoding tracebacks using machine learning - python

I am trying to solve a problem where I have files which contain decoded- tracebacks( Stack call trace) whenever there is a Crash (in Linux world) and I have a unique ID to track the Crash occuring each time.
I want to build a classfier which will learn from the previous decoded-tracebacks and predict if there is an already existing ID for current traceback seen.
This is my first machine learning project . I used machine learning and did a trial using CountVectorizer and TF-IDF approach in python.
I want to know which features to consider for classification and suitable algorithm for text-classification to solve this problem.

great to hear that this is your first machine learning project! For my first NLP, i'm using the Amazon product reviewed to doing it. Do you try the Bag of words (BOW) model? And you can try N-gram too. And you can consider to use NaiveBayes Classifier and evaluate your classification. Then you will know which will give you the best algorithm to solve the problem.
Extra reading (if you like) : https://machinelearningmastery.com/encoder-decoder-models-text-summarization-keras/

Related

Suggestions for nonparametric machine learning models

I am new to machine learning, but I have decent experience in python. I am faced with a problem: I need to find a machine learning model that would work well to predict the speed of a boat given current environmental and physical conditions. I have looked into Scikit-Learn, Pytorch, and Tensorflow, but I am having trouble finding information on what type of model I should use. I am almost certain that linear regression models would be useless for this task. I have been told that non-parametric regression models would be ideal for this, but I am unable to find many in the Scikit Library. Should I be trying to use regression models at all, or should I be looking more into Neural Networks? I'm open to any suggestions, thanks in advance.
I think multi-linear regression model would work well for your case. I am assuming that the input data is just a bunch of environmental parameters and you have a boat speed corresponding to that. For such problems, regression usually works well. I would not recommend you to use neural networks unless you have a lot of training data and the size of one input data is also quite big.

Election winning Prediction of 5 Candidates using Linear Regression in python

I got one project in which I need to build a prediction model using linear regression in build. The case study is, need to predict winning of 5 candidates in an election. In this, I don't have any data and need to build data on my own but I am not able to visualize parameters. Can anybody help me in data building, it would be highly helpful.
you can start by get previous years election winner as your'e train data, if you don't have any train data , you have a problem in using linear regression (or supervised learning) after that if you want to use python try this step's :
use some code from this tutorial : https://machinelearningmastery.com/machine-learning-in-python-step-by-step/ or any other good beginner tutorial , you can join some comunity like https://www.kaggle.com/ and get some ideas from their kernel regarding to some processing of the data and tuning parameters
As I understand this question, you need to create a model based on data you don't have yet. Presumably you will get the data later on, by which time the model should already be implemented. You can create a fake data set using the numpy.random library. We's need more details on what exactly you're trying to do, though.

Subject Extraction of a paragraph/document using NLP

I am trying to build a subject extractor, simply put, read all the sentences of a paragraph and make a calculated guess to what the subject of the paragraph/article/document is. I might even upgrade it to a summerize depending on the progress I make.
There is a great deal of information on the internet. It is difficult to understand all of it and select a correct path, as I am not well versed with NLP.
I was hoping someone with some experience could point me in the right direction.
I am NOT looking for a linguistic computation model, but rather an n-gram or neural network approach, something that has been done recently.
I am also looking into coreference resolution using n-grams, if anyone has any leads on that, it is much appreciated. Slightly familiar with the Stanford Coreferential Solver, but don't want to use it as is.
Any information, ideas and opinions are welcome.
#Dagger,
For finding the 'topic' of the whole document, there are several approaches you can try and research. The unsupervised approaches will be faster and will get you started but may not differentiate between closely related documents that have similar topics. These also don't require neural network. The supervised techniques will be able to recognise differences in similar documents better but require training of networks. You should be able to easily find blogs about implementing these in your desired programming language.
Unsupervised
K-Means Clustering using TF-IDF On Text Words - see intro here
Latent Dirichlet Allocation
Supervised
Text Classification models using SVM, Logistic Regressions and neural nets
LSTM/RNN models using neural net
The neural net models will require training on a set of known documents with associated topics first. They are best suited for picking ONE most likely topic from their model but there are multi-class topic implementations possible.
If you post example data and/or domain along with programming language, I can give some more specifics for you to explore.

Text summarization using deep learning techniques

I am trying to summarize text documents that belong to legal domain.
I am referring to the site deeplearning.net on how to implement the deep learning architectures. I have read quite a few research papers on document summarization (both single document and multidocument) but I am unable to figure to how exactly the summary is generated for each document.
Once the training is done, the network stabilizes during testing phase. So even if I know the set of features (which I have figured out) that are learnt during the training phase, it would be difficult to find out the importance of each feature (because the weight vector of the network is stabilized) during the testing phase where I will be trying to generate summary for each document.
I tried to figure this out for a long time but it's in vain.
If anybody has worked on it or have any idea regarding the same, please give me some pointers. I really appreciate your help. Thank you.
I think you need to be a little more specific. When you say "I am unable to figure to how exactly the summary is generated for each document", do you mean that you don't know how to interpret the learned features, or don't you understand the algorithm? Also, "deep learning techniques" covers a very broad range of models - which one are you actually trying to use?
In the general case, deep learning models do not learn features that are humanly intepretable (albeit, you can of course try to look for correlations between the given inputs and the corresponding activations in the model). So, if that's what you're asking, there really is no good answer. If you're having difficulties understanding the model you're using, I can probably help you :-) Let me know.
this is a blog series that talks in much detail from the very beginning of how text summarization works, recent research uses seq2seq deep learning based models, this blog series begins by explaining this architecture till reaching the newest research approaches
Also this repo collects multiple implementations on building a text summarization model, it runs these models on google colab, and hosts the data on google drive, so no matter how powerful your computer is, you can use google colab which is a free system to train your deep models on
If you like to see the text summarization in action, you can use this free api.
I truly hope this helps

Im trying to do sentiment analysis, need advice on how to start

I have decided to start a sentiment analysis project and am not sure exactly how to start and what to use.
I have heard of Naive Baes classifier but im not sure how to use it in python or how it works.
If i have to find, the net sentiment i have to find, the value of each word. Is there an already created database full of words and sentiment or must i create a list. Thanks
Haven't looked at this for a couple years but...
at that time NLTK was the thing to use for natural language processing. There were a few papers and projects, with source code, demonstrating sentiment analysis with NLTK.
I just googled and found http://text-processing.com/demo/sentiment/ and the page has examples, and links to others. including "Training a naive bayes classifier for sentiment analysis using movie reviews"
There are plenty of datasets you can use as training datasets but I'd recommend you build your own to ensure your classifier is set up properly for your domain.
For example, you could start with the Cornell Movie Review dataset as a way to get started. I'd be very surprised if this dataset worked for your needs but it would allow you to get started building your classification system.

Categories