How to summarize transcripts of dialogue in Python?

How to summarize transcripts of dialogue in Python? - python

I am carrying out a project in which I wish to create summaries of podcast transcripts.
I am aware of tools like NLTK & spaCy that can be used for text summarization, is there a way I could use datasets that focus on dialogue, such as MediaSum, SAMsum or the Spotify Podcast Dataset with these tools as to produce appropriate summaries?
Must I create a model and train it on the mentioned corpora?
How would I do this and then make use of said model in python?
Any help or insight into this task would be greatly appreciated.
I am in the research phase and have identified datasets that I believe could aid me in my summarization of dialogue transcripts, but I do not yet know how to achieve my final goal of using these along with a existing or newly created model as to generate appropriate summarizations of dialogue.

Related

How to decide between NER and QA Model?

I am completing a task involving NLP and transformers. I would like to identify relevant features in a corpus of text. If i was to extract the relevant features from job description for instance the tools that would be used at the job (powerpoint, excel, java, etc..) and the level of proficiency required would this task be better suited for a Named Entity Recognition model or a Question Answering model.
If I was to approach it like a NER task I would attach a label to all the relevant tools in the training data and hope it would generalize well. I could approach the problem simialrly as a QA model and ask things like "what tools does this job require" and supply a description as context.
I plan to use the transformers library unless I am missing a better tool for this task. There are many features I am looking to extract so not all may be as simple as grabbing keywords from a list (programming languages, microsoft office etc...).
Would one of these approaches be a better fit or am I missing a better way to approach the proble.
Any help appreciated. Thank you!

From what you say, it seems it an entity recognition task. However, the questions you should ask and answer yourself are:
How will your user interact with the model?
Structured information → Entity recognition.
Chatbot → QA.
Is there a predefined set of entities that you are going to extract from the text?
Yes → entity recognition.
No → QA.
How do the training data you have for finetuning look like?
Only a few of them → Entity recognition.
Plenty of data, question-answer pair → QA.

Perform Named Entity Recognition - NLP

I am trying to learn how to perform Named Entity Recognition.
I have a set of discharge summaries containing medical information about patients. I converted my unstructured data into structured data. Now, I have a DataFrame that looks like this:
Text | Target
normal coronary arteries... R060
The Text column contains information about the diagnosis of a patient, and the Target column contains the code that will need to be predicted in a further task.
I have also constructed a dictionary that looks like this:
Code (Key) | Term (Value)
A00 Cholera
This dictionary brings information about each diagnosis and the afferent code. The term column will be used to identify the clinical entities in the corpus.
I will need to train a classifier and predict the code in order to automate the process of assigning codes for the discharge summaries (I am explaining this to have an idea about the task I'm performing).
Until now I have converted my data into a structured one. I am trying to understand how I should perform Named Entity Recognition to label the medical terminology. I would like to try direct matching and fuzzy matching but I am not sure what are the previous steps. Should I perform tokenizing, stemming, lemmatizing before? Or firstly should I find the medical terminology as clinical named entities are often multi-token terms with nested structures that include other named entities inside them? Also what packages or tools are you recommending me to use in Python?
I am new in this field so any help will be appreciated! Thanks!

If you are asking for building a classification model, then you should go for deep learning. Deep learning is highly efficient in classification.
While dealing with such type of language processing tasks, I recommend you to first tokenize your text and do padding. Basic tokenization should be enough, but you can go for more preprocessing like basic string processing because proper preprocessing can improve your model accuracy upto 3% or 4%. For basic string processing, you can use regex(built-in package called re) in python.
https://docs.python.org/3/library/re.html
I think, you are doing mapping after preprocessing. Mapping should be enough for tasks like classification, but I recommend you to learn about word embeddings. Word embedding will improve your model.
For all these tasks, i recommend you to use tensorflow. Tensorflow is famous tool for machine learning, language processing, image processing, and much more. You can learn natural language processing from official tensorflow documentation. They have provided all learning material in tensorflow tutorial section.
https://www.tensorflow.org/tutorials/
I think, this will help you. All the best for your work!!!!
Thank you.

How would I go about image labeling/Classification?

Let's say I have a set of images of passports. I am working on a project where I have to identify the name on each passport and eventually transform that object into text.
For the very first part of labeling (or classification (I think. beginner here)) where the name is on each passport, how would I go about that?
What techniques / software can I use to accomplish this?
in great detail or any links would be great. I'm trying to figure out how this is done exactly so I can began coding
I know training a model is involved possibly but I'm just not sure
I'm using Python if that matters.
thanks

There's two routes you can take, one where you have labeled data (or you want to label data yourseld), and one where you don't have that.
Let's start with the latter. Say you have an image of a passport. You want to detect where the text in the image is, and what that text says. You can achieve this using a library called pytessaract. It's an AI that does exactly this for you. It works well because it has been trained on a lot of other images, so it's good in detecting text in any image.
If you have labels you might be able to improve your model you could make with pytessaract, but this is a lot harder. If you want to learn it anyway, I would recommend with learning ŧensorflow, and use "transfer learning" to improve your model.

Setup data for dynamic topic modelling

I'm trying to learn dynamic topic modeling(to capture the semantic changes in the words) from data scrapped from PUBMED. I was able to get the data in the form of xml and was able to extract the "abstract" text and the date information off of it and saved that in the csv format. (But this is just a part of the data.)
Format obtained
Year|month|day|abstractText
I'm planning on using gensim lda for my model
I've never really done topic modeling before and need your help with guiding me through this process one step at a time.
Questions:
Is csv a preferred format to feed into gensim lda?
for dynamic modeling, how should the time aspect of the data be captured and used in the model?
is there a better way to organize the data than in csv files?
Should i use the bodytext instead of the abstract for this?
Hope I learn a lot from this. Thanks in advance.

How to extract sub topic sentences of a review using python & NLTK?

Is there any efficient way to extract sub topic explanations of a review using python and NLTK library.As an example an user review regarding mobile phone could be "This phone's battery is good but display is a bullshit"
I wanna extract above two features like
"Battery is good"
"display is a bullshit"
The purpose of above is em gonna develop a rating system for products with respect to features of the product.
Analyzing polarity part has done.
But extracting features of review is some difficult for me.But I found a way to extract features using POS tag patterns with regular expressions like
<NN.?><VB.?>?<JJ.?>
this pattern as sub topic.But the problem is there could be lots of patterns in a review according to users description patterns.
Is there any way to solve my problem efficiently???
Thank you !!

The question you posed is multi-faceted and not straightforward to answer.
Conceptually, you may want to go through the following steps:
Identify the names of the features of phones (+ maybe creating an ontology based on these features).
Create a lists of synonyms to the feature names (similarly for evaluative phrases, e.g. nice, bad, sucks, etc.).
Use one of NLTK taggers to parse the reviews.
Create rules for extraction of features and their evaluation (Information Extraction part). I am not sure if NLTK can directly support you with this.
Evaluate and refine the approach.
Or: create a larger annotated corpus and train a Deep learning model on it using TensorFlow, Theano, or anything else alike.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.