How to extract Information?

How to extract Information? - python

Objective: I am trying to do a project on Natural Language Processing (NLP), where I want to extract information and represent it in graphical form.
Description:
I am considering news article as input to my project.
Removing unwanted data in the input & making it in Clean Format.
Performing NLP & Extracting Information/Knowledge
Representing Information/Knowledge in Graphical Format.
Is it Possible?

If want to use nltk, you can start here. Its has some explanation about tokenizing, Part Of Speech Tagging, Parsing and more.
Check this page for an example of named entity detection using nltk.
The Graphical representation can be performed using igraph or matplotlib.
Also, scikit-learn has a great text feature extraction methods, in case you want to run some more sophisticated models.

The first step is to try and do this job yourself by hand with a pencil. Try it on not just one but a collection of news stories. You really do have to do this and not just think about it. Draw the graphics just as you'd want the computer to.
What this does is forces you to create rules about how information is transformed to graphics. This is NOT always possible, so doing it by hand is a good test. If you can't do it then you can't program a computer to do it.
Assuming you have found a paper and pencil method. What I like to do is work BACKWARDS. Your method starts with the text. No. Start with the numbers you need to draw the graphic. Then you think about where are these numbers in the stories and what words do I have to look at to get these numbers. Your job is now more like a hunting trip, you know the data is there, but how to find it.
Sorry for the lack of details but I don't know your exact problem but this works in every case. First learn to do the job yourself on paper then work backwards from the output to the input.
If you try to design this software in the forward direction you get stuck soon because you can't possibly know what to do with your text because you don't know what you need, it's like pushing a rope it don't work. Go to the other end and pull the rope. Do the graphic work FIRST then pull the needed data from the news stories.

Related

How to get all the syllables of a word in Python?

I am looking to split the word into its syllables. I am trying to build a speech-to-text system but focused on transcribing medical terms.
Consider a doctor/pharmacist who instead of typing out the medicine dosage would just speak into the microphone and a digital prescription would be generated automatically.
I want to avoid ML/DL based approaches since I wanted the system to work in real-time. Therefore I wanted to tackle this problem via a dictionary-based approach. I have scraped the rxlist.com to get all the possible medicine names.
Currently, I am using the webspeech API (https://www.google.com/intl/en/chrome/demos/speech.html). This works well but often messes up the medicine names.
Panadol twice a day for three days would become panel twice a day for three days
It works sometimes (super unstable). Also, it is important to consider that panadol is a relatively simple term. Consider Vicodin (changed to why couldn't), Abacavir Sulfate, etc.
Here is the approach I thought could perhaps work.
Maintain a dictionary of all medicines.
Once the detections are there (I append all the detections instead of just using the last output), compare the string distance from each medicine (could be huge, so sorting is important here) and replace the word with minimum error.
If nothing matches (maintain a threshold of error in step 2), check the syllables of prediction and that of medicine name and replace the one with the lowest error.
So I now have the list, I was hoping if I could find a library/dictionary API which could give me the syllables of medicine names. Typing How to pronounce vicodin on Google gets the Learn to Pronounce panel which has: vai·kuh·dn. I would want something similar, now I could scrape it off of Google, but I don't get the results for all the medicine names.
Any help would be appreciated.
Thanks.

You can use a library called pyphen. It's pretty easy to use. To install it run the following command in your terminal:
pip install pyphen
After this, find out the syllables in a string:
import pyphen
a = pyphen.Pyphen(lang='en')
print(a.inserted('vicodin'))
I hope you find this useful

How would I go about image labeling/Classification?

Let's say I have a set of images of passports. I am working on a project where I have to identify the name on each passport and eventually transform that object into text.
For the very first part of labeling (or classification (I think. beginner here)) where the name is on each passport, how would I go about that?
What techniques / software can I use to accomplish this?
in great detail or any links would be great. I'm trying to figure out how this is done exactly so I can began coding
I know training a model is involved possibly but I'm just not sure
I'm using Python if that matters.
thanks

There's two routes you can take, one where you have labeled data (or you want to label data yourseld), and one where you don't have that.
Let's start with the latter. Say you have an image of a passport. You want to detect where the text in the image is, and what that text says. You can achieve this using a library called pytessaract. It's an AI that does exactly this for you. It works well because it has been trained on a lot of other images, so it's good in detecting text in any image.
If you have labels you might be able to improve your model you could make with pytessaract, but this is a lot harder. If you want to learn it anyway, I would recommend with learning ŧensorflow, and use "transfer learning" to improve your model.

how to make semantic label image?

Now I want to train my own image data in caffe using SegNet.
But at the first step we need label our own image like these:
I have tried to search github but cannot find anything. So my question is anyone know which tool can make semantic label image?

Check out a tool called sloth: https://github.com/cvhciKIT/sloth, which is an open-source tool written in Python with PyQt for creating ground truth computer vision datasets for a wide array of applications, such as semantically creating data like you have above.
If you don't like sloth, you can use any image editing software, like GIMP where you would make one layer per label and use polygons and flood fill of different hues to create your data. You would then merge all of the layers together to make a final image that you would use for your purposes.
However, as user Miki mentioned (see discussion thread below), creating new datasets from the beginning will take a considerable amount of effort. It is highly advisable that you don't create this on your own as you need a lot of data to ensure your algorithms are performing correctly. You'll need the help of other (hopefully willing) PhD students, preferably those you know personally or work with you in your lab or workplace to help manually curate this data for you.
If this isn't an option, you can use crowd sourced funded places like Amazon Mechanical Turk where you can outsource the work to willing individuals where you inform them of the task at hand and you pay a small amount per image. This would be something to consider if you can't find many people to help you.
All in all, this will take a considerable amount of effort, not only in terms of time but in terms of people if you want to create a large data set within a short span of time. I would recommend you simply use established datasets, such as what you have referenced from Cambridge, or Miki suggested LabelMe by Antonio Torralba which not only is a toolbox for annotating images from his LabelMe dataset but it also allows you to do the same for your own images.
Good luck!

As answer by #rayryeng a tool called sloth is great to finish these task in simple way. However, if I have more than 20 object waiting for me to classify, sloth is not a ideal tools. Thus I develop a simple tool which call IsLabel to finish these problem with few algorithms.
And the result look like these while using IsLabel just took me 40s:
INPUT:
OUTPUT:
I know its not perfect but it work fine for me.

I would recommend using https://www.labelbox.io/. They open sourced a lot of their code and have a hosting platform to manage the whole labeling process end to end.
Here is an example of segmentation
And you can export labels with a mask.

Problems with Echo Nest Earworm analyzing small mp3s

Alright everyone, this one is super niche:
I am attempting to use the earworm.py code to analyze the timbre and pitch features of very short mp3s/tracks (1 second minimum); however, the code is returning no features and an empty graph.
The issue seems to stem from the function get_central(analysis, member='segments'). With short tracks, '"member = getattr(analysis, member)" returns empty.
Why is this? Is there a quick fix I could use like changing "member='segments'" to something that is more fine-grained?
Is there a way to extract timbre and pitch features from such short tracks using EchoNest?

Guess tags of a paragraph programmatically using python

I've trying to read about NLP in general and nltk in specific to use with python. I don't know for sure if what am looking for exists out there, or if I perhaps need to develop it.
I have a program that collect text from different files, the text is extremely random and talks about different things. Each file contains a paragraph or 3 maximum, my program opens the files and store them into a table.
My question is, can i guess tags of what the paragraph is about? if anyone knows of an existing technology or approach, I would really appreciate it.
Thanks,

Your task is called "document classification", and the nltk book has a whole chapter on it. I'd start with that.
It all depends on your criteria for assigning tags. Are you interested in matching your documents against a pre-existing set of tags, or perhaps in topic extraction (select the N most important words or phrases in the text)?

You should train a classifier, the easiest one to develop (and you don't really need to develop it as NLTK provides one) is the naive baesian. The problem is that you'll need to classify manually a corpus of observations and then have the program guess what tag best fits a given paragraph (needless to say that the bigger the training corpus the more precise will be your classifier, IMHO you can reach a 80-85% of correctness). Take a look at the docs.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.