I am looking to split the word into its syllables. I am trying to build a speech-to-text system but focused on transcribing medical terms.
Consider a doctor/pharmacist who instead of typing out the medicine dosage would just speak into the microphone and a digital prescription would be generated automatically.
I want to avoid ML/DL based approaches since I wanted the system to work in real-time. Therefore I wanted to tackle this problem via a dictionary-based approach. I have scraped the rxlist.com to get all the possible medicine names.
Currently, I am using the webspeech API (https://www.google.com/intl/en/chrome/demos/speech.html). This works well but often messes up the medicine names.
Panadol twice a day for three days would become panel twice a day for three days
It works sometimes (super unstable). Also, it is important to consider that panadol is a relatively simple term. Consider Vicodin (changed to why couldn't), Abacavir Sulfate, etc.
Here is the approach I thought could perhaps work.
Maintain a dictionary of all medicines.
Once the detections are there (I append all the detections instead of just using the last output), compare the string distance from each medicine (could be huge, so sorting is important here) and replace the word with minimum error.
If nothing matches (maintain a threshold of error in step 2), check the syllables of prediction and that of medicine name and replace the one with the lowest error.
So I now have the list, I was hoping if I could find a library/dictionary API which could give me the syllables of medicine names. Typing How to pronounce vicodin on Google gets the Learn to Pronounce panel which has: vai·kuh·dn. I would want something similar, now I could scrape it off of Google, but I don't get the results for all the medicine names.
Any help would be appreciated.
Thanks.
You can use a library called pyphen. It's pretty easy to use. To install it run the following command in your terminal:
pip install pyphen
After this, find out the syllables in a string:
import pyphen
a = pyphen.Pyphen(lang='en')
print(a.inserted('vicodin'))
I hope you find this useful
Related
I am trying to use Spacy to extract word relations/dependencies, but am a little unsure about how to use the information it gives me. I understand how to generate the visual dependency tree for debugging.
Specifically, I don’t see a way to map the list of children of a token to a specific token. There is no index—just a list of words.
Looking at the example here: https://spacy.io/usage/linguistic-features#dependency-parse
nlp("Autonomous cars shift insurance liability toward manufacturers")
Also, if the sentence were nlp("Autonomous cars shift insurance liability toward manufacturers of cars”), how would I disambiguate between the two instances of cars?
The only thing I can think of is that maybe these tokens are actually reference types that I can map to indices myself. Is that the case?
Basically, I am looking to start with getting the predicates and args to understand “who did what to whom and how/using what”.
Yeah, when you print a token it looks like a string. It’s not. It’s an object with tons of metadata, including token.i which is the index you are looking for.
If you’re just getting started with spaCy, the best use of your time is the course, it’s quick and practical.
I am relatively new to the field of NLP/text processing. I would like to know how to identify domain-related important keywords from a given text.
For example, if I have to build a Q&A chatbot that will be used in the Banking domain, the Q would be like: What is the maturity date for TRADE:12345 ?
From the Q, I would like to extract the keywords: maturity date & TRADE:12345.
From the extracted information, I would frame a SQL-like query, search the DB, retrieve the SQL output and provide the response back to the user.
Any help would be appreciated.
Thanks in advance.
So, this is where the work comes in.
Normally people start with a stop word list. There are several, choose wisely. But more than likely you'll experiment and/or use a base list and then add more words to that list.
Depending on the list it will take out
"what, is, the, for, ?"
Since this a pretty easy example, they'll all do that. But you'll notice that what is being done is just the opposite of what you wanted. You asked for domain-specific words but what is happening is the removal of all that other cruft (to the library).
From here it will depend on what you use. NLTK or Spacy are common choices. Regardless of what you pick, get a real understanding of concepts or it can bite you (like pretty much anything in Data Science).
Expect to start thinking in terms of linguistic patterns so, in your example:
What is the maturity date for TRADE:12345 ?
'What' is an interrogative, 'the' is a definite article, 'for' starts a prepositional phrase.
There may be other clues such as the ':' or that TRADE is in all caps. But, it might not be.
That should get you started but you might look at some of the other StackExchange sites for deeper expertise.
Finally, you want to break a question like this into more than one question (assuming that you've done the research and determined the question hasn't already been asked -- repeatedly). So, NLTK and NLP are decently new, but SQL queries are usually a Google search.
I need a spell checker in python.
I've looked at previous answers and they all seem to be outdated now or not applicable:
Python spell checker using a trie This question is more about the data structure.
Python Spell Checker This is a spelling corrector, given two strings.
http://norvig.com/spell-correct.html Often referenced and quite interesting, but also a spelling corrector, and accuracy isn't quite good enough, though I'll probably use this in combination with an checker.
Spell Checker for Python Uses pyenchant which isn't maintained anymore.
Python: check whether a word is spelled correctly Also suggests Pyenchant which isn't maintained.
Some details of what I need:
A function that accepts a string (word) and returns a boolean whether the word is valid English of not. The unit test would want True on an input of "car" and False on an input of "ijjk".
Accuracy needs to be above 90%, but not higher than that. I'm just using this to exclude words during preprocessing for document classification. Most of the errors will be picked up anyway as words that appear too seldom (though not all.). Spell correcting won't work in all cases because a lot of the errors are OCR issues that are too far off to fix.
If it can deal with legal terms that would be a big plus. Otherwise I might need to manually add certain terms to the dictionary.
What's the best approach here? Are there any maintained libraries? Do I need to download a dictionary and check against it?
2 recent Python libraries, both based on Levenshtein minimum edit distance optimized for the task:
symspellpy released in the end of 2019 and
spello released in 2020
It should be mentioned that the symspellpy link above is the Python port of the original SymSpell C# implementation its description is here. The original SymSpell Github repository includes a dictionary with word frequencies.
Spello includes a basic pre-trained model on 30K news and 30K Wikipedia articles. But it's better to train it on your custom corpus from your domain.
If you need simple per-word check, you just need corpus of words (preferably matching your terminology), read it into python set and make membership check for every single word one by one.
Once/if you have issues with this naive implementation, you'll drill down to concrete problems.
You can use a dedicated spellchecking library in Python called enchant
To check a word's spelling is correct i.e whether such a word exists in English, all you have to do is this:
import enchant
d = enchant.Dict("en_US")
d.check("scienc")
This will give an output:
False
The best part about this library is it suggests the right spelling of the words. For example:
d.suggest("scienc")
will give an output:
['science', 'scenic', 'sci enc', 'sci-enc', 'scientist']
There are more features in this library. For example, in the above sample code I have used USA English corpus ("en_US"). You can use other English corpuses like "en_AU" for Australian English, "en_CA", "en_GB" for Canada and Great Britain respectively to name a few. Non-English language support is also there like "fr_FR" for French!
For advanced usage, this library can be used to check words against a custom list of words (this feature will come in handy when you have a set of Proper Nouns). This is simply a file listing the words to be considered, one word per line. The following example creates a Dict object for the personal word list stored in “my_custom_words.txt”:
custom_d = enchant.request_pwl_dict("my_custom_words.txt")
To check out more features and other aspects of it, refer:
http://pyenchant.github.io/pyenchant/
Objective: I am trying to do a project on Natural Language Processing (NLP), where I want to extract information and represent it in graphical form.
Description:
I am considering news article as input to my project.
Removing unwanted data in the input & making it in Clean Format.
Performing NLP & Extracting Information/Knowledge
Representing Information/Knowledge in Graphical Format.
Is it Possible?
If want to use nltk, you can start here. Its has some explanation about tokenizing, Part Of Speech Tagging, Parsing and more.
Check this page for an example of named entity detection using nltk.
The Graphical representation can be performed using igraph or matplotlib.
Also, scikit-learn has a great text feature extraction methods, in case you want to run some more sophisticated models.
The first step is to try and do this job yourself by hand with a pencil. Try it on not just one but a collection of news stories. You really do have to do this and not just think about it. Draw the graphics just as you'd want the computer to.
What this does is forces you to create rules about how information is transformed to graphics. This is NOT always possible, so doing it by hand is a good test. If you can't do it then you can't program a computer to do it.
Assuming you have found a paper and pencil method. What I like to do is work BACKWARDS. Your method starts with the text. No. Start with the numbers you need to draw the graphic. Then you think about where are these numbers in the stories and what words do I have to look at to get these numbers. Your job is now more like a hunting trip, you know the data is there, but how to find it.
Sorry for the lack of details but I don't know your exact problem but this works in every case. First learn to do the job yourself on paper then work backwards from the output to the input.
If you try to design this software in the forward direction you get stuck soon because you can't possibly know what to do with your text because you don't know what you need, it's like pushing a rope it don't work. Go to the other end and pull the rope. Do the graphic work FIRST then pull the needed data from the news stories.
I think this is an interesting question, at least for me.
I have a list of words, let's say:
photo, free, search, image, css3, css, tutorials, webdesign, tutorial, google, china, censorship, politics, internet
and I have a list of contexts:
Programming
World news
Technology
Web Design
I need to try and match words with the appropriate context/contexts if possible.
Maybe discovering word relationships in some way.
Any ideas?
Help would be much appreciated!
This sounds like it's more of a categorization/ontology problem than NLP. Try WordNet for a standard ontology.
I don't see any real NLP in your stated problem, but if you do need some semantic analysis or a parser try NLTK.
Where do these words come from? Do they come from real texts. If they are then it is a classic data mining problem. What you need to do is to your set of documents into the matrix where rows represent which document the word came from and the columns represent the words in the documents.
For example if you have two documents like this:
D1: Need to find meaning.
D2: Need to separate Apples from oranges
you matrix will look like this:
Need to find meaning Apples Oranges Separate From
D1: 1 1 1 1 0 0 0 0
D2: 1 1 0 0 1 1 1 1
This is called term by document matrix
Having collected this statistics you can use algorithms like K-Means to group similar documents together. Since you already know how many concepts you have your tasks should be soomewhat easier. K-Means is very slow algorithm, so you can try to optimize it using techniques such as SVD
I just found this a couple days ago: ConceptNet
It's a commonsense ontology, so it might not be as specific as you would like, but it has a python API and you can download their entire database (currently around 1GB decompressed). Just keep in mind their licensing restrictions.
If you read the papers that were published by the team that developed it, you may get some ideas on how to relate your words to concepts/contexts.
The answer to your question obviously depends on the target taxonomy you are trying to map your terms into. Once you have decided on this you need to figure out how fine-grained the concepts should be. WordNet, as it has been suggested in other responses, will give you synsets, i.e. sets of terms which are more or less synonymous but which you will have to map to concepts like 'Web Design' or 'World News' by some other mechanism since these are not encoded in WordNet. If you're aiming at a very broad semantic categorization, you could use WordNet's higher-level concept nodes which differentiate, e.g. (going up the hierarchy) human from animal, animates from plants, substances from solids, concrete from abstract things, etc.
Another kind-of-taxonomy which may be quite useful to you is the Wikipedia category system. This is not just a spontaneous idea I just came up with, but there has been a lot of work on deriving real ontologies from Wikipedia categories. Take a look at the Java Wikipedia Library - the idea would be to find a wikipedia article for the term in question (e.g. 'css3'), extract the categories this article belongs to, and pick the best ones with respect to some criterion (i.e. 'programming', 'technology', and 'web-development'). Depending on what you're trying to do this last step (choosing the best of several given categories) may or may not be difficult.
See here for a list of other ontologies / knowledge bases you could use.