I need a spell checker in python.
I've looked at previous answers and they all seem to be outdated now or not applicable:
Python spell checker using a trie This question is more about the data structure.
Python Spell Checker This is a spelling corrector, given two strings.
http://norvig.com/spell-correct.html Often referenced and quite interesting, but also a spelling corrector, and accuracy isn't quite good enough, though I'll probably use this in combination with an checker.
Spell Checker for Python Uses pyenchant which isn't maintained anymore.
Python: check whether a word is spelled correctly Also suggests Pyenchant which isn't maintained.
Some details of what I need:
A function that accepts a string (word) and returns a boolean whether the word is valid English of not. The unit test would want True on an input of "car" and False on an input of "ijjk".
Accuracy needs to be above 90%, but not higher than that. I'm just using this to exclude words during preprocessing for document classification. Most of the errors will be picked up anyway as words that appear too seldom (though not all.). Spell correcting won't work in all cases because a lot of the errors are OCR issues that are too far off to fix.
If it can deal with legal terms that would be a big plus. Otherwise I might need to manually add certain terms to the dictionary.
What's the best approach here? Are there any maintained libraries? Do I need to download a dictionary and check against it?
2 recent Python libraries, both based on Levenshtein minimum edit distance optimized for the task:
symspellpy released in the end of 2019 and
spello released in 2020
It should be mentioned that the symspellpy link above is the Python port of the original SymSpell C# implementation its description is here. The original SymSpell Github repository includes a dictionary with word frequencies.
Spello includes a basic pre-trained model on 30K news and 30K Wikipedia articles. But it's better to train it on your custom corpus from your domain.
If you need simple per-word check, you just need corpus of words (preferably matching your terminology), read it into python set and make membership check for every single word one by one.
Once/if you have issues with this naive implementation, you'll drill down to concrete problems.
You can use a dedicated spellchecking library in Python called enchant
To check a word's spelling is correct i.e whether such a word exists in English, all you have to do is this:
import enchant
d = enchant.Dict("en_US")
d.check("scienc")
This will give an output:
False
The best part about this library is it suggests the right spelling of the words. For example:
d.suggest("scienc")
will give an output:
['science', 'scenic', 'sci enc', 'sci-enc', 'scientist']
There are more features in this library. For example, in the above sample code I have used USA English corpus ("en_US"). You can use other English corpuses like "en_AU" for Australian English, "en_CA", "en_GB" for Canada and Great Britain respectively to name a few. Non-English language support is also there like "fr_FR" for French!
For advanced usage, this library can be used to check words against a custom list of words (this feature will come in handy when you have a set of Proper Nouns). This is simply a file listing the words to be considered, one word per line. The following example creates a Dict object for the personal word list stored in “my_custom_words.txt”:
custom_d = enchant.request_pwl_dict("my_custom_words.txt")
To check out more features and other aspects of it, refer:
http://pyenchant.github.io/pyenchant/
Related
I am trying to take a person's ailment, and return what they should do (from a predetermined set of "solutions").
For example,
person's ailment
My head is not bleeding
predetermined set of "solutions"
[take medicine, go to a doctor, call the doctor]
I know I need to first remove common words from the sentence (such as 'my' and 'is') but also preserve "common" words such as 'not,' which are crucial to the solution and important to the context.
Next, I'm pretty sure I'll need to train a set of processed inputs and match them to outputs to train a model which will attempt to identify the "solution" for the given string.
Are there any other libraries I should be using (other than nltk, and scikit-learn)?
You should check out gensim. and you might want to check out these keywords for a start: tokenizer, word stemming, lemmatization, good luck!
How can I calculate the string similarity (semantic meaning) between 2 string?
For example if I have 2 string like "Display" and "Screen" the string similarity must be close to 100%
If I have "Display" and "Color" the screen similarity must be close to 0%
I'm writing my script in Python... My question is if exists some library or framework to do this kind or think... In alternative can someone suggest me a good approach?
Based on your examples, I think you are looking for semantical similarity. You can do this for instance by using WordNet, but you will have to add for instance that you are working with nouns and possible iterate over the different meanings of the word. The link shows two examples that calculate the similarity according to various implementations.
Most implementations are however computationally expensive: they make use of a large amount of text to calculate how often two words are close to each other, etc.
What you're looking to solve is an NLP problem; which, if you're not familiar with, can be a hassle. The most popular library out there is NTLK, which has a lot of AI tools. A quick google of what you're looking for yields logic of semantics: http://www.nltk.org/book/ch10.html
This is a computationally heavy process, since it involves loading a dictionary of the entire English language. If you have a small subset of examples, you might be better off creating a mapping yourself.
I am not good at in NPL, but I think Levenshtein Distance Algorithm can help you solve this problem.Becuase I use this algorithm to calculate the similarity between to strings. And the preformance is not bad.
The following are my CPP code, click the link, maybe you can transform the code to Python.I will post the Python code later.
If you understance Dynamic Programming, I think you can understande it.
enter link description here
Have a look in following libraries:
https://simlibrary.wordpress.com/
http://sourceforge.net/projects/semantics/
http://radimrehurek.com/gensim/
Check out word2vec as implemented in the Gensim library. One of its features is to compute word similarity.
https://radimrehurek.com/gensim/models/word2vec.html
More details and demos can be found here.
I believe this is the state of the art right now.
As another user suggested, the Gensim library can do this using the word2vec technique. The below example is modified from this blog post. You can run it in Google Colab.
Google Colab comes with the Gensim package installed. We can import the part of it we require:
from gensim.models import KeyedVectors
We will download training data from Google News, and load it up
!wget -P /root/input/ -c "https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz"
word_vectors = KeyedVectors.load_word2vec_format('/root/input/GoogleNews-vectors-negative300.bin.gz', binary=True)
This gives us a measure of similarity between any two words. From your examples:
word_vectors.similarity('display', 'color')
>>> 0.3068566
word_vectors.similarity('display', 'screen')
>>> 0.32314363
Compare those resulting numbers and you will see the words display and screen are more similar than display and color are.
I am writing a program that should spit out a random sentence of a complexity of my choosing. As a concrete example, I would like to aid my language learning by spitting out valid sentences of a grammar structure and using words that I have already learned. I would like to use python and nltk to do this, although I am open to other ideas.
It seems like there are a couple of approaches:
Define a grammar file that uses the grammar and lexicon I know about, and then generate all valid sentences from this list, then selecting a random answer.
Load in corpora to train ngrams, which then can be used to construct a sentence.
Am I thinking about this correctly? Is one approach preferred over the other? Any tips are appreciated. Thanks!
If I'm getting it right and if the purpose is to test yourself on the vocabulary you already have learned, then another approach could be taken:
Instead of going through the difficult labor of NLG (Natural Language Generation), you could create a search program that goes online, reads news feeds or even simply Wikipedia, and finds sentences with only the words you have defined.
In any case, for what you want, you will have to create lists of words that you have learned. You could then create search algorithms for sentences that contain only / nearly only these words.
That would have the major advantage of testing yourself on real sentences, as opposed to artificially-constructed ones (which are likely to sound not quite right in a number of cases).
An app like this would actually be a great help for learning a foreign language. If you did it nicely I'm sure a lot of people would benefit from using it.
If your purpose is really to make a language learning aid, you need to generate grammatical (i.e., correct) sentences. If so, do not use ngrams. They stick together words at random, and you just get intriguingly natural-looking nonsense.
You could use a grammar in principle, but it will have to be a very good and probably very large grammar.
Another option you haven't considered is to use a template method. Get yourself a bunch of sentences, identify some word classes you are interested in, and generate variants by fitting, e.g., different nouns as the subject or object. This method is much more likely to give you usable results in a finite amount of time. There's any number of well-known bots that work on this principle, and it's also pretty much what language-teaching books do.
I am asking a related question here but this question is more general. I have taken a large corpora and annotated some words with their named-entities. In my case, they are domain-specific and I call them: Entity, Action, Incident. I want to use these as a seed for extracting more named-entities. For example, following is one sentence:
When the robot had a technical glitch, the object was thrown but was later caught by another robot.
is tagged as:
When the (robot)/Entity had a (technical glitch)/Incident, the
(object)/Entity was (thrown)/Action but was later (caught)/Action by
(another robot)/Entity.
Given examples like this, is there anyway I can train a classifier to recognize new named-entities? For instance, given a sentence like this:
The nanobot had a bug and so it crashed into the wall.
should be tagged somewhat like this:
The (nanobot)/Entity had a (bug)/Incident and so it (crashed)/Action into the (wall)/Entity.
Of course, I am aware that 100% accuracy is not possible but I would be interested in knowing any formal approaches to do this. Any suggestions?
This is not named-entity recognition at all, since none of the labeled parts are names, so the feature sets for NER systems won't help you (English NER systems tend to rely on capitalization quite strongly and will prefer nouns). This is a kind of information extraction/semantic interpretation. I suspect this is going to be quite hard in a machine learning setting because your annotation is really inconsistent:
When the (robot)/Entity had a (technical glitch)/Incident, the (object)/Entity was (thrown)/Action but was later (caught)/Action by another robot.
Why is "another robot" not annotated?
If you want to solve this kind of problem, you'd better start out with some regular expressions, maybe matched against POS-tagged versions of the string.
I can think of 2 approaches.
First is pattern matching over words in sentence. Something like this (pseudocode, though it is similar to NLTK chunk parser syntax):
<some_word>+ (<NN|NNS>) <have|has|had> (<NN|NNS>)
<NN|NNS> (<VB>|was <VB>) (<and|but> (<VB>|was <VB>))* <into|onto|by> (<NN|NNS>)
These 2 patterns can (roughly) catch 2 parts of your first sentence. This is a good choice if you have not very much kinds of sentences. I believe it is possible to get up to 90% accuracy with well-chosen patterns. Drawback is that this model is hard to extend/modify.
Another approach is to mine dependencies between words in sentence, for example, with Stanford Dependency Parser. Among other things, it allows to mine object, subject and predicate, that seems very similar to what you want: in your first sentence "robot" is subject, "had" is predicate and "glitch" is object.
You could try object role modeling at http://www.ormfoundation.com/ which looks at the semantics(facts) between one or more entities or names and their relationships with other objects. There are also tools to convert the orm models into xml and other languages and vice versa. See http://orm.sourceforge.net/
I want to parse a text and categorize the sentences according to their grammatical structure, but I have a very small understanding of NLP so I don't even know where to start.
As far as I have read, I need to parse the text and find out (or tag?) the part-of-speech of every word. Then I search for the verb clause or whatever other defining characteristic I want to use to categorize the sentences.
What I don't know is if there is already some method to do this more easily or if I need to define the grammar rules separately or what.
Any resources on NLP that discuss this would be great. Program examples are welcome as well. I have used NLTK before, but not extensively. Other parsers or languages are OK too!
Python Natural Language Toolkit is a library which is suitable for doing such a work. As with any NLP library, you will have to download the dataset for training separately and corpus(data) and scripts for training are available too.
There are also certain example tutorials which will help you identify parts of the speech for words. By all means, I think nltk.org should be the place to go for what you are looking for.
Specific questions could be posted here again.
May be you need simply define patterns like "noun verb noun" etc for each type of grammatical structure and search matches in part-of-speach tagger output sequence.