Spell Checker for Python - python

I'm fairly new to Python and NLTK. I am busy with an application that can perform spell checks (replaces an incorrectly spelled word with the correct one).
I'm currently using the Enchant library on Python 2.7, PyEnchant and the NLTK library. The code below is a class that handles the correction/replacement.
from nltk.metrics import edit_distance
class SpellingReplacer:
def __init__(self, dict_name='en_GB', max_dist=2):
self.spell_dict = enchant.Dict(dict_name)
self.max_dist = 2
def replace(self, word):
if self.spell_dict.check(word):
return word
suggestions = self.spell_dict.suggest(word)
if suggestions and edit_distance(word, suggestions[0]) <= self.max_dist:
return suggestions[0]
else:
return word
I have written a function that takes in a list of words and executes replace() on each word and then returns a list of those words, but spelled correctly.
def spell_check(word_list):
checked_list = []
for item in word_list:
replacer = SpellingReplacer()
r = replacer.replace(item)
checked_list.append(r)
return checked_list
>>> word_list = ['car', 'colour']
>>> spell_check(words)
['car', 'color']
Now, I don't really like this because it isn't very accurate and I'm looking for a way to achieve spelling checks and replacements on words. I also need something that can pick up spelling mistakes like "caaaar"? Are there better ways to perform spelling checks out there? If so, what are they? How does Google do it? Because their spelling suggester is very good.
Any suggestions?

You can use the autocorrect lib to spell check in python.
Example Usage:
from autocorrect import Speller
spell = Speller(lang='en')
print(spell('caaaar'))
print(spell('mussage'))
print(spell('survice'))
print(spell('hte'))
Result:
caesar
message
service
the

I'd recommend starting by carefully reading this post by Peter Norvig. (I had to something similar and I found it extremely useful.)
The following function, in particular has the ideas that you now need to make your spell checker more sophisticated: splitting, deleting, transposing, and inserting the irregular words to 'correct' them.
def edits1(word):
splits = [(word[:i], word[i:]) for i in range(len(word) + 1)]
deletes = [a + b[1:] for a, b in splits if b]
transposes = [a + b[1] + b[0] + b[2:] for a, b in splits if len(b)>1]
replaces = [a + c + b[1:] for a, b in splits for c in alphabet if b]
inserts = [a + c + b for a, b in splits for c in alphabet]
return set(deletes + transposes + replaces + inserts)
Note: The above is one snippet from Norvig's spelling corrector
And the good news is that you can incrementally add to and keep improving your spell-checker.
Hope that helps.

The best way for spell checking in python is by: SymSpell, Bk-Tree or Peter Novig's method.
The fastest one is SymSpell.
This is Method1: Reference link pyspellchecker
This library is based on Peter Norvig's implementation.
pip install pyspellchecker
from spellchecker import SpellChecker
spell = SpellChecker()
# find those words that may be misspelled
misspelled = spell.unknown(['something', 'is', 'hapenning', 'here'])
for word in misspelled:
# Get the one `most likely` answer
print(spell.correction(word))
# Get a list of `likely` options
print(spell.candidates(word))
Method2: SymSpell Python
pip install -U symspellpy

Maybe it is too late, but I am answering for future searches.
TO perform spelling mistake correction, you first need to make sure the word is not absurd or from slang like, caaaar, amazzzing etc. with repeated alphabets. So, we first need to get rid of these alphabets. As we know in English language words usually have a maximum of 2 repeated alphabets, e.g., hello., so we remove the extra repetitions from the words first and then check them for spelling.
For removing the extra alphabets, you can use Regular Expression module in Python.
Once this is done use Pyspellchecker library from Python for correcting spellings.
For implementation visit this link: https://rustyonrampage.github.io/text-mining/2017/11/28/spelling-correction-with-python-and-nltk.html

Try jamspell - it works pretty well for automatic spelling correction:
import jamspell
corrector = jamspell.TSpellCorrector()
corrector.LoadLangModel('en.bin')
corrector.FixFragment('Some sentnec with error')
# u'Some sentence with error'
corrector.GetCandidates(['Some', 'sentnec', 'with', 'error'], 1)
# ('sentence', 'senate', 'scented', 'sentinel')

IN TERMINAL
pip install gingerit
FOR CODE
from gingerit.gingerit import GingerIt
text = input("Enter text to be corrected")
result = GingerIt().parse(text)
corrections = result['corrections']
correctText = result['result']
print("Correct Text:",correctText)
print()
print("CORRECTIONS")
for d in corrections:
print("________________")
print("Previous:",d['text'])
print("Correction:",d['correct'])
print("`Definiton`:",d['definition'])

You can also try:
pip install textblob
from textblob import TextBlob
txt="machne learnig"
b = TextBlob(txt)
print("after spell correction: "+str(b.correct()))
after spell correction: machine learning

spell corrector->
you need to import a corpus on to your desktop if you store elsewhere change the path in the code i have added a few graphics as well using tkinter and this is only to tackle non word errors!!
def min_edit_dist(word1,word2):
len_1=len(word1)
len_2=len(word2)
x = [[0]*(len_2+1) for _ in range(len_1+1)]#the matrix whose last element ->edit distance
for i in range(0,len_1+1):
#initialization of base case values
x[i][0]=i
for j in range(0,len_2+1):
x[0][j]=j
for i in range (1,len_1+1):
for j in range(1,len_2+1):
if word1[i-1]==word2[j-1]:
x[i][j] = x[i-1][j-1]
else :
x[i][j]= min(x[i][j-1],x[i-1][j],x[i-1][j-1])+1
return x[i][j]
from Tkinter import *
def retrieve_text():
global word1
word1=(app_entry.get())
path="C:\Documents and Settings\Owner\Desktop\Dictionary.txt"
ffile=open(path,'r')
lines=ffile.readlines()
distance_list=[]
print "Suggestions coming right up count till 10"
for i in range(0,58109):
dist=min_edit_dist(word1,lines[i])
distance_list.append(dist)
for j in range(0,58109):
if distance_list[j]<=2:
print lines[j]
print" "
ffile.close()
if __name__ == "__main__":
app_win = Tk()
app_win.title("spell")
app_label = Label(app_win, text="Enter the incorrect word")
app_label.pack()
app_entry = Entry(app_win)
app_entry.pack()
app_button = Button(app_win, text="Get Suggestions", command=retrieve_text)
app_button.pack()
# Initialize GUI loop
app_win.mainloop()

pyspellchecker is the one of the best solutions for this problem. pyspellchecker library is based on Peter Norvig’s blog post.
It uses a Levenshtein Distance algorithm to find permutations within an edit distance of 2 from the original word.
There are two ways to install this library. The official document highly recommends using the pipev package.
install using pip
pip install pyspellchecker
install from source
git clone https://github.com/barrust/pyspellchecker.git
cd pyspellchecker
python setup.py install
the following code is the example provided from the documentation
from spellchecker import SpellChecker
spell = SpellChecker()
# find those words that may be misspelled
misspelled = spell.unknown(['something', 'is', 'hapenning', 'here'])
for word in misspelled:
# Get the one `most likely` answer
print(spell.correction(word))
# Get a list of `likely` options
print(spell.candidates(word))

from autocorrect import spell
for this you need to install, prefer anaconda and it only works for words, not sentences so that's a limitation u gonna face.
from autocorrect import spell
print(spell('intrerpreter'))
# output: interpreter

pip install scuse
from scuse import scuse
obj = scuse()
checkedspell = obj.wordf("spelling you want to check")
print(checkedspell)

Spark NLP is another option that I used and it is working excellent. A simple tutorial can be found here. https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/jupyter/annotation/english/spell-check-ml-pipeline/Pretrained-SpellCheckML-Pipeline.ipynb

Related

Spell Correction with Python (pyspellchecker)

I wanna build a spell correction using python and I try to use pyspellchecker, because I have to build my own dictionary and I think pyspellchecker is easy to use with our own model or dictionary. My problem is, how to load and return my word with case_sensitive is On?
I have tried this:
spell = SpellChecker(language=None, case_sensitive=True)
but when I load my file contains many text like 'Hello' with this code:
spell.word_frequency.load_text_file('myfile.txt')
and when I start to spell with spell.correction('Hello') its return 'hello' (lower case).
Do you know how to build our own model or dictionary with our letters not diminished or it stays uppercase?
Or if you have a recommendation for spell-checking with our own model please let me know, Thank you!
Try this:
from spellchecker import SpellChecker
spell = SpellChecker(language=None, case_sensitive=True)
a = spell.word_frequency.load_words(["Hello", "HELLO", "I", "AM", "Alok", "Mishra"])
# find those words that may be misspelled
misspelled = spell.unknown(["helo", "Alk", "Mishr"])
for word in misspelled:
# Get the one `most likely` answer
print(spell.correction(word))
# Get a list of `likely` options
print(spell.candidates(word))
Output:
Alok
{'Alok'}
Hello
{'Hello'}
Mishra
{'Mishra'}

Using PyDictionary to check if a word exists

Very new to the PyDictionary library, and have had some trouble finding proper documentation for it. So, I've come here to ask:
A) Does anybody know how to check if a word (in english) exists, using PyDictionary?
B) Does anybody know of some more full documentation for PyDictionary?
If you read the code here in theory there is this:
meaning(term, disable_errors=False)
so you should be able to pass True to avoid printing the error in case the word is not in the dictionary. I tried but I guess the version I installed via pip does not contains that code...
To further expound on what #Daniele stated, you can just pass True to the meaning function. If the word is not found in the dictionary, the function returns None.
from PyDictionary import PyDictionary
def check_if_word_in_dictionary(word):
dictionary = PyDictionary()
if dictionary.meaning(word,True) is None:
print(f"It appears '{word}' is NOT a word found in the dictionary.")
else:
print(f"You're in luck, '{word}' IS found in the dictionary!")
check_if_word_in_dictionary("fingle")
Output: It appears 'fingle' is not a word found in the dictionary.
check_if_word_in_dictionary("finger")
Output: You are in luck, 'finger' IS found in the dictionary!

Determine if text is in English?

I am using both Nltk and Scikit Learn to do some text processing. However, within my list of documents I have some documents that are not in English. For example, the following could be true:
[ "this is some text written in English",
"this is some more text written in English",
"Ce n'est pas en anglais" ]
For the purposes of my analysis, I want all sentences that are not in English to be removed as part of pre-processing. However, is there a good way to do this? I have been Googling, but cannot find anything specific that will let me recognize if strings are in English or not. Is this something that is not offered as functionality in either Nltk or Scikit learn? EDIT I've seen questions both like this and this but both are for individual words... Not a "document". Would I have to loop through every word in a sentence to check if the whole sentence is in English?
I'm using Python, so libraries that are in Python would be preferable, but I can switch languages if needed, just thought that Python would be the best for this.
There is a library called langdetect. It is ported from Google's language-detection available here:
https://pypi.python.org/pypi/langdetect
It supports 55 languages out of the box.
You might be interested in my paper The WiLI benchmark dataset for written
language identification. I also benchmarked a couple of tools.
TL;DR:
CLD-2 is pretty good and extremely fast
lang-detect is a tiny bit better, but much slower
langid is good, but CLD-2 and lang-detect are much better
NLTK's Textcat is neither efficient nor effective.
You can install lidtk and classify languages:
$ lidtk cld2 predict --text "this is some text written in English"
eng
$ lidtk cld2 predict --text "this is some more text written in English"
eng
$ lidtk cld2 predict --text "Ce n'est pas en anglais"
fra
Pretrained Fast Text Model Worked Best For My Similar Needs
I arrived at your question with a very similar need. I appreciated Martin Thoma's answer. However, I found the most help from Rabash's answer part 7 HERE.
After experimenting to find what worked best for my needs, which were making sure text files were in English in 60,000+ text files, I found that fasttext was an excellent tool.
With a little work, I had a tool that worked very fast over many files. Below is the code with comments. I believe that you and others will be able to modify this code for your more specific needs.
class English_Check:
def __init__(self):
# Don't need to train a model to detect languages. A model exists
# that is very good. Let's use it.
pretrained_model_path = 'location of your lid.176.ftz file from fasttext'
self.model = fasttext.load_model(pretrained_model_path)
def predictionict_languages(self, text_file):
this_D = {}
with open(text_file, 'r') as f:
fla = f.readlines() # fla = file line array.
# fasttext doesn't like newline characters, but it can take
# an array of lines from a file. The two list comprehensions
# below, just clean up the lines in fla
fla = [line.rstrip('\n').strip(' ') for line in fla]
fla = [line for line in fla if len(line) > 0]
for line in fla: # Language predict each line of the file
language_tuple = self.model.predictionict(line)
# The next two lines simply get at the top language prediction
# string AND the confidence value for that prediction.
prediction = language_tuple[0][0].replace('__label__', '')
value = language_tuple[1][0]
# Each top language prediction for the lines in the file
# becomes a unique key for the this_D dictionary.
# Everytime that language is found, add the confidence
# score to the running tally for that language.
if prediction not in this_D.keys():
this_D[prediction] = 0
this_D[prediction] += value
self.this_D = this_D
def determine_if_file_is_english(self, text_file):
self.predictionict_languages(text_file)
# Find the max tallied confidence and the sum of all confidences.
max_value = max(self.this_D.values())
sum_of_values = sum(self.this_D.values())
# calculate a relative confidence of the max confidence to all
# confidence scores. Then find the key with the max confidence.
confidence = max_value / sum_of_values
max_key = [key for key in self.this_D.keys()
if self.this_D[key] == max_value][0]
# Only want to know if this is english or not.
return max_key == 'en'
Below is the application / instantiation and use of the above class for my needs.
file_list = # some tool to get my specific list of files to check for English
en_checker = English_Check()
for file in file_list:
check = en_checker.determine_if_file_is_english(file)
if not check:
print(file)
This is what I've used some time ago.
It works for texts longer than 3 words and with less than 3 non-recognized words.
Of course, you can play with the settings, but for my use case (website scraping) those worked pretty well.
from enchant.checker import SpellChecker
max_error_count = 4
min_text_length = 3
def is_in_english(quote):
d = SpellChecker("en_US")
d.set_text(quote)
errors = [err.word for err in d]
return False if ((len(errors) > max_error_count) or len(quote.split()) < min_text_length) else True
print(is_in_english('“中文”'))
print(is_in_english('“Two things are infinite: the universe and human stupidity; and I\'m not sure about the universe.”'))
> False
> True
Use the enchant library
import enchant
dictionary = enchant.Dict("en_US") #also available are en_GB, fr_FR, etc
dictionary.check("Hello") # prints True
dictionary.check("Helo") #prints False
This example is taken directly from their website
If you want something lightweight, letter trigrams are a popular approach. Every language has a different "profile" of common and uncommon trigrams. You can google around for it, or code your own. Here's a sample implementation I came across, which uses "cosine similarity" as a measure of distance between the sample text and the reference data:
http://code.activestate.com/recipes/326576-language-detection-using-character-trigrams/
If you know the common non-English languages in your corpus, it's pretty easy to turn this into a yes/no test. If you don't, you need to anticipate sentences from languages for which you don't have trigram statistics. I would do some testing to see the normal range of similarity scores for single-sentence texts in your documents, and choose a suitable threshold for the English cosine score.
import enchant
def check(text):
text=text.split()
dictionary = enchant.Dict("en_US") #also available are en_GB, fr_FR, etc
for i in range(len(text)):
if(dictionary.check(text[i])==False):
o = "False"
break
else:
o = ("True")
return o

How to segment text into sub-sentences based on enumerators?

I am segmenting sentences for a text in python using nltk PunktSentenceTokenizer(). However, there are many long sentences appears in a enumerated way and I need to get the sub sentence in this case.
Example:
The api allows the user to achieve following goals: (a) aXXXXXX ,(b)bXXXX, (c) cXXXXX.
The required output would be :
"The api allows the user to achieve following goals aXXXXX. ", "The api allows the user to achieve following goals bXXXXX." and "The api allows the user to achieve following goals cXXXXX. "
How can I achieve this goal?
To get the sub-sequences you could use a RegExp Tokenizer.
An example how to use it to split the sentence could look like this:
from nltk.tokenize.regexp import regexp_tokenize
str1 = 'The api allows the user to achieve following goals: (a) aXXXXXX ,(b)bXXXX, (c) cXXXXX.'
parts = regexp_tokenize(str1, r'\(\w\)\s*', gaps=True)
start_of_sentence = parts.pop(0)
for part in parts:
print(" ".join((start_of_sentence, part)))
I'll just skip over the obvious question (being: "What have you tried so far?"). As you may have found out already, PunktSentenceTokenizer isn't really going to help you here, since it will leave your input sentence in one piece.
The best solution depends heavily on the predictability of your input. The following will work on your example, but as you can see it relies on there being a colon and some comma's. If they're not there, it's not going to help you.
import re
from nltk import PunktSentenceTokenizer
s = 'The api allows the user to achieve following goals: (a) aXXXXXX ,(b)bXXXX, (c) cXXXXX.'
#sents = PunktSentenceTokenizer().tokenize(s)
p = s.split(':')
for l in p[1:]:
i = l.split(',')
for j in i:
j = re.sub(r'\([a-z]\)', '', j).strip()
print("%s: %s" % (p[0], j))

Python - English translator

What is the best way to approach writing a program in Python to translate English words and/or phrases into other languages?
AJAX Language API
This is an incredibly difficult problem -- language is very very very complicated. Think about all the things you'd have to do -- parse the phrase, work out what the words mean, translate them. That's probably not idiomatic so you'll need special cases for different syntaxes. Many, many special cases. You'll need to work out the syntax of the foreign language if it differs from English -- "the big green ball" goes to "the ball big green" in Spanish, for instance.
Don't reinvent the wheel. Google provide an API to their translation service, which has undoubtedly had many many clever people thinking really quite hard about it.
I think you should look into the Google Translate API. Here is a library implemented specifically for this purpose in python.
the simplest way to do this is to make a dictionary that matches one language's words to another language's words. However, this is extremely silly and would not take into account grammar at all and it would literally take a very long time to create a translator, especially if you plan to use it for multiple languages. If grammar is not important to you (for example, if you were creating your own language for a game or story that doesn't have grammar different from english) than you could get away with using dictionaries and simply having a function look for a requested match in the dictionary
# command : pip install mtranslate
from mtranslate import translate
>>> from mtranslate import translate
>>> translate("Tranalating to kannada language (my mother tongue) ", to_language = "kn")
'ಕನ್ನಡ ಭಾಷೆಗೆ ಅನುವಾದ (ನನ್ನ ಮಾತೃಭಾಷೆ)'
You can use Goslate Package for that
its very easy to use
example
import goslate
print(goslate.Goslate().translate('hello world', 'ar'))
now first argument is text you want to translate and second argument is in which language you want to translate.
i hope you will find the answer usefull
# Please install Microsoft translate using >> pip install translate
from translate import Translator
class clsTranslate():
def translateText(self, strString, strTolang):
self.strString = strString
self.strTolang = strTolang
translator = Translator(to_lang=self.strTolang)
translation = translator.translate(self.strString)
return (str(translation))
# Create a Class object and call the Translate function
# Pass the language as a parameter to the function, de: German zh: Chinese etc
objTrans=clsTranslate()
strTranslatedText= objTrans.translateText('Howare you', 'de')
print(strTranslatedText)
It's very very easy if you use deep-translator! Here's the source code(make sure to install deep-translator module):
from deep_translator import GoogleTranslator
import time
def start():
while True:
def translate():
line_to_translate = input('Which line/phrase/word you want to translate?\n')
to_lang = input('In which language you want to translate it?\n')
to_lang = to_lang.lower()
translation = GoogleTranslator(source='auto', target=to_lang).translate(text=line_to_translate)
return translation
time.sleep(1 sec)
esc = (input("Enter 'q' to exit and 'r' to restart.\n"))
while True:
if esc.lower() in {'q', 'r'}:
break
else:
print('Please enter a valid Option!!')
time.sleep(1)
esc = (input("Enter 'q' to exit and 'r' to restart.\n"))
if esc.lower() == 'q':
return
elif esc.lower() == 'r':
pass
start()
# command : pip install mtranslate
from mtranslate import translate
>>> from mtranslate import translate
>>> translate("Tranalating to kannada language (my mother tongue) ", to_language = "kn")
'ಕನ್ನಡ ಭಾಷೆಗೆ ಅನುವಾದ (ನನ್ನ ಮಾತೃಭಾಷೆ)'

Categories