Text Spinner using Naive Bayes - python

I am writing a text spinner which is working fine as it should. But the accuracy of the readable sentences is very low as it is just using a dictionary which i am getting from database. Which return spintax like this
{Your} {home| house| residence| property} {is} {your} {castle| mansion| fortress| palace}
and is passed to a function which selects randomly synonym and output sentence based on the original input of the user. For example for input:
Your home is your castle.
will return
Your property is your mansion.
Now I want to include Artificial intelligence as it will make my output sentences more readable. I want to know how to make a better selection using naive Bayes. I know I probably need to train so that better results.
Here is my current method for selection of word, which is really simple right now.
def spin(spintax):
while True:
word, n = re.subn('{([^{}]*)}',lambda m: random.choice(m.group(1).split("|")),spintax)
if n == 0: break
return word.strip()
Thank you in advance if you guys need me to post more code let me know

This will probably get closed as there is no concise answer to your question, but you might want to check out nltk wordnet:
https://pythonprogramming.net/wordnet-nltk-tutorial/

Maybe you could download the dataset Google collected from all English books and generate random sentences using ngrams? https://books.google.com/ngrams
The implementation is to use a Markov chain, where that downloaded data provides you probabilities for the next word to choose.

Related

How to get the list of matched featured names along with the predict_prob in CalibratedClassifierCV?

I am trying to find the profanity score of a given text which is received on the chats.
For this is I went through a couple of python(base) libraries and found some of the relevant ones as:
profanity-check
alt-profanity-check -- (currently using)
profanity-filter
detoxify
Now, The one which I am using (profanity-check) is giving me proper results when using
predict and predict_prob against the calibrated_classifier used underhood after training.
The problem is that I am unable to identify the words which were used to give the prediction or calculate the probability. In short the list of feature names (profane words) used in the test data when passed as an input.
I know there are no methods to return the same, but I would like to fork and use the library.
I wanted to understand if we can add something to this place (edit) to create a method for the same.
e.g
text = ["this is crap"]
predict([text]) - array([1])
predict_prob([text]) - array([0.99868968])
> predict_words([text]) - array(["crap"]) ---- (NEED THIS)

How to find the root of a word from its present participle or other variations in Python?

I'm working on a NLP project, and right now, I'm stuck on detecting antonyms for certain phrases that aren't in their "standard" forms (like verbs, adjectives, nouns) instead of present-participles, past tense, or something to that effect. For instance, if I have the phrase "arriving" or "arrived", I need to convert it to "arrive". Similarly, "came" should be "come". Lastly, “dissatisfied” should be “dissatisfy”. Can anyone help me out with this? I have tried several stemmers and lemmanizers in NLTK with Python, to no avail; most of them don’t produce the correct root. I’ve also thought about the ConceptNet semantic network and other dictionary APIs, but it seems far too complicated for what I need. Any advice is helpful. Thanks!
If you know you'll be working with a limited set, you could create a dictionary.
Example :
look_up = {'arriving' : 'arrive',
'arrived' : 'arrive',
'came' : 'come',
'dissatisfied' : 'dissatisfy'}
test = 'arrived'
print (look_up [test])

What are the best algorithms to determine the language of text and to correct typos in python?

I am looking for algorithms that could tell the language of the text to me(e.g. Hello - English, Bonjour - French, Servicio - Spanish) and also correct typos of the words in english. I have already explored Google's TextBlob, it is very relevant but it got "Too many requests" error as soon as my code starts executing. I also started exploring Polyglot but I am facing a lot of issues to download the library on Windows.
Code for TextBlob
*import pandas as pd
from tkinter import filedialog
from textblob import TextBlob
import time
from time import sleep
colnames = ['Word']
x=filedialog.askopenfilename(title='Select the word list')
print("Data to be checked: " + x)
df = pd.read_excel(x,sheet_name='Sheet1',header=0,names=colnames,na_values='?',dtype=str)
words = df['Word']
i=0
Language_detector=pd.DataFrame(columns=['Word','Language','corrected_word','translated_word'])
for word in words:
b = TextBlob(word)
language_word=b.detect_language()
time.sleep(0.5)
if language_word in ['en','EN']:
corrected_word=b.correct()
time.sleep(0.5)
Language_detector.loc[i, ['corrected_word']]=corrected_word
else:
translated_word=b.translate(to='en')
time.sleep(0.5)
Language_detector.loc[i, ['Word']]=word
Language_detector.loc[i, ['Language']]=language_word
Language_detector.loc[i, ['translated_word']]=translated_word
i=i+1
filename="Language detector test v 1.xlsx"
Language_detector.to_excel(filename,sheet_name='Sheet1')
print("Languages identified for the word list")**
A common way to classify languages is to gather summary statistics on letter or word frequencies and compare them to a known corpus. A naive bayesian classifier would suffice. See https://pypi.org/project/Reverend/ for a way to do this in Python.
Correction of typos can also be done from a corpus using a statistical model of the most likely words versus the likelihood of a particular typo. See, https://norvig.com/spell-correct.html for an example of how to do this in Python.
You could use this, but it is hardly reliable:
https://github.com/hb20007/hands-on-nltk-tutorial/blob/master/8-1-The-langdetect-and-langid-Libraries.ipynb
Alternatively, you could give compact language detector (cld v3) or fasttext a chance OR you could use a corpus to check frequencies of occurring words with the target text in order to find out whether the target text belongs to the language of the respective corpus. The latter is only possible if you know the set of languages to choose from.
For typo correction, you could use the Levenshtein algorithm, which computes a «edit distance». You can compare your words against a dictionary and choose the most likely word. For Python, you could use: https://pypi.org/project/python-Levenshtein/
See the concept of Levenshtein edit distance here: https://en.wikipedia.org/wiki/Levenshtein_distance

Vectorized form of cleaning function for NLP

I made the following function to clean the text notes of my dataset :
import spacy
nlp = spacy.load("en")
def clean(text):
"""
Text preprocessing for english text
"""
# Apply spacy to the text
doc=nlp(text)
# Lemmatization, remotion of noise (stopwords, digit, puntuaction and singol characters)
tokens=[token.lemma_.strip() for token in doc if
not token.is_stop and not nlp.vocab[token.lemma_].is_stop # Remotion StopWords
and not token.is_punct # Remove puntuaction
and not token.is_digit # Remove digit
]
# Recreation of the text
text=" ".join(tokens)
return text.lower()
Problem is when I want to clean all my dataset text, it take hour and hour. (my dataset is 70k row and between 100 to 5000 words per row)
I tried to use swifter to run the apply method on multiplethread like that : data.note_line_comment.swifter.apply(clean)
But it doesn't made really better as it took almost one hour.
I was wondering if there is any way to make a vectorized form of my function or maybe and other way to speed up the process. Any idea ?
Short answer
This type of problem inherently takes time.
Long answer
Use regular expressions
Change the spacy pipeline
The more information about the strings you need to make a decision, the longer it will take.
Good news is, if your cleaning of the text is relatively simplified, a few regular expressions might do the trick.
Otherwise you are using the spacy pipeline to help remove bits of text which is costly since it does many things by default:
Tokenisation
Lemmatisation
Dependency parsing
NER
Chunking
Alternatively, you can try your task again and turn off the aspects of the spacy pipeline you don't want which may speed it up quite a bit.
For example, maybe turn off named entity recognition, tagging and dependency parsing...
nlp = spacy.load("en", disable=["parser", "tagger", "ner"])
Then try again, it will speed up.

Determine if text is in English?

I am using both Nltk and Scikit Learn to do some text processing. However, within my list of documents I have some documents that are not in English. For example, the following could be true:
[ "this is some text written in English",
"this is some more text written in English",
"Ce n'est pas en anglais" ]
For the purposes of my analysis, I want all sentences that are not in English to be removed as part of pre-processing. However, is there a good way to do this? I have been Googling, but cannot find anything specific that will let me recognize if strings are in English or not. Is this something that is not offered as functionality in either Nltk or Scikit learn? EDIT I've seen questions both like this and this but both are for individual words... Not a "document". Would I have to loop through every word in a sentence to check if the whole sentence is in English?
I'm using Python, so libraries that are in Python would be preferable, but I can switch languages if needed, just thought that Python would be the best for this.
There is a library called langdetect. It is ported from Google's language-detection available here:
https://pypi.python.org/pypi/langdetect
It supports 55 languages out of the box.
You might be interested in my paper The WiLI benchmark dataset for written
language identification. I also benchmarked a couple of tools.
TL;DR:
CLD-2 is pretty good and extremely fast
lang-detect is a tiny bit better, but much slower
langid is good, but CLD-2 and lang-detect are much better
NLTK's Textcat is neither efficient nor effective.
You can install lidtk and classify languages:
$ lidtk cld2 predict --text "this is some text written in English"
eng
$ lidtk cld2 predict --text "this is some more text written in English"
eng
$ lidtk cld2 predict --text "Ce n'est pas en anglais"
fra
Pretrained Fast Text Model Worked Best For My Similar Needs
I arrived at your question with a very similar need. I appreciated Martin Thoma's answer. However, I found the most help from Rabash's answer part 7 HERE.
After experimenting to find what worked best for my needs, which were making sure text files were in English in 60,000+ text files, I found that fasttext was an excellent tool.
With a little work, I had a tool that worked very fast over many files. Below is the code with comments. I believe that you and others will be able to modify this code for your more specific needs.
class English_Check:
def __init__(self):
# Don't need to train a model to detect languages. A model exists
# that is very good. Let's use it.
pretrained_model_path = 'location of your lid.176.ftz file from fasttext'
self.model = fasttext.load_model(pretrained_model_path)
def predictionict_languages(self, text_file):
this_D = {}
with open(text_file, 'r') as f:
fla = f.readlines() # fla = file line array.
# fasttext doesn't like newline characters, but it can take
# an array of lines from a file. The two list comprehensions
# below, just clean up the lines in fla
fla = [line.rstrip('\n').strip(' ') for line in fla]
fla = [line for line in fla if len(line) > 0]
for line in fla: # Language predict each line of the file
language_tuple = self.model.predictionict(line)
# The next two lines simply get at the top language prediction
# string AND the confidence value for that prediction.
prediction = language_tuple[0][0].replace('__label__', '')
value = language_tuple[1][0]
# Each top language prediction for the lines in the file
# becomes a unique key for the this_D dictionary.
# Everytime that language is found, add the confidence
# score to the running tally for that language.
if prediction not in this_D.keys():
this_D[prediction] = 0
this_D[prediction] += value
self.this_D = this_D
def determine_if_file_is_english(self, text_file):
self.predictionict_languages(text_file)
# Find the max tallied confidence and the sum of all confidences.
max_value = max(self.this_D.values())
sum_of_values = sum(self.this_D.values())
# calculate a relative confidence of the max confidence to all
# confidence scores. Then find the key with the max confidence.
confidence = max_value / sum_of_values
max_key = [key for key in self.this_D.keys()
if self.this_D[key] == max_value][0]
# Only want to know if this is english or not.
return max_key == 'en'
Below is the application / instantiation and use of the above class for my needs.
file_list = # some tool to get my specific list of files to check for English
en_checker = English_Check()
for file in file_list:
check = en_checker.determine_if_file_is_english(file)
if not check:
print(file)
This is what I've used some time ago.
It works for texts longer than 3 words and with less than 3 non-recognized words.
Of course, you can play with the settings, but for my use case (website scraping) those worked pretty well.
from enchant.checker import SpellChecker
max_error_count = 4
min_text_length = 3
def is_in_english(quote):
d = SpellChecker("en_US")
d.set_text(quote)
errors = [err.word for err in d]
return False if ((len(errors) > max_error_count) or len(quote.split()) < min_text_length) else True
print(is_in_english('“中文”'))
print(is_in_english('“Two things are infinite: the universe and human stupidity; and I\'m not sure about the universe.”'))
> False
> True
Use the enchant library
import enchant
dictionary = enchant.Dict("en_US") #also available are en_GB, fr_FR, etc
dictionary.check("Hello") # prints True
dictionary.check("Helo") #prints False
This example is taken directly from their website
If you want something lightweight, letter trigrams are a popular approach. Every language has a different "profile" of common and uncommon trigrams. You can google around for it, or code your own. Here's a sample implementation I came across, which uses "cosine similarity" as a measure of distance between the sample text and the reference data:
http://code.activestate.com/recipes/326576-language-detection-using-character-trigrams/
If you know the common non-English languages in your corpus, it's pretty easy to turn this into a yes/no test. If you don't, you need to anticipate sentences from languages for which you don't have trigram statistics. I would do some testing to see the normal range of similarity scores for single-sentence texts in your documents, and choose a suitable threshold for the English cosine score.
import enchant
def check(text):
text=text.split()
dictionary = enchant.Dict("en_US") #also available are en_GB, fr_FR, etc
for i in range(len(text)):
if(dictionary.check(text[i])==False):
o = "False"
break
else:
o = ("True")
return o

Categories