I work on sentiment analysis . Abbreviations are one of the most widely used in natural languages. I used Spellcheker to correct spelling mistakes, and one of the problems with using this method is that it translates Abbreviations into the closest word to English. This affects the sentiment detection. Is there any code or a method that these Abbreviations can be extended according to their neighbor words?
hello here is an example that might be useful
import spacy
from scispacy.abbreviation import AbbreviationDetector
nlp=spacy.load("en_core_web_sm")
abbreviation_pipe=AbbreviationDetector(nlp)
text="stackoverflow (SO) is a question and answer site for professional and enth_usiast programmers.SO roxks!"
nlp.add_pipe(abbreviation_pipe)
def replace_acronyms(text):
doc=nlp(txt)
altered_tok=[tok.text for tok in doc]
print(doc._.abbreviations)
for abrv in doc._.abbreviations:
altered_tok[abrv.start]=str(abrv._.long_form)
return(" "join(altered_tok))
replace_acronyms(text)
replace_acronyms("Top executives of Microsoft(MS) and General Motors (GM) met today in NewYord")
Related
I am looking for algorithms that could tell the language of the text to me(e.g. Hello - English, Bonjour - French, Servicio - Spanish) and also correct typos of the words in english. I have already explored Google's TextBlob, it is very relevant but it got "Too many requests" error as soon as my code starts executing. I also started exploring Polyglot but I am facing a lot of issues to download the library on Windows.
Code for TextBlob
*import pandas as pd
from tkinter import filedialog
from textblob import TextBlob
import time
from time import sleep
colnames = ['Word']
x=filedialog.askopenfilename(title='Select the word list')
print("Data to be checked: " + x)
df = pd.read_excel(x,sheet_name='Sheet1',header=0,names=colnames,na_values='?',dtype=str)
words = df['Word']
i=0
Language_detector=pd.DataFrame(columns=['Word','Language','corrected_word','translated_word'])
for word in words:
b = TextBlob(word)
language_word=b.detect_language()
time.sleep(0.5)
if language_word in ['en','EN']:
corrected_word=b.correct()
time.sleep(0.5)
Language_detector.loc[i, ['corrected_word']]=corrected_word
else:
translated_word=b.translate(to='en')
time.sleep(0.5)
Language_detector.loc[i, ['Word']]=word
Language_detector.loc[i, ['Language']]=language_word
Language_detector.loc[i, ['translated_word']]=translated_word
i=i+1
filename="Language detector test v 1.xlsx"
Language_detector.to_excel(filename,sheet_name='Sheet1')
print("Languages identified for the word list")**
A common way to classify languages is to gather summary statistics on letter or word frequencies and compare them to a known corpus. A naive bayesian classifier would suffice. See https://pypi.org/project/Reverend/ for a way to do this in Python.
Correction of typos can also be done from a corpus using a statistical model of the most likely words versus the likelihood of a particular typo. See, https://norvig.com/spell-correct.html for an example of how to do this in Python.
You could use this, but it is hardly reliable:
https://github.com/hb20007/hands-on-nltk-tutorial/blob/master/8-1-The-langdetect-and-langid-Libraries.ipynb
Alternatively, you could give compact language detector (cld v3) or fasttext a chance OR you could use a corpus to check frequencies of occurring words with the target text in order to find out whether the target text belongs to the language of the respective corpus. The latter is only possible if you know the set of languages to choose from.
For typo correction, you could use the Levenshtein algorithm, which computes a «edit distance». You can compare your words against a dictionary and choose the most likely word. For Python, you could use: https://pypi.org/project/python-Levenshtein/
See the concept of Levenshtein edit distance here: https://en.wikipedia.org/wiki/Levenshtein_distance
I need a language detection script. I tried Textblob library which right now give me the two letter abbreviation of the language. How can I get the complete language expansion?
This detects the language with two letter abbreviation of the language
from textblob import TextBlob
b = TextBlob("cómo estás")
language = b.detect_language()
print(language)
Actual Results : es
Expected Results : Spanish
I have the list of language and their abbreviation from this link
https://developers.google.com/admin-sdk/directory/v1/languages
The code you're using gives you a two-letter abbreviation that conforms to the ISO 639-2 international protocol. You could look up a list of these correspondences (e.g. this page and rig up a method to just input one and output the other, but given you're programming in python, someone's already done that for you.
I recommend pycountry - a general-purpose library for this type of task that also contains a number of other standards. Example of using it for this problem:
from textblob import TextBlob
import pycountry
b = TextBlob("நீங்கள் எப்படி இருக்கிறீர்கள்")
iso_code = b.detect_language()
# iso_code = "ta"
language = pycountry.languages.get(alpha_2=iso_code)
# language = Language(alpha_2='ta', alpha_3='tam', name='Tamil', scope='I', type='L')
print(language.name)
and that prints Tamil, as expected. Same works for Spanish:
>>> pycountry.languages.get(alpha_2='es').name
'Spanish'
and probably most other languages you'll encounter in whatever it is you're doing..
simple example: func-tional --> functional
The story is that I got a Microsoft Word document, which is converted from PDF format, and some words remain hyphenated (such as func-tional, broken because of line break in PDF). I want to recover those broken words while normal ones(i.e., "-" is not for word-break) are kept.
In order to make it more clear, one long example (source text) is added:
After the symposium, the Foundation and the FCF steering team continued their work and created the Func-tional Check Flight Compendium. This compendium contains information that can be used to reduce the risk of functional check flights. The information contained in the guidance document is generic, and may need to be adjusted to apply to your specific aircraft. If there are questions on any of the information in the compendi-um, contact your manufacturer for further guidance.
Could someone give me some suggestions on this problem?
I would use regular expression. This little script searches for words with hyphenated and replaces the hyphenated by nothing.
import re
def replaceHyphenated(s):
matchList = re.findall(r"\w+-\w+",s) # find combination of word-word
sOut = s
for m in matchList:
new = m.replace("-","")
sOut = sOut.replace(m,new)
return sOut
if __name__ == "__main__":
s = """After the symposium, the Foundation and the FCF steering team continued their work and created the Func-tional Check Flight Compendium. This compendium contains information that can be used to reduce the risk of functional check flights. The information contained in the guidance document is generic, and may need to be adjusted to apply to your specific aircraft. If there are questions on any of the information in the compendi-um, contact your manufacturer for further guidance."""
print(replaceHyphenated(s))
output would be:
After the symposium, the Foundation and the FCF steering team
continued their work and created the Functional Check Flight
Compendium. This compendium contains information that can be used to
reduce the risk of functional check flights. The information contained
in the guidance document is generic, and may need to be adjusted to
apply to your specific aircraft. If there are questions on any of the
information in the compendium, contact your manufacturer for further
guidance.
If you are not used to RegExp I recommend this site:
https://regex101.com/
I have a list of strings (sentences) that might contain one or more Dutch city names. I also have a list of Dutch cities, and their various spellings. I am currently working in Python, but a solution in another language would also work.
What would be the best and most efficient way to retrieve a list of cities mentioned in the sentences?
What I do at the moment is loop through the sentence list, and then within that loop, loop through the cities list and one by one check if
place_name in sentence.lower(), so I have:
for sentence in sentences:
for place_name in place_names:
if place_name in sentence.lower():
places[place_name] = places[place_name] + 1
Is this the most efficient way to do this? I also run into the problem that cities like "Ee" exist in Holland, and that words with "ee" in them are quite common. For now I solved this by just checking if place_name + ' ' in sentence.lower(), but this is of course suboptimal and ugly, as it would also disregard sentences like "Huis in Amsterdam", since it doesn't end with a space, and it won't also work well with punctuation. I tried using regex, but this is of course way too slow. Would there be a better way to solve this particular problem, or to solve this problem in general? I am leaning somewhat to an NLP solution, but I also feel like that would be a massive overkill.
You may look into Named Entity Recognition solutions in general. This can be done in nltk as well but here is a sample in Spacy - cities would be marked with GPE labels (GPE stands for "Geopolitical Entity" like countries, states, cities etc):
import spacy
nlp = spacy.load('en_core_web_lg')
doc = nlp(u'Some company is looking at buying an Amsterdam startup for $1 billion')
for ent in doc.ents:
print(ent.text, ent.label_)
Prints:
Amsterdam GPE
$1 billion MONEY
I have been trying to make an artificial intelligence on python. What I have been trying to do is make input command responses target one word. So for example, if the user types in "whats your name" it will have the same response as "name" by targeting the word "name". how can I do this?
What you're looking for is a library for handling Parts of Speech. Luckily it's pretty well trodden ground, and there are libraries for lots of different languages - including Python. Have a look at Stanford's Natural Language Toolkit (NLTK). Here's an example from the linked article:
>>> from nltk.tag.stanford import POSTagger
>>> english_postagger = POSTagger(‘models/english-bidirectional-distsim.tagger’, ‘stanford-postagger.jar’)
>>> english_postagger.tag(‘this is stanford postagger in nltk for python users’.split())
[(u’this’, u’DT’),
(u’is’, u’VBZ’),
(u’stanford’, u’JJ’),
(u’postagger’, u’NN’),
(u’in’, u’IN’),
(u’nltk’, u’NN’),
(u’for’, u’IN’),
(u’python’, u’NN’),
(u’users’, u’NNS’)]
The NN, VBZ, etc. you can see are speech tags. It looks like you're looking for nouns (NN).