Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 2 years ago.
Improve this question
I generate text via transformer models and I am looking for a way of measuring the grammatical text-quality.
Like the text: "Today is a good day. I slept well and got up good in the morning."
should be rated higher than: "Yesterday I went into bed and. got Breakfast son."
Are there any models, which can do this job which I didnt find before, or is there any other way of measuring the quality of the grammatical output of the text?
What I found out was, that spacy has the option to show whether a text has a grammatical error, but what I am more interested in is a score which included the length of the text and the amount of error it has.
Also I looked into NLTK readability, but this aims at how well the text can be understood, which depends on more than the grammar only.
Thank you!
So I found what I was looking for:
In this paper the researchers tested different measures for their ability on checking grammar mistakes for text without references (what the GLEU-Score can be used for). They also tested the python-language-tool which is also used for spell checking in open-office. This tool is able to measure the amount of grammar mistakes in a text. For my purpose, I will just divide the amount of error through the amount of words in the text, which gives me an error metric.
Maybe this helps someone, who has the same issue. Here the example code, based on pypi:
import language_tool_python
tool = language_tool_python.LanguageTool('en-US')
text = "this is a test tsentence, to check if all erors are found"
matches = tool.check(text)
len(matches)
>>>3
Related
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
I have around 7.000 sentences, for which I have done a refined Name-Entity-Recognition (i.e., for specific entities) using SpaCy. Now I want to do relationship extraction (basically causal inference) and I do not know how to use NER to provide training set.
As far as I read there are a different approaches to perform relationship extraction:
1) Handwritten patterns
2) Supervised machine learning
3) Semi-supervised machine learning
Since I want to use supervised machine learning I need training data.
It would be nice if anyone could give me some direction, many thanks. Here is a screen shoot of my data frame, entities are provided by a customised spaCy model. I have access to the syntactic dependencies and part-of-speech tags of each sentence, as given by spaCy:
It seems that your dataset is some kind of technical writing, very well structured, so maybe part-of-speech tags are enough to do the extraction you want.
I would recommend you to read this paper, and understand the pos-tags based pattern used Identifying Relations for Open Information Extraction
The piece of code below tags a sent with part-of-speech tags and then looks for sequences that match the called ReVerb pattern.
import nltk
verb = "<ADV>*<AUX>*<VBN><IN|PART>*<ADV>*"
word = "<NOUN|ADJ|ADV|DET|ADP>"
preposition = "<ADP|ADJ>"
rel_pattern = "( %s (%s* (%s)+ )? )+ " % (verb, word, preposition)
grammar_long = '''REL_PHRASE: {%s}''' % rel_pattern
reverb_pattern = nltk.RegexpParser(grammar_long)
sent = "where the equation caused by the eccentricity is maximum."
sent_pos_tags = nltk.tag.pos_tag("where the equation caused by the eccentricity is maximum".split())
for x in reverb_pattern.parse(tags):
if isinstance(x, nltk.Tree) and x.label() == 'REL_PHRASE':
rel_phrase = " ".join([t[0] for t in x.leaves()])
print(rel_phrase)
There is a bit missing which is to find the closest noun-phrases to right and left of the pattern, but I leave that as an exercise. I also wrote a blog post with a more detailed example. I hope it helps.
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 3 years ago.
Improve this question
We have requirement that we get different type of documents from client like student admission document, marksheet etc. So we want to create an algorithm which identify which document it is. So for this we choose some specific keyword to identify the document type like if admission documents have keywords like fee, admission etc . And marksheet documents keyword like marks, grade etc. So Here we can predict document type by comparing keywords frequency.
For this above requirement which algorithm should implement? I was planning to implement multinomial naive base algorithm. But I can not fit my data in to it.
FYI.. I am using python sklearn module.
Can you please anyone tell me which algorithm should suitable for above requirement. If possible can you also please provide an example with code so that i can easily figure out the solution?
You are looking for Topic Modeling solution and there are plenty of it to solve the problem. via python and scikit-learn i recommend you to take a look at this article
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 4 years ago.
Improve this question
So I have started to learn gensim for both word2vec and doc2vec and it works. The similarity scores actually work really well. For an experiment, however, I wanted to optimize a key word based search algorithm by comparing a single word and getting how similar it is to a piece of text.
What is the best way to do this? I considered averaging the the word vectors of all words in the text (maybe remove fill and stop word first) and and comparing this to the search word? But this really is just intuition, what would be the best way to do this?
Averaging all the word-vectors of a longer text is one crude but somewhat effective way to get a single vector for the full text. The resulting vector might then be usefully comparable to single word-vectors.
The Doc2Vec modes that train word-vectors into the same 'space' as the doc-vectors – PV-DM (dm=1), or PV-DBOW if word-training is added (dm=0, dbow_words=1) – could be considered. The doc-vectors closest to a single word-vector might work for your purposes.
Another technique for calculating a 'closeness' of two sets-of-word-vectors is "Word Mover's Distance" ('WMD'). It's more expensive to calculate than those techniques that reduce a text to a single vector, because it's essentially considering many possible cost-minimizing ways of correlating the sets-of-vectors. I'm not sure how well it works in the degenerate case of one 'text' being just a single word (or very short phrase), but it could be worth trying. (The method wmd_distance() in gensim offers this.)
I've also seen mention of another calculation, called 'Soft Cosine Similarity', that may be more efficient that WMD but offer similar benefits. It's also now available in gensim; there's a Jupyter notebook intro tutorial as well.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
Given words like "romantic" or "underground", I'd like to use python to go through a list of text data and retrieve entries that contain those words and associated words such as "girlfriend" or "hole-in-the-wall".
It's been suggested that I work with NLTK to do this, but I have no idea where to start and I know nothing about language processing or linguistics. Any pointers would be much appreciated.
You haven't given us much to go on. But let's assume you have a paragraph of text. Here's one I just stole from a Yelp review:
What a beautiful train station in the heart of New York City. I've grown up seeing memorable images of GCT on newspapers, in movies, and in magazines, so I was well aware of what the interior of the station looked like. However, it's still a gem. To stand in the centre of the main hall during rush hour is an interesting experience- commuters streaming vigorously around you, sunlight beaming in through the massive windows, announcements booming on the PA system. It's a true NY experience.
Okay, there are a bunch of words there. What kind of words do you want? Adjectives? Adverbs? NLTK will help you "tag" the words, so you can find all the ad-words: "beautiful", "memorable", "interesting", "massive", "true".
Now, what are you going to do with them? Maybe you can throw in some verbs and nouns, "beaming" sounds pretty good. But "announcements" isn't so interesting.
Regardless, you can build an associations database. This ad-word appears in a paragraph with these other words.
Maybe you can count the frequency of each word, over your total corpus. Maybe "restaurant" appears a lot, but "pesthole" is relatively rare. So you can filter that way? (Only keep "interesting" words.)
Or maybe you go the other way, and extract synonyms: if "romantic" and "girlfriend" appear together a lot, then call them "correlated words" and use them as part of your search engine?
We don't know what you're trying to accomplish, so it's hard to make suggestions. But yes, NLTK can help you select certain subgroups of words, IF that's actually relevant.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 9 years ago.
Improve this question
I have to run a project involving NAO robot programmed in python. What I have to do is to assign some knowledge on what is shown to NAO.
For example:
A person shows NAO a picture (drawn by hand on a whiteboard)
The person says "House" (let's say the person draws a house)
NAO now knows that the picture shown represents a house
The problem I have encountered is in the speech recognition module. Only words in a certain vocabulary could be recognized. But in my project setting, a person should draw on a whiteboard and say to NAO what is drawn there. So, means I cannot know what the person is going to draw and I cannot set the vocabulary in advance.
My starting point is this tutorial here. As you can see by reading the tutorial, can be recognized only certain words belonging to the vocabulary, like in this line of code:
wordList=["yes","no","hello Nao","goodbye Nao"]
asr.setWordListAsVocabulary(wordList)
During the recognition, an event called WordRecognized is raised. It has this structure:
Event: "WordRecognized"
callback(std::string eventName, AL::ALValue value, std::string subscriberIdentifier)
It is raised when one of the specified words with ALSpeechRecognitionProxy::setWordListAsVocabulary() has been recognized. When no word is currently recognized, this value is reinitialized.
So I suppose the key of my answer is here, but I need an help.
How could I solve this problem? Is there any better documentation I can refer to?
Thanks in advance!
The problem is that NAO speech recognition module is proprietary and I highly doubt you can do such things with it.
However, if you consider ROS platform and open source engine like CMUSphinx you can definitely do what you want. It's easy to include placeholder word to a grammar which will be matched against an unknown word and later be placed in the dictionary.
This is a highly complicated research question to learn the vocabulary by voice interaction, but it was done before. As an example you can read this publication
Combined systems for automatic phonetic transcription of proper nouns
A. Laurent, T. Merlin , S. Meignier, Y. Esteve, P. Deleglise
http://www.lrec-conf.org/proceedings/lrec2008/pdf/455_paper.pdf
The only thing is that you want to work with the recognizer on the very low level.