Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
Given words like "romantic" or "underground", I'd like to use python to go through a list of text data and retrieve entries that contain those words and associated words such as "girlfriend" or "hole-in-the-wall".
It's been suggested that I work with NLTK to do this, but I have no idea where to start and I know nothing about language processing or linguistics. Any pointers would be much appreciated.
You haven't given us much to go on. But let's assume you have a paragraph of text. Here's one I just stole from a Yelp review:
What a beautiful train station in the heart of New York City. I've grown up seeing memorable images of GCT on newspapers, in movies, and in magazines, so I was well aware of what the interior of the station looked like. However, it's still a gem. To stand in the centre of the main hall during rush hour is an interesting experience- commuters streaming vigorously around you, sunlight beaming in through the massive windows, announcements booming on the PA system. It's a true NY experience.
Okay, there are a bunch of words there. What kind of words do you want? Adjectives? Adverbs? NLTK will help you "tag" the words, so you can find all the ad-words: "beautiful", "memorable", "interesting", "massive", "true".
Now, what are you going to do with them? Maybe you can throw in some verbs and nouns, "beaming" sounds pretty good. But "announcements" isn't so interesting.
Regardless, you can build an associations database. This ad-word appears in a paragraph with these other words.
Maybe you can count the frequency of each word, over your total corpus. Maybe "restaurant" appears a lot, but "pesthole" is relatively rare. So you can filter that way? (Only keep "interesting" words.)
Or maybe you go the other way, and extract synonyms: if "romantic" and "girlfriend" appear together a lot, then call them "correlated words" and use them as part of your search engine?
We don't know what you're trying to accomplish, so it's hard to make suggestions. But yes, NLTK can help you select certain subgroups of words, IF that's actually relevant.
Related
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 2 years ago.
Improve this question
I generate text via transformer models and I am looking for a way of measuring the grammatical text-quality.
Like the text: "Today is a good day. I slept well and got up good in the morning."
should be rated higher than: "Yesterday I went into bed and. got Breakfast son."
Are there any models, which can do this job which I didnt find before, or is there any other way of measuring the quality of the grammatical output of the text?
What I found out was, that spacy has the option to show whether a text has a grammatical error, but what I am more interested in is a score which included the length of the text and the amount of error it has.
Also I looked into NLTK readability, but this aims at how well the text can be understood, which depends on more than the grammar only.
Thank you!
So I found what I was looking for:
In this paper the researchers tested different measures for their ability on checking grammar mistakes for text without references (what the GLEU-Score can be used for). They also tested the python-language-tool which is also used for spell checking in open-office. This tool is able to measure the amount of grammar mistakes in a text. For my purpose, I will just divide the amount of error through the amount of words in the text, which gives me an error metric.
Maybe this helps someone, who has the same issue. Here the example code, based on pypi:
import language_tool_python
tool = language_tool_python.LanguageTool('en-US')
text = "this is a test tsentence, to check if all erors are found"
matches = tool.check(text)
len(matches)
>>>3
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
I'm trying to create a machine learning algorithm, for address classification or similar address classification, for rural(Villages) areas. I have a historical data, which includes list of Addresses (Independent Variable), Village Name (Independent Variable) Pin-Codes (Independent Variable), Customer Mobile Number and Route No (Dependent Variable). Route No is for delivery cart, which will help them to cover maximum number of delivery destination in that area.
Challenges -
"Address" can be miss spelled.
"Villages Name" can be null.
"Pin-codes" can be wrong.
Good Thing -
Not all the independent variables can be wrong/null at the same time.
Now the point of creating this algorithm is for selecting the best Route Number, on the basis of "Address", "Villages", "Pin-Codes", and Historical Data(In which we have manually selected the Route for delivery carts).
I'm the beginner, i'm confused how to do this which process is to use.
Tasked I have done.
Address cleaning - Removed short words, Removed Big Words, Removed Stop Words.
Now trying to do it with word vector, but i'm not able to do that.
for this first you'll have to build a dataset first - consisting the names of as many villages as you can! because many villages have similar names so identifying a typo is pretty difficult and risky! there is a difference of one or two letters. So, bigger dataset is better.
Then, try to use TF-IDF on the combination of village name and PIN code (this link may be helpful for Indian data) or you can go for fuzzy logic.
Hope it helps! Happy coding!
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 6 years ago.
Improve this question
I find that I am often a little inconsistent in my naming conventions for variables, and I'm just wondering what people consider to be the best approach. The specific convention I am talking about is when a variable needs to be described by a noun and an adjective, and whether the adjective should come before or after the noun. The question is general across all programming languages, although personally I use C++ and Python.
For example, consider writing a GUI, which has two buttons; one on the right, and one on the left. I now need to create two variables to store them. One option would be to have the adjective before the noun, and call them left_button, and right_button. The other option would be to have the adjective after the noun, and call them button_left, and button_right. With the first case, it makes more sense when reading out loud, because in English you would always place the adjective before the noun. However, with the second case, it helps to structure the data semantically, because button reveals the most information about the variable, and left or right is supplementary information.
So what do you think? Should the adjective come before or after the noun? Or is it entirely subjective?
I try to use noun_adj because it conforms to what I use for functions. When I write functions I tend to use verb_noun_adj, for instance:
def get_button_id():
"""Get id property of button object."""
pass
This reads to me a bit more clearly that get_id_button because it is not entirely clear what you are getting here: is it getting the button.id or is getting a button called 'id' or maybe even something else? Unless you expand the name to be a bit more clear, like get_id_of_button which may be a bit too verbose for you.
There's probably an equally valid argument against what I'm doing here, but at least I'm being consistent in my madness?
I think that writing the adjective before the noun is the more common thing. But if you think about it for a second, writing it after the noun is easier when reading your code. The adjective can easily be seen as a attribute of the noun. In a logic way of thinking, this is the better way in my opinion.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 7 years ago.
Improve this question
Here is the problem: When given a block of text, I want to suggest possible topics . For example, a news article about Kobe Bryant would have suggested tags like: ‘basketball’, ‘nba’, ‘sports’.
I have a fairly large training dataset (350k+) that includes bodies of text and tags that users have assigned to the text. There are about 40k, pre existing topics; however, many of the topics do not have too many entries in them. I would say only about 5k of the topics have more than 10 entries in them. Users cannot assign topics that don’t already exist in the system. I'd also like to include that
Does anyone have any suggestions for algorithms to use?
If anyone has any suggestions of python libraries as well that would be awesome.
There have been attemps on similar problem - one example is right here - stackoverflow. When you wrote your question, stackoverflow itself suggests some tags without your intervention, though you can manually add or remove them.
Out-of-the-box classification would fail as the number of tags is really huge. There're two directions you could work on this problem from.
Nearest Neighbors
Easy, fast and effective. You have a labelled training set. When a new document comes, you look for closest matches, e.g. words like 'tags', ' training', 'dataset', 'labels' etc helped your question map with other similar questions on StackOverflow. In those questions, machine-learning tag was there - so this tag was suggested. Best way for implementation is index your training data (search-engine tactic). You may use Lucene, Elastic Search or something similar. When a new document appears, use that as a query and search for top 10 matching documents stored previously. Poll their tags. Sort the tags and use the scores of the documents to find how important the tags are. Done.
Probabilistic Models
Idea is on the lines of classification but off-the-shelf tools won't help you with that. Check the works like Clayton Stanley, Predicting Tags for StackOverflow Posts, Darren Kuo, On Word Prediction Methods
or Schuster's report on Predicting Tags for StackOverflow Questions
If you have got this problem as a part of long-term academic project or research, working on Method 2 would be better. However if you need off the shelf solution, use Method 1. Lucene is a great indexing tool used even in production. It is originally in Java but you can easily find wrappers for Python. Another alternatives are Elastic Search, Katta and many more.
p.s. A lot of experimentation is required while playing with the tag-scores.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 9 years ago.
Improve this question
I have to run a project involving NAO robot programmed in python. What I have to do is to assign some knowledge on what is shown to NAO.
For example:
A person shows NAO a picture (drawn by hand on a whiteboard)
The person says "House" (let's say the person draws a house)
NAO now knows that the picture shown represents a house
The problem I have encountered is in the speech recognition module. Only words in a certain vocabulary could be recognized. But in my project setting, a person should draw on a whiteboard and say to NAO what is drawn there. So, means I cannot know what the person is going to draw and I cannot set the vocabulary in advance.
My starting point is this tutorial here. As you can see by reading the tutorial, can be recognized only certain words belonging to the vocabulary, like in this line of code:
wordList=["yes","no","hello Nao","goodbye Nao"]
asr.setWordListAsVocabulary(wordList)
During the recognition, an event called WordRecognized is raised. It has this structure:
Event: "WordRecognized"
callback(std::string eventName, AL::ALValue value, std::string subscriberIdentifier)
It is raised when one of the specified words with ALSpeechRecognitionProxy::setWordListAsVocabulary() has been recognized. When no word is currently recognized, this value is reinitialized.
So I suppose the key of my answer is here, but I need an help.
How could I solve this problem? Is there any better documentation I can refer to?
Thanks in advance!
The problem is that NAO speech recognition module is proprietary and I highly doubt you can do such things with it.
However, if you consider ROS platform and open source engine like CMUSphinx you can definitely do what you want. It's easy to include placeholder word to a grammar which will be matched against an unknown word and later be placed in the dictionary.
This is a highly complicated research question to learn the vocabulary by voice interaction, but it was done before. As an example you can read this publication
Combined systems for automatic phonetic transcription of proper nouns
A. Laurent, T. Merlin , S. Meignier, Y. Esteve, P. Deleglise
http://www.lrec-conf.org/proceedings/lrec2008/pdf/455_paper.pdf
The only thing is that you want to work with the recognizer on the very low level.