Spell check program in python - python

Exercise problem: "given a word list and a text file, spell check the
contents of the text file and print all (unique) words which aren't
found in the word list."
I didn't get solutions to the problem so can somebody tell me how I went and what the correct answer should be?:
As a disclaimer none of this parses in my python console...
My attempt:
a=list[....,.....,....,whatever goes here,...]
data = open(C:\Documents and Settings\bhaa\Desktop\blablabla.txt).read()
#I'm aware that something is wrong here since I get an error when I use it.....when I just write blablabla.txt it says that it can't find the thing. Is this function only gonna work if I'm working off the online IVLE program where all those files are automatically linked to the console or how would I do things from python without logging into the online IVLE?
for words in data:
for words not in a
print words
wrong = words not in a
right = words in a
print="wrong spelling:" + "properly splled words:" + right
oh yeh...I'm very sure I've indented everything correctly but I don't know how to format my question here so that it doesn't come out as a block like it has. sorry.
What do you think?

There are many things wrong with this code - I'm going to mark some of them below, but I strongly recommend that you read up on Python control flow constructs, comparison operators, and built-in data types.
a=list[....,.....,....,whatever goes here,...]
data = open(C:\Documents and Settings\bhaa\Desktop\blablabla.txt).read()
# The filename needs to be a string value - put "C:\..." in quotes!
for words in data:
# data is a string - iterating over it will give you one letter
# per iteration, not one word
for words not in a
# aside from syntax (remember the colons!), remember what for means - it
# executes its body once for every item in a collection. "not in a" is not a
# collection of any kind!
print words
wrong = words not in a
# this does not say what you think it says - "not in" is an operator which
# takes an arbitrary value on the left, and some collection on the right,
# and returns a single boolean value
right = words in a
# same as the previous line
print="wrong spelling:" + "properly splled words:" + right

I don't know what you are trying to iterate over, but why don't you just first iterate over your words (which are in the variable a I guess?) and then for every word in a you iterate over the wordlist and check whether or not that word is in the wordslist.
I won't paste code since it seems like homework to me (if so, please add the homework tag).
Btw the first argument to open() should be a string.

It's simple really. Turn both lists into sets then take the difference. Should take like 10 lines of code. You just have to figure out the syntax on your own ;) You aren't going to learn anything by having us write it for you.

Related

In python, is there a way to remove all text following the last instance of a delimiter?

I'm trying to create a random text generator in python. I'm using Markovify to produce the required text, a filter to not let it start generating text unless the first word is capitalized and, to prevent it from ending "mid sentence", want the program to search from the back of the output to the front and remove all text after the last (for instance) period. I want it to ignore all other instances of the selected delimiter(s). I have no idea how many instances of the delimiter will occur in the generated text, nor have anyway to know in advance.
While looking into this I found rsplit(), and tried using that, but ran into a problem.
'''tweet = buff.rsplit('.')[-1] '''
The above is what I tried first, and I thought it was working until I noticed that all of the lines printed with that had only a single sentence in them. Never more than that. The problem seems to be that the text is being dumped into an array of strings, and the [-1] bit is calling just one entry from that array.
'''tweet = buff.rsplit('.') - buff.rsplit('.')[-1] '''
Next I tried the above. The thinking, was that it would remove the last entry in the array, and then I could just print what remained. It... didn't go to plan. I get an "unsupported operand type" error, specifically tied to the attempt to subtract. Not sure what I'm missing at this point.
.rsplit has second optional argument - maxsplit i.e. maximum number of split to do. You could use it following way:
txt = 'some.text.with.dots'
all_but_last = txt.rsplit('.', 1)[0]
print(all_but_last)
Output:
some.text.with

Searching a string for an exact match from a list in Python

I'm working on a project that searches specific user's Twitter streams from my followers list and retweets them. The code below works fine, but if the string appears in side of the word (for instance if the desired string was only "man" but they wrote "manager", it'd get retweeted). I'm still pretty new to python, but my hunch is RegEx will be the way to go, but my attempts have proved useless thus far.
if tweet["user"]["screen_name"] in friends:
for phrase in list:
if phrase in tweet["text"].lower():
print tweet
api.retweet(tweet["id"])
return True
Since you only want to match whole words the easiest way to get Python to do this is to split the tweet text into a list of words and then test for the presence of each of your words using in.
There's an optimization you can use because position isn't important: by building a set from the word list you make searching much faster (technically, O(1) rather than O(n)) because of the fast hashed access used by sets and dicts (thank you Tim Peters, also author of The Zen of Python).
The full solution is:
if tweet["user"]["screen_name"] in friends:
tweet_words = set(tweet["text"].lower().split())
for phrase in list:
if phrase in tweet_words:
print tweet
api.retweet(tweet["id"])
return True
This is not a complete solution. Really you should be taking care of things like purging leading and trailing punctuation. You could write a function to do that, and call it with the tweet text as an argument instead of using a .split() method call.
Given that optimization it occurred to me that iteration in Python could be avoided altogether if the phrases were a set also (the iteration will still happen, but at C speeds rather than Python speeds). So in the code that follows let's suppose that you have during initialization executed the code
tweet_words = set(l.lower() for l in list)
By the way, list is a terrible name for a variable, since by using it you make the Python list type unavailable under its usual name (though you can still get at it with tricks like type([])). Perhaps better to call it word_list or something else both more meaningful and not an existing name. You will have to adapt this code to your needs, it's just to give you the idea. Note that tweet_words only has to be set once.
list = ['Python', 'Perl', 'COBOL']
tweets = [
"This vacation just isn't worth the bother",
"Goodness me she's a great Perl programmer",
"This one slides by under the radar",
"I used to program COBOL but I'm all right now",
"A visit to the doctor is not reported"
]
tweet_words = set(w.lower() for w in list)
for tweet in tweets:
if set(tweet.lower().split()) & tweet_words:
print(tweet)
If you want to use regexes to do this, look for a pattern that is of the form \b<string>\b. In your case this would be:
pattern = re.compile(r"\bman\b")
if re.search(pattern, tweet["text"].lower()):
#do your thing
\b looks for a word boundary in regex. So prefixing and suffixing your pattern with it will match only the pattern. Hope it helps.

PyEnchant 'correcting' words in dictionary to words not in dictionary

I'm attempting to take large amounts of natural language from a web forum and correct the spelling with PyEnchant. The text is often informal, and about medical issues, so I have created a text file "test.pwl" containing relevant medical words, chat abbreviations, and so on. In some cases, little bits of html, urls, etc do unfortunately remain in it.
My script is designed to use both the en_US dictionary and the PWL to find all misspelled words and correct them to the first suggestion of d.suggest totally automatically. It prints a list of misspelled words, then a list of words that had no suggestions, and writes the corrected text to 'spellfixed.txt':
import enchant
import codecs
def spellcheckfile(filepath):
d = enchant.DictWithPWL("en_US","test.pwl")
try:
f = codecs.open(filepath, "r", "utf-8")
except IOError:
print "Error reading the file, right filepath?"
return
textdata = f.read()
mispelled = []
words = textdata.split()
for word in words:
# if spell check failed and the word is also not in
# mis-spelled list already, then add the word
if d.check(word) == False and word not in mispelled:
mispelled.append(word)
print mispelled
for mspellword in mispelled:
#get suggestions
suggestions=d.suggest(mspellword)
#make sure we actually got some
if len(suggestions) > 0:
# pick the first one
picksuggestion=suggestions[0]
else: print mspellword
#replace every occurence of the bad word with the suggestion
#this is almost certainly a bad idea :)
textdata = textdata.replace(mspellword,picksuggestion)
try:
fo=open("spellfixed.txt","w")
except IOError:
print "Error writing spellfixed.txt to current directory. Who knows why."
return
fo.write(textdata.encode("UTF-8"))
fo.close()
return
The issue is that the output often contains 'corrections' for words that were in either the dictionary or the pwl. For instance, when the first portion of the input was:
My NEW doctor feels that I am now bi-polar . This , after 9 years of being considered majorly depressed by everyone else
I got this:
My NEW dotor feels that I am now bipolar . This , aftER 9 years of being considERed majorly depressed by evERyone else
I could handle the case changes, but doctor --> dotor is no good at all. When the input is much shorter (for example, the above quotation is the entire imput), the result is desirable:
My NEW doctor feels that I am now bipolar . This , after 9 years of being considered majorly depressed by everyone else
Could anybody explain to me why? In very simple terms, please, as I'm very new to programming and newer to Python. A step-by-step solution would be greatly appreciated.
I think your problem is that you're replacing letter sequences inside words. "ER" might be a valid spelling correction for "er", but that doesn't mean that you should change "considered" to "considERed".
You can use regexes instead of simple text replacement to ensure that you replace only full words. "\b" in a regex means "word boundary":
>>> "considered at the er".replace( "er", "ER" )
'considERed at the ER'
>>> import re
>>> re.sub( "\\b" + "er" + "\\b", "ER", "considered at the er" )
'considered at the ER'
#replace every occurence of the bad word with the suggestion
#this is almost certainly a bad idea :)
You were right, that is a bad idea. This is what's causing "considered" to be replaced by "considERed". Also, you're doing a replacement even when you don't find a suggestion. Move the replacement to the if len(suggestions) > 0 block.
As for replacing every instance of the word, what you want to do instead is save the positions of the misspelled words along with the text of the misspelled words (or maybe just the positions and you can look the words up in the text later when you're looking for suggestions), allow duplicate misspelled words, and only replace the individual word with its suggestion.
I'll leave the implementation details and optimizations up to you, though. A step-by-step solution won't help you learn as much.

Python comparing elements in two lists

I have two lists:
a - dictionary which contains keywords such as ["impeccable", "obvious", "fantastic", "evident"] as elements of the list
b - sentences which contains sentences such as ["I am impeccable", "you are fantastic", "that is obvious", "that is evident"]
The goal is to use the dictionary list as a reference.
The process is as follows:
Take an element for the sentences list and run it against each element in the dictionary list. If any of the elements exists, then spit out that sentence to a new list
Repeating step 1 for each of the elements in the sentences list.
Any help would be much appreciated.
Thanks.
Below is the code:
sentences = "The book was awesome and envious","splendid job done by those guys", "that was an amazing sale"
dictionary = "awesome","amazing", "fantastic","envious"
##Find Matches
for match in dictionary:
if any(match in value for value in sentences):
print match
Now that you've fixed the original problem, and fixed the next problem with doing the check backward, and renamed all of your variables, you have this:
for match in dictionary:
if any(match in value for value in sentences):
print match
And your problem with it is:
The way I have the code written i can get the dictionary items but instead i want to print the sentences.
Well, yes, your match is a dictionary item, and that's what you're printing, so of course that's what you get.
If you want to print the sentences that contain the dictionary item, you can't use any, because the whole point of that function us to just return True if any elements are true. It won't tell you which ones—in fact, if there are more than one, it'll stop at the first one.
If you don't understand functions like any and the generator expressions you're passing to them, you really shouldn't be using them as magic invocations. Figure out how to write them as explicit loops, and you will be able to answer these problems for yourself easily. (Note that the any docs directly show you how to write an equivalent loop.)
For example, your existing code is equivalent to:
for match in dictionary:
for value in sentences:
if match in value:
print match
break
Written that way, it should be obvious how to fix it. First, you want to print the sentence instead of the word, so print value instead of match (and again, it would really help if you used meaningful variable names like sentence and word instead of meaningless names like value and misleading names like match…). Second, you want to print all matching sentences, not just the first one, so don't break. So:
for match in dictionary:
for value in sentences:
if match in value:
print value
And if you go back to my first answer, you may notice that this is the exact same structure I suggested.
You can simplify or shorten this by using comprehensions and iterator functions, but not until you understand the simple version, and how those comprehensions and iterator functions work.
First translate your algorithm into psuedocode instead of a vague description, like this:
for each sentence:
for each element in the dictionary:
if the element is in the sentence:
spit out the sentence to a new list
The only one of these steps that isn't completely trivial to convert to Python is "spit out the sentence to a new list". To do that, you'll need to have a new list before you get started, like a_new_list = [], and then you can call append on it.
Once you convert this to Python, you will discover that "I am impeccable and fantastic" gets spit out twice. If you don't want that, you need to find the appropriate please to break out of the inner loop and move on to the next sentence. Which is also trivial to convert to Python.
Now that you've posted your code… I don't know what problem you were asking about, but there's at least one thing obviously wrong with it.
sentences is a list of sentences.
So, for partial in sentences means each partial will be a sentence, like "I am impeccable".
dictionary is a list of words. So, for value in dictionary means each value will be a word, like "impeccable".
Now, you're checking partial in value for each value for each partial. That will never be true. "I am impeccable" is not in "impeccable".
If you turn that around, and check whether value in partial, it will give you something that's at least true sometimes, and that may even be what you actually want, but I'm not sure.
As a side note, if you used better names for your variables, this would be a lot more obvious. partial and value don't tell you what those things actually are; if you'd called them sentence and word it would be pretty clear that sentence in word is never going to be true, and that word in sentence is probably what you wanted.
Also, it really helps to look at intermediate values to debug things like this. When you use an explicit for statement, you can print(partial) to see each thing that partial holds, or you can put a breakpoint in your debugger, or you can step through in a visualizer like this one. If you have to break the any(genexpr) up into an explicit loop to do, then do so. (If you don't know how, then you probably don't understand what generator expressions or the any function do, and have just copied and pasted random code you didn't understand and tried changing random things until it worked… in which case you should stop doing that and learn what they actually mean.)

for loop only returning the last word in a list of many words

for word in wordStr:
word = word.strip()
print word
When the above code analyzes a .txt with thousands of words, why does it only return the last word in the .txt file? What do I need to do to get it to return all the words in the text file?
Because you're overwriting word in the loop, not really a good idea. You can try something like:
wordlist = ""
for word in wordStr:
wordlist = "%s %s"%(wordlist,word.strip())
print wordlist[1:]
This is fairly primitive Python and I'm sure there's a more Pythonic way to do it with list comprehensions and all that new-fangled stuff :-) but I usually prefer readability where possible.
What this does is to maintain a list of the words in a separate string and then add each stripped word to the end of that list. The [1:] at the end is simply to get rid of the initial space that was added when the first word was tacked on to the end of the empty word list.
It will suffer eventually as the word count becomes substantial since tacking things on to the end of a string is less optimal than other data structures. However, even up to 10,000 words (with the print removed), it's still well under a second of execution time.
At 50,000 words it becomes noticeable, taking 3 seconds on my box. If you're going to be processing that sort of quantity, you would probably opt for a real list-based solution like (equivalent to above but with a different underlying data structure):
wordlist = []
for word in wordStr:
wordlist.append (word.strip())
print wordlist
That takes about 0.22 seconds (without the print) to do my entire dictionary file, some 110,000 words.
To print all the words in wordStr (assuming that wordStr is some kind of iterable that returning strings), you can simply write
for word in wordStr:
word = word.strip()
print word # Notice that the only difference is the indentation on this line
Python cares about indentation, so in your code the print statement is outside the loop and is only executed once. In the modified version, the print statement is inside the loop and is executed once per word.
that is because the variable word is the current word, when finish the file, is the last one:
def stuff():
words = []
for word in wordStr:
words.append(word.strip())
print words
return words
List comprehensions should make your code snazzier and more pythonic.
wordStr = 'here are some words'
print [word.strip() for word in wordStr.split()]
returns
['here', 'are', 'some', 'words']
If you don't know why the code in your example is "returning" only the last word (actually it's not returning anything, it's printing a single word), then I'm afraid nothing here is going to help you very much.
I know that sounds hostile, but I don't wish to be. It seems from your question that you are throwing bits of Python together with little real understanding of the fundamental basics of programming in Python.
Now don't get me wrong, trying stuff out is a great early learning activity, and having a task to motivate your learning is also a great way to do it. So I don't want to tell you to stop! But whether you're trying to learn to program or just have a task that needs you to write a program, you aren't going to get very far without developing an understanding of the fundamental underlying issues that make your example not do what you want it to do.
We can tell you here that this:
for word in wordStr:
word = word.strip()
print word
is a program that roughly means "for every word in wordStr, bind word to the result of word.strip(); then after all that, print the contents of word", whereas what you wanted is likely:
for word in wordStr:
word = word.strip()
print word
which is a program that roughly means "for every word in wordStr, bind word to the result of word.strip() and then print word". And that solves your immediate problem. But you're going to run into many more problems of a very similar nature, and without an understanding of the very basics you won't be able to see that they're all "of a kind", and you'll just end up asking more questions here.
What you need is to gain a basic understanding of how variables, statements, loops, etc work in Python. You will probably eventually gain that if you just keep trying to apply code and ask questions here. But Stack Overflow is not the most efficient resource for gaining that understanding; a much better bet would be to find yourself a good book or tutorial (there's an official one at http://docs.python.org/tutorial/).
Here endeth the soap box.

Categories