Using Python to check words - python

I'm stuck on a simple problem. I've got a dictionary of words in the English language, and a sample text that is to be checked. I've got to check every word in the sample against the dictionary, and the code I'm using is wrong.
for word in checkList: # iterates through every word in the sample
if word not in refDict: # checks if word is not in the dictionary
print word # just to see if it's recognizing misspelled words
The only problem is, as it goes through the loop it prints out every word, not just the misspelled ones. Can someone explain this and offer a solution possibly? Thank you so much!

The snippet you have is functional. See for example
>>> refDict = {'alpha':1, 'bravo':2, 'charlie':3, 'delta':4}
>>> s = 'he said bravo to charlie O\'Brian and jack Alpha'
>>> for word in s.split():
... if word not in refDict:
... print(repr(word)) # by temporarily using repr() we can see exactly
... # what the words are like
...
'he'
'said'
'to'
"O'Brian"
'and'
'jack'
'Alpha' # note how Alpha was not found in refDict (u/l case difference)
Therefore, the dictionary contents must differ from what you think, or the words out of checklist are not exactly as they appear (eg. with whitespace or capitalization; see the use of repr() (*) in print statement to help identify cases of the former).
Debugging suggestion: FOCUS on the first word from checklist (or the first that you suspect is to be found in dictionary). Then for this word and this word only, print it in details, with its length, with bracket on either side etc., for both the word out of checklist and the corresponding key in the dictionary...
(*) repr() was a suggestion from John Machin. Instead I often use brackets or other characters as in print('[' + word + ']'), but repr() is more exacting in its output.

Consider stripping your words of any whitespace that might be there, and changing all the words of both sets to the same case. Like this:
word.strip().lower()
That way you can make sure you're comparing apples to apples.

Clearly "word not in refDict" always evaluates to True. This is probably because the contents of refDict or checkList are not what you think they are. Are they both tuples or lists of strings?

The code you have would work if the keys in refDict are the correctly spelt words. If the correctly spelt words are the values in your dict then you need something like this:
for word in checkList:
if word not in refDict.values():
print word
Is there a reason you dictionary is stored as a mapping as opposed to a list or a set? A python dict contains name-value pairs for example I could use this mapping: {"dog":23, "cat":45, "pony":67} to store an index of a word and page number it is found in some book. In your case your dict is a mapping of what to what?

Are the words in the refDict the keys or the values?
Your code will only see keys: e.g.:
refDict = { 'w':'x', 'y':'z' }
for word in [ 'w','x','y','z' ]:
if word not in refDict:
print word
prints:
x
z
Othewise you want;
if word not in refDict.values()
Of course this rather assumes that your dictionary is an actual python dictionary which seems an odd way to store a list of words.

Your refDict is probably wrong. The in keyword checks if the value is in the keys of the dictionary. I believe you've put your words in as values.
I'd propose using a set instead of a dictionary.
knownwords = set("dog", "cat")
knownwords.add("apple")
text = "The dog eats an apple."
for word in text.split(" "):
# to ignore case word is converted to lowercase
if word.lower() not in knownwords:
print word
# The
# eats
# an
# apple. <- doesn't work because of the dot

Related

Using Regex to encompass a group of keys in a dictionary and match them inside of a list of strings

I'm new to text-cleaning in python but I currently created a dictionary with various slang words/acronyms/contractions that looks something like this:
fulltext = {'BYOB': 'bring your own beer', 'couldn't': 'could not', 'finna': 'going to'}... etc.
and I have another large corpus of text data:
uncleaned_text = ['This is finna be crazy', 'I don't know why we couldn't be there', 'I should have known when the event was BYOB that it would be terrible']
For which I am trying to 'clean' by replacing those words inside the list of strings that match the dictionary keys with their corresponding values. So, my ideal output would be:
cleaned text = ['This is going to be crazy', 'I don't know why we could not be there', 'I should have known when the event was bring your own beer that it would be terrible']
I know I should be using REGEX in some way and I know I should be using loops, but I am definitely not even close to what I should be doing I think, because the error I get is builtin function not iterable...
Any suggestions?
for sentence in uncleaned_text:
for word in sentence:
if word in fulltext.keys:
word.replace(word, fulltext.key)
The error you are receiving is because dictionary.keys is a function not a list. So to get all the keys, you would want to use fulltext.keys() not fulltext.keys. The keys member of the dictionary class is a function that returns a list of the keys. The more pythonic way of checking if a specific word exists in the dictionary's keys is: if key in dictionary. The in operator checks to see if the left operand is a key in the dictionary, so you don't have to use the .keys function.
For the rest of the function I'd do the following:
clean_text = []
for sentence in uncleaned_text:
for word in sentence.split():
if word in fulltext:
sentence = sentence.replace(word, fulltext[word])
clean_text.append(sentence)
The changes I made explained:
you'll need to split the sentence into words. The sentence is just a long string, so if you iterate over it you would get every character of the sentence individually. The .split method splits it on every space by default.
The replace method doesn't change the string in place, so you have to catch it in some other variable.
To get a value out of the dictionary, you'll need to use the key. Word is our key in this case, so I changed fulltext.key to be fulltext[word]. This gets the value associated with word from the fulltext dictionary.
Added an array to append the changed sentences to.
This will leave the original list (uncleaned_text) unchanged.
This might be helpful:
import re
fulltext = {"BYOB": "bring your own beer", "couldn't": "could not", "finna": "going to"}
uncleaned_text = ["This is finna be crazy", "I don't know why we couldn't be there", "I should have known when the event was BYOB that it would be terrible"]
cleaned_text = []
keys = fulltext.keys()
for text in uncleaned_text:
for key in keys:
if key in text:
cleaned_text.append(re.sub(key,fulltext[key],text))
print("cleaned_text => ",cleaned_text)
However, this code will take a long time to run if you have lots of data because of the nested for-loop.

Difference between "word for word" and "char for char" in Python

For a sentence called txt that I have split into words, I am trying to filter the words of 4+ characters into a list.
I've tried both char for char in words if len(char) >= 4 as well as word for word in words if len(word) >= 4, and they give the same output. What's the difference between them? Which one should I use?
They give you the same output because you're actually asking python to do the same thing. Try doing:
foo for foo in words if len(foo) >= 4
In this case, foo is just the name of the variable - in the list comprehension it 'becomes' each word within the words list. The variable is also referenced in the if statement.
char and word are not special commands in this context - they are just variables named in a way that describes what they are.
This is the same with a variable declared in a loop:
for word in words:
print(word)
or:
for foo in words:
print(foo)
do the same thing - arguably the first is better because the name of the variable word helps give context to people reading the code.
In this context, 'char' and 'word' are just variable names. All the code is saying in either example is that you want to iterate a list called 'words', and you want to refer to the current item as 'word'/'char'. Either is acceptable, but you should read into list comprehension to understand more.
https://www.pythonforbeginners.com/basics/list-comprehensions-in-python

Derive words from string based on key words

I have a string (text_string) from which I want to find words based on my so called key_words. I want to store the result in a list called expected_output.
The expected output is always the word after the keyword (the number of spaces between the keyword and the output word doesn't matter). The expected_output word is then all characters until the next space.
Please see the example below:
text_string = "happy yes_no!?. why coding without paus happy yes"
key_words = ["happy","coding"]
expected_output = ['yes_no!?.', 'without', 'yes']
expected_output explanation:
yes_no!?. (since it comes after happy. All signs are included until the next space.)
without (since it comes after coding. the number of spaces surronding the word doesn't matter)
yes (since it comes after happy)
You can solve it using regex. Like this e.g.
import re
expected_output = re.findall('(?:{0})\s+?([^\s]+)'.format('|'.join(key_words)), text_string)
Explanation
(?:{0}) Is getting your key_words list and creating a non-capturing group with all the words inside this list.
\s+? Add a lazy quantifier so it will get all spaces after any of the former occurrences up to the next character which isn't a space
([^\s]+) Will capture the text right after your key_words until a next space is found
Note: in case you're running this too many times, inside a loop i.e, you ought to use re.compile on the regex string before in order to improve performance.
We will use re module of Python to split your strings based on whitespaces.
Then, the idea is to go over each word, and look if that word is part of your keywords. If yes, we set take_it to True, so that next time the loop is processed, the word will be added to taken which stores all the words you're looking for.
import re
def find_next_words(text, keywords):
take_it = False
taken = []
for word in re.split(r'\s+', text):
if take_it == True:
taken.append(word)
take_it = word in keywords
return taken
print(find_next_words("happy yes_no!?. why coding without paus happy yes", ["happy", "coding"]))
results in ['yes_no!?.', 'without', 'yes']

Separating between Hebrew and English strings

So I have this huge list of strings in Hebrew and English, and I want to extract from them only those in Hebrew, but couldn't find a regex example that works with Hebrew.
I have tried the stupid method of comparing every character:
import string
data = []
for s in slist:
found = False
for c in string.ascii_letters:
if c in s:
found = True
if not found:
data.append(s)
And it works, but it is of course very slow and my list is HUGE.
Instead of this, I tried comparing only the first letter of the string to string.ascii_letters which was much faster, but it only filters out those that start with an English letter, and leaves the "mixed" strings in there. I only want those that are "pure" Hebrew.
I'm sure this can be done much better... Help, anyone?
P.S: I prefer to do it within a python program, but a grep command that does the same would also help
To check if a string contains any ASCII letters (ie. non-Hebrew) use:
re.search('[' + string.ascii_letters + ']', s)
If this returns true, your string is not pure Hebrew.
This one should work:
import re
data = [s for s in slist if re.match('^[a-zA-Z ]+$', s)]
This will pick all the strings that consist of lowercase and uppercase English letters and spaces. If the strings are allowed to contain digits or punctuation marks, the allowed characters should be included into the regex.
Edit: Just noticed, it filters out the English-only strings, but you need it do do the other way round. You can try this instead:
data = [s for s in slist if not re.match('^.*[a-zA-Z].*$', s)]
This will discard any string that contains at least one English letter.
Python has extensive unicode support. It depends on what you're asking for. Is a hebrew word one that contains only hebrew characters and whitespace, or is it simply a word that contains no latin characters? Either way, you can do so directly. Just create the criteria set and test for membership.
Note that testing for membership in a set is much faster than iteration through string.ascii_letters.
Please note that I do not speak hebrew so I may have missed a letter or two of the alphabet.
def is_hebrew(word):
hebrew = set("א‎ב‎ג‎ד‎ה‎ו‎ז‎ח‎ט‎י‎כ‎ך‎ל‎מ‎נ‎ס‎ ע‎פ‎צ‎ק‎ר‎ש‎ת‎ם‎ן‎ף‎ץ"+string.whitespace)
for char in word:
if char not in hebrew:
return False
return True
def contains_latin(word):
return any(char in set("abcdefghijklmnopqrstuvwxyz") for char in word.lower())
# a generator expression like this is a terser way of expressing the
# above concept.
hebrew_words = [word for word in words if is_hebrew(word)]
non_latin words = [word for word in words if not contains_latin(word)]
Another option would be to create a dictionary of hebrew words:
hebrew_words = {...}
And then you iterate through the list of words and compare them against this dictionary ignoring case. This will work much faster than other approaches (O(n) where n is the length of your list of words).
The downside is that you need to get all or most of hebrew words somewhere. I think it's possible to find it on the web in csv or some other form. Parse it and put it into python dictionary.
However, it makes sense if you need to parse such lists of words very often and quite quickly. Another problem is that the dictionary may contain not all hebrew words which will not give a completely right answer.
Try this:
>>> import re
>>> filter(lambda x: re.match(r'^[^\w]+$',x),s)

Find max length word from arbitrary letters

I have 10 arbitrary letters and need to check the max length match from words file
I started to learn RE just some time ago, and can't seem to find suitable pattern
first idea that came was using set: [10 chars] but it also repeats included chars and I don't know how to avoid that
I stared to learn Python recently but before RE and maybe RE is not needed and this can be solved without it
using "for this in that:" iterator seems inappropriate, but maybe itertools can do it easily (with which I'm not familiar)
I guess solution is known even to novice programmers/scripters, but not to me
Thanks
I'm guessing this is something like finding possible words given a set of Scrabble tiles, so that a character can be repeated only as many times as it is repeated in the original list.
The trick is to efficiently test each character of each word in your word file against a set containing your source letters. For each character, if found in the test set, remove it from the test set and proceed; otherwise, the word is not a match, and go on to the next word.
Python has a nice function all for testing a set of conditions based on elements in a sequence. all has the added feature that it will "short-circuit", that is, as soon as one item fails the condition, then no more tests are done. So if your first letter of your candidate word is 'z', and there is no 'z' in your source letters, then there is no point in testing any more letters in the candidate word.
My first shot at writing this was simply:
matches = []
for word in wordlist:
testset = set(letters)
if all(c in testset for c in word):
matches.append(word)
Unfortunately, the bug here is that if the source letters contained a single 'm', a word with several 'm's would erroneously match, since each 'm' would separately match the given 'm' in the source testset. So I needed to remove each letter as it was matched.
I took advantage of the fact that set.remove(item) returns None, which Python treats as a Boolean False, and expanded my generator expression used in calling all. For each c in word, if it is found in testset, I want to additionally remove it from testset, something like (pseudo-code, not valid Python):
all(c in testset and "remove c from testset" for c in word)
Since set.remove returns a None, I can replace the quoted bit above with "not testset.remove(c)", and now I have a valid Python expression:
all(c in testset and not testset.remove(c) for c in word)
Now we just need to wrap that in a loop that checks each word in the list (be sure to build a fresh testset before checking each word, since our all test has now become a destructive test):
for word in wordlist:
testset = set(letters)
if all(c in testset and not testset.remove(c) for c in word):
matches.append(word)
The final step is to sort the matches by descending length. We can pass a key function to sort. The builtin len would be good, but that would sort by ascending length. To change it to a descending sort, we use a lambda to give us not len, but -1 * len:
matches.sort(key=lambda wd: -len(wd))
Now you can just print out the longest word, at matches[0], or iterate over all matches and print them out.
(I was surprised that this brute force approach runs so well. I used the 2of12inf.txt word list, containing over 80,000 words, and for a list of 10 characters, I get back the list of matches in about 0.8 seconds on my little 1.99GHz laptop.)
I think this code will do what you are looking for:
>>> words = open('file.txt')
>>> max(len(word) for word in set(words.split()))
If you require more sophisticated tokenising, for example if you're not using Latin text, would should use NLTK:
>>> import nltk
>>> words = open('file.txt')
>>> max(len(word) for word in set(nltk.word_tokenize(words)))
I assume you are trying to find out what is the longest word that can be made from your 10 arbitrary letters.
You can keep your 10 arbitrary letters in a dict along with the frequency they occur.
e.g., your 4 (using 4 instead of 10 for simplicity) arbitrary letters are: e, w, l, l. This would be in a dict as:
{'e':1, 'w':1, 'l':2}
Then for each word in the text file, see if all of the letters for that word can be found in your dict of arbitrary letters. If so, then that is one of your candidate words.
So:
we
wall
well
all of the letters in well would be found in your dict of arbitrary letters so save it and its length for comparison against other words.

Categories