How to check generated strings against a text file - python

I'm trying to have the user input a string of characters with one asterisk. The asterisk indicates a character that can be subbed out for a vowel (a,e,i,o,u) in order to see what substitutions produce valid words.
Essentially, I want to take an input "l*g" and have it return "lag, leg, log, lug" because "lig" is not a valid English word. Below I have invalid words to be represented as "x".
I've gotten it to properly output each possible combination (e.g., including "lig"), but once I try to compare these words with the text file I'm referencing (for the list of valid words), it'll only return 5 lines of x's. I'm guessing it's that I'm improperly importing or reading the file?
Here's the link to the file I'm looking at so you can see the formatting:
https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/packages/corpora/words.zip
Using the "en" file ~2.5MB
It's not in a dictionary layout i.e. no corresponding keys/values, just lines (maybe I could use the line number as the index, but I don't know how to do that). What can I change to check the test words to narrow down which are valid words based on the text file?
with open(os.path.expanduser('~/Downloads/words/en')) as f:
words = f.readlines()
inputted_word = input("Enter a word with ' * ' as the missing letter: ")
letters = []
for l in inputted_word:
letters.append(l)
### find the index of the blank
asterisk = inputted_word.index('*') # also used a redundant int(), works fine
### sub in vowels
vowels = ['a','e','i','o','u']
list_of_new_words = []
for v in vowels:
letters[asterisk] = v
new_word = ''.join(letters)
list_of_new_words.append(new_word)
for w in list_of_new_words:
if w in words:
print(new_word)
else:
print('x')
There are probably more efficient ways to do this, but I'm brand new to this. The last two for loops could probably be combined but debugging it was tougher that way.

print(list_of_new_words)
gives
['lag', 'leg', 'lig', 'log', 'lug']
So far, so good.
But this :
for w in list_of_new_words:
if w in words:
print(new_word)
else:
print('x')
Here you print new_word, which is defined in the previous for loop :
for v in vowels:
letters[asterisk] = v
new_word = ''.join(letters) # <----
list_of_new_words.append(new_word)
So after the loop, new_word still has the last value it was assigned to : "lug" (if the script input was l*g).
You probably meant w instead ?
for w in list_of_new_words:
if w in words:
print(w)
else:
print('x')
But it still prints 5 xs ...
So that means that w in words is always False. How is that ?
Looking at words :
print(words[0:10]) # the first 10 will suffice
['A\n', 'a\n', 'aa\n', 'aal\n', 'aalii\n', 'aam\n', 'Aani\n', 'aardvark\n', 'aardwolf\n', 'Aaron\n']
All the words from the dictionary contain a newline character (\n) at the end. I guess you were not aware that it is what readlines do. So I recommend using :
words = f.read().splitlines()
instead.
With these 2 modifications (w and splitlines) :
Enter a word with ' * ' as the missing letter: l*g
lag
leg
x
log
lug
🎉

Related

python file handling can't search all words

Trying to search no of times word appears in a file using python file handling. For example was trying to search 'believer' in the lyrics of believer song that how many times believer comes. It appears 18 times but my program is giving 12. What are the conditions I am missing.
def no_words_function():
f=open("believer.txt","r")
data = f.read()
cnt=0
ws = input("Enter word to find: ")
word = data.split()
for w in word:
if w in ws:
cnt+=1
f.close()
print(ws,"found",cnt,"times in the file.")
no_words_function()
if you are not considering camel case while searching, assuming the entered word in small case you can add below code:
for w in word:
if ws.lower() in w.lower():
cnt+=1
You are not cleaning the data of the trailing characters which can be ,, ", '.' etc. This means your code will not find "believer," in the text
You are also not doing case comparisons. This means your code will not find "Believer" in the text. Based on your search needs you might want to do that.
For cleaning data:
word = data.split()
word = [w.strip("'\".,") for w in word] # Add other trailing characters you do not want
For case-insensitive search:
word = [w.lower() for w in word]
The reason you only find 12 of the 18 times "believer" occurs is because of your test inside the for loop.
Instead of writing
if w in ws:
cnt+=1
you should reverse the order
if ws in w:
cnt+=1
To understand why, let's look at one of the lines in you test: You break me down, you build me up, believer, believer. If you split this lines you get the following result:
line = "You break me down, you build me up, believer, believer"
line.split()
Out[26]:
['You', 'break', 'me', 'down,',
'you', 'build', 'me', 'up,',
'believer,', 'believer']
As you can see, the ninth element in this list is believer,. If you test 'believer,' in 'believer' the result will be False. However, if you test 'believer' in 'believer,' the result will be True
As others have mentioned, it is also a good idea to convert the search string and your search word to lower case, if you want to ignore case.

Python: How to delete numbers in list

Python learner here. So I have a wordlist.txt file, one word on each line. I want to filter out specific words starting and ending with specific letters. But in my wordlist.txt, words are listed with their occurrence numbers.
For example:
food 312
freak 36
cucumber 1
Here is my code
wordList = open("full.txt","r", encoding="utf-8")
word = wordList.read().splitlines()
for i in word:
if i.startswith("h") and i.endswith("e"):
print(i)
But since each item in the list has numbers at the end I can't filter correct words. I could not figure out how to omit those numbers.
Try splitting the line using space as the delimiter and use the first value [0] which is the word in your case
for i in word:
if i.split(" ")[0].startswith("h") and i.split(" ")[0].endswith("e"):
print(i.split(" ")[0])
Or you can just peform the split once as
for i in word:
w = i.split(" ")[0]
if w.startswith("h") and w.endswith("e"):
print(w)
EDIT: Based on the comment below, you may want to use no argument or None to split in case there happen to be two spaces or a tab as a field delimiter.
w = i.split()[0]
Try this
str = "This must not b3 delet3d, but the number at the end yes 12345"
str = re.sub(" \d+", "", str)
The str will be =
"This must not b3 delet3d, but the number at the end yes"

remove only the unknown words from a text but leave punctuation and digits

I have a text in French containing words that are separated by space (e.g répu blique*). I want to remove these separated words from the text and append them into a list while keeping punctuation and digits in the text. My code works for appending the words that are separated but it does not work to keep the digits in the text.
import nltk
from nltk.tokenize import word_tokenize
import re
with open ('french_text.txt') as tx:
#opening text containing the separated words
#stores the text with the separated words
text = word_tokenize(tx.read().lower())
with open ('Fr-dictionary.txt') as fr: #opens the dictionary
dic = word_tokenize(fr.read().lower()) #stores the first dictionary
pat=re.compile(r'[.?\-",:]+|\d+')
out_file=open("newtext.txt","w") #defining name of output file
valid_words=[ ] #empty list to append the words checked by the dictionary
invalid_words=[ ] #empty list to append the errors found
for word in text:
reg=pat.findall(word)
if reg is True:
valid_words.append(word)
elif word in dic:
valid_words.append(word)#appending to a list the words checked
else:
invalid_words.append(word) #appending the invalid_words
a=' '.join(valid_words) #converting list into a string
print(a) #print converted list
print(invalid_words) #print errors found
out_file.write(a) #writing the output to a file
out_file.close()
so, with this code, my list of errors come with the digits.
['ments', 'prési', 'répu', 'blique', 'diri', 'geants', '»', 'grand-est', 'elysée', 'emmanuel', 'macron', 'sncf', 'pepy', 'montparnasse', '1er', '2017.', 'geoffroy', 'hasselt', 'afp', 's', 'empare', 'sncf', 'grand-est', '26', 'elysée', 'emmanuel', 'macron', 'sncf', 'saint-dié', 'epinal', '23', '2018', 'etat', 's', 'vosges', '2018']
I think the problem is with the regular expression. Any suggestions? Thank you!!
The problem is with your if statement where you check reg is True. You should not use the is operator with True to check if the result of pat.findall(word) was positive (i.e. you had a matching word).
You can do this instead:
for word in text:
if pat.match(word):
valid_words.append(word)
elif word in dic:
valid_words.append(word)#appending to a list the words checked
else:
invalid_words.append(word) #appending the invalid_words
Caveat user: this is actually a complex problem, because it all depends on what we define to be a word:
is l’Académie a single word, how about j’eus ?
is gallo-romanes a single word, or c'est-à-dire?
how about J.-C.?
and xiv(e) (with superscript, as in 14 siecle)?
and then QDN or QQ1 or LOL?
Here's a direct solution, that's summarised as:
break up text into "words" and "non-words" (punctuation, spaces)
validate "words" against a dictionary
# Adjust this to your locale
WORD = re.compile(r'\w+')
text = "foo bar, baz"
while True:
m = WORD.search(text)
if not m:
if text:
print(f"punctuation: {text!r}")
break
start, end = m.span()
punctuation = text[:start]
word = text[start:end]
text = text[end:]
if punctuation:
print(f"punctuation: {punctuation!r}")
print(f"possible word: {word!r}")
possible word: 'foo'
punctuation: ' '
possible word: 'bar'
punctuation: ', '
possible word: 'baz'
I get a feeling that you are trying to deal with intentionally misspelt / broken up words, e.g. if someone is trying to get around forum blacklist rules or speech analysis.
Then, a better approach would be:
identify what might be a "word" or "non-word" using a dictionary
then break up the text
If the original text was made to evade computers but be readable by humans, your best bet would be ML/AI, most likely a neural network, like RNN's used to identify objects in images.

Python 3 - How to capitalize first letter of every sentence when translating from morse code

I am trying to translate morse code into words and sentences and it all works fine... except for one thing. My entire output is lowercased and I want to be able to capitalize every first letter of every sentence.
This is my current code:
text = input()
if is_morse(text):
lst = text.split(" ")
text = ""
for e in lst:
text += TO_TEXT[e].lower()
print(text)
Each element in the split list is equal to a character (but in morse) NOT a WORD. 'TO_TEXT' is a dictionary. Does anyone have a easy solution to this? I am a beginner in programming and Python btw, so I might not understand some solutions...
Maintain a flag telling you whether or not this is the first letter of a new sentence. Use that to decide whether the letter should be upper-case.
text = input()
if is_morse(text):
lst = text.split(" ")
text = ""
first_letter = True
for e in lst:
if first_letter:
this_letter = TO_TEXT[e].upper()
else:
this_letter = TO_TEXT[e].lower()
# Period heralds a new sentence.
first_letter = this_letter == "."
text += this_letter
print(text)
From what is understandable from your code, I can say that you can use the title() function of python.
For a more stringent result, you can use the capwords() function importing the string class.
This is what you get from Python docs on capwords:
Split the argument into words using str.split(), capitalize each word using str.capitalize(), and join the capitalized words using str.join(). If the optional second argument sep is absent or None, runs of whitespace characters are replaced by a single space and leading and trailing whitespace are removed, otherwise sep is used to split and join the words.

Python - Bug in code

This might be an easy one, but I can't spot where I am making the mistake.
I wrote a simple program to read words from a wordfile (don't have to be dictionary words), sum the characters and print them out from lowest to highest. (PART1)
Then, I wrote a small script after this program to filter and search for only those words which have only alphabetic, characters in them. (PART2)
While the first part works correctly, the second part prints nothing. I think the error is at the line 'print ch' where a character of a list converted to string is not being printed. Please advise what could be the error
#!/usr/bin/python
# compares two words and checks if word1 has smaller sum of chars than word2
def cmp_words(word_with_sum1,word_with_sum2):
(word1_sum,__)=word_with_sum1
(word2_sum,__)=word_with_sum2
return word1_sum.__cmp__(word2_sum)
# PART1
word_data=[]
with open('smalllist.txt') as f:
for l in f:
word=l.strip()
word_sum=sum(map(ord,(list(word))))
word_data.append((word_sum,word))
word_data.sort(cmp_words)
for index,each_word_data in enumerate(word_data):
(word_sum,word)=each_word_data
#PART2
# we only display words that contain alphabetic characters and numebrs
valid_characters=[chr(ord('A')+x) for x in range(0,26)] + [x for x in range(0,10)]
# returns true if only alphabetic characters found
def only_alphabetic(word_with_sum):
(__,single_word)=word_with_sum
map(single_word.charAt,range(0,len(single_word)))
for ch in list(single_word):
print ch # problem might be in this loop -- can't see ch
if not ch in valid_characters:
return False
return True
valid_words=filter(only_alphabetic,word_data)
for w in valid_words:
print w
Thanks in advance,
John
The problem is that charAt does not exist in python.
You can use directly: 'for ch in my_word`.
Notes:
you can use the builtin str.isalnum() for you test
valid_characters contains only the uppercase version of the alphabet

Categories