Python learner here. So I have a wordlist.txt file, one word on each line. I want to filter out specific words starting and ending with specific letters. But in my wordlist.txt, words are listed with their occurrence numbers.
For example:
food 312
freak 36
cucumber 1
Here is my code
wordList = open("full.txt","r", encoding="utf-8")
word = wordList.read().splitlines()
for i in word:
if i.startswith("h") and i.endswith("e"):
print(i)
But since each item in the list has numbers at the end I can't filter correct words. I could not figure out how to omit those numbers.
Try splitting the line using space as the delimiter and use the first value [0] which is the word in your case
for i in word:
if i.split(" ")[0].startswith("h") and i.split(" ")[0].endswith("e"):
print(i.split(" ")[0])
Or you can just peform the split once as
for i in word:
w = i.split(" ")[0]
if w.startswith("h") and w.endswith("e"):
print(w)
EDIT: Based on the comment below, you may want to use no argument or None to split in case there happen to be two spaces or a tab as a field delimiter.
w = i.split()[0]
Try this
str = "This must not b3 delet3d, but the number at the end yes 12345"
str = re.sub(" \d+", "", str)
The str will be =
"This must not b3 delet3d, but the number at the end yes"
Related
I'm trying to have the user input a string of characters with one asterisk. The asterisk indicates a character that can be subbed out for a vowel (a,e,i,o,u) in order to see what substitutions produce valid words.
Essentially, I want to take an input "l*g" and have it return "lag, leg, log, lug" because "lig" is not a valid English word. Below I have invalid words to be represented as "x".
I've gotten it to properly output each possible combination (e.g., including "lig"), but once I try to compare these words with the text file I'm referencing (for the list of valid words), it'll only return 5 lines of x's. I'm guessing it's that I'm improperly importing or reading the file?
Here's the link to the file I'm looking at so you can see the formatting:
https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/packages/corpora/words.zip
Using the "en" file ~2.5MB
It's not in a dictionary layout i.e. no corresponding keys/values, just lines (maybe I could use the line number as the index, but I don't know how to do that). What can I change to check the test words to narrow down which are valid words based on the text file?
with open(os.path.expanduser('~/Downloads/words/en')) as f:
words = f.readlines()
inputted_word = input("Enter a word with ' * ' as the missing letter: ")
letters = []
for l in inputted_word:
letters.append(l)
### find the index of the blank
asterisk = inputted_word.index('*') # also used a redundant int(), works fine
### sub in vowels
vowels = ['a','e','i','o','u']
list_of_new_words = []
for v in vowels:
letters[asterisk] = v
new_word = ''.join(letters)
list_of_new_words.append(new_word)
for w in list_of_new_words:
if w in words:
print(new_word)
else:
print('x')
There are probably more efficient ways to do this, but I'm brand new to this. The last two for loops could probably be combined but debugging it was tougher that way.
print(list_of_new_words)
gives
['lag', 'leg', 'lig', 'log', 'lug']
So far, so good.
But this :
for w in list_of_new_words:
if w in words:
print(new_word)
else:
print('x')
Here you print new_word, which is defined in the previous for loop :
for v in vowels:
letters[asterisk] = v
new_word = ''.join(letters) # <----
list_of_new_words.append(new_word)
So after the loop, new_word still has the last value it was assigned to : "lug" (if the script input was l*g).
You probably meant w instead ?
for w in list_of_new_words:
if w in words:
print(w)
else:
print('x')
But it still prints 5 xs ...
So that means that w in words is always False. How is that ?
Looking at words :
print(words[0:10]) # the first 10 will suffice
['A\n', 'a\n', 'aa\n', 'aal\n', 'aalii\n', 'aam\n', 'Aani\n', 'aardvark\n', 'aardwolf\n', 'Aaron\n']
All the words from the dictionary contain a newline character (\n) at the end. I guess you were not aware that it is what readlines do. So I recommend using :
words = f.read().splitlines()
instead.
With these 2 modifications (w and splitlines) :
Enter a word with ' * ' as the missing letter: l*g
lag
leg
x
log
lug
🎉
I want to find the total number of words in a file (text/string). I was able to get an output with my code but I'm not sure if it is correct Here are some sample files for y'all to try and see what you get.
Also note, use of any modules/libraries is not permitted.
sample1: https://www.dropbox.com/s/kqwvudflxnmldqr/sample1.txt?dl=0
sample2 - https://www.dropbox.com/s/7xph5pb9bdf551h/sample2.txt?dl=0
sample3 - https://www.dropbox.com/s/4mdb5hgnxyy5n2p/sample3.txt?dl=0
There are some things you must consider before counting the words.
A sentence is a sequence of words followed by either a full-stop, question mark or exclamation mark, which in turn must be followed either by a quotation mark (so the sentence is the end of a quote or spoken utterance), or white space (space, tab or new-line character).
E.g if a full-stop is not at the end of a sentence, it is to be regarded as white space, so serve to end words.
Like 3.42 would be two words. Or P.yth.on would be 3 words.
Double hypen (--) represents is to be regarded as a space character.
That being said, first of all, I opened and read the file to get all the text. I then replaced all the useless characters with blank space so it is easier to count the words. This includes '--' as well.
Then I split the text into words, created a dictionary to store count of the words. After completing the dictionary, I added all the values to get the total number of words and printed this. See below for code:
def countwords():
filename = input("Name of file? ")
text = open(filename, "r").read()
text = text.lower()
for ch in '!.?"#$%&()*+/:<=>#[\\]^_`{|}~':
text = text.replace(ch, ' ')
text = text.replace('--', ' ')
text = text.rstrip("\n")
words = text.split()
count = {}
for w in words:
count[w] = count.get(w,0) + 1
wordcount = sum(count.values())
print(wordcount)
So for sample1 text file, my word count is 321,
Forsample2: 542
For sample3: 139
I was hoping if I could compare these answers with some python pros here and see if my results are correct and if they are not what I'm doing wrong.
You can try this solution using regex.
#word counter using regex
import re
while True:
string =raw_input("Enter the string: ")
count = len(re.findall("[a-zA-Z_]+", string))
if line == "Done": #command to terminate the loop
break
print (count)
print ("Terminated")
I need to write a function that returns the first letters (and make it uppercase) of any text like:
shortened = shorten("Don't repeat yourself")
print(shortened)
Expected output:
DRY
and:
shortened = shorten("All terrain armoured transport")
print(shortened)
Expected output:
ATAT
Use list comprehension and join
shortened = "".join([x[0] for x in text.title().split(' ') if x])
Using regex you can match all characters except the first letter of each word, replace them with an empty string to remove them, then capitalize the resulting string:
import re
def shorten(sentence):
return re.sub(r"\B[\S]+\s*","",sentence).upper()
print(shorten("Don't repeat yourself"))
Output:
DRY
text = 'this is a test'
output = ''.join(char[0] for char in text.title().split(' '))
print(output)
TIAT
Let me explain how this works.
My first step is to capitalize the first letter of each work
text.title()
Now I want to be able to separate each word by the space in between, this will become a list
text.title()split(' ')
With that I'd end up with 'This','Is','A','Test' so now I obviously only want the first character of each word in the list
for word in text.title()split(' '):
print(word[0]) # T I A T
Now I can lump all that into something called list comprehension
output = [char[0] for char in text.title().split(' ')]
# ['T','I','A','T']
I can use ''.join() to combine them together, I don't need the [] brackets anymore because it doesn't need to be a list
output = ''.join(char[0] for char in text.title().split(' ')
I need to delete the first letters of each word of a string. I know that by using something like
st = "testing"
st = st[3:]
I can delete the first 3 letters from that word.
I need to do this for many words in the same string now.
For example if I get
"hello this is a test"
I need to delete the first 2 letters (chose 2 randomly) from that word, but only if the lenght of that word is >=2.
the output of this example should be:
llo is a st
(note that "is" got deleted because it has a lenght of 2 letters)
Assuming that a word is any sequence that does not contain spaces:
Split the text into words
Use list comprehension to modify the words that qualify
Join the results
If by "word" you mean real English words (that do not include punctuation, etc.), then use nltk.word_tokenize(st) instead of st.split().
" ".join([(word[2:] if len(word) >= 2 else word) for word in st.split()])
#'llo is a st'
Direct can do:
' '.join([s[2:] for s in st.split()])
But wanted to keep characters less than length of two:
' '.join([s[2:] or s for s in st.split()])
Use or because takes which that's True, '' or 'a' will choose 'a'
I am trying to translate morse code into words and sentences and it all works fine... except for one thing. My entire output is lowercased and I want to be able to capitalize every first letter of every sentence.
This is my current code:
text = input()
if is_morse(text):
lst = text.split(" ")
text = ""
for e in lst:
text += TO_TEXT[e].lower()
print(text)
Each element in the split list is equal to a character (but in morse) NOT a WORD. 'TO_TEXT' is a dictionary. Does anyone have a easy solution to this? I am a beginner in programming and Python btw, so I might not understand some solutions...
Maintain a flag telling you whether or not this is the first letter of a new sentence. Use that to decide whether the letter should be upper-case.
text = input()
if is_morse(text):
lst = text.split(" ")
text = ""
first_letter = True
for e in lst:
if first_letter:
this_letter = TO_TEXT[e].upper()
else:
this_letter = TO_TEXT[e].lower()
# Period heralds a new sentence.
first_letter = this_letter == "."
text += this_letter
print(text)
From what is understandable from your code, I can say that you can use the title() function of python.
For a more stringent result, you can use the capwords() function importing the string class.
This is what you get from Python docs on capwords:
Split the argument into words using str.split(), capitalize each word using str.capitalize(), and join the capitalized words using str.join(). If the optional second argument sep is absent or None, runs of whitespace characters are replaced by a single space and leading and trailing whitespace are removed, otherwise sep is used to split and join the words.