Reading whitespaces inside of a list of strings - python

I'm having a problem trying to count whitespaces in a list in python.
Here's my code
Data = ''
index = 0
num_words = 0
# Open a file for reading.
infile = open('article.txt', 'r')
# Read the contents of the file into a list.
data = infile.readlines()
# Strip the \n from each element.
while index < len(data):
data[index] = data[index].rstrip('\n')
index += 1
for ch in data:
if ch.isspace():
num_words += 1
# Close the file.
infile.close()
# Print the contents of the list.
print(num_words)
The contents of the article.txt is just a list of sentences so the list is just a list of strings such as:
data = ['this is sentence one.', 'this is sentence two.' , 'this is sentence three.' , 'this is sentence four.' , 'this is sentence five.' , 'this is sentence six.' ]
I think I know what the problem is because I did:
print(ch)
Which results in 'false' getting printed 6 times. I'm thinking this is because the for loop is searching to see if the whole string is a space rather than checking for spaces inside of the string.
I know I could just do:
data = infile.read()
But I need each line in a list. Is there anything I can change so the for loop searches for spaces in each string in the list or am I out of luck?

Python has a handy method for that on strings, called str.split. When passed no arguments, it will split on whitespace. If you count the items in the resulting list, you will have the number of words.
Handles multiple spaces:
>>> line = "this is some string."
>>> len(line.split())
4
Handles empty lines:
>>> line = " "
>>> len(line.split())
0
Handles extra space before and after:
>>> line = " space before and after. "
>>> len(line.split())
4
Here is some sample code:
lines = 0
words = 0
with open('yourfile', 'rt') as yourfile:
for line in yourfile:
lines += 1
words += len(line.split())

Related

How to take out punctuation from string and find a count of words of a certain length?

I am opening trying to create a function that opens a .txt file and counts the words that have the same length as the number specified by the user.
The .txt file is:
This is a random text document. How many words have a length of one?
How many words have the length three? We have the power to figure it out!
Is a function capable of doing this?
I'm able to open and read the file, but I am unable to exclude punctuation and find the length of each word.
def samplePractice(number):
fin = open('sample.txt', 'r')
lstLines = fin.readlines()
fin.close
count = 0
for words in lstLines:
words = words.split()
for i in words:
if len(i) == number:
count += 1
return count
You can try using the replace() on the string and pass in the desired punctuation and replace it with an empty string("").
It would look something like this:
puncstr = "Hello!"
nopuncstr = puncstr.replace(".", "").replace("?", "").replace("!", "")
I have written a sample code to remove punctuations and to count the number of words. Modify according to your requirement.
import re
fin = """This is a random text document. How many words have a length of one? How many words have the length three? We have the power to figure it out! Is a function capable of doing this?"""
fin = re.sub(r'[^\w\s]','',fin)
print(len(fin.split()))
The above code prints the number of words. Hope this helps!!
instead of cascading replace() just use strip() a one time call
Edit: a cleaner version
pl = '?!."\'' # punctuation list
def samplePractice(number):
with open('sample.txt', 'r') as fin:
words = fin.read().split()
# clean words
words = [w.strip(pl) for w in words]
count = 0
for word in words:
if len(word) == number:
print(word, end=', ')
count += 1
return count
result = samplePractice(4)
print('\nResult:', result)
output:
This, text, many, have, many, have, have, this,
Result: 8
your code is almost ok, it just the second for block in wrong position
pl = '?!."\'' # punctuation list
def samplePractice(number):
fin = open('sample.txt', 'r')
lstLines = fin.readlines()
fin.close
count = 0
for words in lstLines:
words = words.split()
for i in words:
i = i.strip(pl) # clean the word by strip
if len(i) == number:
count += 1
return count
result = samplePractice(4)
print(result)
output:
8

How to search words from txt file to python

How can I show words which length are 20 in a text file?
To show how to list all the word, I know I can use the following code:
#Program for searching words is in 20 words length in words.txt file
def main():
file = open("words.txt","r")
lines = file.readlines()
file.close()
for line in lines:
print (line)
return
main()
But I not sure how to focus and show all the words with 20 letters.
Big thanks
If your lines have lines of text and not just a single word per line, you would first have to split them, which returns a list of the words:
words = line.split(' ')
Then you can iterate over each word in this list and check whether its length is 20.
for word in words:
if len(word) == 20:
# Do what you want to do here
If each line has a single word, you can just operate on line directly and skip the for loop. You may need to strip the trailing end-of-line character though, word = line.strip('\n'). If you just want to collect them all, you can do this:
words_longer_than_20 = []
for word in words:
if len(word) > 20:
words_longer_than_20.append(word)
If your file has one word only per line, and you want only the words with 20 letters you can simply use:
with open("words.txt", "r") as f:
words = f.read().splitlines()
found = [x for x in words if len(x) == 20]
you can then print the list or print each word seperately
You can try this:
f = open('file.txt')
new_file = f.read().splitlines()
words = [i for i in f if len(i) == 20]
f.close()

Python make a list of words from a file

I'm trying to make a list of words from a file that includes only words that do not contain any duplicate letters such as 'hello' but 'helo' would be included.
My code words perfectly when I use a list that I create by just typing in words however when I try to do it with the file list it just prints all the words even if they include duplicate letters.
words = []
length = 5
file = open('dictionary.txt')
for word in file:
if len(word) == length+1:
words.insert(-1, word.rstrip('\n'))
alpha = ["a","b","c","d","e","f","g","h","i","j","k","l","m","n","o","p","q","r","s","t","u","v","w","x","y","z"]
x = 0
while x in range(0, len(alpha)):
i = 0
while i in range(0, len(words)):
if words[i].count(alpha[x]) > 1:
del(words[i])
i = i - 1
else:
i = i + 1
x = x + 1
print(words)
This snippet adds words, and removes duplicated letters before inserting them
words = []
length = 5
file = open('dictionary.txt')
for word in file:
clean_word = word.strip('\n')
if len(clean_word) == length + 1:
words.append(''.join(set(clean_word))
We convert the string to a set, which removed duplicates, and then we join the set to a string again:
>>> word = "helloool"
>>> set(word)
set(['h', 'e', 'l', 'o'])
>>> ''.join(set(word))
'helo'
I am not 100% sure how you want to remove duplicates like this, so I've assumed no letter can be more than once in the word (as your question specifies "duplicate letter" and not "double letter").
What does your dictionary.txt look like? Your code should work so long as each word is on a separate line (for x in file iterates through lines) and at least some of the words have 5 non-repeating letters.
Also, couple of tips:
You can read lines from a file into a list by calling file.readlines()
You can check for repeats in a list or string by using sets. Sets remove all duplicate elements, so checking if len(word) == len(set(word)) will tell you if there are duplicate letters in much less code :)

Python Pig Latin convertor and line/word counter

I have to create a python file that prompts the user for a file path to a text document and then convert it into pig Latin and do a line/word count.
• A function to generate the pig Latin version of a single word
• A function to print line and word counts to standard output
• Correct pig Latin output with identical formatting as the original text file
• Correct line and word counts
I can't figure out why the pig latin is coming out wrong. My teacher said that I need another string.strip("\n") because it is making the words convert wrong but I have no idea where I am supposed to put that.
Also my line counter is broken. It counts but it always says 222 lines.
How can I make it just count the lines with words ?
#Step 1: User enters text file.
#Step 2: Pig Latin function rewrites file and saves as .txt.
#Step 3: Tracks how many lines and words it rewrites.
vowels = ("A", "a", "E", "e", "I", "i", "O", "o", "U", "u")
# Functions
def pig_word(string):
line = string.strip("\n")
for word in string.split(" "):
first_letter = word[0]
if first_letter in vowels:
return word + "way"
else:
return word[1:] + first_letter + "ay"
def pig_sentence(sentence):
word_list = sentence.split(" ")
convert = " "
for word in word_list:
convert = convert + pig_word(word)
convert = convert + " "
return convert
def line_counter(s):
line_count = 0
for line in s:
line_count += 1
return line_count
def word_counter(line):
word_count = 0
list_of_words = line.split()
word_count += len(list_of_words)
return word_count
# File path conversion
text = raw_input("Enter the path of a text file: ")
file_path = open(text, "r")
out_file = open("pig_output.txt", "w")
s = file_path.read()
pig = pig_sentence(s)
out_file.write(pig+" ")
out_file.write("\n")
linecount = line_counter(s)
wordcount = word_counter(s)
file_path.close()
out_file.close()
# Results
print "\n\n\n\nTranslation finished and written to pig_output.txt"
print "A total of {} lines were translated successfully.".format(linecount)
print "A total of {} words were translated successfully.".format(wordcount)
print "\n\n\n\n"
your first problem is here:
def pig_word(string):
line = string.strip("\n") #!!!! line is NEVER USED !!!
for word in string.split(" "): #you want *line*.split here
the second issue is caused by iterating over a string, it goes through every character instead of every line like a file does:
>>> for i in "abcd":
... print(i)
a
b
c
d
so in your line_counter instead of doing:
for line in s:
line_count += 1
you just need to do:
for line in s.split("\n"):
line_count += 1
The first reason why your not getting the output you want is because in your pig_word(string) function, you return the first word in the string when you put that return inside of your for loop. Also, your teacher was talking about taking all the lines into the function, and iterating over each line via str.split('\n'). \n represents the "new-line" character.
You can try something like this to correct that.
def pig_sentence(string):
lines = []
for line in string.split('\n'):
new_string = ""
for word in line.split(" "):
first_letter = word[0]
if first_letter in vowels:
new_string += word + "way"
else:
new_string += word[1:] + first_letter + "ay"
lines.append(new_string)
return lines
The Changes Made
Initialized a new list lines that we can append to throughout the loops.
Iterate over each line in the passed in string.
For each line, create a new string new_string.
Use your code, but instead of returning we add it to new_string, then append new_string to our list of new lines, lines.
Note that this removes the need for two functions. Also note that I renamed pig_word to pig_sentence.
The second error is in your function line_counter(s). You are iterating over each character rather than each line. Here add that str.split('\n') again to get the output you want by splitting the string into a list of lines then iterating over the list.
Here is the modified function:
def line_counter(s):
line_count = 0
for _ in s.split('\n'):
line_count += 1
return line_count
(Since there is nothing erroneous with your file i.o., I'm just going to use a string literal here for the testing.)
Test
paragraph = """\
Hello world
how are you
pig latin\
"""
lines = line_counter(paragraph)
words = sum([word_counter(line) for line in paragraph.split('\n')])
out = pig_sentence(paragraph)
print(lines, words, out)
The output is what we expect!
3 7 ['elloHay', 'elloHayorldway', 'owhay', 'owhayareway', 'owhayarewayouyay', 'igpay', 'igpayatinlay']
You are removing only the space, you need to remove all punctuation as well as end of line characters. Replace
split(" ")
with
split()
Your sentence list is the equivalent of
sentence = 'Hello there.\nMy name is Roxy.\nHow are you?
If you print after split(" ") and split() you will see the difference and you will get the results that you expect.
Additionally, you will get incorrect results because you will have there translated to heretay. you need to loop around so that it comes out as erethay
That is move every consonent to the end before adding 'ay' so that the new word starts with a vowel.

reading and checking the consecutive words in a file

I want to read the words in a file, and say for example, check if the word is "1",if word is 1, I have to check if the next word is "two". After that i have to do some other task. Can u help me to check the occurance of "1" and "two" consecutively.
I have used
filne = raw_input("name of existing file to be proceesed:")
f = open(filne, 'r+')
for word in f.read().split():
for i in xrange(len(word)):
print word[i]
print word[i+1]
but its not working.
The easiest way to deal with consecutive items is with zip:
with open(filename, 'r') as f: # better way to open file
for line in f: # for each line
words = line.strip().split() # all words on the line
for word1, word2 in zip(words, words[1:]): # iterate through pairs
if word1 == '1' and word2 == 'crore': # test the pair
At the moment, your indices (i and i+1) are within each word (i.e. characters) not for words within the list.
I think you want to print two consecutive words from the file,
In your code you are iterating over the each character instead of each word in file if thats what you intend to do.
You can do that in following way:
f = open('yourFileName')
str1 = f.read().split()
for i in xrange(len(str1)-1): # -1 otherwise it will be index out of range error
print str1[i]
print str1[i+1]
and if you want to check some word is present and want check for word next to it, use
if 'wordYouWantToCheck' in str1:
index=str1.index('wordYouWantToCheck')
Now you have index for the word you are looking for, you can check for the word next to it using str1[index+1].
But 'index' function will return only the first occurrence of the word. To accomplish your intent here, you can use 'enumerate' function.
indices = [i for i,x in enumerate(str1) if x == "1"]
This will return list containing indices of all occurrences of word '1'.

Categories