How to find length of words in a file in Python - python

I am fairly new to files in python and want to find the words in a file that have say 8 letters in them, which prints them, and keeps a numerical total of how many there actually are. Can you look through files like if it were a very large string or is there a specific way that it has to be done?

You could use Python's Counter for doing this:
from collections import Counter
import re
with open('input.txt') as f_input:
text = f_input.read().lower()
words = re.findall(r'\b(\w+)\b', text)
word_counts = Counter(w for w in words if len(w) == 8)
for word, count in word_counts.items():
print(word, count)
This works as follows:
It reads in a file called input.txt, as one very long string.
It then converts it all to lowercase to make sure the same words with different case are counted as the same word.
It uses a regular expression to split all of the text into a list of words.
It uses a list comprehension to store any word that has a length of 8 characters into a Counter.
It displays all of the matching entries along with the counts.

Try this code, where "eight_l_words" is an array of all the eight letter words and where "number_of_8lwords" is the number of eight letter words:
# defines text to be used
your_file = open("file_location","r+")
text = your_file.read
# divides the text into lines and defines some arrays
lines = text.split("\n")
words = []
eight_l_words = []
# iterating through "lines" adding each separate word to the "words" array
for each in lines:
words += each.split(" ")
# checking to see if each word in the "words" array is 8 chars long, and if so
# appending that words to the "eight_l_word" array
for each in words:
if len(each) == 8:
eight_l_word.append(each)
# finding the number of eight letter words
number_of_8lwords = len(eight_l_words)
# displaying results
print(eight_l_words)
print("There are "+str(number_of_8lwords)+" eight letter words")
Running the code with
text = "boomhead shot\nshamwow slapchop"
Yields the results:
['boomhead', 'slapchop']
There are 2 eight letter words

There's a useful post from 2 years ago called "How to split a text file to its words in python?"
How to split a text file to its words in python?
It describes splitting the line by whitespace. If you got punctuation such as commas and fullstops in there then you'll have to be a bit more sophisticated. There's help here: "Python - Split Strings with Multiple Delimiters" Split Strings with Multiple Delimiters?
You can use the function len() to get the length of each individual word.

Related

Duplicates with in a sentence of a text file in python

Hi I want to write a code that reads a text file, and identifies the sentences in the file with words that have duplicates within that sentence. I was thinking of putting each sentence of the file in a dictionary and finding which sentences have duplicates. Since I am new to Python, I need some help in writing the code.
This is what I have so far:
def Sentences():
def Strings():
l = string.split('.')
for x in range(len(l)):
print('Sentence', x + 1, ': ', l[x])
return
text = open('Rand article.txt', 'r')
string = text.read()
Strings()
return
The code above converts files to sentences.
Suppose you have a file where each line is a sentence, e.g. "sentences.txt":
I contain unique words.
This sentence repeats repeats a word.
The strategy could be to split the sentence into its constituent words, then use set to find the unique words in the sentence. If the resulting set is shorter than the list of all words, then you know that the sentence contains at least one duplicated word:
sentences_with_dups = []
with open("sentences.txt") as fh:
for sentence in fh:
words = sentence.split(" ")
if len(set(words)) != len(words):
sentences_with_dups.append(sentence)

Count total number of words in a file?

I want to find the total number of words in a file (text/string). I was able to get an output with my code but I'm not sure if it is correct Here are some sample files for y'all to try and see what you get.
Also note, use of any modules/libraries is not permitted.
sample1: https://www.dropbox.com/s/kqwvudflxnmldqr/sample1.txt?dl=0
sample2 - https://www.dropbox.com/s/7xph5pb9bdf551h/sample2.txt?dl=0
sample3 - https://www.dropbox.com/s/4mdb5hgnxyy5n2p/sample3.txt?dl=0
There are some things you must consider before counting the words.
A sentence is a sequence of words followed by either a full-stop, question mark or exclamation mark, which in turn must be followed either by a quotation mark (so the sentence is the end of a quote or spoken utterance), or white space (space, tab or new-line character).
E.g if a full-stop is not at the end of a sentence, it is to be regarded as white space, so serve to end words.
Like 3.42 would be two words. Or P.yth.on would be 3 words.
Double hypen (--) represents is to be regarded as a space character.
That being said, first of all, I opened and read the file to get all the text. I then replaced all the useless characters with blank space so it is easier to count the words. This includes '--' as well.
Then I split the text into words, created a dictionary to store count of the words. After completing the dictionary, I added all the values to get the total number of words and printed this. See below for code:
def countwords():
filename = input("Name of file? ")
text = open(filename, "r").read()
text = text.lower()
for ch in '!.?"#$%&()*+/:<=>#[\\]^_`{|}~':
text = text.replace(ch, ' ')
text = text.replace('--', ' ')
text = text.rstrip("\n")
words = text.split()
count = {}
for w in words:
count[w] = count.get(w,0) + 1
wordcount = sum(count.values())
print(wordcount)
So for sample1 text file, my word count is 321,
Forsample2: 542
For sample3: 139
I was hoping if I could compare these answers with some python pros here and see if my results are correct and if they are not what I'm doing wrong.
You can try this solution using regex.
#word counter using regex
import re
while True:
string =raw_input("Enter the string: ")
count = len(re.findall("[a-zA-Z_]+", string))
if line == "Done": #command to terminate the loop
break
print (count)
print ("Terminated")

How can you use Python to count the unique words (without special characters/ cases interfering) in a text document

I am new to Python and need some help with trying to come up with a text content analyzer that will help me find 7 things within a text file:
Total word count
Total count of unique words (without case and special characters interfering)
The number of sentences
Average words in a sentence
Find common used phrases (a phrase of 3 or more words used over 3 times)
A list of words used, in order of descending frequency (without case and special characters interfering)
The ability to accept input from STDIN, or from a file specified on the command line
So far I have this Python program to print total word count:
with open('/Users/name/Desktop/20words.txt', 'r') as f:
p = f.read()
words = p.split()
wordCount = len(words)
print "The total word count is:", wordCount
So far I have this Python program to print unique words and their frequency: (it's not in order and sees words such as: dog, dog., "dog, and dog, as different words)
file=open("/Users/name/Desktop/20words.txt", "r+")
wordcount={}
for word in file.read().split():
if word not in wordcount:
wordcount[word] = 1
else:
wordcount[word] += 1
for k, v in wordcount.items():
print k, v
Thank you for any help you can give!
Certainly the most difficult part is identifying the sentences. You could use a regular expression for this, but there might still be some ambiguity, e.g. with names and titles, that have a dot followed by an upper case letter. For words, too, you can use a simple regex, instead of using split. The exact expression to use depends on what qualifies as a "word". Finally, you can use collections.Counter for counting all of those instead of doing this manually. Use str.lower to convert either the text as a whole or the individual words to lowercase.
This should help you getting startet:
import re, collections
text = """Sentences start with an upper-case letter. Do they always end
with a dot? No! Also, not each dot is the end of a sentence, e.g. these two,
but this is. Still, some ambiguity remains with names, like Mr. Miller here."""
sentence = re.compile(r"[A-Z].*?[.!?](?=\s+[A-Z]|$)", re.S)
sentences = collections.Counter(sentence.findall(text))
for n, s in sentences.most_common():
print n, s
word = re.compile(r"\w+")
words = collections.Counter(word.findall(text.lower()))
for n, w in words.most_common():
print n, w
For "more power", you could use some natural language toolkit, but this might be a bit much for this task.
If you know what characters you want to avoid, you can use str.strip to remove these characters from the extremities.
word = word.strip().strip("'").strip('"')...
This will remove the occurrence of these characters on the extremities of the word.
This probably isn't as efficient as using some NLP library, but it can get the job done.
str.strip Docs

Why is my list comprehension causing the code to return my entire text document rather than only matches of words from permutations of a word?

My code:
from itertools import permutations
original = str(input('What word would you like to unscramble?: '))
inputFile = open('dic.txt', 'r')
compare = inputFile.read().split('\n')
inputFile.close()
for now in permutations(original):
car = [print(now) for now in compare if now in compare] #supposed to compare iterations of input word to text file.
I am trying to unscramble a word by finding all permutations of a word and running each permutation through a text file of English words to see if it is a real word or not. My previous version stored all the permutations in a list (now I know that's a bad idea). This code here just prints my entire text file, and I'm not entirely sure why. I would like to know what I'm doing wrong with the list comprehension that prints the entire text file of words rather than only iterating through the permutations of the input word.
You can make you code more python-idiomatic in multiple ways:
from itertools import permutations
original = str(input('What word would you like to unscramble?: '))
with open('dic.txt') as input_file:
compare = input_file.readlines()
for permutation in permutations(original):
if permuation in compare:
print(permutation)
Does this do what you are looking for?

Automatically separating words into letters?

So I have this code:
import sys ## The 'sys' module lets us read command line arguments
words1 = open(sys.argv[2],'r') ##sys.argv[2] is your dictionary text file
words = str((words1.read()))
def main():
# Get the dictionary to search
if (len(sys.argv) != 3) :
print("Proper format: python filename.py scrambledword filename.txt")
exit(1) ## the non-zero return code indicates an error
scrambled = sys.argv[1]
print(sys.argv[1])
unscrambled = sorted(scrambled)
print(unscrambled)
for line in words:
print(line)
When I print words, it prints the words in the dictionary, one word at a time, which is great. But as soon as I try and do anything with those words like in my last two lines, it automatically separates the words into letters, and prints one letter per line of each word. Is there anyway to keep the words together? My end goal is to do ordered=sorted(line), and then an if (ordered==unscrambled) have it print the original word from the dictionary?
Your words is an instance of str. You should use split to iterate over words:
for word in words.split():
print(word)
A for-loop takes one element at a time from the "sequence" you pass it. You have read the contents of your file into a single string, so python treats it as a sequence of letters. What you need is to convert it into a list yourself: Split it into a list of strings that are as large as you like:
lines = words.splitlines() # Makes a list of lines
for line in lines:
....
Or
wordlist = words.split() # Makes a list of "words", by splitting at whitespace
for word in wordlist:
....

Categories