I want to find the total number of words in a file (text/string). I was able to get an output with my code but I'm not sure if it is correct Here are some sample files for y'all to try and see what you get.
Also note, use of any modules/libraries is not permitted.
sample1: https://www.dropbox.com/s/kqwvudflxnmldqr/sample1.txt?dl=0
sample2 - https://www.dropbox.com/s/7xph5pb9bdf551h/sample2.txt?dl=0
sample3 - https://www.dropbox.com/s/4mdb5hgnxyy5n2p/sample3.txt?dl=0
There are some things you must consider before counting the words.
A sentence is a sequence of words followed by either a full-stop, question mark or exclamation mark, which in turn must be followed either by a quotation mark (so the sentence is the end of a quote or spoken utterance), or white space (space, tab or new-line character).
E.g if a full-stop is not at the end of a sentence, it is to be regarded as white space, so serve to end words.
Like 3.42 would be two words. Or P.yth.on would be 3 words.
Double hypen (--) represents is to be regarded as a space character.
That being said, first of all, I opened and read the file to get all the text. I then replaced all the useless characters with blank space so it is easier to count the words. This includes '--' as well.
Then I split the text into words, created a dictionary to store count of the words. After completing the dictionary, I added all the values to get the total number of words and printed this. See below for code:
def countwords():
filename = input("Name of file? ")
text = open(filename, "r").read()
text = text.lower()
for ch in '!.?"#$%&()*+/:<=>#[\\]^_`{|}~':
text = text.replace(ch, ' ')
text = text.replace('--', ' ')
text = text.rstrip("\n")
words = text.split()
count = {}
for w in words:
count[w] = count.get(w,0) + 1
wordcount = sum(count.values())
print(wordcount)
So for sample1 text file, my word count is 321,
Forsample2: 542
For sample3: 139
I was hoping if I could compare these answers with some python pros here and see if my results are correct and if they are not what I'm doing wrong.
You can try this solution using regex.
#word counter using regex
import re
while True:
string =raw_input("Enter the string: ")
count = len(re.findall("[a-zA-Z_]+", string))
if line == "Done": #command to terminate the loop
break
print (count)
print ("Terminated")
Related
I have tried this code from my side, any suggestion and help is appreciated. To be more specific, I want to create a python program which can count and identify the number of acronyms in a text file. And the output of the program should display every acronyms present in the specified text file and how many time each of those acronyms occurred in the file.
*Note- The below code is not giving the desired output. Any type of help and suggestion is appreciated.
Link for the Text File , You guys can have a look- https://drive.google.com/file/d/1zlqsmJKqGIdD7qKicVmF0W6OgF5-g7Qk/view?usp=sharing
This text file contain various acronyms which are used in it. So, I basically want to write a python script to identify those acronyms and count how many times those acronyms occurred. The acronyms are of various type which can be 2 or more letters and it can either be of small or capital letters. For further reference about acronyms please have a look at the text file provided at the google drive.
Any updated code is also appreciated.
acronyms = 0 # number of acronyms
#open file File.txt in read mode with name file
with open('Larex_text_file.txt', "r", errors ='ignore') as file:
text = str(file.read())
import re
print(re.sub("([a-zA-Z]\.*){2,}s?", "", text))
for line in text: # for every line in file
for word in line.split(' '): # for every word in line
if word.isupper(): # if word is all uppercase letters
acronyms+=1
print("Number of acronyms:", acronyms) #print number of acronyms
In building a small text file and then trying out your code, I came up with a couple of tweaks to your code to simplify it and still acquire the count of words within the text file that are all uppercase letters.
acronyms = 0 # number of acronyms
#open file File.txt in read mode with name file
with open('Larex_text_file.txt', "r", errors ='ignore') as file:
text = str(file.read())
for word in text.split(' '): # for every word in line
if word.isupper() and word.isalpha(): # if word is all uppercase letters
acronyms+=1
print("Number of words that are all uppercase:", acronyms) #print number of acronyms
First off, just a simple loop is used through the words that are split out from the read text, and then the program just checks that the word is all alpha and that all of the letters in the word are all uppercase.
To test, I built a small text file with some words all in uppercase.
NASA and UCLA have teamed up with the FBI and JPL.
also UNICEF and the WWE have teamed up.
With that, there should be five words that are all uppercase.
And, when run, this was the output on the terminal.
#Una:~/Python_Programs/Acronyms$ python3 Acronym.py
Number of words that are all uppercase: 5
You will note that I am being a bit pedantic here referring to the count of "uppercase" words and not calling them acronyms. I am not sure if you are attempting to actually derive true acronyms, but if you are, this link might help:
Acronyms
Give that a try to see if it meets the spirit of your project.
Answer to the question-
#open file File.txt in read mode with name file
with open('Larex_text_file.txt', "r", errors ='ignore') as file:
text = str(file.read())
for word in text.split(' '): # for every word in line
if word.isupper() and word.isalpha(): # if word is all uppercase letters
acronyms+=1
if len(word) == 1: #ignoring the word found in the file of single character as they are not acronyms
pass
else:
index = len(acronym_word)
acronym_word.insert(index, word) #storing all the acronyms founded in the file to a list
uniqWords = sorted(set(acronym_word)) #remove duplicate words and sort the list of acronyms
for word in uniqWords:
print(word, ":", acronym_word.count(word))
From your comments, it sounds like every acronym appears at least once as an all-uppercase word, then can appear several more times in lowercase.
I suggest making two passes on the text: a first time to collect all uppercase words, and a second pass to search for every occurrence, case-insensitive, of the words you collected on the first pass.
You can use collections.Counter to quickly count words.
You can use ''.join(filter(str.isalpha, word.lower())) to strip a word of its non-alphabetical characters and disregard its case.
In the code snippet below, I used io.StringIO to emulate opening a text file.
from io import StringIO
from collections import Counter
text = '''First we have CR and PU as uppercase words. A word which first
appeared as uppercase can also appear as lowercase.
For instance, cr and pu appear in lowercase, and pu appears again.
And again: here is a new occurrence of pu.
An acronym might or might not have punctuation or numbers in it: CR-1,
C.R., cr.
A word that contains only a singly letter will look like an acronym
if it ever appears as the first word of a sentence.'''
#with open('path/to/file.txt', 'r') as f:
with StringIO(text) as f:
counts = Counter(''.join(filter(str.isalpha, word.lower()))
for line in f for word in line.split())
f.seek(0)
uppercase_words = set(''.join(filter(str.isalpha, word.lower()))
for line in f
for word in line.split() if word.isupper())
acronyms = Counter({w: c for w,c in counts.items() if w in uppercase_words})
print(acronyms)
# Counter({'cr': 5, 'a': 5, 'pu': 4})
Python learner here. So I have a wordlist.txt file, one word on each line. I want to filter out specific words starting and ending with specific letters. But in my wordlist.txt, words are listed with their occurrence numbers.
For example:
food 312
freak 36
cucumber 1
Here is my code
wordList = open("full.txt","r", encoding="utf-8")
word = wordList.read().splitlines()
for i in word:
if i.startswith("h") and i.endswith("e"):
print(i)
But since each item in the list has numbers at the end I can't filter correct words. I could not figure out how to omit those numbers.
Try splitting the line using space as the delimiter and use the first value [0] which is the word in your case
for i in word:
if i.split(" ")[0].startswith("h") and i.split(" ")[0].endswith("e"):
print(i.split(" ")[0])
Or you can just peform the split once as
for i in word:
w = i.split(" ")[0]
if w.startswith("h") and w.endswith("e"):
print(w)
EDIT: Based on the comment below, you may want to use no argument or None to split in case there happen to be two spaces or a tab as a field delimiter.
w = i.split()[0]
Try this
str = "This must not b3 delet3d, but the number at the end yes 12345"
str = re.sub(" \d+", "", str)
The str will be =
"This must not b3 delet3d, but the number at the end yes"
I need to delete the first letters of each word of a string. I know that by using something like
st = "testing"
st = st[3:]
I can delete the first 3 letters from that word.
I need to do this for many words in the same string now.
For example if I get
"hello this is a test"
I need to delete the first 2 letters (chose 2 randomly) from that word, but only if the lenght of that word is >=2.
the output of this example should be:
llo is a st
(note that "is" got deleted because it has a lenght of 2 letters)
Assuming that a word is any sequence that does not contain spaces:
Split the text into words
Use list comprehension to modify the words that qualify
Join the results
If by "word" you mean real English words (that do not include punctuation, etc.), then use nltk.word_tokenize(st) instead of st.split().
" ".join([(word[2:] if len(word) >= 2 else word) for word in st.split()])
#'llo is a st'
Direct can do:
' '.join([s[2:] for s in st.split()])
But wanted to keep characters less than length of two:
' '.join([s[2:] or s for s in st.split()])
Use or because takes which that's True, '' or 'a' will choose 'a'
I am fairly new to files in python and want to find the words in a file that have say 8 letters in them, which prints them, and keeps a numerical total of how many there actually are. Can you look through files like if it were a very large string or is there a specific way that it has to be done?
You could use Python's Counter for doing this:
from collections import Counter
import re
with open('input.txt') as f_input:
text = f_input.read().lower()
words = re.findall(r'\b(\w+)\b', text)
word_counts = Counter(w for w in words if len(w) == 8)
for word, count in word_counts.items():
print(word, count)
This works as follows:
It reads in a file called input.txt, as one very long string.
It then converts it all to lowercase to make sure the same words with different case are counted as the same word.
It uses a regular expression to split all of the text into a list of words.
It uses a list comprehension to store any word that has a length of 8 characters into a Counter.
It displays all of the matching entries along with the counts.
Try this code, where "eight_l_words" is an array of all the eight letter words and where "number_of_8lwords" is the number of eight letter words:
# defines text to be used
your_file = open("file_location","r+")
text = your_file.read
# divides the text into lines and defines some arrays
lines = text.split("\n")
words = []
eight_l_words = []
# iterating through "lines" adding each separate word to the "words" array
for each in lines:
words += each.split(" ")
# checking to see if each word in the "words" array is 8 chars long, and if so
# appending that words to the "eight_l_word" array
for each in words:
if len(each) == 8:
eight_l_word.append(each)
# finding the number of eight letter words
number_of_8lwords = len(eight_l_words)
# displaying results
print(eight_l_words)
print("There are "+str(number_of_8lwords)+" eight letter words")
Running the code with
text = "boomhead shot\nshamwow slapchop"
Yields the results:
['boomhead', 'slapchop']
There are 2 eight letter words
There's a useful post from 2 years ago called "How to split a text file to its words in python?"
How to split a text file to its words in python?
It describes splitting the line by whitespace. If you got punctuation such as commas and fullstops in there then you'll have to be a bit more sophisticated. There's help here: "Python - Split Strings with Multiple Delimiters" Split Strings with Multiple Delimiters?
You can use the function len() to get the length of each individual word.
I use the following code to open a text file, remove the HTML, and search for words before and after a certain keyword:
import nltk
import re
text = nltk.clean_html(open('file.txt').read())
text = text.lower()
pattern = re.compile(r'''(?x) ([^\(\)0-9]\.)+ | \w+(-\w+)* | \.\.\. ''')
text = nltk.regexp_tokenize(text, pattern)
#remove the digits from text
text = [i for i in text if not i.isdigit()]
# Text is now a list of words from file.txt
# I now loop over the Text to find all words before and after a specific keyword
keyword = ['foreign']
for i, w in enumerate(text): #it gives to the list items numbers
if w in keyword:
before_word = text[i-5:i-1] if i > 0 else ''
before_word = ' '.join(word for word in before_word)
after_word = text[i+1:i+5] if i+1 < len(text) else ''
after_word = ' '.join(word for word in after_word)
print "%s <%s> %s" % (before_word, w, after_word)
This codes works well if keyword is one word. But what if I want to find the 5 words before and after 'foreign currency' ? The issue is that in text all words separated by a space is a different item in the text list. I can't do keyword = ['foreign currency']. How can I solve this issue?
Sample .txt file here.
Have you considered a regex?
This will match and capture five words before, and five words after, foreign currency
((\w+ ){5})foreign currency(( \w+){5})
Edit: this regex breaks on things like tabs, quotes, commas, parentheses, etc.
And the provided 'sample of words to be found' doesn't have 5 following words, so it wouldn't match that.
Here's an updated regex which is 5 words up to, and 1-5 words following, the phrase
uses 'non-space' characters separated by 'non-word' characters for the words,
and it captures as one group including the search text:
((\S+\W){5}foreign currency(\W\S+){1,5})
Otherwise, you could try:
Join the text all into one line, no newlines
Use something = text.find('foreign currency') to find the first position of that text
Count backwards from there, character by character looking for spaces, for 5 words
Count forwards from the end, character by character looking for spaces, for 5 words
Loop all of this, using something = text.find('foreign currency', previous_end_pos) to tell it to look starting after the end of the previous step, to find the next instance.
Have you thought about using a variable for the number of words in the "keyword" and iterating through the text by that number of items at a time?