I'm trying to count the number of instances several words appear in a file.
Here is my code:
#!/usr/bin/env python
file = open('my_output', 'r')
word1 = 'wordA'
print('wordA', file.read().split().count(word1))
word2 = 'wordB'
print('wordB', file.read().split().count(word2))
word3 = 'wordC'
print('wordC', file.read().split().count(word3))
The issue in the code is that it only counts the number of instances of word1. How can this code be fixed to count word2 and word3?
Thanks!
i think instead of continuously reading and splitting the file , this code would work better if you did : [ this way you could find the term frequency of any number of words you find in the file ]
file=open('my_output' , 'r')
s=file.read()
s=s.split()
w=set(s)
tf={}
for i in s:
tf[i]=s.count(i)
print(tf)
The main problem is that file.read() consumes the file. Thus the second time you search you end up searching an empty file. The simplest solution is to read the file once (if it is not too large) and then just search the previously read text:
#!/usr/bin/env python
with open('my_output', 'r') as file:
text = file.read()
word1 = 'wordA'
print('wordA', text.split().count(word1))
word2 = 'wordB'
print('wordB', text.split().count(word2))
word3 = 'wordC'
print('wordC', text.split().count(word3))
To improve performance it is also possible to split only once:
#!/usr/bin/env python
with open('my_output', 'r') as file:
split_text = file.read().split()
word1 = 'wordA'
print('wordA', split_text.count(word1))
word2 = 'wordB'
print('wordB', split_text.count(word2))
word3 = 'wordC'
print('wordC', split_text.count(word3))
Using with will also ensure that the file closes correctly after being read.
Can you try this:
file = open('my_output', 'r')
splitFile = file.read().split()
lst = ['wordA','wordB','wordC']
for wrd in lst:
print(wrd, splitFile.count(wrd))
Short solution using collections.Counter object:
import collections
with open('my_output', 'r') as f:
wordnames = ('wordA', 'wordB', 'wordC')
counts = (i for i in collections.Counter(f.read().split()).items() if i[0] in wordnames)
for c in counts:
print(c[0], c[1])
For the following sample text line:
'wordA some dfasd asdasdword B wordA sdfsd sdasdasdddasd wordB wordC wordC sdfsdfsdf wordA'
we would obtain the output:
wordB 1
wordC 2
wordA 3
in your code the file is consumed (exhausted) in the first line so the next lines will not return anything to count: the first file.read() reads the whole contents of the file and returns it as a string. the second file.read() has nothing left to read and just returns an empty string '' - as does the third file.read() .
this is a version that should do what you want:
from collections import Counter
counter = Counter()
with open('my_output', 'r') as file:
for line in file:
counter.update(line.split())
print(counter)
you may have to do some preprocessing (in order to get rid of special characters and , and . and what not).
Counter is in the python standard library and is very useful for exactly that kind of thing.
note that this way you iterate once over the file only and you do not have to store the whole file in memory at any time.
if you only want to keep track of certain words you could select only them instead of passing the whole line to a counter:
from collections import Counter
import string
counter = Counter()
words = ('wordA', 'wordB', 'wordC')
chars_to_remove = str.maketrans('', '', string.punctuation)
with open('my_output', 'r') as file:
for line in file:
line = line.translate(chars_to_remove)
w = (word for word in line.split() if word in words)
counter.update(w)
print(counter)
i also included an example of what i meant with preprocessing: punctuation will be removed before counting.
from collections import Counter
#Create a empty word_list which stores each of the words from a line.
word_list=[]
#file_handle to refer to the file object
file_handle=open(r'my_file.txt' , 'r+')
#read all the lines in a file
for line in file_handle.readlines():
#get each line,
#split each line into list of words
#extend those returned words into the word_list
word_list.extend(line.split())
# close the file object
file_handle.close()
#Pass the word_list to Counter() and get the dictionary of the words
dictionary_of_words=Counter(word_list)
print dictionary_of_words
Related
article = open("article.txt", encoding="utf-8")
for i in article:
print(max(i.split(), key=len))
The text is written with line breaks, and it gives me the longest words from each line. How to get the longest word from all of the text?
One approach would be to read the entire text file into a Python string, remove newlines, and then find the largest word:
with open('article.text', 'r') as file:
data = re.sub(r'\r?\n', '', file.read())
longest_word = max(re.findall(r'\w+', data), key=len)
longest = 0
curr_word = ""
with open("article.txt", encoding="utf-8") as f:
for line in f:
for word in line.split(" "): # Use line-by-line generator to avoid loading large file in memory
word = word.strip()
if (wl := len(word)) > longest: # Python 3.9+, otherwise use 2 lines
longest = wl
curr_word = word
print(curr_word)
Instead of iterating through each line, you can get the entire text of the file and then split them using article.readline().split()
article = open("test.txt", encoding="utf-8")
print(max(article.readline().split(), key=len))
article.close()
There are many ways by which you could do that. This would work
with open("article.txt", encoding="utf-8") as article:
txt = [word for item in article.readlines() for word in item.split(" ")]
biggest_word = sorted(txt, key=lambda word: (-len(word), word), )[0]
Note that I am using a with statement to close the connection to the file when the reading is done, that I use readlines to read the entire file, returing a list of lines, and that I unpack the split items twice to get a flat list of items. The last line of code sorts the list and uses -len(word) to inverse the sorting from ascending to descending.
I hope this is what you are looking for :)
If your file is large enough to fit in memory, you can read all line at once.
file = open("article.txt", encoding="utf-8", mode='r')
all_text = file.read()
longest = max(i.split(), key=len)
print(longest)
I have this text file made up of numbers and words, for example like this - 09807754 18 n 03 aristocrat 0 blue_blood 0 patrician and I want to split it so that each word or number will come up as a new line.
A whitespace separator would be ideal as I would like the words with the dashes to stay connected.
This is what I have so far:
f = open('words.txt', 'r')
for word in f:
print(word)
not really sure how to go from here, I would like this to be the output:
09807754
18
n
3
aristocrat
...
Given this file:
$ cat words.txt
line1 word1 word2
line2 word3 word4
line3 word5 word6
If you just want one word at a time (ignoring the meaning of spaces vs line breaks in the file):
with open('words.txt','r') as f:
for line in f:
for word in line.split():
print(word)
Prints:
line1
word1
word2
line2
...
word6
Similarly, if you want to flatten the file into a single flat list of words in the file, you might do something like this:
with open('words.txt') as f:
flat_list=[word for line in f for word in line.split()]
>>> flat_list
['line1', 'word1', 'word2', 'line2', 'word3', 'word4', 'line3', 'word5', 'word6']
Which can create the same output as the first example with print '\n'.join(flat_list)...
Or, if you want a nested list of the words in each line of the file (for example, to create a matrix of rows and columns from a file):
with open('words.txt') as f:
matrix=[line.split() for line in f]
>>> matrix
[['line1', 'word1', 'word2'], ['line2', 'word3', 'word4'], ['line3', 'word5', 'word6']]
If you want a regex solution, which would allow you to filter wordN vs lineN type words in the example file:
import re
with open("words.txt") as f:
for line in f:
for word in re.findall(r'\bword\d+', line):
# wordN by wordN with no lineN
Or, if you want that to be a line by line generator with a regex:
with open("words.txt") as f:
(word for line in f for word in re.findall(r'\w+', line))
f = open('words.txt')
for word in f.read().split():
print(word)
As supplementary,
if you are reading a vvvvery large file, and you don't want read all of the content into memory at once, you might consider using a buffer, then return each word by yield:
def read_words(inputfile):
with open(inputfile, 'r') as f:
while True:
buf = f.read(10240)
if not buf:
break
# make sure we end on a space (word boundary)
while not str.isspace(buf[-1]):
ch = f.read(1)
if not ch:
break
buf += ch
words = buf.split()
for word in words:
yield word
yield '' #handle the scene that the file is empty
if __name__ == "__main__":
for word in read_words('./very_large_file.txt'):
process(word)
What you can do is use nltk to tokenize words and then store all of the words in a list, here's what I did.
If you don't know nltk; it stands for natural language toolkit and is used to process natural language. Here's some resource if you wanna get started
[http://www.nltk.org/book/]
import nltk
from nltk.tokenize import word_tokenize
file = open("abc.txt",newline='')
result = file.read()
words = word_tokenize(result)
for i in words:
print(i)
The output will be this:
09807754
18
n
03
aristocrat
0
blue_blood
0
patrician
with open(filename) as file:
words = file.read().split()
Its a List of all words in your file.
import re
with open(filename) as file:
words = re.findall(r"([a-zA-Z\-]+)", file.read())
Here is my totally functional approach which avoids having to read and split lines. It makes use of the itertools module:
Note for python 3, replace itertools.imap with map
import itertools
def readwords(mfile):
byte_stream = itertools.groupby(
itertools.takewhile(lambda c: bool(c),
itertools.imap(mfile.read,
itertools.repeat(1))), str.isspace)
return ("".join(group) for pred, group in byte_stream if not pred)
Sample usage:
>>> import sys
>>> for w in readwords(sys.stdin):
... print (w)
...
I really love this new method of reading words in python
I
really
love
this
new
method
of
reading
words
in
python
It's soo very Functional!
It's
soo
very
Functional!
>>>
I guess in your case, this would be the way to use the function:
with open('words.txt', 'r') as f:
for word in readwords(f):
print(word)
I want to ignore the repeated words in a text file .txt, and I've tried many things but none of them worked.
def words_in_file(filename):
f = open(filename)
word = []
for line in f:
word.append(line)
return word
You can use a set, which ignores duplicates.
def words_in_file(filename):
words = []
with open(filename) as f:
for line in f:
words.append(line)
return set(words)
Note that I used with, which will make sure to close the file after using it or if there is an error.
I don't know your file but from your code it has nothing to do with what you want. But you can find unique words like this: list(set(another_list)). Only if your line has just one word, then change your code like this:
def words_in_file(filename):
f = open(filename)
word = []
for line in f:
word.append(line)
another_list=list(set(word))
return another_list
edit: And at the end don't forget to close your file.
I am trying to write a program where I count the most frequently used words from one file but those words should not be available in another file. So basically I am reading data from test.txt file and counting the most frequently used word from that file, but that word should not be found in test2.txt file.
Below are sample data files, test.txt and test2.txt
test.txt:
The Project is for testing. doing some testing to find what's going on. the the the.
test2.txt:
a
about
above
across
after
afterwards
again
against
the
Below is my script, which parses files test.txt and test2.txt. It finds the most frequently used words from test.txt, excluding words found in test2.txt.
I thought I was doing everything right, but when I execute my script, it gives "the" as the most frequent word. But actually, the result should be "testing", as "the" is found in test2.txt but "testing" is not found in test2.txt.
from collections import Counter
import re
dgWords = re.findall(r'\w+', open('test.txt').read().lower())
f = open('test2.txt', 'rb')
sWords = [line.strip() for line in f]
print(len(dgWords));
for sWord in sWords:
print (sWord)
print (dgWords)
while sWord in dgWords: dgWords.remove(sWord)
print(len(dgWords));
mostFrequentWord = Counter(dgWords).most_common(1)
print (mostFrequentWord)
Here's how I'd go about it - using sets
all_words = re.findall(r'\w+', open('test.txt').read().lower())
f = open('test2.txt', 'rb')
stop_words = [line.strip() for line in f]
set_all = set(all_words)
set_stop = set(stop_words)
all_only = set_all - set_stop
print Counter(filter(lambda w:w in all_only, all_words)).most_common(1)
This should be slightly faster as well as you do a counter on only 'all_only' words
I simply changed the following line of your original code
f = open('test2.txt', 'rb')
to
f = open('test2.txt', 'r')
and it worked. Simply read your text as string instead of binaries. Otherwise they won't match in regex. Tested on python 3.4 eclipse PyDev Win7 x64.
OFFTOPIC:
It's more pythonic to open files using with statements. In this case, write
with open('test2.txt', 'r') as f:
and indent file processing statements accordingly. That should keep you away from forgetting to close the filestream.
import re
from collections import Counter
with open('test.txt') as testfile, open('test2.txt') as stopfile:
stopwords = set(line.strip() for line in stopfile)
words = Counter(re.findall(r'\w+', open('test.txt').read().lower()))
for word in stopwords:
if word in words:
words.pop(word)
print("the most frequent word is", words.most_common(1))
I'm a beginner with python and would like to know how to use two txt files to count the characters as well as counter the 10 most common characters. also how to convert all characters in the file to lower case and eliminate all characters other than a-z
here's what i've tried and had no luck with:
from string import ascii_lowercase
from collections import Counter
with open ('document1.txt' , 'document2.txt') as f:
print Counter(letter for line in f
for letter in line.lower()
if letter in ascii_lowercase)
try like this :
>>> from collections import Counter
>>> import re
>>> words = re.findall(r'\w+', "{} {}".format(open('your_file1').read().lower(), open('your_file2').read().lower()))
>>> Counter(words).most_common(10)
Here is a simple example. You can adapt this code to fit your needs
from string import ascii_lowercase
from collections import Counter
with open('file1.txt', 'r') as file1data: #opening an reading file one
file1 = file1data.read().lower() #convert the entire file contents to lower
with open('file2.txt', 'r') as file2data: #opening an reading file two
file2 = file2data.read().lower()
#The contents of both file 1 and 2 are stored in fil1 and file2 variables
#Examples of how to work with one file repeat for two files
file1_list = []
for ch in file1:
if ch in ascii_lowercase: #makes sure only lowercase alphabet is appended. All Non alphabet characters are removed
file1_list.append(ch)
elif ch in [" ", ".", ",", "'"]: #remove this elif block is you just want the letters
file1_list.append(ch) #make sure basic punctionation is kept
print "".join(file1_list) #this line is not needed. Just to show what the text looks like now
print Counter(file1_list).most_common(10) #prints the top ten
print Counter(file1_list) #prints the number of characters and how many times they repeat
Now that you've reviewed that mess above and have an idea of what each line is doing, here is a cleaner version, that does what you were looking for.
from string import ascii_lowercase
from collections import Counter
with open('file1.txt', 'r') as file1data:
file1 = file1data.read().lower()
with open('file2.txt', 'r') as file2data:
file2 = file2data.read().lower()
file1_list = []
for ch in file1:
if ch in ascii_lowercase:
file1_list.append(ch)
file2_list = []
for ch in file2:
if ch in ascii_lowercase:
file2_list.append(ch)
all_counter = Counter(file1_list + file2_list)
top_ten_counter = Counter(file1_list + file2_list).most_common(10)
print sorted(all_counter.items())
print sorted(top_ten_counter)
Unfortunately there is no way to insert into the middle of a file without re-writing it. As previous posters have indicated, you can append to a file or overwrite part of it using seek but if you want to add stuff at the beginning or the middle, you'll have to rewrite it.
This is an operating system thing, not a Python thing. It is the same in all languages.
What I usually do is read from the file, make the modifications and write it out to a new file called myfile.txt.tmp or something like that. This is better than reading the whole file into memory because the file may be too large for that. Once the temporary file is completed, I rename it the same as the original file.
This is a good, safe way to do it because if the file write crashes or aborts for any reason, you still have your untouched original file.
To find most common words from multiple files,
from collections import Counter
import re
with open(''document1.txt'') as f1, open(''document1.txt'') as f2:
words = re.findall(r'\w+', f1.read().lower()) + re.findall(r'\w+', f2.read().lower())
>>>Counter(words).most_common(10)
"wil give you most 10 common words"
If you want most 10 common characters
>>>Counter(f1.read() + f2.read()).most_common(10)