How many times a word occurs in a file? - python

In my Python homework my assignment is to: "Write a complete python program that reads a file trash.txt and outputs how many times the word Bob occurs in the file."
My code is:
count=0
f=open('trash.txt','r')
bob_in_trash=f.readlines()
for line in bob_in_trash:
if "Bob" in line:
count=count+1
print(count)
f.close()
Is there any way to make this code more efficient? It counted 5 correctly but I was wondering if there's anything I could modify.

You can just read the whole file and count the nomber of "Bob":
data = open('trash.txt').read()
count = data.count('Bob')
Although this is more accurate for smaller files, loading the whole file to memory might be a problem when you're dealing with bigger files.
Reading it line by line is still more efficient, but use str.count instead of Bob in line (which makes you read how many lines that has "Bob" in it).
with open('trash.txt') as f:
for line in f:
count += line.count("Bob")

This way you're always counting one "Bob" per line... How about using the count method, so you could sum any number of occurrences per line:
for line in bob_in_trash:
count=count+line.count("Bob")

For more versatility use regex to distinguish bob, Bob, bobcat, etc.
import re
with open('trash.txt','r') as f:
count = sum(len(re.findall( r'\bbob\b', line)) for line in f)
Options:
r'\bbob\b' # matches bob
r'(?i)\bbob\b' # matches bob, Bob
r'bob' # matches bob, Bob, bobcat

>>> count = 0
>>> abuffer = bytearray(4096)
>>> with open('trash.txt') as fp:
... while fp.readinto(abuffer) > 0:
... count += abuffer.count('Bob')

Because you're looking for only whole words, it's best to use a regex:
i = 0
with open('trash.txt','r') as file:
for result in re.finditer(r'\bBob\b', file.read()):
i += 1
print('Number of Bobs in file: ' + str(i))
Note that the regular expression is \bBob\b, where the \b at the beggining and end mean that Bob must be a word, not part of a word. Also, I used finditer instead of find because the former uses much less memory for large files.
To save even more memory, combine with line-by-line reading:
i = 0
with open('trash.txt','r') as file:
for line in file:
for result in re.finditer(r'\bBob\b', line):
i += 1
print('Number of Bobs in file: ' + str(i))

Related

How do I convert each of the words to a number?

I am trying to read a file and overwrite its contents with numbers. That means for the first word it would be 1, for the second word it would be 2, and so on.
This is my code:
file=open("reviews.txt","r+")
i=1
for x in file:
line=file.readline()
word=line.split()
file.write(word.replace(word,str(i)))
i+=1
file.close()
Input file:
This movie is not so good
This movie is good
Expected output file:
1 2 3 4 5 6
7 8 9 10
During compilation time I keep getting an error that: AttributeError: 'list' object has no attribute 'replace'. Which one is the list object? All the variables are strings as far as I know. Please help me.
It might be OK to first create the output, with any method that you like, then write it once in the file. Maybe, file.write in the loop wouldn't be so necessary.
Steps
We open the file, get all its content, and close it.
Using re module in DOTALL mode, we'd get anything that we want to replace in the first capturing group, in this case, with (\S+) or (\w+) etc., then we collect all other chars in the second capturing group with (.+?), then with re.findall, we'd generate two-elements tuples in a list, which we'd want to replace the first element of those tuples.
We then write a loop, and replace the first group with an incrementing counter, which is the idea here, and the second group untouched, and we would stepwise concat both as our new content to string_out
We finally open the [empty] file, and write the string_out, and close it.
Test
import re
file = open("reviews.txt","r+")
word_finder, counter, string_out = re.findall(r"(\S+)|(.+?)", file.read(), re.DOTALL), 0, ''
file.close()
for item in word_finder:
if item[0]:
counter += 1
string_out += str(counter)
else:
string_out += item[1]
try:
file = open("reviews.txt","w")
file.write(string_out)
finally:
file.close()
Output
1 2 3 4 5 6
7 8 9 10
RegEx
If you wish to simplify/modify/explore the expression, it's been explained on the top right panel of regex101.com. If you'd like, you can also watch in this link, how it would match against some sample inputs.
Reference
re — Regular expression operations
The call to split is returning a list, which you need to iterate to handle the replacement of each word:
with open("reviews.txt", "r+") as file:
i = 1
line = file.readline()
while line:
words = line.split()
for item in words:
file.write(str(i) + ' ')
i += 1
line = file.readline()
file.close()

How to count number of replacements made in string

I am currently working on a beginner problem
(https://www.reddit.com/r/beginnerprojects/comments/1i6sax/challenge_count_and_fix_green_eggs_and_ham/).
The challenge is to read through a file, replacing lower case 'i' with 'I' and writing a new corrected file.
I am at a point where the program reads the input file, replaces the relevant lower case characters, and writes a new corrected file. However, I need to also count the number of corrections.
I have looked through the .replace() documentation and I cannot see that it is possible to find out the number of replacements made. Is it possible to count corrections using the replace method?
def capitalize_i(file):
file = file.replace('i ', 'I ')
file = file.replace('-i-', '-I-')
return file
with open("green_eggs.txt", "r") as f_open:
file_1 = f_open.read()
file_2 = open("result.txt", "w")
file_2.write(capitalize_i(file_1))
You can just use the count function:
i_count = file.count('i ')
file = file.replace('i ', 'I ')
i_count += file.count('-i-')
file = file.replace('-i-', '-I-')
i_count will have the total amount of replacements made. You can also separate them by creating new variables if you want.

Print contents of a file and count the number of lines in the file - Python

Taking an intro to programming course using python.
Our task this week is to create a text file called 'randomnum.txt' and print to it using a python script. I was able to successfully create the file but am met with the second part of the assignment to print the contents of the file AND count the number of numbers (lines) in the .txt.
I have been able to print the contents or count the number of lines but never both. I'm pretty bad at Python and would like some help.
with open ('randomnum.txt','r') as random_numbers:
num_nums = 0
contents = random_numbers.read()
for lines in random_numbers:
num_nums += 1
print('List of random numbers in randomnum.txt')
print(contents)
print('Random number count: ', num_nums)
This way it gives me a random number count of 0.
any help would be super appreciated!
This is a good question because you're observing the behaviour that you can only read a file object once. Once you call random_numbers.read(), you can not repeat that action.
What I recommend is instead of doing .read(), use .readlines(). It reads each line one-by-one instead of reading the whole file at once. While iterating through each line, add one to your counter and print the current line:
with open("file.txt", "r") as myfile:
total = 0
for line in myfile.readlines():
print(line, end="")
total += 1
print("Total: " + str(total))
Note the second parameter I pass through to print (end=""). This is because by default print adds a newline, but since the file will already have newlines at the end of the line, you would be printing two new lines. end="" stops the behaviour of print printing a trailing newline.
Try this readlines then map for stripping, using the method descriptor for it, then just use len for getting length:
with open ('randomnum.txt','r') as random_numbers:
l=random_numbers.readlines()
print('List of random numbers in randomnum.txt')
print(''.join(map(str.rstrip,l)))
print('Random number count: ', len(l))
Output of code with randomnum.txt as:
1
2
3
4
5
6
Output:
List of random numbers in randomnum.txt
123456
Random number count: 6
Calling .read() and iterating with for ... in ... both consume the contents of the file. You can't do both unless you call .seek(0) in between. Alternatively, instead of calling .read(), you could capture the lines in your for loop (maybe switch to .readlines()) and then not worry about it.

MemoryError Python, in file 99999999 string

Windows 10 pro 64bit, python installed 64bit version
The file weighs 1,80 gb
How to fix thiss error, and print all string
def count():
reg = open('link_genrator.txt', 'r')
s = reg.readline().split()
print(s)
reg.read().split('\n') will give a list of all lines.
Why don't you just do s = reg.read(65536).splitlines()? This will give you a hint on the structure of the content and you can then play with the size you read in a chunk.
Once you know a bit more, you can try to loop that line an sum up the number of lines
After looking at the answers and trying to understand what the initial question could be I come to more complete answer than my previous one.
Looking at the question and the code in the sample function I assume now following:
is seems he want to separate the contents of a file into words and print them
from the function name I suppose he would like to count all these words
the whole file is quite big and thus Python stops with a memory error
Handling such large files obviously asks for a different treatment than the usual ones. For example, I do not see any use in printing all the separated words of such a file on the console. Of course it might make sense to count these words or search for patterns in it.
To show as an example how one might treat such big files I wrote following example. It is meant as a starting point for further refinements and changes according your own requirements.
MAXSTR = 65536
MAXLP = 999999999
WORDSEP = ';'
lineCnt = 0
wordCnt = 0
lpCnt = 0
fn = 'link_genrator.txt'
fin = open(fn, 'r')
try:
while lpCnt < MAXLP:
pos = fin.tell()
s = fin.read(MAXSTR)
lines = s.splitlines(True)
if len(lines) == 0:
break
# count words of line
k= 0
for l in lines:
lineWords = l.split(WORDSEP)# semi-colon separates each word
k += len(lineWords) # sum up words of each line
wordCnt += k - 1 # last word most probably not complete: subtract one
# count lines
lineCnt += len(lines)-1
# correction when line ends with \n
if lines[len(lines)-1][-1] == '\n':
lineCnt += 1
wordCnt += 1
lpCnt += 1
print('{0} {4} - {5} act Pos: {1}, act lines: {2}, act words: {3}'.format(lpCnt, pos, lineCnt, wordCnt, lines[0][0:10], lines[len(lines)-1][-10:]))
finally:
fin.close()
lineCnt += 1
print('Total line count: {}'.format(lineCnt))
That code works for files up to 2GB (tested with 2.1GB). The two constants at the beginning let you play with the size of the read in chunks and limit the amount of text processed. During testing you can then just process a subset of the whole data which goes much faster.

Read a line store it in a variable and then read another line and come back to the first line. Python 2

This is a tricky question and I've read a lot of posts about it, but I haven't been able to make it work.
I have a big file. I need to read it line by line, and once I reach a line of the form "Total is: (any decimal number)", take this string and to save the number in a variable. If the number is bigger than 40.0, then I need to find the fourth line above the Total line (for example, if the Total line was line 39, this line would be line 35). This line will be in the format "(number).(space)(substring)". Finally, I need to parse this substring out and do further processing on it.
This is an example of what an input file might look like:
many lines that we don't care about
many lines that we don't care about
...
1. Hi45
People: bla bla bla bla bla bla
whitespace
bla bla bla bla bla
Total is: (*here there will be a decimal number*)
bla bla
white space
...
more lines we don't care about
and then more lines and then
again we get
2. How144
People: bla bla bla bla bla bla
whitespace
bla bla bla bla bla
Total is: (*here there will be a decimal number*)
bla bla
white space
I have tried many things, including using the re.search() method to capture what I need from each line I need to focus on.
Here is my code which I modified from another stackoverflow Q & A:
import re
import linecache
number = ""
higher_line = ""
found_line = ""
with open("filename_with_many_lines.txt") as aFile:
for num, line in enumerate(aFile, 1):
searchObj = re.search(r'(\bTotal\b)(\s)(\w+)(\:)(\s)(\d+.\d+)', line)
if searchObj:
print "this is on line", line
print "this is the line number:", num
var1 = searchObj.group(6)
print var1
if float(var1) > 40.0:
number = num
higher_line = number - 4
print number
print higher_line
found_line = linecache.getline("filename_with_many_lines.txt", higher_line)
print "found the", found_line
The expected output would be:
this is on line Total is: 45.5
this is the line number: 14857
14857
14853
found the 1. Hi145
this is on line Total is: 62.1
this is the line number: 14985
14985
14981
found the 2.How144
If the line you need is always four lines above the Total is: line, you could keep the previous lines in a bounded deque.
from collections import deque
with open(filename, 'r') as file:
previous_lines = deque(maxlen=4)
for line in file:
if line.startswith('Total is: '):
try:
higher_line = previous_lines[-4]
# store higher_line, do calculations, whatever
break # if you only want to do this once; else let it keep going
except IndexError:
# we don't have four previous lines yet
# I've elected to simply skip this total line in that case
pass
previous_lines.append(line)
A bounded deque (one with a maximum length) will discard an item from the opposite side if adding a new item would cause it to exceed its maximum length. In this case, we're appending strings to the right side of the deque, so once the length of the deque reaches 4, each new string we append to the right side will cause it to discard one string from the left side. Thus, at the beginning of the for loop, the deque will contain the four lines prior to the current line, with the oldest line at the far left (index 0).
In fact, the documentation on collections.deque mentions use cases very similar to ours:
Bounded length deques provide functionality similar to the tail filter in Unix. They are also useful for tracking transactions and other pools of data where only the most recent activity is of interest.
This stores the line which starts with a number and a dot into a variable called prevline. We print the prevline only if re.search returns a match object.
import re
with open("file") as aFile:
prevline = ""
for num, line in enumerate(aFile,1):
m = re.match(r'\d+\.\s*.*', line) # stores the match object of the line which starts with a number and a dot
if m:
prevline += re.match(r'\d+\.\s*(.*)', line).group() # If there is any match found then this would append the whole line to the variable prevline. You could also write this line as prevline += m.group()
searchObj = re.search(r'(\bTotal\b\s+\w+:\s+(\d+\.\d+))', line) # Search for the line which contains the string Total plus a word plus a colon and a float number
if searchObj: # if there is any
score = float(searchObj.group(2)) # then the float number is assigned to the variable called score
if score > 40.0: # Do all the below operations only if the float number we fetched was greater than 40.0
print "this is the line number: ", num
print "this is the line", searchObj.group(1)
print num
print num-4
print "found the", prevline
prevline = ""
Output:
this is on line Total is: 45.5
this is the line number: 8
8
4
found the 1. Hi45
this is on line Total is: 62.1
this is the line number: 20
20
16
found the 2. How144
I suggested an edit to Blacklight Shining's post that built on its deque solution, but it was rejected with the suggestion that it instead be made into an answer. Below, I show how Blacklight's solution does solve your problem, if you were to just stare at it for a moment.
with open(filename, 'r') as file:
# Clear: we don't care about checking the first 4 lines for totals.
# Instead, we just store them for later.
previousLines = []
previousLines.append(file.readline())
previousLines.append(file.readline())
previousLines.append(file.readline())
previousLines.append(file.readline())
# The earliest we should expect a total is at line 5.
for lineNum, line in enumerate(file, 5):
if line.startswith('Total is: '):
prevLine = previousLines[0]
high_num = prevLine.split()[1] # A
score = float(line.strip("Total_is: ").strip("\n").strip()) # B
if score > 40.0:
# That's all! We've now got everything we need.
# Display results as shown in example code.
print "this is the line number : ", lineNum
print "this is the line ", line.strip('\n')
print lineNum
print (lineNum - 4)
print "found the ", prevLine
# Critical - remove old line & push current line onto deque.
previousLines = previousLines[1:] + [line]
I don't take advantage of deque, but my code accomplishes the same thing imperatively. I don't think it's necessarily a better answer than either of the others; I'm posting it to show how the problem you're trying to solve can be addressed with a very simple algorithm and simple tools. (Compare Avinash's clever 17 line solution with my dumbed-down 18 line solution.)
This simplified approach won't make you look like a wizard to anyone reading your code, but it also won't accidentally match on anything in the intervening lines. If you're dead set on hitting your lines with a regex, then just modify lines A and B. The general solution still works.
The point is, an easy way to remember what was on the line 4 lines back is to just store the last four lines in memory.

Categories