How do I convert each of the words to a number? - python

I am trying to read a file and overwrite its contents with numbers. That means for the first word it would be 1, for the second word it would be 2, and so on.
This is my code:
file=open("reviews.txt","r+")
i=1
for x in file:
line=file.readline()
word=line.split()
file.write(word.replace(word,str(i)))
i+=1
file.close()
Input file:
This movie is not so good
This movie is good
Expected output file:
1 2 3 4 5 6
7 8 9 10
During compilation time I keep getting an error that: AttributeError: 'list' object has no attribute 'replace'. Which one is the list object? All the variables are strings as far as I know. Please help me.

It might be OK to first create the output, with any method that you like, then write it once in the file. Maybe, file.write in the loop wouldn't be so necessary.
Steps
We open the file, get all its content, and close it.
Using re module in DOTALL mode, we'd get anything that we want to replace in the first capturing group, in this case, with (\S+) or (\w+) etc., then we collect all other chars in the second capturing group with (.+?), then with re.findall, we'd generate two-elements tuples in a list, which we'd want to replace the first element of those tuples.
We then write a loop, and replace the first group with an incrementing counter, which is the idea here, and the second group untouched, and we would stepwise concat both as our new content to string_out
We finally open the [empty] file, and write the string_out, and close it.
Test
import re
file = open("reviews.txt","r+")
word_finder, counter, string_out = re.findall(r"(\S+)|(.+?)", file.read(), re.DOTALL), 0, ''
file.close()
for item in word_finder:
if item[0]:
counter += 1
string_out += str(counter)
else:
string_out += item[1]
try:
file = open("reviews.txt","w")
file.write(string_out)
finally:
file.close()
Output
1 2 3 4 5 6
7 8 9 10
RegEx
If you wish to simplify/modify/explore the expression, it's been explained on the top right panel of regex101.com. If you'd like, you can also watch in this link, how it would match against some sample inputs.
Reference
re — Regular expression operations

The call to split is returning a list, which you need to iterate to handle the replacement of each word:
with open("reviews.txt", "r+") as file:
i = 1
line = file.readline()
while line:
words = line.split()
for item in words:
file.write(str(i) + ' ')
i += 1
line = file.readline()
file.close()

Related

Iterate over pair lines and count using python

I have a text file that looks like:
>MN00153:75:000H37WNG:1:11102:13823:1502
CCTGCGTTGAAGTGGCTTACTTGCACCTTATGCTACCGTGACCTGCGAATCCAGTCTCATCGTGACCATTCAGGACCAGTGGCAAGAGATCGGAAGAGCACACGTCTGAACTCCAGTCACCGCTCATTATCTCGTATGCCGTCTTCTGCTT
CASP3_fw1_rc : CCATTCAGGACCAGTGGCAAG - The position is 66
CASP3_fw2 : CCTGCGTTGAAGTGGCTTACT - The position is 1
Distance is 44
>MN00153:75:000H37WNG:1:11102:13823:1504
CCTGCGTTGAAGTGGCTTACTTGCACCTTATGCTACCGTGACCTGCGAATCCAGTCTCATCGTGACCATTCAGGACCAGTGGCAAGAGATCGGAAGAGCACACGTCTGAACTCCAGTCACCGCTCATTATCTCGTATGCCGTCTTCTGCTT
CASP3_fw1_rc : CCATTCAGGACCAGTGGCAAG - The position is 66
CASP3_fw2 : CCTGCGTTGAAGTGGCTTACT - The position is 1
Distance is 44
>MN00153:75:000H37WNG:1:11102:13823:1506
CCTGCGTTGAAGTGGCTTACTTGCACCTTATGCTACCGTGACCTGCGAATCCAGTCTCATCGTGACCATTCAGGACCAGTGGCAAGAGATCGGAAGAGCACACGTCTGAACTCCAGTCACCGCTCATTATCTCGTATGCCGTCTTCTGCTT
CASP3_fw1_rc : CCATTCAGGACCAGTGGCAAG - The position is 66
CASP3_fw2_rc : CCTGCGTTGAAGTGGCTTACT - The position is 1
Distance is 44
>MN00153:75:000H37WNG:1:11102:13823:1508
CCTGCGTTGAAGTGGCTTACTTGCACCTTATGCTACCGTGACCTGCGAATCCAGTCTCATCGTGACCATTCAGGACCAGTGGCAAGAGATCGGAAGAGCACACGTCTGAACTCCAGTCACCGCTCATTATCTCGTATGCCGTCTTCTGCTT
CASP3_fw1_rc : CCATTCAGGACCAGTGGCAAG - The position is 66
EIF2_fw2 : CCTGCGTTGAAGTGGCTTACT - The position is 1
Distance is 44
I am interested in lines 3 and 4 of each bucket (starts with '>'). I want to count 1 if line 3 and line 4 is CASP3 (regardless of what is afterward). so the output should be
3
Because first, second, and third buckets have pair CASP3 in lines 3 and 4 of each bucket (except the last one).
Thanks
If your file is not too huge you might use .readlines function to get list of lines following way:
with open('filename.txt', 'r') as f:
lines = f.readlines()
then use enumerate and str methods following way:
cnt = 0
for inx, line in enumerate(lines):
if line.startswith('>') and lines[inx+2].startswith('CASP3') and lines[inx+3].startswith('CASP3'):
cnt += 1
print(cnt)
My solution requires that there are at least 3 lines after last line starting with >.
Without reading the whole file into memory:
def startswith_casp(iterator):
# grab the next four lines of your file
chunk = [line for _, line in zip(range(3), iterator)]
# use a slice here to avoid index errors
return all(c.startswith('CASP3') for c in chunk[1:3])
with open('yourfile.txt') as fh:
count = 0
for line in fh:
if not line.strip():
continue
elif line.startswith('>'):
# Function returns a boolean, so True will add 1 while False adds 0
count += startswith_casp(fh)
else:
continue
print(count)
Here I am split-ing by \n\n to get the "buckets", then by \n to get the lines within each bucket, then checking the 3rd ([2]) and 4th ([3]) line in each bucket for the pattern:
with open('genes.txt') as file:
data = file.read()
by_bucket = [i.split('\n') for i in data.split('\n\n')]
count = 0
for bucket in by_bucket:
count += (bucket[2].startswith('CASP3') and bucket[3].startswith('CASP3'))
print(count)
It would probably help if you were to explain what you already tried. That being said here's a general approach:
Use the .split() method to split at any character(s) you want. This results in you getting a list with each entry being one bucket.
Loop over this list with for VariableName in ExampleList: to check each bucket on their own.
You can optionally check if the first entry is a > or if you did the splitting correctly you may not need to.
Seperate each bucket into another list where each entry is one line by using bucket.splitlines().
Then check if the first characters of the 3rd and 4th entry in this list are CASP3 by checking if string[2][:5]=="CASP3"(for the third line) and string[3][:5]=="CASP3"(for the fourth line) is true.
Add another counter to the function that is increased by 1 whenever one bucket is valid.
return this counter.
If you have additional questions, feel free to ask.
Here's a example that takes a string and returns your value you need:
def getValue(string):
counter=0
splitList=string.split("\n\n")
for bucket in splitList:
bucket=bucket.splitlines()
if bucket[2][:5]=="CASP3" and bucket[3][:5]=="CASP3":
counter+=1
return counter
Note that this function relies on the buckets being seperated by a empty newline, but you can change that as well to seperate on any other character(s).
My solution is reading the file.txt into a dictionary of text sections (where a section spans between the two greater than symbols (i.e. '>') which then allows you to easily perform some comparisons.
file_path = './file.txt'
keyword="CASP3"
section_ID = 0
count = 0
all_sections = {}
with open(file_path,'r') as f:
for line in f:
if line.startswith(">"):
if line not in all_sections:
section_ID += 1
all_sections[section_ID] = {}
all_sections[section_ID]['entries'] = []
all_sections[section_ID]['entries'].append(line)
for sec_id in all_sections:
if all_sections[sec_id]['entries'][2].startswith(keyword) and all_sections[sec_id]['entries'][3].startswith(keyword):
count+=1
print('count : ', count)
output using your file would be :
count : 3

Defining a function to count the number of lines in a file, containing a certain substring

I'm kinda new to Python. I'm trying to define a function that can count the number of lines in a file, containing a particular substring. I also want to count the lines which have multiple values of my substring as just 1.
Here's my code:
def CLT(filename):
with open(filename,'r') as f:
pattern='ing'
count=a=0
k=f.readlines()
for line in k:
if pattern in k[a:]:
count += 1
return count
print( CLT('random_file.txt') )
Assume that my file has 25 instances where a string 'str' appears but it has 2 lines where 2 'str' appear on the same line. So the ideal output to this problem should be 23.
But its returning 0 as the number of lines. I also recognize that my code doesn't do the part where the lines with multiple substrings will be counted as just 1 count. What can I do to improve this code?
You've got a slight error in your code:
if pattern in k[a:]:
should be:
if pattern in line[a:]:
It looks like you're positioning yourself to use a to keep track of when you've already found the string in the line and you're now looking for an additional occurrence, but if not, you should remove it as it complicates the logic.
Otherwise, if you use a to show the index of where you already found an occurrence of the string in the line, you need to make sure to start looking again at index a + 1 so that you don't find the same occurrence again and again and end up in an infinite loop when you add a loop to check for further occurrences in the same line.
Here is the code you might want to try,
def CLT(filename):
with open(filename, 'r') as f:
pattern = 'ing'
count = 0
for line in f:
if pattern in line:
count += 1
return count
print(CLT('random_file.txt'))
Hope this helps you!

What qualifies collection of strings to become a line?

Following code is taking every character and running the loop as many times. But when I save the same line in a text file and perform same operation, the loop is only run once for 1 line. It is bit confusing. Possible reason I can think off is that first method is running the loop by considering "a" as a list. Kindly correct me if I am wrong. Also let me know how to create a line in code itself rather first saving it in a file and then using it.
>>> a="In this world\n"
>>> i=0
>>> for lines in a:
... i=i+1
... print i
...
1
2
3
4
5
6
7
8
9
10
11
12
13
You're trying to loop over a, which is a string. Regardless of how many newlines you have in a string, when you loop over it, you're going to go character by character.
If you want to loop through a bunch of lines, you have to use a list:
lines = ["this is line 1", "this is another line", "etc"]
for line in lines:
print line
If you have a string containing a bunch of newlines and want to convert it to a list of lines, use the split method:
text = "This is line 1\nThis is another line\netc"
lines = text.split("\n")
for line in lines:
print line
The reason why you go line by line when reading from a file is because the people who implemented Python decided that it would be more useful if iterating over a file yielded a collection of lines instead of a collection of characters.
However, a file and a string are different things, and you should not necessarily expect that they work in the same way.
Just change the name of the variable when looping on the line:
i = 0
worldLine ="In this world\n"
for character in worldLine:
i=i+1
print i
count = 0
readFile = open('myFile','r')
for line in readFile:
count += 1
now it should be clear what's going on.
Keeping meaningful names will save you a lot of debugging time.
Considering doing the following:
i = 0
worldLine =["In this world\n"]
for character in worldLine:
i=i+1
print i
if you want to loop on a list of lines consisting of worldLine only.

Python 3.4 - Capture block of text based on single string

I have searched far and wide and I hope someone can either point me to the link I missed or help me out with this logic.
We have a script the goes out and collects logs from various devices and places them in text files. Within these text files there is a time stamp and we need to collect the few lines of text before and after this time stamp.
I already have a script that matches the time stamps and removes them for certain reports (included below) but I cannot figure out how to match the time stamp and then capture the surrounding lines.
regex_time_stamp = re.compile('\d{2}:\d{2}:\d{2}|\d{1,2}y\d{1,2}w\d{1,2}d|\d{1,2}w\d{1,2}d|\d{1,2}d\d{1,2}h')
with open(filename, 'r') as f:
h = f.readlines()
for line in h:
if regex_time_stamp.search(line) is not None:
new_line = re.sub(regex_time_stamp, '', line)
pre_list.append(new_line)
else:
pre_list.append(line)
Any assistance would be greatly appreciated! Thanks for taking the time to read this.
The basic algorithm is to remember the three most recently read lines. When you match a header, read the next two lines and the combine it with the header and the last three lines that you've saved.
Alternately, since you're saving all of the lines in a list, simply keep track of which element is the current element, and when you find a header you can go back and get the previous two and next two elements.
Catch with duplicated lines
Agreed with the basic algorithm by #Bryan-Oakley and #TigerhawkT3, however there's a catch:
What if several lines match consecutively?
You could end up duplicating "context" lines by printing the last 2 lines of the first match, and then the last 2 lines of the second match... that would actually also contain the previous matched line.
The solution is to keep track of which line number was last printed, in order to print just enough lines before the current matched line.
Flexible context parameter
What if also you want to print 3 lines before and after instead of 2? Then you need to keep track of more lines.
What if you want only 1 ?
Then your number of lines to print needs to be a parameter and the algorithm needs to use it.
Sample input and output
Here's a file sample that contains the word MATCH instead of your timestamp, for clarity. The other lines contain NOT + line number
==
NOT 0
NOT 1
NOT 2
NOT 3
NOT 4
MATCH LINE 5
NOT 6
NOT 7
NOT 8
NOT 9
MATCH LINE 10
MATCH LINE 11
NOT 12
MATCH LINE 13
NOT 14
==
The output should be:
==
NOT 3
NOT 4
LINE 5
NOT 6
NOT 8
NOT 9
LINE 10
LINE 11
NOT 12
LINE 13
NOT 14
==
Solution
This solution iterates on the file and keeps track of:
what is the last line that was printed? This will take care of not duplicating "context" lines if matched lines come in sequence.
what is the last line that was matched? This will tell the program to print the current line if it is "close" to the last matched line. How close? This is determined by your "number of lines to print" parameter. Then we also set the last_line_printed variable to the current line index.
Here's a simplified algorithm in English:
When matching a line we will:
print the last N lines, from the last_line_printed variable to the current index
print the current line after stripping the timestamp
set the last_line_printed = last_line_matched = current line index
continue
When not matching a line we will:
print the current line if current_index < last_line_matched index + number_of_lines_to_print
Of course we're taking care of whether we're close to the beginning of file by checking limits
Not print but return an array
This solution doesn't print directly but returns an array with all the lines to print. That's just a bit classier.
I like to name my "return" variable result but that's just me. It makes it obvious what is the result variable during the whole algorithm.
Code
You can try this code with the input above, it'll print the same output.
def search_timestamps_context(filename, number_of_lines_to_print=2):
import re
result = []
regex_time_stamp = re.compile('\d{2}:\d{2}:\d{2}|\d{1,2}y\d{1,2}w\d{1,2}d|\d{1,2}w\d{1,2}d|\d{1,2}d\d{1,2}h')
# for my test
regex_time_stamp = re.compile('MATCH')
with open(filename, 'r') as f:
h = f.readlines()
# Remember which is the last line printed and matched
last_line_printed = -1
last_line_matched = -1
for idx, line in enumerate(h):
if regex_time_stamp.search(line) is not None:
# We want to print the last "number_of_lines_to_print" lines
# ...unless they were already printed
# We want to return last lines from idx - number_of_lines_to_print
# print ('** Matched ', line, idx, last_line_matched, last_line_printed)
if last_line_printed == -1:
lines_to_print = max(idx - number_of_lines_to_print, 0)
else:
# Unless we've already printed those lines because of a previous match, then we continue
lines_to_print = max(idx - number_of_lines_to_print, last_line_printed + 1)
for l in h[lines_to_print:idx]:
result.append(l)
# Now print the stripped line
new_line = re.sub(regex_time_stamp, '', line)
result.append(new_line)
# Update the last line printed
last_line_printed = last_line_matched = idx
else:
# If not a match, we still need to print the current line if we had a match N lines before
if last_line_matched != -1 and idx < last_line_matched + number_of_lines_to_print:
result.append(line)
last_line_printed = idx
return result
filename = 'test_match.txt'
lines = search_timestamps_context(filename, number_of_lines_to_print=2)
print (''.join(lines))
Improvements
The usage of getlines() is inefficient: we are reading the whole file before starting.
It would be more efficient to just iterate, but then we need to remember the last lines in case we need to print them. To achieve that, we would maintain a list of the last N lines, and not more.
That's an exercise left to the reader :)

How many times a word occurs in a file?

In my Python homework my assignment is to: "Write a complete python program that reads a file trash.txt and outputs how many times the word Bob occurs in the file."
My code is:
count=0
f=open('trash.txt','r')
bob_in_trash=f.readlines()
for line in bob_in_trash:
if "Bob" in line:
count=count+1
print(count)
f.close()
Is there any way to make this code more efficient? It counted 5 correctly but I was wondering if there's anything I could modify.
You can just read the whole file and count the nomber of "Bob":
data = open('trash.txt').read()
count = data.count('Bob')
Although this is more accurate for smaller files, loading the whole file to memory might be a problem when you're dealing with bigger files.
Reading it line by line is still more efficient, but use str.count instead of Bob in line (which makes you read how many lines that has "Bob" in it).
with open('trash.txt') as f:
for line in f:
count += line.count("Bob")
This way you're always counting one "Bob" per line... How about using the count method, so you could sum any number of occurrences per line:
for line in bob_in_trash:
count=count+line.count("Bob")
For more versatility use regex to distinguish bob, Bob, bobcat, etc.
import re
with open('trash.txt','r') as f:
count = sum(len(re.findall( r'\bbob\b', line)) for line in f)
Options:
r'\bbob\b' # matches bob
r'(?i)\bbob\b' # matches bob, Bob
r'bob' # matches bob, Bob, bobcat
>>> count = 0
>>> abuffer = bytearray(4096)
>>> with open('trash.txt') as fp:
... while fp.readinto(abuffer) > 0:
... count += abuffer.count('Bob')
Because you're looking for only whole words, it's best to use a regex:
i = 0
with open('trash.txt','r') as file:
for result in re.finditer(r'\bBob\b', file.read()):
i += 1
print('Number of Bobs in file: ' + str(i))
Note that the regular expression is \bBob\b, where the \b at the beggining and end mean that Bob must be a word, not part of a word. Also, I used finditer instead of find because the former uses much less memory for large files.
To save even more memory, combine with line-by-line reading:
i = 0
with open('trash.txt','r') as file:
for line in file:
for result in re.finditer(r'\bBob\b', line):
i += 1
print('Number of Bobs in file: ' + str(i))

Categories