Iterate over pair lines and count using python - python

I have a text file that looks like:
>MN00153:75:000H37WNG:1:11102:13823:1502
CCTGCGTTGAAGTGGCTTACTTGCACCTTATGCTACCGTGACCTGCGAATCCAGTCTCATCGTGACCATTCAGGACCAGTGGCAAGAGATCGGAAGAGCACACGTCTGAACTCCAGTCACCGCTCATTATCTCGTATGCCGTCTTCTGCTT
CASP3_fw1_rc : CCATTCAGGACCAGTGGCAAG - The position is 66
CASP3_fw2 : CCTGCGTTGAAGTGGCTTACT - The position is 1
Distance is 44
>MN00153:75:000H37WNG:1:11102:13823:1504
CCTGCGTTGAAGTGGCTTACTTGCACCTTATGCTACCGTGACCTGCGAATCCAGTCTCATCGTGACCATTCAGGACCAGTGGCAAGAGATCGGAAGAGCACACGTCTGAACTCCAGTCACCGCTCATTATCTCGTATGCCGTCTTCTGCTT
CASP3_fw1_rc : CCATTCAGGACCAGTGGCAAG - The position is 66
CASP3_fw2 : CCTGCGTTGAAGTGGCTTACT - The position is 1
Distance is 44
>MN00153:75:000H37WNG:1:11102:13823:1506
CCTGCGTTGAAGTGGCTTACTTGCACCTTATGCTACCGTGACCTGCGAATCCAGTCTCATCGTGACCATTCAGGACCAGTGGCAAGAGATCGGAAGAGCACACGTCTGAACTCCAGTCACCGCTCATTATCTCGTATGCCGTCTTCTGCTT
CASP3_fw1_rc : CCATTCAGGACCAGTGGCAAG - The position is 66
CASP3_fw2_rc : CCTGCGTTGAAGTGGCTTACT - The position is 1
Distance is 44
>MN00153:75:000H37WNG:1:11102:13823:1508
CCTGCGTTGAAGTGGCTTACTTGCACCTTATGCTACCGTGACCTGCGAATCCAGTCTCATCGTGACCATTCAGGACCAGTGGCAAGAGATCGGAAGAGCACACGTCTGAACTCCAGTCACCGCTCATTATCTCGTATGCCGTCTTCTGCTT
CASP3_fw1_rc : CCATTCAGGACCAGTGGCAAG - The position is 66
EIF2_fw2 : CCTGCGTTGAAGTGGCTTACT - The position is 1
Distance is 44
I am interested in lines 3 and 4 of each bucket (starts with '>'). I want to count 1 if line 3 and line 4 is CASP3 (regardless of what is afterward). so the output should be
3
Because first, second, and third buckets have pair CASP3 in lines 3 and 4 of each bucket (except the last one).
Thanks

If your file is not too huge you might use .readlines function to get list of lines following way:
with open('filename.txt', 'r') as f:
lines = f.readlines()
then use enumerate and str methods following way:
cnt = 0
for inx, line in enumerate(lines):
if line.startswith('>') and lines[inx+2].startswith('CASP3') and lines[inx+3].startswith('CASP3'):
cnt += 1
print(cnt)
My solution requires that there are at least 3 lines after last line starting with >.

Without reading the whole file into memory:
def startswith_casp(iterator):
# grab the next four lines of your file
chunk = [line for _, line in zip(range(3), iterator)]
# use a slice here to avoid index errors
return all(c.startswith('CASP3') for c in chunk[1:3])
with open('yourfile.txt') as fh:
count = 0
for line in fh:
if not line.strip():
continue
elif line.startswith('>'):
# Function returns a boolean, so True will add 1 while False adds 0
count += startswith_casp(fh)
else:
continue
print(count)

Here I am split-ing by \n\n to get the "buckets", then by \n to get the lines within each bucket, then checking the 3rd ([2]) and 4th ([3]) line in each bucket for the pattern:
with open('genes.txt') as file:
data = file.read()
by_bucket = [i.split('\n') for i in data.split('\n\n')]
count = 0
for bucket in by_bucket:
count += (bucket[2].startswith('CASP3') and bucket[3].startswith('CASP3'))
print(count)

It would probably help if you were to explain what you already tried. That being said here's a general approach:
Use the .split() method to split at any character(s) you want. This results in you getting a list with each entry being one bucket.
Loop over this list with for VariableName in ExampleList: to check each bucket on their own.
You can optionally check if the first entry is a > or if you did the splitting correctly you may not need to.
Seperate each bucket into another list where each entry is one line by using bucket.splitlines().
Then check if the first characters of the 3rd and 4th entry in this list are CASP3 by checking if string[2][:5]=="CASP3"(for the third line) and string[3][:5]=="CASP3"(for the fourth line) is true.
Add another counter to the function that is increased by 1 whenever one bucket is valid.
return this counter.
If you have additional questions, feel free to ask.
Here's a example that takes a string and returns your value you need:
def getValue(string):
counter=0
splitList=string.split("\n\n")
for bucket in splitList:
bucket=bucket.splitlines()
if bucket[2][:5]=="CASP3" and bucket[3][:5]=="CASP3":
counter+=1
return counter
Note that this function relies on the buckets being seperated by a empty newline, but you can change that as well to seperate on any other character(s).

My solution is reading the file.txt into a dictionary of text sections (where a section spans between the two greater than symbols (i.e. '>') which then allows you to easily perform some comparisons.
file_path = './file.txt'
keyword="CASP3"
section_ID = 0
count = 0
all_sections = {}
with open(file_path,'r') as f:
for line in f:
if line.startswith(">"):
if line not in all_sections:
section_ID += 1
all_sections[section_ID] = {}
all_sections[section_ID]['entries'] = []
all_sections[section_ID]['entries'].append(line)
for sec_id in all_sections:
if all_sections[sec_id]['entries'][2].startswith(keyword) and all_sections[sec_id]['entries'][3].startswith(keyword):
count+=1
print('count : ', count)
output using your file would be :
count : 3

Related

If the first 3 characters are the same, delete the line in a text file?

Goal: Open the text file. Check whether the first 3 characters of each line are the same in subsequent lines. If yes, delete the bottom one.
The contents of the text file:
cat1
dog4
cat3
fish
dog8
Desired output:
cat1
dog4
fish
Attempt at code:
line = open("text.txt", "r")
for num in line.readlines():
a = line[num][0:3] #getting first 3 characters
for num2 in line.readlines():
b = line[num2][0:3]
if a in b:
line[num2] = ""
Open the file and read one line at a time. Note the first 3 characters (prefix). Check if the prefix has been previously observed. If not, keep that line and add the prefix to a set. For example:
with open('text.txt') as infile:
out_lines = []
prefixes = set()
for line in map(str.strip, infile):
if not (prefix := line[:3]) in prefixes:
out_lines.append(line)
prefixes.add(prefix)
print(out_lines)
Output:
['cat1', 'dog4', 'fish']
Note:
Requires Python 3.8+
You can use a dictionary to store the first 3 char and then check while reading. Sample check then code below
line = open("text.txt", "r")
first_three_char_dict = {}
for num in line.readlines():
a = line[num][0:3] # getting first 3 characters
if first_three_char_dict.get(a):
line[num] = ""
else:
first_three_char_dict[a] = num
pass;
try to read line and add the word (first 3 char)into a dict. The key of dict would be the first 3 char of word and value would be the word itself. At the end you will have dict keys which are unique and their values are your desired result.
You just need to check if the data is already exist or not in temporary list
line = open("text.txt", "r")
result = []
for num in line.readlines():
data = line[num][0:3] # getting first 3 characters
if data not in result: # check if the data is already exist in list or not
result.append(data) # if the data is not exist in list just append it

replace lines in a larger file with ID

Hello every one i have problem with replacing same content lines with same ID e.x:
ONE -----------> 1
TWO -----------> 2
THREE-----------> 3
HELLO-----------> 4
SEVEN-----------> 5
ONE-----------> 1
ONE-----------> 1
ONE-----------> 1
TWO-----------> 2
I have worked on this code below but with no results:
NOTE: filein and file2 have same value of the defined example.
# opening the file in read mode
file = open("filein.txt", "r")
# opening the file in read and write mod
file2 = open("filein2.txt", "r+")
replacement = ""
count=1
# using the for loop
for line in file:
for line2 in file2:
line = line.strip()
if line == line2 :
changes = line.replace(line, str(count))
replacement = replacement + changes + "\n"
file2.seek(0)
file2.write(replacement)
count=count+1
file.close()
filein and filein2 contain same value
ONE
TWO
THREE
HELLO
SEVEN
ONE
ONE
ONE
TWO
To my understanding this is what you want; compare two files line by line, if the corresponding lines are equal, assign an ID to them, if the lines repeat somewhere else in the file assign the same ID as before, if the lines have not occurred before assign a new ID. If the lines are different, get both of their contents. In the end write either the ID or line content to a new file:
index_dct = dict()
id_ = 1
with open('text.txt') as f1, open('text1.txt') as f2, open('result.txt', 'w') as result:
for line1, line2 in zip(f1, f2):
line1, line2 = line1.strip(), line2.strip()
if line1 == line2:
text = index_dct.get(line1)
if text is None:
text = index_dct[line1] = id_
id_ += 1
else:
text = f'{line1} {line2}'
result.write(f'{text}\n')
A quick overview of how this works:
First you have a dictionary to store the value and its corresponding ID so that if a value repeats you can assign the same ID.
Then using a context manager (with) you open three files:
then iterate over the first two files at the same time using zip and compare if the lines match, if they do then first try to get their corresponding ID based on their value, if there is not yet such a value in the dictionary assign the current line value as a key and have its value be the ID, then increase ID by one.
If the lines don't match then just concatenate them together
Finally write the resultant value to the third file
If you're trying to make each unique word have a unique ID, you could use a dictionary:
inputText = "ONE TWO THREE HELLO SEVEN ONE ONE ONE TWO"
indexDictionary = {}
count = 1
outList = []
for word in inputText.split(" "):
if word not in indexDictionary.keys():
indexDictionary[word] = count
count += 1
outList.append(indexDictionary[word])
print(outList)
print(indexDictionary)

How do I convert each of the words to a number?

I am trying to read a file and overwrite its contents with numbers. That means for the first word it would be 1, for the second word it would be 2, and so on.
This is my code:
file=open("reviews.txt","r+")
i=1
for x in file:
line=file.readline()
word=line.split()
file.write(word.replace(word,str(i)))
i+=1
file.close()
Input file:
This movie is not so good
This movie is good
Expected output file:
1 2 3 4 5 6
7 8 9 10
During compilation time I keep getting an error that: AttributeError: 'list' object has no attribute 'replace'. Which one is the list object? All the variables are strings as far as I know. Please help me.
It might be OK to first create the output, with any method that you like, then write it once in the file. Maybe, file.write in the loop wouldn't be so necessary.
Steps
We open the file, get all its content, and close it.
Using re module in DOTALL mode, we'd get anything that we want to replace in the first capturing group, in this case, with (\S+) or (\w+) etc., then we collect all other chars in the second capturing group with (.+?), then with re.findall, we'd generate two-elements tuples in a list, which we'd want to replace the first element of those tuples.
We then write a loop, and replace the first group with an incrementing counter, which is the idea here, and the second group untouched, and we would stepwise concat both as our new content to string_out
We finally open the [empty] file, and write the string_out, and close it.
Test
import re
file = open("reviews.txt","r+")
word_finder, counter, string_out = re.findall(r"(\S+)|(.+?)", file.read(), re.DOTALL), 0, ''
file.close()
for item in word_finder:
if item[0]:
counter += 1
string_out += str(counter)
else:
string_out += item[1]
try:
file = open("reviews.txt","w")
file.write(string_out)
finally:
file.close()
Output
1 2 3 4 5 6
7 8 9 10
RegEx
If you wish to simplify/modify/explore the expression, it's been explained on the top right panel of regex101.com. If you'd like, you can also watch in this link, how it would match against some sample inputs.
Reference
re — Regular expression operations
The call to split is returning a list, which you need to iterate to handle the replacement of each word:
with open("reviews.txt", "r+") as file:
i = 1
line = file.readline()
while line:
words = line.split()
for item in words:
file.write(str(i) + ' ')
i += 1
line = file.readline()
file.close()

Python 3.4 - Capture block of text based on single string

I have searched far and wide and I hope someone can either point me to the link I missed or help me out with this logic.
We have a script the goes out and collects logs from various devices and places them in text files. Within these text files there is a time stamp and we need to collect the few lines of text before and after this time stamp.
I already have a script that matches the time stamps and removes them for certain reports (included below) but I cannot figure out how to match the time stamp and then capture the surrounding lines.
regex_time_stamp = re.compile('\d{2}:\d{2}:\d{2}|\d{1,2}y\d{1,2}w\d{1,2}d|\d{1,2}w\d{1,2}d|\d{1,2}d\d{1,2}h')
with open(filename, 'r') as f:
h = f.readlines()
for line in h:
if regex_time_stamp.search(line) is not None:
new_line = re.sub(regex_time_stamp, '', line)
pre_list.append(new_line)
else:
pre_list.append(line)
Any assistance would be greatly appreciated! Thanks for taking the time to read this.
The basic algorithm is to remember the three most recently read lines. When you match a header, read the next two lines and the combine it with the header and the last three lines that you've saved.
Alternately, since you're saving all of the lines in a list, simply keep track of which element is the current element, and when you find a header you can go back and get the previous two and next two elements.
Catch with duplicated lines
Agreed with the basic algorithm by #Bryan-Oakley and #TigerhawkT3, however there's a catch:
What if several lines match consecutively?
You could end up duplicating "context" lines by printing the last 2 lines of the first match, and then the last 2 lines of the second match... that would actually also contain the previous matched line.
The solution is to keep track of which line number was last printed, in order to print just enough lines before the current matched line.
Flexible context parameter
What if also you want to print 3 lines before and after instead of 2? Then you need to keep track of more lines.
What if you want only 1 ?
Then your number of lines to print needs to be a parameter and the algorithm needs to use it.
Sample input and output
Here's a file sample that contains the word MATCH instead of your timestamp, for clarity. The other lines contain NOT + line number
==
NOT 0
NOT 1
NOT 2
NOT 3
NOT 4
MATCH LINE 5
NOT 6
NOT 7
NOT 8
NOT 9
MATCH LINE 10
MATCH LINE 11
NOT 12
MATCH LINE 13
NOT 14
==
The output should be:
==
NOT 3
NOT 4
LINE 5
NOT 6
NOT 8
NOT 9
LINE 10
LINE 11
NOT 12
LINE 13
NOT 14
==
Solution
This solution iterates on the file and keeps track of:
what is the last line that was printed? This will take care of not duplicating "context" lines if matched lines come in sequence.
what is the last line that was matched? This will tell the program to print the current line if it is "close" to the last matched line. How close? This is determined by your "number of lines to print" parameter. Then we also set the last_line_printed variable to the current line index.
Here's a simplified algorithm in English:
When matching a line we will:
print the last N lines, from the last_line_printed variable to the current index
print the current line after stripping the timestamp
set the last_line_printed = last_line_matched = current line index
continue
When not matching a line we will:
print the current line if current_index < last_line_matched index + number_of_lines_to_print
Of course we're taking care of whether we're close to the beginning of file by checking limits
Not print but return an array
This solution doesn't print directly but returns an array with all the lines to print. That's just a bit classier.
I like to name my "return" variable result but that's just me. It makes it obvious what is the result variable during the whole algorithm.
Code
You can try this code with the input above, it'll print the same output.
def search_timestamps_context(filename, number_of_lines_to_print=2):
import re
result = []
regex_time_stamp = re.compile('\d{2}:\d{2}:\d{2}|\d{1,2}y\d{1,2}w\d{1,2}d|\d{1,2}w\d{1,2}d|\d{1,2}d\d{1,2}h')
# for my test
regex_time_stamp = re.compile('MATCH')
with open(filename, 'r') as f:
h = f.readlines()
# Remember which is the last line printed and matched
last_line_printed = -1
last_line_matched = -1
for idx, line in enumerate(h):
if regex_time_stamp.search(line) is not None:
# We want to print the last "number_of_lines_to_print" lines
# ...unless they were already printed
# We want to return last lines from idx - number_of_lines_to_print
# print ('** Matched ', line, idx, last_line_matched, last_line_printed)
if last_line_printed == -1:
lines_to_print = max(idx - number_of_lines_to_print, 0)
else:
# Unless we've already printed those lines because of a previous match, then we continue
lines_to_print = max(idx - number_of_lines_to_print, last_line_printed + 1)
for l in h[lines_to_print:idx]:
result.append(l)
# Now print the stripped line
new_line = re.sub(regex_time_stamp, '', line)
result.append(new_line)
# Update the last line printed
last_line_printed = last_line_matched = idx
else:
# If not a match, we still need to print the current line if we had a match N lines before
if last_line_matched != -1 and idx < last_line_matched + number_of_lines_to_print:
result.append(line)
last_line_printed = idx
return result
filename = 'test_match.txt'
lines = search_timestamps_context(filename, number_of_lines_to_print=2)
print (''.join(lines))
Improvements
The usage of getlines() is inefficient: we are reading the whole file before starting.
It would be more efficient to just iterate, but then we need to remember the last lines in case we need to print them. To achieve that, we would maintain a list of the last N lines, and not more.
That's an exercise left to the reader :)

Read line once without removing it Python using .readline()

I would like to count the occurences of missings of every line in a txt file.
foo.txt file:
1 1 1 1 1 NA # so, Missings: 1
1 1 1 NA 1 1 # so, Missings: 1
1 1 NA 1 1 NA # so, Missings: 2
But I would also like to obtain the amount of elements for the first line (assuming this is equal for all lines).
miss = []
with open("foo.txt") as f:
for line in f:
miss.append(line.count("NA"))
>>> miss
[1, 1, 2] # correct
The problem is when I try to identify the amount of elements. I did this with the following code:
miss = []
with open("foo.txt") as f:
first_line = f.readline()
elements = first_line.count(" ") # given that values are separated by space
for line in f:
miss.append(line.count("NA"))
>>> (elements + 1)
6 # True, this is correct
>>> miss
[1,2] # misses the first item due to readline() removing lines.`
How can I read the first line once without removing it for the further operation?
Try f.seek(0). This will reset the file handle to the beginning of the file.
Complete example would then be:
miss = []
with open("foo.txt") as f:
first_line = f.readline()
elements = first_line.count(" ") # given that values are separated by space
f.seek(0)
for line in f:
miss.append(line.count("NA"))
Even better would be to read all lines, even the first line, only once, and checking for number of elements only once:
miss = []
elements = None
with open("foo.txt") as f:
for line in f:
if elements is None:
elements = line.count(" ") # given that values are separated by space
miss.append(line.count("NA"))
BTW: wouldn't the number of elements be line.count(" ") + 1?
I'd recommend using len(line.split()), as this also handles tabs, double spaces, leading/trailing spaces etc.
Provided all lines have the number of items you can just count items in the last line:
miss = []
with open("foo.txt") as f:
for line in f:
miss.append(line.count("NA")
elements = len(line.split())
A better way to count is probably:
elements = len(line.split())
because this also counts items separated with multiple spaces or tabs.
You can also just treat the first line separately
with open("foo.txt") as f:
first_line = next(f1)
elements = first_line.count(" ") # given that values are separated by space
miss = [first_line.count("NA")]
for line in f:
miss.append(line.count("NA")

Categories