How to extract text between the matching pattern in python [duplicate] - python

This question already has answers here:
Extract Values between two strings in a text file using python
(9 answers)
Closed 3 years ago.
I am new to python and wanted to try it to extract text between the matching pattern in each line of my tab delimited text file (mydata)
mydata.txt:
Sequence tRNA Bounds tRNA Anti Intron Bounds Cove
Name tRNA # Begin End Type Codon Begin End Score
-------- ------ ---- ------ ---- ----- ----- ---- ------
lcl|NC_035155.1_gene_75[locus_tag=SS1G_20133][db_xref=GeneID:33 1 1 71 Pseudo ??? 0 0 -1
lcl|NC_035155.1_gene_73[locus_tag=SS1G_20131][db_xref=GeneID:33 1 1 73 Pseudo ??? 0 0 -1
lcl|NC_035155.1_gene_72[locus_tag=SS1G_20130][db_xref=GeneID:33 1 1 71 Pseudo ??? 0 0 -1
lcl|NC_035155.1_gene_71[locus_tag=SS1G_20129][db_xref=GeneID:33 1 1 72 Pseudo ??? 0 0 -1
lcl|NC_035155.1_gene_62[locus_tag=SS1G_20127][db_xref=GeneID:33 1 1 71 Pseudo ??? 0 0 -1
Code I tried:
lines = [] #Declare an empty list named "lines"
with open('/media/owner/c3c5fbb4-73f6-45dc-a475-988ad914056e/phasing/trna/test.txt') as input_data:
# Skips text before the beginning of the interesting block:
for line in input_data:
# print(line)
if line.strip() == "locus_tag=": # Or whatever test is needed
break
# Reads text until the end of the block:
for line in input_data: # This keeps reading the file
if line.strip() == "][db":
break
print(line) # Line is extracted (or block_of_lines.append(line), etc.)
I want to grab texts between [locus_tag= and ][db_xre and get these as my results:
SS1G_20133
SS1G_20131
SS1G_20130
SS1G_20129
SS1G_20127

If I'm understanding correctly, this should work for a given line of your data:
data = line.split("locus_tag=")[1].split("][db_xref")[0]
The idea is to split the string on locus_tag=, take the 2nd element, then split that string on ][db_xref and take the first element.
If you want help with the outer loop it could look like:
for line in open(file_path, 'r'):
if "locus_tag" in line:
data = line.split("locus_tag=")[1].split("][db_xref")[0]
print(data)

You can use re.search with positive lookbehind and positive lookahead patterns:
import re
...
for line in input_data:
match = re.search(r'(?<=\[locus_tag=).*(?=\]\[db_xre)', line)
if match:
print(match.group())

Related

Python to remove extra delimiter

We have a 100MB pipe delimited file that has 5 column/4 delimiters each separated by a pipe. However there are few rows where the second column has an extra pipe. For these few rows total delimiter are 5.
For example, in the below 4 rows, the 3rd is a problematic one as it has an extra pipe.
1|B|3|D|5
A|1|2|34|5
D|This is a |text|3|5|7
B|4|5|5|6
Is there any way we can remove an extra pipe from the second position where the delimiter count for the row is 5. So, post correction, the file needs to look like below.
1|B|3|D|5
A|1|2|34|5
D|This is a text|3|5|7
B|4|5|5|6
Please note that the file size is 100 MB. Any help is appreciated.
Source: my_file.txt
1|B|3|D|5
A|1|2|34|5
D|This is a |text|3|5|7
B|4|5|5|6
E|1 |9 |2 |8 |Not| a |text|!!!|3|7|4
Code
# If using Python3.10, this can be Parenthesized context managers
# https://docs.python.org/3.10/whatsnew/3.10.html#parenthesized-context-managers
with open('./my_file.txt') as file_src, open('./my_file_parsed.txt', 'w') as file_dst:
for line in file_src.readlines():
# Split the line by the character '|'
line_list = line.split('|')
if len(line_list) <= 5:
# If the number of columns doesn't exceed, just write the original line as is.
file_dst.write(line)
else:
# If the number of columns exceeds, count the number of columns that should be merged.
to_merge_columns_count = (len(line_list) - 5) + 1
# Merge the columns from index 1 to index x which includes all the columns to be merged.
merged_column = "".join(line_list[1:1+to_merge_columns_count])
# Replace all the items from index 1 to index x with the single merged column
line_list[1:1+to_merge_columns_count] = [merged_column]
# Write the updated line.
file_dst.write("|".join(line_list))
Result: my_file_parsed.txt
1|B|3|D|5
A|1|2|34|5
D|This is a text|3|5|7
B|4|5|5|6
E|1 9 2 8 Not a text!!!|3|7|4
A simple regular expression pattern like this works on Python 3.7.3:
from re import compile
bad_pipe_re = compile(r"[ \w]+\|[ \w]+(\|)[ \w]+\|[ \w]+\|[ \w]+\|[ \w]+\n")
with open("input", "r") as fp_1, open("output", "w") as fp_2:
line = fp_1.readline()
while line is not "":
mo = bad_pipe_re.fullmatch(line)
if mo is not None:
line = line[:mo.start(1)] + line[mo.end(1):]
fp_2.write(line)
line = fp_1.readline()

Looking for first line of data with python

I have a data file that looks like this, and the file type is a list.
############################################################
# Tool
# File: test
#
# mass: mass in GeV
# spectrum: from 1 to 100 GeV
###########################################################
# mass (GeV) spectrum (1-100 GeV)
10 0.2822771608053263
20 0.8697454394829301
30 1.430461657476815
40 1.9349004472432392
50 2.3876849629827412
60 2.796620869276766
70 3.1726347734996727
80 3.5235401505002244
90 3.8513847250834106
100 4.157478780924807
For me to read the data I would normally have to count how many lines before the first set of numbers and then for loop through the file. In this file its 8 lines
spectrum=[]
mass=[]
with open ('test.in') as m:
test=m.readlines()
for i in range(8,len(test)):
single_line=test[i].split('\t')
mass.appened(float(single_line[0]))
spectrum.append(float(single_line[1]))
Let's say I didn't want to open the file to check how many lines there are from the intro statement to the first line of data points. How would I make python automatically start at the first line of data points, and go through the end of the file?
This is a general solution, but should work in your specific case.
you could for each line, check if it starts with a number.
psedo-code
for line in test:
if line.split()[0].isdigit():
DoStuffWithData
spectrum=[]
mass=[]
with open ('test.in') as m:
test=m.readlines()
for line in test:
if line[0] == '#':
continue
single_line=line.split('\t')
mass.append(float(single_line[0]))
spectrum.append(float(single_line[1]))
You can filter all lines that start with # by regex or startswith method of string
import re
spectrum=[]
mass=[]
with open ('test.in') as m:
test= [i for i in f.readlines() if not re.match("^#.*", i)]
for i in test:
single_line = i.split('\t')
mass.appened(float(single_line[0]))
spectrum.append(float(single_line[1]))
OR
spectrum = []
mass = []
with open('test.in') as m:
test = [i for i in f.readlines() if not i.startwith("#")]
for i in test:
single_line = i.split('\t')
mass.appened(float(single_line[0]))
spectrum.append(float(single_line[1]))
This will filter out all the lines that start with #.
pseudo code:
for r in m:
if r.startwith('#'):
continue
spt = r.split('\t')
if len(spt) < 2:
continue
## todo: .....

Find next missing number in list

I am trying to make a very simple login script to learn about accessing files and lists but I'm a bit stuck.
newaccno = str(1)
with open("C:\\Python\\Test\\userpasstest.txt","r+") as loginfile:
for line in loginfile.readlines():
line = line.strip()
logininfo = line.split(" ")
print(newaccno in logininfo[0])
while newaccno in logininfo[0]: #issue is here, also tried ==
newaccno += 1
print(newaccno)
loginfile.write(newaccno)
My logic is that it will search logininfo[0] for newaccno and if it is true, increase newaccno by 1 and search again until it is false then write to file (so if the file has 1, 2 and 3 already then newaccno will end up as 4).
Edit: This is how the txt file looks, the first number represents newaccno before it gets split.
1 abc qwe
2 123 456
(adapted from comment)
Your while loop needs to be inside your for loop for it to work. If it is outside logininfo[0] will always be the last line's first character

Python: a way to ignore/account for newlines with read()

So I am having a problem extracting text from a larger (>GB) text file. The file is structured as follows:
>header1
hereComesTextWithNewlineAtPosition_80
hereComesTextWithNewlineAtPosition_80
hereComesTextWithNewlineAtPosition_80
andEnds
>header2
hereComesTextWithNewlineAtPosition_80
hereComesTextWithNewlineAtPosAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAlineAtPosition_80
MaybeAnotherTargetBBBBBBBBBBBrestText
andEndsSomewhereHere
Now I have the information that in the entry with header2 I need to extract the text from position X to position Y (the A's in this example), starting with 1 as the first letter in the line below the header.
BUT: the positions do not account for newline characters. So basically when it says from 1 to 95 it really means just the letters from 1 to 80 and the following 15 of the next line.
My first solution was to use file.read(X-1) to skip the unwanted part in front and then file.read(Y-X) to get the part I want, but when that stretches over newline(s) I get to few characters extracted.
Is there a way to solve this with another python-function than read() maybe? I thought about just replacing all newlines with empty strings but the file maybe quite large (millions of lines).
I also tried to account for the newlines by taking extractLength // 80 as added length, but this is problematic in cases like the example when eg. of 95 characters it's 2-80-3 over 3 lines I actually need 2 additional positions but 95 // 80 is 1.
UPDATE:
I modified my code to use Biopython:
for s in SeqIO.parse(sys.argv[2], "fasta"):
#foundClusters stores the information for substrings I want extracted
currentCluster = foundClusters.get(s.id)
if(currentCluster is not None):
for i in range(len(currentCluster)):
outputFile.write(">"+s.id+"|cluster"+str(i)+"\n")
flanking = 25
start = currentCluster[i][0]
end = currentCluster[i][1]
left = currentCluster[i][2]
if(start - flanking < 0):
start = 0
else:
start = start - flanking
if(end + flanking > end + left):
end = end + left
else:
end = end + flanking
#for debugging only
print(currentCluster)
print(start)
print(end)
outputFile.write(s.seq[start, end+1])
But I get the following error:
[[1, 55, 2782]]
0
80
Traceback (most recent call last):
File "findClaClusters.py", line 92, in <module>
outputFile.write(s.seq[start, end+1])
File "/usr/local/lib/python3.4/dist-packages/Bio/Seq.py", line 236, in __getitem__
return Seq(self._data[index], self.alphabet)
TypeError: string indices must be integers
UPDATE2:
Changed outputFile.write(s.seq[start, end+1]) to:
outRecord = SeqRecord(s.seq[start: end+1], id=s.id+"|cluster"+str(i), description="Repeat-Cluster")
SeqIO.write(outRecord, outputFile, "fasta")
and its working :)
With Biopython:
from Bio import SeqIO
X = 66
Y = 130
for s in in SeqIO.parse("test.fst", "fasta"):
if "header2" == s.id:
print s.seq[X: Y+1]
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
Biopython let's you parse fasta file and access its id, description and sequence easily. You have then a Seq object and you can manipulate it conveniently without recoding everything (like reverse complement and so on).

Printing on the same line as a header in python

I am writing a function calculate a score for the matrix and output the score along with some other variables as a header. My code for output is as follows:
header=">"+motif+" "+gene+" "+str(score)
append_copy = open(newpwmfile, "r")
original_text = append_copy.read()
append_copy.close()
append_copy = open(newpwmfile, "w")
append_copy.write(header)
append_copy.write(original_text)
append_copy.close()
However the header is printing the score on the next line instead of the same line, as follows:
>ATGC ABC/CDF
5.8
0.23076923076923 0 0.69230769230769 0.076923076923077
0.46153846153846 0.23076923076923 0.23076923076923 0.076923076923077
0 0 1 0
0 1 0 0
1 0 0 0
What could be the reason? I also tried interchanging the variables and then the header is printed on the same line. However the sequence is relevant in this case.
When reading fields from a file, it is good practice to remove possible extra blank spaces using strip() function.
As an example, this is a typical workflow to manually get the fields from a csv file:
for line in open(fname).readlines():
linefields = [field.strip() for field in line.strip().split(',')]
This removes either the blankspace between lines and the blankspace between fields.

Categories