Extracting sequence and header from fasta file by known sequences - python

I am trying to do compare two files and extract the sequences which have the subset of others. And, I want to extract the identifiers too. However, what I can do is being able to extract the sequences including subsets. The example files are :
text.fa
>header1
ETTTHAASCISATTVQEQ*TLFRLLP
>header2
SKSPCSDSDY**AAA
>header3
SSGAVAAAPTTA
and,
textref.fa
>textref.fa
CISA
AAAP
AATP
When I run the code, I am having this output :
ETTTHAASCISATTVQEQ*TLFRLLP
SSGAVAAAPTTA
However, my expected output is with headers:
>header1
ETTTHAASCISATTVQEQ*TLFRLLP
>header3
SSGAVAAAPTTA
My code is in two parts, first I create the file with these sequences, and then I try to extract them with their headers from the original fasta file :
def get_nucl(filename):
with open(filename,'r') as fd:
nucl = []
for line in fd:
if line[0]!='>':
nucl.append(line.strip())
return nucl
def finding(filename,reffile):
nucl = get_nucl(filename)
with open(reffile,'r') as reffile2:
for line in reffile2:
for element in nucl:
if line.strip() in element:
yield(element)
with open('sequencesmatched.txt','w') as output:
results = finding('text.fa','textref.fa',)
for res in results:
print(res)
output.write(res + '\n')
So, in this sequencesmatched.txt, I am having the sequences of text.fa that have the substrings of textref.fa. as :
ETTTHAASCISATTVQEQ*TLFRLLP
SSGAVAAAPTTA
So in the other part, to retrieve the respective headers and these sequences :
def finding(filename,seqfile):
with open(filename,'r') as fastafile:
with open(seqfile,'r') as sequf:
alls=[]
for line in fastafile:
alls.append(line.strip())
print(alls)
sequfs = []
for line2 in sequf:
sequfs.append(line2.strip())
if str(line.strip()) == str(line2.strip()):
num = alls.index(line.strip())
print(alls[num-1] + line)
print(finding('text.fa','sequencesmatched.txt'))
However, as an output, I can only retrieve one sequence,which is the first match:
>header1
ETTTHAASCISATTVQEQ*TLFRLLP
Maybe I could do it without the second file, but I could not make the right loops to get the sequences and their respective headers. Therefore, I went to the long way..
I would be happy if you can help!

You can do something much easier if your file is always the same structure:
def get_nucl(filename):
with open(filename, 'r') as fd:
headers = {}
key = ''
for line in fd.readlines():
if '>' in line:
key = line.strip()[1:] # to remove the '>'
else:
headers[key] = line.strip()
return headers
Here i'm assuming your file begin with ">headern" whatever, if not you have to add some test. Now you have a dictionnary like headers['header1'] = 'ETTTHAASCISATTVQEQ*TLFRLLP'.
So now to find the matches you just use that dict :
def finding(filename, reffile):
headers = get_nucl(filename)
with open(reffile, 'r') as f:
matches = {}
for line in f.readlines():
for key, value in headers.items():
if line.stip() in value and key not in matches:
matches[key] = value
return matches
So as you have a dict with headers matching their values, you can just check in the dict if you have a sub string, and you already have the header value as a key.
Just saw you did print(finding(....) , your function already print, so just call it.

Related

How to read a value of file separate by tabs in Python?

I have a text file with this format
ConfigFile 1.1
;
; Version: 4.0.32.1
; Date="2021/04/08" Time="11:54:46" UTC="8"
;
Name
John Legend
Type
Student
Number
s1054520
I would like to get the value of Name or Type or Number
How do I get it?
I tried with this method, but it does not solve my problem.
import re
f = open("Data.txt", "r")
file = f.read()
Name = re.findall("Name", file)
print(Name)
My expectation output is John Legend
Anyone can help me please. I really appreciated. Thank you
First of all re.findall is used to search for “all” occurrences that match a given pattern. So in your case. you are finding every "Name" in the file. Because that's what you are looking for.
On the other hand, the computer will not know the "John Legend" is the name. it will only know that's the line after the word "Name".
In your case I will suggest you can check this link.
Find the "Name"'s line number
Read the next line
Get the name without the white space
If there is more than 1 Name. this will work as well
the final code is like this
def search_string_in_file(file_name, string_to_search):
"""Search for the given string in file and return lines containing that string,
along with line numbers"""
line_number = 0
list_of_results = []
# Open the file in read only mode
with open(file_name, 'r') as read_obj:
# Read all lines in the file one by one
for line in read_obj:
# For each line, check if line contains the string
line_number += 1
if string_to_search in line:
# If yes, then add the line number & line as a tuple in the list
list_of_results.append((line_number, line.rstrip()))
# Return list of tuples containing line numbers and lines where string is found
return list_of_results
file = open('Data.txt')
content = file.readlines()
matched_lines = search_string_in_file('Data.txt', 'Name')
print('Total Matched lines : ', len(matched_lines))
for i in matched_lines:
print(content[i[0]].strip())
Here I'm going through each line and when I encounter Name I will add the next line (you can directly print too) to the result list:
import re
def print_hi(name):
result = []
regexp = re.compile(r'Name*')
gotname = False;
with open('test.txt') as f:
for line in f:
if gotname:
result.append(line.strip())
gotname = False
match = regexp.match(line)
if match:
gotname = True
print(result)
if __name__ == '__main__':
print_hi('test')
Assuming those label lines are in the sequence found in the file you
can simply scan for them:
labelList = ["Name","Type","Number"]
captures = dict()
with open("Data.txt","rt") as f:
for label in labelList:
while not f.readline().startswith(label):
pass
captures[label] = f.readline().strip()
for label in labelList:
print(f"{label} : {captures[label]}")
I wouldn't use a regex, but rather make a parser for the file type. The rules might be:
The first line can be ignored
Any lines that start with ; can be ignored.
Every line with no leading whitespace is a key
Every line with leading whitespace is a value belonging to the last
key
I'd start with a generator that can return to you any unignored line:
def read_data_lines(filename):
with open(filename, "r") as f:
# skip the first line
f.readline()
# read until no more lines
while line := f.readline():
# skip lines that start with ;
if not line.startswith(";"):
yield line
Then fill up a dict by following rules 3 and 4:
def parse_data_file(filename):
data = {}
key = None
for line in read_data_lines(filename):
# No starting whitespace makes this a key
if not line.startswith(" "):
key = line.strip()
# Starting whitespace makes this a value for the last key
else:
data[key] = line.strip()
return data
Now at this point you can parse the file and print whatever key you want:
data = parse_data_file("Data.txt")
print(data["Name"])

Use a file to search another file and print lines matching a pattern to first file

Python noob here. I've been smashing my head trying to do this, tried several Unix tools and I'm convinced that python is the way to go.
I have two files, File1 has headers and numbers like this:
>id1
77
>id2
2
>id3
2
>id4
22
...
Note that id number is unique, but the number assigned to it may repeat. I have several files like this all with the same number of headers (~500).
File2 has all numbers of File1 and an appended sequence
1
ATCGTCATA
2
ATCGTCGTA
...
22
CCCGTCGTA
...
77
ATCGTCATA
...
Note that sequence id is unique, as all sequences after it. I have the same amount of files as File1 but the number of sequences within each File2 may vary(~150).
My desired output is the File1 with the sequence from File2, it is important that File1 maintains original order.
>id1
ATCGTCATA
>id2
ATCGTCGTA
>id3
ATCGTCGTA
>id4
CCCGTCGTA
My approach is to extract numbers from File1 and use them as a pattern to match in File2. First I am trying to make this work with only a pair of files. here is what I achieved:
#!/usr/bin/env python
import re
datafile = 'protein2683.fasta.txt.named'
schemaseqs = 'protein2683.fasta'
with open(datafile, 'r') as f:
datafile_lines = set([line.strip() for line in f]) #maybe I could use regex to get only lines with number as pattern?
print (datafile_lines)
outputlist = []
with open(schemaseqs, 'r') as f:
for line in f:
seqs = line.split(',')[0]
if seqs[1:-1] in datafile_lines:
outputlist.append(line)
print (outputlist)
This outputs a mix of patterns from File1 and the sequences from File2. Any help is appreciated.
Ps: I am open to modifications in files structure, I tried substituting \n in File2 for "," with no avail.
import re
datafile = 'protein2683.fasta.txt.named'
schemaseqs = 'protein2683.fasta'
datafile_lines = []
d = {}
prev = None
with open(datafile, 'r') as f:
i = 0
for line in f:
if i % 2 == 0:
d[line.strip()]=0
prev = line.strip()
else:
d[prev] = line.strip()
i+=1
new_d = {}
with open(schemaseqs, 'r') as f:
i=0
prev = None
for line in f:
if i % 2 == 0:
new_d[line.strip()]=0
prev = line.strip()
else:
new_d[prev] = line.strip()
i+=1
for key, value in d.items():
if value in new_d:
d[key] = new_d[value]
print(d)
with open(datafile,'w') as filee:
for k,v in d.items():
filee.writelines(k)
filee.writelines('\n')
filee.writelines(v)
filee.writelines('\n')
creating two dictionary would be easy and then map both dictionary values.
Since the files are so neatly organized, I wouldn't use a set to store the lines. Sets don't enforce order, and the order of these lines conveys a lot of information. I also wouldn't use Regex; it's probably overkill for the task of parsing individual lines, but not powerful enough to keep track of which ID corresponds to each gene sequence.
Instead, I would read the files in the opposite order. First, read the file with the gene sequences and build a mapping of IDs to genes. Then read in the first file and replace each id with the corresponding value in that mapping.
If the IDs are a continuous sequence (1, 2, 3... n, n+1), then a list is probably the easiest way to store them. If the file is already in order, you don't even have to pay attention to the ID numbers; you can just skip every other row and append each gene sequence to an array in order. If they aren't continuous, you can use a dictionary with the IDs as keys. I'll use the dictionary approach for this example:
id_to_gene_map = {}
with open(file2, 'r') as id_to_gene_file:
for line_number, line in enumerate(id_to_gene_file, start=1):
if line_number % 2 == 1: # Update ID on odd numbered lines, including line 1
current_id = line
else:
id_to_gene_map[current_id] = line # Map previous line's ID to this line's value
with open(file1, 'r') as input_file, open('output.txt', 'w') as output_file:
for line in input_file:
if not line.startswith(">"): # Keep ">id1" lines unchanged
line = id_to_gene_map[line] # Otherwise, replace with the corresponding gene
output_file.write(line)
In this case, the IDs and values both have trailing newlines. You can strip them out, but since you'll want to add them back in for writing the output file, it's probably easiest to leave them alone.

How to extract a subset from a text file and store it in a separate file?

I am currently trying to extract information from a text file using Python. I want to extract a subset from the file and store it in a separate file from everywhere it occurs in the text file. To give you an idea of what my file looks like, here is a sample:
C","datatype":"double","value":25.71,"measurement":"Temperature","timestamp":1573039331258250},
{"unit":"%RH","datatype":"double","value":66.09,"measurement":"Humidity","timestamp":1573039331258250}]
Here, I want to extract "value" and the corresponding number beside it. I have tried various techniques but have been unsuccessful. I tried to iterate through the file and stop at where I have "value" but that did not work.
Here is a sample of the code:
with open("DOTemp.txt") as openfile:
for line in openfile:
for part in line.split():
if "value" in part:
print(part)
A simple solution to return the value marked by the "value" key:
with open("DOTemp.txt") as openfile:
for line in openfile:
line = line.replace('"', '')
for part in line.split(','):
if "value" in part:
print(part.split(':')[1])
Note that by default str.split() splits on whitespace. In the last line, if we printed element zero of the list it would just be "value". If you wish to use this as an int or float, simply cast it as such and return it.
First split using , (comma) as the delimiter, then split the corresponding strings using : as the delimiter.
if required trim leading and trailing "" then compare with value
Below code will work for you:
file1 = open("untitled.txt","r")
data = file1.readlines()
#Convert to a single string
val = ""
for d in data:
val = val + d
#split string at comma
comma_splitted = val.split(',')
#find the required float
for element in comma_splitted:
if 'value' in element:
out = element.split('"value":')[1]
print(float(out))
I assume your input file is a json string(list of dictionaries) (looking at the file sample). If that's the case, perhaps you can try this.
import json
#Assuming each record is a dictionary
with open("DOTemp.txt") as openfile:
lines = openfile.readlines()
records = json.loads(lines)
out_lines = list(map(lambda d: d.get('value'), records))
with open('DOTemp_out.txt', 'w') as outfile:
outfile.write("\n".join(out_lines))

Extracting data from a file using regular expressions and storing in a list to be compiled into a dictionary- python

I've been trying to extract both the species name and sequence from a file as depicted below in order to compile a dictionary with the key corresponding to the species name (FOX2_MOUSE for example) and the value corresponding to the Amino Acid sequence.
Sample fasta file:
>sp|P58463|FOXP2_MOUSE
MMQESATETISNSSMNQNGMSTLSSQLDAGSRDGRSSGDTSSEVSTVELL
HLQQQQALQAARQLLLQQQTSGLKSPKSSEKQRPLQVPVSVAMMTPQVIT
PQQMQQILQQQVLSPQQLQALLQQQQAVMLQQQQLQEFYKKQQEQLHLQL
LQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQ-HPGKQAKE
QQQQQQQQQ-LAAQQLVFQQQLLQMQQLQQQQHLLSLQRQGLISIPPGQA
ALPVQSLPQAGLSPAEIQQLWKEVTGVHSMEDNGIKHGGLDLTTNNSSST
TSSTTSKASPPITHHSIVNGQSSVLNARRDSSSHEETGASHTLYGHGVCK
>sp|Q8MJ98|FOXP2_PONPY
MMQESVTETISNSSMNQNGMSTLSSQLDAGSRDGRSSGDTSSEVSTVELL
HLQQQQALQAARQLLLQQQTSGLKSPKSSDKQRPLQVPVSVAMMTPQVIT
PQQMQQILQQQVLSPQQLQALLQQQQAVMLQQQQLQEFYKKQQEQLHLQL
LQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQ--HPGKQAKE
QQQQQQQQQ-LAAQQLVFQQQLLQMQQLQQQQHLLSLQRQGLISIPPGQA
ALPVQSLPQAGLSPAEIQQLWKEVTGVHSMEDNGIKHGGLDLTTNNSSST
TSSTTSKASPPITHHSIVNGQSSVLNARRDSSSHEETGASHTLYGHGVCK
I've tried using my code below:
import re
InFileName = "foxp2.fasta"
InFile = open(InFileName, 'r')
Species = []
Sequence = []
reg = re.compile('FOXP2_\w+')
for Line in InFile:
Species += reg.findall(Line)
print Species
reg = re.compile('(^\w+)')
for Line in Infile:
Sequence += reg.findall(Line)
print Sequence
dictionary = dict(zip(Species, Sequence))
InFile.close()
However, my output for my lists are:
[FOX2_MOUSE, FOXP2_PONPY]
[]
Why is my second list empty? Are you not allowed to use re.compile() twice? Any suggestions on how to circumvent my problem?
Thank you,
Christy
If you want to read a file twice, you have to seek back to the beginning.
InFile.seek(0)
You can do it in a single pass, and without regular expressions:
def load_fasta(filename):
data = {}
species = ""
sequence = []
with open(filename) as inf:
for line in inf:
line = line.strip()
if line.startswith(";"): # is comment?
# skip it
pass
elif line.startswith(">"): # start of new record?
# save previous record (if any)
if species and sequence:
data[species] = "".join(sequence)
species = line.split("|")[2]
sequence = []
else: # continuation of previous record
sequence.append(line)
# end of file - finish storing last record
if species and sequence:
data[species] = "".join(sequence)
return data
data = load_fasta("foxp2.fasta")
On your given file, this produces data ==
{
'FOXP2_PONPY': 'MMQESVTETISNSSMNQNGMSTLSSQLDAGSRDGRSSGDTSSEVSTVELLHLQQQQALQAARQLLLQQQTSGLKSPKSSDKQRPLQVPVSVAMMTPQVITPQQMQQILQQQVLSPQQLQALLQQQQAVMLQQQQLQEFYKKQQEQLHLQLLQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQ--HPGKQAKEQQQQQQQQQ-LAAQQLVFQQQLLQMQQLQQQQHLLSLQRQGLISIPPGQAALPVQSLPQAGLSPAEIQQLWKEVTGVHSMEDNGIKHGGLDLTTNNSSSTTSSTTSKASPPITHHSIVNGQSSVLNARRDSSSHEETGASHTLYGHGVCK',
'FOXP2_MOUSE': 'MMQESATETISNSSMNQNGMSTLSSQLDAGSRDGRSSGDTSSEVSTVELLHLQQQQALQAARQLLLQQQTSGLKSPKSSEKQRPLQVPVSVAMMTPQVITPQQMQQILQQQVLSPQQLQALLQQQQAVMLQQQQLQEFYKKQQEQLHLQLLQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQ-HPGKQAKEQQQQQQQQQ-LAAQQLVFQQQLLQMQQLQQQQHLLSLQRQGLISIPPGQAALPVQSLPQAGLSPAEIQQLWKEVTGVHSMEDNGIKHGGLDLTTNNSSSTTSSTTSKASPPITHHSIVNGQSSVLNARRDSSSHEETGASHTLYGHGVCK'
}
You could also do this in a single pass with a multiline regex:
import re
reg = re.compile('(FOXP2_\w+)\n(^[\w\n-]+)', re.MULTILINE)
with open("foxp2.fasta", 'r') as file:
data = dict(reg.findall(file.read()))
The downside is that you have to read the whole file in at once. Whether this is a problem depends on likely file sizes.

read fastq file into dictionary

I have a fastq file like this (part of the file):
#A80HNBABXX:4:1:1344:2224#0/1
AAAACATCAGTATCCATCAGGATCAGTTTGGAAAGGGAGAGGCAATTTTTCCTAAACATGTGTTCAAATGGTCTGAGACAGACGTTAAAATGAAAAGGGG
+
\\YYWX\PX^YT[TVYaTY]^\^H\`^`a`\UZU__TTbSbb^\a^^^`[GOVVXLXMV[Y_^a^BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
#A80HNBABXX:4:1:1515:2211#0/1
TTAGAAACTATGGGATTATTCACTCCCTAGGTACTGAGAATGGAAACTTTCTTTGCCTTAATCGTTGACATCCCCTCTTTTAGGTTCTTGCTTCCTAACA
+
ee^e^\`ad`eeee\dd\ddddYeebdd\ddaYbdcYc`\bac^YX[V^\Ybb]]^bdbaZ]ZZ\^K\^]VPNME][`_``Ubb_bYddZbbbYbbYT^_
#A80HNBABXX:4:1:1538:2220#0/1
CTGAGTAAATCATATACTCAATGATTTTTTTATGTGTGTGCATGTGTGCTGTTGATATTCTTCAGTACCAAAACCCATCATCTTATTTGCATAGGGAAGT
+
fff^fd\c^d^Ycac`dcdcded`effdfedb]beeeeecd^ddccdddddfff`eaeeeffdTecacaLV[QRPa\\a\`]aY]ZZ[XYcccYcZ\\]Y
#A80HNBABXX:4:1:1666:2222#0/1
CTGCCAGCACGCTGTCACCTCTCAATAACAGTGAGTGTAATGGCCATACTCTTGATTTGGTTTTTGCCTTATGAATCAGTGGCTAAAAATATTATTTAAT
+
deeee`bbcddddad\bbbbeee\ecYZcc^dd^ddd\\`]``L`ccabaVJ`MZ^aaYMbbb__PYWY]RWNUUab`Y`BBBBBBBBBBBBBBBBBBBB
The FASTQ file uses four lines per sequence. Line 1 begins with a '#' character and is followed by a sequence identifier. Line 2 is the DNA sequence letters. Line 3 begins with a '+' character. Line 4 encodes the quality values for the sequence in Line 2 (the part after "+" and before the next "#", and must contain the same number of symbols as letters in the sequence.
i want to read the fastq file into a dictionary like this (the key is the DNA sequence and the value is the quality value, and the line starting with "#" and "+" can be discarded):
{'AAAACATCAGTATCCATCAGGATCAGTTTGGAAAGGGAGAGGCAATTTTTCCTAAACATGTGTTCAAATGGTCTGAGACAGACGTTAAAATGAAAAGGGG':'\YYWX\PX^YT[TVYaTY]^\^H`^a\UZU__TTbSbb^\a^^^[GOVVXLXMV[Y_^a^BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB',
'CTGAGTAAATCATATACTCAATGATTTTTTTATGTGTGTGCATGTGTGCTGTTGATATTCTTCAGTACCAAAACCCATCATCTTATTTGCATAGGGAAGT':'fff^fd\c^d^Ycacdcdcdedeffdfedb]beeeeecd^ddccdddddfffeaeeeffdTecacaLV[QRPa\a`]aY]ZZ[XYcccYcZ\]Y ',
....}
I write the following code but it does not give me what I want. Can anyone help me to fix/improve my code?
class fastq(object):
def __init__(self,filename):
self.filename = filename
self.__sequences = {}
def parse_file(self):
symbol=['#','+']
"""Stores both the sequence and the quality values for the sequence"""
f = open(self.filename,'rU')
for lines in self.filename:
if symbol not in lines.startwith()
data = f.readlines()
return data
Here's a pretty quick and efficient way of doing it:
def parse_file(self):
with open(self.filename, 'r') as f:
content = f.readlines()
# Recreate content without lines that start with # and +
content = [line for line in content if not line[0] in '#+']
# Now the lines you want are alternating, so you can make a dict
# from key/value pairs of lists content[0::2] and content[1::2]
data = dict(zip(content[0::2], content[1::2]))
return data
I don't think use the reads as the key is good idea, what if you got exactly the same read. But any way if you want to do it:
In [9]:
with open('temp.fastq') as f:
lines=f.readlines()
head=[item[:-1] for item in lines[::4]] #get rid of '\n'
read=[item[:-1] for item in lines[1::4]]
qual=[item[:-1] for item in lines[3::4]]
dict(zip(read, qual))
Out[9]:
{'AAAACATCAGTATCCATCAGGATCAGTTTGGAAAGGGAGAGGCAATTTTTCCTAAACATGTGTTCAAATGGTCTGAGACAGACGTTAAAATGAAAAGGGG': '\\\\YYWX\\PX^YT[TVYaTY]^\\^H\\`^`a`\\UZU__TTbSbb^\\a^^^`[GOVVXLXMV[Y_^a^BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB',
'CTGAGTAAATCATATACTCAATGATTTTTTTATGTGTGTGCATGTGTGCTGTTGATATTCTTCAGTACCAAAACCCATCATCTTATTTGCATAGGGAAGT': 'fff^fd\\c^d^Ycac`dcdcded`effdfedb]beeeeecd^ddccdddddfff`eaeeeffdTecacaLV[QRPa\\\\a\\`]aY]ZZ[XYcccYcZ\\\\]Y',
'CTGCCAGCACGCTGTCACCTCTCAATAACAGTGAGTGTAATGGCCATACTCTTGATTTGGTTTTTGCCTTATGAATCAGTGGCTAAAAATATTATTTAAT': 'deeee`bbcddddad\\bbbbeee\\ecYZcc^dd^ddd\\\\`]``L`ccabaVJ`MZ^aaYMbbb__PYWY]RWNUUab`Y`BBBBBBBBBBBBBBBBBBBB',
'TTAGAAACTATGGGATTATTCACTCCCTAGGTACTGAGAATGGAAACTTTCTTTGCCTTAATCGTTGACATCCCCTCTTTTAGGTTCTTGCTTCCTAACA': 'ee^e^\\`ad`eeee\\dd\\ddddYeebdd\\ddaYbdcYc`\\bac^YX[V^\\Ybb]]^bdbaZ]ZZ\\^K\\^]VPNME][`_``Ubb_bYddZbbbYbbYT^_'}
you can use function from Bio, like this:
from Bio import SeqIO
myf=mydir+myfile
startlist=[]
for record in SeqIO.parse(myf, "fastq"):
startlist.append(str(record.seq)) #or without 'str'

Categories