Extracting fasta sequences in list files order - python

I need to extract some fasta sequences from "goodProteins.fasta" file (first input) with id list files present in separate folder (second input).
The format of the fasta sequence file is:
>1_12256
FSKVJLKDFJFDAKJQWERTYU......
>1_12257
SKJFHKDAJHLQWERTYGFDFHU......
>1_12258
QWERTYUHKDJKDJOKK......
>1_12259
DJHFDSQWERTYUHKDJKDJOKK......
>1_12260
ADKKHDFHJQWERTYUHKDJKDJOKK......
and the format of one of the id file is:
1_12258
1_12256
1_12257
I'm using the following script:
from Bio import SeqIO
import glob
def process(wanted_file, result_file):
fasta_file = "goodProteins.fasta" # First input (Fasta sequence)
wanted = set()
with open(wanted_file) as f:
for line in f:
line = line.strip()
if line != "":
wanted.add(line)
fasta_sequences = SeqIO.parse(open(fasta_file),'fasta')
with open(result_file, "w") as f:
for seq in fasta_sequences:
if seq.id in wanted:
SeqIO.write([seq], f, "fasta")
listFilesArr = glob.glob("My_folder\*txt") # takes all .txt files as
# Second input in My_folder
for wanted_file in listFilesArr:
result_file = wanted_file[0:-4] + ".fasta"
process(wanted_file, result_file)
It should extract fasta sequence based on the information and order list in the id file and the desired output would be:
>1_12258
QWERTYUHKDJKDJOKK......
>1_12256
FSKVJLKDFJFDAKJQWERTYU......
>1_12257
SKJFHKDAJHLQWERTYGFDFHU......
but I get:
>1_12256
FSKVJLKDFJFDAKJQWERTYU......
>1_12257
SKJFHKDAJHLQWERTYGFDFHU......
>1_12258
QWERTYUHKDJKDJOKK......
That is, in my final output I get the header sorted according to their lower values, but I want them in exactly the same order as described in the list files. I'm not sure how to do it...please help.

I think the root cause of the ordering problem is because wanted is a set which are unordered. Since you want the sequence ids in the wanted_files to determine the ordering, you'd need to store them in something else that preserves order, like a list.
Alternatively, you can just process each line of the wanted_file as it's read. A problem with that approach is it would require you to potentially read through the "goodProteins.fasta" file many times — perhaps once for each line of the wanted_file if its contents aren't in a sorted order.
To avoid that, the entire file can be read in to a memory-resident dictionary whose keys are the sequence ids once using the SeqIO.to_dict() function, and then reused for each wanted_file. You say the file is 50-60 MB, but that isn't too much for most of today's hardware.
Anyway, here's code that attempts to do this. To avoid global variables there's a Process class that reads in the "goodProteins.fasta" file and converts it into a dictionary when an instance of it is created. Instances are callable and reusable, meaning that the same process object can be used with each of the wanted_files without repeatedly reading the sequences file.
Note that the code is untested because I don't have the data files or the Bio module installed on my system — but hopefully it's close enough to help.
from Bio import SeqIO
import glob
class Process(object):
def __init__(self, fasta_file_name):
# read entire fasta file into memory as a dictionary indexed by ID
with open(fasta_file_name, "rU") as fasta_file:
self.fasta_sequences = SeqIO.to_dict(
SeqIO.parse(fasta_file, 'fasta'))
def __call__(self, wanted_file_name, results_file_name):
with open(wanted_file_name, "rU") as wanted, \
open(results_file_name, "w") as results:
for seq_id in (line.strip() for line in wanted):
if seq_id:
SeqIO.write(self.fasta_sequences[seq_id], results, "fasta")
process = Process("goodProteins.fasta") # create process object
# process each wanted file using it
for wanted_file_name in glob.glob(r"My_folder\*.txt"):
results_file_name = wanted_file_name[:-4] + ".fasta"
process(wanted_file_name, results_file_name)

Related

Separating a file by lines in python

I have a .fastq file (cannot use Biopython) that consists of multiple samples in different lines. The file contents look like this:
#sample1
ACGTC.....
+
IIIIDDDDDFF
#sample2
AGCGC....
+
IIIIIDFDFD
.
.
.
#sampleX
ACATAG
+
IIIIIDDDFFF
I want to take the file and separate out each individual set of samples (i.e. lines 1-4, 5-8 and so on until the end of the file) and write each of them to a separate file (i.e. sample1.fastq contains that contents of sample 1 lines 1-4 and so on). Is this doable using loops in python?
You can use defaultdict and regex for this
import re
from collections import defaultdict
# Get file contents
with open("test.fastq", "r") as f:
content = f.read()
samples = defaultdict(list) # Make defaultdict of empty lists
identifier = ""
# Iterate through every line in file
for line in content.split("\n"):
# Find strings which start with #
if re.match("^#.*", line):
# Set identifier to match following lines to this section
identifier = line.replace("#", "")
else:
# Add the line to its identifier
samples[identifier].append(line)
Now all you have to do is save the contents of this default dictionary into multiple files:
# Loop through all samples (and their contents)
for sample_name, sample_items in samples.items():
# Create new file with the name of its sample_name.fastq
# (You might want to change the naming)
with open(f"{sample_name}.fastq", "w") as f:
# Write each element of the sample_items to new line
f.write("\n".join(sample_items))
It might be helpful for you to also include #sample_name in the beginning of the file (first line), but I'm not sure you want that so I haven't added that.
Note that you can adjust the regex settings to only match #sample[number] instead of all #..., if you want that, you can use re.match("^#sample\d+") instead

How can i edit my python script so it can select the whole text of a fasta sequence?

I have 2 files: one is a text file that contains a series of IDs, and the other is a multifasta file that contains fasta sequences corresponding to the IDs in the first file. I have a python a script that can select the matching IDs from both files. It looks like this:
from Bio import SeqIO
fasta=SeqIO.parse("fasta1.fasta","fasta")
seq_dict={}
for record in fasta:
seq_dict[record.id]=record.seq
#print (seq_dict)
for line in open("names","r"):
line=line.strip()
print(line)
for cle in seq_dict.keys():
print(cle)
I need to edit my script so it can select the text of the sequence next to its corresponding ID. Can you help me please to do that? Thank you.
After playing a bit with Bio.SeqIO, I concluded that #Bazingaa is probably correct. Adapted your code like so:
from Bio import SeqIO
fasta=SeqIO.parse("fasta1.fasta","fasta")
seq_dict={}
for record in fasta:
seq_dict[record.id]=record.description
#print (seq_dict)
for line in open("names","r"):
line=line.strip()
print(line)
for cle, desc in seq_dict.items():
print(cle)
print(desc)
You seem to be new to python, so here's what I did:
instead of keeping record.seq, i stored record.description
for a, b in <some dictionary>.items() will iterate through the dictionary items returning key, value pairs into the a, b variables
Hope this helps.
Edit:
Here's a somewhat more "pythonic" version. I don't really understand what fasta is so I assumed you'd want to read lines from names, take the 'tr|something|something' part as ids (without the leading '>') and print out the ones from 'fasta1.fasta' if they're in names:
from Bio import SeqIO
fasta = SeqIO.parse("fasta1.fasta","fasta")
# read all the names
with open("names", "r") as f: # this takes care to close the file afterwards
names = [line.strip().lstrip('>') for line in f]
print("Names: ", names)
for record in fasta:
print("Record:", record.id)
if record.id in names:
print("Matching record:", record.id, record.seq, record.description)
If you want to extract only the items from the names file, it is probably more efficient to read the names into memory first.
from Bio import SeqIO
wanted = dict()
with open("names","r") as lines:
for line in lines:
wanted[line.strip()] = 1
for record in SeqIO.parse("fasta1.fasta","fasta"):
if record.id in wanted:
print(record.seq)
See if this works:
from Bio import SeqIO
fasta=SeqIO.parse("fasta1.fasta","fasta")
seq_dict = {}
for record in fasta:
seq_dict[record.id.strip()] = record.seq
with open("names","r") as lines:
for line in lines:
l = line.strip().lstrip('<')
if l in seq_dict:
print(l) # ID
print(seq_dict[l]) # sequence
Note this assumes that the IDs obtained from the fasta file are the same as the IDs in the name file. If this is not the case, please provide further detail of exactly what each of the two files contains (with examples)

how to subsample a fasta file based on the headers if headers contain certain strings?

I have a fasta file like this:
>gi|373248686|emb|HE586118.1| Streptomyces albus subsp. albus salinomycin biosynthesis cluster, strain DSM 41398
GGATGCGAAGGACGCGCTGCGCAAGGCGCTGTCGATGGGTGCGGACAAGGGCATCCACGT
CGAGGACGACGATCTGCACGGCACCGACGCCGTGGGTACCTCGCTGGTGCTGGCCAAGGC
>gi|1139489917|gb|KX622588.1| Hyalangium minutum strain DSM 14724 myxochromide D subtype 1 biosynthetic gene cluster and tRNA-Thr gene, complete sequence
ATGCGCAAGCTCGTCATCACGGTGGGGATTCTGGTGGGGTTGGGGCTCGTGGTCCTTTGG
TTCTGGAGCCCGGGAGGCCCAGTCCCCTCCACGGACACGGAGGGGGAAGGGCGGAGTCAG
CGCCGGCAGGCCATGGCCCGGCCCGGCTCCGCGCAGCTGGAGAGTCCCGAGGACATGGGG
>gi|930076459|gb|KR364704.1| Streptomyces sioyaensis strain BCCO10_981 putative annimycin-type biosynthetic gene cluster, partial sequence
GCCGGCAGGTGGGCCGCGGTCAGCTTCAGGACCGTGGCCGTCGCGCCCGCCAGCACCACG
GAGGCCCCCACGGCCAGCGCCGGGCCCGTGCCCGTGCCGTACGCGAGGTCCGTGCTGAAC
and I have a text file containing a list of numbers:
373248686
930076459
296280703
......
I want to do the following:
if the header in the fasta file contains the numbers in the text file:
write all the matches(header+sequence) to a new output.fasta file.
How to do this in python? It seems easy, just some for loops may do the job, but somehow I cannot make that happen, and if my files are really big, loop in another loop may take a long time. Here's what I have tried:
from Bio import SeqIO
import sys
wanted = []
for line in open(sys.argv[2]):
titles = line.strip()
wanted.append(titles)
seqiter = SeqIO.parse(open(sys.argv[1]), 'fasta')
sys.stdout = open('output.fasta', 'w')
new_seq = []
for seq in seqiter:
new_seq.append(seq if i in seq.id for i in wanted)
SeqIO.write(new_seq, sys.stdout, "fasta")
sys.stdout.close()
Got this error:
new_seq.append(seq if i in seq.id for i in wanted)
^
SyntaxError: invalid syntax
Is there a better way to do this?
Thank you!
Use a program like this
from Bio import SeqIO
import sys
# read in the text file
numbersInTxtFile = set()
# hint: use with, then you don't need to
# program file closing. Furhtermore error
# handling is comming along with this too
with open(sys.argv[2], "r") as inF:
for line in inF:
line = line.strip()
if line == "": continue
numbersInTxtFile.add(int(line))
# read in the fasta file
with open(sys.argv[1], "r") as inF:
for record in SeqIO.parse(inF, "fasta"):
# now check if this record in the fasta file
# has an id we are searching for
name = record.description
id = int(name.split("|")[1])
print(id, numbersInTxtFile, id in numbersInTxtFile)
if id in numbersInTxtFile:
# we need to output
print(">%s" % name)
print(record.seq)
which you can then call like so from the commandline
python3 nameOfProg.py inputFastaFile.fa idsToSearch.txt > outputFastaFile.fa
Import your "keeper" IDs into a dictionary rather than a list, this will be much faster as the list doesn't have to be searched thousands of times.
keepers = {}
with open("ids.txt", "r") as id_handle:
for curr_id in id_handle:
keepers[curr_id] = True
A list comprehension generates a list, so you don't need to append to another list.
keeper_seqs = [x for x in seqiter if x.id in keepers]
With larger files you will want to loop over seqiter and write the entries one at a time to avoid memory issues.
You should also never assign to sys.stdout without a good reason, if you want to output to STDOUT just use print or sys.stdout.write().

in python loop print lines from alternating files

I am trying to use python to find four-line blocks of interest in two separate files then print out some of those lines in controlled order. Below are the two input files and an example of the desired output file. Note that the DNA sequence in the Input.fasta is different than the DNA sequence in Input.fastq because the .fasta file has been read corrected.
Input.fasta
>read1
AAAGGCTGT
>read2
AGTCTTTAT
>read3
CGTGCCGCT
Input.fastq
#read1
AAATGCTGT
+
'(''%$'))
#read2
AGTCTCTAT
+
&---+2010
#read3
AGTGTCGCT
+
0-23;:677
DesiredOutput.fastq
#read1
AAAGGCTGT
+
'(''%$'))
#read2
AGTCTTTAT
+
&---+2010
#read3
CGTGCCGCT
+
0-23;:677
Basically I need the sequence line "AAAGGCTGT",
"AGTCTTTAT", and "CGTGCCGCT" from "input.fasta" and all other lines from "input.fastq". This allows the restoration of quality information to a read corrected .fasta file.
Here is my closest failed attempt:
fastq = open(Input.fastq, "r")
fasta = open(Input.fasta, "r")
ReadIDs = []
IDs = []
with fastq as fq:
for line in fq:
if "read" in line:
ReadIDs.append(line)
print(line.strip())
for ID in ReadIDs:
IDs.append(ID[1:6])
with fasta as fa:
for line in fa:
if any(string in line for string in IDs):
print(next(fa).strip())
next(fq)
print(next(fq).strip())
print(next(fq).strip())
I think I am running into trouble by trying to nest "with" calls to two different files in the same loop. This prints the desired lines for read1 correctly but does not continue to iterate through the remaining lines and throws an error "ValueError: I/O operation on closed file"
I suggest you use Biopython, which will save you a lot of trouble as it provides nice parsers for these file formats, which handle not only the standard cases but also for example multi-line fasta.
Here is an implementation that replaces the fastq sequence lines with the corresponding fasta sequence lines:
from Bio import SeqIO
fasta_dict = {record.id: record.seq for record in
SeqIO.parse('Input.fasta', 'fasta')}
def yield_records():
for record in SeqIO.parse('Input.fastq', 'fastq'):
record.seq = fasta_dict[record.id]
yield record
SeqIO.write(yield_records(), 'DesiredOutput.fastq', 'fastq')
If you don't want to use the headers but just rely on the order then the solution is even simpler and more memory efficient (just make sure the order and number of records is the same), no need to define the dictionary first, just iterate over the records together:
fasta_records = SeqIO.parse('Input.fasta', 'fasta')
fastq_records = SeqIO.parse('Input.fastq', 'fastq')
def yield_records():
for fasta_record, fastq_record in zip(fasta_records, fastq_records):
fastq_record.seq = fasta_record.seq
yield fastq_record
## Open the files (and close them after the 'with' block ends)
with open("Input.fastq", "r") as fq, open("Input.fasta", "r") as fa:
## Read in the Input.fastq file and save its content to a list
fastq = fq.readlines()
## Do the same for the Input.fasta file
fasta = fa.readlines()
## For every line in the Input.fastq file
for i in range(len(fastq)):
print(fastq[i]))
print(fasta[2 * i])
print(fasta[(2 * i) + 1])
I like the Biopython solution by #Chris_Rands better for small files, but here is a solution that only uses the batteries included with Python and is memory efficient. It assumes the fasta and fastq files to contain the same number of reads in the same order.
with open('Input.fasta') as fasta, open('Input.fastq') as fastq, open('DesiredOutput.fastq', 'w') as fo:
for i, line in enumerate(fastq):
if i % 4 == 1:
for j in range(2):
line = fasta.readline()
print(line, end='', file=fo)

How to jump between lines in for loop in python?

I have a file of sequence information, so the file will be structured like this,
[SEQUENCE ID]
atgctagctagatcga
[SEQUENCE ID]
agatcgatggctagatc
What I've been doing is comparing between files to see what sequences IDs are shared, which is simple enough, but now I want to pull out the actual sequence associated with the ID. The files I'm using are huge (10 GB+) so using a dictionary or anything that would involve reading all the lines into the system memory is out.
Basically what the code is intended to do is if the sequence ID from file 1 isn't found in file 2, then return the line after the sequence ID from file 1. Any tips?
So you only need line N and line N+1? In this case read the file in chunks of two lines. Then you always have access to both the sequence ID and the sequence.
from itertools import izip
with open('data.txt', 'r') as f:
for line1, line2 in izip(*(iter(f),) * 2):
print line1, line2
Short answer: you will have to use a third party Python library to keep one of the data sequences searchable in better than O(n).
If they are not sorted, you will have to sort at least one of the files. Think of it this way:
I get the sequence ID from file 1 - and to check if it is not present in file2, I'dhave to read all the file - much eless feasible than reading the file once.
Than - better than sorting, it would be usefull to have a data-structure that could hold the sorted data on disc in a way to provide for fast searchs, and still be able to grow - that woulf facilitate sorting as well,a s all you'd have to do in a first step would be reading the entries in file 2, and just inserting then into this growing-sorted disk-persisted data structure.
While certainly you could roll your own data-structure to do this, I'd suggest the ue of ZODB - ZOPE's object oriented DATABSe, witha btree folder, and have your "2 lines of data" made into a minimal object for your task.
Assuming the [SEQUENCE ID] s do fit in memory, and that the bulk of your data is actually on the sequence line (unlike the examples provided) - you have the option to parse a file (file2 in your question), and anotate not only te [SEQUENCE ID] - but the file postion for each such identifier. This approach would enable you to proceed without braking much of your current workflow (like, having to learn about a database)
:
def get_indexes(filename):
with open(filename, "rt") as file:
sequences = {}
while True:
position = file.tell()
id = file.readline()
if not id:
break()
sequences[id.strip()] = position
# skip corresponding data line:
file.readline()
return sequences
def fetcher(filename1, filename2, sequences):
with open(filename1, "rt") as file1, open(filename2, "rt" as file2):
while True:
id = file.readline()
data = file.readline()
if not id:
break
id = id.strip()
if id in sequences:
# postion file2 reading at the identifier:
file2.seek(sequences[id])
# throw away id:
file2.readline()
data = file.readline()
yield id, data
if __name__== "__main__":
sequences = getindexes("/data/file2")
for id, data in fetcher("/data/file1", "/data/file2", sequences):
print "%s\n%s"% (id, data)

Categories