How to jump between lines in for loop in python? - python

I have a file of sequence information, so the file will be structured like this,
[SEQUENCE ID]
atgctagctagatcga
[SEQUENCE ID]
agatcgatggctagatc
What I've been doing is comparing between files to see what sequences IDs are shared, which is simple enough, but now I want to pull out the actual sequence associated with the ID. The files I'm using are huge (10 GB+) so using a dictionary or anything that would involve reading all the lines into the system memory is out.
Basically what the code is intended to do is if the sequence ID from file 1 isn't found in file 2, then return the line after the sequence ID from file 1. Any tips?

So you only need line N and line N+1? In this case read the file in chunks of two lines. Then you always have access to both the sequence ID and the sequence.
from itertools import izip
with open('data.txt', 'r') as f:
for line1, line2 in izip(*(iter(f),) * 2):
print line1, line2

Short answer: you will have to use a third party Python library to keep one of the data sequences searchable in better than O(n).
If they are not sorted, you will have to sort at least one of the files. Think of it this way:
I get the sequence ID from file 1 - and to check if it is not present in file2, I'dhave to read all the file - much eless feasible than reading the file once.
Than - better than sorting, it would be usefull to have a data-structure that could hold the sorted data on disc in a way to provide for fast searchs, and still be able to grow - that woulf facilitate sorting as well,a s all you'd have to do in a first step would be reading the entries in file 2, and just inserting then into this growing-sorted disk-persisted data structure.
While certainly you could roll your own data-structure to do this, I'd suggest the ue of ZODB - ZOPE's object oriented DATABSe, witha btree folder, and have your "2 lines of data" made into a minimal object for your task.

Assuming the [SEQUENCE ID] s do fit in memory, and that the bulk of your data is actually on the sequence line (unlike the examples provided) - you have the option to parse a file (file2 in your question), and anotate not only te [SEQUENCE ID] - but the file postion for each such identifier. This approach would enable you to proceed without braking much of your current workflow (like, having to learn about a database)
:
def get_indexes(filename):
with open(filename, "rt") as file:
sequences = {}
while True:
position = file.tell()
id = file.readline()
if not id:
break()
sequences[id.strip()] = position
# skip corresponding data line:
file.readline()
return sequences
def fetcher(filename1, filename2, sequences):
with open(filename1, "rt") as file1, open(filename2, "rt" as file2):
while True:
id = file.readline()
data = file.readline()
if not id:
break
id = id.strip()
if id in sequences:
# postion file2 reading at the identifier:
file2.seek(sequences[id])
# throw away id:
file2.readline()
data = file.readline()
yield id, data
if __name__== "__main__":
sequences = getindexes("/data/file2")
for id, data in fetcher("/data/file1", "/data/file2", sequences):
print "%s\n%s"% (id, data)

Related

Joblib too slow using "if not in" loop

I am working with amino acid sequences using the Biopython parser, but regardless of data format (the format is fasta, that is, you can imagine them as strings of letters as follows preceded by the id), my problem is that I have a huge amount of data and despite having tried to parallelize with joblib the estimate of the hours it would take me to run this simple code is 400.
Basically I have a file that contains a series of ids that I have to remove (ids_to_drop) from the original dataset (original_dataset), to create a new file (new_dataset) that contains all the ids contained in the original dataset without the ids_to_drop.
I've tried them all but I don't know how else to do it and I'm stuck right now. Thanks so much!
def file_without_ids_to_remove(seq):
with open(new_output, "a") as f, open(ids_to_drop, "r") as r: #output #removing file
remove = r.read().split("\n")
if seq.id not in remove:
SeqIO.write(seq, f, "fasta")
Parallel(n_jobs=10)(delayed(file_without_ids_to_remove)(seq) for seq in tqdm.tqdm(SeqIO.parse(original_dataset, 'fasta')))
To be clear this is an example of the data (sequence.id + sequence):
WP_051064487.1
MSSAAQTPEATSDVSDANAKQAEALRVASVNVNGIRASYRKGMAEWLAPRQVDILCLQEVRAPDEVVDGF
LADDWHIVHAEAEAKGRAGVLIASRKDSLAPDATRIGIGEEYFATAGRWVEADYTIGENAKKLTVISAYV
HSGEVGTQRQEDKYRFLDTMLERMAELAEQSDYALIVGDLNVGHTELDIKNWKGNVKNAGFLPEERAYFD
KFFGGGDTPGGLGWKDVQRELAGPVNGPYTWWSQRGQAFDNDTGWRIDYHMATPELFARAGNAVVDRAPS
YAERWSDHAPLLVDYTIR
UPDATE: I tried in the following way after the suggestion and it works.
with open(new_dataset, "w") as filtered:
[SeqIO.write(seq,filtered,"fasta") for seq in tqdm.tqdm(SeqIO.parse(original_dataset, 'fasta')) if seq.id not in ids_to_remove]
This looks like a simple file filter operation. Turn the ids to remove into a set one time, and then just read/filter/write the original dataset. Sets are optimized for fast lookup. This operation will be I/O bound and would not benefit from parallelization.
with open("ids-to-remove") as f:
ids_to_remove = {seq_id_line.strip() for seq_id_line in f}
# just in case there are blank lines
if "" in ids_to_remove:
ids_to_remove.remove("")
with open("original-data-set") as orig, open("filtered-data-set", "w") as filtered:
filtered.writelines(line for line in orig if line.split()[0] not in ids_to_remove)

how to subsample a fasta file based on the headers if headers contain certain strings?

I have a fasta file like this:
>gi|373248686|emb|HE586118.1| Streptomyces albus subsp. albus salinomycin biosynthesis cluster, strain DSM 41398
GGATGCGAAGGACGCGCTGCGCAAGGCGCTGTCGATGGGTGCGGACAAGGGCATCCACGT
CGAGGACGACGATCTGCACGGCACCGACGCCGTGGGTACCTCGCTGGTGCTGGCCAAGGC
>gi|1139489917|gb|KX622588.1| Hyalangium minutum strain DSM 14724 myxochromide D subtype 1 biosynthetic gene cluster and tRNA-Thr gene, complete sequence
ATGCGCAAGCTCGTCATCACGGTGGGGATTCTGGTGGGGTTGGGGCTCGTGGTCCTTTGG
TTCTGGAGCCCGGGAGGCCCAGTCCCCTCCACGGACACGGAGGGGGAAGGGCGGAGTCAG
CGCCGGCAGGCCATGGCCCGGCCCGGCTCCGCGCAGCTGGAGAGTCCCGAGGACATGGGG
>gi|930076459|gb|KR364704.1| Streptomyces sioyaensis strain BCCO10_981 putative annimycin-type biosynthetic gene cluster, partial sequence
GCCGGCAGGTGGGCCGCGGTCAGCTTCAGGACCGTGGCCGTCGCGCCCGCCAGCACCACG
GAGGCCCCCACGGCCAGCGCCGGGCCCGTGCCCGTGCCGTACGCGAGGTCCGTGCTGAAC
and I have a text file containing a list of numbers:
373248686
930076459
296280703
......
I want to do the following:
if the header in the fasta file contains the numbers in the text file:
write all the matches(header+sequence) to a new output.fasta file.
How to do this in python? It seems easy, just some for loops may do the job, but somehow I cannot make that happen, and if my files are really big, loop in another loop may take a long time. Here's what I have tried:
from Bio import SeqIO
import sys
wanted = []
for line in open(sys.argv[2]):
titles = line.strip()
wanted.append(titles)
seqiter = SeqIO.parse(open(sys.argv[1]), 'fasta')
sys.stdout = open('output.fasta', 'w')
new_seq = []
for seq in seqiter:
new_seq.append(seq if i in seq.id for i in wanted)
SeqIO.write(new_seq, sys.stdout, "fasta")
sys.stdout.close()
Got this error:
new_seq.append(seq if i in seq.id for i in wanted)
^
SyntaxError: invalid syntax
Is there a better way to do this?
Thank you!
Use a program like this
from Bio import SeqIO
import sys
# read in the text file
numbersInTxtFile = set()
# hint: use with, then you don't need to
# program file closing. Furhtermore error
# handling is comming along with this too
with open(sys.argv[2], "r") as inF:
for line in inF:
line = line.strip()
if line == "": continue
numbersInTxtFile.add(int(line))
# read in the fasta file
with open(sys.argv[1], "r") as inF:
for record in SeqIO.parse(inF, "fasta"):
# now check if this record in the fasta file
# has an id we are searching for
name = record.description
id = int(name.split("|")[1])
print(id, numbersInTxtFile, id in numbersInTxtFile)
if id in numbersInTxtFile:
# we need to output
print(">%s" % name)
print(record.seq)
which you can then call like so from the commandline
python3 nameOfProg.py inputFastaFile.fa idsToSearch.txt > outputFastaFile.fa
Import your "keeper" IDs into a dictionary rather than a list, this will be much faster as the list doesn't have to be searched thousands of times.
keepers = {}
with open("ids.txt", "r") as id_handle:
for curr_id in id_handle:
keepers[curr_id] = True
A list comprehension generates a list, so you don't need to append to another list.
keeper_seqs = [x for x in seqiter if x.id in keepers]
With larger files you will want to loop over seqiter and write the entries one at a time to avoid memory issues.
You should also never assign to sys.stdout without a good reason, if you want to output to STDOUT just use print or sys.stdout.write().

in python loop print lines from alternating files

I am trying to use python to find four-line blocks of interest in two separate files then print out some of those lines in controlled order. Below are the two input files and an example of the desired output file. Note that the DNA sequence in the Input.fasta is different than the DNA sequence in Input.fastq because the .fasta file has been read corrected.
Input.fasta
>read1
AAAGGCTGT
>read2
AGTCTTTAT
>read3
CGTGCCGCT
Input.fastq
#read1
AAATGCTGT
+
'(''%$'))
#read2
AGTCTCTAT
+
&---+2010
#read3
AGTGTCGCT
+
0-23;:677
DesiredOutput.fastq
#read1
AAAGGCTGT
+
'(''%$'))
#read2
AGTCTTTAT
+
&---+2010
#read3
CGTGCCGCT
+
0-23;:677
Basically I need the sequence line "AAAGGCTGT",
"AGTCTTTAT", and "CGTGCCGCT" from "input.fasta" and all other lines from "input.fastq". This allows the restoration of quality information to a read corrected .fasta file.
Here is my closest failed attempt:
fastq = open(Input.fastq, "r")
fasta = open(Input.fasta, "r")
ReadIDs = []
IDs = []
with fastq as fq:
for line in fq:
if "read" in line:
ReadIDs.append(line)
print(line.strip())
for ID in ReadIDs:
IDs.append(ID[1:6])
with fasta as fa:
for line in fa:
if any(string in line for string in IDs):
print(next(fa).strip())
next(fq)
print(next(fq).strip())
print(next(fq).strip())
I think I am running into trouble by trying to nest "with" calls to two different files in the same loop. This prints the desired lines for read1 correctly but does not continue to iterate through the remaining lines and throws an error "ValueError: I/O operation on closed file"
I suggest you use Biopython, which will save you a lot of trouble as it provides nice parsers for these file formats, which handle not only the standard cases but also for example multi-line fasta.
Here is an implementation that replaces the fastq sequence lines with the corresponding fasta sequence lines:
from Bio import SeqIO
fasta_dict = {record.id: record.seq for record in
SeqIO.parse('Input.fasta', 'fasta')}
def yield_records():
for record in SeqIO.parse('Input.fastq', 'fastq'):
record.seq = fasta_dict[record.id]
yield record
SeqIO.write(yield_records(), 'DesiredOutput.fastq', 'fastq')
If you don't want to use the headers but just rely on the order then the solution is even simpler and more memory efficient (just make sure the order and number of records is the same), no need to define the dictionary first, just iterate over the records together:
fasta_records = SeqIO.parse('Input.fasta', 'fasta')
fastq_records = SeqIO.parse('Input.fastq', 'fastq')
def yield_records():
for fasta_record, fastq_record in zip(fasta_records, fastq_records):
fastq_record.seq = fasta_record.seq
yield fastq_record
## Open the files (and close them after the 'with' block ends)
with open("Input.fastq", "r") as fq, open("Input.fasta", "r") as fa:
## Read in the Input.fastq file and save its content to a list
fastq = fq.readlines()
## Do the same for the Input.fasta file
fasta = fa.readlines()
## For every line in the Input.fastq file
for i in range(len(fastq)):
print(fastq[i]))
print(fasta[2 * i])
print(fasta[(2 * i) + 1])
I like the Biopython solution by #Chris_Rands better for small files, but here is a solution that only uses the batteries included with Python and is memory efficient. It assumes the fasta and fastq files to contain the same number of reads in the same order.
with open('Input.fasta') as fasta, open('Input.fastq') as fastq, open('DesiredOutput.fastq', 'w') as fo:
for i, line in enumerate(fastq):
if i % 4 == 1:
for j in range(2):
line = fasta.readline()
print(line, end='', file=fo)

filtering a weird text file in python

I have a text file in which each ID line starts with > and the next line(s) are the a sequence of characters. And the next line after the sequence of characters would be an other ID line starting with >. but in some of them, instead of sequence I have “Sequence unavailable”. The sequence after the ID line can be one or more lines.
like this example:
>ENSG00000173153|ENST00000000442|64073050;64074640|64073208;64074651
AAGCAGCCGGCGGCGCCGCCGAGTGAGGGGACGCGGCGCGGTGGGGCGGCGCGGCCCGAGGAGGCGGCGGAGGAGGGGCCGCCCGCGGCCCCCGGCTCACTCCGGCACTCCGGGCCGCTC
>ENSG00000004139|ENST00000003834
Sequence unavailable
I want to filter out those IDs with “Sequence unavailable”. The output should look like this:
output:
>ENSG00000173153|ENST00000000442|64073050;64074640|64073208;64074651
AAGCAGCCGGCGGCGCCGCCGAGTGAGGGGACGCGGCGCGGTGGGGCGGCGCGGCCCGAGGAGGCGGCGGAGGAGGGGCCGCCCGCGGCCCCCGGCTCACTCCGGCACTCCGGGCCGCTC
do you know how to do that in python?
Unlike the other answers, I’d strongly recommand against parsing the FASTA format manually. It’s not too hard but there are pitfalls, and it’s completely unnecessary since efficient, well-tested implementations exist:
Use Bio.SeqIO from BioPython; for example:
from Bio import SeqIO
for record in SeqIO.parse(filename, 'fasta'):
if record.seq != 'Sequenceunavailable':
SeqIO.write(record, outfile, 'fasta')
Note the missing space in 'Sequenceunavailable': reading the sequences in FASTA format will omit spaces.
How about this:
with open(filename, 'r+') as f:
data = f.read()
data = data.split('>')
result = ['>{}'.format(item) for item in data if item and 'Sequence unavailable' not in item]
f.seek(0)
for line in result:
f.write(line)
def main():
filename = open('text.txt', 'rU').readlines()
filterFile(filename)
def filterFile(SequenceFile):
outfile = open('outfile', 'w')
for line in SequenceFile:
if line.startswith('>'):
sequence = line.next()
if sequence.startswith('Sequence unavailable'):
//nothing should happen I suppose?
else:
outfile.write(line + "\n" + sequence + "\n")
main()
I unfortunately can't test this code right now but I made this out of the top of my head! Please test it and let me know what the outcome is so I can adjust the code :-)
So I don't exactly know how large these files will get, just in case, I'm doing it without mapping the file in memory:
with open(filename) as fh:
with open(filename+'.new', 'w+') as fh_new:
for idline, geneseq in zip(*[iter(fh)] * 2):
if geneseq.strip() != 'Sequence unavailable':
fh_new.write(idline)
fh_new.write(geneseq)
It works by creating a new file, then the zip thing is some magic to read the 2 lines of the file, the idline will be the first part and the geneseq the second part.
This solution should be relatively cheap in computer power but will create an extra output file.

Extracting fasta sequences in list files order

I need to extract some fasta sequences from "goodProteins.fasta" file (first input) with id list files present in separate folder (second input).
The format of the fasta sequence file is:
>1_12256
FSKVJLKDFJFDAKJQWERTYU......
>1_12257
SKJFHKDAJHLQWERTYGFDFHU......
>1_12258
QWERTYUHKDJKDJOKK......
>1_12259
DJHFDSQWERTYUHKDJKDJOKK......
>1_12260
ADKKHDFHJQWERTYUHKDJKDJOKK......
and the format of one of the id file is:
1_12258
1_12256
1_12257
I'm using the following script:
from Bio import SeqIO
import glob
def process(wanted_file, result_file):
fasta_file = "goodProteins.fasta" # First input (Fasta sequence)
wanted = set()
with open(wanted_file) as f:
for line in f:
line = line.strip()
if line != "":
wanted.add(line)
fasta_sequences = SeqIO.parse(open(fasta_file),'fasta')
with open(result_file, "w") as f:
for seq in fasta_sequences:
if seq.id in wanted:
SeqIO.write([seq], f, "fasta")
listFilesArr = glob.glob("My_folder\*txt") # takes all .txt files as
# Second input in My_folder
for wanted_file in listFilesArr:
result_file = wanted_file[0:-4] + ".fasta"
process(wanted_file, result_file)
It should extract fasta sequence based on the information and order list in the id file and the desired output would be:
>1_12258
QWERTYUHKDJKDJOKK......
>1_12256
FSKVJLKDFJFDAKJQWERTYU......
>1_12257
SKJFHKDAJHLQWERTYGFDFHU......
but I get:
>1_12256
FSKVJLKDFJFDAKJQWERTYU......
>1_12257
SKJFHKDAJHLQWERTYGFDFHU......
>1_12258
QWERTYUHKDJKDJOKK......
That is, in my final output I get the header sorted according to their lower values, but I want them in exactly the same order as described in the list files. I'm not sure how to do it...please help.
I think the root cause of the ordering problem is because wanted is a set which are unordered. Since you want the sequence ids in the wanted_files to determine the ordering, you'd need to store them in something else that preserves order, like a list.
Alternatively, you can just process each line of the wanted_file as it's read. A problem with that approach is it would require you to potentially read through the "goodProteins.fasta" file many times — perhaps once for each line of the wanted_file if its contents aren't in a sorted order.
To avoid that, the entire file can be read in to a memory-resident dictionary whose keys are the sequence ids once using the SeqIO.to_dict() function, and then reused for each wanted_file. You say the file is 50-60 MB, but that isn't too much for most of today's hardware.
Anyway, here's code that attempts to do this. To avoid global variables there's a Process class that reads in the "goodProteins.fasta" file and converts it into a dictionary when an instance of it is created. Instances are callable and reusable, meaning that the same process object can be used with each of the wanted_files without repeatedly reading the sequences file.
Note that the code is untested because I don't have the data files or the Bio module installed on my system — but hopefully it's close enough to help.
from Bio import SeqIO
import glob
class Process(object):
def __init__(self, fasta_file_name):
# read entire fasta file into memory as a dictionary indexed by ID
with open(fasta_file_name, "rU") as fasta_file:
self.fasta_sequences = SeqIO.to_dict(
SeqIO.parse(fasta_file, 'fasta'))
def __call__(self, wanted_file_name, results_file_name):
with open(wanted_file_name, "rU") as wanted, \
open(results_file_name, "w") as results:
for seq_id in (line.strip() for line in wanted):
if seq_id:
SeqIO.write(self.fasta_sequences[seq_id], results, "fasta")
process = Process("goodProteins.fasta") # create process object
# process each wanted file using it
for wanted_file_name in glob.glob(r"My_folder\*.txt"):
results_file_name = wanted_file_name[:-4] + ".fasta"
process(wanted_file_name, results_file_name)

Categories