Joining DNA sequences from two files under the same species name - python

I have two FASTA file with DNA sequences coding for two different proteins. I want to join the sequences for the different proteins and same species into one long sequence.
for example, I have:
Protein 1
>sce
AGTAGATGACAGCT
>act
GCTAGCTAGCT
Protein 2
>sce
GCTACGATCGACT
>act
TACGATCAGCTA
Protein 1+2
>sce
AGTAGATGACAGCTGCTACGATCGACT
>act
GCTAGCTAGCTTACGATCAGCTA
Something that might be a bit of an issue is that the species don't appear in the same order in both files and there's a few sequences that are found in one, but not in the other (files are about 110-species long, with discrepancy of 4 or 5).
My first attempt at writing a code for it was:
gamma = open('gamma.fas', 'w')
spc = open("spc98.fas", 'w')
outfile = open("joined.fas", 'w')
for line in gamma:
if line.startswith(">"):
for line2 in spc:
if line2.startswith(">"):
if line == line2:
outfile.write(line)
else:
outfile.write(line)
fh.close()
but since the DNA sequences are very long and take many lines of the file, I don't know how to select them.
Please help!

Since you tagged Biopython, here is a compact solution. Note it puts the whole file into memory (as most simple approaches will):
from Bio.Seq import Seq
from Bio import SeqIO
d = SeqIO.to_dict(SeqIO.parse('1.fasta', 'fasta'))
for r in SeqIO.parse('2.fasta', 'fasta'):
d[r.id] = d.setdefault(r.id, Seq('')) + r.seq
SeqIO.write(d.values(), 'output.fasta', 'fasta')
Here 1.fasta and 2.fasta are your two input fasta files, and output.fasta is your merged output file.
Also, note that biologically I think this is an odd thing to do, concatenating sequences across multiple files could lead to the creation of 'fake' contiguous sequences, and the order of concatenation is surely important, so be careful

By using a dictionary, you could append fasta sequences to each ID. And then, print them to the output file.
outfile = open("joined.fas", 'w')
d = dict()
for file in ('gamma.fas', 'spc98.fas'):
with open(file, 'r') as f:
for line in f:
line = line.rstrip()
if line.startswith('>'):
key = line
else:
d.setdefault(key, '')
d[key] += line
for key, seq in d.items():
outfile.write(key + "\n" + seq + "\n")
outfile.close()
EDIT: By the way, you are opening your two reading files as open for writing which will clobber the two input files.
gamma = open('gamma.fas', 'w')
spc = open("spc98.fas", 'w')
They should be opened with r instead of w.

Related

Use a file to search another file and print lines matching a pattern to first file

Python noob here. I've been smashing my head trying to do this, tried several Unix tools and I'm convinced that python is the way to go.
I have two files, File1 has headers and numbers like this:
>id1
77
>id2
2
>id3
2
>id4
22
...
Note that id number is unique, but the number assigned to it may repeat. I have several files like this all with the same number of headers (~500).
File2 has all numbers of File1 and an appended sequence
1
ATCGTCATA
2
ATCGTCGTA
...
22
CCCGTCGTA
...
77
ATCGTCATA
...
Note that sequence id is unique, as all sequences after it. I have the same amount of files as File1 but the number of sequences within each File2 may vary(~150).
My desired output is the File1 with the sequence from File2, it is important that File1 maintains original order.
>id1
ATCGTCATA
>id2
ATCGTCGTA
>id3
ATCGTCGTA
>id4
CCCGTCGTA
My approach is to extract numbers from File1 and use them as a pattern to match in File2. First I am trying to make this work with only a pair of files. here is what I achieved:
#!/usr/bin/env python
import re
datafile = 'protein2683.fasta.txt.named'
schemaseqs = 'protein2683.fasta'
with open(datafile, 'r') as f:
datafile_lines = set([line.strip() for line in f]) #maybe I could use regex to get only lines with number as pattern?
print (datafile_lines)
outputlist = []
with open(schemaseqs, 'r') as f:
for line in f:
seqs = line.split(',')[0]
if seqs[1:-1] in datafile_lines:
outputlist.append(line)
print (outputlist)
This outputs a mix of patterns from File1 and the sequences from File2. Any help is appreciated.
Ps: I am open to modifications in files structure, I tried substituting \n in File2 for "," with no avail.
import re
datafile = 'protein2683.fasta.txt.named'
schemaseqs = 'protein2683.fasta'
datafile_lines = []
d = {}
prev = None
with open(datafile, 'r') as f:
i = 0
for line in f:
if i % 2 == 0:
d[line.strip()]=0
prev = line.strip()
else:
d[prev] = line.strip()
i+=1
new_d = {}
with open(schemaseqs, 'r') as f:
i=0
prev = None
for line in f:
if i % 2 == 0:
new_d[line.strip()]=0
prev = line.strip()
else:
new_d[prev] = line.strip()
i+=1
for key, value in d.items():
if value in new_d:
d[key] = new_d[value]
print(d)
with open(datafile,'w') as filee:
for k,v in d.items():
filee.writelines(k)
filee.writelines('\n')
filee.writelines(v)
filee.writelines('\n')
creating two dictionary would be easy and then map both dictionary values.
Since the files are so neatly organized, I wouldn't use a set to store the lines. Sets don't enforce order, and the order of these lines conveys a lot of information. I also wouldn't use Regex; it's probably overkill for the task of parsing individual lines, but not powerful enough to keep track of which ID corresponds to each gene sequence.
Instead, I would read the files in the opposite order. First, read the file with the gene sequences and build a mapping of IDs to genes. Then read in the first file and replace each id with the corresponding value in that mapping.
If the IDs are a continuous sequence (1, 2, 3... n, n+1), then a list is probably the easiest way to store them. If the file is already in order, you don't even have to pay attention to the ID numbers; you can just skip every other row and append each gene sequence to an array in order. If they aren't continuous, you can use a dictionary with the IDs as keys. I'll use the dictionary approach for this example:
id_to_gene_map = {}
with open(file2, 'r') as id_to_gene_file:
for line_number, line in enumerate(id_to_gene_file, start=1):
if line_number % 2 == 1: # Update ID on odd numbered lines, including line 1
current_id = line
else:
id_to_gene_map[current_id] = line # Map previous line's ID to this line's value
with open(file1, 'r') as input_file, open('output.txt', 'w') as output_file:
for line in input_file:
if not line.startswith(">"): # Keep ">id1" lines unchanged
line = id_to_gene_map[line] # Otherwise, replace with the corresponding gene
output_file.write(line)
In this case, the IDs and values both have trailing newlines. You can strip them out, but since you'll want to add them back in for writing the output file, it's probably easiest to leave them alone.

how to subsample a fasta file based on the headers if headers contain certain strings?

I have a fasta file like this:
>gi|373248686|emb|HE586118.1| Streptomyces albus subsp. albus salinomycin biosynthesis cluster, strain DSM 41398
GGATGCGAAGGACGCGCTGCGCAAGGCGCTGTCGATGGGTGCGGACAAGGGCATCCACGT
CGAGGACGACGATCTGCACGGCACCGACGCCGTGGGTACCTCGCTGGTGCTGGCCAAGGC
>gi|1139489917|gb|KX622588.1| Hyalangium minutum strain DSM 14724 myxochromide D subtype 1 biosynthetic gene cluster and tRNA-Thr gene, complete sequence
ATGCGCAAGCTCGTCATCACGGTGGGGATTCTGGTGGGGTTGGGGCTCGTGGTCCTTTGG
TTCTGGAGCCCGGGAGGCCCAGTCCCCTCCACGGACACGGAGGGGGAAGGGCGGAGTCAG
CGCCGGCAGGCCATGGCCCGGCCCGGCTCCGCGCAGCTGGAGAGTCCCGAGGACATGGGG
>gi|930076459|gb|KR364704.1| Streptomyces sioyaensis strain BCCO10_981 putative annimycin-type biosynthetic gene cluster, partial sequence
GCCGGCAGGTGGGCCGCGGTCAGCTTCAGGACCGTGGCCGTCGCGCCCGCCAGCACCACG
GAGGCCCCCACGGCCAGCGCCGGGCCCGTGCCCGTGCCGTACGCGAGGTCCGTGCTGAAC
and I have a text file containing a list of numbers:
373248686
930076459
296280703
......
I want to do the following:
if the header in the fasta file contains the numbers in the text file:
write all the matches(header+sequence) to a new output.fasta file.
How to do this in python? It seems easy, just some for loops may do the job, but somehow I cannot make that happen, and if my files are really big, loop in another loop may take a long time. Here's what I have tried:
from Bio import SeqIO
import sys
wanted = []
for line in open(sys.argv[2]):
titles = line.strip()
wanted.append(titles)
seqiter = SeqIO.parse(open(sys.argv[1]), 'fasta')
sys.stdout = open('output.fasta', 'w')
new_seq = []
for seq in seqiter:
new_seq.append(seq if i in seq.id for i in wanted)
SeqIO.write(new_seq, sys.stdout, "fasta")
sys.stdout.close()
Got this error:
new_seq.append(seq if i in seq.id for i in wanted)
^
SyntaxError: invalid syntax
Is there a better way to do this?
Thank you!
Use a program like this
from Bio import SeqIO
import sys
# read in the text file
numbersInTxtFile = set()
# hint: use with, then you don't need to
# program file closing. Furhtermore error
# handling is comming along with this too
with open(sys.argv[2], "r") as inF:
for line in inF:
line = line.strip()
if line == "": continue
numbersInTxtFile.add(int(line))
# read in the fasta file
with open(sys.argv[1], "r") as inF:
for record in SeqIO.parse(inF, "fasta"):
# now check if this record in the fasta file
# has an id we are searching for
name = record.description
id = int(name.split("|")[1])
print(id, numbersInTxtFile, id in numbersInTxtFile)
if id in numbersInTxtFile:
# we need to output
print(">%s" % name)
print(record.seq)
which you can then call like so from the commandline
python3 nameOfProg.py inputFastaFile.fa idsToSearch.txt > outputFastaFile.fa
Import your "keeper" IDs into a dictionary rather than a list, this will be much faster as the list doesn't have to be searched thousands of times.
keepers = {}
with open("ids.txt", "r") as id_handle:
for curr_id in id_handle:
keepers[curr_id] = True
A list comprehension generates a list, so you don't need to append to another list.
keeper_seqs = [x for x in seqiter if x.id in keepers]
With larger files you will want to loop over seqiter and write the entries one at a time to avoid memory issues.
You should also never assign to sys.stdout without a good reason, if you want to output to STDOUT just use print or sys.stdout.write().

in python loop print lines from alternating files

I am trying to use python to find four-line blocks of interest in two separate files then print out some of those lines in controlled order. Below are the two input files and an example of the desired output file. Note that the DNA sequence in the Input.fasta is different than the DNA sequence in Input.fastq because the .fasta file has been read corrected.
Input.fasta
>read1
AAAGGCTGT
>read2
AGTCTTTAT
>read3
CGTGCCGCT
Input.fastq
#read1
AAATGCTGT
+
'(''%$'))
#read2
AGTCTCTAT
+
&---+2010
#read3
AGTGTCGCT
+
0-23;:677
DesiredOutput.fastq
#read1
AAAGGCTGT
+
'(''%$'))
#read2
AGTCTTTAT
+
&---+2010
#read3
CGTGCCGCT
+
0-23;:677
Basically I need the sequence line "AAAGGCTGT",
"AGTCTTTAT", and "CGTGCCGCT" from "input.fasta" and all other lines from "input.fastq". This allows the restoration of quality information to a read corrected .fasta file.
Here is my closest failed attempt:
fastq = open(Input.fastq, "r")
fasta = open(Input.fasta, "r")
ReadIDs = []
IDs = []
with fastq as fq:
for line in fq:
if "read" in line:
ReadIDs.append(line)
print(line.strip())
for ID in ReadIDs:
IDs.append(ID[1:6])
with fasta as fa:
for line in fa:
if any(string in line for string in IDs):
print(next(fa).strip())
next(fq)
print(next(fq).strip())
print(next(fq).strip())
I think I am running into trouble by trying to nest "with" calls to two different files in the same loop. This prints the desired lines for read1 correctly but does not continue to iterate through the remaining lines and throws an error "ValueError: I/O operation on closed file"
I suggest you use Biopython, which will save you a lot of trouble as it provides nice parsers for these file formats, which handle not only the standard cases but also for example multi-line fasta.
Here is an implementation that replaces the fastq sequence lines with the corresponding fasta sequence lines:
from Bio import SeqIO
fasta_dict = {record.id: record.seq for record in
SeqIO.parse('Input.fasta', 'fasta')}
def yield_records():
for record in SeqIO.parse('Input.fastq', 'fastq'):
record.seq = fasta_dict[record.id]
yield record
SeqIO.write(yield_records(), 'DesiredOutput.fastq', 'fastq')
If you don't want to use the headers but just rely on the order then the solution is even simpler and more memory efficient (just make sure the order and number of records is the same), no need to define the dictionary first, just iterate over the records together:
fasta_records = SeqIO.parse('Input.fasta', 'fasta')
fastq_records = SeqIO.parse('Input.fastq', 'fastq')
def yield_records():
for fasta_record, fastq_record in zip(fasta_records, fastq_records):
fastq_record.seq = fasta_record.seq
yield fastq_record
## Open the files (and close them after the 'with' block ends)
with open("Input.fastq", "r") as fq, open("Input.fasta", "r") as fa:
## Read in the Input.fastq file and save its content to a list
fastq = fq.readlines()
## Do the same for the Input.fasta file
fasta = fa.readlines()
## For every line in the Input.fastq file
for i in range(len(fastq)):
print(fastq[i]))
print(fasta[2 * i])
print(fasta[(2 * i) + 1])
I like the Biopython solution by #Chris_Rands better for small files, but here is a solution that only uses the batteries included with Python and is memory efficient. It assumes the fasta and fastq files to contain the same number of reads in the same order.
with open('Input.fasta') as fasta, open('Input.fastq') as fastq, open('DesiredOutput.fastq', 'w') as fo:
for i, line in enumerate(fastq):
if i % 4 == 1:
for j in range(2):
line = fasta.readline()
print(line, end='', file=fo)

filtering a weird text file in python

I have a text file in which each ID line starts with > and the next line(s) are the a sequence of characters. And the next line after the sequence of characters would be an other ID line starting with >. but in some of them, instead of sequence I have “Sequence unavailable”. The sequence after the ID line can be one or more lines.
like this example:
>ENSG00000173153|ENST00000000442|64073050;64074640|64073208;64074651
AAGCAGCCGGCGGCGCCGCCGAGTGAGGGGACGCGGCGCGGTGGGGCGGCGCGGCCCGAGGAGGCGGCGGAGGAGGGGCCGCCCGCGGCCCCCGGCTCACTCCGGCACTCCGGGCCGCTC
>ENSG00000004139|ENST00000003834
Sequence unavailable
I want to filter out those IDs with “Sequence unavailable”. The output should look like this:
output:
>ENSG00000173153|ENST00000000442|64073050;64074640|64073208;64074651
AAGCAGCCGGCGGCGCCGCCGAGTGAGGGGACGCGGCGCGGTGGGGCGGCGCGGCCCGAGGAGGCGGCGGAGGAGGGGCCGCCCGCGGCCCCCGGCTCACTCCGGCACTCCGGGCCGCTC
do you know how to do that in python?
Unlike the other answers, I’d strongly recommand against parsing the FASTA format manually. It’s not too hard but there are pitfalls, and it’s completely unnecessary since efficient, well-tested implementations exist:
Use Bio.SeqIO from BioPython; for example:
from Bio import SeqIO
for record in SeqIO.parse(filename, 'fasta'):
if record.seq != 'Sequenceunavailable':
SeqIO.write(record, outfile, 'fasta')
Note the missing space in 'Sequenceunavailable': reading the sequences in FASTA format will omit spaces.
How about this:
with open(filename, 'r+') as f:
data = f.read()
data = data.split('>')
result = ['>{}'.format(item) for item in data if item and 'Sequence unavailable' not in item]
f.seek(0)
for line in result:
f.write(line)
def main():
filename = open('text.txt', 'rU').readlines()
filterFile(filename)
def filterFile(SequenceFile):
outfile = open('outfile', 'w')
for line in SequenceFile:
if line.startswith('>'):
sequence = line.next()
if sequence.startswith('Sequence unavailable'):
//nothing should happen I suppose?
else:
outfile.write(line + "\n" + sequence + "\n")
main()
I unfortunately can't test this code right now but I made this out of the top of my head! Please test it and let me know what the outcome is so I can adjust the code :-)
So I don't exactly know how large these files will get, just in case, I'm doing it without mapping the file in memory:
with open(filename) as fh:
with open(filename+'.new', 'w+') as fh_new:
for idline, geneseq in zip(*[iter(fh)] * 2):
if geneseq.strip() != 'Sequence unavailable':
fh_new.write(idline)
fh_new.write(geneseq)
It works by creating a new file, then the zip thing is some magic to read the 2 lines of the file, the idline will be the first part and the geneseq the second part.
This solution should be relatively cheap in computer power but will create an extra output file.

comparing parts of lines in two tsv files in python

So I want to sum/analyse values pertaining to a given line in one file which match another file.
The format of the first file I wish to compare against is:
Acetobacter cibinongensis Acetobacter Acetobacteraceae
Rhodospirillales Proteobacteria Bacteria
Acetobacter ghanensis Acetobacter Acetobacteraceae Rhodospirillales Proteobacteria Bacteria
Acetobacter pasteurianus Acetobacter Acetobacteraceae Rhodospirillales Proteobacteria Bacteria
And the second file is like:
Blochmannia endosymbiont of Polyrhachis (Hedomyrma) turneri Candidatus Blochmannia Enterobacteriaceae Enterobacteriales Proteobacteria Bacteria 1990 7.511 14946.9
Blochmannia endosymbiont of Polyrhachis (Hedomyrma) turneri Candidatus Blochmannia Enterobacteriaceae Enterobacteriales Proteobacteria Bacteria 2061 6.451 13295.5
Calyptogena okutanii thioautotrophic gill symbiont Proteobacteria-undef Proteobacteria-undef Proteobacteria-undef Proteobacteria Bacteria 7121 2.466 17560.4
What I want to do is parse every line in the first file, and for every line in the second file where the first 6 fields match, perform analysis on the numbers in the 3 fields following the species info.
My code is as follows:
with open('file1', 'r') as file1:
with open('file2', 'r') as file2:
for line in file1:
count = 0
line = line.split("\t")
for l in file2:
l = l.split("\t")
if l[0:6] == line[0:6]:
count+=1
count = str(count)
print line + '\t' + count +'\t'+'\n'
Which I'm hoping will give me the line from the first file and the number of times that species was found in the second file.
I know there's probably a better way of doing THIS particular part of the analysis but I wanted to give a simple example of the objective..
Anyway, I don't get any matches, i.e. I never see an instance where
l[0:6] == line[0:6]
is True.
Any ideas?? :-S
The root cause is that you consume file2 at the first iteration, then it always iterate over nothing.
Quick fix: read file2 fully and put it in a list. However, this is rather inefficient in terms of speed (O(N^2): double loop). Could be better if creating a dictionary with key = tuple of the 6 first values.
with open('file2', 'r') as f:
file2 = list(f)
with open('file1', 'r') as file1:
for line in file1:
count = 0
line = line.split("\t")
for l in file2:
l = l.split("\t")
if l[0:6] == line[0:6]:
count+=1
count = str(count)
print line + '\t' + count +'\t'+'\n'
Also, using csv module configured with TAB as separator would avoid you some surprises in the future.
Better version, using a dictionary for faster access on data of file2 (the first 6 elements are the key, note that we cannot use a list as key since it's mutable but we have to convert it to a tuple):
d = dict()
# create the dictionary from file2
with open('file2', 'r') as file2:
for l in file2:
fields = l.split("\t")
d[tuple(fields[0:6])] = fields[6:]
# iterate through file1, and use dict lookup on data of file2
# much, much faster if file2 contains a lot of data
with open('file1', 'r') as file1:
for line in file1:
count = 0
line = line.split("\t")
if tuple(line[0:6]) in d: # check if in dictionary
count+=1
# we could extract the extra data by accessing
# d[tuple(line[0:6])]
count = str(count)
print(line + '\t' + count +'\t'+'\n')

Categories