How to get the sequence counts (in fasta) with conditions using python? - python

I have a fasta file (fasta is a file in which header line starts with > followed by a sequence line corresponding to that header). I want to get the counts for sequences matching TRINITY and total sequences that starts with >K after each >TRINITY sequences. I was able to get the counts for >TRINITY sequences, but not sure how to get the counts for >K for corresponding >TRINITY sequence group. How can I get this done in python?
myfasta.fasta:
>TRINITY_DN12824_c0_g1_i1
TGGTGACCTGAATGGTCACCACGTCCATACAGA
>K00363:119:HTJ23BBXX:1:1212:18730:9403 1:N:0:CGATGTAT
CACTATTACAATTCTGATGTTTTAATTACTGAGACAT
>K00363:119:HTJ23BBXX:1:2228:9678:46223_(reversed) 1:N:0:CGATGTAT
TAGATTTAAAATAGACGCTTCCATAGA
>TRINITY_DN12824_c0_g1_i1
TGGTGACCTGAATGGTCACCACGTCCATACAGA
>K00363:119:HTJ23BBXX:1:1212:18730:9403 1:N:0:CGATGTAT
CACTATTACAATTCTGATGTTTTAATTACTGAGACAT
>TRINITY_DN555_c0_g1_i1
>K00363:119:HTJ23BBXX:1:2228:9658:46188_(reversed) 1:N:0:CGATGTAT
CGATGCTAGATTTAAAATAGACG
>K00363:119:HTJ23BBXX:1:2106:15260:10387_(reversed) 1:N:0:CGATGTAT
TTAAAATAGACGCTTCCATAGAGA
Result I want:
reference reference_counts Corresponding_K_sequences
>TRINITY_DN12824_c0_g1_i1 2 3
>TRINITY_DN555_c0_g1_i1 1 2
Here is the code I have written which only accounts for >TRINITY sequence counts, but couldn't extend it to the bit where it also would count the corresponding >K sequences, so any help would be appreciated.
To Run:
python code.py myfasta.fasta output.txt
import sys
import os
from Bio import SeqIO
from collections import defaultdict
filename = sys.argv[1]
outfile = sys.argv[2]
dedup_records = defaultdict(list)
for record in SeqIO.parse(filename, "fasta"):
#print(record)
#print(record.id)
if record.id.startswith('TRINITY'):
#print(record.id)
# Use the sequence as the key and then have a list of id's as the value
dedup_records[str(record.seq)].append(record.id)
#print(dedup_records)
with open(outfile, 'w') as output:
# # to get the counts of duplicated TRINITY ids (sorted order)
for seq, ids in sorted(dedup_records.items(), key = lambda t: len(t[1]), reverse=True):
#output.write("{} {}\n".format(ids,len(ids)))
print(ids, len(ids))

You have the correct kind of thinking but you need to keep track of the last header that starts with "TRINITY" and slightly alter your structure:
from Bio import SeqIO
from collections import defaultdict
TRIN, d = None, defaultdict(lambda: [0,0])
for r in SeqIO.parse('myfasta.fasta', 'fasta'):
if r.id.startswith('TRINITY'):
TRIN = r.id
d[TRIN][0] += 1
elif r.id.startswith('K'):
if TRIN:
d[TRIN][1] += 1
print('reference\treference_counts\tCorresponding_K_sequences')
for k,v in d.items():
print('{}\t{}\t{}'.format(k,v[0],v[1]))
Outputs:
reference reference_counts Corresponding_K_sequences
TRINITY_DN12824_c0_g1_i1 2 3
TRINITY_DN555_c0_g1_i1 1 2

Related

How do I get my code to list index number in my sequences

My code is supposed to find and list the index number of the sequence "ACYT" within a larger sequence input as a fasta file. This is what I have so far, but all is does is print ['AC[CT]T']
from Bio.Seq import Seq
from Bio import SeqIO
from Bio import SeqUtils
fasta = input("Enter Fasta File: ")
sequences = SeqIO.parse(fasta, "fasta")
for record in sequences:
event = Seq("ACYT")
results = SeqUtils.nt_search(str(sequences),event)
print(results)
You are supposed to give an index to the result returned from nt_search. For example, 1 If you want the first occurrence.
Next, You are supposed to pass a string record.seq as a first parameter. I couldn't check exactly for your problem as you haven't provided the data but it works with mine.
for record in sequences:
event = str('A')
sequence_string = str(record.seq)
results = SeqUtils.nt_search(str(record.seq), event)[1]
print(results)

How to get the count of duplicated sequences in fasta file using python

I have a fasta file like this:
test_fasta.fasta
>XXKHH_1
AAAAATTTCTGGGCCCC
>YYYXXKHH_1
TTAAAAATTTCTGGGCCCCGGGAAAAAA
>TTDTT_11
TTTGGGAATTAAACCCT
>ID_2SS
TTTGGGAATTAAACCCT
>YKHH_1
TTAAAAATTTCTGGGCCCCGGGAAAAAA
>YKHSH_1S
TTAAAAATTTCTGGGCCCCGGGAAAAAA
I want to get the count of duplicated sequences and append the total counts for each sequence in the file (sorted from the most to least) and get the result as shown below:
>YYYXXKHH_1_counts3
TTAAAAATTTCTGGGCCCCGGGAAAAAA
>TTDTT_11_counts2
TTTGGGAATTAAACCCT
>XXKHH_1_counts1
AAAAATTTCTGGGCCCC
I have this code which finds the duplicated sequences and joins theirs ids together, but instead of joining them together, I just want the counts for duplicates appended in the id as shown above in results.
from Bio import SeqIO
from collections import defaultdict
dedup_records = defaultdict(list)
for record in SeqIO.parse("test_fasta.fasta", "fasta"):
# Use the sequence as the key and then have a list of id's as the value
dedup_records[str(record.seq)].append(record.id)
with open("Output.fasta", 'w') as output:
for seq, ids in dedup_records.items():
# Join the ids and write them out as the fasta
output.write(">{}\n".format('|'.join(ids)))
output.write(seq + "\n")
Since you already have the IDs of each duplicating record in the ids list in your output loop, you can simply output the first ID (which you apparently want, per your expected output), followed by the length of the ids list:
for seq, ids in sorted(dedup_records.items(), key=lambda t: len(t[1]), reverse=True):
output.write(">{}_counts{}\n".format(ids[0], len(ids)))
output.write(seq + "\n")

Counting DNA sequences with python/biopython

My script below is counting the occurrences of the sequences 'CCCCAAAA' and 'GGGGTTTT' from a standard FASTA file:
>contig00001
CCCCAAAACCCCAAAACCCCAAAACCCCTAcGAaTCCCcTCATAATTGAAAGACTTAAACTTTAAAACCCTAGAAT
The script counts the CCCCAAAA sequence here 3 times
CCCCAAAACCCCAAAACCCCAAAA(CCCC not counted)
Can somebody please advise how I would include the CCCC sequence at the end as a half count to return a value of 3.5 for this.
I've been unsuccessful in my attempts so far.
My script is as follows...
from Bio import SeqIO
input_file = open('telomer.test.fasta', 'r')
output_file = open('telomer.test1.out.tsv','w')
output_file.write('Contig\tCCCCAAAA\tGGGGTTTT\n')
for cur_record in SeqIO.parse(input_file, "fasta") :
contig = cur_record.name
CCCCAAAA_count = cur_record.seq.count('CCCCAAAA')
CCCC_count = cur_record.seq.count('CCCC')
GGGGTTTT_count = cur_record.seq.count('GGGGTTTT')
GGGG_count = cur_record.seq.count('GGGG')
#length = len(cur_record.seq)
splittedContig1=contig.split(CCCCAAAA_count)
splittedContig2=contig.split(GGGGTTTT_count)
cnt1=len(splittedContig1)-1
cnt2=len(splittedContig2)
cnt1+sum([0.5 for e in splittedContig1 if e.startswith(CCCC_count)])) = CCCCAAAA_count
cnt2+sum([0.5 for e in splittedContig2 if e.startswith(GGGG_count)])) = GGGGTTTT_count
output_line = '%s\t%i\t%i\n' % \
(CONTIG, CCCCAAAA_count, GGGGTTTT_count)
output_file.write(output_line)
output_file.close()
input_file.close()
You can use split and startwith list comprehension as follows:
contig="CCCCAAAACCCCAAAACCCCAAAACCCCTAcGAaTCCCcTCATAATTGAAAGACTTAAACTTTAAAACCCTAGAAT"
splitbase="CCCCAAAA"
halfBase="CCCC"
splittedContig=contig.split(splitbase)
cnt=len(splittedContig)-1
print cnt+sum([0.5 for e in splittedContig if e.startswith(halfBase)])
Output:
3.5
split the strings based on CCCCAAAA. It would give the list, in the list elements CCCCAAAA will be removed
length of splitted - 1 gives the number of occurrence of CCCCAAAA
in the splitted element, look for elements starts with CCCC. If found add 0.5 to count for each occurence.

Create dictionary and see if key always has same value

If I had a file of lines starting with a number followed by some text, how could I see if the numbers are always followed by different text? For example:
0 Brucella abortus Brucellaceae
0 Brucella ceti Brucellaceae
0 Brucella canis Brucellaceae
0 Brucella ceti Brucellaceae
So here, I'd like to know that 0 is followed by 3 different "types" of text.
Ideally I could read a file into a python script that would have output something like this:
1:250
2:98
3:78
4:65
etc.
The first number would be the number of different "texts", and the number after the : would be how many numbers have this occurring.
I have the following script that calculates how many times a "text" is found in different numbers, so I'm wondering how to kind of reverse it so I know how many times a number has different texts, and how many different texts are present. This script makes the files of numbers and "text" into a dictionary but I'm unsure of how to manipulate this dictionary to get what I want.
#!/usr/bin/env python
#Dictionary to broken species, genus, family
fileIn = 'usearchclusternumgenus.txt'
d = {}
with open(fileIn, "r") as f:
for line in f:
clu, gen, spec, fam = line.split()
d.setdefault(clu, []).append((spec))
# Iterate through and find out how many times each key occurs
vals = {} # A dictonary to store how often each value occurs.
for i in d.values():
for j in set(i): # Convert to a set to remove duplicates
vals[j] = 1 + vals.get(j,0) # If we've seen this value iterate the count
# Otherwise we get the default of 0 and iterate it
#print vals
# Iterate through each possible freqency and find how many values have that count.
counts = {} # A dictonary to store the final frequencies.
# We will iterate from 0 (which is a valid count) to the maximum count
for i in range(0,max(vals.values())+1):
# Find all values that have the current frequency, count them
#and add them to the frequency dictionary
counts[i] = len([x for x in vals.values() if x == i])
for key in sorted(counts.keys()):
if counts[key] > 0:
print key,":",counts[key]`
Use a collections.defaultdict() object with a set as the factory to track different lines, then print out the sizes of the collected sets:
from collections import defaultdict
unique_clu = defaultdict(set)
with open(fileIn) as infh:
for line in infh:
clu, gen, spec, rest = line.split(None, 3)
unique_clu[clu].add(spec)
for key in sorted(unique_clu):
count = len(unique_clu[key])
if count:
print '{}:{}'.format(key, count)

count for number of repetition of values in a list and generate outfile

I have a file having a few columns like:
PAIR 1MFK 1 URANIUM 82 HELIUM 112 2.5506
PAIR 2JGH 2 PLUTONIUM 98 POTASSIUM 88 5.3003
PAIR 345G 3 SODIUM 23 CARBON 14 1.664
PAIR 4IG5 4 LITHIUM 82 ARGON 99 2.5506
PAIR 234G 5 URANIUM 99 KRYPTON 89 1.664
Now what I wanted to do is read the last column and iterate the values for repetitions and generate an output file containing two column 'VALUE' & 'NO OF TIMES REPEATED'.
I have tried like:
inp = ('filename'.'r').read().strip().replace('\t',' ').split('\n')
from collections import defaultdict
D = defaultdict(line)
for line in map(str.split,inp):
k=line[-1]
D[k].append(line)
I'm stuck here.
plaese help.!
There are a number of issues with the code as posted. A while-loop isn't allowed inside a list comprehension. The argument to defaultdict should be list not line. Here is a fixed-up version of your code:
from collections import defaultdict
D = defaultdict(list)
for line in open('filename', 'r'):
k = line.split()[-1]
D[k].append(line)
print 'VALUE NO TIMES REPEATED'
print '----- -----------------'
for value, lines in D.items():
print '%-6s %d' % (value, len(lines))
Another way to do it is to use collections.Counter to conveniently sum the number of repetitions. That let's you simplify the code a bit:
from collections import Counter
D = Counter()
for line in open('filename', 'r'):
k = line.split()[-1]
D[k] += 1
print 'VALUE NO TIMES REPEATED'
print '----- -----------------'
for value, count in D.items():
print '%-6s %d' % (value, count)
Now what I wanted to do is read the last column and iterate the values for repetitions and generate an output file containing two column 'VALUE' & 'NO OF TIMES REPEATED'.
So use collections.Counter to count the number of times each value appears, not a defaultdict. (It's not at all clear what you're trying to do with the defaultdict, and your initialization won't work, anyway; defaultdict is constructed with a callable that will create a default value. In your case, the default value you apparently had in mind is an empty list, so you would use list to initialize the defaultdict.) You don't need to store the lines to count them. The Counter counts them for you automatically.
Also, processing the entire file ahead of time is a bit ugly, since you can iterate over the file directly and get lines, which does part of the processing for you. Although you can actually do that iteration automatically in the Counter creation.
Here is a complete solution:
from collections import Counter
with open('input', 'r') as data:
histogram = Counter(line.split('\t')[-1].strip() for line in data)
with open('output', 'w') as result:
for item in histogram.iteritems():
result.write('%s\t%s\n' % item)

Categories