Counting DNA sequences with python/biopython - python

My script below is counting the occurrences of the sequences 'CCCCAAAA' and 'GGGGTTTT' from a standard FASTA file:
>contig00001
CCCCAAAACCCCAAAACCCCAAAACCCCTAcGAaTCCCcTCATAATTGAAAGACTTAAACTTTAAAACCCTAGAAT
The script counts the CCCCAAAA sequence here 3 times
CCCCAAAACCCCAAAACCCCAAAA(CCCC not counted)
Can somebody please advise how I would include the CCCC sequence at the end as a half count to return a value of 3.5 for this.
I've been unsuccessful in my attempts so far.
My script is as follows...
from Bio import SeqIO
input_file = open('telomer.test.fasta', 'r')
output_file = open('telomer.test1.out.tsv','w')
output_file.write('Contig\tCCCCAAAA\tGGGGTTTT\n')
for cur_record in SeqIO.parse(input_file, "fasta") :
contig = cur_record.name
CCCCAAAA_count = cur_record.seq.count('CCCCAAAA')
CCCC_count = cur_record.seq.count('CCCC')
GGGGTTTT_count = cur_record.seq.count('GGGGTTTT')
GGGG_count = cur_record.seq.count('GGGG')
#length = len(cur_record.seq)
splittedContig1=contig.split(CCCCAAAA_count)
splittedContig2=contig.split(GGGGTTTT_count)
cnt1=len(splittedContig1)-1
cnt2=len(splittedContig2)
cnt1+sum([0.5 for e in splittedContig1 if e.startswith(CCCC_count)])) = CCCCAAAA_count
cnt2+sum([0.5 for e in splittedContig2 if e.startswith(GGGG_count)])) = GGGGTTTT_count
output_line = '%s\t%i\t%i\n' % \
(CONTIG, CCCCAAAA_count, GGGGTTTT_count)
output_file.write(output_line)
output_file.close()
input_file.close()

You can use split and startwith list comprehension as follows:
contig="CCCCAAAACCCCAAAACCCCAAAACCCCTAcGAaTCCCcTCATAATTGAAAGACTTAAACTTTAAAACCCTAGAAT"
splitbase="CCCCAAAA"
halfBase="CCCC"
splittedContig=contig.split(splitbase)
cnt=len(splittedContig)-1
print cnt+sum([0.5 for e in splittedContig if e.startswith(halfBase)])
Output:
3.5
split the strings based on CCCCAAAA. It would give the list, in the list elements CCCCAAAA will be removed
length of splitted - 1 gives the number of occurrence of CCCCAAAA
in the splitted element, look for elements starts with CCCC. If found add 0.5 to count for each occurence.

Related

How can I clean this data for easier visualizing?

I'm writing a program to read a set of data rows and quantify matching sets. I have the code below however would like to cut, or filter the numbers which is not being recognized as a match.
import collections
a = "test.txt" #This can be changed to a = input("What's the filename? ", )
line_file = open(a, "r")
print(line_file.readable()) #Readable check.
#print(line_file.read()) #Prints each individual line.
#Code for quantity counter.
counts = collections.Counter() #Creates a new counter.
with open(a) as infile:
for line in infile:
for number in line.split():
counts.update((number,))
for key, count in counts.items():
print(f"{key}: x{count}")
line_file.close()
This is what it outputs, however I'd like for it to not read the numbers at the end and pair the matching sets accordingly.
A2-W-FF-DIN-22: x1
A2-FF-DIN: x1
A2-W-FF-DIN-11: x1
B12-H-BB-DD: x2
B12-H-BB-DD-77: x1
C1-GH-KK-LOP: x1
What I'm aiming for is so that it ignored the "-77" in this, and instead counts the total as x3
B12-H-BB-DD: x2
B12-H-BB-DD-77: x1
Split each element on the dashes and check the last element is a number. If so, remove it, then continue on.
from collections import Counter
def trunc(s):
parts = s.split('-')
if parts[-1].isnumeric():
return '-'.join(parts[:-1])
return s
with open('data.txt') as f:
data = [trunc(x.rstrip()) for x in f.readlines()]
counts = Counter(data)
for k, v in counts.items():
print(k, v)
Output
A2-W-FF-DIN 2
A2-FF-DIN 1
B12-H-BB-DD 3
C1-GH-KK-LOP 1
You could use a regular expression to create a matching group for a digit suffix. If each number is its own string, e.g. "A2-W-FF-DIN-11", then a regular expression like (?P<base>.+?)(?:-(?P<suffix>\d+))?\Z could work.
Here, (?P<base>.+?) is a non-greedy match of any character except for a newline grouped under the name "base", (?:-(?P<suffix>\d+))? matches 0 or 1 occurrences of something like -11 occurring at the end of the "base" group and puts the digits in a group named "suffix", and \Z is the end of the string.
This is what it does in action:
>>> import re
>>> regex = re.compile(r"(?P<base>.+?)(?:-(?P<suffix>\d+))?\Z")
>>> regex.match("A2-W-FF-DIN-11").groupdict()
{'base': 'A2-W-FF-DIN', 'suffix': '11'}
>>> regex.match("A2-W-FF-DIN").groupdict()
{'base': 'A2-W-FF-DIN', 'suffix': None}
So you can see, in this instance, whether or not the string has a digital suffix, the base is the same.
All together, here's a self-contained example of how it might be applied to data like this:
import collections
import re
regex = re.compile(r"(?P<base>.+?)(?:-(?P<suffix>\d+))?\Z")
sample_data = [
"A2-FF-DIN",
"A2-W-FF-DIN-11",
"A2-W-FF-DIN-22",
"B12-H-BB-DD",
"B12-H-BB-DD",
"B12-H-BB-DD-77",
"C1-GH-KK-LOP"
]
counts = collections.Counter()
# Iterates through the data and updates the counter.
for datum in sample_data:
# Isolates the base of the number from any digit suffix.
number = regex.match(datum)["base"]
counts.update((number,))
# Prints each number and prints how many instances were found.
for key, count in counts.items():
print(f"{key}: x{count}")
For which the output is
A2-FF-DIN: x1
A2-W-FF-DIN: x2
B12-H-BB-DD: x3
C1-GH-KK-LOP: x1
Or in the example code you provided, it might look like this:
import collections
import re
# Compiles a regular expression to match the base and suffix
# of a number in the file.
regex = re.compile(r"(?P<base>.+?)(?:-(?P<suffix>\d+))?\Z")
a = "test.txt"
line_file = open(a, "r")
print(line_file.readable()) # Readable check.
# Creates a new counter.
counts = collections.Counter()
with open(a) as infile:
for line in infile:
for number in line.split():
# Isolates the base match of the number.
counts.update((regex.match(number)["base"],))
for key, count in counts.items():
print(f"{key}: x{count}")
line_file.close()

How to get the sequence counts (in fasta) with conditions using python?

I have a fasta file (fasta is a file in which header line starts with > followed by a sequence line corresponding to that header). I want to get the counts for sequences matching TRINITY and total sequences that starts with >K after each >TRINITY sequences. I was able to get the counts for >TRINITY sequences, but not sure how to get the counts for >K for corresponding >TRINITY sequence group. How can I get this done in python?
myfasta.fasta:
>TRINITY_DN12824_c0_g1_i1
TGGTGACCTGAATGGTCACCACGTCCATACAGA
>K00363:119:HTJ23BBXX:1:1212:18730:9403 1:N:0:CGATGTAT
CACTATTACAATTCTGATGTTTTAATTACTGAGACAT
>K00363:119:HTJ23BBXX:1:2228:9678:46223_(reversed) 1:N:0:CGATGTAT
TAGATTTAAAATAGACGCTTCCATAGA
>TRINITY_DN12824_c0_g1_i1
TGGTGACCTGAATGGTCACCACGTCCATACAGA
>K00363:119:HTJ23BBXX:1:1212:18730:9403 1:N:0:CGATGTAT
CACTATTACAATTCTGATGTTTTAATTACTGAGACAT
>TRINITY_DN555_c0_g1_i1
>K00363:119:HTJ23BBXX:1:2228:9658:46188_(reversed) 1:N:0:CGATGTAT
CGATGCTAGATTTAAAATAGACG
>K00363:119:HTJ23BBXX:1:2106:15260:10387_(reversed) 1:N:0:CGATGTAT
TTAAAATAGACGCTTCCATAGAGA
Result I want:
reference reference_counts Corresponding_K_sequences
>TRINITY_DN12824_c0_g1_i1 2 3
>TRINITY_DN555_c0_g1_i1 1 2
Here is the code I have written which only accounts for >TRINITY sequence counts, but couldn't extend it to the bit where it also would count the corresponding >K sequences, so any help would be appreciated.
To Run:
python code.py myfasta.fasta output.txt
import sys
import os
from Bio import SeqIO
from collections import defaultdict
filename = sys.argv[1]
outfile = sys.argv[2]
dedup_records = defaultdict(list)
for record in SeqIO.parse(filename, "fasta"):
#print(record)
#print(record.id)
if record.id.startswith('TRINITY'):
#print(record.id)
# Use the sequence as the key and then have a list of id's as the value
dedup_records[str(record.seq)].append(record.id)
#print(dedup_records)
with open(outfile, 'w') as output:
# # to get the counts of duplicated TRINITY ids (sorted order)
for seq, ids in sorted(dedup_records.items(), key = lambda t: len(t[1]), reverse=True):
#output.write("{} {}\n".format(ids,len(ids)))
print(ids, len(ids))
You have the correct kind of thinking but you need to keep track of the last header that starts with "TRINITY" and slightly alter your structure:
from Bio import SeqIO
from collections import defaultdict
TRIN, d = None, defaultdict(lambda: [0,0])
for r in SeqIO.parse('myfasta.fasta', 'fasta'):
if r.id.startswith('TRINITY'):
TRIN = r.id
d[TRIN][0] += 1
elif r.id.startswith('K'):
if TRIN:
d[TRIN][1] += 1
print('reference\treference_counts\tCorresponding_K_sequences')
for k,v in d.items():
print('{}\t{}\t{}'.format(k,v[0],v[1]))
Outputs:
reference reference_counts Corresponding_K_sequences
TRINITY_DN12824_c0_g1_i1 2 3
TRINITY_DN555_c0_g1_i1 1 2

Find first line of text according to value in Python

How can I do a search of a value of the first "latitude, longitude" coordinate in a "file.txt" list in Python and get 3 rows above and 3 rows below?
Value
37.0459
file.txt
37.04278,-95.58895
37.04369,-95.58592
37.04369,-95.58582
37.04376,-95.58557
37.04376,-95.58546
37.04415,-95.58429
37.0443,-95.5839
37.04446,-95.58346
37.04461,-95.58305
37.04502,-95.58204
37.04516,-95.58184
37.04572,-95.58139
37.04597,-95.58127
37.04565,-95.58073
37.04546,-95.58033
37.04516,-95.57948
37.04508,-95.57914
37.04494,-95.57842
37.04483,-95.5771
37.0448,-95.57674
37.04474,-95.57606
37.04467,-95.57534
37.04462,-95.57474
37.04458,-95.57396
37.04454,-95.57274
37.04452,-95.57233
37.04453,-95.5722
37.0445,-95.57164
37.04448,-95.57122
37.04444,-95.57054
37.04432,-95.56845
37.04432,-95.56834
37.04424,-95.5668
37.044,-95.56251
37.04396,-95.5618
Expected Result
37.04502,-95.58204
37.04516,-95.58184
37.04572,-95.58139
37.04597,-95.58127
37.04565,-95.58073
37.04546,-95.58033
37.04516,-95.57948
Additional information
In linux I can get the closest line and do the treatment I need using grep, sed, cut and others, but I'd like in Python.
Any help will be greatly appreciated!
Thank you.
How can I do a search of a value of the first "latitude, longitude"
coordinate in a "file.txt" list in Python and get 3 rows above and 3
rows below?*
You can try:
with open("text_filter.txt") as f:
text = f.readlines() # read text lines to list
filter= "37.0459"
match = [i for i,x in enumerate(text) if filter in x] # get list index of item matching filter
if match:
if len(text) >= match[0]+3: # if list has 3 items after filter, print it
print("".join(text[match[0]:match[0]+3]).strip())
print(text[match[0]].strip())
if match[0] >= 3: # if list has 3 items before filter, print it
print("".join(text[match[0]-3:match[0]]).strip())
Output:
37.04597,-95.58127
37.04565,-95.58073
37.04546,-95.58033
37.04597,-95.58127
37.04502,-95.58204
37.04516,-95.58184
37.04572,-95.58139
You can use pandas to import the data in a dataframe and then easily manipulate it. As per your question the value to check is not the exact match and therefore I have converted it to string.
import pandas as pd
data = pd.read_csv("file.txt", header=None, names=["latitude","longitude"]) #imports text file as dataframe
value_to_check = 37.0459 # user defined
for i in range(len(data)):
if str(value_to_check) == str(data.iloc[i,0])[:len(str(value_to_check))]:
break
print(data.iloc[i-3:i+4,:])
output
latitude longitude
9 37.04502 -95.58204
10 37.04516 -95.58184
11 37.04572 -95.58139
12 37.04597 -95.58127
13 37.04565 -95.58073
14 37.04546 -95.58033
15 37.04516 -95.57948
A solution with iterators, that only keeps in memory the necessary lines and doesn't load the unnecessary part of the file:
from collections import deque
from itertools import islice
def find_in_file(file, target, before=3, after=3):
queue = deque(maxlen=before)
with open(file) as f:
for line in f:
if target in map(float, line.split(',')):
out = list(queue) + [line] + list(islice(f, 3))
return out
queue.append(line)
else:
raise ValueError('target not found')
Some tests:
print(find_in_file('test.txt', 37.04597))
# ['37.04502,-95.58204\n', '37.04516,-95.58184\n', '37.04572,-95.58139\n', '37.04597,-95.58127\n',
# '37.04565,-95.58073\n', '37.04565,-95.58073\n', '37.04565,-95.58073\n']
print(find_in_file('test.txt', 37.044)) # Only one line after the match
# ['37.04432,-95.56845\n', '37.04432,-95.56834\n', '37.04424,-95.5668\n', '37.044,-95.56251\n',
# '37.04396,-95.5618\n']
Also, it works if there is less than the expected number of lines before or after the match. We match floats, not strings, as '37.04' would erroneously match '37.0444' otherwise.
This solution will print the before and after elements even if they are less than 3.
Also I am using string as it is implied from the question that you want partial matches also. ie. 37.0459 will match 37.04597
search_term='37.04462'
with open('file.txt') as f:
lines = f.readlines()
lines = [line.strip().split(',') for line in lines] #remove '\n'
for lat,lon in lines:
if search_term in lat:
index=lines.index([lat,lon])
break
left=0
right=0
for k in range (1,4): #bcoz last one is not included
if index-k >=0:
left+=1
if index+k<=(len(lines)-1):
right+=1
for i in range(index-left,index+right+1): #bcoz last one is not included
print(lines[i][0],lines[i][1])

How to extract and trim the fasta sequence using biopython

Hellow everybody, I am new to python struggling to do a small task using biopython.I have two file- one containing list of ids and associated number.eg
id.txt
tr_F6LMO6_F6LMO6_9LE 25
tr_F6ISE0_F6ISE0_9LE 17
tr_F6HSF4_F6HSF4_9LE 27
tr_F6PLK9_F6PLK9_9LE 19
tr_F6HOT8_F6HOT8_9LE 29
Second file containg a large fasta sequences.eg below
fasta_db.fasta
>tr|F6LMO6|F6LMO6_9LEHG Transporter
MLAPETRRKRLFSLIFLCTILTTRDLLSVGIFQPSHNARYGGMGGTNLAIGGSPMDIGTN
PANLGLSSKKELEFGVSLPYIRSVYTDKLQDPDPNLAYTNSQNYNVLAPLPYIAIRIPIT
EKLTYGGGVYVPGGGNGNVSELNRATPNGQTFQNWSGLNISGPIGDSRRIKESYSSTFYV
>tr|F6ISE0|F6ISE0_9LEHG peptidase domain protein OMat str.
MPILKVAFVSFVLLVFSLPSFAEEKTDFDGVRKAVVQIKVYSQAINPYSPWTTDGVRASS
GTGFLIGKKRILTNAHVVSNAKFIQVQRYNQTEWYRVKILFIAHDCDLAILEAEDGQFYK
>tr|F6HSF4|F6HSF4_9LEHG hypothetical protein,
MNLRSYIREIQVGLLCILVFLMSLYLLYFESKSRGASVKEILGNVSFRYKTAQRKFPDRM
LWEDLEQGMSVFDKDSVRTDEASEAVVHLNSGTQIELDPQSMVVLQLKENREILHLGEGS
>tr|F6PLK9|F6PLK9_9LEHG Uncharacterized protein mano str.
MRKITGSYSKISLLTLLFLIGFTVLQSETNSFSLSSFTLRDLRLQKSESGNNFIELSPRD
RKQGGELFFDFEEDEASNLQDKTGGYRVLSSSYLVDSAQAHTGKRSARFAGKRSGIKISG
I wanted to match the id from the first file with second file and print those matched seq in a new file after removing the length(from 1 to 25, in eq) .
Eg output[ 25(associated value with id,first file), aa removed from start, when id matched].
fasta_pruned.fasta
>tr|F6LMO6|F6LMO6_9LEHG Transporter
LLSVGIFQPSHNARYGGMGGTNLAIGGSPMDIGTNPANLGLSSKKELEFGVSL
PYIRSVYTDKLQDPDPNLAYTNSQNYNVLAPLPYIAIRIPITEKLTYGGGVYV
PGGGNGNVSELNRATPNGQTFQNWSGLNISGPIGDSRRIKESYSSTFYV
Biopython cookbook was way above my head being new to python programming.Thanks for any help you can give.
I tried and messed up. Here is it.
from Bio import SeqIO
from Bio import Seq
f1 = open('fasta_pruned.fasta','w')
lengthdict = dict()
with open("seqid_len.txt") as seqlengths:
for line in seqlengths:
split_IDlength = line.strip().split(' ')
lengthdict[split_IDlength[0]] = split_IDlength[1]
with open("species.fasta","rU") as spe:
for record in SeqIO.parse(spe,"fasta"):
if record[0] == '>' :
split_header = line.split('|')
accession_ID = split_header[1]
if accession_ID in lengthdict:
f1.write(str(seq_record.id) + "\n")
f1.write(str(seq_record_seq[split_IDlength[1]-1:]))
f1.close()
Your code has almost everything except for a couple of small things which prevent it from giving the desired output:
Your file id.txt has two spaces between the id and the location. You take the 2nd element which would be empty in this case.
When the file is read it is interpreted as a string but you want the position to be an integer
lengthdict[split_IDlength[0]] = int(split_IDlength[-1])
Your ids are very similar but not identical, the only identical part is the 6 character identifier which could be used to map the two files (double check that before you assume it works). Having identical keys makes mapping much easier.
f1 = open('fasta_pruned.fasta', 'w')
fasta = dict()
with open("species.fasta","rU") as spe:
for record in SeqIO.parse(spe, "fasta"):
fasta[record.id.split('|')[1]] = record
lengthdict = dict()
with open("seqid_len.txt") as seqlengths:
for line in seqlengths:
split_IDlength = line.strip().split(' ')
lengthdict[split_IDlength[0].split('_')[1]] = int(split_IDlength[1])
for k, v in lengthdict.items():
if fasta.get(k) is None:
continue
print('>' + k)
print(fasta[k].seq[v:])
f1.write('>{}\n'.format(k))
f1.write(str(fasta[k].seq[v:]) + '\n')
f1.close()
Output:
>F6LMO6
LLSVGIFQPSHNARYGGMGGTNLAIGGSPMDIGTNPANLGLSSKKELEFGVSLPYIRSVYTDKLQDPDPNLAYTNSQNYNVLAPLPYIAIRIPITEKLTYGGGVYVPGGGNGNVSELNRATPNGQTFQNWSGLNISGPIGDSRRIKESYSSTFYV
>F6ISE0
LPSFAEEKTDFDGVRKAVVQIKVYSQAINPYSPWTTDGVRASSGTGFLIGKKRILTNAHVVSNAKFIQVQRYNQTEWYRVKILFIAHDCDLAILEAEDGQFYK
>F6HSF4
YFESKSRGASVKEILGNVSFRYKTAQRKFPDRMLWEDLEQGMSVFDKDSVRTDEASEAVVHLNSGTQIELDPQSMVVLQLKENREILHLGEGS
>F6PLK9
IGFTVLQSETNSFSLSSFTLRDLRLQKSESGNNFIELSPRDRKQGGELFFDFEEDEASNLQDKTGGYRVLSSSYLVDSAQAHTGKRSARFAGKRSGIKISG
>F6HOT8

How to sort a large number of lists to get a top 10 of the longest lists

So I have a text file with around 400,000 lists that mostly look like this.
100005 127545 202036 257630 362970 376927 429080
10001 27638 51569 88226 116422 126227 159947 162938 184977 188045
191044 246142 265214 290507 296858 300258 341525 348922 359832 365744
382502 390538 410857 433453 479170 489980 540746
10001 27638 51569 88226 116422 126227 159947 162938 184977 188045
191044 246142 265214 290507 300258 341525 348922 359832 365744 382502
So far I have a for loop that goes line by line and turns the current line into a temp array list.
How would I create a top ten list that has the list with the most elements of the whole file.
This is the code I have now.
file = open('node.txt', 'r')
adj = {}
top_ten = []
at_least_3 = 0
for line in file:
data = line.split()
adj[data[0]] = data[1:]
And this is what one of the list look like
['99995', '110038', '330533', '333808', '344852', '376948', '470766', '499315']
# collect the lines
lines = []
with open("so.txt") as f:
for line in f:
# split each line into a list
lines.append(line.split())
# sort the lines by length, descending
lines = sorted(lines, key=lambda x: -len(x))
# print the first 10 lines
print(lines[:10])
Why not use collections to display the top 10? i.e.:
import re
import collections
file = open('numbers.txt', 'r')
content = file.read()
numbers = re.findall(r"\d+", content)
counter = collections.Counter(numbers)
print(counter.most_common(10))
Ideone Demo
When wanting to count and then find the one(s) with the highest counts, collections.Counter comes to mind:
from collections import Counter
lists = Counter()
with open('node.txt', 'r') as file:
for line in file:
values = line.split()
lists[tuple(values)] = len(values)
print('Length Data')
print('====== ====')
for values, length in lists.most_common(10):
print('{:2d} {}'.format(length, list(values)))
Output (using sample file data):
Length Data
====== ====
10 ['191044', '246142', '265214', '290507', '300258', '341525', '348922', '359832', '365744', '382502']
10 ['191044', '246142', '265214', '290507', '296858', '300258', '341525', '348922', '359832', '365744']
10 ['10001', '27638', '51569', '88226', '116422', '126227', '159947', '162938', '184977', '188045']
7 ['382502', '390538', '410857', '433453', '479170', '489980', '540746']
7 ['100005', '127545', '202036', '257630', '362970', '376927', '429080']
Use a for loop and max() maybe? You say you've got a for loop that's placing the values into a temp array. From that you could use "max()" to pick out the largest value and put that into a list.
As an open for loop, something like appending max() to a new list:
newlist = []
for x in data:
largest = max(x)
newlist.append(largest)
Or as a list comprehension:
newlist = [max(x) for x in data]
Then from there you have to do the same process on the new list(s) until you get to the desired top 10 scenario.
EDIT: I've just realised that i've misread your question. You want to get the lists with the most elements, not the highest values. Ok.
len() is a good one for this.
for x in data:
if len(templist) > x:
newlist.append(templist)
That would give you the current highest and from there you could create a top 10 list of lengths or of the temp lists themselves, or both.
If your data is really as shown with each number the same length, then I would make a dictionary with key = line, value = length, get the top value / key pairs in the dictionary and voila. Sounds easy enough.

Categories