I have a table with 12 columns and want to select the items in the first column (qseqid) based on the second column (sseqid). Meaning that the second column (sseqid) is repeating with different values in the 11th and 12th columns, which areevalueandbitscore, respectively.
The ones that I would like to get are having the lowestevalueand the highestbitscore(whenevalues are the same, the rest of the columns can be ignored and the data is down below).
So, I have made a short code which uses the second columns as a key for the dictionary. I can get five different items from the second column with lists of qseqid+evalueandqseqid+bitscore.
Here is the code:
#!usr/bin/python
filename = "data.txt"
readfile = open(filename,"r")
d = dict()
for i in readfile.readlines():
i = i.strip()
i = i.split("\t")
d.setdefault(i[1], []).append([i[0],i[10]])
d.setdefault(i[1], []).append([i[0],i[11]])
for x in d:
print(x,d[x])
readfile.close()
But, I am struggling to get the qseqid with the lowest evalue and the highest bitscore for each sseqid.
Is there any good logic to solve the problem?
Thedata.txtfile (including the header row and with»representing tab characters)
qseqid»sseqid»pident»length»mismatch»gapopen»qstart»qend»sstart»send»evalue»bitscore
ACLA_022040»TBB»32.71»431»258»8»39»468»24»423»2.00E-76»240
ACLA_024600»TBB»80»435»87»0»1»435»1»435»0»729
ACLA_031860»TBB»39.74»453»251»3»1»447»1»437»1.00E-121»357
ACLA_046030»TBB»75.81»434»105»0»1»434»1»434»0»704
ACLA_072490»TBB»41.7»446»245»3»4»447»3»435»2.00E-120»353
ACLA_010400»EF1A»27.31»249»127»8»69»286»9»234»3.00E-13»61.6
ACLA_015630»EF1A»22»491»255»17»186»602»3»439»8.00E-19»78.2
ACLA_016510»EF1A»26.23»122»61»4»21»127»9»116»2.00E-08»46.2
ACLA_023300»EF1A»29.31»447»249»12»48»437»3»439»2.00E-45»155
ACLA_028450»EF1A»85.55»443»63»1»1»443»1»442»0»801
ACLA_074730»CALM»23.13»147»101»4»6»143»2»145»7.00E-08»41.2
ACLA_096170»CALM»29.33»150»96»4»34»179»2»145»1.00E-13»55.1
ACLA_016630»CALM»23.9»159»106»5»58»216»4»147»5.00E-12»51.2
ACLA_031930»RPB2»36.87»1226»633»24»121»1237»26»1219»0»734
ACLA_065630»RPB2»65.79»1257»386»14»1»1252»4»1221»0»1691
ACLA_082370»RPB2»27.69»1228»667»37»31»1132»35»1167»7.00E-110»365
ACLA_061960»ACT»28.57»147»95»5»146»284»69»213»3.00E-12»57.4
ACLA_068200»ACT»28.73»463»231»13»16»471»4»374»1.00E-53»176
ACLA_069960»ACT»24.11»141»97»4»581»718»242»375»9.00E-09»46.2
ACLA_095800»ACT»91.73»375»31»0»1»375»1»375»0»732
And here's a little more readable version of the table's contents:
0 1 2 3 4 5 6 7 8 9 10 11
qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore
ACLA_022040 TBB 32.71 431 258 8 39 468 24 423 2.00E-76 240
ACLA_024600 TBB 80 435 87 0 1 435 1 435 0 729
ACLA_031860 TBB 39.74 453 251 3 1 447 1 437 1.00E-121 357
ACLA_046030 TBB 75.81 434 105 0 1 434 1 434 0 704
ACLA_072490 TBB 41.7 446 245 3 4 447 3 435 2.00E-120 353
ACLA_010400 EF1A 27.31 249 127 8 69 286 9 234 3.00E-13 61.6
ACLA_015630 EF1A 22 491 255 17 186 602 3 439 8.00E-19 78.2
ACLA_016510 EF1A 26.23 122 61 4 21 127 9 116 2.00E-08 46.2
ACLA_023300 EF1A 29.31 447 249 12 48 437 3 439 2.00E-45 155
ACLA_028450 EF1A 85.55 443 63 1 1 443 1 442 0 801
ACLA_074730 CALM 23.13 147 101 4 6 143 2 145 7.00E-08 41.2
ACLA_096170 CALM 29.33 150 96 4 34 179 2 145 1.00E-13 55.1
ACLA_016630 CALM 23.9 159 106 5 58 216 4 147 5.00E-12 51.2
ACLA_031930 RPB2 36.87 1226 633 24 121 1237 26 1219 0 734
ACLA_065630 RPB2 65.79 1257 386 14 1 1252 4 1221 0 1691
ACLA_082370 RPB2 27.69 1228 667 37 31 1132 35 1167 7.00E-110 365
ACLA_061960 ACT 28.57 147 95 5 146 284 69 213 3.00E-12 57.4
ACLA_068200 ACT 28.73 463 231 13 16 471 4 374 1.00E-53 176
ACLA_069960 ACT 24.11 141 97 4 581 718 242 375 9.00E-09 46.2
ACLA_095800 ACT 91.73 375 31 0 1 375 1 375 0 732
Since you're a Python newbie I'm glad that there are several examples of how to this manually, but for comparison I'll show how it can be done using the pandas library which makes working with tabular data much simpler.
Since you didn't provide example output, I'm assuming that by "with the lowest evalue and the highest bitscore for each sseqid" you mean "the highest bitscore among the lowest evalues" for a given sseqid; if you want those separately, that's trivial too.
import pandas as pd
df = pd.read_csv("acla1.dat", sep="\t")
df = df.sort(["evalue", "bitscore"],ascending=[True, False])
df_new = df.groupby("sseqid", as_index=False).first()
which produces
>>> df_new
sseqid qseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore
0 ACT ACLA_095800 91.73 375 31 0 1 375 1 375 0.000000e+00 732.0
1 CALM ACLA_096170 29.33 150 96 4 34 179 2 145 1.000000e-13 55.1
2 EF1A ACLA_028450 85.55 443 63 1 1 443 1 442 0.000000e+00 801.0
3 RPB2 ACLA_065630 65.79 1257 386 14 1 1252 4 1221 0.000000e+00 1691.0
4 TBB ACLA_024600 80.00 435 87 0 1 435 1 435 0.000000e+00 729.0
Basically, first we read the data file into an object called a DataFrame, which is kind of like an Excel worksheet. Then we sort by evalue ascending (so that lower evalues come first) and by bitscore descending (so that higher bitscores come first). Then we can use groupby to collect the data in groups of equal sseqid, and take the first one in each group, which because of the sorting will be the one we want.
#!usr/bin/python
import csv
DATA = "data.txt"
class Sequence:
def __init__(self, row):
self.qseqid = row[0]
self.sseqid = row[1]
self.pident = float(row[2])
self.length = int(row[3])
self.mismatch = int(row[4])
self.gapopen = int(row[5])
self.qstart = int(row[6])
self.qend = int(row[7])
self.sstart = int(row[8])
self.send = int(row[9])
self.evalue = float(row[10])
self.bitscore = float(row[11])
def __str__(self):
return (
"{qseqid}\t"
"{sseqid}\t"
"{pident}\t"
"{length}\t"
"{mismatch}\t"
"{gapopen}\t"
"{qstart}\t"
"{qend}\t"
"{sstart}\t"
"{send}\t"
"{evalue}\t"
"{bitscore}"
).format(**self.__dict__)
def entries(fname, header_rows=1, dtype=list, **kwargs):
with open(fname) as inf:
incsv = csv.reader(inf, **kwargs)
# skip header rows
for i in range(header_rows):
next(incsv)
for row in incsv:
yield dtype(row)
def main():
bestseq = {}
for seq in entries(DATA, dtype=Sequence, delimiter="\t"):
# see if a sequence with the same sseqid already exists
prev = bestseq.get(seq.sseqid, None)
if (
prev is None
or seq.evalue < prev.evalue
or (seq.evalue == prev.evalue and seq.bitscore > prev.bitscore)
):
bestseq[seq.sseqid] = seq
# display selected sequences
keys = sorted(bestseq)
for key in keys:
print(bestseq[key])
if __name__ == "__main__":
main()
which results in
ACLA_095800 ACT 91.73 375 31 0 1 375 1 375 0.0 732.0
ACLA_096170 CALM 29.33 150 96 4 34 179 2 145 1e-13 55.1
ACLA_028450 EF1A 85.55 443 63 1 1 443 1 442 0.0 801.0
ACLA_065630 RPB2 65.79 1257 386 14 1 1252 4 1221 0.0 1691.0
ACLA_024600 TBB 80.0 435 87 0 1 435 1 435 0.0 729.0
While not nearly as elegant and concise as using thepandaslibrary, it's quite possible to do what you want without resorting to third-party modules. The following uses thecollections.defaultdictclass to facilitate creation of dictionaries of variable-length lists of records. The use of theAttrDictclass is optional, but it makes accessing the fields of each dictionary-based records easier and is less awkward-looking than the usualdict['fieldname']syntax otherwise required.
import csv
from collections import defaultdict, namedtuple
from itertools import imap
from operator import itemgetter
data_file_name = 'data.txt'
DELIMITER = '\t'
ssqeid_dict = defaultdict(list)
# from http://stackoverflow.com/a/1144405/355230
def multikeysort(items, columns):
comparers = [((itemgetter(col[1:].strip()), -1) if col.startswith('-') else
(itemgetter(col.strip()), 1)) for col in columns]
def comparer(left, right):
for fn, mult in comparers:
result = cmp(fn(left), fn(right))
if result:
return mult * result
else:
return 0
return sorted(items, cmp=comparer)
# from http://stackoverflow.com/a/15109345/355230
class AttrDict(dict):
def __init__(self, *args, **kwargs):
super(AttrDict, self).__init__(*args, **kwargs)
self.__dict__ = self
with open(data_file_name, 'rb') as data_file:
reader = csv.DictReader(data_file, delimiter=DELIMITER)
format_spec = '\t'.join([('{%s}' % field) for field in reader.fieldnames])
for rec in (AttrDict(r) for r in reader):
# Convert the two sort fields to numeric values for proper ordering.
rec.evalue, rec.bitscore = map(float, (rec.evalue, rec.bitscore))
ssqeid_dict[rec.sseqid].append(rec)
for ssqeid in sorted(ssqeid_dict):
# Sort each group of recs with same ssqeid. The first record after sorting
# will be the one sought that has the lowest evalue and highest bitscore.
selected = multikeysort(ssqeid_dict[ssqeid], ['evalue', '-bitscore'])[0]
print format_spec.format(**selected)
Output (»represents tabs):
ACLA_095800» ACT» 91.73» 375» 31» 0» 1» 375» 1» 375» 0.0» 732.0
ACLA_096170» CALM» 29.33» 150» 96» 4» 34» 179» 2» 145» 1e-13» 55.1
ACLA_028450» EF1A» 85.55» 443» 63» 1» 1» 443» 1» 442» 0.0» 801.0
ACLA_065630» RPB2» 65.79» 1257» 386» 14» 1» 1252» 4» 1221» 0.0» 1691.0
ACLA_024600» TBB» 80» 435» 87» 0» 1» 435» 1» 435» 0.0» 729.0
filename = 'data.txt'
readfile = open(filename,'r')
d = dict()
sseqid=[]
lines=[]
for i in readfile.readlines():
sseqid.append(i.rsplit()[1])
lines.append(i.rsplit())
sorted_sseqid = sorted(set(sseqid))
sdqDict={}
key =None
for sorted_ssqd in sorted_sseqid:
key=sorted_ssqd
evalue=[]
bitscore=[]
qseid=[]
for line in lines:
if key in line:
evalue.append(line[10])
bitscore.append(line[11])
qseid.append(line[0])
sdqDict[key]=[qseid,evalue,bitscore]
print sdqDict
print 'TBB LOWEST EVALUE' + '---->' + min(sdqDict['TBB'][1])
##I think you can do the list manipulation below to find out the qseqid
readfile.close()
Apologies for the long question, I have been trying to solve this bug but I cant work out what Im doing wrong! I have included an example of the data so you can see what Im working with.
I have data output from a BLAST search as below:
# BLASTN 2.2.29+
# Query: Cryptocephalus androgyne
# Database: SANfive
# Fields: query id subject id % identity alignment length mismatches gap opens q. start q. end s. start s. end evalue bit score
# 7 hits found
Cryptocephalus M00964:19:000000000-A4YV1:1:2110:23842:21326 99.6 250 1 0 125 374 250 1 1.00E-128 457
Cryptocephalus M00964:19:000000000-A4YV1:1:1112:19704:18005 85.37 246 36 0 90 335 246 1 4.00E-68 255
Cryptocephalus M00964:19:000000000-A4YV1:1:2106:14369:15227 77.42 248 50 3 200 444 245 1 3.00E-34 143
Cryptocephalus M00964:19:000000000-A4YV1:1:2102:5533:11928 78.1 137 30 0 3 139 114 250 2.00E-17 87.9
Cryptocephalus M00964:19:000000000-A4YV1:1:1110:28729:12868 81.55 103 19 0 38 140 104 2 6.00E-17 86.1
Cryptocephalus M00964:19:000000000-A4YV1:1:1113:11427:16440 78.74 127 27 0 3 129 124 250 6.00E-17 86.1
Cryptocephalus M00964:19:000000000-A4YV1:1:2110:12170:20594 78.26 115 25 0 3 117 102 216 1.00E-13 75
# BLASTN 2.2.29+
# Query: Cryptocephalus aureolus
# Database: SANfive
# Fields: query id subject id % identity alignment length mismatches gap opens q. start q. end s. start s. end evalue bit score
# 10 hits found
Cryptocephalus M00964:19:000000000-A4YV1:1:2111:20990:19930 97.2 250 7 0 119 368 250 1 1.00E-118 424
Cryptocephalus M00964:19:000000000-A4YV1:1:1105:20676:23942 86.89 206 27 0 5 210 209 4 7.00E-61 231
Cryptocephalus M00964:19:000000000-A4YV1:1:1113:6534:23125 97.74 133 3 0 1 133 133 1 3.00E-60 230
Cryptocephalus M00964:21:000000000-A4WJV:1:2104:11955:19015 89.58 144 15 0 512 655 1 144 2.00E-46 183
Cryptocephalus M00964:21:000000000-A4WJV:1:1109:14814:10240 88.28 128 15 0 83 210 11 138 2.00E-37 154
Cryptocephalus M00964:21:000000000-A4WJV:1:1105:4530:13833 79.81 208 42 0 3 210 211 4 6.00E-37 152
Cryptocephalus M00964:19:000000000-A4YV1:1:2108:13133:14967 98.7 77 1 0 1 77 77 1 2.00E-32 137
Cryptocephalus M00964:19:000000000-A4YV1:1:1109:14328:3682 100 60 0 0 596 655 251 192 1.00E-24 111
Cryptocephalus M00964:19:000000000-A4YV1:1:1105:19070:25181 100 53 0 0 1 53 53 1 8.00E-21 99
Cryptocephalus M00964:19:000000000-A4YV1:1:1109:20848:27419 100 28 0 0 1 28 28 1 6.00E-07 52.8
# BLASTN 2.2.29+
# Query: Cryptocephalus cynarae
# Database: SANfive
# Fields: query id subject id % identity alignment length mismatches gap opens q. start q. end s. start s. end evalue bit score
# 2 hits found
Cryptocephalus M00964:21:000000000-A4WJV:1:2107:12228:15885 90.86 175 16 0 418 592 4 178 5.00E-62 235
Cryptocephalus M00964:21:000000000-A4WJV:1:1110:20463:5044 84.52 168 26 0 110 277 191 24 2.00E-41 167
and I have saved this as a csv, again shown below
# BLASTN 2.2.29+,,,,,,,,,,,
# Query: Cryptocephalus androgyne,,,,,,,,,,,
# Database: SANfive,,,,,,,,,,,
# Fields: query id, subject id, % identity, alignment length, mismatches, gap opens, q. start, q. end, s. start, s. end, evalue, bit score
# 7 hits found,,,,,,,,,,,
Cryptocephalus,M00964:19:000000000-A4YV1:1:2110:23842:21326,99.6,250,1,0,125,374,250,1,1.00E-128,457
Cryptocephalus,M00964:19:000000000-A4YV1:1:1112:19704:18005,85.37,246,36,0,90,335,246,1,4.00E-68,255
Cryptocephalus,M00964:19:000000000-A4YV1:1:2106:14369:15227,77.42,248,50,3,200,444,245,1,3.00E-34,143
Cryptocephalus,M00964:19:000000000-A4YV1:1:2102:5533:11928,78.1,137,30,0,3,139,114,250,2.00E-17,87.9
Cryptocephalus,M00964:19:000000000-A4YV1:1:1110:28729:12868,81.55,103,19,0,38,140,104,2,6.00E-17,86.1
Cryptocephalus,M00964:19:000000000-A4YV1:1:1113:11427:16440,78.74,127,27,0,3,129,124,250,6.00E-17,86.1
Cryptocephalus,M00964:19:000000000-A4YV1:1:2110:12170:20594,78.26,115,25,0,3,117,102,216,1.00E-13,75
# BLASTN 2.2.29+,,,,,,,,,,,
# Query: Cryptocephalus aureolus,,,,,,,,,,,
# Database: SANfive,,,,,,,,,,,
# Fields: query id, subject id, % identity, alignment length, mismatches, gap opens, q. start, q. end, s. start, s. end, evalue, bit score
# 10 hits found,,,,,,,,,,,
Cryptocephalus,M00964:19:000000000-A4YV1:1:2111:20990:19930,97.2,250,7,0,119,368,250,1,1.00E-118,424
Cryptocephalus,M00964:19:000000000-A4YV1:1:1105:20676:23942,86.89,206,27,0,5,210,209,4,7.00E-61,231
Cryptocephalus,M00964:19:000000000-A4YV1:1:1113:6534:23125,97.74,133,3,0,1,133,133,1,3.00E-60,230
Cryptocephalus,M00964:21:000000000-A4WJV:1:2104:11955:19015,89.58,144,15,0,512,655,1,144,2.00E-46,183
Cryptocephalus,M00964:21:000000000-A4WJV:1:1109:14814:10240,88.28,128,15,0,83,210,11,138,2.00E-37,154
Cryptocephalus,M00964:21:000000000-A4WJV:1:1105:4530:13833,79.81,208,42,0,3,210,211,4,6.00E-37,152
Cryptocephalus,M00964:19:000000000-A4YV1:1:2108:13133:14967,98.7,77,1,0,1,77,77,1,2.00E-32,137
Cryptocephalus,M00964:19:000000000-A4YV1:1:1109:14328:3682,100,60,0,0,596,655,251,192,1.00E-24,111
Cryptocephalus,M00964:19:000000000-A4YV1:1:1105:19070:25181,100,53,0,0,1,53,53,1,8.00E-21,99
Cryptocephalus,M00964:19:000000000-A4YV1:1:1109:20848:27419,100,28,0,0,1,28,28,1,6.00E-07,52.8
I have designed a short script that goes through the percentage identity and if it is above a threshold finds the queryID and adds it to a list before removing duplicates from the list.
import csv
from pylab import plot,show
#Making a function to see if a string is a number or not
def is_number(s):
try:
float(s)
return True
except ValueError:
return False
#Importing the CSV file, using sniffer to check the delimiters used
#In the first 1024 bytes
ImportFile = raw_input("What is the name of your import file? ")
csvfile = open(ImportFile, "rU")
dialect = csv.Sniffer().sniff(csvfile.read(1024))
csvfile.seek(0)
reader = csv.reader(csvfile, dialect)
#Finding species over 98%
Species98 = []
Species95to97 = []
Species90to94 = []
Species85to89 = []
Species80to84 = []
Species75to79 = []
SpeciesBelow74 = []
for line in reader:
if is_number(line[2])== True:
if float(line[2])>=98:
Species98.append(line[0])
elif 97>=float(line[2])>=95:
Species95to97.append(line[0])
elif 94>=float(line[2])>=90:
Species90to94.append(line[0])
elif 89>=float(line[2])>=85:
Species85to89.append(line[0])
elif 84>=float(line[2])>=80:
Species80to84.append(line[0])
elif 79>=float(line[2])>=75:
Species75to79.append(line[0])
elif float(line[2])<=74:
SpeciesBelow74.append(line[0])
def f7(seq):
seen = set()
seen_add = seen.add
return [ x for x in seq if x not in seen and not seen_add(x)]
Species98=f7(Species98)
print len(Species98), "species over 98"
Species95to97=f7(Species95to97) #removing duplicates
search_set = set().union(Species98)
Species95to97 = [x for x in Species95to97 if x not in search_set]
print len(Species95to97), "species between 95-97"
Species90to94=f7(Species90to94)
search_set = set().union(Species98, Species95to97)
Species90to94 = [x for x in Species90to94 if x not in search_set]
print len(Species90to94), "species between 90-94"
Species85to89=f7(Species85to89)
search_set = set().union(Species98, Species95to97, Species90to94)
Species85to89 = [x for x in Species85to89 if x not in search_set]
print len(Species85to89), "species between 85-89"
Species80to84=f7(Species80to84)
search_set = set().union(Species98, Species95to97, Species90to94, Species85to89)
Species80to84 = [x for x in Species80to84 if x not in search_set]
print len(Species80to84), "species between 80-84"
Species75to79=f7(Species75to79)
search_set = set().union(Species98, Species95to97, Species90to94, Species85to89,Species80to84)
Species75to79 = [x for x in Species75to79 if x not in search_set]
print len(Species75to79), "species between 75-79"
SpeciesBelow74=f7(SpeciesBelow74)
search_set = set().union(Species98, Species95to97, Species90to94, Species85to89,Species80to84, Species75to79)
SpeciesBelow74 = [x for x in SpeciesBelow74 if x not in search_set]
print len(SpeciesBelow74), "species below 74"
#Finding species 95-97%
The script works perfectly most of the time but every so often I get the error shown below
File "FindingSpeciesRepresentation.py", line 35, in <module>
if is_number(line[2])== "True":
IndexError: list index out of range
But if I change the script so it prints line[2] it prints all the identities as I would expect. Do you have any idea what could be going wrong? Again apologies for the wall of data.
This has been partly taken from my earlier question: Extracting BLAST output columns in CSV form with python