How to print increasing order in dictionary's? - python

My program is a calorie counter that reads the food and calories from a text file and the name + the foods the person ate. At the end of the program it should output the names + total calories (starting with lowest first). The output values are printing correct, although they aren't in the correct order.
Anyone know why this is happening?
import sys
file = "foods.txt"
line = sys.stdin
fridge = {}
with open(file, "r") as f:
for a in f:
a = a.strip().split()
food = " ".join(a[:-1])
calorie = a[-1]
fridge[food] = calorie
for i in line:
i = i.strip().split(",")
name = i[0]
foods = i[1:]
total_calories = 0
for k in foods:
calorie = fridge.get(k)
if k in fridge:
total_calories += int(calorie)
print("{} : {}".format(name, total_calories))
#My_Output
#joe : 2375
#mary : 785
#sandy : 2086
#trudy : 875
#Expected_Output
#trudy : 875
#mary : 985
#sandy : 2186
#joe : 2375
#foods.txt
#almonds 795
#apple pie 405
#asparagus 15
#avocdo 340
#banana 105
#blackberries 75
#blue cheese 100
#blueberries 80
#muffin 135
#blueberry pie 380
#broccoli 40
#butter 100
#cabbage 15
#carrot cake 385
#cheddar cheese 115
#cheeseburger 525
#cherry pie 410
#chicken noodle soup 75
#chocolate chip cookie 185
#cola 160
#cranberry juice 145
#croissant 235
#danish pastry 235
#egg 75
#grapefruit juice 95
#ice cream 375
#lamb 315
#lemon meringue pie 355
#lettuce 5
#macadamia nuts 960
#mayonnaise 100
#mixed grain bread 65
#orange juice 110
#potatoes 120
#pumpkin pie 320
#rice 230
#salmon 150
#spaghetti 190
#spinach 55
#strawberries 45
#taco 195
#tomatoes 25
#tuna 135
#veal 230
#waffles 205
#watermelon 50
#white bread 65
#wine 75
#yogurt 230
#zuchini 16
#sys.stdin
#joe,almonds,almonds,blue cheese,cabbage,mayonnaise,cherry pie,cola
#mary,apple pie,avocado,broccoli,butter,danish pastry,lettuce,apple
#sandy,zuchini,yogurt,veal,tuna,taco,pumpkin pie,macadamia nuts,brazil nuts
#trudy,waffles,waffles,waffles,chicken noodle soup,chocolate chip cookie

In the last statement of your code, instead of printing append the values to a list predefined using list.append((name,totalcalories)).
Later define a takeSecond function
def takeSecond(elem):
return elem[1]
and sort the list using
l.sort(key=takeSecond)
Now you may get your required result.

Related

Finding Common Elements (Amazon SDE-1)

Given two lists V1 and V2 of sizes n and m respectively. Return the list of elements common to both the lists and return the list in sorted order. Duplicates may be there in the output list.
Link to the problem : LINK
Example:
Input:
5
3 4 2 2 4
4
3 2 2 7
Output:
2 2 3
Explanation:
The first list is {3 4 2 2 4}, and the second list is {3 2 2 7}.
The common elements in sorted order are {2 2 3}
Expected Time complexity : O(N)
My code:
class Solution:
def common_element(self,v1,v2):
dict1 = {}
ans = []
for num1 in v1:
dict1[num1] = 0
for num2 in v2:
if num2 in dict1:
ans.append(num2)
return sorted(ans)
Problem with my code:
So the accessing time in a dictionary is constant and hence my time complexity was reduced but one of the hidden test cases is failing and my logic is very simple and straight forward and everything seems to be on point. What's your take? Is the logic wrong or the question desc is missing some vital details?
New Approach
Now I am generating two hashmaps/dictionaries for the two arrays. If a num is present in another array, we check the min frequency and then appending that num into the ans that many times.
class Solution:
def common_element(self,arr1,arr2):
dict1 = {}
dict2 = {}
ans = []
for num1 in arr1:
dict1[num1] = 0
for num1 in arr1:
dict1[num1] += 1
for num2 in arr2:
dict2[num2] = 0
for num2 in arr2:
dict2[num2] += 1
for number in dict1:
if number in dict2:
minFreq = min(dict1[number],dict2[number])
for _ in range(minFreq):
ans.append(number)
return sorted(ans)
The code is outputting nothing for this test case
Input:
64920
83454 38720 96164 26694 34159 26694 51732 64378 41604 13682 82725 82237 41850 26501 29460 57055 10851 58745 22405 37332 68806 65956 24444 97310 72883 33190 88996 42918 56060 73526 33825 8241 37300 46719 45367 1116 79566 75831 14760 95648 49875 66341 39691 56110 83764 67379 83210 31115 10030 90456 33607 62065 41831 65110 34633 81943 45048 92837 54415 29171 63497 10714 37685 68717 58156 51743 64900 85997 24597 73904 10421 41880 41826 40845 31548 14259 11134 16392 58525 3128 85059 29188 13812.................
Its Correct output is:
4 6 9 14 17 19 21 26 28 32 33 42 45 54 61 64 67 72 77 86 93 108 113 115 115 124 129 133 135 137 138 141 142 144 148 151 154 160 167 173 174 192 193 195 198 202 205 209 215 219 220 221 231 231 233 235 236 238 239 241 245 246 246 247 254 255 257 262 277 283 286 290 294 298 305 305 307 309 311 312 316 319 321 323 325 325 326 329 329 335 338 340 341 350 353 355 358 364 367 369 378 385 387 391 401 404 405 406 406 410 413 416 417 421 434 435 443 449 452 455 456 459 460 460 466 467 469 473 482 496 503 .................
And Your Code's output is:
Please find the below solution
def sorted_common_elemen(v1, v2):
res = []
for elem in v2:
res.append(elem)
v1.pop(0)
return sorted(res)
Your code ignores the number of times a given element occurs in the list. I think this is a good way to fix that:
class Solution:
def common_element(self, l0, l1):
li = []
for i in l0:
if i in l1:
l1.remove(i)
li.append(i)
return sorted(li)

Python 3: Getting information from list in list

I have txt file in Python 3 like this (cities are just examples):
Tokyo 0 267 308 211 156 152 216 27 60 70 75
London 267 0 155 314 111 203 101 258 254 199 310
Paris 308 155 0 429 152 315 216 330 295 249 351
Vienna 211 314 429 0 299 116 212 184 271 265 252
Tallinn 156 111 152 299 0 183 129 178 143 97 199
Helsinki 152 203 313 116 183 0 99 126 212 151 193
Stockholm 216 101 216 212 129 99 0 189 252 161 257
Moscow 27 258 330 184 178 126 189 0 87 73 68
Riga 60 254 295 271 143 212 252 87 0 91 71
Melbourne 70 199 249 265 97 151 161 73 91 0 128
Oslo 75 310 351 252 199 193 257 68 71 128 0
I want to get program to work like this with an example:
Please enter starting point: Paris
Now please enter ending point: Riga
Distance between Paris and Riga is 295 km.
I'm fairly new in Python and I don't know how to read distance list in list.
What I managed to do so far:
cities = []
distances = []
file = open("cities.txt")
for city_info in file:
city_info = city_info.strip()
city = city_info.split()
cities.append(city[0])
distances2 = []
for dist in city[1:]:
distances2.append(int(dist))
distances.append(distances2)
# to check, if lists are good to go
print(distances)
print(cities)
file.close()
amount = len(cities)
for x in range(amount):
for y in range(amount):
startpoint = cities[x]
endpoint = cities[y]
dist1 = distances[x][y]
startpoint = input("Enter start point: ").capitalize()
if startpoint not in cities:
print("Start point doesn't exist in our database: ", startpoint)
else:
endpoint = input("Enter end point: ").capitalize()
if endpoint not in cities:
print("Start point doesn't exist in our database: ", endpoint)
else:
print("Distance between", startpoint, "and", endpoint, "is", dist1, "kilometers.")
As I'm not very competent in Python language, I don't know what I'm doing wrong.
For example I want to get distance between cities[1] and cities[4], so it should find distance from distances[1][4].
Try this:
# reading from file:
with open('cities.txt') as f:
lines = f.readlines()
# pre-processing
indices = {line.split()[0]: i for i, line in enumerate(lines)}
distances = [line.split()[1:] for line in lines]
#user input:
start = input("Please enter starting point: ")
end = input("Now please enter ending point: ")
# evaluation:
distance = distances[indices[start]][indices[end]]
# output:
print("Distance between {start} and {end} is {distance} km.".format(**locals()))
Another approach, not too different from the other answer.
cities = {}
with open('cities.txt') as f:
for i, line in enumerate(f.read().splitlines()):
vals = line.split()
cities[vals[0]] = {'index': i, 'distances': [int(i) for i in vals[1:]]}
startpoint = input("Enter start point: ").capitalize()
if startpoint in cities:
endpoint = input("Enter end point: ").capitalize()
if endpoint in cities:
index = cities[startpoint]['index']
distance = cities[endpoint]['distances'][index]
print('The distance from %s to %s is %d' % (startpoint, endpoint, distance))
else:
print('city %s does not exist' % endpoint)
else:
print('city %s does not exist' % startpoint)

How to use list as a parameter for function

My first function creates a list from my input file. I'm trying to use the list I created as a parameter for my second function. How would I do this? I understand that each function has its own namespace so the way I'm doing it wrong. I'm assuming I need to assign this variable in the global namespace.
def get_data(file_object):
while True:
try:
file_object=input("Enter the name of the input file: ")
input_file=open(file_object, "r")
break
except FileNotFoundError:
print("Error: file not found\n:")
student_db=[]
for line in input_file:
fields=(line.split())
name=int(fields[0])
exam1=int(fields[1])
exam2=int(fields[2])
exam3=int(fields[3])
in_class=int(fields[4])
projects=int(fields[5])
exercises=int(fields[6])
record=[name,exam1,exam2,exam3,in_class,projects,exercises]
student_db.append(record)
student_db.sort()
return student_db
#def calculate_grade(a_list):
# print(a_list)
#how do I use student_db as a parameter??
def main():
# a_list=student_db
# b=calculate_grade(a_list)
# print(b)
a=get_data("data.tiny.txt")
print(a)
Here is the input file I am using
031 97 108 113 48 217 14
032 97 124 147 45 355 15
033 140 145 175 50 446 14
034 133 123 115 46 430 15
035 107 92 136 45 278 13
036 98 115 130 37 387 15
037 117 69 131 34 238 12
038 134 125 132 50 434 15
039 125 116 178 50 433 15
040 125 142 156 50 363 15
041 77 51 68 45 219 15
042 122 142 182 50 447 15
043 103 123 102 46 320 15
044 106 100 127 50 362 15
045 125 110 140 50 396 15
046 120 98 129 48 325 13
047 89 70 80 46 302 14
048 99 130 103 50 436 15
049 100 87 148 17 408 13
050 104 47 91 37 50 9
Your main (and commented out code) should look like:
def calculate_grade(a_list):
print(a_list)
def main():
a_list=get_data("data.tiny.txt")
calculate_grade(a_list)
main()
Remember this:
If your function returns a value. then you would assign it to a variable in the global namespace and use it at different points. If it has a print statement then you do not need to use print again when you are calling it
In your example, the student_db is (as I understood) stored in the variable a. You can just pass that variable to the second function, so just add calculate_grade(a) to your main function (after defining it, obviously).
The function get_data() returning a list which can be assigned to local variable and passed to other functions. like you are doing it now.
a=get_data("data.tiny.txt")
calculate_grade(a)
we can't directly use student_db[] because it is local to get_data(). it can declared/used as global but ultimately they make your program less flexible and more likely to contain errors that will be difficult to spot.
other approach would be using methods (object oriented mechanism).
class FileList :
def get_data(self, file_object):
while True:
try:
file_object=input("Enter the name of the input file: ")
input_file=open(file_object, "r")
break
except FileNotFoundError:
print("Error: file not found\n:")
self.student_db=[]
for line in input_file:
fields=(line.split())
name=int(fields[0])
exam1=int(fields[1])
exam2=int(fields[2])
exam3=int(fields[3])
in_class=int(fields[4])
projects=int(fields[5])
exercises=int(fields[6])
record=[name,exam1,exam2,exam3,in_class,projects,exercises]
self.student_db.append(record)
self.student_db.sort()
def print_object(self):
print self.student_db
def main():
myobj=FileList()
myobj.get_data("data.tiny.txt")
myobj.print_object()

Python selecting items by comparing values in a table using dictionary

I have a table with 12 columns and want to select the items in the first column (qseqid) based on the second column (sseqid). Meaning that the second column (sseqid) is repeating with different values in the 11th and 12th columns, which areevalueandbitscore, respectively.
The ones that I would like to get are having the lowestevalueand the highestbitscore(whenevalues are the same, the rest of the columns can be ignored and the data is down below).
So, I have made a short code which uses the second columns as a key for the dictionary. I can get five different items from the second column with lists of qseqid+evalueandqseqid+bitscore.
Here is the code:
#!usr/bin/python
filename = "data.txt"
readfile = open(filename,"r")
d = dict()
for i in readfile.readlines():
i = i.strip()
i = i.split("\t")
d.setdefault(i[1], []).append([i[0],i[10]])
d.setdefault(i[1], []).append([i[0],i[11]])
for x in d:
print(x,d[x])
readfile.close()
But, I am struggling to get the qseqid with the lowest evalue and the highest bitscore for each sseqid.
Is there any good logic to solve the problem?
Thedata.txtfile (including the header row and with»representing tab characters)
qseqid»sseqid»pident»length»mismatch»gapopen»qstart»qend»sstart»send»evalue»bitscore
ACLA_022040»TBB»32.71»431»258»8»39»468»24»423»2.00E-76»240
ACLA_024600»TBB»80»435»87»0»1»435»1»435»0»729
ACLA_031860»TBB»39.74»453»251»3»1»447»1»437»1.00E-121»357
ACLA_046030»TBB»75.81»434»105»0»1»434»1»434»0»704
ACLA_072490»TBB»41.7»446»245»3»4»447»3»435»2.00E-120»353
ACLA_010400»EF1A»27.31»249»127»8»69»286»9»234»3.00E-13»61.6
ACLA_015630»EF1A»22»491»255»17»186»602»3»439»8.00E-19»78.2
ACLA_016510»EF1A»26.23»122»61»4»21»127»9»116»2.00E-08»46.2
ACLA_023300»EF1A»29.31»447»249»12»48»437»3»439»2.00E-45»155
ACLA_028450»EF1A»85.55»443»63»1»1»443»1»442»0»801
ACLA_074730»CALM»23.13»147»101»4»6»143»2»145»7.00E-08»41.2
ACLA_096170»CALM»29.33»150»96»4»34»179»2»145»1.00E-13»55.1
ACLA_016630»CALM»23.9»159»106»5»58»216»4»147»5.00E-12»51.2
ACLA_031930»RPB2»36.87»1226»633»24»121»1237»26»1219»0»734
ACLA_065630»RPB2»65.79»1257»386»14»1»1252»4»1221»0»1691
ACLA_082370»RPB2»27.69»1228»667»37»31»1132»35»1167»7.00E-110»365
ACLA_061960»ACT»28.57»147»95»5»146»284»69»213»3.00E-12»57.4
ACLA_068200»ACT»28.73»463»231»13»16»471»4»374»1.00E-53»176
ACLA_069960»ACT»24.11»141»97»4»581»718»242»375»9.00E-09»46.2
ACLA_095800»ACT»91.73»375»31»0»1»375»1»375»0»732
And here's a little more readable version of the table's contents:
0 1 2 3 4 5 6 7 8 9 10 11
qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore
ACLA_022040 TBB 32.71 431 258 8 39 468 24 423 2.00E-76 240
ACLA_024600 TBB 80 435 87 0 1 435 1 435 0 729
ACLA_031860 TBB 39.74 453 251 3 1 447 1 437 1.00E-121 357
ACLA_046030 TBB 75.81 434 105 0 1 434 1 434 0 704
ACLA_072490 TBB 41.7 446 245 3 4 447 3 435 2.00E-120 353
ACLA_010400 EF1A 27.31 249 127 8 69 286 9 234 3.00E-13 61.6
ACLA_015630 EF1A 22 491 255 17 186 602 3 439 8.00E-19 78.2
ACLA_016510 EF1A 26.23 122 61 4 21 127 9 116 2.00E-08 46.2
ACLA_023300 EF1A 29.31 447 249 12 48 437 3 439 2.00E-45 155
ACLA_028450 EF1A 85.55 443 63 1 1 443 1 442 0 801
ACLA_074730 CALM 23.13 147 101 4 6 143 2 145 7.00E-08 41.2
ACLA_096170 CALM 29.33 150 96 4 34 179 2 145 1.00E-13 55.1
ACLA_016630 CALM 23.9 159 106 5 58 216 4 147 5.00E-12 51.2
ACLA_031930 RPB2 36.87 1226 633 24 121 1237 26 1219 0 734
ACLA_065630 RPB2 65.79 1257 386 14 1 1252 4 1221 0 1691
ACLA_082370 RPB2 27.69 1228 667 37 31 1132 35 1167 7.00E-110 365
ACLA_061960 ACT 28.57 147 95 5 146 284 69 213 3.00E-12 57.4
ACLA_068200 ACT 28.73 463 231 13 16 471 4 374 1.00E-53 176
ACLA_069960 ACT 24.11 141 97 4 581 718 242 375 9.00E-09 46.2
ACLA_095800 ACT 91.73 375 31 0 1 375 1 375 0 732
Since you're a Python newbie I'm glad that there are several examples of how to this manually, but for comparison I'll show how it can be done using the pandas library which makes working with tabular data much simpler.
Since you didn't provide example output, I'm assuming that by "with the lowest evalue and the highest bitscore for each sseqid" you mean "the highest bitscore among the lowest evalues" for a given sseqid; if you want those separately, that's trivial too.
import pandas as pd
df = pd.read_csv("acla1.dat", sep="\t")
df = df.sort(["evalue", "bitscore"],ascending=[True, False])
df_new = df.groupby("sseqid", as_index=False).first()
which produces
>>> df_new
sseqid qseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore
0 ACT ACLA_095800 91.73 375 31 0 1 375 1 375 0.000000e+00 732.0
1 CALM ACLA_096170 29.33 150 96 4 34 179 2 145 1.000000e-13 55.1
2 EF1A ACLA_028450 85.55 443 63 1 1 443 1 442 0.000000e+00 801.0
3 RPB2 ACLA_065630 65.79 1257 386 14 1 1252 4 1221 0.000000e+00 1691.0
4 TBB ACLA_024600 80.00 435 87 0 1 435 1 435 0.000000e+00 729.0
Basically, first we read the data file into an object called a DataFrame, which is kind of like an Excel worksheet. Then we sort by evalue ascending (so that lower evalues come first) and by bitscore descending (so that higher bitscores come first). Then we can use groupby to collect the data in groups of equal sseqid, and take the first one in each group, which because of the sorting will be the one we want.
#!usr/bin/python
import csv
DATA = "data.txt"
class Sequence:
def __init__(self, row):
self.qseqid = row[0]
self.sseqid = row[1]
self.pident = float(row[2])
self.length = int(row[3])
self.mismatch = int(row[4])
self.gapopen = int(row[5])
self.qstart = int(row[6])
self.qend = int(row[7])
self.sstart = int(row[8])
self.send = int(row[9])
self.evalue = float(row[10])
self.bitscore = float(row[11])
def __str__(self):
return (
"{qseqid}\t"
"{sseqid}\t"
"{pident}\t"
"{length}\t"
"{mismatch}\t"
"{gapopen}\t"
"{qstart}\t"
"{qend}\t"
"{sstart}\t"
"{send}\t"
"{evalue}\t"
"{bitscore}"
).format(**self.__dict__)
def entries(fname, header_rows=1, dtype=list, **kwargs):
with open(fname) as inf:
incsv = csv.reader(inf, **kwargs)
# skip header rows
for i in range(header_rows):
next(incsv)
for row in incsv:
yield dtype(row)
def main():
bestseq = {}
for seq in entries(DATA, dtype=Sequence, delimiter="\t"):
# see if a sequence with the same sseqid already exists
prev = bestseq.get(seq.sseqid, None)
if (
prev is None
or seq.evalue < prev.evalue
or (seq.evalue == prev.evalue and seq.bitscore > prev.bitscore)
):
bestseq[seq.sseqid] = seq
# display selected sequences
keys = sorted(bestseq)
for key in keys:
print(bestseq[key])
if __name__ == "__main__":
main()
which results in
ACLA_095800 ACT 91.73 375 31 0 1 375 1 375 0.0 732.0
ACLA_096170 CALM 29.33 150 96 4 34 179 2 145 1e-13 55.1
ACLA_028450 EF1A 85.55 443 63 1 1 443 1 442 0.0 801.0
ACLA_065630 RPB2 65.79 1257 386 14 1 1252 4 1221 0.0 1691.0
ACLA_024600 TBB 80.0 435 87 0 1 435 1 435 0.0 729.0
While not nearly as elegant and concise as using thepandaslibrary, it's quite possible to do what you want without resorting to third-party modules. The following uses thecollections.defaultdictclass to facilitate creation of dictionaries of variable-length lists of records. The use of theAttrDictclass is optional, but it makes accessing the fields of each dictionary-based records easier and is less awkward-looking than the usualdict['fieldname']syntax otherwise required.
import csv
from collections import defaultdict, namedtuple
from itertools import imap
from operator import itemgetter
data_file_name = 'data.txt'
DELIMITER = '\t'
ssqeid_dict = defaultdict(list)
# from http://stackoverflow.com/a/1144405/355230
def multikeysort(items, columns):
comparers = [((itemgetter(col[1:].strip()), -1) if col.startswith('-') else
(itemgetter(col.strip()), 1)) for col in columns]
def comparer(left, right):
for fn, mult in comparers:
result = cmp(fn(left), fn(right))
if result:
return mult * result
else:
return 0
return sorted(items, cmp=comparer)
# from http://stackoverflow.com/a/15109345/355230
class AttrDict(dict):
def __init__(self, *args, **kwargs):
super(AttrDict, self).__init__(*args, **kwargs)
self.__dict__ = self
with open(data_file_name, 'rb') as data_file:
reader = csv.DictReader(data_file, delimiter=DELIMITER)
format_spec = '\t'.join([('{%s}' % field) for field in reader.fieldnames])
for rec in (AttrDict(r) for r in reader):
# Convert the two sort fields to numeric values for proper ordering.
rec.evalue, rec.bitscore = map(float, (rec.evalue, rec.bitscore))
ssqeid_dict[rec.sseqid].append(rec)
for ssqeid in sorted(ssqeid_dict):
# Sort each group of recs with same ssqeid. The first record after sorting
# will be the one sought that has the lowest evalue and highest bitscore.
selected = multikeysort(ssqeid_dict[ssqeid], ['evalue', '-bitscore'])[0]
print format_spec.format(**selected)
Output (»represents tabs):
ACLA_095800» ACT» 91.73» 375» 31» 0» 1» 375» 1» 375» 0.0» 732.0
ACLA_096170» CALM» 29.33» 150» 96» 4» 34» 179» 2» 145» 1e-13» 55.1
ACLA_028450» EF1A» 85.55» 443» 63» 1» 1» 443» 1» 442» 0.0» 801.0
ACLA_065630» RPB2» 65.79» 1257» 386» 14» 1» 1252» 4» 1221» 0.0» 1691.0
ACLA_024600» TBB» 80» 435» 87» 0» 1» 435» 1» 435» 0.0» 729.0
filename = 'data.txt'
readfile = open(filename,'r')
d = dict()
sseqid=[]
lines=[]
for i in readfile.readlines():
sseqid.append(i.rsplit()[1])
lines.append(i.rsplit())
sorted_sseqid = sorted(set(sseqid))
sdqDict={}
key =None
for sorted_ssqd in sorted_sseqid:
key=sorted_ssqd
evalue=[]
bitscore=[]
qseid=[]
for line in lines:
if key in line:
evalue.append(line[10])
bitscore.append(line[11])
qseid.append(line[0])
sdqDict[key]=[qseid,evalue,bitscore]
print sdqDict
print 'TBB LOWEST EVALUE' + '---->' + min(sdqDict['TBB'][1])
##I think you can do the list manipulation below to find out the qseqid
readfile.close()

Index Error when using python to read BLAST output in csv format

Apologies for the long question, I have been trying to solve this bug but I cant work out what Im doing wrong! I have included an example of the data so you can see what Im working with.
I have data output from a BLAST search as below:
# BLASTN 2.2.29+
# Query: Cryptocephalus androgyne
# Database: SANfive
# Fields: query id subject id % identity alignment length mismatches gap opens q. start q. end s. start s. end evalue bit score
# 7 hits found
Cryptocephalus M00964:19:000000000-A4YV1:1:2110:23842:21326 99.6 250 1 0 125 374 250 1 1.00E-128 457
Cryptocephalus M00964:19:000000000-A4YV1:1:1112:19704:18005 85.37 246 36 0 90 335 246 1 4.00E-68 255
Cryptocephalus M00964:19:000000000-A4YV1:1:2106:14369:15227 77.42 248 50 3 200 444 245 1 3.00E-34 143
Cryptocephalus M00964:19:000000000-A4YV1:1:2102:5533:11928 78.1 137 30 0 3 139 114 250 2.00E-17 87.9
Cryptocephalus M00964:19:000000000-A4YV1:1:1110:28729:12868 81.55 103 19 0 38 140 104 2 6.00E-17 86.1
Cryptocephalus M00964:19:000000000-A4YV1:1:1113:11427:16440 78.74 127 27 0 3 129 124 250 6.00E-17 86.1
Cryptocephalus M00964:19:000000000-A4YV1:1:2110:12170:20594 78.26 115 25 0 3 117 102 216 1.00E-13 75
# BLASTN 2.2.29+
# Query: Cryptocephalus aureolus
# Database: SANfive
# Fields: query id subject id % identity alignment length mismatches gap opens q. start q. end s. start s. end evalue bit score
# 10 hits found
Cryptocephalus M00964:19:000000000-A4YV1:1:2111:20990:19930 97.2 250 7 0 119 368 250 1 1.00E-118 424
Cryptocephalus M00964:19:000000000-A4YV1:1:1105:20676:23942 86.89 206 27 0 5 210 209 4 7.00E-61 231
Cryptocephalus M00964:19:000000000-A4YV1:1:1113:6534:23125 97.74 133 3 0 1 133 133 1 3.00E-60 230
Cryptocephalus M00964:21:000000000-A4WJV:1:2104:11955:19015 89.58 144 15 0 512 655 1 144 2.00E-46 183
Cryptocephalus M00964:21:000000000-A4WJV:1:1109:14814:10240 88.28 128 15 0 83 210 11 138 2.00E-37 154
Cryptocephalus M00964:21:000000000-A4WJV:1:1105:4530:13833 79.81 208 42 0 3 210 211 4 6.00E-37 152
Cryptocephalus M00964:19:000000000-A4YV1:1:2108:13133:14967 98.7 77 1 0 1 77 77 1 2.00E-32 137
Cryptocephalus M00964:19:000000000-A4YV1:1:1109:14328:3682 100 60 0 0 596 655 251 192 1.00E-24 111
Cryptocephalus M00964:19:000000000-A4YV1:1:1105:19070:25181 100 53 0 0 1 53 53 1 8.00E-21 99
Cryptocephalus M00964:19:000000000-A4YV1:1:1109:20848:27419 100 28 0 0 1 28 28 1 6.00E-07 52.8
# BLASTN 2.2.29+
# Query: Cryptocephalus cynarae
# Database: SANfive
# Fields: query id subject id % identity alignment length mismatches gap opens q. start q. end s. start s. end evalue bit score
# 2 hits found
Cryptocephalus M00964:21:000000000-A4WJV:1:2107:12228:15885 90.86 175 16 0 418 592 4 178 5.00E-62 235
Cryptocephalus M00964:21:000000000-A4WJV:1:1110:20463:5044 84.52 168 26 0 110 277 191 24 2.00E-41 167
and I have saved this as a csv, again shown below
# BLASTN 2.2.29+,,,,,,,,,,,
# Query: Cryptocephalus androgyne,,,,,,,,,,,
# Database: SANfive,,,,,,,,,,,
# Fields: query id, subject id, % identity, alignment length, mismatches, gap opens, q. start, q. end, s. start, s. end, evalue, bit score
# 7 hits found,,,,,,,,,,,
Cryptocephalus,M00964:19:000000000-A4YV1:1:2110:23842:21326,99.6,250,1,0,125,374,250,1,1.00E-128,457
Cryptocephalus,M00964:19:000000000-A4YV1:1:1112:19704:18005,85.37,246,36,0,90,335,246,1,4.00E-68,255
Cryptocephalus,M00964:19:000000000-A4YV1:1:2106:14369:15227,77.42,248,50,3,200,444,245,1,3.00E-34,143
Cryptocephalus,M00964:19:000000000-A4YV1:1:2102:5533:11928,78.1,137,30,0,3,139,114,250,2.00E-17,87.9
Cryptocephalus,M00964:19:000000000-A4YV1:1:1110:28729:12868,81.55,103,19,0,38,140,104,2,6.00E-17,86.1
Cryptocephalus,M00964:19:000000000-A4YV1:1:1113:11427:16440,78.74,127,27,0,3,129,124,250,6.00E-17,86.1
Cryptocephalus,M00964:19:000000000-A4YV1:1:2110:12170:20594,78.26,115,25,0,3,117,102,216,1.00E-13,75
# BLASTN 2.2.29+,,,,,,,,,,,
# Query: Cryptocephalus aureolus,,,,,,,,,,,
# Database: SANfive,,,,,,,,,,,
# Fields: query id, subject id, % identity, alignment length, mismatches, gap opens, q. start, q. end, s. start, s. end, evalue, bit score
# 10 hits found,,,,,,,,,,,
Cryptocephalus,M00964:19:000000000-A4YV1:1:2111:20990:19930,97.2,250,7,0,119,368,250,1,1.00E-118,424
Cryptocephalus,M00964:19:000000000-A4YV1:1:1105:20676:23942,86.89,206,27,0,5,210,209,4,7.00E-61,231
Cryptocephalus,M00964:19:000000000-A4YV1:1:1113:6534:23125,97.74,133,3,0,1,133,133,1,3.00E-60,230
Cryptocephalus,M00964:21:000000000-A4WJV:1:2104:11955:19015,89.58,144,15,0,512,655,1,144,2.00E-46,183
Cryptocephalus,M00964:21:000000000-A4WJV:1:1109:14814:10240,88.28,128,15,0,83,210,11,138,2.00E-37,154
Cryptocephalus,M00964:21:000000000-A4WJV:1:1105:4530:13833,79.81,208,42,0,3,210,211,4,6.00E-37,152
Cryptocephalus,M00964:19:000000000-A4YV1:1:2108:13133:14967,98.7,77,1,0,1,77,77,1,2.00E-32,137
Cryptocephalus,M00964:19:000000000-A4YV1:1:1109:14328:3682,100,60,0,0,596,655,251,192,1.00E-24,111
Cryptocephalus,M00964:19:000000000-A4YV1:1:1105:19070:25181,100,53,0,0,1,53,53,1,8.00E-21,99
Cryptocephalus,M00964:19:000000000-A4YV1:1:1109:20848:27419,100,28,0,0,1,28,28,1,6.00E-07,52.8
I have designed a short script that goes through the percentage identity and if it is above a threshold finds the queryID and adds it to a list before removing duplicates from the list.
import csv
from pylab import plot,show
#Making a function to see if a string is a number or not
def is_number(s):
try:
float(s)
return True
except ValueError:
return False
#Importing the CSV file, using sniffer to check the delimiters used
#In the first 1024 bytes
ImportFile = raw_input("What is the name of your import file? ")
csvfile = open(ImportFile, "rU")
dialect = csv.Sniffer().sniff(csvfile.read(1024))
csvfile.seek(0)
reader = csv.reader(csvfile, dialect)
#Finding species over 98%
Species98 = []
Species95to97 = []
Species90to94 = []
Species85to89 = []
Species80to84 = []
Species75to79 = []
SpeciesBelow74 = []
for line in reader:
if is_number(line[2])== True:
if float(line[2])>=98:
Species98.append(line[0])
elif 97>=float(line[2])>=95:
Species95to97.append(line[0])
elif 94>=float(line[2])>=90:
Species90to94.append(line[0])
elif 89>=float(line[2])>=85:
Species85to89.append(line[0])
elif 84>=float(line[2])>=80:
Species80to84.append(line[0])
elif 79>=float(line[2])>=75:
Species75to79.append(line[0])
elif float(line[2])<=74:
SpeciesBelow74.append(line[0])
def f7(seq):
seen = set()
seen_add = seen.add
return [ x for x in seq if x not in seen and not seen_add(x)]
Species98=f7(Species98)
print len(Species98), "species over 98"
Species95to97=f7(Species95to97) #removing duplicates
search_set = set().union(Species98)
Species95to97 = [x for x in Species95to97 if x not in search_set]
print len(Species95to97), "species between 95-97"
Species90to94=f7(Species90to94)
search_set = set().union(Species98, Species95to97)
Species90to94 = [x for x in Species90to94 if x not in search_set]
print len(Species90to94), "species between 90-94"
Species85to89=f7(Species85to89)
search_set = set().union(Species98, Species95to97, Species90to94)
Species85to89 = [x for x in Species85to89 if x not in search_set]
print len(Species85to89), "species between 85-89"
Species80to84=f7(Species80to84)
search_set = set().union(Species98, Species95to97, Species90to94, Species85to89)
Species80to84 = [x for x in Species80to84 if x not in search_set]
print len(Species80to84), "species between 80-84"
Species75to79=f7(Species75to79)
search_set = set().union(Species98, Species95to97, Species90to94, Species85to89,Species80to84)
Species75to79 = [x for x in Species75to79 if x not in search_set]
print len(Species75to79), "species between 75-79"
SpeciesBelow74=f7(SpeciesBelow74)
search_set = set().union(Species98, Species95to97, Species90to94, Species85to89,Species80to84, Species75to79)
SpeciesBelow74 = [x for x in SpeciesBelow74 if x not in search_set]
print len(SpeciesBelow74), "species below 74"
#Finding species 95-97%
The script works perfectly most of the time but every so often I get the error shown below
File "FindingSpeciesRepresentation.py", line 35, in <module>
if is_number(line[2])== "True":
IndexError: list index out of range
But if I change the script so it prints line[2] it prints all the identities as I would expect. Do you have any idea what could be going wrong? Again apologies for the wall of data.
This has been partly taken from my earlier question: Extracting BLAST output columns in CSV form with python

Categories