For loop with non-consecutive indices - python

I'm quite new to Phyton and working with data frames, so this might be a very simple problem.
I successfully imported some measurement data (1 minute resolution) and did some calculations on them. I want to recalculate some data processing on a 15 minute basis (not average), for which I extracted every row at :00, :15, :30 and :45 from the original data frame.
df_interval = df[(df['DateTime'].dt.minute == 0) | (df['DateTime'].dt.minute == 15) | (df['DateTime'].dt.minute == 30) | (df['DateTime'].dt.minute == 45)]
This seems to work fine. Now I want to recalculate the concentration every 15 minute based on what the instrument is internally doing, which is a simple formula.
So what I tried is:
for i in df_interval.index:
if np.isnan(df_interval.ATN[i]) == False and np.isnan(df_interval.ATN[i+1]) == False:
df_15min = (0.785 *((df_interval.ATN[i+1]-df_interval.ATN[i])/100))/(df_interval.Flow[i]*(1-0.07)*10.8*(1-df_interval.K[i]*df_interval.ATN[i])*15)
however, I end up with a KeyError: 226. And I don't understand why...
Update:
Here is the data and in the last column (df_15min) also the result that I want to get:
ATN
Flow
K
df_15min
150
3647
0.00994
165
3634
0.00996
180
3634
0.00995
195
3621
0.00995
210
3615
0.00994
225
1.703678939
3754
0.00994
3.75E-08
240
4.356519267
3741
0.00994
3.84E-08
255
6.997422571
3741
0.00994
3.94E-08
270
9.627710046
3736
0.00995
4.02E-08
285
12.23379251
3728
0.01007
3.89E-08
300
14.67175418
3727
0.01026
3.76E-08
315
16.9583747
3714
0.01043
3.73E-08
330
19.1497249
3714
0.01061
3.96E-08
345
21.39628083
3709
0.01079
3.87E-08
360
23.51512717
3701
0.01086
4.02E-08
375
25.63995721
3700
0.01083
3.90E-08
390
27.63886191
3688
0.0108
3.47E-08
405
29.36343728
3688
0.01076
3.68E-08
420
31.14291069
3677
0.01072
3.90E-08
I do a lot of things in Igor, so that is how I would do it there (unfortunately for me, it has to be in python this time):
variable i
For (i=0; i<numpnts(ATN)-1; i+=1)
df_15min[i] = (0.785 *((ATN[i+1]-ATN[i])/100))/(Flow[i]*(1-0.07)*10.8*(1-K[i]*ATN[i])*15)
endfor
Any help would be appreciated, thanks!

You can literally write the same operation as vectorial code. Just use the whole rows and shift(-1) to get the "next" row.
df['df_15min'] = (0.785 *((df['ATN'].shift(-1)-df['ATN'])/100))/(df['Flow']*(1-0.07)*10.8*(1-df['K']*df['ATN'])*15)
Or using diff:
df['df_15min'] = (0.785 *((-df['ATN'].diff(-1))/100))/(df['Flow']*(1-0.07)*10.8*(1-df['K']*df['ATN'])*15)
output:
ATN Flow K df_15min
index
150 NaN 3647 0.00994 NaN
165 NaN 3634 0.00996 NaN
180 NaN 3634 0.00995 NaN
195 NaN 3621 0.00995 NaN
210 NaN 3615 0.00994 NaN
225 1.703679 3754 0.00994 3.745468e-08
240 4.356519 3741 0.00994 3.844700e-08
255 6.997423 3741 0.00994 3.937279e-08
270 9.627710 3736 0.00995 4.019633e-08
285 12.233793 3728 0.01007 3.886148e-08
300 14.671754 3727 0.01026 3.763219e-08
315 16.958375 3714 0.01043 3.734876e-08
330 19.149725 3714 0.01061 3.955360e-08
345 21.396281 3709 0.01079 3.870011e-08
360 23.515127 3701 0.01086 4.017342e-08
375 25.639957 3700 0.01083 3.897022e-08
390 27.638862 3688 0.01080 3.473242e-08
405 29.363437 3688 0.01076 3.675232e-08
420 31.142911 3677 0.01072 NaN

Your if condition checks bc_interval.row1[i+1] for nan and then you access df_interval.row1[i+1]. Looks like you wanted to check df_interval.row1[i+1] instead.

Related

Filling an empty list in python

I am trying to create a new list using data from a pandas Dataframe. The Dataframe in question has a column of Dates as well as a column for Units Sold as seen below:
Peep = Xsku[['new_date', 'cum_sum']]
Peep.head(15)
Out[159]:
new_date cum_sum
18 2011-01-17 214
1173 2011-01-24 343
2328 2011-01-31 407 #Save Entry in List
3483 2011-02-07 71
4638 2011-02-14 159
5793 2011-02-21 294
6948 2011-02-28 425 #Save Entry in List
8103 2011-03-07 113
9258 2011-03-14 249
10413 2011-03-21 347
11568 2011-03-28 463 #Save Entry in List
12723 2011-04-04 99
13878 2011-04-11 186
15033 2011-04-18 291
16188 2011-04-25 385
I am trying to make a new list, where the list contains the maximum 'cum_sum' before the number is reset (i.e. becomes smaller). For example, in the first four entries above, the cum_sum reaches 407 and then goes back down to 71. I am thus trying to save the number 407 as well as the corresponding 'new_date' (2011-01-31 in this example) and do this for every entry.
My final List will thus have all the maximum 'cum_sum' values before it is reset.
For example it will look like as follows:
(First Three Expected Values)
MyList
Out[]:
new_date cum_sum
2011-01-31 407
2011-02-28 425
2011-03-28 463
...
I have been trying to do something as a for loop, but continually run into problems:
MyList= [] ##My Empty List
for i in range(len(Peep['new_date'])):
if Peep.iloc[i,1] > Peep.iloc[i + 1,1]:
MyList.append(Peep.iloc[i,1])
Can anyone help me in this regard?
Use .diff and filter like
In [17]: df[df['cum_sum'].diff(-1).ge(0)]
Out[17]:
new_date cum_sum
2 2011-01-31 407
6 2011-02-28 425
10 2011-03-28 463

How to compare two values at a specific location in a loop, and append data in a range of values in Pandas Dataframe

I have a dataframe, from where I extracted some sample data:
Time Val
0 70000 -322
1 70500 -439
2 71000 -528
3 71500 -606
4 72000 -642
5 72500 -663
6 73000 -620
7 73500 -561
8 74000 -592
9 74500 -614
10 75000 -630
11 75500 -719
12 80000 -613
13 80500 -127
14 81000 -235
15 81500 -186
16 82000 -82
17 82500 836
18 83000 1137
183 70000 -106
184 70500 -117
185 71000 -626
186 71500 -810
187 72000 -822
188 72500 -676
189 73000 -639
190 73500 -664
191 74000 -708
192 74500 -515
193 75000 -61
194 75500 -121
195 80000 -145
196 80500 -57
197 81000 -133
198 81500 101
199 82000 235
200 82500 585
201 83000 550
366 70000 18
367 70500 138
368 71000 22
369 71500 -68
370 72000 -146
371 72500 -163
372 73000 -251
373 73500 -230
374 74000 -218
375 74500 -137
376 75000 -126
Now I would like to compare the value from 'Val' at time 73000 with the value [i-3].
If the value is less, then append the continuous values to the list until Time has reached 80000.
I wrote this loop but the problem is that 'Val' compares ALL values [i-3] between 73000 and 80000. I want that the comparison happens ONLY at 73000, and if the condition is true, write the data to the list (until Time 80000)
box = []
for i in df.index:
if df.Time[i] >= 73000 and df.Time[i] <= 80000 and df.Val[i] < df.Val[i-3]:
box.append(
{
'Time': df.Time[i],
'newVAL': df.Val[i],
}
)
box = pd.DataFrame (box, columns = ['Time','newVAL'])
How could I change the code in order to achieve this?
You need to remember the result of the comparison in another variable, and reset it whenever you encounter a time value outside your desired interval. The code would look like this.
box = []
writeToList = False
for i in df.index:
if df.Time[i] < 73000 or df.Time[i] > 80000:
writeToList = False
if df.Time[i] == 73000 and df.Val[i] < df.Val[i-3]:
writeToList = True
if writeToList and df.Time[i] >= 73000 and df.Time[i] <= 80000 :
box.append(
{
'Time': df.Time[i],
'newVAL': df.Val[i],
}
)
box = pd.DataFrame (box, columns = ['Time','newVAL'])
Hope this helps.

I want to compare values in a dataframe column and report the index for the value that satisfy a conditional argument?

Unnamed: 4 GDP in billions of chained 2009 dollars.1
214 2000q1 12359.1
215 2000q2 12592.5
216 2000q3 12607.7
217 2000q4 12679.3
218 2001q1 12643.3
219 2001q2 12710.3
220 2001q3 12670.1
221 2001q4 12705.3
222 2002q1 12822.3
223 2002q2 12893.0
224 2002q3 12955.8
225 2002q4 12964.0
226 2003q1 13031.2
227 2003q2 13152.1
228 2003q3 13372.4
229 2003q4 13528.7
230 2004q1 13606.5
231 2004q2 13706.2
232 2004q3 13830.8
233 2004q4 13950.4
234 2005q1 14099.1
235 2005q2 14172.7
236 2005q3 14291.8
237 2005q4 14373.4
238 2006q1 14546.1
239 2006q2 14589.6
240 2006q3 14602.6
241 2006q4 14716.9
242 2007q1 14726.0
243 2007q2 14838.7
... ... ...
250 2009q1 14375.0
251 2009q2 14355.6
252 2009q3 14402.5
253 2009q4 14541.9
254 2010q1 14604.8
255 2010q2 14745.9
256 2010q3 14845.5
257 2010q4 14939.0
258 2011q1 14881.3
259 2011q2 14989.6
260 2011q3 15021.1
261 2011q4 15190.3
262 2012q1 15291.0
263 2012q2 15362.4
264 2012q3 15380.8
265 2012q4 15384.3
266 2013q1 15491.9
267 2013q2 15521.6
268 2013q3 15641.3
269 2013q4 15793.9
270 2014q1 15747.0
271 2014q2 15900.8
272 2014q3 16094.5
273 2014q4 16186.7
274 2015q1 16269.0
275 2015q2 16374.2
276 2015q3 16454.9
277 2015q4 16490.7
278 2016q1 16525.0
279 2016q2 16583.1
I have the above dataframe. I want to compare the values in the column GDP in billions of chained 2009 dollars.1 and report the index and value of the row for which the value of the column is consecutively less for two values above it. I am using the following code but i am not getting the result
datan = pd.read_excel('gdplev.xls', skiprows = 5)
datan.drop(datan.iloc[0:230, 0:4], inplace = True, axis = 1)
datan = datan[214:]
datan = datan.drop(['GDP in billions of current dollars.1', 'Unnamed: 7'], axis = 1)
datan
for item in datan['GDP in billions of chained 2009 dollars.1']:
if item > item+1 and item+1 > item+2:
print(item+2)
Please help
I suggest the following:
# First I reproduce a similar DataFrame than yours
import pandas as pd
import numpy as np
np.random.seed(123)
df = pd.DataFrame({"quarter" : pd.date_range("2000q1", freq="Q", periods = 10),
"gdp": np.random.rand(10)*10000})
df["quarter"] = pd.Series(df["quarter"].dt.year).astype("str") + "q" + pd.Series(df["quarter"].dt.quarter).astype("str")
# Then I create two columns that are the lags of gdp
df["gdpN_1"] = df["gdp"].shift()
df["gdpN_2"] = df["gdpN_1"].shift()
# I create a top when gdp is below gdp at past quarter and the quarter before that
df["top"] = (df["gdp"] < df["gdpN_1"]) & (df["gdp"] < df["gdpN_2"])
# I only select rows for which top is True
new_df = df.loc[df["top"], ["quarter", "gdp"]]
And the result for new_df is :
quarter gdp
2 2000q3 2268.514536
5 2001q2 4231.064601
8 2002q1 4809.319015
9 2002q2 3921.175182

Trying to lookup a value from a pandas dataframe within a range of two rows in the index dataframe

I have two dataframes - "grower_moo" and "pricing" in a Python notebook to analyze harvested crops and price payments to the growers.
pricing is the index dataframe, and grower_moo has various unique load tickets with information about each load.
I need to pull the price per ton from the pricing index to a new column in the load data if the Fat of that load is not greater than the next Wet Fat.
Below is a .head() sample of each dataframe and the code I tried. I received a ValueError: Can only compare identically-labeled Series objects error.
pricing
Price_Per_Ton Wet_Fat
0 306 10
1 339 11
2 382 12
3 430 13
4 481 14
5 532 15
6 580 16
7 625 17
8 665 18
9 700 19
10 728 20
11 750 21
12 766 22
13 778 23
14 788 24
15 797 25
grower_moo
Load Ticket Net Fruit Weight Net MOO Percent_MOO Fat
0 L2019000011817 56660 833 1.448872 21.92
1 L2019000011816 53680 1409 2.557679 21.12
2 L2019000011815 53560 1001 1.834644 21.36
3 L2019000011161 62320 2737 4.207080 21.41
4 L2019000011160 57940 1129 1.911324 20.06
grower_moo['price_per_ton'] = max(pricing[pricing['Wet_Fat'] < grower_moo['Fat']]['Price_Per_Ton'])
Example output - grower_moo['Fat'] of 13.60 is less than 14 Fat, therefore gets a price per ton of $430
grower_moo_with_price
Load Ticket Net Fruit Weight Net MOO Percent_MOO Fat price_per_ton
0 L2019000011817 56660 833 1.448872 21.92 750
1 L2019000011816 53680 1409 2.557679 21.12 750
2 L2019000011815 53560 1001 1.834644 21.36 750
3 L2019000011161 62320 2737 4.207080 21.41 750
4 L2019000011160 57940 1129 1.911324 20.06 728
This looks like a job for an "as of" merge, pd.merge_asof (documentation):
This is similar to a left-join except that we match on nearest key
rather than equal keys. Both DataFrames must be sorted by the key.
For each row in the left DataFrame:
A "backward" search [the default]
selects the last row in the right DataFrame whose ‘on’ key is less
than or equal to the left’s key.
In the following code, I use your example inputs, but with column names using underscores _ instead of spaces .
# Required by merge_asof: sort keys in left DataFrame
grower_moo = grower_moo.sort_values('Fat')
# Required by merge_asof: key column data types must match
pricing['Wet_Fat'] = pricing['Wet_Fat'].astype('float')
# Perform the asof merge
res = pd.merge_asof(grower_moo, pricing, left_on='Fat', right_on='Wet_Fat')
# Print result
res
Load_Ticket Net_Fruit_Weight Net_MOO Percent_MOO Fat Price_Per_Ton Wet_Fat
0 L2019000011160 57940 1129 1.911324 20.06 728 20.0
1 L2019000011816 53680 1409 2.557679 21.12 750 21.0
2 L2019000011815 53560 1001 1.834644 21.36 750 21.0
3 L2019000011161 62320 2737 4.207080 21.41 750 21.0
4 L2019000011817 56660 833 1.448872 21.92 750 21.0
# Optional: drop the key column from the right DataFrame
res.drop(columns='Wet_Fat')
Load_Ticket Net_Fruit_Weight Net_MOO Percent_MOO Fat Price_Per_Ton
0 L2019000011160 57940 1129 1.911324 20.06 728
1 L2019000011816 53680 1409 2.557679 21.12 750
2 L2019000011815 53560 1001 1.834644 21.36 750
3 L2019000011161 62320 2737 4.207080 21.41 750
4 L2019000011817 56660 833 1.448872 21.92 750
concat_df = pd.concat([grower_moo, pricing], axis)
cocnat_df = concat_df[concat_df['Wet_Fat'] < concat_df['Fat']]
del cocnat_df['Wet_Fat']

Python selecting items by comparing values in a table using dictionary

I have a table with 12 columns and want to select the items in the first column (qseqid) based on the second column (sseqid). Meaning that the second column (sseqid) is repeating with different values in the 11th and 12th columns, which areevalueandbitscore, respectively.
The ones that I would like to get are having the lowestevalueand the highestbitscore(whenevalues are the same, the rest of the columns can be ignored and the data is down below).
So, I have made a short code which uses the second columns as a key for the dictionary. I can get five different items from the second column with lists of qseqid+evalueandqseqid+bitscore.
Here is the code:
#!usr/bin/python
filename = "data.txt"
readfile = open(filename,"r")
d = dict()
for i in readfile.readlines():
i = i.strip()
i = i.split("\t")
d.setdefault(i[1], []).append([i[0],i[10]])
d.setdefault(i[1], []).append([i[0],i[11]])
for x in d:
print(x,d[x])
readfile.close()
But, I am struggling to get the qseqid with the lowest evalue and the highest bitscore for each sseqid.
Is there any good logic to solve the problem?
Thedata.txtfile (including the header row and with»representing tab characters)
qseqid»sseqid»pident»length»mismatch»gapopen»qstart»qend»sstart»send»evalue»bitscore
ACLA_022040»TBB»32.71»431»258»8»39»468»24»423»2.00E-76»240
ACLA_024600»TBB»80»435»87»0»1»435»1»435»0»729
ACLA_031860»TBB»39.74»453»251»3»1»447»1»437»1.00E-121»357
ACLA_046030»TBB»75.81»434»105»0»1»434»1»434»0»704
ACLA_072490»TBB»41.7»446»245»3»4»447»3»435»2.00E-120»353
ACLA_010400»EF1A»27.31»249»127»8»69»286»9»234»3.00E-13»61.6
ACLA_015630»EF1A»22»491»255»17»186»602»3»439»8.00E-19»78.2
ACLA_016510»EF1A»26.23»122»61»4»21»127»9»116»2.00E-08»46.2
ACLA_023300»EF1A»29.31»447»249»12»48»437»3»439»2.00E-45»155
ACLA_028450»EF1A»85.55»443»63»1»1»443»1»442»0»801
ACLA_074730»CALM»23.13»147»101»4»6»143»2»145»7.00E-08»41.2
ACLA_096170»CALM»29.33»150»96»4»34»179»2»145»1.00E-13»55.1
ACLA_016630»CALM»23.9»159»106»5»58»216»4»147»5.00E-12»51.2
ACLA_031930»RPB2»36.87»1226»633»24»121»1237»26»1219»0»734
ACLA_065630»RPB2»65.79»1257»386»14»1»1252»4»1221»0»1691
ACLA_082370»RPB2»27.69»1228»667»37»31»1132»35»1167»7.00E-110»365
ACLA_061960»ACT»28.57»147»95»5»146»284»69»213»3.00E-12»57.4
ACLA_068200»ACT»28.73»463»231»13»16»471»4»374»1.00E-53»176
ACLA_069960»ACT»24.11»141»97»4»581»718»242»375»9.00E-09»46.2
ACLA_095800»ACT»91.73»375»31»0»1»375»1»375»0»732
And here's a little more readable version of the table's contents:
0 1 2 3 4 5 6 7 8 9 10 11
qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore
ACLA_022040 TBB 32.71 431 258 8 39 468 24 423 2.00E-76 240
ACLA_024600 TBB 80 435 87 0 1 435 1 435 0 729
ACLA_031860 TBB 39.74 453 251 3 1 447 1 437 1.00E-121 357
ACLA_046030 TBB 75.81 434 105 0 1 434 1 434 0 704
ACLA_072490 TBB 41.7 446 245 3 4 447 3 435 2.00E-120 353
ACLA_010400 EF1A 27.31 249 127 8 69 286 9 234 3.00E-13 61.6
ACLA_015630 EF1A 22 491 255 17 186 602 3 439 8.00E-19 78.2
ACLA_016510 EF1A 26.23 122 61 4 21 127 9 116 2.00E-08 46.2
ACLA_023300 EF1A 29.31 447 249 12 48 437 3 439 2.00E-45 155
ACLA_028450 EF1A 85.55 443 63 1 1 443 1 442 0 801
ACLA_074730 CALM 23.13 147 101 4 6 143 2 145 7.00E-08 41.2
ACLA_096170 CALM 29.33 150 96 4 34 179 2 145 1.00E-13 55.1
ACLA_016630 CALM 23.9 159 106 5 58 216 4 147 5.00E-12 51.2
ACLA_031930 RPB2 36.87 1226 633 24 121 1237 26 1219 0 734
ACLA_065630 RPB2 65.79 1257 386 14 1 1252 4 1221 0 1691
ACLA_082370 RPB2 27.69 1228 667 37 31 1132 35 1167 7.00E-110 365
ACLA_061960 ACT 28.57 147 95 5 146 284 69 213 3.00E-12 57.4
ACLA_068200 ACT 28.73 463 231 13 16 471 4 374 1.00E-53 176
ACLA_069960 ACT 24.11 141 97 4 581 718 242 375 9.00E-09 46.2
ACLA_095800 ACT 91.73 375 31 0 1 375 1 375 0 732
Since you're a Python newbie I'm glad that there are several examples of how to this manually, but for comparison I'll show how it can be done using the pandas library which makes working with tabular data much simpler.
Since you didn't provide example output, I'm assuming that by "with the lowest evalue and the highest bitscore for each sseqid" you mean "the highest bitscore among the lowest evalues" for a given sseqid; if you want those separately, that's trivial too.
import pandas as pd
df = pd.read_csv("acla1.dat", sep="\t")
df = df.sort(["evalue", "bitscore"],ascending=[True, False])
df_new = df.groupby("sseqid", as_index=False).first()
which produces
>>> df_new
sseqid qseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore
0 ACT ACLA_095800 91.73 375 31 0 1 375 1 375 0.000000e+00 732.0
1 CALM ACLA_096170 29.33 150 96 4 34 179 2 145 1.000000e-13 55.1
2 EF1A ACLA_028450 85.55 443 63 1 1 443 1 442 0.000000e+00 801.0
3 RPB2 ACLA_065630 65.79 1257 386 14 1 1252 4 1221 0.000000e+00 1691.0
4 TBB ACLA_024600 80.00 435 87 0 1 435 1 435 0.000000e+00 729.0
Basically, first we read the data file into an object called a DataFrame, which is kind of like an Excel worksheet. Then we sort by evalue ascending (so that lower evalues come first) and by bitscore descending (so that higher bitscores come first). Then we can use groupby to collect the data in groups of equal sseqid, and take the first one in each group, which because of the sorting will be the one we want.
#!usr/bin/python
import csv
DATA = "data.txt"
class Sequence:
def __init__(self, row):
self.qseqid = row[0]
self.sseqid = row[1]
self.pident = float(row[2])
self.length = int(row[3])
self.mismatch = int(row[4])
self.gapopen = int(row[5])
self.qstart = int(row[6])
self.qend = int(row[7])
self.sstart = int(row[8])
self.send = int(row[9])
self.evalue = float(row[10])
self.bitscore = float(row[11])
def __str__(self):
return (
"{qseqid}\t"
"{sseqid}\t"
"{pident}\t"
"{length}\t"
"{mismatch}\t"
"{gapopen}\t"
"{qstart}\t"
"{qend}\t"
"{sstart}\t"
"{send}\t"
"{evalue}\t"
"{bitscore}"
).format(**self.__dict__)
def entries(fname, header_rows=1, dtype=list, **kwargs):
with open(fname) as inf:
incsv = csv.reader(inf, **kwargs)
# skip header rows
for i in range(header_rows):
next(incsv)
for row in incsv:
yield dtype(row)
def main():
bestseq = {}
for seq in entries(DATA, dtype=Sequence, delimiter="\t"):
# see if a sequence with the same sseqid already exists
prev = bestseq.get(seq.sseqid, None)
if (
prev is None
or seq.evalue < prev.evalue
or (seq.evalue == prev.evalue and seq.bitscore > prev.bitscore)
):
bestseq[seq.sseqid] = seq
# display selected sequences
keys = sorted(bestseq)
for key in keys:
print(bestseq[key])
if __name__ == "__main__":
main()
which results in
ACLA_095800 ACT 91.73 375 31 0 1 375 1 375 0.0 732.0
ACLA_096170 CALM 29.33 150 96 4 34 179 2 145 1e-13 55.1
ACLA_028450 EF1A 85.55 443 63 1 1 443 1 442 0.0 801.0
ACLA_065630 RPB2 65.79 1257 386 14 1 1252 4 1221 0.0 1691.0
ACLA_024600 TBB 80.0 435 87 0 1 435 1 435 0.0 729.0
While not nearly as elegant and concise as using thepandaslibrary, it's quite possible to do what you want without resorting to third-party modules. The following uses thecollections.defaultdictclass to facilitate creation of dictionaries of variable-length lists of records. The use of theAttrDictclass is optional, but it makes accessing the fields of each dictionary-based records easier and is less awkward-looking than the usualdict['fieldname']syntax otherwise required.
import csv
from collections import defaultdict, namedtuple
from itertools import imap
from operator import itemgetter
data_file_name = 'data.txt'
DELIMITER = '\t'
ssqeid_dict = defaultdict(list)
# from http://stackoverflow.com/a/1144405/355230
def multikeysort(items, columns):
comparers = [((itemgetter(col[1:].strip()), -1) if col.startswith('-') else
(itemgetter(col.strip()), 1)) for col in columns]
def comparer(left, right):
for fn, mult in comparers:
result = cmp(fn(left), fn(right))
if result:
return mult * result
else:
return 0
return sorted(items, cmp=comparer)
# from http://stackoverflow.com/a/15109345/355230
class AttrDict(dict):
def __init__(self, *args, **kwargs):
super(AttrDict, self).__init__(*args, **kwargs)
self.__dict__ = self
with open(data_file_name, 'rb') as data_file:
reader = csv.DictReader(data_file, delimiter=DELIMITER)
format_spec = '\t'.join([('{%s}' % field) for field in reader.fieldnames])
for rec in (AttrDict(r) for r in reader):
# Convert the two sort fields to numeric values for proper ordering.
rec.evalue, rec.bitscore = map(float, (rec.evalue, rec.bitscore))
ssqeid_dict[rec.sseqid].append(rec)
for ssqeid in sorted(ssqeid_dict):
# Sort each group of recs with same ssqeid. The first record after sorting
# will be the one sought that has the lowest evalue and highest bitscore.
selected = multikeysort(ssqeid_dict[ssqeid], ['evalue', '-bitscore'])[0]
print format_spec.format(**selected)
Output (»represents tabs):
ACLA_095800» ACT» 91.73» 375» 31» 0» 1» 375» 1» 375» 0.0» 732.0
ACLA_096170» CALM» 29.33» 150» 96» 4» 34» 179» 2» 145» 1e-13» 55.1
ACLA_028450» EF1A» 85.55» 443» 63» 1» 1» 443» 1» 442» 0.0» 801.0
ACLA_065630» RPB2» 65.79» 1257» 386» 14» 1» 1252» 4» 1221» 0.0» 1691.0
ACLA_024600» TBB» 80» 435» 87» 0» 1» 435» 1» 435» 0.0» 729.0
filename = 'data.txt'
readfile = open(filename,'r')
d = dict()
sseqid=[]
lines=[]
for i in readfile.readlines():
sseqid.append(i.rsplit()[1])
lines.append(i.rsplit())
sorted_sseqid = sorted(set(sseqid))
sdqDict={}
key =None
for sorted_ssqd in sorted_sseqid:
key=sorted_ssqd
evalue=[]
bitscore=[]
qseid=[]
for line in lines:
if key in line:
evalue.append(line[10])
bitscore.append(line[11])
qseid.append(line[0])
sdqDict[key]=[qseid,evalue,bitscore]
print sdqDict
print 'TBB LOWEST EVALUE' + '---->' + min(sdqDict['TBB'][1])
##I think you can do the list manipulation below to find out the qseqid
readfile.close()

Categories