What is the internal load factor of a sets in Python - python

I am trying to find out what the internal load factor is for the Python sets. For dictionary which uses a hash table with a load factor of 0.66 (2/3) is. The number of buckets start at 8 and when the 6th key is inserted the number of buckets increases to 16
The table below shows the shift in buckets.
bucket
shift
8
5
16
10
32
21
64
42
128
85
This can be seen with de following Python code where the size of a dictionary and sets is shows with the getsizeof method:
import sys
d = {}
s = set()
for x in range(25):
d[x] = 1
s.add(x)
print(len(d), sys.getsizeof(d), sys.getsizeof(s))
# of elements
memory used for dict
memory used for sets
1
232
216
2
232
216
3
232
216
4
232
216
5
232
728
6
360
728
7
360
728
8
360
728
9
360
728
10
360
728
11
640
728
12
640
728
13
640
728
14
640
728
15
640
728
16
640
728
17
640
728
18
640
728
19
640
2264
20
640
2264
21
640
2264
22
1176
2264
23
1176
2264
24
1176
2264
25
1176
2264
The above table shows that the shift in the buckets correct is for the dictionary, but not for the sets. The memory in sets is different.
I am trying to find out what the load factor is for a set. Is that also 2/3? Or am I doing something wrong with the code?

Currently, it's about 3/5. See the source:
if ((size_t)so->fill*5 < mask*3)
return 0;
return set_table_resize(so, so->used>50000 ? so->used*2 : so->used*4);
fill is the number of occupied table cells (including "deleted entry" markers), and mask is 1 less than the total table capacity.

Related

Sampling from static data set to create dataframe, ignore index in Python

I am trying to create some random samples (of a given size) from a static dataframe. The goal is to create multiple columns for each sample (and each sample drawn is the same size). I'm expecting to see multiple columns of the same length (i.e. sample size) in the fully sampled dataframe, but maybe append isn't the right way to go. Here is the code:
# create sample dataframe
target_df = pd.DataFrame(np.arange(1000))
target_df.columns=['pl']
# create the sampler:
sample_num = 5
sample_len = 10
df_max_row = len(target_df) - sample_len
for i in range(sample_num):
rndm_start = np.random.choice(df_max_row, 1)[0]
rndm_end = rndm_start + sample_len
slicer = target_df.iloc[rndm_start:rndm_end]['pl']
sampled_df = sampled_df.append(slicer, ignore_index=True)
sampled_df = sampled_df.T
The output of this is shown in the pic below - The red line shows the index I want remove.
The desired output is shown below that. How do I make this happen?
Thanks!
I would create new column using
sampled_df[i] = slicer.reset_index(drop=True)
Eventually I would use str(i) for column name because later it is simpler to select column using string than number
import pandas as pd
import random
target_df = pd.DataFrame({'pl': range(1000)})
# create the sampler:
sample_num = 5
sample_len = 10
df_max_row = len(target_df) - sample_len
sampled_df = pd.DataFrame()
for i in range(1, sample_num+1):
start = random.randint(0, df_max_row)
end = start + sample_len
slicer = target_df[start:end]['pl']
sampled_df[str(i)] = slicer.reset_index(drop=True)
sampled_df.index += 1
print(sampled_df)
Result:
1 2 3 4 5
1 735 396 646 534 769
2 736 397 647 535 770
3 737 398 648 536 771
4 738 399 649 537 772
5 739 400 650 538 773
6 740 401 651 539 774
7 741 402 652 540 775
8 742 403 653 541 776
9 743 404 654 542 777
10 744 405 655 543 778
But to create really random values then I would first shuffle values
np.random.shuffle(target_df['pl'])
and then I don't have to use random to select start
shuffle changes original column so it can't assign to new variable.
It doesn't repeat values in samples.
import pandas as pd
#import numpy as np
import random
target_df = pd.DataFrame({'pl': range(1000)})
# create the sampler:
sample_num = 5
sample_len = 10
sampled_df = pd.DataFrame()
#np.random.shuffle(target_df['pl'])
random.shuffle(target_df['pl'])
for i in range(1, sample_num+1):
start = i * sample_len
end = start + sample_len
slicer = target_df[start:end]['pl']
sampled_df[str(i)] = slicer.reset_index(drop=True)
sampled_df.index += 1
print(sampled_df)
Result:
1 2 3 4 5
1 638 331 171 989 170
2 22 643 47 136 764
3 969 455 211 763 194
4 859 384 174 552 566
5 221 829 62 926 414
6 4 895 951 967 381
7 758 688 594 876 873
8 757 691 825 693 707
9 235 353 34 699 121
10 447 81 36 682 251
If values can repeat then you could use
sampled_df[str(i)] = target_df['pl'].sample(n=sample_len, ignore_index=True)
import pandas as pd
target_df = pd.DataFrame({'pl': range(1000)})
# create the sampler:
sample_num = 5
sample_len = 10
sampled_df = pd.DataFrame()
for i in range(1, sample_num+1):
sampled_df[str(i)] = target_df['pl'].sample(n=sample_len, ignore_index=True)
sampled_df.index += 1
print(sampled_df)
EDIT
You may also get shuffled values as numpy array and use reshape - and later convert back to DataFrame with many columns. And later you can get some columns.
import pandas as pd
import random
target_df = pd.DataFrame({'pl': range(1000)})
# create the sampler:
sample_num = 5
sample_len = 10
random.shuffle(target_df['pl'])
sampled_df = pd.DataFrame(target_df['pl'].values.reshape([sample_len,-1]))
sampled_df = sampled_df.iloc[:, 0:sample_num]
sampled_df.index += 1
print(sampled_df)

Trying to lookup a value from a pandas dataframe within a range of two rows in the index dataframe

I have two dataframes - "grower_moo" and "pricing" in a Python notebook to analyze harvested crops and price payments to the growers.
pricing is the index dataframe, and grower_moo has various unique load tickets with information about each load.
I need to pull the price per ton from the pricing index to a new column in the load data if the Fat of that load is not greater than the next Wet Fat.
Below is a .head() sample of each dataframe and the code I tried. I received a ValueError: Can only compare identically-labeled Series objects error.
pricing
Price_Per_Ton Wet_Fat
0 306 10
1 339 11
2 382 12
3 430 13
4 481 14
5 532 15
6 580 16
7 625 17
8 665 18
9 700 19
10 728 20
11 750 21
12 766 22
13 778 23
14 788 24
15 797 25
grower_moo
Load Ticket Net Fruit Weight Net MOO Percent_MOO Fat
0 L2019000011817 56660 833 1.448872 21.92
1 L2019000011816 53680 1409 2.557679 21.12
2 L2019000011815 53560 1001 1.834644 21.36
3 L2019000011161 62320 2737 4.207080 21.41
4 L2019000011160 57940 1129 1.911324 20.06
grower_moo['price_per_ton'] = max(pricing[pricing['Wet_Fat'] < grower_moo['Fat']]['Price_Per_Ton'])
Example output - grower_moo['Fat'] of 13.60 is less than 14 Fat, therefore gets a price per ton of $430
grower_moo_with_price
Load Ticket Net Fruit Weight Net MOO Percent_MOO Fat price_per_ton
0 L2019000011817 56660 833 1.448872 21.92 750
1 L2019000011816 53680 1409 2.557679 21.12 750
2 L2019000011815 53560 1001 1.834644 21.36 750
3 L2019000011161 62320 2737 4.207080 21.41 750
4 L2019000011160 57940 1129 1.911324 20.06 728
This looks like a job for an "as of" merge, pd.merge_asof (documentation):
This is similar to a left-join except that we match on nearest key
rather than equal keys. Both DataFrames must be sorted by the key.
For each row in the left DataFrame:
A "backward" search [the default]
selects the last row in the right DataFrame whose ‘on’ key is less
than or equal to the left’s key.
In the following code, I use your example inputs, but with column names using underscores _ instead of spaces .
# Required by merge_asof: sort keys in left DataFrame
grower_moo = grower_moo.sort_values('Fat')
# Required by merge_asof: key column data types must match
pricing['Wet_Fat'] = pricing['Wet_Fat'].astype('float')
# Perform the asof merge
res = pd.merge_asof(grower_moo, pricing, left_on='Fat', right_on='Wet_Fat')
# Print result
res
Load_Ticket Net_Fruit_Weight Net_MOO Percent_MOO Fat Price_Per_Ton Wet_Fat
0 L2019000011160 57940 1129 1.911324 20.06 728 20.0
1 L2019000011816 53680 1409 2.557679 21.12 750 21.0
2 L2019000011815 53560 1001 1.834644 21.36 750 21.0
3 L2019000011161 62320 2737 4.207080 21.41 750 21.0
4 L2019000011817 56660 833 1.448872 21.92 750 21.0
# Optional: drop the key column from the right DataFrame
res.drop(columns='Wet_Fat')
Load_Ticket Net_Fruit_Weight Net_MOO Percent_MOO Fat Price_Per_Ton
0 L2019000011160 57940 1129 1.911324 20.06 728
1 L2019000011816 53680 1409 2.557679 21.12 750
2 L2019000011815 53560 1001 1.834644 21.36 750
3 L2019000011161 62320 2737 4.207080 21.41 750
4 L2019000011817 56660 833 1.448872 21.92 750
concat_df = pd.concat([grower_moo, pricing], axis)
cocnat_df = concat_df[concat_df['Wet_Fat'] < concat_df['Fat']]
del cocnat_df['Wet_Fat']

prediction based on updating data

I'm currently working on an online advertisement optimzier project.
Let's assume that only thing I can change is CPC(cost per click).
I don't have much data, as the data is updated only once a day.
I want to get the prediction of net_income by CPC and want to let the program to suggest the best CPC value to maximize the net_income for tomorrow, based on the data that updates every day.
cpc margin
0 440 -95224.0
1 840 -81620.0
2 530 -57496.0
3 590 -47287.0
4 560 -45681.0
5 590 -52766.0
6 500 -60852.0
7 650 -59653.0
8 480 -48905.0
9 620 -56496.0
10 680 -53614.0
11 590 -44440.0
12 460 -34066.0
13 720 -31086.0
14 590 -23177.0
15 680 -12803.0
16 760 -10625.0
17 590 -20548.0
18 800 -15136.0
19 650 -12804.0
20 420 -63435.0
21 400 -7566.0
22 400 21136.0
23 400 -58585.0
24 400 -14166.0
25 420 -23065.0
26 400 -28533.0
27 380 -14454.0
28 400 -50819.0
29 380 -26356.0
30 400 -26322.0
31 380 -19107.0
32 400 -28270.0
33 380 -88439.0
34 360 -32207.0
35 340 -27632.0
36 340 -18050.0
37 340 -71574.0
38 340 -18050.0
39 320 -20735.0
40 300 -17984.0
41 290 -9426.0
42 280 -16555.0
43 290 2961.0
For instance, say the above data is df.
I tried use sklearn and LogisticRegression to get the prediction:
import pandas as pd
from sklearn import datasets
from sklearn import metrics
from sklearn.linear_model import LogisticRegression
model = LinearRegression()
model.fit(df['cpc'], df['margin'])
prediction = model.predict([[300]])
print(prediction[0])
margin is net_income, btw.
So by doing this, I thought I might get the prediction based on the data when CPC is 300, but it returned an error saying:
ValueError: Expected 2D array, got 1D array instead:
array=[440 840 530 590 560 590 500 650 480 620 680 590 460 720 590 680 760 590
800 650 420 400 400 400 400 420 400 380 400 380 400 380 400 380 360 340
340 340 340 320 300 290 280 290].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
I've been looking for some examples using linear regression models or logistics regression models, but they all use a 2-d array for input, which doesn't fit my needs. I only have one factor that I can change, and the result is simply the net_income(or margin).
How would I use sklearn on my project? Or is there maybe another better way to solve the problem?
I'm pretty new to programming and have no knowledge of math and statistics which makes it harder for me to understand or get keywords to study... please guide me on this.
---------------------------------updated-------------------------------------
Allright, let me give you another df
cpc margin
0 440 -35224.0
1 340 -11574.0
2 380 -68439.0
3 420 -23435.0
4 840 -81620.0
5 400 -38585.0
6 530 -37496.0
7 590 -7287.0
8 560 -5681.0
9 590 -32766.0
10 500 -60852.0
11 400 -30819.0
12 650 -59653.0
13 480 -28905.0
14 620 -56496.0
15 680 -53614.0
16 590 -44440.0
17 460 -14066.0
18 420 16935.0
19 360 -12207.0
20 400 -8533.0
21 400 -6322.0
22 400 25834.0
23 720 -31086.0
24 400 121136.0
25 400 -28270.0
26 340 1950.0
27 340 1950.0
28 300 2016.0
29 340 -27632.0
30 400 32434.0
31 380 -26356.0
32 590 -23177.0
33 680 7197.0
34 320 -20735.0
35 760 9375.0
36 590 -20548.0
37 290 10574.0
38 380 -19107.0
39 290 42961.0
40 280 -16555.0
41 800 -15136.0
42 380 -14454.0
43 650 -12804.0
Thanks to your answers, I could go further as below.
after I could run my code without error, I thought by looping the input, I would be able to get the optimal cpc value.
import pandas as pd
from sklearn import datasets
from sklearn import metrics
from sklearn.linear_model import LogisticRegression
df = pd.DataFrame(final_db)
model = LogisticRegression()
x = df[['cpc']]
model.fit(x, df['margin'])
previous_prediction = -99999999999999
df_prediction = []
for i in list(range(10, 1000, 10)):
prediction = model.predict([[i]])
df_prediction.append({'cpc':i, 'margin' : prediction})
if prediction > previous_prediction:
previous_prediction = prediction
previous_i = i
and the result was as below
which isn't very satisfying. based on the data I have, is there any better model to use? To achieve my goal, any other suggestions?
I guess it is complaining about this line
model.fit(df['cpc'], df['margin'])
where first parameter should be two dimensional array. You can used array indexing of DataFrame
df[['cpc']]
to get DataFrame instead of series which will fix the issue

Converting from AMPL to Pyomo

I'm trying to convert an AMPL model to Pyomo (something I have no experience with using). I'm finding the syntax hard to adapt to, especially the constraint and objective sections. I've already linked my computer together with python, anaconda, Pyomo, and GLPK, and just need to get the actual code down. I'm a beginner coder so forgive me if my code is poorly written. Still trying to get the hang of this!
Here is the data from the AMPL code:
set PROD := 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30;
set PROD1:= 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30;
ProdCost 414 3 46 519 876 146 827 996 922 308 568 176 58 13 20 974 121 751 130 844 280 123 275 843 717 694 72 413 65 631
HoldingCost 606 308 757 851 148 867 336 44 364 960 69 428 778 485 285 938 980 932 199 175 625 513 536 965 366 950 632 88 698 744
Demand 105 70 135 67 102 25 147 69 23 84 32 41 81 133 180 22 174 80 24 156 28 125 23 137 180 151 39 138 196 69
And here is the model:
set PROD; # set of production amounts
set PROD1; # set of holding amounts
param ProdCost {PROD} >= 0; # parameter set of production costs
param Demand {PROD} >= 0; # parameter set of demand at each time
param HoldingCost {PROD} >= 0; # parameter set of holding costs
var Inventory {PROD1} >= 0; # variable that sets inventory amount at each time
var Make {p in PROD} >= 0; # variable of amount produced at each time
minimize Total_Cost: sum {p in PROD} ((ProdCost[p] * Make[p]) + (Inventory[p] * HoldingCost[p]));
# Objective: minimize total cost from all production and holding cost
subject to InventoryConstraint {p in PROD}: Inventory[p] = Inventory[p-1] + Make[p] - Demand[p];
# excess production transfers to inventory
subject to MeetDemandConstraint {p in PROD}: Make[p] >= Demand[p] - Inventory[p-1];
# Constraint: holding and production must exceed demand
subject to InitialInventoryConstraint: Inventory[0] = 0;
# Constraint: Inventory must start at 0
Here's what I have so far. Not sure if it's right or not:
from pyomo.environ import *
demand=[105,70,135,67,102,25,147,69,23,84,32,41,81,133,180,22,174,80,24,156,28,125,23,137,180,151,39,138,196,69]
holdingcost=[606,308,757,851,148,867,336,44,364,960,69,428,778,485,285,938,980,932,199,175,625,513,536,965,366,950,632,88,698,744]
productioncost=[414,3,46,519,876,146,827,996,922,308,568,176,58,13,20,974,121,751,130,844,280,123,275,843,717,694,72,413,65,631]
model=ConcreteModel()
model.I=RangeSet(1,30)
model.J=RangeSet(0,30)
model.x=Var(model.I, within=NonNegativeIntegers)
model.y=Var(model.J, within=NonNegativeIntegers)
model.obj = Objective(expr = sum(model.x[i]*productioncost[i]+model.y[i]*holdingcost[i] for i in model.I))
def InventoryConstraint(model, i):
return model.y[i-1] + model.x[i] - demand[i] <= model.y[i]
InvCont = Constraint(model, rule=InventoryConstraint)
def MeetDemandConstraint(model, i):
return model.x[i] >= demand[i] - model.y[i-1]
DemCont = Constraint(model, rule=MeetDemandConstraint)
def Initial(model):
return model.y[0] == 0
model.Init = Constraint(rule=Initial)
opt = SolverFactory('glpk')
results = opt.solve(model,load_solutions=True)
model.solutions.store_to(results)
results.write()
Thanks!
The only issues I see are in some of your constraint declarations. You need to attach the constraints to the model and the first argument passed in should be the indexing set (which I'm assuming should be model.I).
def InventoryConstraint(model, i):
return model.y[i-1] + model.x[i] - demand[i] <= model.y[i]
model.InvCont = Constraint(model.I, rule=InventoryConstraint)
def MeetDemandConstraint(model, i):
return model.x[i] >= demand[i] - model.y[i-1]
model.DemCont = Constraint(model.I, rule=MeetDemandConstraint)
The syntax that you're using to solve the model is a little out-dated but should work. Another option would be:
opt = SolverFactory('glpk')
opt.solve(model,tee=True) # The 'tee' option prints the solver output to the screen
model.display() # This will print a summary of the model solution
Another command that is useful for debugging is model.pprint(). This will display the entire model including the expressions for Constraints and Objectives.

Python selecting items by comparing values in a table using dictionary

I have a table with 12 columns and want to select the items in the first column (qseqid) based on the second column (sseqid). Meaning that the second column (sseqid) is repeating with different values in the 11th and 12th columns, which areevalueandbitscore, respectively.
The ones that I would like to get are having the lowestevalueand the highestbitscore(whenevalues are the same, the rest of the columns can be ignored and the data is down below).
So, I have made a short code which uses the second columns as a key for the dictionary. I can get five different items from the second column with lists of qseqid+evalueandqseqid+bitscore.
Here is the code:
#!usr/bin/python
filename = "data.txt"
readfile = open(filename,"r")
d = dict()
for i in readfile.readlines():
i = i.strip()
i = i.split("\t")
d.setdefault(i[1], []).append([i[0],i[10]])
d.setdefault(i[1], []).append([i[0],i[11]])
for x in d:
print(x,d[x])
readfile.close()
But, I am struggling to get the qseqid with the lowest evalue and the highest bitscore for each sseqid.
Is there any good logic to solve the problem?
Thedata.txtfile (including the header row and with»representing tab characters)
qseqid»sseqid»pident»length»mismatch»gapopen»qstart»qend»sstart»send»evalue»bitscore
ACLA_022040»TBB»32.71»431»258»8»39»468»24»423»2.00E-76»240
ACLA_024600»TBB»80»435»87»0»1»435»1»435»0»729
ACLA_031860»TBB»39.74»453»251»3»1»447»1»437»1.00E-121»357
ACLA_046030»TBB»75.81»434»105»0»1»434»1»434»0»704
ACLA_072490»TBB»41.7»446»245»3»4»447»3»435»2.00E-120»353
ACLA_010400»EF1A»27.31»249»127»8»69»286»9»234»3.00E-13»61.6
ACLA_015630»EF1A»22»491»255»17»186»602»3»439»8.00E-19»78.2
ACLA_016510»EF1A»26.23»122»61»4»21»127»9»116»2.00E-08»46.2
ACLA_023300»EF1A»29.31»447»249»12»48»437»3»439»2.00E-45»155
ACLA_028450»EF1A»85.55»443»63»1»1»443»1»442»0»801
ACLA_074730»CALM»23.13»147»101»4»6»143»2»145»7.00E-08»41.2
ACLA_096170»CALM»29.33»150»96»4»34»179»2»145»1.00E-13»55.1
ACLA_016630»CALM»23.9»159»106»5»58»216»4»147»5.00E-12»51.2
ACLA_031930»RPB2»36.87»1226»633»24»121»1237»26»1219»0»734
ACLA_065630»RPB2»65.79»1257»386»14»1»1252»4»1221»0»1691
ACLA_082370»RPB2»27.69»1228»667»37»31»1132»35»1167»7.00E-110»365
ACLA_061960»ACT»28.57»147»95»5»146»284»69»213»3.00E-12»57.4
ACLA_068200»ACT»28.73»463»231»13»16»471»4»374»1.00E-53»176
ACLA_069960»ACT»24.11»141»97»4»581»718»242»375»9.00E-09»46.2
ACLA_095800»ACT»91.73»375»31»0»1»375»1»375»0»732
And here's a little more readable version of the table's contents:
0 1 2 3 4 5 6 7 8 9 10 11
qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore
ACLA_022040 TBB 32.71 431 258 8 39 468 24 423 2.00E-76 240
ACLA_024600 TBB 80 435 87 0 1 435 1 435 0 729
ACLA_031860 TBB 39.74 453 251 3 1 447 1 437 1.00E-121 357
ACLA_046030 TBB 75.81 434 105 0 1 434 1 434 0 704
ACLA_072490 TBB 41.7 446 245 3 4 447 3 435 2.00E-120 353
ACLA_010400 EF1A 27.31 249 127 8 69 286 9 234 3.00E-13 61.6
ACLA_015630 EF1A 22 491 255 17 186 602 3 439 8.00E-19 78.2
ACLA_016510 EF1A 26.23 122 61 4 21 127 9 116 2.00E-08 46.2
ACLA_023300 EF1A 29.31 447 249 12 48 437 3 439 2.00E-45 155
ACLA_028450 EF1A 85.55 443 63 1 1 443 1 442 0 801
ACLA_074730 CALM 23.13 147 101 4 6 143 2 145 7.00E-08 41.2
ACLA_096170 CALM 29.33 150 96 4 34 179 2 145 1.00E-13 55.1
ACLA_016630 CALM 23.9 159 106 5 58 216 4 147 5.00E-12 51.2
ACLA_031930 RPB2 36.87 1226 633 24 121 1237 26 1219 0 734
ACLA_065630 RPB2 65.79 1257 386 14 1 1252 4 1221 0 1691
ACLA_082370 RPB2 27.69 1228 667 37 31 1132 35 1167 7.00E-110 365
ACLA_061960 ACT 28.57 147 95 5 146 284 69 213 3.00E-12 57.4
ACLA_068200 ACT 28.73 463 231 13 16 471 4 374 1.00E-53 176
ACLA_069960 ACT 24.11 141 97 4 581 718 242 375 9.00E-09 46.2
ACLA_095800 ACT 91.73 375 31 0 1 375 1 375 0 732
Since you're a Python newbie I'm glad that there are several examples of how to this manually, but for comparison I'll show how it can be done using the pandas library which makes working with tabular data much simpler.
Since you didn't provide example output, I'm assuming that by "with the lowest evalue and the highest bitscore for each sseqid" you mean "the highest bitscore among the lowest evalues" for a given sseqid; if you want those separately, that's trivial too.
import pandas as pd
df = pd.read_csv("acla1.dat", sep="\t")
df = df.sort(["evalue", "bitscore"],ascending=[True, False])
df_new = df.groupby("sseqid", as_index=False).first()
which produces
>>> df_new
sseqid qseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore
0 ACT ACLA_095800 91.73 375 31 0 1 375 1 375 0.000000e+00 732.0
1 CALM ACLA_096170 29.33 150 96 4 34 179 2 145 1.000000e-13 55.1
2 EF1A ACLA_028450 85.55 443 63 1 1 443 1 442 0.000000e+00 801.0
3 RPB2 ACLA_065630 65.79 1257 386 14 1 1252 4 1221 0.000000e+00 1691.0
4 TBB ACLA_024600 80.00 435 87 0 1 435 1 435 0.000000e+00 729.0
Basically, first we read the data file into an object called a DataFrame, which is kind of like an Excel worksheet. Then we sort by evalue ascending (so that lower evalues come first) and by bitscore descending (so that higher bitscores come first). Then we can use groupby to collect the data in groups of equal sseqid, and take the first one in each group, which because of the sorting will be the one we want.
#!usr/bin/python
import csv
DATA = "data.txt"
class Sequence:
def __init__(self, row):
self.qseqid = row[0]
self.sseqid = row[1]
self.pident = float(row[2])
self.length = int(row[3])
self.mismatch = int(row[4])
self.gapopen = int(row[5])
self.qstart = int(row[6])
self.qend = int(row[7])
self.sstart = int(row[8])
self.send = int(row[9])
self.evalue = float(row[10])
self.bitscore = float(row[11])
def __str__(self):
return (
"{qseqid}\t"
"{sseqid}\t"
"{pident}\t"
"{length}\t"
"{mismatch}\t"
"{gapopen}\t"
"{qstart}\t"
"{qend}\t"
"{sstart}\t"
"{send}\t"
"{evalue}\t"
"{bitscore}"
).format(**self.__dict__)
def entries(fname, header_rows=1, dtype=list, **kwargs):
with open(fname) as inf:
incsv = csv.reader(inf, **kwargs)
# skip header rows
for i in range(header_rows):
next(incsv)
for row in incsv:
yield dtype(row)
def main():
bestseq = {}
for seq in entries(DATA, dtype=Sequence, delimiter="\t"):
# see if a sequence with the same sseqid already exists
prev = bestseq.get(seq.sseqid, None)
if (
prev is None
or seq.evalue < prev.evalue
or (seq.evalue == prev.evalue and seq.bitscore > prev.bitscore)
):
bestseq[seq.sseqid] = seq
# display selected sequences
keys = sorted(bestseq)
for key in keys:
print(bestseq[key])
if __name__ == "__main__":
main()
which results in
ACLA_095800 ACT 91.73 375 31 0 1 375 1 375 0.0 732.0
ACLA_096170 CALM 29.33 150 96 4 34 179 2 145 1e-13 55.1
ACLA_028450 EF1A 85.55 443 63 1 1 443 1 442 0.0 801.0
ACLA_065630 RPB2 65.79 1257 386 14 1 1252 4 1221 0.0 1691.0
ACLA_024600 TBB 80.0 435 87 0 1 435 1 435 0.0 729.0
While not nearly as elegant and concise as using thepandaslibrary, it's quite possible to do what you want without resorting to third-party modules. The following uses thecollections.defaultdictclass to facilitate creation of dictionaries of variable-length lists of records. The use of theAttrDictclass is optional, but it makes accessing the fields of each dictionary-based records easier and is less awkward-looking than the usualdict['fieldname']syntax otherwise required.
import csv
from collections import defaultdict, namedtuple
from itertools import imap
from operator import itemgetter
data_file_name = 'data.txt'
DELIMITER = '\t'
ssqeid_dict = defaultdict(list)
# from http://stackoverflow.com/a/1144405/355230
def multikeysort(items, columns):
comparers = [((itemgetter(col[1:].strip()), -1) if col.startswith('-') else
(itemgetter(col.strip()), 1)) for col in columns]
def comparer(left, right):
for fn, mult in comparers:
result = cmp(fn(left), fn(right))
if result:
return mult * result
else:
return 0
return sorted(items, cmp=comparer)
# from http://stackoverflow.com/a/15109345/355230
class AttrDict(dict):
def __init__(self, *args, **kwargs):
super(AttrDict, self).__init__(*args, **kwargs)
self.__dict__ = self
with open(data_file_name, 'rb') as data_file:
reader = csv.DictReader(data_file, delimiter=DELIMITER)
format_spec = '\t'.join([('{%s}' % field) for field in reader.fieldnames])
for rec in (AttrDict(r) for r in reader):
# Convert the two sort fields to numeric values for proper ordering.
rec.evalue, rec.bitscore = map(float, (rec.evalue, rec.bitscore))
ssqeid_dict[rec.sseqid].append(rec)
for ssqeid in sorted(ssqeid_dict):
# Sort each group of recs with same ssqeid. The first record after sorting
# will be the one sought that has the lowest evalue and highest bitscore.
selected = multikeysort(ssqeid_dict[ssqeid], ['evalue', '-bitscore'])[0]
print format_spec.format(**selected)
Output (»represents tabs):
ACLA_095800» ACT» 91.73» 375» 31» 0» 1» 375» 1» 375» 0.0» 732.0
ACLA_096170» CALM» 29.33» 150» 96» 4» 34» 179» 2» 145» 1e-13» 55.1
ACLA_028450» EF1A» 85.55» 443» 63» 1» 1» 443» 1» 442» 0.0» 801.0
ACLA_065630» RPB2» 65.79» 1257» 386» 14» 1» 1252» 4» 1221» 0.0» 1691.0
ACLA_024600» TBB» 80» 435» 87» 0» 1» 435» 1» 435» 0.0» 729.0
filename = 'data.txt'
readfile = open(filename,'r')
d = dict()
sseqid=[]
lines=[]
for i in readfile.readlines():
sseqid.append(i.rsplit()[1])
lines.append(i.rsplit())
sorted_sseqid = sorted(set(sseqid))
sdqDict={}
key =None
for sorted_ssqd in sorted_sseqid:
key=sorted_ssqd
evalue=[]
bitscore=[]
qseid=[]
for line in lines:
if key in line:
evalue.append(line[10])
bitscore.append(line[11])
qseid.append(line[0])
sdqDict[key]=[qseid,evalue,bitscore]
print sdqDict
print 'TBB LOWEST EVALUE' + '---->' + min(sdqDict['TBB'][1])
##I think you can do the list manipulation below to find out the qseqid
readfile.close()

Categories