Pandas Merge Error: MemoryError - python

Problem:
I'm trying to two relatively small datasets together, but the merge raises a MemoryError. I have two datasets of aggregates of country trade data, that I'm trying to merge on the keys year and country, so the data needs to be particularity placed. This unfortunately makes the use of concat and its performance benefits impossible as seen in the answer to this question: MemoryError on large merges with pandas in Python.
Here's the setup:
The attempted merge:
df = merge(df, i, left_on=['year', 'ComTrade_CC'], right_on=["Year","Partner Code"])
Basic data structure:
i:
Year Reporter_Code Trade_Flow_Code Partner_Code Classification Commodity Code Quantity Unit Code Supplementary Quantity Netweight (kg) Value Estimation Code
0 2003 381 2 36 H2 070951 8 1274 1274 13810 0
1 2003 381 2 36 H2 070930 8 17150 17150 30626 0
2 2003 381 2 36 H2 0709 8 20493 20493 635840 0
3 2003 381 1 36 H2 0507 8 5200 5200 27619 0
4 2003 381 1 36 H2 050400 8 56439 56439 683104 0
df:
mporter cod CC ComTrade_CC Distance_miles
0 110 215 215 757 428.989
1 110 215 215 757 428.989
2 110 215 215 757 428.989
3 110 215 215 757 428.989
4 110 215 215 757 428.989
Error Traceback:
MemoryError Traceback (most recent call last)
<ipython-input-10-8d6e9fb45de6> in <module>()
1 for i in c_list:
----> 2 df = merge(df, i, left_on=['year', 'ComTrade_CC'], right_on=["Year","Partner Code"])
/usr/local/lib/python2.7/dist-packages/pandas-0.12.0rc1_309_g9fc8636-py2.7-linux-x86_64.egg/pandas/tools/merge.pyc in merge(left, right, how, on, left_on, right_on, left_index, right_index, sort, suffixes, copy)
36 right_index=right_index, sort=sort, suffixes=suffixes,
37 copy=copy)
---> 38 return op.get_result()
39 if __debug__:
40 merge.__doc__ = _merge_doc % '\nleft : DataFrame'
/usr/local/lib/python2.7/dist-packages/pandas-0.12.0rc1_309_g9fc8636-py2.7-linux-x86_64.egg/pandas/tools/merge.pyc in get_result(self)
193 copy=self.copy)
194
--> 195 result_data = join_op.get_result()
196 result = DataFrame(result_data)
197
/usr/local/lib/python2.7/dist-packages/pandas-0.12.0rc1_309_g9fc8636-py2.7-linux-x86_64.egg/pandas/tools/merge.pyc in get_result(self)
693 if klass in mapping:
694 klass_blocks.extend((unit, b) for b in mapping[klass])
--> 695 res_blk = self._get_merged_block(klass_blocks)
696
697 # if we have a unique result index, need to clear the _ref_locs
/usr/local/lib/python2.7/dist-packages/pandas-0.12.0rc1_309_g9fc8636-py2.7-linux-x86_64.egg/pandas/tools/merge.pyc in _get_merged_block(self, to_merge)
706 def _get_merged_block(self, to_merge):
707 if len(to_merge) > 1:
--> 708 return self._merge_blocks(to_merge)
709 else:
710 unit, block = to_merge[0]
/usr/local/lib/python2.7/dist-packages/pandas-0.12.0rc1_309_g9fc8636-py2.7-linux-x86_64.egg/pandas/tools/merge.pyc in _merge_blocks(self, merge_chunks)
728 # Should use Fortran order??
729 block_dtype = _get_block_dtype([x[1] for x in merge_chunks])
--> 730 out = np.empty(out_shape, dtype=block_dtype)
731
732 sofar = 0
MemoryError:
Thanks for your thoughts!

In case anyone coming across this question still has similar trouble with merge, you can probably get concat to work by renaming the relevant columns in the two dataframes to the same names, setting them as a MultiIndex (i.e. df = dv.set_index(['A','B'])), and then using concat to join them.
UPDATE
Example:
df1 = pd.DataFrame({'A':[1, 2], 'B':[2, 3], 'C':[3, 4]})
df2 = pd.DataFrame({'A':[1, 2], 'B':[2, 3], 'D':[7, 8]})
both = pd.concat([df1.set_index(['A','B']), df2.set_index(['A','B'])], axis=1).reset_index()
df1
A B C
0 1 2 3
1 2 3 4
df2
A B D
0 1 2 7
1 2 3 8
both
A B C D
0 1 2 3 7
1 2 3 4 8
I haven't benchmarked the performance of this approach, but it didn't get the memory error and worked for my applications.

Related

How to simplify data.table logic and make it doable in pandas?

I have a dataframe with multiple columns with numerical values. I wanted to new columns which compare the values of other columns and assign its column name as label. I already understood its logic in r, but wondering how should I do this easily in python. Can anyone point me out how this can be done in python when we try to add new column where need to compare value of multiple columns and assign column name which has max value? Any idea?
reproducible example
this is 100% working reproducible example in R:
library(data.table)
df <- data.frame(a = sample(seq(1:10), size=10), b = sample(LETTERS[1:10], size=10), cnt=sample(seq(1:100), size=5),
RECENT_MOV= sample(seq(1:1000), size = 10),
RETIRED= sample(seq(1:200), size = 10),
SERV_EMPL= sample(seq(1:500), size = 10),
SUB_BUS=sample(seq(1:2000), size = 10),
WORK_HOME=sample(seq(1:1200), size = 10)
)
dt <- as.data.table(df)
write.csv(dt, "sample.csv")
label = c("RECENT_MOV", "RETIRED", "SERV_EMPL", "SUB_BUS","WORK_HOME")
df$category <- NA_character_
df[, row_ind:= 1:nrow(df)]
df[cnt > 2, category := names(which.max(.SD[, label, with = FALSE])), by = row_ind]
current output is:
> dput(dt)
structure(list(a = c(5L, 10L, 1L, 6L, 7L, 3L, 2L, 8L, 4L, 9L),
b = c("E", "A", "D", "H", "J", "F", "G", "I", "C", "B"),
cnt = c(13L, 88L, 45L, 92L, 70L, 13L, 88L, 45L, 92L, 70L),
RECENT_MOV = c(70L, 195L, 620L, 572L, 354L, 648L, 798L, 657L,
233L, 672L), RETIRED = c(189L, 195L, 191L, 88L, 148L, 186L,
39L, 78L, 158L, 55L), SERV_EMPL = c(65L, 151L, 415L, 383L,
255L, 207L, 210L, 470L, 181L, 188L), SUB_BUS = c(894L, 829L,
1798L, 502L, 897L, 1461L, 744L, 1991L, 260L, 1697L), WORK_HOME = c(553L,
739L, 454L, 137L, 435L, 1042L, 316L, 697L, 517L, 1158L),
category = c("SUB_BUS", "SUB_BUS", "SUB_BUS", "RECENT_MOV",
"SUB_BUS", "SUB_BUS", "RECENT_MOV", "SUB_BUS", "WORK_HOME",
"SUB_BUS"), row_ind = 1:10), row.names = c(NA, -10L), class = c("data.table",
"data.frame"), .internal.selfref = <pointer: 0x0000015a64b61ef0>)
my current python attempt
import pandas as pd
df=pd.read_csv("sample.csv", index_col=None, header=0)
label = ["RECENT_MOV", "RETIRED", "SERV_EMPL", "SUB_BUS","WORK_HOME"]
df['category'] = pd.NA
df['row_ind'] = range(1,len(df))
however, I have trouble to make this line in pythonic way:
df[cnt > 2, category := names(which.max(.SD[, label, with = FALSE])), by = row_ind]
basically, this lines said create new column variable called category where comparing columns in label where whichever column has max value, its column name will be assigned as value in category column. How should I do it this easily in python?
logic translation:
df[cnt > 2, category := names(which.max(.SD[, label, with = FALSE])), by = row_ind]
this line telling us that first do filter by cnt column where cnt > 2, then compare columns values of df[["RECENT_MOV", "RETIRED", "SERV_EMPL", "SUB_BUS","WORK_HOME"]] and pick the column with highest value by row-wise and assign that name of that column as value to df['category']=col_name_with_highest_value_in_each_row.
desirable output
this is desirable output that I want to produce in python:
a b cnt RECENT_MOV RETIRED SERV_EMPL SUB_BUS WORK_HOME category row_ind
1 5 E 13 70 189 65 894 553 SUB_BUS 1
2 10 A 88 195 195 151 829 739 SUB_BUS 2
3 1 D 45 620 191 415 1798 454 SUB_BUS 3
4 6 H 92 572 88 383 502 137 RECENT_MOV 4
5 7 J 70 354 148 255 897 435 SUB_BUS 5
6 3 F 13 648 186 207 1461 1042 SUB_BUS 6
7 2 G 88 798 39 210 744 316 RECENT_MOV 7
8 8 I 45 657 78 470 1991 697 SUB_BUS 8
9 4 C 92 233 158 181 260 517 WORK_HOME 9
10 9 B 70 672 55 188 1697 1158 SUB_BUS 10
This is actually really simple with pandas. Have a list of the columns to search in, and then use idxmax with axis=1:
# Filter out rows where `cnt` is less than or equal to 2
df = df[df['cnt'] > 2]
# Determine category for each row
search_cols = ['RECENT_MOV', 'RETIRED', 'SERV_EMPL', 'SUB_BUS', 'WORK_HOME']
df['category'] = df[search_cols].idxmax(axis=1)
# Assign row indexes
df['row_ind'] = df.index
Output:
>>> df
a b cnt RECENT_MOV RETIRED SERV_EMPL SUB_BUS WORK_HOME category row_ind
1 1 C 76 452 62 55 115 247 RECENT_MOV 1
2 7 E 14 50 165 337 1165 810 SUB_BUS 2
3 2 A 46 523 167 423 784 707 SUB_BUS 3
4 3 H 3 38 144 473 745 437 SUB_BUS 4
5 5 I 59 743 127 261 351 190 RECENT_MOV 5
6 8 J 76 143 49 470 1612 935 SUB_BUS 6
7 4 D 14 818 101 418 1919 314 SUB_BUS 7
8 6 F 46 714 9 446 1432 938 SUB_BUS 8
9 10 B 3 585 160 14 107 489 RECENT_MOV 9
10 9 G 59 814 73 449 937 287 SUB_BUS 10

Python Pandas apply qcut to grouped by level 0 of multi-index in multi-index dataframe

I have a multi-index dataframe in pandas (date and entity_id) and for each date/entity I have obseravtions of a number of variables (A, B ...). My goal is to create a dataframe with the same shape but where the values are replaced by their decile scores.
My test data looks like this:
I want to apply qcut to each column grouped by level 0 of the multi-index - the issue I have is creating a result Dataframe
This code
def qcut_sub_index(df_with_sub_index):
# create empty return value same shape as passed dataframe
df_return=pd.DataFrame()
for date, sub_df in df_with_sub_index.groupby(level=0):
df_return=df_return.append(pd.DataFrame(pd.qcut(sub_df, 10, labels=False, duplicates='drop')))
print(df_return)
return df_return
print(df_values.apply(lambda x: qcut_sub_index(x), axis=0))
returns
A
as_at_date entity_id
2008-01-27 2928 0
2932 3
3083 6
3333 9
2008-02-27 2928 3
2935 9
3333 0
3874 6
2008-03-27 2928 1
2932 2
2934 0
2936 9
2937 4
2939 9
2940 7
2943 3
2944 0
2945 8
2946 6
2947 5
2949 4
B
as_at_date entity_id
2008-01-27 2928 9
2932 6
3083 0
3333 3
2008-02-27 2928 6
2935 0
3333 3
3874 9
2008-03-27 2928 0
2932 9
2934 2
2936 8
2937 7
2939 6
2940 3
2943 1
2944 4
2945 9
2946 5
2947 4
2949 0
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-104-72ff0e6da288> in <module>
11
12
---> 13 print(df_values.apply(lambda x: qcut_sub_index(x), axis=0))
~\Anaconda3\lib\site-packages\pandas\core\frame.py in apply(self, func, axis, raw, result_type, args, **kwds)
7546 kwds=kwds,
7547 )
-> 7548 return op.get_result()
7549
7550 def applymap(self, func) -> "DataFrame":
~\Anaconda3\lib\site-packages\pandas\core\apply.py in get_result(self)
178 return self.apply_raw()
179
--> 180 return self.apply_standard()
181
182 def apply_empty_result(self):
~\Anaconda3\lib\site-packages\pandas\core\apply.py in apply_standard(self)
272
273 # wrap results
--> 274 return self.wrap_results(results, res_index)
275
276 def apply_series_generator(self) -> Tuple[ResType, "Index"]:
~\Anaconda3\lib\site-packages\pandas\core\apply.py in wrap_results(self, results, res_index)
313 # see if we can infer the results
314 if len(results) > 0 and 0 in results and is_sequence(results[0]):
--> 315 return self.wrap_results_for_axis(results, res_index)
316
317 # dict of scalars
~\Anaconda3\lib\site-packages\pandas\core\apply.py in wrap_results_for_axis(self, results, res_index)
369
370 try:
--> 371 result = self.obj._constructor(data=results)
372 except ValueError as err:
373 if "arrays must all be same length" in str(err):
~\Anaconda3\lib\site-packages\pandas\core\frame.py in __init__(self, data, index, columns, dtype, copy)
466
467 elif isinstance(data, dict):
--> 468 mgr = init_dict(data, index, columns, dtype=dtype)
469 elif isinstance(data, ma.MaskedArray):
470 import numpy.ma.mrecords as mrecords
~\Anaconda3\lib\site-packages\pandas\core\internals\construction.py in init_dict(data, index, columns, dtype)
281 arr if not is_datetime64tz_dtype(arr) else arr.copy() for arr in arrays
282 ]
--> 283 return arrays_to_mgr(arrays, data_names, index, columns, dtype=dtype)
284
285
~\Anaconda3\lib\site-packages\pandas\core\internals\construction.py in arrays_to_mgr(arrays, arr_names, index, columns, dtype, verify_integrity)
76 # figure out the index, if necessary
77 if index is None:
---> 78 index = extract_index(arrays)
79 else:
80 index = ensure_index(index)
~\Anaconda3\lib\site-packages\pandas\core\internals\construction.py in extract_index(data)
385
386 if not indexes and not raw_lengths:
--> 387 raise ValueError("If using all scalar values, you must pass an index")
388
389 if have_series:
ValueError: If using all scalar values, you must pass an index
so something is preventing the second application of the lambda function.
I'd appreciate your help, thanks for takign a look.
p.s. if this can be done implcitly without using apply would love to hear. thanks
You solution appears over complicated. Your terminology is none standard, multi-indexes have levels. Stated as qcut() by level 0 of multi-index (not talking about sub-frames which are not pandas concepts)
Bring it all back together
use **kwargs approach to pass arguments to assign() for all columns in data frame
groupby(level=0) is as_of_date
transform() to get a row back for every entry in index
s = 12
df = pd.DataFrame({"as_at_date":np.random.choice(pd.date_range(dt.date(2020,1,27), periods=3, freq="M"), s),
"entity_id":np.random.randint(2900, 3500, s),
"A":np.random.random(s),
"B":np.random.random(s)*(10**np.random.randint(8,10,s))
}).sort_values(["as_at_date","entity_id"])
df = df.set_index(["as_at_date","entity_id"])
df2 = df.assign(**{c:df.groupby(level=0)[c].transform(lambda x: pd.qcut(x, 10, labels=False))
for c in df.columns})
df
A B
as_at_date entity_id
2020-01-31 2926 0.770121 2.883519e+07
2943 0.187747 1.167975e+08
2973 0.371721 3.133071e+07
3104 0.243347 4.497294e+08
3253 0.591022 7.796131e+08
3362 0.810001 6.438441e+08
2020-02-29 3185 0.690875 4.513044e+08
3304 0.311436 4.561929e+07
2020-03-31 2953 0.325846 7.770111e+08
2981 0.918461 7.594753e+08
3034 0.133053 6.767501e+08
3355 0.624519 6.318104e+07
df2
A B
as_at_date entity_id
2020-01-31 2926 7 0
2943 0 3
2973 3 1
3104 1 5
3253 5 9
3362 9 7
2020-02-29 3185 9 9
3304 0 0
2020-03-31 2953 3 9
2981 9 6
3034 0 3
3355 6 0
Using concat inside an iteration on the original dataframe does the trick but is there a smarter way to do this?
thanks
def qcut_sub_index(df_with_sub_index):
# create empty return value same shape as passed dataframe
df_return=pd.DataFrame()
for date, sub_df in df_with_sub_index.groupby(level=0):
df_return=df_return.append(pd.DataFrame(pd.qcut(sub_df, 10, labels=False,
duplicates='drop')))
return df_return
df_x=pd.DataFrame()
for (columnName, columnData) in df_values.iteritems():
df_x=pd.concat([df_x, qcut_sub_index(columnData)], axis=1, join="outer")
df_x

Pandas: How to (cleanly) unpivot two columns with same category?

I'm trying to unpivot two columns inside a pandas dataframe. The transformation I seek would be the inverse of this question.
We start with a dataset that looks like this:
import pandas as pd
import numpy as np
df_orig = pd.DataFrame(data=np.random.randint(255, size=(4,5)),
columns=['accuracy','time_a','time_b','memory_a', 'memory_b'])
df_orig
accuracy time_a time_b memory_a memory_b
0 6 118 170 102 239
1 241 9 166 159 162
2 164 70 76 228 121
3 228 121 135 128 92
I wish to unpivot both themwmory and time columns, obtaining this dataset in result:
df
accuracy memory category time
0 6 102 a 118
1 241 159 a 9
2 164 228 a 70
3 228 128 a 121
12 6 239 b 170
13 241 162 b 166
14 164 121 b 76
15 228 92 b 135
So far I have managed to get my desired output using df.melt() twice plus some extra commands:
df = df_orig.copy()
# Unpivot memory columns
df = df.melt(id_vars=['accuracy','time_a', 'time_b'],
value_vars=['memory_a', 'memory_b'],
value_name='memory',
var_name='mem_cat')
# Unpivot time columns
df = df.melt(id_vars=['accuracy','memory', 'mem_cat'],
value_vars=['time_a', 'time_b'],
value_name='time',
var_name='time_cat')
# Keep only the 'a'/'b' as categories
df.mem_cat = df.mem_cat.str[-1]
df.time_cat = df.time_cat.str[-1]
# Keeping only the colums whose categories match (DIRTY!)
df = df[df.mem_cat==df.time_cat]
# Removing the duplicated category column.
df = df.drop(columns='time_cat').rename(columns={"mem_cat":'category'})
Given how easy it was to solve the inverse question, I believe my code is way too complex. Can anyone do it better?
Use wide_to_long:
np.random.seed(123)
df_orig = pd.DataFrame(data=np.random.randint(255, size=(4,5)),
columns=['accuracy','time_a','time_b','memory_a', 'memory_b'])
df = (pd.wide_to_long(df_orig.reset_index(),
stubnames=['time','memory'],
i='index',
j='category',
sep='_',
suffix='\w+')
.reset_index(level=1)
.reset_index(drop=True)
.rename_axis(None))
print (df)
category accuracy time memory
0 a 254 109 66
1 a 98 230 83
2 a 123 57 225
3 a 113 126 73
4 b 254 126 220
5 b 98 17 106
6 b 123 214 96
7 b 113 47 32

Pandas: Get highest value from a column for each unique value in another column

How to get the highest value in one column for each unique value in another column and return the same dataframe structure back.
Here is a pandas dataframe example?
reg.nr counter value ID2 categ date
1 37367 421 231385 93 A 20.01.2004
2 37368 428 235156 93 B 21.01.2004
3 37369 408 234251 93 C 22.01.2004
4 37372 403 196292 93 D 23.01.2004
5 55523 400 247141 139 E 24.01.2004
6 55575 415 215818 139 F 25.01.2004
7 55576 402 204404 139 A 26.01.2004
8 69940 402 62244 175 B 27.01.2004
9 69941 402 38274 175 C 28.01.2004
10 69942 404 55171 175 D 29.01.2004
11 69943 416 55495 175 E 30.01.2004
12 69944 407 90231 175 F 31.01.2004
13 69945 411 75382 175 A 01.02.2004
14 69948 405 119129 175 B 02.02.2004
Where i want to return the highest value of column "counter" based on the unique value of column "ID2". After the new pandas dataframe should look like this:
reg.nr counter value ID2 categ date
1 37368 428 235156 93 B 21.01.2004
2 55575 415 215818 139 F 25.01.2004
3 69943 416 55495 175 E 30.01.2004
One way using drop_duplicates
In [332]: df.sort_values('counter', ascending=False).drop_duplicates(['ID2'])
Out[332]:
reg.nr counter value ID2 categ date
2 37368 428 235156 93 B 21.01.2004
11 69943 416 55495 175 E 30.01.2004
6 55575 415 215818 139 F 25.01.2004
For desired output, you could sort on two columns, and reset the index
In [336]: (df.sort_values(['ID2', 'counter'], ascending=[True, False])
.drop_duplicates(['ID2']).reset_index(drop=True)
)
Out[336]:
reg.nr counter value ID2 categ date
0 37368 428 235156 93 B 21.01.2004
1 55575 415 215818 139 F 25.01.2004
2 69943 416 55495 175 E 30.01.2004
df.loc[df.groupby('ID2')['counter'].idxmax(), :].reset_index()
index reg.nr counter value ID2 categ date
0 2 37368 428 235156 93 B 21.01.2004
1 6 55575 415 215818 139 F 25.01.2004
2 11 69943 416 55495 175 E 30.01.2004
First, you are grouping your dataframe by column ID2. Then you get counter column and calculate an index of (first) maximal element of this column in each group. Then you use these indexes to filter your initial dataframe. Finally you resets indexes (if you need it).

Python selecting items by comparing values in a table using dictionary

I have a table with 12 columns and want to select the items in the first column (qseqid) based on the second column (sseqid). Meaning that the second column (sseqid) is repeating with different values in the 11th and 12th columns, which areevalueandbitscore, respectively.
The ones that I would like to get are having the lowestevalueand the highestbitscore(whenevalues are the same, the rest of the columns can be ignored and the data is down below).
So, I have made a short code which uses the second columns as a key for the dictionary. I can get five different items from the second column with lists of qseqid+evalueandqseqid+bitscore.
Here is the code:
#!usr/bin/python
filename = "data.txt"
readfile = open(filename,"r")
d = dict()
for i in readfile.readlines():
i = i.strip()
i = i.split("\t")
d.setdefault(i[1], []).append([i[0],i[10]])
d.setdefault(i[1], []).append([i[0],i[11]])
for x in d:
print(x,d[x])
readfile.close()
But, I am struggling to get the qseqid with the lowest evalue and the highest bitscore for each sseqid.
Is there any good logic to solve the problem?
Thedata.txtfile (including the header row and with»representing tab characters)
qseqid»sseqid»pident»length»mismatch»gapopen»qstart»qend»sstart»send»evalue»bitscore
ACLA_022040»TBB»32.71»431»258»8»39»468»24»423»2.00E-76»240
ACLA_024600»TBB»80»435»87»0»1»435»1»435»0»729
ACLA_031860»TBB»39.74»453»251»3»1»447»1»437»1.00E-121»357
ACLA_046030»TBB»75.81»434»105»0»1»434»1»434»0»704
ACLA_072490»TBB»41.7»446»245»3»4»447»3»435»2.00E-120»353
ACLA_010400»EF1A»27.31»249»127»8»69»286»9»234»3.00E-13»61.6
ACLA_015630»EF1A»22»491»255»17»186»602»3»439»8.00E-19»78.2
ACLA_016510»EF1A»26.23»122»61»4»21»127»9»116»2.00E-08»46.2
ACLA_023300»EF1A»29.31»447»249»12»48»437»3»439»2.00E-45»155
ACLA_028450»EF1A»85.55»443»63»1»1»443»1»442»0»801
ACLA_074730»CALM»23.13»147»101»4»6»143»2»145»7.00E-08»41.2
ACLA_096170»CALM»29.33»150»96»4»34»179»2»145»1.00E-13»55.1
ACLA_016630»CALM»23.9»159»106»5»58»216»4»147»5.00E-12»51.2
ACLA_031930»RPB2»36.87»1226»633»24»121»1237»26»1219»0»734
ACLA_065630»RPB2»65.79»1257»386»14»1»1252»4»1221»0»1691
ACLA_082370»RPB2»27.69»1228»667»37»31»1132»35»1167»7.00E-110»365
ACLA_061960»ACT»28.57»147»95»5»146»284»69»213»3.00E-12»57.4
ACLA_068200»ACT»28.73»463»231»13»16»471»4»374»1.00E-53»176
ACLA_069960»ACT»24.11»141»97»4»581»718»242»375»9.00E-09»46.2
ACLA_095800»ACT»91.73»375»31»0»1»375»1»375»0»732
And here's a little more readable version of the table's contents:
0 1 2 3 4 5 6 7 8 9 10 11
qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore
ACLA_022040 TBB 32.71 431 258 8 39 468 24 423 2.00E-76 240
ACLA_024600 TBB 80 435 87 0 1 435 1 435 0 729
ACLA_031860 TBB 39.74 453 251 3 1 447 1 437 1.00E-121 357
ACLA_046030 TBB 75.81 434 105 0 1 434 1 434 0 704
ACLA_072490 TBB 41.7 446 245 3 4 447 3 435 2.00E-120 353
ACLA_010400 EF1A 27.31 249 127 8 69 286 9 234 3.00E-13 61.6
ACLA_015630 EF1A 22 491 255 17 186 602 3 439 8.00E-19 78.2
ACLA_016510 EF1A 26.23 122 61 4 21 127 9 116 2.00E-08 46.2
ACLA_023300 EF1A 29.31 447 249 12 48 437 3 439 2.00E-45 155
ACLA_028450 EF1A 85.55 443 63 1 1 443 1 442 0 801
ACLA_074730 CALM 23.13 147 101 4 6 143 2 145 7.00E-08 41.2
ACLA_096170 CALM 29.33 150 96 4 34 179 2 145 1.00E-13 55.1
ACLA_016630 CALM 23.9 159 106 5 58 216 4 147 5.00E-12 51.2
ACLA_031930 RPB2 36.87 1226 633 24 121 1237 26 1219 0 734
ACLA_065630 RPB2 65.79 1257 386 14 1 1252 4 1221 0 1691
ACLA_082370 RPB2 27.69 1228 667 37 31 1132 35 1167 7.00E-110 365
ACLA_061960 ACT 28.57 147 95 5 146 284 69 213 3.00E-12 57.4
ACLA_068200 ACT 28.73 463 231 13 16 471 4 374 1.00E-53 176
ACLA_069960 ACT 24.11 141 97 4 581 718 242 375 9.00E-09 46.2
ACLA_095800 ACT 91.73 375 31 0 1 375 1 375 0 732
Since you're a Python newbie I'm glad that there are several examples of how to this manually, but for comparison I'll show how it can be done using the pandas library which makes working with tabular data much simpler.
Since you didn't provide example output, I'm assuming that by "with the lowest evalue and the highest bitscore for each sseqid" you mean "the highest bitscore among the lowest evalues" for a given sseqid; if you want those separately, that's trivial too.
import pandas as pd
df = pd.read_csv("acla1.dat", sep="\t")
df = df.sort(["evalue", "bitscore"],ascending=[True, False])
df_new = df.groupby("sseqid", as_index=False).first()
which produces
>>> df_new
sseqid qseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore
0 ACT ACLA_095800 91.73 375 31 0 1 375 1 375 0.000000e+00 732.0
1 CALM ACLA_096170 29.33 150 96 4 34 179 2 145 1.000000e-13 55.1
2 EF1A ACLA_028450 85.55 443 63 1 1 443 1 442 0.000000e+00 801.0
3 RPB2 ACLA_065630 65.79 1257 386 14 1 1252 4 1221 0.000000e+00 1691.0
4 TBB ACLA_024600 80.00 435 87 0 1 435 1 435 0.000000e+00 729.0
Basically, first we read the data file into an object called a DataFrame, which is kind of like an Excel worksheet. Then we sort by evalue ascending (so that lower evalues come first) and by bitscore descending (so that higher bitscores come first). Then we can use groupby to collect the data in groups of equal sseqid, and take the first one in each group, which because of the sorting will be the one we want.
#!usr/bin/python
import csv
DATA = "data.txt"
class Sequence:
def __init__(self, row):
self.qseqid = row[0]
self.sseqid = row[1]
self.pident = float(row[2])
self.length = int(row[3])
self.mismatch = int(row[4])
self.gapopen = int(row[5])
self.qstart = int(row[6])
self.qend = int(row[7])
self.sstart = int(row[8])
self.send = int(row[9])
self.evalue = float(row[10])
self.bitscore = float(row[11])
def __str__(self):
return (
"{qseqid}\t"
"{sseqid}\t"
"{pident}\t"
"{length}\t"
"{mismatch}\t"
"{gapopen}\t"
"{qstart}\t"
"{qend}\t"
"{sstart}\t"
"{send}\t"
"{evalue}\t"
"{bitscore}"
).format(**self.__dict__)
def entries(fname, header_rows=1, dtype=list, **kwargs):
with open(fname) as inf:
incsv = csv.reader(inf, **kwargs)
# skip header rows
for i in range(header_rows):
next(incsv)
for row in incsv:
yield dtype(row)
def main():
bestseq = {}
for seq in entries(DATA, dtype=Sequence, delimiter="\t"):
# see if a sequence with the same sseqid already exists
prev = bestseq.get(seq.sseqid, None)
if (
prev is None
or seq.evalue < prev.evalue
or (seq.evalue == prev.evalue and seq.bitscore > prev.bitscore)
):
bestseq[seq.sseqid] = seq
# display selected sequences
keys = sorted(bestseq)
for key in keys:
print(bestseq[key])
if __name__ == "__main__":
main()
which results in
ACLA_095800 ACT 91.73 375 31 0 1 375 1 375 0.0 732.0
ACLA_096170 CALM 29.33 150 96 4 34 179 2 145 1e-13 55.1
ACLA_028450 EF1A 85.55 443 63 1 1 443 1 442 0.0 801.0
ACLA_065630 RPB2 65.79 1257 386 14 1 1252 4 1221 0.0 1691.0
ACLA_024600 TBB 80.0 435 87 0 1 435 1 435 0.0 729.0
While not nearly as elegant and concise as using thepandaslibrary, it's quite possible to do what you want without resorting to third-party modules. The following uses thecollections.defaultdictclass to facilitate creation of dictionaries of variable-length lists of records. The use of theAttrDictclass is optional, but it makes accessing the fields of each dictionary-based records easier and is less awkward-looking than the usualdict['fieldname']syntax otherwise required.
import csv
from collections import defaultdict, namedtuple
from itertools import imap
from operator import itemgetter
data_file_name = 'data.txt'
DELIMITER = '\t'
ssqeid_dict = defaultdict(list)
# from http://stackoverflow.com/a/1144405/355230
def multikeysort(items, columns):
comparers = [((itemgetter(col[1:].strip()), -1) if col.startswith('-') else
(itemgetter(col.strip()), 1)) for col in columns]
def comparer(left, right):
for fn, mult in comparers:
result = cmp(fn(left), fn(right))
if result:
return mult * result
else:
return 0
return sorted(items, cmp=comparer)
# from http://stackoverflow.com/a/15109345/355230
class AttrDict(dict):
def __init__(self, *args, **kwargs):
super(AttrDict, self).__init__(*args, **kwargs)
self.__dict__ = self
with open(data_file_name, 'rb') as data_file:
reader = csv.DictReader(data_file, delimiter=DELIMITER)
format_spec = '\t'.join([('{%s}' % field) for field in reader.fieldnames])
for rec in (AttrDict(r) for r in reader):
# Convert the two sort fields to numeric values for proper ordering.
rec.evalue, rec.bitscore = map(float, (rec.evalue, rec.bitscore))
ssqeid_dict[rec.sseqid].append(rec)
for ssqeid in sorted(ssqeid_dict):
# Sort each group of recs with same ssqeid. The first record after sorting
# will be the one sought that has the lowest evalue and highest bitscore.
selected = multikeysort(ssqeid_dict[ssqeid], ['evalue', '-bitscore'])[0]
print format_spec.format(**selected)
Output (»represents tabs):
ACLA_095800» ACT» 91.73» 375» 31» 0» 1» 375» 1» 375» 0.0» 732.0
ACLA_096170» CALM» 29.33» 150» 96» 4» 34» 179» 2» 145» 1e-13» 55.1
ACLA_028450» EF1A» 85.55» 443» 63» 1» 1» 443» 1» 442» 0.0» 801.0
ACLA_065630» RPB2» 65.79» 1257» 386» 14» 1» 1252» 4» 1221» 0.0» 1691.0
ACLA_024600» TBB» 80» 435» 87» 0» 1» 435» 1» 435» 0.0» 729.0
filename = 'data.txt'
readfile = open(filename,'r')
d = dict()
sseqid=[]
lines=[]
for i in readfile.readlines():
sseqid.append(i.rsplit()[1])
lines.append(i.rsplit())
sorted_sseqid = sorted(set(sseqid))
sdqDict={}
key =None
for sorted_ssqd in sorted_sseqid:
key=sorted_ssqd
evalue=[]
bitscore=[]
qseid=[]
for line in lines:
if key in line:
evalue.append(line[10])
bitscore.append(line[11])
qseid.append(line[0])
sdqDict[key]=[qseid,evalue,bitscore]
print sdqDict
print 'TBB LOWEST EVALUE' + '---->' + min(sdqDict['TBB'][1])
##I think you can do the list manipulation below to find out the qseqid
readfile.close()

Categories