python dict timing mystery - python

I'm doing sequence alignment, and have run into a rather mysterious timing issue related to the origin of my dict data structure.
Basically, I have function alignment(s1, s2, scores)
which takes in two string s1 and s2, and a scoring matrix (as a python dict) for each possible pair of 20 amino acids and a gap '-'. So scores has 440 keys (char1, char2), with integer values.
Here is the mystery: If I read scores from a text file (call it scores1) and run
alignment(s1, s2, scores1)
for some 1000-ish long strings s1, s2 of amino acids I get the following timing (using cProfile and not showing the function output):
2537776 function calls in 11.796 seconds
Now if I create the exactly same dict in my file (call it scores2) and run
alignment(s1, s2, scores2)
I get the same output results but in 3 times less time:
2537776 function calls in 4.263 seconds
The output in both cases is identical, it is just the timing that is different.
Running print scores1 == scores2 results in True, so they contain identical information.
I verified that using an arbitrary function (instead of alignment) that accesses the dict
many times yields the same factor of 3 timing discrepancy in the two cases.
There must be some metadata related to where the dicts originated from that is slowing down my function (when from a file), even though in both cases I actually read in the file.
I tried creating a new dict object for each via scores1 = dict(scores1) etc., but the same timing discrepancy persists. Quite confusing, but I'm pretty sure there will be a good lesson in this if I can figure it out.
scores1 = create_score_dict_from_file('lcs_scores.txt')
scores2 = create_score_dict(find_alp(s1, s2), match=1, mismatch=0, indel=0)
print scores1 == scores2 # True
alignment(s1, s2, scores1) # gives right answer in about 12s
alignment(s1, s2, scores2) # gives right answer in about 4s
EDIT: Added code and results below:
Here is the a simplified version of the code:
import numpy as np
from time import time
def create_scores_from_file(score_file, sigma=0):
"""
Creates a dict of the scores for each pair in an alphabet,
as well as each indel (an amino acid, paired with '-'), which is scored -sigma.
"""
f = open(score_file, 'r')
alp = f.readline().strip().split()
scores = []
for line in f:
scores.append(map(int, line.strip().split()[1:]))
f.close()
scores = np.array(scores)
score_dict = {}
for c1 in range(len(alp)):
score_dict[(alp[c1], '-')] = -sigma
score_dict[('-', alp[c1])] = -sigma
for c2 in range(len(alp)):
score_dict[(alp[c1], alp[c2])] = scores[c1, c2]
return score_dict
def score_matrix(alp=('A', 'C', 'G', 'T'), match=1, mismatch=0, indel=0):
score_dict = {}
for c1 in range(len(alp)):
score_dict[(alp[c1], '-')] = indel
score_dict[('-', alp[c1])] = indel
for c2 in range(len(alp)):
score_dict[(alp[c1], alp[c2])] = match if c1 == c2 else mismatch
return score_dict
def use_dict_in_function(n, d):
start = time()
count = 0
for i in xrange(n):
for k in d.keys():
count += d[k]
print "Time: ", time() - start
return count
def timing_test():
alp = tuple('A C D E F G H I K L M N P Q R S T V W Y'.split())
scores1 = create_scores_from_file('lcs_scores.txt')
scores2 = score_matrix(alp, match=1, mismatch=0, indel=0)
print type(scores1), id(scores1)
print type(scores2), id(scores2)
print repr(scores1)
print repr(scores2)
print type(list(scores1)[0][0])
print type(list(scores2)[0][0])
print scores1 == scores2
print repr(scores1) == repr(scores2)
n = 10000
use_dict_in_function(n, scores1)
use_dict_in_function(n, scores2)
if __name__ == "__main__":
timing_test()
The results are:
<type 'dict'> 140309927965024
<type 'dict'> 140309928036128
{('S', 'W'): 0, ('G', 'G'): 1, ('E', 'M'): 0, ('P', '-'): 0,... (440 key: values)
{('S', 'W'): 0, ('G', 'G'): 1, ('E', 'M'): 0, ('P', '-'): 0,... (440 key: values)
<type 'str'>
<type 'str'>
True
True
Time: 1.51075315475
Time: 0.352770090103
Here is the contents of the file lcs_scores.txt:
A C D E F G H I K L M N P Q R S T V W Y
A 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
C 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
D 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
E 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
F 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
G 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
H 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0
I 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0
K 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0
L 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0
M 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0
N 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0
P 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0
Q 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0
R 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0
S 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0
T 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0
V 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0
W 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
Y 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1

Which version of Python? And print the repr() of each dict to ensure they really are the same (not just that they compare equal). Can't guess. For example, perhaps you're using Python 2, and in one case your char1 and char2 are plain strings but in the other case they're Unicode strings. Then comparison would say they're the same, but repr() will show the difference:
>>> d1 = {"a": 1}
>>> d2 = {u"a": 1}
>>> d1 == d2
True
>>> print repr(d1), repr(d2)
{'a': 1} {u'a': 1}
In any case, in CPython there is absolutely no internal "metadata" recording where any object came from.
EDIT - something to try
Wonderful job whittling down the problem! This is becoming a pleasure :-) I'd like you to try something. First comment out this line:
scores = np.array(scores)
Then change this line:
score_dict[(alp[c1], alp[c2])] = scores[c1, c2]
to:
score_dict[(alp[c1], alp[c2])] = scores[c1][c2]
^^^^^^
When I do that, the two methods return essentially identical times. I'm not a numpy expert, but my guess is that your "from file" code is using a machine-native numpy integer type for the dict values, and that there's substantial overhead to convert those into Python integers whenever the values are used.
Or maybe not - but that's my guess for now, and I'm sticking to it ;-)

Related

Parsing values to specific columns in Pandas

I would like to use Pandas to parse Q26 Challenges into the subsequent columns, with a "1" representing its presence in the original unparsed column. So the data frame initially looks like this:
ID
Q26 Challenges
Q26_1
Q26_2
Q26_3
Q26_4
Q26_5
Q26_6
Q26_7
1
5
0
0
0
0
0
0
0
2
1,2
0
0
0
0
0
0
0
3
1,3,7
0
0
0
0
0
0
0
And I want it to look like this:
ID
Q26 Challenges
Q26_1
Q26_2
Q26_3
Q26_4
Q26_5
Q26_6
Q26_7
1
5
0
0
0
0
1
0
0
2
1,2
1
1
0
0
0
0
0
3
1,3,7
1
0
1
0
0
0
1
You can iterate over the range of values in Q26 Challenges, using str.contains to check if the current value is contained in the string and then converting that boolean value to an integer. For example:
df = pd.DataFrame({'id' : [1, 2, 3, 4, 5], 'Q26 Challenges': ['0', '1,2', '2', '1,2,6,7', '3,4,5,11' ] })
for i in range(1, 12):
df[f'Q26_{i}'] = df['Q26 Challenges'].str.contains(rf'\b{i}\b').astype(int)
df
Output:
id Q26 Challenges Q26_1 Q26_2 Q26_3 Q26_4 Q26_5 Q26_6 Q26_7 Q26_8 Q26_9 Q26_10 Q26_11
0 1 0 0 0 0 0 0 0 0 0 0 0 0
1 2 1,2 1 1 0 0 0 0 0 0 0 0 0
2 3 2 0 1 0 0 0 0 0 0 0 0 0
3 4 1,2,6,7 1 1 0 0 0 1 1 0 0 0 0
4 5 3,4,5,11 0 0 1 1 1 0 0 0 0 0 1
str.get_dummies can be used on the 'Q26 Challenges' column to create the indicator values. This indicator DataFrame can be reindexed to include the complete result range (note column headers will be of type string). add_prefix can be used to add the 'Q26_' to the column headers. Lastly, join back to the original DataFrame:
df = df.join(
df['Q26 Challenges'].str.get_dummies(sep=',')
.reindex(columns=map(str, range(1, 8)), fill_value=0)
.add_prefix('Q26_')
)
The reindexing can also be done dynamically based on the resulting columns. It is necessary to convert the resulting column headers to numbers first to ensure numeric order, rather than lexicographic ordering:
s = df['Q26 Challenges'].str.get_dummies(sep=',')
# Convert to numbers to correctly access min and max
s.columns = s.columns.astype(int)
# Add back to DataFrame
df = df.join(s.reindex(
# Build range from the min column to max column values
columns=range(min(s.columns), max(s.columns) + 1),
fill_value=0
).add_prefix('Q26_'))
Both options produce:
ID Q26 Challenges Q26_1 Q26_2 Q26_3 Q26_4 Q26_5 Q26_6 Q26_7
0 1 5 0 0 0 0 1 0 0
1 2 1,2 1 1 0 0 0 0 0
2 3 1,3,7 1 0 1 0 0 0 1
Given initial input:
import pandas as pd
df = pd.DataFrame({
'ID': [1, 2, 3],
'Q26 Challenges': ['5', '1,2', '1,3,7']
})
ID Q26 Challenges
0 1 5
1 2 1,2
2 3 1,3,7

zip_longest(*d, fill values='0') doesn't work for all sizes of d

So, I'm using zip_longest to write columns with differents sizes on the same file. I'm using like this:
import csv
from itertools import zip_longest
d = event_time
export_data = zip_longest(*d, fillvalue = '0')
with open(filenameOut, 'w', encoding="ISO-8859-1", newline='') as Output:
wr = csv.writer(Output, delimiter=' ')
wr.writerows(export_data)
Output.close()
event_time is a list of N elements, where which element is another list with an unknow size.
The problem is that the zip_longest doesn't work for all values of N.
If N = 10, it works and on my output file I get something like this on the last lines:
(nine zeros and one different)
0 0 0 0 0 99916 0 0 0 0
0 0 0 0 0 99918 0 0 0 0
0 0 0 0 0 99922 0 0 0 0
0 0 0 0 0 99924 0 0 0 0
0 0 0 0 0 99932 0 0 0 0
0 0 0 0 0 99998 0 0 0 0
But, if N=100, I should've get something similar, but with 99 zeros, instead, at the end I got one column with zeros.

Pandas - Map - Dummy Variables - Assign value of 1

I have two dataframes, x.head() looks like this:
top mid adc support jungle
Irelia Ahri Jinx Janna RekSai
Gnar Ahri Caitlyn Leona Rengar
Renekton Fizz Sivir Annie Rengar
Irelia Leblanc Sivir Thresh JarvanIV
Gnar Lissandra Tristana Janna JarvanIV
and dataframe fullmatrix.head() that I have created looks like this:
Irelia Gnar Renekton Kassadin Sion Jax Lulu Maokai Rumble Lissandra ... XinZhao Amumu Udyr Ivern Shaco Skarner FiddleSticks Aatrox Volibear MonkeyKing
0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0 0 0 0 ...
Now what I cannot figure out is how to assign a value of 1 for each name in the x dataframe to the respective column that has the same name in the fullmatrix dataframe row by row (both dataframes have the same number of rows).
I'm sure this can be improved but one advantage is that it only requires the first DataFrame, and it's conceptually nice to chain operations until you get the desired solution.
fullmatrix = (x.stack()
.reset_index(name='names')
.pivot(index='level_0', columns='names', values='names')
.applymap(lambda x: int(x!=None))
.reset_index(drop=True))
note that only the names that appear in your x DataFrame will appear as columns in fullmatrix. if you want the additional columns you can simply perform a join.
Consider adding a key = 1 column and then iterating through each column for a list of pivoted dfs which you then horizontally merge with pd.concat. Finally run a DataFrame.update() to update original fullmatrix with values from pvt_df, aligned to indices.
x['key'] = 1
dfs = []
for col in x.columns[:-1]:
dfs.append(x.pivot_table(index=df.index, columns=[col], values='key').fillna(0))
pvt_df = pd.concat(dfs, axis=1).astype(int)
fullmatrix.update(pvt_df)
fullmatrix = fullmatrix.astype(int)
fullmatrix # ONLY FOR VISIBLE COLUMNS IN ORIGINAL POST
# Irelia Gnar Renekton Kassadin Sion Jax Lulu Maokai Rumble Lissandra XinZhao Amumu Udyr Ivern Shaco Skarner FiddleSticks Aatrox Volibear MonkeyKing
# 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
# 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
# 2 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
# 3 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
The OP tries to create a table of dummy variables with a set of data points. For each data point, it contains 5 attributes. There are in total N unique attributes.
We will use a simplied dataset to demonstrate how to do it:
5 unique attributes
3 data entries
each data entry contains 3 attributes.
x = pd.DataFrame([['a', 'b', 'c'],
['b', 'd', 'e'],
['e', 'b', 'a']])
fullmatrix = pd.DataFrame([[0 for _ in range(5)] for _ in range(3)],
columns=['a','b','c','d','e'])
""" fullmatrix:
a b c d e
0 0 0 0 0 0
1 0 0 0 0 0
2 0 0 0 0 0
"""
# each row in x_temp is a string of attributed delimited by ","
x_row_joined = pd.Series((",".join(row[1]) for row in x.iterrows()))
fullmatrix = x_row_joined.str.get_dummies(sep=',')
The method is inspired by offbyone's answer It uses pandas.Series.str.get_dummies. We first joins each row of x with a specified delimiter. Then make use of the Series.str.get_dummies method. The method takes a delimiter that we just use to join attributes and will generate the dummy-varaible table for you. (Caution: don't pick sep that exists in x.)

How to index data by elements of lists in columns?

I have the following DataFrame (named df2 later):
recipe_id ingredients
0 3332 [11307, 11322, 11632, 11338, 11478, 11438]
1 3333 [11322, 11338, 11632, 11314, 11682, 11478, 108...
2 3334 [11632, 11682, 11338, 11337, 10837, 11435, 113...
3 3335 [11149, 11322, 11532, 11996, 10616, 10837, 113...
4 3336 [11330, 11632, 11422, 11256, 11338, 11314, 114...
5 3812 [959, 92, 3, 554, 12271, 202]
...
I want to create another DataFrame which will have following colums : ['ingredients', "recipe_id1", "recipe_id2", ..., "recipe_idn"], where n is total number of recipes in database. I did that with the following snippet:
columns = ['ingredient'] + (list(df2['recipe_id'].unique()))
ingredient_df = pd.DataFrame(columns=columns)
After i create this DataFrame (which i did already), and populate it (problem i'm having), output should look like this:
In [1]:
# Create and populate ingredient_df by some method
columns = ['ingredient'] + (list(df2['recipe_id'].unique()))
ingredient_df = pd.DataFrame(columns=columns)
ingredient_df = populate_df(ingredient_df, df2)
Out [1]:
In [2]:
ingredient_df
Out[2]:
ingredient ... 3332 3333 3334 3335 3336 ...
...
11322 ... 1 1 0 1 0 ...
...
In the example above, value at (11322, 3334) is 0 because ingredient 11322 is not present in recipe with id 3334.
In other words, I want for every ingredient to have mapping (ingredient, recipe_id) = 1 if ingredient is present in that recipe, and 0 otherwise.
I've managed to do this by iterating over all recipes and trough all ingredient, but this is very slow. How can i do this in more robust and elegant way using Pandas methods (if this is possible at all)?
setup
df = pd.DataFrame(
dict(
recipe_id=list('abcde'),
ingredients=[list('xyz'),
list('tuv'),
list('ytw'),
list('vy'),
list('zxs')]
)
)[['recipe_id', 'ingredients']]
df
recipe_id ingredients
0 a [x, y, z]
1 b [t, u, v]
2 c [y, t, w]
3 d [v, y]
4 e [z, x, s]
method 1
df.set_index('recipe_id').ingredients.apply(pd.value_counts) \
.fillna(0).astype(int).T.rename_axis('ingredients')
recipe_id a b c d e
ingredients
s 0 0 0 0 1
t 0 1 1 0 0
u 0 1 0 0 0
v 0 1 0 1 0
w 0 0 1 0 0
x 1 0 0 0 1
y 1 0 1 1 0
z 1 0 0 0 1
method 2
idx = np.repeat(df.index.values, df.ingredients.str.len())
df1 = df.drop('ingredients', 1).loc[idx]
df1['ingredients'] = df.ingredients.sum()
df1.groupby('ingredients').recipe_id.apply(pd.value_counts) \
.unstack(fill_value=0).rename_axis('recipe_id', 1)
recipe_id a b c d e
ingredients
s 0 0 0 0 1
t 0 1 1 0 0
u 0 1 0 0 0
v 0 1 0 1 0
w 0 0 1 0 0
x 1 0 0 0 1
y 1 0 1 1 0
z 1 0 0 0 1

How to apply logical operator OR in some of the list item?

I want to know is it possible to include logical operator OR in the list item. For example:
CHARS = ['X','Y','Z']
change this line of code to something like: (I know this is not a correct way)
Can anyone help me?
CHARS = ['X','Y','Z','X OR Y','Y OR Z','X OR Z']
Example code:
import numpy as np
seqs = ["XYZXYZ","YZYZYZ"]
CHARS = ['X','Y','Z']
CHARS_COUNT = len(CHARS)
maxlen = max(map(len, seqs))
res = np.zeros((len(seqs), CHARS_COUNT * maxlen), dtype=np.uint8)
for si, seq in enumerate(seqs):
seqlen = len(seq)
arr = np.chararray((seqlen,), buffer=seq)
for ii, char in enumerate(CHARS):
res[si][ii*seqlen:(ii+1)*seqlen][arr == char] = 1
print res
It scan through to detect X first if it is occurred then will be awarded 1 then detect Y and last Z.
Output:
[[1 0 0 1 0 0 0 1 0 0 1 0 0 0 1 0 0 1]
[0 0 0 0 0 0 1 0 1 0 1 0 0 1 0 1 0 1]]
Expected output after include logical OR:
[[1 0 0 1 0 0 0 1 0 0 1 0 0 0 1 0 0 1 1 1 0 1 1 0 0 1 1 0 1 1 1 0 1 1 0 1]
[0 0 0 0 0 0 1 0 1 0 1 0 0 1 0 1 0 1 1 0 1 0 1 0 1 1 1 1 1 1 0 1 0 1 0 1]]
The example below is a bit contrived, but using itertools.combinations would be a way to generate combinations of size n for a given list. Combine this with str.join() and you'd be able to generate strings as exemplified in the first part of your question:
import itertools
CHARS = ['X','Y','Z']
allCombinations = [" OR ".join(x) for i in range(1,len(CHARS)) for x in itertools.combinations(CHARS, i)]
print repr(allCombinations)
Output:
['X', 'Y', 'Z', 'X OR Y', 'X OR Z', 'Y OR Z']

Categories