cluster_name size is 101,878
maxpaths size is 1,508,931
Dataframe looks like this
| cluster_name | maxpath | chrom_pos |
|--------------+---------------------+--------------------|
| cluster_1007 | 5_G,6_C,7_A,8_A,9_T | chr11:611117-799999|
| cluster_1007 | 5_G,6_C,7_A,8_A,9_T | chr11:611117-799999|
| cluster_1007 | 3_G,4_C,5_A,6_A,7_T | chr12:823492-102341|
What I'm trying to do is
for a given cluster I want to compare each maxpaths to other given maxpaths in
that cluster, I'm interested in the clusters where the maxpaths do not overlap
and their annotations on the chromosome are also disjoint. For example
"5_G,6_C,7_A,8_A,9_T" and "3_G,4_C,5_A,6_A,7_T" do not overlap and have
different annotations. One of the biggest pain points in my code is that I
compare each maxpath to all the other maxpaths in that cluster. Some clusters have over 1000 maxpaths so it slows down a lot. I tried to reduce the number of pairwise comparisons I do by returning the name of the cluster when it finds two maxpaths that fulfill the criteria. I've tried moving certain parts of the code into numpy arrays but it's still extremely slow.
My code looks like this. Does anyone have any ideas that could help?
import pandas as pd
import numpy as np
def find_cluster(cluster, maxpaths):
"""
returns clusters with disjoint maxpaths annotated
to different genomes or annotated on the same
genome min of 10kb apart
"""
for idx, item in enumerate(maxpaths):
unique = set(item.split(','))
for j in range(idx+1, len(maxpaths)):
compare = set(maxpaths[j].split(','))
chrom_string1 = subset_df[
subset_df['maxpath'] == item]['chrom_pos'].values[0]
chrom_string2 = subset_df[
subset_df['maxpath'] == maxpaths[j]]['chrom_pos'].values[0]
chrom1 = chrom_string1.split(':')[0]
chrom_end1 = int(chrom_string1.split('-')[1])
chrom2 = chrom_string2.split(':')[0]
chrom_end2 = int(chrom_string2.split('-')[1])
if len(unique.intersection(compare)) == 0:
if chrom1 != chrom2:
return(cluster)
elif chrom1 == chrom2:
if abs(chrom_end1 - chrom_end2) > 10000:
return(cluster)
file_number = input.df_subset.split('_')[-1].split('.')[0]
df = pd.read_csv(input.df_subset)
cluster_names = df['cluster_name'].unique()
disjoint_clusters = []
for i in cluster_names:
subset_df = df[df['cluster_name'] ==i]
maxpaths_array = subset_df['maxpath'].as_matrix()
cluster = find_cluster(i,maxpaths_array)
disjoint_clusters.append(cluster)
disjoint_maxpaths = pd.DataFrame({"clusters_with_disjoint_maxpaths":disjoint_clusters})
disjoint_maxpaths.to_csv(os.path.abspath('results/disjoint_maxpath_clusters_{}.csv'.format(file_number)),index=False)
After incorporating some of the suggested ideas this is what I came up with. I lost readability but what I needed was performance. Whole run took about 2 hours to finish on my main computer.
def find_cluster(cluster_name, dataframe):
"""
returns clusters with disjoint maxpaths annotated
to different genomes or annotated on the same
genome min of 10kb apart
"""
cluster = cluster_name
df = dataframe
maxpaths = [set(item) for item in [_.split(',') for _ in df['maxpath'].as_matrix()]]
chrom_string = df['chrom_pos'].as_matrix()
chrom = [_.split(':')[0] for _ in df['chrom_pos'].as_matrix()]
chrom_end = [int(_.split('-')[1]) for _ in df['chrom_pos'].as_matrix()]
for idx,item in enumerate(maxpaths):
for j in range(idx+1, len(maxpaths)):
if item.isdisjoint(maxpaths[j]):
if chrom[idx] != chrom[j]:
return(cluster)
elif chrom[idx] == chrom[j]:
if abs(chrom_end[idx] - chrom_end[j]) > 10000:
return(cluster)
def df_to_dict(dataframe):
"""
Create dict with cluster as key and
subset of dataframe at cluster as value
"""
df = dataframe
unique_clusters = df['cluster_name'].unique()
sub_dfs = []
for i in unique_clusters:
sub_dfs.append(df[df['cluster_name'] == i])
return(dict(zip(unique_clusters, sub_dfs)))
def get_clusters(dataframe):
"""
return disjoint clusters
"""
df = pd.read_csv(dataframe, index_col=False)
df_dict = df_to_dict(df)
disjoint = [find_cluster(k,v) for k,v in df_dict.items() if find_cluster(k,v)]
return(disjoint)
def test_new():
cluster = ["cluster_689"]
disjoint_cluster = []
for i in cluster:
found = find_cluster(i, subset_df)
disjoint_cluster.append(found)
return(disjoint_cluster)
def test_old():
cluster = ["cluster_689"]
disjoint_cluster = []
for i in cluster:
maxpaths_array = subset_df['maxpath'].as_matrix()
found = old_find_cluster(i,maxpaths_array)
disjoint_cluster.append(found)
return(disjoint_cluster)
new = %timeit for x in range(3): test_new()
old = %timeit for x in range(3): test_old()
yields
New find_cluster
247 µs ± 5.48 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Old find_cluster
138 ms ± 587 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
The speed up when looking for disjoint max paths is immense. I wasn't able to calculate the time for the whole script because the last few large clusters never finished running after over 24 hours. This was after splitting up the data frame into 100 smaller dataframes as well. But i'm sure overall the script became faster outside of the find_clusters function. Thanks for all the help everyone.
There is some scope for improvement. Basically your code is pythonic, but if performance loss (over readability gain) is unacceptable, you should look at least somewhat deeper, into things such as repeated iterations that you do, and convenience functions which take iterables as inputs and again/repeatedly iterate over (possibly the same) data.
Avoid repeating iterations.
You are repeating iterations (such as with df['cluster_name'].unique() and again iterating all rows in for loop, then again subsetting array multiple times at df[df['cluster_name'] ==i] which is conditional execution, that gets repeated a lot of times for same values). Instead you can just iterate over the whole data once, keep a dict like mapping for cluster (eg. cluster name mapped to a list of sets of max_paths), check if the max_paths for current cluster name entry are already present in that dictionary's corresponding entry, and accumulate clusters passing your conditions. This also helps repeated creations of exactly same sets that's happening with your current code.
Another example of repeated heavy computation:
cluster_names = df['cluster_name'].unique()
disjoint_clusters = []
for i in cluster_names:
subset_df = df[df['cluster_name'] ==i]
You can at least keep (cache) and reuse subsets instead of re-computing.
Do not compute unnecessary expressions, and look for items breaking your conditions. Eg. len(unique.intersection(compare)) == 0 is slower than unique.isdisjoint(compare) because the latter one short-circuits and your data seem to have that case more frequently. Basically you don't need that intersection, you just need to check if they're disjoint.
Here, again the same sets are being constructed (which themselves involve iterations over input iterables) repeatedly, with the dict approach above, those repetitions are avoided.
There is one shortcoming with the suggested approach thought, it may consume more memory but I guess that shouldn't be a troublesome difference.
Also I assume you need all clusters passing your conditions, and not stop at just 2. In that case it can be made even faster.
Related
Here's the current implementation:
def nonzero_indexes_by_row(input):
return [
np.nonzero(row)[1]
for row in csr_matrix(input.T)
]
The matrix is very large(1.5M, 500K), since I'm accessing rows, I have to convert CSC to CSR first. The result would be a 2d list, each list contains a list of indexes that are nonzero corresponding to the row in the original matrix.
The current process takes 20 minutes. Is there a faster way?
Sure. You're pretty close to having an ideal solution, but you're allocating some unnecessary arrays. Here's a faster way:
from scipy import sparse
import numpy as np
def my_impl(csc):
csr = csc.tocsr()
return np.split(csr.indices, csr.indptr[1:-1])
def your_impl(input):
return [
np.nonzero(row)[1]
for row in sparse.csr_matrix(input)
]
## Results
# demo data
csc = sparse.random(15000, 5000, format="csc")
your_result = your_impl(csc)
my_result = my_impl(csc)
## Tests for correctness
# Same result
assert all(np.array_equal(x, y) for x, y in zip(your_result, my_result))
# Right number of rows
assert len(my_result) == csc.shape[0]
## Speed
%timeit my_impl(csc)
# 31 ms ± 1.26 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit your_impl(csc)
# 1.49 s ± 19.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Side question, why are you transposing the matrix? Wouldn't you then be getting the non-zero entries of the columns? If that's what you want, you don't even need to convert to csr and can just run:
np.split(csc.indices, csc.indptr[1:-1])
An interesting alternative to your code is to convert your array to
the COOrdinate format and then read its row and col attributes:
def nonzero_indices_by_coo(input):
cx = input.T.tocoo()
res = [ [] for i in range(cx.shape[0]) ]
for i, j in zip(cx.row, cx.col):
res[i].append(j)
return res
It returns a list of plain pythonic lists, instead of Numpy arrays,
but this should not be any important difference.
I noticed that your code uses internally transposition of the source array
(T operator) so I did the same in my code.
To compare execution speed, I created the following sparse array (2000 by 300):
r = 2000; c = 300
x = scipy.sparse.lil_matrix( (r,c) )
for _ in range(r):
x[np.random.randint(0,r-1), np.random.randint(0,c-1)] = np.random.randint(1,100)
and my code ran about 12 times faster than yours.
Yet quicker solution (in other format)
Or maybe it will be better to generate a 2-D (Numpy) array,
with 2 rows:
first row - row indices of consecutive non-zero elements,
second row - column indices.
To generate such result, you can use the following code:
def nonzero_indices_2d(input):
cx = input.T.tocoo()
return np.array([cx.row, cx.col])
which runs 4 times faster than my first solution.
Of course, then other parts of your code should be reworked, to consume
the indices given in another format.
Sparse arrays have also its own nonzero method:
arr.nonzero()
generating a 2-row Numpy array of indices. This function runs yet
a few percent faster.
So, assuming that the 2-D result format is acceptable (instead of
a list of lists), maybe you don't need any own function to get these
indices.
Another detail to consider: Whether (in all versions) there should be
used transposition.
Your choice, but without transposition each version
of code would run a bit faster.
Summarize the Problem
I am trying to optimize some code I have written. In its current form it works as intended, however because of the sheer number of loops required the script it takes a very long time to run.
I'm looking for a method of speeding up the below-described code.
Detail the problem
Within this data frame called master, there are 3,936,192 rows. The Position column represents a genomic window. Which is present in this data frame 76 times. Such that master[master['Position'] == 300] returns a dataframe of 76 rows, and similar for each unique appearance of Position. I do some operations on each of these subsets of the data frame.
The data can be found here
My current code takes the form:
import pandas as pd
master = pd.read_csv(data_location)
windows = sorted(set(master['Position']))
window_factor = []
# loop through all the windows, look at the cohort of samples, ignore anything not CNV == 2
# if that means ignore all, then drop the window entirely
# else record the 1/2 mean of that windows normalised coverage across all samples.
for window in windows:
current_window = master[master['Position'] == window]
t = current_window[current_window['CNV'] == 2]
if t.shape[0] == 0:
window_factor.append('drop')
else:
window_factor.append(
np.mean(current_window[current_window['CNV'] == 2]['Normalised_coverage'])/2)
However, this takes an exceptionally long time to run and I can't figure out a way to speed this up, though I know there must be one.
your df is not that big and in your code there are few problems:
If you use np.mean and one value is np.nan it returns np.nan
You can divide by 2 after calculate the mean.
It seems to me a perfect case for groupby
Return a string while other results are float you might consider to use
np.nan instead
import pandas as pd
df = pd.read_csv("master.csv")
def fun(x):
t = x[x["CNV"]==2]
return t["Normalised_coverage"].mean()/2
# returns np.nan when len(t)==0
out = df.groupby('Position').apply(fun)
CPU times: user 34.7 s, sys: 72.5 ms, total: 34.8 s
Wall time: 34.7 s
Or even faster filtering before the groupby as
%%time
out = df[df["CNV"]==2].groupby("Position")["Normalised_coverage"].mean()/2
CPU times: user 82.5 ms, sys: 8.03 ms, total: 90.5 ms
Wall time: 87.8 ms
UPDATE: In the last case if you really need to keep track of groups where df["CNV"]!=2 you can use this code:
import numpy as np
bad = df[df["CNV"]!=2]["Position"].unique()
bad = list(set(bad)-set(out.index))
out = out.reset_index(name="value")
out1 = pd.DataFrame({"Position":bad,
"value":[np.nan]*len(bad)})
out = pd.concat([out,out1],
ignore_index=True)\
.sort_values("Position")\
.reset_index(drop=True)
Which is going to add 160ms to your computation.
I think .groupby() function is what you need here:
fac = []
for name,group in master.groupby('Position'):
if all(group['CNV'] != 2):
fac.append('drop')
else:
fac.append(np.mean(group[group['CNV'] == 2]['Normalised_coverage'])/2)
I downloaded your data master.csv, data generated is exactly the same, running time decreased from 6 min to 30 sec on my laptop.
Hope it helps.
You can do several things:
instead of using a python list for window_factor consider using a np.array since
you know the length of the array.
t is already current_window[current_window['CNV'] == 2] use t when calculate np.mean.
You can also use a profiler to see if there are operations that are expensive, or just consider using C++ and reimplement the code(it's very simple).
Using groupby and query was the solution I went with.
import pandas as pd
import numpy as np
master = pd.read_csv("/home/sean/Desktop/master.csv", index_col=0)
windows = sorted(set(master['Position']))
g = master.groupby("Position")
master.query("Position == 24386700").shape
g = master.query("CNV == 2").groupby("Position")
p = g.Normalised_coverage.mean() / 2
Please help me understand why this "replace from dictionary" operation is slow in Python/Pandas:
# Series has 200 rows and 1 column
# Dictionary has 11269 key-value pairs
series.replace(dictionary, inplace=True)
Dictionary lookups should be O(1). Replacing a value in a column should be O(1). Isn't this a vectorized operation? Even if it's not vectorized, iterating 200 rows is only 200 iterations, so how can it be slow?
Here is a SSCCE demonstrating the issue:
import pandas as pd
import random
# Initialize dummy data
dictionary = {}
orig = []
for x in range(11270):
dictionary[x] = 'Some string ' + str(x)
for x in range(200):
orig.append(random.randint(1, 11269))
series = pd.Series(orig)
# The actual operation we care about
print('Starting...')
series.replace(dictionary, inplace=True)
print('Done.')
Running that command takes more than 1 second on my machine, which is 1000's of times longer than expected to perform <1000 operations.
It looks like replace has a bit of overhead, and explicitly telling the Series what to do via map yields the best performance:
series = series.map(lambda x: dictionary.get(x,x))
If you're sure that all keys are in your dictionary you can get a very slight performance boost by not creating a lambda, and directly supplying the dictionary.get function. Any keys that are not present will return NaN via this method, so beware:
series = series.map(dictionary.get)
You can also supply just the dictionary itself, but this appears to introduce a bit of overhead:
series = series.map(dictionary)
Timings
Some timing comparisons using your example data:
%timeit series.map(dictionary.get)
10000 loops, best of 3: 124 µs per loop
%timeit series.map(lambda x: dictionary.get(x,x))
10000 loops, best of 3: 150 µs per loop
%timeit series.map(dictionary)
100 loops, best of 3: 5.45 ms per loop
%timeit series.replace(dictionary)
1 loop, best of 3: 1.23 s per loop
.replacecan do incomplete substring matches, while .map requires complete values to be supplied in the dictionary (or it returns NaNs). The fast but generic solution (that can handle substring) should first use .replace on a dict of all possible values (obtained e.g. with .value_counts().index) and then go over all rows of the Series with this dict and .map. This combo can handle for instance special national characters replacements (full substrings) on 1m-row columns in a quarter of a second, where .replace alone would take 15.
Thanks to #root: I did a benchmarking again and found different results on pandas v1.1.4
Found series.map(dictionary) fastest it also returns NaN is key not present
I have a very large (200k+) set of key/value pairs, for which I need to retrieve very large (sometimes all) of the values. The obvious way to do this is with a dictionary such that
values = {lookup.get(key) for key in key_set}
This is getting very time consuming in my code, and I'm wondering if there's a faster way to implement this with a NumPy array. I've been experimenting with using an array with two columns and n rows, such that for any individual key:
value = lookup_array[lookup_array[:,0] == key, 1]
But I'm not sure how to scale this to many keys up without costly iteration. I've looked at:
values = lookup_array[np.in1d(lookup_array[:,0], key_set), 1]
but this also seems time consuming.
Is there any other way to do a massive lookup of nonconsecutive values quickly without iterating?
If certain special conditions apply, you can use NumPy indexing as a very fast alternative to dictionary lookups.
The keys must be integers
You have enough memory to create a NumPy array whose size is as big as the
maximum key value you wish to look up (so that all keys correspond to a valid index into the array.)
The idea is to use
lookup_array = np.empty((M,), dtype=values.dtype)
lookup_array[keys] = values
result = lookup_array[key_set]
instead of
result = {lookup_dict.get(key) for key in key_set}
For example,
import numpy as np
import pandas as pd
def using_dict(lookup_dict, key_set):
return {lookup_dict.get(key) for key in key_set}
def using_array(lookup_array, key_set):
return lookup_array[key_set]
def using_pandas(df, key_set):
return df.loc[df['a'].isin(key_set)]
M = 10**6
N = 2*10**5
K = 10**4
keys = np.random.randint(M, size=(N,))
values = np.random.random((N,))
lookup_dict = dict(zip(keys, values))
lookup_array = np.empty((M,), dtype=values.dtype)
lookup_array[keys] = values
df = pd.DataFrame(np.column_stack([keys, values]), columns=list('ab'))
key_set = np.random.choice(keys, size=(K,))
And here is a timeit benchmark (using IPython) for the methods above:
In [25]: %timeit using_array(lookup_array, key_set)
10000 loops, best of 3: 22.4 µs per loop
In [26]: %timeit using_dict(lookup_dict, key_set)
100 loops, best of 3: 3.73 ms per loop
In [24]: %timeit using_pandas(df, key_set)
10 loops, best of 3: 38.9 ms per loop
Here's an approach with np.searchsorted -
row_idx = np.searchsorted(lookup_array[:,0],key_set)[key_set.argsort()]
values = lookup_array[row_idx,1]
This assumes that lookup_array has the keys sorted in its first column. If that's not the case, you can use the optional sorter argument with np.searchsorted.
Loading a dictionary this huge in memory is kinda not good and then the added overhead of lookups. If this is a data structure you are using quite frequently how about using a database engine. There are KEY / VALUE databases if you don't like SQL. They are highly optimized for lookups.
I am writing some tests and I am using Pandas DataFrames to house a large dataset ~(600,000 x 10). I have extracted 10 random rows from the source data (using Stata) and now I want to write a test see if those rows are in the DataFrame in my test suite.
As a small example
np.random.seed(2)
raw_data = pd.DataFrame(np.random.rand(5,3), columns=['one', 'two', 'three'])
random_sample = raw_data.ix[1]
Here raw_data is:
And random_sample is derived to guarantee a match and is:
Currently I have written:
for idx, row in raw_data.iterrows():
if random_sample.equals(row):
print "match"
break
Which works but on the large dataset is very slow. Is there a more efficient way to check if an entire row is contained in the DataFrame?
BTW: My example also needs to be able to compare np.NaN equality which is why I am using the equals() method
equals doesn't seem to broadcast, but we can always do the equality comparison manually:
>>> df = pd.DataFrame(np.random.rand(600000, 10))
>>> sample = df.iloc[-1]
>>> %timeit df[((df == sample) | (df.isnull() & sample.isnull())).all(1)]
1 loops, best of 3: 231 ms per loop
>>> df[((df == sample) | (df.isnull() & sample.isnull())).all(1)]
0 1 2 3 4 5 6 \
599999 0.07832 0.064828 0.502513 0.851816 0.976464 0.761231 0.275242
7 8 9
599999 0.426393 0.91632 0.569807
which is much faster than the iterative version for me (which takes > 30s.)
But since we have lots of rows and relatively few columns, we could loop over the columns, and in the typical case probably cut down substantially on the number of rows to be looked at. For example, something like
def finder(df, row):
for col in df:
df = df.loc[(df[col] == row[col]) | (df[col].isnull() & pd.isnull(row[col]))]
return df
gives me
>>> %timeit finder(df, sample)
10 loops, best of 3: 35.2 ms per loop
which is roughly an order of magnitude faster, because after the first column there's only one row left.
(I think I once had a much slicker way to do this but for the life of me I can't remember it now.)
The best I have come up with is to take a filtering approach which seems to work quite well and prevents a lot of comparisons when the dataset is large:
tmp = raw_data
for idx, val in random_sample.iteritems():
try:
if np.isnan(val):
continue
except:
pass
tmp = tmp[tmp[idx] == val]
if len(tmp) == 1: print "match"
Note: This is actually a slower for the above small dimensional example. But on a large dataset this ~9 times faster than the basic iteration