Efficiently find matching rows (based on content) in a pandas DataFrame

Efficiently find matching rows (based on content) in a pandas DataFrame - python

I am writing some tests and I am using Pandas DataFrames to house a large dataset ~(600,000 x 10). I have extracted 10 random rows from the source data (using Stata) and now I want to write a test see if those rows are in the DataFrame in my test suite.
As a small example
np.random.seed(2)
raw_data = pd.DataFrame(np.random.rand(5,3), columns=['one', 'two', 'three'])
random_sample = raw_data.ix[1]
Here raw_data is:
And random_sample is derived to guarantee a match and is:
Currently I have written:
for idx, row in raw_data.iterrows():
if random_sample.equals(row):
print "match"
break
Which works but on the large dataset is very slow. Is there a more efficient way to check if an entire row is contained in the DataFrame?
BTW: My example also needs to be able to compare np.NaN equality which is why I am using the equals() method

equals doesn't seem to broadcast, but we can always do the equality comparison manually:
>>> df = pd.DataFrame(np.random.rand(600000, 10))
>>> sample = df.iloc[-1]
>>> %timeit df[((df == sample) | (df.isnull() & sample.isnull())).all(1)]
1 loops, best of 3: 231 ms per loop
>>> df[((df == sample) | (df.isnull() & sample.isnull())).all(1)]
0 1 2 3 4 5 6 \
599999 0.07832 0.064828 0.502513 0.851816 0.976464 0.761231 0.275242
7 8 9
599999 0.426393 0.91632 0.569807
which is much faster than the iterative version for me (which takes > 30s.)
But since we have lots of rows and relatively few columns, we could loop over the columns, and in the typical case probably cut down substantially on the number of rows to be looked at. For example, something like
def finder(df, row):
for col in df:
df = df.loc[(df[col] == row[col]) | (df[col].isnull() & pd.isnull(row[col]))]
return df
gives me
>>> %timeit finder(df, sample)
10 loops, best of 3: 35.2 ms per loop
which is roughly an order of magnitude faster, because after the first column there's only one row left.
(I think I once had a much slicker way to do this but for the life of me I can't remember it now.)

The best I have come up with is to take a filtering approach which seems to work quite well and prevents a lot of comparisons when the dataset is large:
tmp = raw_data
for idx, val in random_sample.iteritems():
try:
if np.isnan(val):
continue
except:
pass
tmp = tmp[tmp[idx] == val]
if len(tmp) == 1: print "match"
Note: This is actually a slower for the above small dimensional example. But on a large dataset this ~9 times faster than the basic iteration

Related

Speeding up slow nested loops through rows in dataframe?

cluster_name size is 101,878
maxpaths size is 1,508,931
Dataframe looks like this
| cluster_name | maxpath | chrom_pos |
|--------------+---------------------+--------------------|
| cluster_1007 | 5_G,6_C,7_A,8_A,9_T | chr11:611117-799999|
| cluster_1007 | 5_G,6_C,7_A,8_A,9_T | chr11:611117-799999|
| cluster_1007 | 3_G,4_C,5_A,6_A,7_T | chr12:823492-102341|
What I'm trying to do is
for a given cluster I want to compare each maxpaths to other given maxpaths in
that cluster, I'm interested in the clusters where the maxpaths do not overlap
and their annotations on the chromosome are also disjoint. For example
"5_G,6_C,7_A,8_A,9_T" and "3_G,4_C,5_A,6_A,7_T" do not overlap and have
different annotations. One of the biggest pain points in my code is that I
compare each maxpath to all the other maxpaths in that cluster. Some clusters have over 1000 maxpaths so it slows down a lot. I tried to reduce the number of pairwise comparisons I do by returning the name of the cluster when it finds two maxpaths that fulfill the criteria. I've tried moving certain parts of the code into numpy arrays but it's still extremely slow.
My code looks like this. Does anyone have any ideas that could help?
import pandas as pd
import numpy as np
def find_cluster(cluster, maxpaths):
"""
returns clusters with disjoint maxpaths annotated
to different genomes or annotated on the same
genome min of 10kb apart
"""
for idx, item in enumerate(maxpaths):
unique = set(item.split(','))
for j in range(idx+1, len(maxpaths)):
compare = set(maxpaths[j].split(','))
chrom_string1 = subset_df[
subset_df['maxpath'] == item]['chrom_pos'].values[0]
chrom_string2 = subset_df[
subset_df['maxpath'] == maxpaths[j]]['chrom_pos'].values[0]
chrom1 = chrom_string1.split(':')[0]
chrom_end1 = int(chrom_string1.split('-')[1])
chrom2 = chrom_string2.split(':')[0]
chrom_end2 = int(chrom_string2.split('-')[1])
if len(unique.intersection(compare)) == 0:
if chrom1 != chrom2:
return(cluster)
elif chrom1 == chrom2:
if abs(chrom_end1 - chrom_end2) > 10000:
return(cluster)
file_number = input.df_subset.split('_')[-1].split('.')[0]
df = pd.read_csv(input.df_subset)
cluster_names = df['cluster_name'].unique()
disjoint_clusters = []
for i in cluster_names:
subset_df = df[df['cluster_name'] ==i]
maxpaths_array = subset_df['maxpath'].as_matrix()
cluster = find_cluster(i,maxpaths_array)
disjoint_clusters.append(cluster)
disjoint_maxpaths = pd.DataFrame({"clusters_with_disjoint_maxpaths":disjoint_clusters})
disjoint_maxpaths.to_csv(os.path.abspath('results/disjoint_maxpath_clusters_{}.csv'.format(file_number)),index=False)

After incorporating some of the suggested ideas this is what I came up with. I lost readability but what I needed was performance. Whole run took about 2 hours to finish on my main computer.
def find_cluster(cluster_name, dataframe):
"""
returns clusters with disjoint maxpaths annotated
to different genomes or annotated on the same
genome min of 10kb apart
"""
cluster = cluster_name
df = dataframe
maxpaths = [set(item) for item in [_.split(',') for _ in df['maxpath'].as_matrix()]]
chrom_string = df['chrom_pos'].as_matrix()
chrom = [_.split(':')[0] for _ in df['chrom_pos'].as_matrix()]
chrom_end = [int(_.split('-')[1]) for _ in df['chrom_pos'].as_matrix()]
for idx,item in enumerate(maxpaths):
for j in range(idx+1, len(maxpaths)):
if item.isdisjoint(maxpaths[j]):
if chrom[idx] != chrom[j]:
return(cluster)
elif chrom[idx] == chrom[j]:
if abs(chrom_end[idx] - chrom_end[j]) > 10000:
return(cluster)
def df_to_dict(dataframe):
"""
Create dict with cluster as key and
subset of dataframe at cluster as value
"""
df = dataframe
unique_clusters = df['cluster_name'].unique()
sub_dfs = []
for i in unique_clusters:
sub_dfs.append(df[df['cluster_name'] == i])
return(dict(zip(unique_clusters, sub_dfs)))
def get_clusters(dataframe):
"""
return disjoint clusters
"""
df = pd.read_csv(dataframe, index_col=False)
df_dict = df_to_dict(df)
disjoint = [find_cluster(k,v) for k,v in df_dict.items() if find_cluster(k,v)]
return(disjoint)
def test_new():
cluster = ["cluster_689"]
disjoint_cluster = []
for i in cluster:
found = find_cluster(i, subset_df)
disjoint_cluster.append(found)
return(disjoint_cluster)
def test_old():
cluster = ["cluster_689"]
disjoint_cluster = []
for i in cluster:
maxpaths_array = subset_df['maxpath'].as_matrix()
found = old_find_cluster(i,maxpaths_array)
disjoint_cluster.append(found)
return(disjoint_cluster)
new = %timeit for x in range(3): test_new()
old = %timeit for x in range(3): test_old()
yields
New find_cluster
247 µs ± 5.48 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Old find_cluster
138 ms ± 587 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
The speed up when looking for disjoint max paths is immense. I wasn't able to calculate the time for the whole script because the last few large clusters never finished running after over 24 hours. This was after splitting up the data frame into 100 smaller dataframes as well. But i'm sure overall the script became faster outside of the find_clusters function. Thanks for all the help everyone.

There is some scope for improvement. Basically your code is pythonic, but if performance loss (over readability gain) is unacceptable, you should look at least somewhat deeper, into things such as repeated iterations that you do, and convenience functions which take iterables as inputs and again/repeatedly iterate over (possibly the same) data.
Avoid repeating iterations.
You are repeating iterations (such as with df['cluster_name'].unique() and again iterating all rows in for loop, then again subsetting array multiple times at df[df['cluster_name'] ==i] which is conditional execution, that gets repeated a lot of times for same values). Instead you can just iterate over the whole data once, keep a dict like mapping for cluster (eg. cluster name mapped to a list of sets of max_paths), check if the max_paths for current cluster name entry are already present in that dictionary's corresponding entry, and accumulate clusters passing your conditions. This also helps repeated creations of exactly same sets that's happening with your current code.
Another example of repeated heavy computation:
cluster_names = df['cluster_name'].unique()
disjoint_clusters = []
for i in cluster_names:
subset_df = df[df['cluster_name'] ==i]
You can at least keep (cache) and reuse subsets instead of re-computing.
Do not compute unnecessary expressions, and look for items breaking your conditions. Eg. len(unique.intersection(compare)) == 0 is slower than unique.isdisjoint(compare) because the latter one short-circuits and your data seem to have that case more frequently. Basically you don't need that intersection, you just need to check if they're disjoint.
Here, again the same sets are being constructed (which themselves involve iterations over input iterables) repeatedly, with the dict approach above, those repetitions are avoided.
There is one shortcoming with the suggested approach thought, it may consume more memory but I guess that shouldn't be a troublesome difference.
Also I assume you need all clusters passing your conditions, and not stop at just 2. In that case it can be made even faster.

Pandas replace/dictionary slowness

Please help me understand why this "replace from dictionary" operation is slow in Python/Pandas:
# Series has 200 rows and 1 column
# Dictionary has 11269 key-value pairs
series.replace(dictionary, inplace=True)
Dictionary lookups should be O(1). Replacing a value in a column should be O(1). Isn't this a vectorized operation? Even if it's not vectorized, iterating 200 rows is only 200 iterations, so how can it be slow?
Here is a SSCCE demonstrating the issue:
import pandas as pd
import random
# Initialize dummy data
dictionary = {}
orig = []
for x in range(11270):
dictionary[x] = 'Some string ' + str(x)
for x in range(200):
orig.append(random.randint(1, 11269))
series = pd.Series(orig)
# The actual operation we care about
print('Starting...')
series.replace(dictionary, inplace=True)
print('Done.')
Running that command takes more than 1 second on my machine, which is 1000's of times longer than expected to perform <1000 operations.

It looks like replace has a bit of overhead, and explicitly telling the Series what to do via map yields the best performance:
series = series.map(lambda x: dictionary.get(x,x))
If you're sure that all keys are in your dictionary you can get a very slight performance boost by not creating a lambda, and directly supplying the dictionary.get function. Any keys that are not present will return NaN via this method, so beware:
series = series.map(dictionary.get)
You can also supply just the dictionary itself, but this appears to introduce a bit of overhead:
series = series.map(dictionary)
Timings
Some timing comparisons using your example data:
%timeit series.map(dictionary.get)
10000 loops, best of 3: 124 µs per loop
%timeit series.map(lambda x: dictionary.get(x,x))
10000 loops, best of 3: 150 µs per loop
%timeit series.map(dictionary)
100 loops, best of 3: 5.45 ms per loop
%timeit series.replace(dictionary)
1 loop, best of 3: 1.23 s per loop

.replacecan do incomplete substring matches, while .map requires complete values to be supplied in the dictionary (or it returns NaNs). The fast but generic solution (that can handle substring) should first use .replace on a dict of all possible values (obtained e.g. with .value_counts().index) and then go over all rows of the Series with this dict and .map. This combo can handle for instance special national characters replacements (full substrings) on 1m-row columns in a quarter of a second, where .replace alone would take 15.

Thanks to #root: I did a benchmarking again and found different results on pandas v1.1.4
Found series.map(dictionary) fastest it also returns NaN is key not present

Filter by columns combination in pandas

I have two DataFrames "A" and "B". Each has two columns "key1" and "key2", but a unique key is a combination of two. I want to select from second DataFrame all rows with combination of "key1" and "key2" columns which are contained in DataFrame "A".
Simple example:
A = pd.DataFrame({'a':list(range(20000))*100,
'b':np.repeat(list(range(100)),20000)})
B = pd.DataFrame({'a':list(range(40000))*100,
'b':np.repeat(list(range(100)),40000),
'c':np.random.randint(4000000, size = 4000000)})
Solution 1:
%%time
A['marker'] = True
C = B.merge(A, on=['a','b'], how='inner').drop('marker', axis=1)
1.26 s
Solution 2:
%%time
A['marker'] = A['a'].astype(str) + '_' + A['b'].astype(str)
B['marker'] = B['a'].astype(str) + '_' + B['b'].astype(str)
C = B[B.marker.isin(A.marker)]
20.4 s
This works, but is there a more elegant (and fast) solution?

You could try taking a look at pd.MultiIndex and using multi-level indices instead of plain/meaningless integer ones. Not sure if it would be a lot faster in the real data, but modifying your example data slightly:
index1 = pd.MultiIndex.from_arrays([range(20000)*100, np.repeat(range(100),20000)]) #former A
index2 = pd.MultiIndex.from_arrays([range(40000)*100, np.repeat(range(100),40000)]) #index of B[['a', 'b']]
s = pd.Series(np.random.randint(4000000, size = 4000000), index=index2) #former B['c']
In [93]: %timeit c = s[index1]
1 loops, best of 3: 803 ms per loop
The indexing of s with a different index (index1) from its original index (index2) is roughly equivalent to the your merge operation.
Usually operations on the index tend to be faster than those performed on regular DataFrame columns. But either way, you are probably looking for marginal improvement here. I don't think you can get this done in the microsecond scale.

Approach to speed up pandas multilevel index selection?

I have a data frame, and would like to work on a small partition each time for particular tuples of values of 'a', 'b','c'.
df = pd.DataFrame({'a':np.random.randint(0,10,10000),
'b':np.random.randint(0,10,10000),
'c':np.random.randint(0,10,10000),
'value':np.random.randint(0,100,10000)})
so I chose to use pandas multiindex:
dfi = df.set_index(['a','b','c'])
dfi.sortlevel(inplace = True)
However, the performance is not great.
%timeit dfi.ix[(2,1,7)] # 511 us
%timeit df[(df['a'].values == 2) &
(df['b'].values == 1) & (df['c'].values == 7)] # 247 us
I suspect there are some overheads somewhere. My program has ~1k tuples, so it takes 511 * 1000 = 0.5s for one run. How can I improve further?
update:
hmm, I forgot to mention that the number of tuples are less than the total Cartesian product of distinct values in 'a', 'b','c' in df. Wouldn't groupby do excess amount of work on indices that doesn't exist in my tuples?

its not clear what 'work' on means, but I would do this
this can be almost any function
In [33]: %timeit df.groupby(['a','b','c']).apply(lambda x: x.sum())
10 loops, best of 3: 83.6 ms per loop
certain operations are cythonized so very fast
In [34]: %timeit df.groupby(['a','b','c']).sum()
100 loops, best of 3: 2.65 ms per loop
Doing a a selection on a multi-index is not efficient to do index by index.
If you are operating on a very small subset of the total groups, then you might want to directly index into the multi-index; groupby wins if you are operating on a fraction (maybe 20%) of the groups or more. You might also want to investigate filter which you can use to pre-filter the groups based on some criteria.
As noted above, the cartesian product of the groups indexers is irrelevant. Only the actual groups will be iterated by groupby (think of a MultiIndex as a sparse representation of the total possible space).

How about:
dfi = df.set_index(['a','b','c'])
dfi.sortlevel(inplace = True)
value = dfi["value"].values
value[dfi.index.get_loc((2, 1, 7))]
the result is a ndarray without index.

Faster alternative to Series.add function in pandas

I am trying to add two pandas Series together. The first Series is very large and has a MultiIndex. The index of the second series is a small subset of the index of the first.
df1 = pd.DataFrame(np.ones((1000,5000)),dtype=int).stack()
df1 = pd.DataFrame(df1, columns = ['total'])
df2 = pd.concat([df1.iloc[50:55],df1.iloc[2000:2005]]) # df2 is tiny subset of df1
Using the regular Series.add function takes about 9 seconds the first time, and 2 seconds on subsequent tries (maybe because pandas optimizes how the df is stored in memory?).
starttime = time.time()
df1.total.add(df2.total,fill_value=0).sum()
print "Method 1 took %f seconds" % (time.time() - starttime)
Manually iterating over rows takes about 2/3 as long as Series.add the first time, and about 1/100 as long as Series.add on subsequent tries.
starttime = time.time()
result = df1.total.copy()
for row_index, row in df2.iterrows():
result[row_index] += row
print "Method 2 took %f seconds" % (time.time() - starttime)
The speed difference is particularly noticeable when (as here) the Index is a MultiIndex.
Why does Series.add not work well here? Any suggestions for speeding this up? Is there a more efficient alternative to iterating over each element of the Series?
Also, how do I sort or structure the data frame to improve the performance of either method? The second time either of these methods is run is appreciably faster. How do I get this performance on the first time? Sorting using sort_index helps only marginally.

You don't need for loop:
df1.total[df2.index] += df2.total

As HYRY answers, the more efficient thing to do in this situation is to only look at the small subset of df2's index. You can do this, with the slightly more robust add function (which can fill NaNs):
df1.total[df2.index] = (df1.total[df2.index]).add(df2.total, fill_value=0)
Although the syntax here is not very dry...
To compare some timeit information, we can see that add isn't drastically slower, and both are an enormous improvement on your naive for loop:
In [11]: %%timeit
result = df1.total.copy()
for row_index, row in df2.iterrows():
result[row_index] += row
100 loops, best of 3: 17.9 ms per loop
In [12]: %timeit df1.total[df2.index] = (df1.total[df2.index]).add(df2.total, fill_value=0)
1000 loops, best of 3: 325 µs per loop
In [13]: %timeit df1.total[df2.index] += df2.total
1000 loops, best of 3: 283 µs per loop
It's an interesting question (and I may fill this in later) to what relative size this will be faster with, but certainly in this extreme case there is a huge win...
The thing to take away from this:
If you are writing a for loop (in python) to speed something up, you're doing it wrong! :)

I think your second might be faster in this specific case because you're iterating through the smaller dataset (small amount of work) and then accessing only a handful of components of the larger dataset (an efficient operation thanks to pandas developers).
However, with the .add method, pandas has to look look at the entirety of both indices.
If df1 and df2 are the same length, your first method takes 54 ms, but the second method takes >2 minutes (on my machine, obviously, YMMV).

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Efficiently find matching rows (based on content) in a pandas DataFrame - python

Related

Speeding up slow nested loops through rows in dataframe?

Pandas replace/dictionary slowness

Filter by columns combination in pandas

Approach to speed up pandas multilevel index selection?

Faster alternative to Series.add function in pandas

Categories

Resources