Faster alternative to Series.add function in pandas - python

I am trying to add two pandas Series together. The first Series is very large and has a MultiIndex. The index of the second series is a small subset of the index of the first.
df1 = pd.DataFrame(np.ones((1000,5000)),dtype=int).stack()
df1 = pd.DataFrame(df1, columns = ['total'])
df2 = pd.concat([df1.iloc[50:55],df1.iloc[2000:2005]]) # df2 is tiny subset of df1
Using the regular Series.add function takes about 9 seconds the first time, and 2 seconds on subsequent tries (maybe because pandas optimizes how the df is stored in memory?).
starttime = time.time()
df1.total.add(df2.total,fill_value=0).sum()
print "Method 1 took %f seconds" % (time.time() - starttime)
Manually iterating over rows takes about 2/3 as long as Series.add the first time, and about 1/100 as long as Series.add on subsequent tries.
starttime = time.time()
result = df1.total.copy()
for row_index, row in df2.iterrows():
result[row_index] += row
print "Method 2 took %f seconds" % (time.time() - starttime)
The speed difference is particularly noticeable when (as here) the Index is a MultiIndex.
Why does Series.add not work well here? Any suggestions for speeding this up? Is there a more efficient alternative to iterating over each element of the Series?
Also, how do I sort or structure the data frame to improve the performance of either method? The second time either of these methods is run is appreciably faster. How do I get this performance on the first time? Sorting using sort_index helps only marginally.

You don't need for loop:
df1.total[df2.index] += df2.total

As HYRY answers, the more efficient thing to do in this situation is to only look at the small subset of df2's index. You can do this, with the slightly more robust add function (which can fill NaNs):
df1.total[df2.index] = (df1.total[df2.index]).add(df2.total, fill_value=0)
Although the syntax here is not very dry...
To compare some timeit information, we can see that add isn't drastically slower, and both are an enormous improvement on your naive for loop:
In [11]: %%timeit
result = df1.total.copy()
for row_index, row in df2.iterrows():
result[row_index] += row
100 loops, best of 3: 17.9 ms per loop
In [12]: %timeit df1.total[df2.index] = (df1.total[df2.index]).add(df2.total, fill_value=0)
1000 loops, best of 3: 325 µs per loop
In [13]: %timeit df1.total[df2.index] += df2.total
1000 loops, best of 3: 283 µs per loop
It's an interesting question (and I may fill this in later) to what relative size this will be faster with, but certainly in this extreme case there is a huge win...
The thing to take away from this:
If you are writing a for loop (in python) to speed something up, you're doing it wrong! :)

I think your second might be faster in this specific case because you're iterating through the smaller dataset (small amount of work) and then accessing only a handful of components of the larger dataset (an efficient operation thanks to pandas developers).
However, with the .add method, pandas has to look look at the entirety of both indices.
If df1 and df2 are the same length, your first method takes 54 ms, but the second method takes >2 minutes (on my machine, obviously, YMMV).

Related

Poor performance using .loc in Dataframe

I'm trying to combine certain columns by index of this dataframe, which I achived using a simple pd.cov() function, to calculate the variances and covariances of u_centro, v_centro and w_centro.
However, when I try to slice some of theses values using .loc, the performance is very slow (much slow!).
For example:
df_uu = df.loc[(iz_centro,'u_centro'),'u_centro']
where I want all the combinations of u_centro by u_centro. The result is exactly what I wanted, but the time spend to complete this is abusurd, more than 10 minutes.
the whole data:
https://raw.githubusercontent.com/AlessandroMDO/LargeEddySimulation/master/sd.csv
There are different ways to do this, but the best performance is using vectorization functions like xs (thanks #Paul H) or boolean masks for example:
startime = datetime.now()
mask = df.index.get_level_values(1) == 'u_centro'
df.loc[mask]
print(datetime.now() - startime) # 0:00:00.001417
I don't know if 1417 µs are a big deal in this case.

How to avoid poor performance of pandas mean() with datetime columns

I have a pandas (version 0.25.3) DataFrame containing a datetime64 column. I'd like to calculate the mean of each column.
import numpy as np
import pandas as pd
n = 1000000
df = pd.DataFrame({
"x": np.random.normal(0.0, 1.0, n),
"d": pd.date_range(pd.datetime.today(), periods=n, freq="1H").tolist()
})
Calculating the mean of individual columns is pretty much instantaneous.
df["x"].mean()
## 1000 loops, best of 3: 1.35 ms per loop
df["d"].mean()
## 100 loops, best of 3: 2.91 ms per loop
However, when I use the DataFrame's .mean() method, it takes a really long time.
%timeit df.mean()
## 1 loop, best of 3: 9.23 s per loop
It isn't clear to me where the performance penalty comes from.
What is the best way to avoid the slowdown? Should I convert the datetime64 column to a different type? Is using the DataFrame-level .mean() method considered bad form?
You could restrict it to the numeric values:
df.mean(numeric_only=True)
Then it runs very fast as well.
Here is the text from the documentation:
numeric_only : bool, default None
Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data. Not implemented for Series.
--
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.mean.html
This is a pandas bug.
On the current master (so probably with Pandas >= 1.3), the minimal example in the question will be fast also when running df.mean(), but the column d is not in the result. You still have to do df["d"].mean() to get a result. I guess this is done to avoid breaking changes, but I am not sure.
Passing the parameter numeric_only=True to .mean() or calling .mean() on columns and not on the dataframe are good workarounds.
Note: Things are not very intuitive, if your Dataframe contains column with a non-numeric data type, such as string, dates etc. Pandas then tries to do a sum (what ever that means for the datatype), then convert it to numbers and divide by the number of rows. For strings this leads to weird results such as "42" + "42" + "42" which is "424242" and then converted to 424242. and divided by 3. For non-numeric values this can be pretty slow. If the concatenation of the strings cannot be converted to a number, the result is either omitted for df.mean() or an error is raised for Pandas >= 1.3 or if you call mean() on the column.

Pandas replace/dictionary slowness

Please help me understand why this "replace from dictionary" operation is slow in Python/Pandas:
# Series has 200 rows and 1 column
# Dictionary has 11269 key-value pairs
series.replace(dictionary, inplace=True)
Dictionary lookups should be O(1). Replacing a value in a column should be O(1). Isn't this a vectorized operation? Even if it's not vectorized, iterating 200 rows is only 200 iterations, so how can it be slow?
Here is a SSCCE demonstrating the issue:
import pandas as pd
import random
# Initialize dummy data
dictionary = {}
orig = []
for x in range(11270):
dictionary[x] = 'Some string ' + str(x)
for x in range(200):
orig.append(random.randint(1, 11269))
series = pd.Series(orig)
# The actual operation we care about
print('Starting...')
series.replace(dictionary, inplace=True)
print('Done.')
Running that command takes more than 1 second on my machine, which is 1000's of times longer than expected to perform <1000 operations.
It looks like replace has a bit of overhead, and explicitly telling the Series what to do via map yields the best performance:
series = series.map(lambda x: dictionary.get(x,x))
If you're sure that all keys are in your dictionary you can get a very slight performance boost by not creating a lambda, and directly supplying the dictionary.get function. Any keys that are not present will return NaN via this method, so beware:
series = series.map(dictionary.get)
You can also supply just the dictionary itself, but this appears to introduce a bit of overhead:
series = series.map(dictionary)
Timings
Some timing comparisons using your example data:
%timeit series.map(dictionary.get)
10000 loops, best of 3: 124 µs per loop
%timeit series.map(lambda x: dictionary.get(x,x))
10000 loops, best of 3: 150 µs per loop
%timeit series.map(dictionary)
100 loops, best of 3: 5.45 ms per loop
%timeit series.replace(dictionary)
1 loop, best of 3: 1.23 s per loop
.replacecan do incomplete substring matches, while .map requires complete values to be supplied in the dictionary (or it returns NaNs). The fast but generic solution (that can handle substring) should first use .replace on a dict of all possible values (obtained e.g. with .value_counts().index) and then go over all rows of the Series with this dict and .map. This combo can handle for instance special national characters replacements (full substrings) on 1m-row columns in a quarter of a second, where .replace alone would take 15.
Thanks to #root: I did a benchmarking again and found different results on pandas v1.1.4
Found series.map(dictionary) fastest it also returns NaN is key not present

Efficiently find matching rows (based on content) in a pandas DataFrame

I am writing some tests and I am using Pandas DataFrames to house a large dataset ~(600,000 x 10). I have extracted 10 random rows from the source data (using Stata) and now I want to write a test see if those rows are in the DataFrame in my test suite.
As a small example
np.random.seed(2)
raw_data = pd.DataFrame(np.random.rand(5,3), columns=['one', 'two', 'three'])
random_sample = raw_data.ix[1]
Here raw_data is:
And random_sample is derived to guarantee a match and is:
Currently I have written:
for idx, row in raw_data.iterrows():
if random_sample.equals(row):
print "match"
break
Which works but on the large dataset is very slow. Is there a more efficient way to check if an entire row is contained in the DataFrame?
BTW: My example also needs to be able to compare np.NaN equality which is why I am using the equals() method
equals doesn't seem to broadcast, but we can always do the equality comparison manually:
>>> df = pd.DataFrame(np.random.rand(600000, 10))
>>> sample = df.iloc[-1]
>>> %timeit df[((df == sample) | (df.isnull() & sample.isnull())).all(1)]
1 loops, best of 3: 231 ms per loop
>>> df[((df == sample) | (df.isnull() & sample.isnull())).all(1)]
0 1 2 3 4 5 6 \
599999 0.07832 0.064828 0.502513 0.851816 0.976464 0.761231 0.275242
7 8 9
599999 0.426393 0.91632 0.569807
which is much faster than the iterative version for me (which takes > 30s.)
But since we have lots of rows and relatively few columns, we could loop over the columns, and in the typical case probably cut down substantially on the number of rows to be looked at. For example, something like
def finder(df, row):
for col in df:
df = df.loc[(df[col] == row[col]) | (df[col].isnull() & pd.isnull(row[col]))]
return df
gives me
>>> %timeit finder(df, sample)
10 loops, best of 3: 35.2 ms per loop
which is roughly an order of magnitude faster, because after the first column there's only one row left.
(I think I once had a much slicker way to do this but for the life of me I can't remember it now.)
The best I have come up with is to take a filtering approach which seems to work quite well and prevents a lot of comparisons when the dataset is large:
tmp = raw_data
for idx, val in random_sample.iteritems():
try:
if np.isnan(val):
continue
except:
pass
tmp = tmp[tmp[idx] == val]
if len(tmp) == 1: print "match"
Note: This is actually a slower for the above small dimensional example. But on a large dataset this ~9 times faster than the basic iteration

Approach to speed up pandas multilevel index selection?

I have a data frame, and would like to work on a small partition each time for particular tuples of values of 'a', 'b','c'.
df = pd.DataFrame({'a':np.random.randint(0,10,10000),
'b':np.random.randint(0,10,10000),
'c':np.random.randint(0,10,10000),
'value':np.random.randint(0,100,10000)})
so I chose to use pandas multiindex:
dfi = df.set_index(['a','b','c'])
dfi.sortlevel(inplace = True)
However, the performance is not great.
%timeit dfi.ix[(2,1,7)] # 511 us
%timeit df[(df['a'].values == 2) &
(df['b'].values == 1) & (df['c'].values == 7)] # 247 us
I suspect there are some overheads somewhere. My program has ~1k tuples, so it takes 511 * 1000 = 0.5s for one run. How can I improve further?
update:
hmm, I forgot to mention that the number of tuples are less than the total Cartesian product of distinct values in 'a', 'b','c' in df. Wouldn't groupby do excess amount of work on indices that doesn't exist in my tuples?
its not clear what 'work' on means, but I would do this
this can be almost any function
In [33]: %timeit df.groupby(['a','b','c']).apply(lambda x: x.sum())
10 loops, best of 3: 83.6 ms per loop
certain operations are cythonized so very fast
In [34]: %timeit df.groupby(['a','b','c']).sum()
100 loops, best of 3: 2.65 ms per loop
Doing a a selection on a multi-index is not efficient to do index by index.
If you are operating on a very small subset of the total groups, then you might want to directly index into the multi-index; groupby wins if you are operating on a fraction (maybe 20%) of the groups or more. You might also want to investigate filter which you can use to pre-filter the groups based on some criteria.
As noted above, the cartesian product of the groups indexers is irrelevant. Only the actual groups will be iterated by groupby (think of a MultiIndex as a sparse representation of the total possible space).
How about:
dfi = df.set_index(['a','b','c'])
dfi.sortlevel(inplace = True)
value = dfi["value"].values
value[dfi.index.get_loc((2, 1, 7))]
the result is a ndarray without index.

Categories