In writing to a dataframe in pandas, we see we have a couple of ways to do it, as provided by this answer and this answer.
We have the method of
df[r][c].set_value(r,c,some_value) and the method of
df.iloc[r][c] = some_value.
What is the difference? Which is faster? Is either a copy?
The difference is that set_value is returning an object, while the assignment operator assigns the value into the existing DataFrame object.
after calling set_value you will potentially have two DataFrame objects (this does not necessarily mean you'll have two copies of the data, as DataFrame objects can "reference" one another) while the assignment operator will change data in the single DataFrame object.
It appears to be faster to use the set_value, as it is probably optimized for that use-case, while the assignment approach will generate intermediate slices of the data:
In [1]: import pandas as pd
In [2]: import numpy as np
In [3]: df=pd.DataFrame(np.random.rand(100,100))
In [4]: %timeit df[10][10]=7
The slowest run took 6.43 times longer than the fastest. This could mean that an intermediate result is being cached
10000 loops, best of 3: 89.5 µs per loop
In [5]: %timeit df.set_value(10,10,11)
The slowest run took 10.89 times longer than the fastest. This could mean that an intermediate result is being cached
100000 loops, best of 3: 3.94 µs per loop
the result of set_value may be a copy, but the documentation is not really clear (to me) on this:
Returns:
frame : DataFrame
If label pair is contained, will be reference to calling DataFrame, otherwise a new object
Related
Pandas to_dict("records") seems to have a much inferior performance compared to a naive implementation. Below is the code snippet of my implementation:
def fast_to_dict_records(df):
data = df.values.tolist()
columns = df.columns.tolist()
return [
dict(zip(columns, datum))
for datum in data
]
To compare the performance, try the below code snippet:
import pandas as pd
import numpy as np
df_test = pd.DataFrame(
np.random.normal(size=(10000, 300)),
columns=range(300)
)
%timeit df_test.to_dict('records')
%timeit fast_to_dict_records(df_test)
And the outputs are:
2.21 s ± 71.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
293 ms ± 15.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Namely my implementation is ~7.5 faster than pandas native implementation. Also, it should be easy to verify that the two methods provide the same result. I have also tested the performance against different sizes of dataframes and seems like my implementation is consistently outperform its counterpart (although the magnitude might differ).
I am curious am I missing anything here? I am just not convinced that pandas native implementation performance, which under my impression was quite competitive, can be beaten that much by a not-so-complicated alternative...
TL;DR: Pandas is mostly written in pure Python like your implementation although it often use vectorized Numpy calls internally to speed the computation up. Unfortunately, this is not case here. As a result, the Pandas implementation is inefficient. Your implementation is faster, but it requires more memory.
In-depth study:
You can find the implementation of to_list here. It iterate over data using itertuples internally (see here for its code). The resulting (slightly simplified) Pandas code at the date of the 12th march 2021 is the following:
def maybe_box_native(value: Scalar) -> Scalar:
if is_datetime_or_timedelta_dtype(value): # branch never taken here
value = maybe_box_datetimelike(value)
elif is_float(value): # branch always taken here
value = float(value) # slow manual conversion for EACH values!
elif is_integer(value):
value = int(value)
elif is_bool(value):
value = bool(value)
return value
def pandas_to_list(df):
# From itertuples:
fields = list(df.columns)
arrays = [df.iloc[:, k] for k in range(len(df.columns))]
tmpRes = zip(*arrays)
# From to_list:
columns = df.columns.tolist()
rows = (dict(zip(columns, row)) for row in tmpRes)
return [dict((k, maybe_box_native(v)) for k, v in row.items()) for row in rows]
Your implementation generates a big temporary list in memory using to_list while Pandas works with Python generators internally. This list should not be a problem in practice in most simple cases since the dict should eventually be much bigger.
However, to_list (in your implementation) also converts the Numpy types efficiently using vectorized Numpy calls internally while Pandas use a very slow approach. Indeed, Pandas checks and converts all the value one by one using the maybe_box_native pure Python function and slow if/else... It is thus not surprising that the Pandas implementation is slower. That being said, note that your code might behave differently with dates.
The current Pandas implementation is inefficient and it can clearly be improved in the future (possibly without requiring much more memory).
I am building a new method to parse a DataFrame into a Vincent-compatible format. This requires a standard Index (Vincent can't parse a MultiIndex).
Is there a way to detect whether a Pandas DataFrame has a MultiIndex?
In: type(frame)
Out: pandas.core.index.MultiIndex
I've tried:
In: if type(result.index) is 'pandas.core.index.MultiIndex':
print True
else:
print False
Out: False
If I try without quotations I get:
NameError: name 'pandas' is not defined
Any help appreciated.
(Once I have the MultiIndex, I'm then resetting the index and merging the two columns into a single string value for the presentation stage.)
You can use isinstance to check whether an object is a class (or its subclasses):
if isinstance(result.index, pandas.MultiIndex):
You can use nlevels to check how many levels there are:
df.index.nlevels
df.columns.nlevels
If nlevels > 1, your dataframe certainly has multiple indices.
There's also
len(result.index.names) > 1
but it is considerably slower than either isinstance or type:
timeit(len(result.index.names) > 1)
The slowest run took 10.95 times longer than the fastest. This could mean that an intermediate result is being cached.
1000000 loops, best of 3: 1.12 µs per loop
In [254]:
timeit(isinstance(result.index, pd.MultiIndex))
The slowest run took 30.53 times longer than the fastest. This could mean that an intermediate result is being cached.
10000000 loops, best of 3: 177 ns per loop
In [252]:
)
timeit(type(result.index) == pd.MultiIndex)
The slowest run took 22.86 times longer than the fastest. This could mean that an intermediate result is being cached.
1000000 loops, best of 3: 200 ns per loop
Maybe the shortest way is if type(result.index)==pd.MultiIndex:
today I started working with a large dataframe in python and tried to replace a column with a scalar multiple of itself.
For example:
df['some_column'] = [1,2,3,4,5] * 10 = [2,4,6,8,10]
I tried a two different ways to accomplish this:
First way:
df['some_column'] = df['some_column']*10
Second way:
df['some_column'] = df['some_column'].apply(lambda x: x*10)
I used the timeit function and noticed they both had similar runtimes. With the first way, the Jupyter Notebook kernel crashed when I included the entire dataset, so I assumed it ran out of memory. The second way worked as intended, so I'm guessing its less memory intensive.
Question: Am I correct in assuming lambda functions use less memory? If so, is it best practice to use Lambdas as frequently as possible? Are there any comparable ways that might be less resource-intensive than the two I listed here?
Thanks!
Edit: I tried this in a different environment and it didn't crash so my earlier assumption on memory is not correct. In the set I am working with, the data type of 'some_column' is int64. The runtimes of the two ways were 35ms and 56ms, respectively.
You're wrong for both time and memory consumption. And you're wrong about your question: It is not about lambda alone, but about apply and lambda.
Using profiling shows how far this approach is less efficient than parallel functions implemented in pandas.
apply and Lambdas are slower
In [1]: import pandas as pd
In [2]: s = pd.Series(range(10000000))
In [3]: %timeit s * 10
100 loops, best of 3: 13.7 ms per loop
In [4]: %timeit s.multiply(10) # Using the function itself gives
100 loops, best of 3: 13.8 ms per loop # same thing as above
In [5]: %timeit s.apply(lambda x: x * 10)
1 loop, best of 3: 2.92 s per loop # Factor 200 for timing
apply and Lambdas use more memory
In [1]: %load_ext memory_profiler
In [2]: import pandas as pd
In [3]: s = pd.Series(range(10000000))
In [4]: %memit s * 10
peak memory: 163.02 MiB, increment: 38.15 MiB
In [5]: %memit s.multiply(10) # Using the function itself gives
peak memory: 163.01 MiB, increment: 37.96 MiB # same thing as above
In [6]: %memit s.apply(lambda x: x * 10)
peak memory: 1202.03 MiB, increment: 1077.40 MiB # Factor 7 for memory
Lambdas and apply are useful in some cases, but they should not be overused.
Could you try some profiling on your operations? Maybe using a subset of your dataframe to avoid the crash. I would be surprised this is a memory issue occurring with multiply and not with apply.
Bonus: A bit of reading about pandas performance that you may find interesting.
Please help me understand why this "replace from dictionary" operation is slow in Python/Pandas:
# Series has 200 rows and 1 column
# Dictionary has 11269 key-value pairs
series.replace(dictionary, inplace=True)
Dictionary lookups should be O(1). Replacing a value in a column should be O(1). Isn't this a vectorized operation? Even if it's not vectorized, iterating 200 rows is only 200 iterations, so how can it be slow?
Here is a SSCCE demonstrating the issue:
import pandas as pd
import random
# Initialize dummy data
dictionary = {}
orig = []
for x in range(11270):
dictionary[x] = 'Some string ' + str(x)
for x in range(200):
orig.append(random.randint(1, 11269))
series = pd.Series(orig)
# The actual operation we care about
print('Starting...')
series.replace(dictionary, inplace=True)
print('Done.')
Running that command takes more than 1 second on my machine, which is 1000's of times longer than expected to perform <1000 operations.
It looks like replace has a bit of overhead, and explicitly telling the Series what to do via map yields the best performance:
series = series.map(lambda x: dictionary.get(x,x))
If you're sure that all keys are in your dictionary you can get a very slight performance boost by not creating a lambda, and directly supplying the dictionary.get function. Any keys that are not present will return NaN via this method, so beware:
series = series.map(dictionary.get)
You can also supply just the dictionary itself, but this appears to introduce a bit of overhead:
series = series.map(dictionary)
Timings
Some timing comparisons using your example data:
%timeit series.map(dictionary.get)
10000 loops, best of 3: 124 µs per loop
%timeit series.map(lambda x: dictionary.get(x,x))
10000 loops, best of 3: 150 µs per loop
%timeit series.map(dictionary)
100 loops, best of 3: 5.45 ms per loop
%timeit series.replace(dictionary)
1 loop, best of 3: 1.23 s per loop
.replacecan do incomplete substring matches, while .map requires complete values to be supplied in the dictionary (or it returns NaNs). The fast but generic solution (that can handle substring) should first use .replace on a dict of all possible values (obtained e.g. with .value_counts().index) and then go over all rows of the Series with this dict and .map. This combo can handle for instance special national characters replacements (full substrings) on 1m-row columns in a quarter of a second, where .replace alone would take 15.
Thanks to #root: I did a benchmarking again and found different results on pandas v1.1.4
Found series.map(dictionary) fastest it also returns NaN is key not present
I am building a new method to parse a DataFrame into a Vincent-compatible format. This requires a standard Index (Vincent can't parse a MultiIndex).
Is there a way to detect whether a Pandas DataFrame has a MultiIndex?
In: type(frame)
Out: pandas.core.index.MultiIndex
I've tried:
In: if type(result.index) is 'pandas.core.index.MultiIndex':
print True
else:
print False
Out: False
If I try without quotations I get:
NameError: name 'pandas' is not defined
Any help appreciated.
(Once I have the MultiIndex, I'm then resetting the index and merging the two columns into a single string value for the presentation stage.)
You can use isinstance to check whether an object is a class (or its subclasses):
if isinstance(result.index, pandas.MultiIndex):
You can use nlevels to check how many levels there are:
df.index.nlevels
df.columns.nlevels
If nlevels > 1, your dataframe certainly has multiple indices.
There's also
len(result.index.names) > 1
but it is considerably slower than either isinstance or type:
timeit(len(result.index.names) > 1)
The slowest run took 10.95 times longer than the fastest. This could mean that an intermediate result is being cached.
1000000 loops, best of 3: 1.12 µs per loop
In [254]:
timeit(isinstance(result.index, pd.MultiIndex))
The slowest run took 30.53 times longer than the fastest. This could mean that an intermediate result is being cached.
10000000 loops, best of 3: 177 ns per loop
In [252]:
)
timeit(type(result.index) == pd.MultiIndex)
The slowest run took 22.86 times longer than the fastest. This could mean that an intermediate result is being cached.
1000000 loops, best of 3: 200 ns per loop
Maybe the shortest way is if type(result.index)==pd.MultiIndex: