Speeding up .ix in pandas

Speeding up .ix in pandas - python

I'm trying to speed up the following code. 'db' is a dictionary of DataFrames. Is there a better/different way to structure things which would speed this up?
for date in dates: # 3,800 days
for instrument in instruments: # 100 instruments
s = instrument.ticker
current_bar = db[s].ix[date]
# (current_bar.xxx then gets used for difference calculations.)
Here are the results:
%timeit speedTest()
1 loops, best of 3: 1min per loop
This is for each individual call:
%timeit current_bar = db[s].ix[date]
10000 loops, best of 3: 154 µs per loop
Any help/suggestions would be appreciated.
Thanks

i don't think a dict of dataframes is a good idea. Try structure all dataframes in one -- stack vertically and use key as index/ a level of multiindex.

Related

Where is the MultiIndex information stored? [duplicate]

I am building a new method to parse a DataFrame into a Vincent-compatible format. This requires a standard Index (Vincent can't parse a MultiIndex).
Is there a way to detect whether a Pandas DataFrame has a MultiIndex?
In: type(frame)
Out: pandas.core.index.MultiIndex
I've tried:
In: if type(result.index) is 'pandas.core.index.MultiIndex':
print True
else:
print False
Out: False
If I try without quotations I get:
NameError: name 'pandas' is not defined
Any help appreciated.
(Once I have the MultiIndex, I'm then resetting the index and merging the two columns into a single string value for the presentation stage.)

You can use isinstance to check whether an object is a class (or its subclasses):
if isinstance(result.index, pandas.MultiIndex):

You can use nlevels to check how many levels there are:
df.index.nlevels
df.columns.nlevels
If nlevels > 1, your dataframe certainly has multiple indices.

There's also
len(result.index.names) > 1
but it is considerably slower than either isinstance or type:
timeit(len(result.index.names) > 1)
The slowest run took 10.95 times longer than the fastest. This could mean that an intermediate result is being cached.
1000000 loops, best of 3: 1.12 µs per loop
In [254]:
timeit(isinstance(result.index, pd.MultiIndex))
The slowest run took 30.53 times longer than the fastest. This could mean that an intermediate result is being cached.
10000000 loops, best of 3: 177 ns per loop
In [252]:
)
timeit(type(result.index) == pd.MultiIndex)
The slowest run took 22.86 times longer than the fastest. This could mean that an intermediate result is being cached.
1000000 loops, best of 3: 200 ns per loop

Maybe the shortest way is if type(result.index)==pd.MultiIndex:

Slow pandas Index lookup for unique index

I would like to have a quick index lookup when using a pandas dataframe. As noted here and there, I understand that I need to keep my index unique, otherwise all hope is lost.
I made sure that the index is sorted and unique:
df = df.sort_index()
assert df.index.is_unique
I measured the lookup speed:
%timeit df.at[tuple(q.values), 'column']
1000 loops, best of 3: 185 µs per loop
Then, when I moved the index to a separate python dictionary:
index = {}
for i in df.index:
index[i] = np.random.randint(1000)
assert len(index) == len(df.index)
I got a huge speedup:
%timeit index[tuple(q.values)]
100000 loops, best of 3: 2.7 µs per loop
Why is it so? Am I doing something wrong? Is there a way to replicate python dict's speed (or something in <5x range) in a pandas index?

Find value for column where value for a separate column is maximal

If I have a Python Pandas DataFrame containing two columns of people and sequence respectively like:
people sequence
John 1
Rob 2
Bob 3
How can I return the person where sequence is maximal? In this example I want to return 'Bob'

pandas.Series.idxmax
Is the method that tells you the index value where the maximum occurs.
Then use that to get at the value of the other column.
df.at[df['sequence'].idxmax(), 'people']
'Bob'
I like the solution #user3483203 provided in the comments. The reason I provided a different one is to show that the same think can be done with fewer objects created.
In this case, df['sequence'] is accessing an internally stored object and subsequently calling the idxmax method on it. At that point we are accessing a specific cell in the dataframe df with the at accessor.
We can see that we are accessing the internally stored object because we can access it in two different ways and validate that it is the same object.
df['sequence'] is df.sequence
True
While
df['sequence'] is df.sequence.copy()
False
On the other hand, df.set_index('people') creates a new object and that is expensive.
Clearly this is over a ridiculously small data set but:
%timeit df.loc[df['sequence'].idxmax(), 'people']
%timeit df.at[df['sequence'].idxmax(), 'people']
%timeit df.set_index('people').sequence.idxmax()
10000 loops, best of 3: 65.1 µs per loop
10000 loops, best of 3: 62.6 µs per loop
1000 loops, best of 3: 556 µs per loop
Over a much larger data set:
df = pd.DataFrame(dict(
people=range(10000),
sequence=np.random.permutation(range(10000))
))
%timeit df.loc[df['sequence'].idxmax(), 'people']
%timeit df.at[df['sequence'].idxmax(), 'people']
%timeit df.set_index('people').sequence.idxmax()
10000 loops, best of 3: 107 µs per loop
10000 loops, best of 3: 101 µs per loop
1000 loops, best of 3: 816 µs per loop
The relative difference is consistent.

Detect whether a dataframe has a MultiIndex

I am building a new method to parse a DataFrame into a Vincent-compatible format. This requires a standard Index (Vincent can't parse a MultiIndex).
Is there a way to detect whether a Pandas DataFrame has a MultiIndex?
In: type(frame)
Out: pandas.core.index.MultiIndex
I've tried:
In: if type(result.index) is 'pandas.core.index.MultiIndex':
print True
else:
print False
Out: False
If I try without quotations I get:
NameError: name 'pandas' is not defined
Any help appreciated.
(Once I have the MultiIndex, I'm then resetting the index and merging the two columns into a single string value for the presentation stage.)

You can use isinstance to check whether an object is a class (or its subclasses):
if isinstance(result.index, pandas.MultiIndex):

You can use nlevels to check how many levels there are:
df.index.nlevels
df.columns.nlevels
If nlevels > 1, your dataframe certainly has multiple indices.

There's also
len(result.index.names) > 1
but it is considerably slower than either isinstance or type:
timeit(len(result.index.names) > 1)
The slowest run took 10.95 times longer than the fastest. This could mean that an intermediate result is being cached.
1000000 loops, best of 3: 1.12 µs per loop
In [254]:
timeit(isinstance(result.index, pd.MultiIndex))
The slowest run took 30.53 times longer than the fastest. This could mean that an intermediate result is being cached.
10000000 loops, best of 3: 177 ns per loop
In [252]:
)
timeit(type(result.index) == pd.MultiIndex)
The slowest run took 22.86 times longer than the fastest. This could mean that an intermediate result is being cached.
1000000 loops, best of 3: 200 ns per loop

Maybe the shortest way is if type(result.index)==pd.MultiIndex:

Efficient column indexing and selection in PANDAS

I'm looking for the most efficient way to select multiple columns from a data frame:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.rand(4,8), columns = list('abcdefgh'))
I want to select columns the following columns a,c,e,f,g only, which can be done by using indexing:
df.ix[:,[0,2,4,5,6]]
For a large data frame of many columns, this seems an inefficient method and I would much rather specify consecutive column indexes by range, if at all possible, but attempts such as the following, both throw up syntax errors:
df.ix[:,[0,2,4:6]]
or
df.ix[:,[0,2,[4:6]]]

As soon as you select non adjacent columns, you will pay the load.
If your data is homogeneous, falling back to numpy give you notable improvement.
In [147]: %timeit df[['a','c','e','f','g']]
%timeit df.values[:,[0,2,4,5,6]]
%timeit df.ix[:,[0,2,4,5,6]]
%timeit pd.DataFrame(df.values[:,[0,2,4,5,6]],columns=df.columns[[0,2,4,5,6]])
100 loops, best of 3: 2.67 ms per loop
10000 loops, best of 3: 58.7 µs per loop
1000 loops, best of 3: 1.81 ms per loop
1000 loops, best of 3: 568 µs per loop

I think you can use range:
print [0,2] + range(4,7)
[0, 2, 4, 5, 6]
print df.ix[:, [0,2] + range(4,7)]
a c e f g
0 0.278231 0.192650 0.653491 0.944689 0.663457
1 0.416367 0.477074 0.582187 0.730247 0.946496
2 0.396906 0.877941 0.774960 0.057290 0.556719
3 0.119685 0.211581 0.526096 0.213282 0.492261

Pandas is relatively well thought, the shortest way is the most efficient:
df[['a','c','e','f','g']]
You don't need ix, as it will do a search in your data, but for that you obviously need the names of the columns.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Speeding up .ix in pandas - python

i don't think a dict of dataframes is a good idea. Try structure all dataframes in one -- stack vertically and use key as index/ a level of multiindex.

Related

Where is the MultiIndex information stored? [duplicate]

Slow pandas Index lookup for unique index

Find value for column where value for a separate column is maximal

Detect whether a dataframe has a MultiIndex

Efficient column indexing and selection in PANDAS

Categories

Resources