Where is the MultiIndex information stored? [duplicate] - python

I am building a new method to parse a DataFrame into a Vincent-compatible format. This requires a standard Index (Vincent can't parse a MultiIndex).
Is there a way to detect whether a Pandas DataFrame has a MultiIndex?
In: type(frame)
Out: pandas.core.index.MultiIndex
I've tried:
In: if type(result.index) is 'pandas.core.index.MultiIndex':
print True
else:
print False
Out: False
If I try without quotations I get:
NameError: name 'pandas' is not defined
Any help appreciated.
(Once I have the MultiIndex, I'm then resetting the index and merging the two columns into a single string value for the presentation stage.)

You can use isinstance to check whether an object is a class (or its subclasses):
if isinstance(result.index, pandas.MultiIndex):

You can use nlevels to check how many levels there are:
df.index.nlevels
df.columns.nlevels
If nlevels > 1, your dataframe certainly has multiple indices.

There's also
len(result.index.names) > 1
but it is considerably slower than either isinstance or type:
timeit(len(result.index.names) > 1)
The slowest run took 10.95 times longer than the fastest. This could mean that an intermediate result is being cached.
1000000 loops, best of 3: 1.12 µs per loop
In [254]:
timeit(isinstance(result.index, pd.MultiIndex))
The slowest run took 30.53 times longer than the fastest. This could mean that an intermediate result is being cached.
10000000 loops, best of 3: 177 ns per loop
In [252]:
)
timeit(type(result.index) == pd.MultiIndex)
The slowest run took 22.86 times longer than the fastest. This could mean that an intermediate result is being cached.
1000000 loops, best of 3: 200 ns per loop

Maybe the shortest way is if type(result.index)==pd.MultiIndex:

Related

Find value for column where value for a separate column is maximal

If I have a Python Pandas DataFrame containing two columns of people and sequence respectively like:
people sequence
John 1
Rob 2
Bob 3
How can I return the person where sequence is maximal? In this example I want to return 'Bob'
pandas.Series.idxmax
Is the method that tells you the index value where the maximum occurs.
Then use that to get at the value of the other column.
df.at[df['sequence'].idxmax(), 'people']
'Bob'
I like the solution #user3483203 provided in the comments. The reason I provided a different one is to show that the same think can be done with fewer objects created.
In this case, df['sequence'] is accessing an internally stored object and subsequently calling the idxmax method on it. At that point we are accessing a specific cell in the dataframe df with the at accessor.
We can see that we are accessing the internally stored object because we can access it in two different ways and validate that it is the same object.
df['sequence'] is df.sequence
True
While
df['sequence'] is df.sequence.copy()
False
On the other hand, df.set_index('people') creates a new object and that is expensive.
Clearly this is over a ridiculously small data set but:
%timeit df.loc[df['sequence'].idxmax(), 'people']
%timeit df.at[df['sequence'].idxmax(), 'people']
%timeit df.set_index('people').sequence.idxmax()
10000 loops, best of 3: 65.1 µs per loop
10000 loops, best of 3: 62.6 µs per loop
1000 loops, best of 3: 556 µs per loop
Over a much larger data set:
df = pd.DataFrame(dict(
people=range(10000),
sequence=np.random.permutation(range(10000))
))
%timeit df.loc[df['sequence'].idxmax(), 'people']
%timeit df.at[df['sequence'].idxmax(), 'people']
%timeit df.set_index('people').sequence.idxmax()
10000 loops, best of 3: 107 µs per loop
10000 loops, best of 3: 101 µs per loop
1000 loops, best of 3: 816 µs per loop
The relative difference is consistent.

what's the difference between set_value and = in pandas

In writing to a dataframe in pandas, we see we have a couple of ways to do it, as provided by this answer and this answer.
We have the method of
df[r][c].set_value(r,c,some_value) and the method of
df.iloc[r][c] = some_value.
What is the difference? Which is faster? Is either a copy?
The difference is that set_value is returning an object, while the assignment operator assigns the value into the existing DataFrame object.
after calling set_value you will potentially have two DataFrame objects (this does not necessarily mean you'll have two copies of the data, as DataFrame objects can "reference" one another) while the assignment operator will change data in the single DataFrame object.
It appears to be faster to use the set_value, as it is probably optimized for that use-case, while the assignment approach will generate intermediate slices of the data:
In [1]: import pandas as pd
In [2]: import numpy as np
In [3]: df=pd.DataFrame(np.random.rand(100,100))
In [4]: %timeit df[10][10]=7
The slowest run took 6.43 times longer than the fastest. This could mean that an intermediate result is being cached
10000 loops, best of 3: 89.5 µs per loop
In [5]: %timeit df.set_value(10,10,11)
The slowest run took 10.89 times longer than the fastest. This could mean that an intermediate result is being cached
100000 loops, best of 3: 3.94 µs per loop
the result of set_value may be a copy, but the documentation is not really clear (to me) on this:
Returns:
frame : DataFrame
If label pair is contained, will be reference to calling DataFrame, otherwise a new object

Pandas replace/dictionary slowness

Please help me understand why this "replace from dictionary" operation is slow in Python/Pandas:
# Series has 200 rows and 1 column
# Dictionary has 11269 key-value pairs
series.replace(dictionary, inplace=True)
Dictionary lookups should be O(1). Replacing a value in a column should be O(1). Isn't this a vectorized operation? Even if it's not vectorized, iterating 200 rows is only 200 iterations, so how can it be slow?
Here is a SSCCE demonstrating the issue:
import pandas as pd
import random
# Initialize dummy data
dictionary = {}
orig = []
for x in range(11270):
dictionary[x] = 'Some string ' + str(x)
for x in range(200):
orig.append(random.randint(1, 11269))
series = pd.Series(orig)
# The actual operation we care about
print('Starting...')
series.replace(dictionary, inplace=True)
print('Done.')
Running that command takes more than 1 second on my machine, which is 1000's of times longer than expected to perform <1000 operations.
It looks like replace has a bit of overhead, and explicitly telling the Series what to do via map yields the best performance:
series = series.map(lambda x: dictionary.get(x,x))
If you're sure that all keys are in your dictionary you can get a very slight performance boost by not creating a lambda, and directly supplying the dictionary.get function. Any keys that are not present will return NaN via this method, so beware:
series = series.map(dictionary.get)
You can also supply just the dictionary itself, but this appears to introduce a bit of overhead:
series = series.map(dictionary)
Timings
Some timing comparisons using your example data:
%timeit series.map(dictionary.get)
10000 loops, best of 3: 124 µs per loop
%timeit series.map(lambda x: dictionary.get(x,x))
10000 loops, best of 3: 150 µs per loop
%timeit series.map(dictionary)
100 loops, best of 3: 5.45 ms per loop
%timeit series.replace(dictionary)
1 loop, best of 3: 1.23 s per loop
.replacecan do incomplete substring matches, while .map requires complete values to be supplied in the dictionary (or it returns NaNs). The fast but generic solution (that can handle substring) should first use .replace on a dict of all possible values (obtained e.g. with .value_counts().index) and then go over all rows of the Series with this dict and .map. This combo can handle for instance special national characters replacements (full substrings) on 1m-row columns in a quarter of a second, where .replace alone would take 15.
Thanks to #root: I did a benchmarking again and found different results on pandas v1.1.4
Found series.map(dictionary) fastest it also returns NaN is key not present

Speeding up .ix in pandas

I'm trying to speed up the following code. 'db' is a dictionary of DataFrames. Is there a better/different way to structure things which would speed this up?
for date in dates: # 3,800 days
for instrument in instruments: # 100 instruments
s = instrument.ticker
current_bar = db[s].ix[date]
# (current_bar.xxx then gets used for difference calculations.)
Here are the results:
%timeit speedTest()
1 loops, best of 3: 1min per loop
This is for each individual call:
%timeit current_bar = db[s].ix[date]
10000 loops, best of 3: 154 µs per loop
Any help/suggestions would be appreciated.
Thanks
i don't think a dict of dataframes is a good idea. Try structure all dataframes in one -- stack vertically and use key as index/ a level of multiindex.

Detect whether a dataframe has a MultiIndex

I am building a new method to parse a DataFrame into a Vincent-compatible format. This requires a standard Index (Vincent can't parse a MultiIndex).
Is there a way to detect whether a Pandas DataFrame has a MultiIndex?
In: type(frame)
Out: pandas.core.index.MultiIndex
I've tried:
In: if type(result.index) is 'pandas.core.index.MultiIndex':
print True
else:
print False
Out: False
If I try without quotations I get:
NameError: name 'pandas' is not defined
Any help appreciated.
(Once I have the MultiIndex, I'm then resetting the index and merging the two columns into a single string value for the presentation stage.)
You can use isinstance to check whether an object is a class (or its subclasses):
if isinstance(result.index, pandas.MultiIndex):
You can use nlevels to check how many levels there are:
df.index.nlevels
df.columns.nlevels
If nlevels > 1, your dataframe certainly has multiple indices.
There's also
len(result.index.names) > 1
but it is considerably slower than either isinstance or type:
timeit(len(result.index.names) > 1)
The slowest run took 10.95 times longer than the fastest. This could mean that an intermediate result is being cached.
1000000 loops, best of 3: 1.12 µs per loop
In [254]:
timeit(isinstance(result.index, pd.MultiIndex))
The slowest run took 30.53 times longer than the fastest. This could mean that an intermediate result is being cached.
10000000 loops, best of 3: 177 ns per loop
In [252]:
)
timeit(type(result.index) == pd.MultiIndex)
The slowest run took 22.86 times longer than the fastest. This could mean that an intermediate result is being cached.
1000000 loops, best of 3: 200 ns per loop
Maybe the shortest way is if type(result.index)==pd.MultiIndex:

Categories