pandas dataframe's `apply` slow when applied on index - python

I have a pandas dataframe df with an index of type DatetimeIndex with parameters: dtype='datetime64[ns]', name='DateTime', length=324336, freq=None. The dataframe has 22 columns, all numerical.
I want to create a new column Date with only the date-part of DateTime (to be used for grouping later).
My first attempt
df['Date'] = df.apply(lambda row: row.name.date(), axis=1)
takes ca. 13.5 seconds.
But when I make DateTime a regular column, it goes faster, even including the index operations:
df.reset_index(inplace=True)
df['Date'] = df.apply(lambda row: row['DateTime'].date(), axis=1)
df.set_index('DateTime')
This takes ca. 6.3 s, i.e., it is twice as fast. Furthermore, applying apply directly on the series (since it depends only on one column) is faster still:
df.reset_index(inplace=True)
df['Date'] = df['DateTime'].apply(lambda dt: dt.date())
df.set_index('DateTime')
takes ca. 1.1 s, more than 10 times faster than the original solution.
This brings me to my questions:
Is applying apply on a series generally faster than doing it on the dataframe?
Is using apply on an index generally slower than on columns?
More general: what is the advantage of keeping a column as an index? Or, conversely, what would I loose by resetting the index before doing any operations?
Finally, is there an even better/faster way of adding the column?

Use DatetimeIndex.date what should be faster solution:
df['Date'] = df.index.date
Is applying apply on a series generally faster than doing it on the dataframe?
Is using apply on an index generally slower than on columns
I think apply are loops under the hood, so it is obviously slowier like pandas methods
More general: what is the advantage of keeping a column as an index? Or, conversely, what would I loose by resetting the index before doing any operations?
You can check this:
Identifies data (i.e. provides metadata) using known indicators, important for analysis, visualization, and interactive console display.
Enables automatic and explicit data alignment.
Allows intuitive getting and setting of subsets of the data set.
Also if working time timeseries there are many methods like resample working with DatetimeIndex, also is possible use indexing with DatetimeIndex.

Related

assign not working in grouped pandas dataframe

In a complex chained method using pandas, one of the steps is grouping data by a column and then calculate some metrics. This is a simplified example of the procedure i want to achieve. I have many more assignments in the workflow but is failing miserabily at first.
import pandas as pd
import numpy as np
data = pd.DataFrame({'Group':['A','A','A','B','B','B'],'first':[1,12,4,5,4,3],'last':[5,3,4,5,2,7,]})
data.groupby('Group').assign(average_ratio=lambda x: np.mean(x['first']/x['last']))
>>>> AttributeError: 'DataFrameGroupBy' object has no attribute 'assign'
I know i could use apply this way:
data.groupby('Group').apply(lambda x: np.mean(x['first']/x['last']))
Group
A 1.733333
B 1.142857
dtype: float64
or much better, renaming the column in the same step:
data.groupby('Group').apply(lambda x: pd.Series({'average_ratio':np.mean(x['first']/x['last'])}))
average_ratio
Group
A 1.733333
B 1.142857
Is there any way of using .assign to obtain the same?
To answer last question, for your needs no you cannot. The method, DataFrame.assign simply adds new columns or replace existing columns but return the same index DataFrame with new/adjusted columns.
You are attempted a grouped aggregation that reduces the rows to group level and thereby changing the index and DataFrame granularity from unit level to aggregated grouped level. Therefore you need to run your groupby operations without assign.
To encapsulate multiple assigned aggregated columns that aligns to chained process, use a defined method and then apply it accordingly:
def aggfunc(row):
row['first_mean'] = np.mean(row['first'])
row['last_mean'] = np.mean(row['last'])
row['average_ratio'] = np.mean(row['first'].div(row['last']))
return row
agg_data = data.groupby('Group').apply(aggfunc)

Difference between "as_index = False", and "reset_index()" in pandas groupby

I just wanted to know what is the difference in the function performed by these 2.
Data:
import pandas as pd
df = pd.DataFrame({"ID":["A","B","A","C","A","A","C","B"], "value":[1,2,4,3,6,7,3,4]})
as_index=False :
df_group1 = df.groupby("ID").sum().reset_index()
reset_index() :
df_group2 = df.groupby("ID", as_index=False).sum()
Both of them give the exact same output.
ID value
0 A 18
1 B 6
2 C 6
Can anyone tell me what is the difference and any example illustrating the same?
When you use as_index=False, you indicate to groupby() that you don't want to set the column ID as the index (duh!). When both implementation yield the same results, use as_index=False because it will save you some typing and an unnecessary pandas operation ;)
However, sometimes, you want to apply more complicated operations on your groups. In those occasions, you might find out that one is more suited than the other.
Example 1: You want to sum the values of three variables (i.e. columns) in a group on both axes.
Using as_index=True allows you to apply a sum over axis=1 without specifying the names of the columns, then summing the value over axis 0. When the operation is finished, you can use reset_index(drop=True/False) to get the dataframe under the right form.
Example 2: You need to set a value for the group based on the columns in the groupby().
Setting as_index=False allow you to check the condition on a common column and not on an index, which is often way easier.
At some point, you might come across KeyError when applying operations on groups. In that case, it is often because you are trying to use a column in your aggregate function that is currently an index of your GroupBy object.

How to find the difference between each subsequent pair of DataFrame.index values in pandas?

I have created a DataFrame in order to process some data, and I want to find the difference in time between each pair of data in the DataFrame. Prior to using pandas, I was using two numpy arrays, one describing the data and the other describing time (an array of datetime.datetimes). With the data in arrays, I could do timearray[1:] - timearray[:-1] which resulted in an array (of n-1 elements) describing the gap in time between each pair of data.
In pandas, doing DataFrame.index[1] - DataFrame.index[0] gives me the result I want – the difference in time between the two indices I've picked out. However, doing DataFrame.index[1:] - DataFrame.index[:-1] does not yield an array of similar results, instead simply being equal to DataFrame.index[-1]. Why is this, and how can I replicate the numpy behaviour in pandas?
Alternatively, what is the best way to find datagaps in a DataFrame in pandas?
You can use shift to offset the date and use it to calculate the difference between rows.
# create dummy data
import pandas as pd
rng = pd.date_range('1/1/2011', periods=90, freq='h')
# shift a copy of the date column and subtract from the original date
df = pd.DataFrame({'value':range(1,91),'date':rng})
df['time_gap'] = df['date']- df['date'].shift(1)
To use this set your index to a column temporarily by using .reset_index() and .set_index('date') to return the date column to an index if required.

Best way to set a multiindex on a pandas dataframe

I have a Dataframe df with these columns:
Group
Year
Gender
Feature_1
Feature_2
Feature_3
...
I want to use MultiIndex to stack the data later, and I tried this way:
df.index = pd.MultiIndex.from_arrays([df['Group'], df['Year'], df['Gender']])
This instruction successfully makes MultiIndex for my Dataframe, but is there a better way that also removes the original columns?
Indexing in pandas is easier than this. You do not need to create your own instance of the MultiIndex class.
The pandas DataFrame has a method called .set_index() which takes either a single column as argument or a list of columns. Supplying a list of columns will set a multiindex for you.
Like this:
df.set_index(['Group', 'Year', 'Gender'], inplace=True)
Note the inplace=True, which I can recommend highly.
When you are dealing with huge dataframes that barely fit in memory, inplace operations will litterally half your memory usage.
Consider this:
df2 = df1.set_index('column') # Don't do this
del df1 # Don't do this
When this operation is done, the memory usage will be about the same as before. But only because we do del df1. In the time between these two commands, there will be two copies of the same dataframe, therefore, double memory.
Doing this is implicitly the same:
df1 = df1.set_index('column') # Don't do this either
And will still take double memory of doing this inplace.

MultiIndexing rows vs. columns in pandas DataFrame

I am working with multiindexing dataframe in pandas and am wondering whether I should multiindex the rows or the columns.
My data looks something like this:
Code:
import numpy as np
import pandas as pd
arrays = pd.tools.util.cartesian_product([['condition1', 'condition2'],
['patient1', 'patient2'],
['measure1', 'measure2', 'measure3']])
colidxs = pd.MultiIndex.from_arrays(arrays,
names=['condition', 'patient', 'measure'])
rowidxs = pd.Index([0,1,2,3], name='time')
data = pd.DataFrame(np.random.randn(len(rowidxs), len(colidxs)),
index=rowidxs, columns=colidxs)
Here I choose to multiindex the column, with the rationale that pandas dataframe consists of series, and my data ultimately is a bunch of time series (hence row-indexed by time here).
I have this question because it seems there is some asymmetry between rows and columns for multiindexing. For example, in this document webpage it shows how query works for row-multiindexed dataframe, but if the dataframe is column-multiindexed then the command in the document has to be replaced by something like df.T.query('color == "red"').T.
My question might seem a bit silly, but I'd like to see if there is any difference in convenience between multiindexing rows vs. columns for dataframes (such as the query case above).
Thanks.
A rough personal summary of what I call the row/column-propensity of some common operations for DataFrame:
[]: column-first
get: column-only
attribute accessing as indexing: column-only
query: row-only
loc, iloc, ix: row-first
xs: row-first
sortlevel: row-first
groupby: row-first
"row-first" means the operation expects row index as the first argument, and to operate on column index one needs to use [:, ] or specify axis=1;
"row-only" means the operation only works for row index and one has to do something like transposing the dataframe to operate on the column index.
Based on this, it seems multiindexing rows is slightly more convenient.
A natural question of mine: why don't pandas developers unify the row/column propensity of DataFrame operations? For example, that [] and loc/iloc/ix are two most common ways of indexing dataframes but one slices columns and the others slice rows seems a bit odd.

Categories