Best way to set a multiindex on a pandas dataframe - python

I have a Dataframe df with these columns:
Group
Year
Gender
Feature_1
Feature_2
Feature_3
...
I want to use MultiIndex to stack the data later, and I tried this way:
df.index = pd.MultiIndex.from_arrays([df['Group'], df['Year'], df['Gender']])
This instruction successfully makes MultiIndex for my Dataframe, but is there a better way that also removes the original columns?

Indexing in pandas is easier than this. You do not need to create your own instance of the MultiIndex class.
The pandas DataFrame has a method called .set_index() which takes either a single column as argument or a list of columns. Supplying a list of columns will set a multiindex for you.
Like this:
df.set_index(['Group', 'Year', 'Gender'], inplace=True)
Note the inplace=True, which I can recommend highly.
When you are dealing with huge dataframes that barely fit in memory, inplace operations will litterally half your memory usage.
Consider this:
df2 = df1.set_index('column') # Don't do this
del df1 # Don't do this
When this operation is done, the memory usage will be about the same as before. But only because we do del df1. In the time between these two commands, there will be two copies of the same dataframe, therefore, double memory.
Doing this is implicitly the same:
df1 = df1.set_index('column') # Don't do this either
And will still take double memory of doing this inplace.

Related

How do I turn a Pandas DataFrame object with 1 main column into a Pandas Series with the index column from the original DataFrame

Say I have a simple data frame as below where I set the index of this dataframe to be the time column.
E.g
import pandas as pd
df = pd.DataFrame({'time':['2021-02-20','2021-02-21','2021-02-22','2021-02-23'], 'price':[1,2,3,4]})
df.set_index('time', inplace=True)
Now this dataframe as only 1 main column (price) so I want to know the best way to take this dataframe and simply change its type to a series.
I feel like this can be done using the pandas squeeze() method however want to know if there are any other alternatives or better ways, also correct me if my method seems wrong.
E.g
# Setting the original DataFrame to a Series, since only 1 main column can use the 'column' argument
# in the squeeze method
df = df.squeeze('columns')
` ``
I think simplier is select column like:
s = df.set_index('time')['price']
I think inplace is not good practice, check this and this.
df["price"] would also give you the same thing.

pandas dataframe's `apply` slow when applied on index

I have a pandas dataframe df with an index of type DatetimeIndex with parameters: dtype='datetime64[ns]', name='DateTime', length=324336, freq=None. The dataframe has 22 columns, all numerical.
I want to create a new column Date with only the date-part of DateTime (to be used for grouping later).
My first attempt
df['Date'] = df.apply(lambda row: row.name.date(), axis=1)
takes ca. 13.5 seconds.
But when I make DateTime a regular column, it goes faster, even including the index operations:
df.reset_index(inplace=True)
df['Date'] = df.apply(lambda row: row['DateTime'].date(), axis=1)
df.set_index('DateTime')
This takes ca. 6.3 s, i.e., it is twice as fast. Furthermore, applying apply directly on the series (since it depends only on one column) is faster still:
df.reset_index(inplace=True)
df['Date'] = df['DateTime'].apply(lambda dt: dt.date())
df.set_index('DateTime')
takes ca. 1.1 s, more than 10 times faster than the original solution.
This brings me to my questions:
Is applying apply on a series generally faster than doing it on the dataframe?
Is using apply on an index generally slower than on columns?
More general: what is the advantage of keeping a column as an index? Or, conversely, what would I loose by resetting the index before doing any operations?
Finally, is there an even better/faster way of adding the column?
Use DatetimeIndex.date what should be faster solution:
df['Date'] = df.index.date
Is applying apply on a series generally faster than doing it on the dataframe?
Is using apply on an index generally slower than on columns
I think apply are loops under the hood, so it is obviously slowier like pandas methods
More general: what is the advantage of keeping a column as an index? Or, conversely, what would I loose by resetting the index before doing any operations?
You can check this:
Identifies data (i.e. provides metadata) using known indicators, important for analysis, visualization, and interactive console display.
Enables automatic and explicit data alignment.
Allows intuitive getting and setting of subsets of the data set.
Also if working time timeseries there are many methods like resample working with DatetimeIndex, also is possible use indexing with DatetimeIndex.

Aggregate Python DF based on column

I have a big dataframe (approximately 35 columns), where 1 column - concat_strs is a concatenation of 8 columns in the dataframe. This is used to detect duplicates. What I want to do is to aggregate rows, where concat_strs has the same value, on columns val, abs_val, price, abs_price (using sum).
I have done the following:
agg_attributes = {'val': 'sum', 'abs_val': 'sum', 'price': 'sum', 'abs_price': 'sum'}
final_df= df.groupby('concat_strs', as_index=False).aggregate(agg_attributes)
But, when I look at final_df, I notice 2 issues:
Other columns are removed, so I have only 5 columns. I have tried to do final_df.reindex(columns=df.columns), but then all of the other columns are NaN
The number of rows in the final_df remains the same as in the df (ca. 300k rows). However, it should be reduced (checked manually)
The question is - what is done wrong and is there any improvement suggestion?
You groupby concat_strs, so only concat_strs and the columns in agg_attributes is kept, because groupby operation, pandas does not know what to do with other columns.
You can include all columns with first agg to keep the first value of that column (if duplicated), or last etc.. depends on what you need.
Also this way to dedup I bet it a good operation, can you simply drop all the duplicates?
You dont need to concat_strs too, as groupby support input in a list of column to group on
Not sure if I understood the question correctly. but you can try this?
final_df = df.groupby(['concat_strs']).sum()

Can I get concat() to ignore column names and work only based on the position of the columns?

The docs , at least as of version 0.24.2, specify that pandas.concat can ignore the index, with ignore_index=True, but
Note the index values on the other axes are still respected in the
join.
Is there a way to avoid this, i.e. to concatenate based on the position only, and ignoring the names of the columns?
I see two options:
rename the columns so they match, or
convert to numpy, concatenate in
numpy, then from numpy back to pandas
Are there more elegant ways?
For example, if I want to add the series s as an additional row to the dataframe df, I can:
convert s to frame
transpose it
rename its columns so they are the
same as those of df
concatenate
It works, but it seems very "un-pythonic"!
A toy example is below; this example is with a dataframe and a series, but the same concept applies with two dataframes.
import pandas as pd
df=pd.DataFrame()
df['a']=[1]
df['x']='this'
df['y']='that'
s=pd.Series([3,'txt','more txt'])
st=s.to_frame().transpose()
st.columns=df.columns
out= pd.concat( [df, st] , axis=0, ignore_index=True)
In the case of 1 dataframe and 1 series, you can do:
df.loc[df.shape[0], :] = s.values

MultiIndexing rows vs. columns in pandas DataFrame

I am working with multiindexing dataframe in pandas and am wondering whether I should multiindex the rows or the columns.
My data looks something like this:
Code:
import numpy as np
import pandas as pd
arrays = pd.tools.util.cartesian_product([['condition1', 'condition2'],
['patient1', 'patient2'],
['measure1', 'measure2', 'measure3']])
colidxs = pd.MultiIndex.from_arrays(arrays,
names=['condition', 'patient', 'measure'])
rowidxs = pd.Index([0,1,2,3], name='time')
data = pd.DataFrame(np.random.randn(len(rowidxs), len(colidxs)),
index=rowidxs, columns=colidxs)
Here I choose to multiindex the column, with the rationale that pandas dataframe consists of series, and my data ultimately is a bunch of time series (hence row-indexed by time here).
I have this question because it seems there is some asymmetry between rows and columns for multiindexing. For example, in this document webpage it shows how query works for row-multiindexed dataframe, but if the dataframe is column-multiindexed then the command in the document has to be replaced by something like df.T.query('color == "red"').T.
My question might seem a bit silly, but I'd like to see if there is any difference in convenience between multiindexing rows vs. columns for dataframes (such as the query case above).
Thanks.
A rough personal summary of what I call the row/column-propensity of some common operations for DataFrame:
[]: column-first
get: column-only
attribute accessing as indexing: column-only
query: row-only
loc, iloc, ix: row-first
xs: row-first
sortlevel: row-first
groupby: row-first
"row-first" means the operation expects row index as the first argument, and to operate on column index one needs to use [:, ] or specify axis=1;
"row-only" means the operation only works for row index and one has to do something like transposing the dataframe to operate on the column index.
Based on this, it seems multiindexing rows is slightly more convenient.
A natural question of mine: why don't pandas developers unify the row/column propensity of DataFrame operations? For example, that [] and loc/iloc/ix are two most common ways of indexing dataframes but one slices columns and the others slice rows seems a bit odd.

Categories