I am working with multiindexing dataframe in pandas and am wondering whether I should multiindex the rows or the columns.
My data looks something like this:
Code:
import numpy as np
import pandas as pd
arrays = pd.tools.util.cartesian_product([['condition1', 'condition2'],
['patient1', 'patient2'],
['measure1', 'measure2', 'measure3']])
colidxs = pd.MultiIndex.from_arrays(arrays,
names=['condition', 'patient', 'measure'])
rowidxs = pd.Index([0,1,2,3], name='time')
data = pd.DataFrame(np.random.randn(len(rowidxs), len(colidxs)),
index=rowidxs, columns=colidxs)
Here I choose to multiindex the column, with the rationale that pandas dataframe consists of series, and my data ultimately is a bunch of time series (hence row-indexed by time here).
I have this question because it seems there is some asymmetry between rows and columns for multiindexing. For example, in this document webpage it shows how query works for row-multiindexed dataframe, but if the dataframe is column-multiindexed then the command in the document has to be replaced by something like df.T.query('color == "red"').T.
My question might seem a bit silly, but I'd like to see if there is any difference in convenience between multiindexing rows vs. columns for dataframes (such as the query case above).
Thanks.
A rough personal summary of what I call the row/column-propensity of some common operations for DataFrame:
[]: column-first
get: column-only
attribute accessing as indexing: column-only
query: row-only
loc, iloc, ix: row-first
xs: row-first
sortlevel: row-first
groupby: row-first
"row-first" means the operation expects row index as the first argument, and to operate on column index one needs to use [:, ] or specify axis=1;
"row-only" means the operation only works for row index and one has to do something like transposing the dataframe to operate on the column index.
Based on this, it seems multiindexing rows is slightly more convenient.
A natural question of mine: why don't pandas developers unify the row/column propensity of DataFrame operations? For example, that [] and loc/iloc/ix are two most common ways of indexing dataframes but one slices columns and the others slice rows seems a bit odd.
Related
I have the following column multiindex dataframe.
I would like to select (or get a subset) of the dataframe with different columns of each level_0 index (i.e. x_mm and y_mm from virtual and z_mm rx_deg ry_deg rz_deg from actual). From what I have read I think I might be able to use pandas IndexSlice but not entire sure how to use it in this context.
So far my work around is to use pd.concat selecting the 2 sets of columns independently. I have the feeling that this can be done neatly with slicing.
You can programmatically generate the tuples to slice your MultiIndex:
from itertools import product
cols = ((('virtual',), ('x_mm', 'y_mm')),
(('actual',), ('z_mm', 'rx_deg', 'ry_deg', 'rz_deg'))
)
out = df[[t for x in cols for t in product(*x)]]
I just edited the question as maybe I didn't make myself clear.
I have two dataframes (MR and DT)
The column 'A' in dataframe DT is a subset of the column 'A' in dataframe MR, they both are just similar (not equal) in this ID column, the rest of the columns are different as well as the number of rows.
How can I get the rows from dataframe MR['ID'] that are equal to the dataframe DT['ID']? Knowing that values in 'ID' can appear several times in the same column.
The DT is 1538 rows and MR is 2060 rows).
I tried some lines proposed here >https://stackoverflow.com/questions/28901683/pandas-get-rows-which-are-not-in-other-dataframe but I got bizarre results as I don't fully understand the methods they proposed (and the goal is little different)
Thanks!
Take a look at pandas.Series.isin() method. In your case you'd want to use something like:
matching_id = MR.ID.isin(DT.ID) # This returns a boolean Series of whether values match or not
# Now filter your dataframe to keep only matching rows
new_df = MR.loc[matching_id, :]
Or if you want to just get a new dataframe of combined records for the same ID you need to use merge():
new_df = pd.merge(MR, DT, on='ID')
This will create a new dataframe with columns from both original dfs but only where ID is the same.
I'm trying to find a less manual, more convenient way to slice a Pandas DataFrame based on multiple boolean conditions. To illustrate what I'm after, here is a simplified example
df = pd.DataFrame({'col1':[True,False,True,False,False,True],'col2':[False,False,True,True,False,False]})
suppose I am interested in the subset of the DataFrame where both 'col1' and 'col2' are True. I can find this by running:
df[(df['col1']==True) & (df['col2']==True)]
This is manageable enough in a small dimensional example like this one, but the real one can have up to a hundred columns, so rather than have to string together a long boolean like the one above, I would rather read the columns of interest into a list, e.g.
['col1','col2']
and select where those columns listed are True
If you need all columns:
df[df.all(axis=1)==True]
If you have list of columns:
df[df[COLS].all(axis=1)==True]
For opposite just do False:
df[df.all(axis=1)==False]
I'm working with tables containing a number of image-derived features, with a row of features for every timepoint of an image series, and multiple images. An image identifier is contained in one column. To condense this (primarily to be used in a parallel coordinates plot), I want to reduce/aggregate the columns to a single value. The reduce operation has to be chosen depending on the feature in the column (for example mean, max, or something custom). DataFrame.agg seemed like the natural thing to do, but it gives me a table with multiple rows. Right now I'm doing something like this:
result_df = DataFrame()
for col in df.columns:
if col in ReduceThisColumnByMean:
result_df[col] = df.mean()
elif col in ReduceThisColumnByMax:
result.df[col] = df.max()
This seems like a detour to me, and might not scale well (not a big concern, as the number of reduce operations will most probably not grow beyond a few). Is there a more pandas-esk way to aggregate multiple columns with specific operations to a single row?
You can select all columns by list, get mean and max and join together by concat, last convert Series to one row DataFrame by Series.to_frame and transpose:
result_df = pd.concat([df[ReduceThisColumnByMean].mean(),
df[ReduceThisColumnByMax].max()]).to_frame().T
I have two dataframes with the exact same index titles and the exact same column titles but different values within those tables. Also the number of rows and columns is exactly the same. Let's call them df1, and df2.
df1 = {'A':['a1','a2','a3','a4'],'B':['b1','b2','b3','b4'],'C':['c1','c2','c3','c4']}
df2 = {'A':['d1','d2','d3','d4'],'B':['e1','e2','e3','e4'],'C':['f1','f2','f3','f4']}
I want to perform several operations on these matrices i.e.
Multiplication - create the following matrix:
df2 = {'A':['a1*d1','a2*d2','a3*d3','a4*d4'],'B':['b1*e1','b2*e2','b3*e3','b4*e4'],'C':['c1*f1','c2*f2','c3*f3','c4*f4']}
as well as Addition, Substraction, Division using the exact same logic.
Please note that the question is more about the generic code which can be replicated since the matrix which I am using has hundreds of rows and columns.
This is pretty trivial to achieve using the pandas library. The data type of the columns is unclear from OPs question, but if they are numeric, then then the code below will run.
Try:
import pandas as pd
pd.DataFrame(df1) * pd.DataFrame(df2)
If you don't want to import panda just for this operation you can do this with the following code:
df1_2 = {key: [x*y for x,y in zip(df1[key],df2[key])] for key in df1.keys()}
NOTE : This works only if the values are numerics. If not use concatenation for strings like x'*'y, replace + with your desired operation.