list of booleans to slice DataFrame - python

I'm trying to find a less manual, more convenient way to slice a Pandas DataFrame based on multiple boolean conditions. To illustrate what I'm after, here is a simplified example
df = pd.DataFrame({'col1':[True,False,True,False,False,True],'col2':[False,False,True,True,False,False]})
suppose I am interested in the subset of the DataFrame where both 'col1' and 'col2' are True. I can find this by running:
df[(df['col1']==True) & (df['col2']==True)]
This is manageable enough in a small dimensional example like this one, but the real one can have up to a hundred columns, so rather than have to string together a long boolean like the one above, I would rather read the columns of interest into a list, e.g.
['col1','col2']
and select where those columns listed are True

If you need all columns:
df[df.all(axis=1)==True]
If you have list of columns:
df[df[COLS].all(axis=1)==True]
For opposite just do False:
df[df.all(axis=1)==False]

Related

Filtered pandas dataframe containing boolean version of dataframe

In pandas I have a dataframe with X by Y dimension containing values.
Then I have an identical pandas dataframe with X by Y dimension (same as df1) containing True/False values.
I want to return only the elements from df1 where the same location on df2 the value = True.
What is the fastest way to do this? Is there a way to do this without converting to numpy array?
Without having the reproducible example, I may be missing a couple tweaks/details here, but I think you may be able to accomplish this by dataframe multiplication
df1.mul(df2)
This will multiply each element by the corresponding element in the other dataframe, where True will act to return the other element and False will return a null.
It is also possible to use mask
df1.mask(df2)
This is similar to df1[df2] and replaces hidden values with NaN, although you can choose the value to replace with using the other option
A quick benchmark on a 10x10 dataframe suggests that the df.mul approach is ~5 times faster

How to check if only the integer portion of the elements in two pandas data columns match?

I checked the answer here but this doesn't work for me.
How to get the integer portion of a float column in pandas
As I need to write further conditional statements which will perform operations on the exact values in the columns and the corresponding values in other columns.
So basically I am hoping that for my two dataframes df1 and df2 I will form a concatenated dataframe using
dfn_c = pd.concat([dfn_1, dfn_2], axis=1)
then write something like
dfn_cn = dfn_c.loc[df1.X1.isin(df2['X2'])]
where X1 and X2 are the said columns respectively. The above line of course makes an exact comparison whereas I want to compare only the integer portion and then form the new dataframe.
IIUC, try casting to int then compare.
dfn_cn = dfn_c.loc[df1['X1'].astype(int).isin(df2['X2'].astype(int))]

How do I iterate through two dataframes of different sizes?

Specifically I want to iterate through two dataframes, one being large and one being small.
Ultimately, I would like to compare values within a certain column.
I tried creating a nested for loop; the outer loop iterating through the large dataframe and the inner loop iterating through the small dataframe however I am having difficulties.
I'm looking for a way to identify that the "name" and "value" in my large dataframe that matches my small dataframe.
Background info: I am using the panda library.
Large dataframe:
Small dataframe:
Name Value
SF 12.84
TH -49.45
If the goal is to iterate through one, or especially more, DataFrames, then explicit for loops is usually the wrong move. In this case, because you're trying to
identify that the "name" and "value" in my large dataframe that matches my small dataframe,
the operation that you're looking for is either pd.merge or pd.DataFrame.join which do the comparisons "under the hood" and return matching information. So, say you have the 2 DataFrames and they're called large and small. Then
import pandas as pd
new_large = pd.merge(left=large,
right=small,
how='left',
on=('Name', 'Value'),
indicator=True)
new_large._merge = new_large._merge.apply(lambda x: 1 if x=='both' else 0)
By doing a left join between large and small (how='left'), pd.merge returns the rows in large that contain a match in small on the ('Name', 'Value') tuple. Then, most of the heavy lifting is done by the indicator keyword that, quoting the pd.merge version 0.25.0 docs:
If True, adds a column to output DataFrame called "_merge" with
information on the source of each row.
Information column is Categorical-type and takes on a value of "left_only"
for observations whose merge key only appears in 'left' DataFrame,
"right_only" for observations whose merge key only appears in 'right'
DataFrame, and "both" if the observation's merge key is found in both.
So, new_large is the original large DataFrame with a new column called _merge the entries of which correspond to the rows of large that matched small just on Name (by the value 'left_only') and the rows that matched on Name as well as Value; the latter having the value both. The last step is changing both and left_only to 1 and 0, as you specified.
Now, the left join returned what it did because both of the Name values in the small DataFrame were present in the large DataFrame so the left-join of large and small returned the whole large DataFrame. When this is not the case, there will be pd.NaN values resulting from pd.merge and you'll have to employ a few more tricks to get the nice Boolean (integer) column to show what matched and what didn't. HTH.

Why do I get Pandas data frame with only one column vs Series?

I've noticed single-column data frames a couple of times to much chagrin (examples below); but in most other cases a one-column data frame would just be a Series. Is there any rhyme or reason as to why a one column DF would be returned?
Examples:
1) when indexing columns by a boolean mask where the mask only has one true value:
df = pd.DataFrame([list('abc'), list('def')], columns = ['foo', 'bar', 'tar'])
mask = [False, True, False]
type(df.ix[:,mask])
2) when setting an index on DataFrame that only has two columns to begin with:
df = pd.DataFrame([list('ab'), list('de'), list('fg')], columns = ['foo', 'bar']
type(df.set_index('foo'))
I feel like if I'm expecting a DF with only one column, I can deal with it by just calling
pd.Series(df.values().ravel(), index = df.index)
But in most other cases a one-column data frame would just be a Series. Is there any rhyme or reason as to why a one column DF would be returned?
In general, a one-column DataFrame will be returned when the operation could return a multicolumn DataFrame. For instance, when you use a boolean column index, a multicolumn DataFrame would have to be returned if there was more than one True value, so a DataFrame will always be returned, even if it has only one column. Likewise when setting an index, if your DataFrame had more than two columns, the result would still have to be a DataFrame after removing one for the index, so it will still be a DataFrame even if it has only one column left.
In contrast, if you do something like df.ix[:,'col'], it returns a Series, because there is no way that passing one column name to select can ever select more than one column.
The idea is that doing an operation should not sometimes return a DataFrame and sometimes a Series based on features specific to the operands (i.e., how many columns they happen to have, how many values are True in your boolean mask). When you do df.set_index('col'), it's simpler if you know that you will always get a DataFrame, without having to worry about how many columns the original happened to have.
Note that there is also the DataFrame method .squeeze() for turning a one-column DataFrame into a Series.

MultiIndexing rows vs. columns in pandas DataFrame

I am working with multiindexing dataframe in pandas and am wondering whether I should multiindex the rows or the columns.
My data looks something like this:
Code:
import numpy as np
import pandas as pd
arrays = pd.tools.util.cartesian_product([['condition1', 'condition2'],
['patient1', 'patient2'],
['measure1', 'measure2', 'measure3']])
colidxs = pd.MultiIndex.from_arrays(arrays,
names=['condition', 'patient', 'measure'])
rowidxs = pd.Index([0,1,2,3], name='time')
data = pd.DataFrame(np.random.randn(len(rowidxs), len(colidxs)),
index=rowidxs, columns=colidxs)
Here I choose to multiindex the column, with the rationale that pandas dataframe consists of series, and my data ultimately is a bunch of time series (hence row-indexed by time here).
I have this question because it seems there is some asymmetry between rows and columns for multiindexing. For example, in this document webpage it shows how query works for row-multiindexed dataframe, but if the dataframe is column-multiindexed then the command in the document has to be replaced by something like df.T.query('color == "red"').T.
My question might seem a bit silly, but I'd like to see if there is any difference in convenience between multiindexing rows vs. columns for dataframes (such as the query case above).
Thanks.
A rough personal summary of what I call the row/column-propensity of some common operations for DataFrame:
[]: column-first
get: column-only
attribute accessing as indexing: column-only
query: row-only
loc, iloc, ix: row-first
xs: row-first
sortlevel: row-first
groupby: row-first
"row-first" means the operation expects row index as the first argument, and to operate on column index one needs to use [:, ] or specify axis=1;
"row-only" means the operation only works for row index and one has to do something like transposing the dataframe to operate on the column index.
Based on this, it seems multiindexing rows is slightly more convenient.
A natural question of mine: why don't pandas developers unify the row/column propensity of DataFrame operations? For example, that [] and loc/iloc/ix are two most common ways of indexing dataframes but one slices columns and the others slice rows seems a bit odd.

Categories