How to conditionally slice a dataframe in pandas - python

Consider a pandas DataFrame constructed like:
df = pandas.DataFrame({'a':['one','two','three']})
then I can locate the specific row of the dataframe containing two like:
df[df.a == 'two']
but so far the only way I have found to subset the DataFrame up to this row is like:
df[:df[df.a == 'two'].index[0]]
but that is quite ugly, so:
Is there a more appropriate way to accomplish this subsetting?
Specifically, I am interested in how to slice the DataFrame between row indices where a given column matches some arbitrary text string (in this case 'two'). For this particular case it would be equivalent to df[:2]. In general however, the ability to locate an index for the start and/or end of a slice based on column values seems like a reasonable thing?
one last example, maybe will help; I would expect to be able to do something like this:
df[df.a == 'one' : df.a == 'three']
to get a slice containing rows 1 & 2 of the DataFrame, equivalent to df[0:3]

You want to identify the indices for a particular start and stop values and get the matching rows plus all the rows in between. One way is to find the indexes and build a range, but you already said that you don't like that approach. Here is a general solution using boolean logic that should work for you.
First, let's make a more interesting example:
import pandas as pd
df = pd.DataFrame({'a':['one','two','three', 'four', 'five']})
Suppose start = "two" and stop = "four". That is, you want to get the following output DataFrame:
a
1 two
2 three
3 four
We can find the index of the bounding rows via:
df["a"].isin({start, stop})
#0 False
#1 True
#2 False
#3 True
#4 False
#Name: a, dtype: bool
If the value for index 2 were True, we would be done as we could just use this output as a mask. So let's find a way to create the mask we need.
First we can use cummax() and the boolean XOR operator (^) to achieve:
(df["a"]==start).cummax() ^ (df["a"]==stop).cummax()
#0 False
#1 True
#2 True
#3 False
#4 False
#Name: a, dtype: bool
This is almost what we want, except we are missing the stop value index. So let's just bitwise OR (|) the stop condition:
#0 False
#1 True
#2 True
#3 True
#4 False
#Name: a, dtype: bool
This gets the result we are looking for. So create a mask, and index the dataframe:
mask = (df["a"]==start).cummax() ^ (df["a"]==stop).cummax() | (df["a"]==stop)
print(df[mask])
# a
#1 two
#2 three
#3 four
We can extend these findings into a function that also supports indexing up to a row or indexing from a row to the end:
def get_rows(df, col, start, stop):
if start is None:
mask = ~((df[col] == stop).cummax() ^ (df[col] == stop))
else:
mask = (df[col]==start).cummax() ^ (df[col]==stop).cummax() | (df[col]==stop)
return df[mask]
# get rows between "two" and "four" inclusive
print(get_rows(df=df, col="a", start="two", stop="four"))
# a
#1 two
#2 three
#3 four
# get rows from "two" until the end
print(get_rows(df=df, col="a", start="two", stop=None))
# a
#1 two
#2 three
#3 four
#4 five
# get rows up to "two"
print(get_rows(df=df, col="a", start=None, stop="two"))
# a
#0 one
#1 two
Update:
For completeness, here is the indexing based solution.
def get_rows_indexing(df, col, start, stop):
min_ind = min(df.index[df[col]==start].tolist() or [0])
max_ind = max(df.index[df[col]==stop].tolist() or [len(df)])
return df[min_ind:max_ind+1]
This function does essentially the same thing as the other version, but it may be easier to understand. Also this is more robust, as the other version relies on None not being a value in the desired column.

If you temorarily use column 'a' as an index, then the locate method (loc) does exactly what you are asking.
df = pd.DataFrame({'a':['one','two','three', 'four', 'five']})
start = 'two'
stop = 'four'
df = df.set_index('a').loc[start:stop].reset_index()

Related

If Condition Based On 2 Columns

Tring to conditionally execute a query, only when ColumnA = 'New' and ColumnB = 'Left' (in each individual row). I know that str.contains() works when I only have 1 condition, however, I'm getting a ValueError ("ValueError: The truth value of a Series is ambiguous..."). Can this approach not be successfully applied, for my given scenario? Alternatively, is there a better approach?
Current code:
if df1['ColumnA'].str.contains('New') and df1['ColumnB'].str.contains('Left'):
do something...
Very basic example of the dataframe:
ColumnA
Column B
New
Left
Used
Right
Scrap
Down
New
Right
First row would be the desired row to carry forward (since it meets the criteria).
You have the right idea, however it doesn't appear like your code is expressing exactly what you want to do.
df1['ColumnA'].str.contains('New')
will return a Series with true and false values corresponding to the indices where the condition is true, not a true or false value for whether the entire column contains 'new'. To accomplish this consider doing something like the following:
'new' in df['ColumnA'].values
If you are trying to do it on a row by row basis then you must use the bitwise operator to compare truth values across Series (&).
This will return a boolean like you expected, hopefully this helps (:
Use bitwise & on two mask arrays and generate another column.
>>> import pandas as pd
>>> df = pd.DataFrame({'A':['New','Used','Scrap','New'], 'B':['Left','Right','Down','Right']})
>>> df
A B
0 New Left
1 Used Right
2 Scrap Down
3 New Right
>>> df['C'] = df['A'].str.contains('New') & df['B'].str.contains('Left')
>>> df
A B C
0 New Left True
1 Used Right False
2 Scrap Down False
3 New Right False
>>>

Create a new variable based on 4 other variables

I have a dataframe in Python called df1 where I have 4 dichotomous variables called Ordering_1; Ordering_2, Ordering_3, Ordering_4 with True/False values.
I need to create a variable called Clean, which is based on the 4 other variables. Meaning, when Ordering_1 == True, then Clean == Ordering_1, when Ordering_2==True, then Clean == Ordering_2. Then Clean would be a combination of all the true values from Ordering_1; Ordering_2, Ordering_3, Ordering_4.
Here is an example of how I would like the variable Clean to be:
I have tried the below code but it does not work:
df1[Clean] = df1[Ordering_1] + df1[Ordering_1] + df1[Ordering_1] + df1[Ordering_1]
Would anyone please be able to help me how to do this in python?
Universal solution if there are multiple Trues per rows - filter columns by DataFrame.filter and then use DataFrame.dot for matrix multiplication:
df1 = df.filter(like='Ordering_')
df['Clean'] = df1.dot(df1.columns + ',').str.strip(',')
If there is only one "True" value per row you can use the booleans of each column "Ordering_1", "Ordering_2", etc. and the df1.loc.
Note that this is what you get with df1.Ordering_1:
0 True
1 False
2 False
3 False
Name: Ordering_1, dtype: bool
With df1.loc you can use it to filter on the "True" rows, in this case only row 0:
So you can code this:
Create a new blank "clean" column:
df1["clean"]=""
Set the rows where the series df.Ordering_1 = True to "Ordering_1":
df1.loc[df1.Ordering_1,["clean"]] = "Ordering_1"
Proceed with the remaining columns in the same way.

Get first column value in Pandas DataFrame where row matches condition

Say I have a pandas dataframe that looks like this:
color number
0 red 3
1 blue 4
2 green 2
3 blue 2
I want to get the first value from the number column where the color column has the value 'blue' which in this case would return 4.
I know this can be done using loc in something like this:
df[df['color'] == 'blue']['number'][0]
I'm wondering if there is any more optimal approach given that I only ever need the first occurrence.
Use head—this will return the first row if the color exists, and an empty Series otherwise.
col = 'blue'
df.query('color == #col').head(1).loc[:, 'number']
1 4
Name: number, dtype: int64
Alternatively, to get a single item, use obj.is_empty:
u = df.query('color == #col').head(1)
if not u.is_empty:
print(u.at[u.index[0], 'number'])
# 4
Difference between head and idxmax for invalid color:
df.query('color == "blabla"').head(1).loc[:, 'number']
# Series([], Name: number, dtype: int64)
df.loc[(df['color'] == 'blabla').idxmax(),'number']
# 3
Using idxmax
df.loc[(df['color'] == 'blue').idxmax(),'number']
Out[698]: 4
Using iloc with np.where:
idx = next(iter(df['number'].iloc[np.where(df['color'].eq('blue'))]), -1) # 4
Note this also handles the case where the colour does not exist. In comparison, df['color'].eq('orange').idxmax() gives 0 even though 'orange' does not exist in the series. The above logic will give -1.
numba
I'm wondering if there is any more optimal approach given that I only
ever need the first occurrence.
Yes! For a more efficient solution, see Efficiently return the index of the first value satisfying condition in array. Numba allows you to iterate row-wise efficiently. In this case, you will need to factorize your strings first so that you feed numeric arrays only to Numba:
from numba import njit
# factorize series, pd.factorize maintains order,
# i.e. first item in values gives 0 index
idx, values = pd.factorize(df['color'])
idx_search = np.where(values == 'blue')[0][0]
#njit
def get_first_index_nb(A, k):
for i in range(len(A)):
if A[i] == k:
return i
return -1
res = df['number'].iat[get_first_index_nb(idx, 1)] # 4
Of course, for a one-off calculation, this is inefficient. But for successive calculations, the solution will likely be a factor faster than solutions which check for equality across the entire series / array.

pandas lookup with long and nested conditions

I want to perform a lookup in a dataframe via pandas. But It will be created by a series of nested if else statement similar as outlined Pandas dataframe add a field based on multiple if statements
But I want to use up to 13 different variables. This seems to pretty soon result in chaos. Is there some notation or other nice feature which allows me to specify such long and nested conditions in pandas?
So far np.where() http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.where.html might be my best option.
Is there a shortcut if I would only match for equality in all conditions?
Am I forced to write out each conditional filter? Ir could I just have a single expression which is choosing a (single) lookup value which is produced.
Edit Ideally I would not want to match
df.loc[df['column_name'] == some_value]
for each value ie. 13* number of categorical levels (lets assume 7) would be a lot of different values; especially, if df.loc[df['fist'] == some_value][df['second'] == otherValue1] combination of conditions occur i.e. they are all chained.
edit
A minimal example
df = pd.DataFrame({'ageGroup': [1, 2, 2, 1],
'first2DigitsOfPostcode': ['12', '23', '12', '12'],
'valueOfProduct': ['low', 'medum', 'high', 'low'],
'lookup_join_value': ['foo', 'bar', 'foo', 'baz']})
defines the lookup table which was generated by a sql query grouping by all the columns and aggregating the values (so due to the Cartesian product all. value combinations should be represented in the lookup table.
A new record could look like
new_values = pd.DataFrame({'ageGroup': [1],
'first2DigitsOfPostcode': ['12'],
'valueOfProduct': ['low']})
How can I sort of automate the lookup of all the conditions assuming all conditions require an match by equality (if this makes it easier.
I found
pd.lookup Vectorized look-up of values in Pandas dataframe which seems to work for a single column / condition
maybe a merge could be a solution? Python Pandas: DataFrame as a Lookup Table, but that not really produce the desired lookup result.
edit 2
The second answer seems to be pretty interesting. But
mask = df.drop('lookup_join_value', axis=1).isin(new_values)
print(mask)
print(df[mask])
print(df[mask]['lookup_join_value'])
will unfortunately just return NaN for the lookup value.
Now that I better know what you're after, a dataframe merge is likely a much better choice:
IN: df.merge(new_values, how='inner')
OUT: ageGroup first2DigitsOfPostcode lookup_join_value valueOfProduct
0 1 12 foo low
1 1 12 baz low
Certainly shorter than the other answer I gave! I'll leave the old one though in case it inspires someone else.
I think df.isin() is along the lines of what you're looking for.
Using your example df, and these two:
exists = pd.DataFrame({'ageGroup': [1],
'first2DigitsOfPostcode': ['12'],
'valueOfProduct' : 'low'})
new = pd.DataFrame({'ageGroup': [1],
'first2DigitsOfPostcode': ['12'],
'valueOfProduct' : 'high'})
Then you can check to see what values match, if all, or just some:
df.isin(exists.values[0])
Out[46]: ageGroup first2DigitsOfPostcode valueOfProduct
0 True True True
1 False False False
2 False True False
3 True True True
df.isin(new.values[0])
Out[46]: ageGroup first2DigitsOfPostcode valueOfProduct
0 True True False
1 False False False
2 False True True
3 True True False
If your "query" wasn't a dataframe but instead a list it wouldn't need the ".values[0]" bit. The problem with a dictionary is it tries to match the index as well.
It's not clear to me from your question exactly what you want returned, but you could then subset based on whether all (or some) of the rows are the same:
# Returns matching rows
df[df.isin(exists.values[0]).values.all(True)]
# Returns rows where the first two columns match
matches = df.isin(new.values[0]).values
df[[item==[True,True,False] for item in matches.tolist()]]
...There might be a smarter way to write that last one.
df.query is an option if you can write if can the query as and expression using the column names:
so you can do:
query_string = 'some long (but valid) boolean query'
example from pandas:
>>> from numpy.random import randn
>>> from pandas import DataFrame
>>> df = DataFrame(randn(10, 2), columns=list('ab'))
>>> df.query('a > b')
# similar to this
>>> df[df.a > df.b]
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.query.html

Python pandas, how can I pass a colon ":" to an indexer through a variable

I am working through a method that ultimately will be working with data slices from a large multi-index pandas dataframe. I can generate masks to use for each indexer (essentially lists of values to define the slice):
df.loc[idx[a_mask,b_mask],idx[c_mask,d_mask]]
This would be fine but in some scenarios I'd really like to select everything along some of those axes, something equivalent to:
df.loc[idx[a_mask,b_mask],idx[:,d_mask]]
Is there a way for me to pass that colon ":" that replaces the c_mask in the second example in as a variable? Ideally I'd just set the c_mask to a value like ":", but of course that doesn't work (and shouldn't because what if we had a column named that...). But is there any way to pass in a value by variable that communicates "whole axis" along one of those indexers?
I do realize I could generate a mask that would select everything by gathering together all the values along the appropriate axis, but this is nontrivial and adds a lot of code. Likewise I could break the dataframe access into 5 scenarios (one each for having a : in it and one with four masks) but that doesn't seem to honor the DRY principle and is still brittle because it can't handle multiple direction whole slice selection.
So, anything I can pass in through a variable that will select an entire direction in an indexer like a : would? Or is there a more elegant way to optionally select an entire direction?
idx[slice(None)] is equivalent to the idx[:]
So these are all equivalent.
In [11]: df = DataFrame({'A' : np.random.randn(9)},index=pd.MultiIndex.from_product([range(3),list('abc')],names=['first','second']))
In [12]: df
Out[12]:
A
first second
0 a -0.668344
b -1.679159
c 0.061876
1 a -0.237272
b 0.136495
c -1.296027
2 a 0.554533
b 0.433941
c -0.014107
In [13]: idx = pd.IndexSlice
In [14]: df.loc[idx[:,'b'],]
Out[14]:
A
first second
0 b -1.679159
1 b 0.136495
2 b 0.433941
In [15]: df.loc[idx[slice(None),'b'],]
Out[15]:
A
first second
0 b -1.679159
1 b 0.136495
2 b 0.433941
In [16]: df.loc[(slice(None),'b'),]
Out[16]:
A
first second
0 b -1.679159
1 b 0.136495
2 b 0.433941

Categories