Pandas: eror when checking for a binary flag pattern [duplicate] - python

This question already has an answer here:
Bitwise operations in Pandas that return numbers rather than bools?
(1 answer)
Closed 1 year ago.
I have a dataframe where one of the columns of type int is storing a binary flag pattern:
import pandas as pd
df = pd.DataFrame({'flag': [1, 2, 4, 5, 7, 3, 9, 11]})
I tried selecting rows with value matching 4 the way it is typically done (with binary and operator):
df[df['flag'] & 4]
But it failed with:
KeyError: "None of [Int64Index([0, 0, 4, 4, 4, 0, 0, 0], dtype='int64')] are in the [columns]"
How to actually select rows matching binary pattern?

The bitwise-flag selection works as you’d expect:
>>> df['flag'] & 4
0 0
1 0
2 4
3 4
4 4
5 0
6 0
7 0
Name: flag, dtype: int64
However if you pass this to df.loc[], you’re asking to get the indexes 0 and 4 repeatedly, or if you use df[] directly you’re asking for the column that has Int64Index[...] as column header.
Instead, you should force the conversion to a boolean indexer:
>>> (df['flag'] & 4) != 0
0 False
1 False
2 True
3 True
4 True
5 False
6 False
7 False
Name: flag, dtype: bool
>>> df[(df['flag'] & 4) != 0]
flag
2 4
3 5
4 7

Even though in Pandas & or | is used as a logical operator to specify conditions but at the same time using a Series as an argument to allegedly logical operator results not in a Series of Boolean values but numbers.
Knowing that you can use any of the following approaches to select rows based on a binary pattern:
Since result of <int> & <FLAG> is always <FLAG> then you can use:
df[df['flag'] & 4 == 4]
which (due to the precedence of operators) evaluates as:
df[(df['flag'] & 4) == 4]
alternatively you can use apply and map the result directly to a bool:
df[df['flag'].apply(lambda v: bool(v & FLAG))]
But this does look very cumbersome and is likely to be much slower.
In either cases, the result is as expected:
flag
2 4
3 5
4 7

Related

Count how many consecutive TRUEs on each row in a dataframe

I am trying to count how many consecutive TRUEs on each row and I solved that part myself but I need to find a solution for this part: If a row starts with FALSE then result must be 0. There is a sample dataset below. Can you recommend me your tips to how to solve this.
PS. my original question is at the link below.
how to find number of consecutive decreases(increases)
Sample data, .csv file
idx,Expected Results,M_1,M_2,M_3,M_4,M_5,M_6,M_7,M_8,M_9,M_10,M_11,M_12
1001,0,FALSE,FALSE,FALSE,TRUE,TRUE,TRUE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE
1002,3,TRUE,TRUE,TRUE,FALSE,FALSE,FALSE,TRUE,TRUE,TRUE,FALSE,FALSE,FALSE
1003,1,TRUE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE
1004,4,TRUE,TRUE,TRUE,TRUE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE
1005,0,FALSE,FALSE,FALSE,TRUE,TRUE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE
1006,0,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE
1007,0,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE
1008,1,TRUE,FALSE,FALSE,TRUE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE
1009,0,FALSE,TRUE,FALSE,TRUE,FALSE,TRUE,FALSE,TRUE,FALSE,TRUE,FALSE,FALSE
1010,1,TRUE,FALSE,TRUE,FALSE,TRUE,FALSE,TRUE,FALSE,TRUE,FALSE,TRUE,FALSE
1011,0,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,TRUE,FALSE,FALSE,FALSE,FALSE,FALSE
1013,0,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE
1014,1,TRUE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE
1015,1,TRUE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE
1016,0,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE
1017,2,TRUE,TRUE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE
1018,0,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE
After John Solution;
How can I count the Trues till I see the "False"
result = df.where(df[0], 0)
idx,M_1,M_2,M_3,M_4,M_5,M_6,M_7,M_8,M_9,M_10,M_11,M_12
1001,0,0,0,0,0,0,0,0,0,0,0,0
1002,TRUE,TRUE,TRUE,FALSE,FALSE,FALSE,TRUE,TRUE,TRUE,FALSE,FALSE,FALSE
1003,TRUE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE
1004,TRUE,TRUE,TRUE,TRUE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE
1005,0,0,0,0,0,0,0,0,0,0,0,0
1006,0,0,0,0,0,0,0,0,0,0,0,0
1007,0,0,0,0,0,0,0,0,0,0,0,0
1008,TRUE,FALSE,FALSE,TRUE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE
1009,0,0,0,0,0,0,0,0,0,0,0,0
1010,TRUE,FALSE,TRUE,FALSE,TRUE,FALSE,TRUE,FALSE,TRUE,FALSE,TRUE,FALSE
1011,0,0,0,0,0,0,0,0,0,0,0,0
1013,0,0,0,0,0,0,0,0,0,0,0,0
1014,TRUE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE
1015,TRUE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE
1016,0,0,0,0,0,0,0,0,0,0,0,0
1017,TRUE,TRUE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE
1018,0,0,0,0,0,0,0,0,0,0,0,0
You can use np.argmin. You needn't prefilter your df, it will handle rows starting with False correctly.
df.loc[:, 'M_1':'M_12'].values.argmin(1)
#array([0, 3, 1, 4, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 2, 0])
Note that this assumes there is at least one False in every row.
df.loc[:, 'M_1':'M_12'].apply(np.logical_and.accumulate, axis=1).sum(axis=1)
reverse values of columns M-1 - M-12 using negation '~'. I.e, True to False and vice-versa. Doing cummax to separate first group of consecutive True (note: at this point True represent False-value and 'False' represent True-value). Doing another negation on the result of cummax and finally sum
(~(~df.drop(['idx'], 1)).cummax(1)).sum(1)
Out[503]:
0 0
1 3
2 1
3 4
4 0
5 0
6 0
7 1
8 0
9 1
10 0
11 0
12 1
13 1
14 0
15 2
16 0
dtype: int64

Python Data Frame: cumulative sum of column until condition is reached and return the index

I am new in Python and am currently facing an issue I can't solve. I really hope you can help me out. English is not my native languge so I am sorry if I am not able to express myself properly.
Say I have a simple data frame with two columns:
index Num_Albums Num_authors
0 10 4
1 1 5
2 4 4
3 7 1000
4 1 44
5 3 8
Num_Abums_tot = sum(Num_Albums) = 30
I need to do a cumulative sum of the data in Num_Albums until a certain condition is reached. Register the index at which the condition is achieved and get the correspondent value from Num_authors.
Example:
cumulative sum of Num_Albums until the sum equals 50% ± 1/15 of 30 (--> 15±2):
10 = 15±2? No, then continue;
10+1 =15±2? No, then continue
10+1+41 = 15±2? Yes, stop.
Condition reached at index 2. Then get Num_Authors at that index: Num_Authors(2)=4
I would like to see if there's a function already implemented in pandas, before I start thinking how to do it with a while/for loop....
[I would like to specify the column from which I want to retrieve the value at the relevant index (this comes in handy when I have e.g. 4 columns and i want to sum elements in column 1, condition achieved =yes then get the correspondent value in column 2; then do the same with column 3 and 4)].
Opt - 1:
You could compute the cumulative sum using cumsum. Then use np.isclose with it's inbuilt tolerance parameter to check if the values present in this series lies within the specified threshold of 15 +/- 2. This returns a boolean array.
Through np.flatnonzero, return the ordinal values of the indices for which the True condition holds. We select the first instance of a True value.
Finally, use .iloc to retrieve value of the column name you require based on the index computed earlier.
val = np.flatnonzero(np.isclose(df.Num_Albums.cumsum().values, 15, atol=2))[0]
df['Num_authors'].iloc[val] # for faster access, use .iat
4
When performing np.isclose on the series later converted to an array:
np.isclose(df.Num_Albums.cumsum().values, 15, atol=2)
array([False, False, True, False, False, False], dtype=bool)
Opt - 2:
Use pd.Index.get_loc on the cumsum calculated series which also supports a tolerance parameter on the nearest method.
val = pd.Index(df.Num_Albums.cumsum()).get_loc(15, 'nearest', tolerance=2)
df.get_value(val, 'Num_authors')
4
Opt - 3:
Use idxmax to find the first index of a True value for the boolean mask created after sub and abs operations on the cumsum series:
df.get_value(df.Num_Albums.cumsum().sub(15).abs().le(2).idxmax(), 'Num_authors')
4
I think you can directly add a column with the cumulative sum as:
In [3]: df
Out[3]:
index Num_Albums Num_authors
0 0 10 4
1 1 1 5
2 2 4 4
3 3 7 1000
4 4 1 44
5 5 3 8
In [4]: df['cumsum'] = df['Num_Albums'].cumsum()
In [5]: df
Out[5]:
index Num_Albums Num_authors cumsum
0 0 10 4 10
1 1 1 5 11
2 2 4 4 15
3 3 7 1000 22
4 4 1 44 23
5 5 3 8 26
And then apply the condition you want on the cumsum column. For instance you can use where to get the full row according to the filter. Setting the tolerance tol:
In [18]: tol = 2
In [19]: cond = df.where((df['cumsum']>=15-tol)&(df['cumsum']<=15+tol)).dropna()
In [20]: cond
Out[20]:
index Num_Albums Num_authors cumsum
2 2.0 4.0 4.0 15.0
This could even be done as following code:
def your_function(df):
sum=0
index=-1
for i in df['Num_Albums'].tolist():
sum+=i
index+=1
if sum == ( " your_condition " ):
return (index,df.loc([df.Num_Albums==i,'Num_authors']))
This would actually return a tuple of your index and the corresponding value of Num_authors as soon as the "your condition" is reached.
or could even be returned as an array by
def your_function(df):
sum=0
index=-1
for i in df['Num_Albums'].tolist():
sum+=i
index+=1
if sum == ( " your_condition " ):
return df.loc([df.Num_Albums==i,'Num_authors']).index.values
I am not able to figure out the condition you mentioned of the cumulative sum as when to stop summing so I mentioned it as " your_condition " in the code!!
I am also new so hope it helps !!

Get index of a row of a pandas dataframe as an integer

Assume an easy dataframe, for example
A B
0 1 0.810743
1 2 0.595866
2 3 0.154888
3 4 0.472721
4 5 0.894525
5 6 0.978174
6 7 0.859449
7 8 0.541247
8 9 0.232302
9 10 0.276566
How can I retrieve an index value of a row, given a condition?
For example:
dfb = df[df['A']==5].index.values.astype(int)
returns [4], but what I would like to get is just 4. This is causing me troubles later in the code.
Based on some conditions, I want to have a record of the indexes where that condition is fulfilled, and then select rows between.
I tried
dfb = df[df['A']==5].index.values.astype(int)
dfbb = df[df['A']==8].index.values.astype(int)
df.loc[dfb:dfbb,'B']
for a desired output
A B
4 5 0.894525
5 6 0.978174
6 7 0.859449
but I get TypeError: '[4]' is an invalid key
The easier is add [0] - select first value of list with one element:
dfb = df[df['A']==5].index.values.astype(int)[0]
dfbb = df[df['A']==8].index.values.astype(int)[0]
dfb = int(df[df['A']==5].index[0])
dfbb = int(df[df['A']==8].index[0])
But if possible some values not match, error is raised, because first value not exist.
Solution is use next with iter for get default parameetr if values not matched:
dfb = next(iter(df[df['A']==5].index), 'no match')
print (dfb)
4
dfb = next(iter(df[df['A']==50].index), 'no match')
print (dfb)
no match
Then it seems need substract 1:
print (df.loc[dfb:dfbb-1,'B'])
4 0.894525
5 0.978174
6 0.859449
Name: B, dtype: float64
Another solution with boolean indexing or query:
print (df[(df['A'] >= 5) & (df['A'] < 8)])
A B
4 5 0.894525
5 6 0.978174
6 7 0.859449
print (df.loc[(df['A'] >= 5) & (df['A'] < 8), 'B'])
4 0.894525
5 0.978174
6 0.859449
Name: B, dtype: float64
print (df.query('A >= 5 and A < 8'))
A B
4 5 0.894525
5 6 0.978174
6 7 0.859449
To answer the original question on how to get the index as an integer for the desired selection, the following will work :
df[df['A']==5].index.item()
Little sum up for searching by row:
This can be useful if you don't know the column values ​​or if columns have non-numeric values
if u want get index number as integer u can also do:
item = df[4:5].index.item()
print(item)
4
it also works in numpy / list:
numpy = df[4:7].index.to_numpy()[0]
lista = df[4:7].index.to_list()[0]
in [x] u pick number in range [4:7], for example if u want 6:
numpy = df[4:7].index.to_numpy()[2]
print(numpy)
6
for DataFrame:
df[4:7]
A B
4 5 0.894525
5 6 0.978174
6 7 0.859449
or:
df[(df.index>=4) & (df.index<7)]
A B
4 5 0.894525
5 6 0.978174
6 7 0.859449
The nature of wanting to include the row where A == 5 and all rows upto but not including the row where A == 8 means we will end up using iloc (loc includes both ends of slice).
In order to get the index labels we use idxmax. This will return the first position of the maximum value. I run this on a boolean series where A == 5 (then when A == 8) which returns the index value of when A == 5 first happens (same thing for A == 8).
Then I use searchsorted to find the ordinal position of where the index label (that I found above) occurs. This is what I use in iloc.
i5, i8 = df.index.searchsorted([df.A.eq(5).idxmax(), df.A.eq(8).idxmax()])
df.iloc[i5:i8]
numpy
you can further enhance this by using the underlying numpy objects the analogous numpy functions. I wrapped it up into a handy function.
def find_between(df, col, v1, v2):
vals = df[col].values
mx1, mx2 = (vals == v1).argmax(), (vals == v2).argmax()
idx = df.index.values
i1, i2 = idx.searchsorted([mx1, mx2])
return df.iloc[i1:i2]
find_between(df, 'A', 5, 8)
timing
Or you can add a for loop
for i in dfb:
dfb = i
for j in dfbb:
dgbb = j
This way the element '4' is out of the list

Get first row of dataframe in Python Pandas based on criteria

Let's say that I have a dataframe like this one
import pandas as pd
df = pd.DataFrame([[1, 2, 1], [1, 3, 2], [4, 6, 3], [4, 3, 4], [5, 4, 5]], columns=['A', 'B', 'C'])
>> df
A B C
0 1 2 1
1 1 3 2
2 4 6 3
3 4 3 4
4 5 4 5
The original table is more complicated with more columns and rows.
I want to get the first row that fulfil some criteria. Examples:
Get first row where A > 3 (returns row 2)
Get first row where A > 4 AND B > 3 (returns row 4)
Get first row where A > 3 AND (B > 3 OR C > 2) (returns row 2)
But, if there isn't any row that fulfil the specific criteria, then I want to get the first one after I just sort it descending by A (or other cases by B, C etc)
Get first row where A > 6 (returns row 4 by ordering it by A desc and get the first one)
I was able to do it by iterating on the dataframe (I know that craps :P). So, I prefer a more pythonic way to solve it.
This tutorial is a very good one for pandas slicing. Make sure you check it out. Onto some snippets... To slice a dataframe with a condition, you use this format:
>>> df[condition]
This will return a slice of your dataframe which you can index using iloc. Here are your examples:
Get first row where A > 3 (returns row 2)
>>> df[df.A > 3].iloc[0]
A 4
B 6
C 3
Name: 2, dtype: int64
If what you actually want is the row number, rather than using iloc, it would be df[df.A > 3].index[0].
Get first row where A > 4 AND B > 3:
>>> df[(df.A > 4) & (df.B > 3)].iloc[0]
A 5
B 4
C 5
Name: 4, dtype: int64
Get first row where A > 3 AND (B > 3 OR C > 2) (returns row 2)
>>> df[(df.A > 3) & ((df.B > 3) | (df.C > 2))].iloc[0]
A 4
B 6
C 3
Name: 2, dtype: int64
Now, with your last case we can write a function that handles the default case of returning the descending-sorted frame:
>>> def series_or_default(X, condition, default_col, ascending=False):
... sliced = X[condition]
... if sliced.shape[0] == 0:
... return X.sort_values(default_col, ascending=ascending).iloc[0]
... return sliced.iloc[0]
>>>
>>> series_or_default(df, df.A > 6, 'A')
A 5
B 4
C 5
Name: 4, dtype: int64
As expected, it returns row 4.
For existing matches, use query:
df.query(' A > 3' ).head(1)
Out[33]:
A B C
2 4 6 3
df.query(' A > 4 and B > 3' ).head(1)
Out[34]:
A B C
4 5 4 5
df.query(' A > 3 and (B > 3 or C > 2)' ).head(1)
Out[35]:
A B C
2 4 6 3
you can take care of the first 3 items with slicing and head:
df[df.A>=4].head(1)
df[(df.A>=4)&(df.B>=3)].head(1)
df[(df.A>=4)&((df.B>=3) * (df.C>=2))].head(1)
The condition in case nothing comes back you can handle with a try or an if...
try:
output = df[df.A>=6].head(1)
assert len(output) == 1
except:
output = df.sort_values('A',ascending=False).head(1)
For the point that 'returns the value as soon as you find the first row/record that meets the requirements and NOT iterating other rows', the following code would work:
def pd_iter_func(df):
for row in df.itertuples():
# Define your criteria here
if row.A > 4 and row.B > 3:
return row
It is more efficient than Boolean Indexing when it comes to a large dataframe.
To make the function above more applicable, one can implements lambda functions:
def pd_iter_func(df: DataFrame, criteria: Callable[[NamedTuple], bool]) -> Optional[NamedTuple]:
for row in df.itertuples():
if criteria(row):
return row
pd_iter_func(df, lambda row: row.A > 4 and row.B > 3)
As mentioned in the answer to the 'mirror' question, pandas.Series.idxmax would also be a nice choice.
def pd_idxmax_func(df, mask):
return df.loc[mask.idxmax()]
pd_idxmax_func(df, (df.A > 4) & (df.B > 3))

Pandas : determine mapping from unique rows to original dataframe

Given the following inputs:
In [18]: input
Out[18]:
1 2 3 4
0 1 5 9 1
1 2 6 10 2
2 1 5 9 1
3 1 5 9 1
In [26]: df = input.drop_duplicates()
Out[26]:
1 2 3 4
0 1 5 9 1
1 2 6 10 2
How would I go about getting an array that has the indices of the rows from the subset that are equivalent, eg:
resultant = [0, 1, 0, 0]
I.e. the '1' here is basically stating that (row[1] in input) == (row[1] in df). Since there will be fewer unique rows than there will be multiple values in 'resultant' that will equate to similar values in df. i.e (row[k] in input == row[k+N] in input) == (row[1] in df) could be a case.
I am looking for actual row number mapping from input:df.
While this example is trivial in my case i have a ton of dropped mappings that might map to one index as an example.
Why do I want this? I am training an autoencoder type system where the target sequence is non-unique.
One way would be to treat it as a groupby on all columns:
>> df.groupby(list(df.columns)).groups
{(1, 5, 9, 1): [0, 2, 3], (2, 6, 10, 2): [1]}
Another would be to sort and then compare, which is less efficient in theory but could very well be faster in some cases and is definitely easier to make more tolerant of error:
>>> ds = df.sort(list(df.columns))
>>> eqs = (ds != ds.shift()).all(axis=1).cumsum()
>>> ds.index.groupby(eqs)
{1: [0, 2, 3], 2: [1]}
This seems the right datastructure to me, but if you really do want an array with the group ids, that's easy too, e.g.
>>> eqs.sort_index() - 1
0 0
1 1
2 0
3 0
dtype: int64
Don't have pandas installed on this computer, but I think you could use df.iterrows() like:
def find_matching_row(row, df_slimmed):
for index, slimmed_row in df_slimmed.iterrows():
if slimmed_row.equals(row[slimmed_row.columns]):
return index
def rows_mappings(df, df_slimmed):
for _, row in df.iterrows():
yield find_matching_row(row, df_slimmed)
list(rows_mappings(df, input))
This is if you are interested in generating the resultant list in your example, I don't quite follow the latter part of your reasoning.

Categories