Pandas taking values in columns order - python

Given this df:
Name i j k
A 1 0 3
B 0 5 4
C 0 0 4
D 0 5
My goal is to add in a column "Final" that takes value in an order of i j k:
Name i j k Final
A 1 0 3 1
B 0 5 4 5
C 0 0 4 4
D 0 5 <-- this one is tricky. We do count the null for j column here.
Here is my attempt: df['Final'] = df[['i', 'j', 'k'].bfill(axis=1).iloc[:, 0]. This doesn't work since it always takes the value of column 1. Any help would be appreciated. :)
Many thanks!

If by "taking values in column order", you mean "taking the first non-zero value in each row, or zero if all values are zero", you could use DataFrame.lookup after doing a boolean comparison:
In [113]: df["final"] = df.lookup(df.index,(df[["i","j","k"]] != 0).idxmax(axis=1))
In [114]: df
Out[114]:
Name i j k final
0 A 1 0.0 3 1.0
1 B 0 5.0 4 5.0
2 C 0 0.0 4 4.0
3 D 0 NaN 5 NaN
where first we compare everything with zero:
In [115]: df[["i","j","k"]] != 0
Out[115]:
i j k
0 True False True
1 False True True
2 False False True
3 False True True
and then we use idxmax to find the first True (or the first False if you have a row of zeroes):
In [116]: (df[["i","j","k"]] != 0).idxmax(axis=1)
Out[116]:
0 i
1 j
2 k
3 j
dtype: object

Is this what you need ?
df['Final']=df[['i', 'j', 'k']].mask((df=='')|(df==0)).bfill(axis=1).iloc[:, 0][(df!='').all(1)]
df
Out[1290]:
Name i j k Final
0 A 1 0 3 1.0
1 B 0 5 4 5.0
2 C 0 0 4 4.0
3 D 0 5 NaN

Using pandas.Series.nonzero the solution can be expressed succicntly.
df['Final'] = df.apply(lambda x: x.iloc[x.nonzero()[0][0]], axis=1)
How this works:
nonzero() returns the indices of elements that are not zero (and will match np.nan as well).
We take the first index location and return the value at that location to construct the Final Column.
We apply this on the dataframe using axis=1 to apply it row by row.
A benefit of this approach is that it does not depend on naming individual columns ['i', 'j', 'k']

Related

How do I iteratively select rows in pandas based on column values?

I'm a complete newbie at pandas so a simpler (though maybe not the most efficient or elegant) solution is appreciated. I don't mind a bit of brute force if I can understand the answer better.
If I have the following Dataframe:
A B C
0 0 1
0 1 1
I want to loop through columns "A", "B" and "C" in that order and during each iteration select all the rows for which the current column is "1" and none of the previous columns are and save the result and also use it in the next iteration.
So when looking at column A, I wouldn't select anything. Then when looking at column B I would select the second row because B==1 and A==0. Then when looking at column C I would select the first row because A==0 and B==0.
Create a boolean mask:
m = (df == 1) & (df.cumsum(axis=1) == 1)
d = {col: df[m[col]].index.tolist() for col in df.columns if m[col].sum()}
Output:
>>> m
A B C
0 False False True
1 False True False
2 False False True
>>> d
{'B': [1], 'C': [0, 2]}
I slightly modified your dataframe:
>>> df
A B C
0 0 0 1
1 0 1 1
2 0 0 1
Update
For the expected output on my sample:
for rows, col in zip(m, df.columns):
if m[col].sum():
print(f"\n=== {col} ===")
print(df[m[col]])
Output:
=== B ===
A B C
1 0 1 1
=== C ===
A B C
0 0 0 1
2 0 0 1
Seems like you need a direct use of idxmax
Return index of first occurrence of maximum over requested axis.
NA/null values are excluded.
>>> df.idxmax()
A 0
B 1
C 0
dtype: int64
The values above are the indexes for which your constraints are met. 1 for B means that the second row was "selected". 0 for C, same. The only issue is that, if nothing is found, it'll also return 0.
To address that, you can use where
>>> df.idxmax().where(~df.eq(0).all())
This will make sure that NaNs are returned for all-zero columns.
A NaN
B 1.0
C 0.0
dtype: float64

Find begin and end index of consecutive ones in pandas dataframe

I have the following dataframe:
A B C
0 1 1 1
1 0 1 0
2 1 1 1
3 1 0 1
4 1 1 0
5 1 1 0
6 0 1 1
7 0 1 0
of which I want to know the start and end index when the values are 1 for 3 or more consecutive values per column. Desired outcome:
Column From To
A 2 5
B 1 3
B 4 7
first I filter out the ones that are not consecutive for 3 or more values
filtered_df = df.copy().apply(filter, threshold=3)
where
def filter(col, threshold=3):
mask = col.groupby((col != col.shift()).cumsum()).transform('count').lt(threshold)
mask &= col.eq(1)
col.update(col.loc[mask].replace(1,0))
return col
filtered_df now look as:
A B C
0 0 1 0
1 0 1 0
2 1 1 0
3 1 0 0
4 1 1 0
5 1 1 0
6 0 1 0
7 0 1 0
If the dataframe would have only one column with zeros and ones the result could be achieved as in How to use pandas to find consecutive same data in time series. However, I am struggeling to do something similar for multiple columns at once.
Use DataFrame.pipe for apply function for all DataFrame.
In first solution get first and last value of consecutive 1 per each columns, add output to lists and last concat:
def f(df, threshold=3):
out = []
for col in df.columns:
m = df[col].eq(1)
g = (df[col] != df[col].shift()).cumsum()[m]
mask = g.groupby(g).transform('count').ge(threshold)
filt = g[mask].reset_index()
output = filt.groupby(col)['index'].agg(['first','last'])
output.insert(0, 'col', col)
out.append(output)
return pd.concat(out, ignore_index=True)
Or first reshape by unstack and then apply solution:
def f(df, threshold=3):
df1 = df.unstack().rename_axis(('col','idx')).reset_index(name='val')
m = df1['val'].eq(1)
g = (df1['val'] != df1.groupby('col')['val'].shift()).cumsum()
mask = g.groupby(g).transform('count').ge(threshold) & m
return (df1[mask].groupby([df1['col'], g])['idx']
.agg(['first','last'])
.reset_index(level=1, drop=True)
.reset_index())
filtered_df = df.pipe(f, threshold=3)
print (filtered_df)
col first last
0 A 2 5
1 B 0 2
2 B 4 7
filtered_df = df.pipe(f, threshold=2)
print (filtered_df)
col first last
0 A 2 5
1 B 0 2
2 B 4 7
3 C 2 3
You can use rolling to create a window over the data frame. Then you can apply all your conditions and shift the window back to its start location:
length = 3
window = df.rolling(length)
mask = (window.min() == 1) & (window.max() == 1)
mask = mask.shift(1 - length)
print(mask)
which prints:
A B C
0 False True False
1 False False False
2 True False False
3 True False False
4 False True False
5 False True False
6 NaN NaN NaN
7 NaN NaN NaN

pandas groupby apply list from column based on binary column

I have a dataframe:
id to from flag
1 a x 1
1 a y 0
2 c z 1
2 c m 1
2 b v 0
2 b p 0
and I want to groupby(['id', 'to']) and return a list of the elements in from that have a flag 1 only. If no element has a flag 1, then the resulting output should be 'None'. The desired output should be:
id to from
1 a ['x']
2 c ['z','m']
2 b None
I can do it with apply i.e.
out_df = df.groupby(['id', 'to'])['from'].apply(
lambda x: match_to_list(x['from'], x['flag'])).reset_index()
where:
def match_to_list(to, flag):
matches = list(to.iloc[flag.nonzero()[0]])
if len(matches) == 0:
return 'None'
else:
matches
but this is taking too long and I think there must be a better way that I am missing.
Any help/insights would be very appreciated! TIA
IIUC 1st create the index , with MultiIndex, then we do groupby with agg
idx=pd.MultiIndex.from_tuples(list(map(tuple,df[['id','to']].drop_duplicates().values.tolist())))
yourdf=df.loc[df.flag==1].groupby(['id','to'])['from'].agg(list).reindex(idx).reset_index()
yourdf
Out[13]:
level_0 level_1 from
0 1 a [x]
1 2 c [z, m]
2 2 b NaN
Or just using apply , less efficient but more readable
df.groupby(['id','to']).apply(lambda x : x['from'][x['flag']==1].tolist() if (x['flag']==1).any() else None).reset_index()
Out[17]:
id to 0
0 1 a [x]
1 2 b None
2 2 c [z, m]

Python: given list of columns and list of values, return subset of dataframe that meets all criteria

I have a dataframe like the following.
df = pd.DataFrame({'A' : ['Bob','Jean','Sally','Sue'], 'B' : [1,2,3, 2],'C' : [7,8,9,8] })
Assume that column A will always in the dataframe but sometimes the could be column B, column B and C, or multiple number of columns.
I have created a code to save the columns names (other than A) in a list as well as the unique permutations of the values in the other columns into a list. For instance, in this example, we have columns B and C saved into columns:
col = ['B','C']
The permutations in the simple df are 1,7; 2,8; 3,9. For simplicity assume one permutation is saved as follows:
permutation = [2,8]
How do I select the entire rows (and only those) that equal that permutation?
Right now, I am using:
a[a[col].isin(permutation)]
Unfortunately, I don't get the values in column A.
(I know how to drop those values that are NaN later. BuT How should I do this to keep it dynamic? Sometimes there will be multiple columns. (Ultimately, I'll run through a loop and save the different iterations) based upon multiple permutations in the columns other than A.
Use the intersection of boolean series (where both conditions are true) - first setup code:
import pandas as pd
df = pd.DataFrame({'A' : ['Bob','Jean','Sally','Sue'], 'B' : [1,2,3, 2],'C' : [7,8,9,8] })
col = ['B','C']
permutation = [2,8]
And here's the solution for this limited example:
>>> df[(df[col[0]] == permutation[0]) & (df[col[1]] == permutation[1])]
A B C
1 Jean 2 8
3 Sue 2 8
To break that down:
>>> b, c = col
>>> per_b, per_c = permutation
>>> column_b_matches = df[b] == per_b
>>> column_c_matches = df[c] == per_c
>>> intersection = column_b_matches & column_c_matches
>>> df[intersection]
A B C
1 Jean 2 8
3 Sue 2 8
Additional columns and values
To take any number of columns and values, I would create a function:
def select_rows(df, columns, values):
if not columns or not values:
raise Exception('must pass columns and values')
if len(columns) != len(values):
raise Exception('columns and values must be same length')
intersection = True
for c, v in zip(columns, values):
intersection &= df[c] == v
return df[intersection]
and to use it:
>>> select_rows(df, col, permutation)
A B C
1 Jean 2 8
3 Sue 2 8
Or you can coerce the permutation to an array and accomplish this with a single comparison, assuming numeric values:
import numpy as np
def select_rows(df, columns, values):
return df[(df[col] == np.array(values)).all(axis=1)]
But this does not work with your code sample as given
I figured out a solution. Aaron's above works well if I only have two columns. I need a solution that works regardless of the size of the df (as size will be 3-7 columns).
df = pd.DataFrame({'A' : ['Bob','Jean','Sally','Sue'], 'B' : [1,2,3, 2],'C' : [7,8,9,8] })
permutation = [2,8]
col = ['B','C']
interim = df[col].isin(permutation)
df[df.index.isin(interim[(interim != 0).all(1)].index)]
you can do it this way:
In [77]: permutation = np.array([0,2,2])
In [78]: col
Out[78]: ['a', 'b', 'c']
In [79]: df.loc[(df[col] == permutation).all(axis=1)]
Out[79]:
a b c
10 0 2 2
15 0 2 2
16 0 2 2
your solution will not always work properly:
sample DF:
In [71]: df
Out[71]:
a b c
0 0 2 1
1 1 1 1
2 0 1 2
3 2 0 1
4 0 1 0
5 2 0 0
6 2 0 0
7 0 1 0
8 2 1 0
9 0 0 0
10 0 2 2
11 1 0 1
12 2 1 1
13 1 0 0
14 2 1 0
15 0 2 2
16 0 2 2
17 1 0 2
18 0 1 1
19 1 2 0
In [67]: col = ['a','b','c']
In [68]: permutation = [0,2,2]
In [69]: interim = df[col].isin(permutation)
pay attention at the result:
In [70]: df[df.index.isin(interim[(interim != 0).all(1)].index)]
Out[70]:
a b c
5 2 0 0
6 2 0 0
9 0 0 0
10 0 2 2
15 0 2 2
16 0 2 2

How to apply a condition to a large number of columns in a pandas dataframe

I want to eliminate all rows that are equal to a certain values (or in a certain range) within a dataframe with a large number of columns. For example, if I had the following dataframe:
a b
0 1 0
1 2 1
2 3 2
3 0 3
and wanted to remove all rows containing 0, I could use:
a_df[(a_df['a'] != 0) & (a_df['b'] !=0)]
but this becomes a pain when you're dealing with a large number of columns. It could be done as:
for i in a_df.columns.values:
a_df = a_df[a_df[i] != 0]
but this seems inefficient. Is there a better way to do this?
Just do it for the whole df and call dropna:
In [45]:
df[df != 0].dropna()
Out[45]:
a b
1 2 1
2 3 2
The condition df != 0 produces a boolean mask:
In [47]:
df != 0
Out[47]:
a b
0 True False
1 True True
2 True True
3 False True
When this is combined with the df it produces NaN values where the condition is not met:
In [48]:
df[df != 0]
Out[48]:
a b
0 1 NaN
1 2 1
2 3 2
3 NaN 3
Calling dropna drops any row with a NaN value
Here's a variant of EdChum's approach. You could do df != 0 and then use all to make your selector:
>>> (df != 0).all(axis=1)
0 False
1 True
2 True
3 False
dtype: bool
and then use this to select:
>>> df.loc[(df != 0).all(axis=1)]
a b
1 2 1
2 3 2
The advantage of this is that it keeps NaNs if you want, e.g.
>>> df
a b
0 1 0
1 2 NaN
2 3 2
3 0 3
>>> df.loc[(df != 0).all(axis=1)]
a b
1 2 NaN
2 3 2
>>> df[(df != 0)].dropna()
a b
2 3 2
as you've mentioned in your question you may need to drop rows that have a value in a certain range you can do this by the following
suppose the range is 0 , 10 , 20
frame = DataFrame({'b': [4, 7, -3, 2], 'a': [0, 1, 0, 1]})
mask = frame.applymap(lambda x : False if x in [0 , 10 , 20] else True )
frame[mask.all(axis = 1)]

Categories