Find column names when row element meets a criteria Pandas - python

This is a basic question. I've got a square array with the rows and columns summed up. Eg:
df = pd.DataFrame([[0,0,1,0], [0,0,1,0], [1,0,0,0], [0,1,0,0]], index = ["a","b","c","d"], columns = ["a","b","c","d"])
df["sumRows"] = df.sum(axis = 1)
df.loc["sumCols"] = df.sum()
This returns:
In [100]: df
Out[100]:
a b c d sumRows
a 0 0 1 0 1
b 0 0 1 0 1
c 1 0 0 0 1
d 0 1 0 0 1
sumCols 1 1 2 0 4
I need to find the column labels for the sumCols rows which matches 0. At the moment I am doing this:
[df.loc["sumCols"] == 0].index
But this return a strange index type object. All I want is a list of values that match this criteria i.e: ['d'] in this case.

There is two ways (the index object can be converted to an interable like a list).
Do that with the columns:
columns = df.columns[df.sum()==0]
columns = list(columns)
Or you can rotate the Dataframe and treat columns as rows:
list(df.T[df.T.sumCols == 0].index)

You can use a lambda expression to filter series and if you want a list instead of index as result, you can call .tolist() on the index object:
(df.loc['sumCols'] == 0)[lambda x: x].index.tolist()
# ['d']
Or:
df.loc['sumCols'][lambda x: x == 0].index.tolist()
# ['d']
Without explicitly creating the sumCols and if you want to check which column has sum of zero, you can do:
df.sum()[lambda x: x == 0].index.tolist()
# ['d']
Check rows:
df.sum(axis = 1)[lambda x: x == 0].index.tolist()
# []
Note: The lambda expression approach is as fast as the vectorized method for subsetting, functional style and can be written easily in a one-liner if you prefer.

Heres a simple method using query after transposing
df.T.query('sumCols == 0').index.tolist()

Related

Matching value with column to retrieve index value

Please see example dataframe below:
I'm trying match values of columns X with column names and retrieve value from that matched column
so that:
A B C X result
1 2 3 B 2
5 6 7 A 5
8 9 1 C 1
Any ideas?
Here are a couple of methods:
# Apply Method:
df['result'] = df.apply(lambda x: df.loc[x.name, x['X']], axis=1)
# List comprehension Method:
df['result'] = [df.loc[i, x] for i, x in enumerate(df.X)]
# Pure Pandas Method:
df['result'] = (df.melt('X', ignore_index=False)
.loc[lambda x: x['X'].eq(x['variable']), 'value'])
Here I just build a dataframe from your example and call it df
dict = {
'A': (1,5,8),
'B': (2,6,9),
'C': (3,7,1),
'X': ('B','A','C')}
df = pd.DataFrame(dict)
You can extract the value from another column based on 'X' using the following code. There may be a better way to do this without having to convert first to list and retrieving the first element.
list(df.loc[df['X'] == 'B', 'B'])[0]
I'm going to create a column called 'result' and fill it with 'NA' and then replace the value based on your conditions. The loop below, extracts the value and uses .loc to replace it in your dataframe.
df['result'] = 'NA'
for idx, val in enumerate(list(vals)):
extracted = list(df.loc[df['X'] == val, val])[0]
df.loc[idx, 'result'] = extracted
Here it is as a function:
def search_replace(dataframe, search_col='X', new_col_name='result'):
dataframe[new_col_name] = 'NA'
for idx, val in enumerate(list(vals)):
extracted = list(dataframe.loc[dataframe[search_col] == val, val])[0]
dataframe.loc[idx, new_col_name] = extracted
return df
and the output
>>> search_replace(df)
A B C X result
0 1 2 3 B 2
1 5 6 7 A 5
2 8 9 1 C 1

trim last rows of a pandas dataframe based on a condition

let's assume a dataframe like this:
idx x y
0 a 3
1 b 2
2 c 0
3 d 2
4 e 5
how can I trim the bottom rows, based on a condition, so that any row after the last one matching the condition would be removed?
for example:
with the following condition: y == 0
the output would be
idx x y
0 a 3
1 b 2
2 c 0
the condition can happen many times, but the last one is the one that triggers the cut.
Method 1:
Usng index.max & iloc:
index.max to get the last row with condition y==0
iloc to slice of the dataframe on the index found with df['y'].eq(0)
idx = df.query('y.eq(0)').index.max()+1
# idx = df.query('y==0').index.max()+1 -- if pandas < 0.25
df.iloc[:idx]
Output
x y
0 a 3
1 b 2
2 c 0
Method 2:
Using np.where
idx = np.where(df['y'].eq(0), df.index, 0).max()+1
df.iloc[:idx]
Output
x y
0 a 3
1 b 2
2 c 0
you could do, here np.wherereturns a tuple, so we access the value of the indexes as the first element of the tuple using np.where(df.y == 0), the first occurence is then returned as the last element of this vector, finaly we add 1 to the index so we can include this index of the last occurence while slicing
df_cond = df.iloc[:np.where(df.y == 0)[0][-1]+1, :]
or you could do :
df_cond = df[ :df.y.eq(0).cumsum().idxmax()+1 ]
Set up your dataframe:
data = [
[ 'a', 3],
[ 'b' , 2],
[ 'c' , 0],
[ 'd', 2],
[ 'e' , 5]
]
df = pd.DataFrame(data, columns=['x', 'y']).reset_index().rename(columns={'index':'idx'}).sort_values('idx')
Then find your cutoff (assuming the idx column is already sorted):
cutoff = df[df['y'] == 0].idx.min()
The df['y'] == 0 is your condition. Then get the min idx that meets that condition and save it as our cutoff.
Finally, create a new dataframe using your cutoff:
df_new = df[df.idx <= cutoff].copy()
Output:
df_new
idx x y
0 0 a 3
1 1 b 2
2 2 c 0
I would do something like this:
df.iloc[:df['y'].eq(0).idxmax()+1]
Just look for the largest index where your condition is true.
EDIT
So the above code wont work because idxmax() still only takes the first index where the value is true. So we we can do the following to trick it:
df.iloc[:df['y'].eq(0).sort_index(ascending = False).idxmax()+1]
Flip the index, so the last index is the first index that idxmax picks up.

Find all duplicate columns in a collection of data frames

Having a collection of data frames, the goal is to identify the duplicated column names and return them as a list.
Example
The input are 3 data frames df1, df2 and df3:
df1 = pd.DataFrame({'a':[1,5], 'b':[3,9], 'e':[0,7]})
a b e
0 1 3 0
1 5 9 7
df2 = pd.DataFrame({'d':[2,3], 'e':[0,7], 'f':[2,1]})
d e f
0 2 0 2
1 3 7 1
df3 = pd.DataFrame({'b':[3,9], 'c':[8,2], 'e':[0,7]})
b c e
0 3 8 0
1 9 2 7
The output is a list [b, e]
pd.Series.duplicated
Since you are using Pandas, you can use pd.Series.duplicated after concatenating column names:
# concatenate column labels
s = pd.concat([df.columns.to_series() for df in (df1, df2, df3)])
# keep all duplicates only, then extract unique names
res = s[s.duplicated(keep=False)].unique()
print(res)
array(['b', 'e'], dtype=object)
pd.Series.value_counts
Alternatively, you can extract a series of counts and identify rows which have a count greater than 1:
s = pd.concat([df.columns.to_series() for df in (df1, df2, df3)]).value_counts()
res = s[s > 1].index
print(res)
Index(['e', 'b'], dtype='object')
collections.Counter
The classic Python solution is to use collections.Counter followed by a list comprehension. Recall that list(df) returns the columns in a dataframe, so we can use this map and itertools.chain to produce an iterable to feed Counter.
from itertools import chain
from collections import Counter
c = Counter(chain.from_iterable(map(list, (df1, df2, df3))))
res = [k for k, v in c.items() if v > 1]
here is my code for this problem, for comparing with only two data frames, with out concat them.
def getDuplicateColumns(df1, df2):
df_compare = pd.DataFrame({'df1':df1.columns.to_list()})
df_compare["df2"] = ""
# Iterate over all the columns in dataframe
for x in range(df1.shape[1]):
# Select column at xth index.
col = df1.iloc[:, x]
# Iterate over all the columns in DataFrame from (x+1)th index till end
duplicateColumnNames = []
for y in range(df2.shape[1]):
# Select column at yth index.
otherCol = df2.iloc[:, y]
# Check if two columns at x y index are equal
if col.equals(otherCol):
duplicateColumnNames.append(df2.columns.values[y])
df_compare.loc[df_compare["df1"]==df1.columns.values[x], "df2"] = str(duplicateColumnNames)
return df_compare

Python pandas - get row based on previous row value

I have a big data set where I'm trying to filter only the rows that match certain criteria. More specifically, I want to get all rows where Type == A if Type == B is 2
So in the following example it would result in the row 2 Node-1 A 1
>>> import pandas as pd
>>> data = [['Node-0', 'A', 1],['Node-0', 'B', 1],['Node-1','A', 1],['Node-1', 'B', 2]]
>>> df = pd.DataFrame(data,columns=['Node','Type','Value'])
>>> print df
Node Type Value
0 Node-0 A 1
1 Node-0 B 1
2 Node-1 A 1
3 Node-1 B 2
I can filter the rows using df.loc[df['Type'] == 'A'], but that gives me lines 0 and 2.
IIUC, using some masking with groupby.
m = df.Type.eq('B') & df.Value.eq(2)
df[m.groupby(df.Node).transform('any') & df.Type.eq('A')]
Node Type Value
2 Node-1 A 1
I bet there is a better solution, but this should sort it out for time being:
condition1 = (df['Node'].isin(df.query("Type=='B' & Value==2")['Node']))
#All the 'Node' values whose 'Type' and 'Value' columns have values 'B' and 2
#.isin() filters to rows that match the above criteria
condition2 = (df['Type']=='A')
#all the rows where 'Type' is 'A'
df.loc[condition1&condition2]
#intersection of above conditions
# Node Type Value
#2 Node-1 A 1
Consider the following:
# Get rows maching first criteria
dd1 = df[df.Type == 'A'][df.Value == 1]
# Get "previous" rows maching second criteria
df2 = df.shift(-1)
dd2 = df2[df2.Type == 'B'][df2.Value == 2]
# Find intersection
pd.merge(dd1, dd2, how='inner', on='Node')
Result:
Node Type_x Value_x Type_y Value_y
0 Node-1 A 1 B 2.0

Contains one of several values

I have a dataframe df with column x and a list lst =["apple","peach","pear"]
df
x
apple234
pear231
banana233445
If row1 in df["x"] contains any of the values in lst: then 1 else 0
Final data should look like this:
df
x y
apple234 -- 1
pear231 -- 1
banana233445 - 0
Use str.contains with regex | for join all values of list, last cast boolean mask to 0,1 by astype:
lst =["apple","peach","pear"]
df['y'] = df['x'].str.contains('|'.join(lst)).astype(int)
print (df)
x y
0 apple234 1
1 pear231 1
2 banana233445 0

Categories