Python pandas - get row based on previous row value - python

I have a big data set where I'm trying to filter only the rows that match certain criteria. More specifically, I want to get all rows where Type == A if Type == B is 2
So in the following example it would result in the row 2 Node-1 A 1
>>> import pandas as pd
>>> data = [['Node-0', 'A', 1],['Node-0', 'B', 1],['Node-1','A', 1],['Node-1', 'B', 2]]
>>> df = pd.DataFrame(data,columns=['Node','Type','Value'])
>>> print df
Node Type Value
0 Node-0 A 1
1 Node-0 B 1
2 Node-1 A 1
3 Node-1 B 2
I can filter the rows using df.loc[df['Type'] == 'A'], but that gives me lines 0 and 2.

IIUC, using some masking with groupby.
m = df.Type.eq('B') & df.Value.eq(2)
df[m.groupby(df.Node).transform('any') & df.Type.eq('A')]
Node Type Value
2 Node-1 A 1

I bet there is a better solution, but this should sort it out for time being:
condition1 = (df['Node'].isin(df.query("Type=='B' & Value==2")['Node']))
#All the 'Node' values whose 'Type' and 'Value' columns have values 'B' and 2
#.isin() filters to rows that match the above criteria
condition2 = (df['Type']=='A')
#all the rows where 'Type' is 'A'
df.loc[condition1&condition2]
#intersection of above conditions
# Node Type Value
#2 Node-1 A 1

Consider the following:
# Get rows maching first criteria
dd1 = df[df.Type == 'A'][df.Value == 1]
# Get "previous" rows maching second criteria
df2 = df.shift(-1)
dd2 = df2[df2.Type == 'B'][df2.Value == 2]
# Find intersection
pd.merge(dd1, dd2, how='inner', on='Node')
Result:
Node Type_x Value_x Type_y Value_y
0 Node-1 A 1 B 2.0

Related

Matching value with column to retrieve index value

Please see example dataframe below:
I'm trying match values of columns X with column names and retrieve value from that matched column
so that:
A B C X result
1 2 3 B 2
5 6 7 A 5
8 9 1 C 1
Any ideas?
Here are a couple of methods:
# Apply Method:
df['result'] = df.apply(lambda x: df.loc[x.name, x['X']], axis=1)
# List comprehension Method:
df['result'] = [df.loc[i, x] for i, x in enumerate(df.X)]
# Pure Pandas Method:
df['result'] = (df.melt('X', ignore_index=False)
.loc[lambda x: x['X'].eq(x['variable']), 'value'])
Here I just build a dataframe from your example and call it df
dict = {
'A': (1,5,8),
'B': (2,6,9),
'C': (3,7,1),
'X': ('B','A','C')}
df = pd.DataFrame(dict)
You can extract the value from another column based on 'X' using the following code. There may be a better way to do this without having to convert first to list and retrieving the first element.
list(df.loc[df['X'] == 'B', 'B'])[0]
I'm going to create a column called 'result' and fill it with 'NA' and then replace the value based on your conditions. The loop below, extracts the value and uses .loc to replace it in your dataframe.
df['result'] = 'NA'
for idx, val in enumerate(list(vals)):
extracted = list(df.loc[df['X'] == val, val])[0]
df.loc[idx, 'result'] = extracted
Here it is as a function:
def search_replace(dataframe, search_col='X', new_col_name='result'):
dataframe[new_col_name] = 'NA'
for idx, val in enumerate(list(vals)):
extracted = list(dataframe.loc[dataframe[search_col] == val, val])[0]
dataframe.loc[idx, new_col_name] = extracted
return df
and the output
>>> search_replace(df)
A B C X result
0 1 2 3 B 2
1 5 6 7 A 5
2 8 9 1 C 1

Filter a pandas dataframe columns and rows using values from a dict

I need to filter a data frame with a dictionary, constructed with the key being the column name and the list of value being the value that I want to filter:
dict_filter = {'A':'high', 'B':'medium', 'C':['bottom', 'high']}
# the dataframe is like
df = pd.DataFrame({'id':[1,2], 'A':['high', 'high'], 'B':['high','medium'],'C':['high','bottom']})
the dataframe is like
id A B C
0 1 'high' 'high' 'high'
1 1 'high' 'medium' 'bottom'
I would like to get a dataframe filtered as following:
id A B C
1 1 'high' 'medium' 'bottom'
I tried the following method but it doesnt work with the fact that the last value of the dictionary is a list
df.loc[(df[list(dict_filter)] == pd.Series(dict_filter )).all(axis=1)]
Any suggestions ?
Solution
We can use isin to create a boolean mask, but before that you have to make sure that all the values in the dict_filter are list of strings
d = {k: np.atleast_1d(v) for k, v in dict_filter.items()}
df[df[list(d)].isin(d).all(1)]
id A B C
1 2 high medium bottom
bool_arr = []
for k, v in dict_filter.items():
bool_arr.append(df.loc[:, k].isin(pd.Series(v)))
df.loc[pd.concat(bool_arr, axis=1).all(axis=1)]
# id A B C
# 1 2 high medium bottom
One-liner:
filtered = df[df.apply(lambda col: col.isin(pd.Series(dict_filter.get(col.name, [])))).all(axis=1)]
Output:
>>> filtered
A B C
id
2 high medium bottom
You can use:
d = {k:v if isinstance (v, list) else [v]
for k,v in dict_filter.items()}
mask = (df[list(dict_filter)]
.apply(lambda c: c.isin(d[c.name]))
.all(1)
)
df2 = df[mask]
Output:
id A B C
1 2 high medium bottom

Is there a way for slect some columns which rows value is expected

I am handling a df, and want to select the columns that meet the conditions by filtering the values of a row.
I only know one stupid way:query a cell value in a loop, then get my expect
columns.
>>> df = pd.DataFrame({'A':list('abcd'),'B':list('1bfe'),'C':list('ghgk')})
>>> df
A B C
0 a 1 g
1 b b h
2 c f g
3 d e k
>>> #get columns ,condition: second row equal 'b'
...
>>> cols = list()
>>> for val in df:
... if df.loc[1,val] == 'b':
... cols.append(val)
...
>>> cols
['A', 'B']
use
df.columns[df.loc[1]=='b']
There is this nice way to make queries in pandas dataframes.
df = pd.DataFrame({'A':list('abcd'),'B':list('1bfe'),'C':list('ghgk')})
df.query("A == 'a' and B == '1'")
This query will return the first row of the dataframe based on the fact that column A matches a and column B matches 1
On the phone, can’t test but this would run:
df.columns[df.loc[1].eq('b')]

Is there any way to replace values with different values in the same row if they meet a certain condition using a for loop?

I need to replace the values of a certain cell with values from another cell if a certain condition is met.
for r in df:
if df['col1'] > 1 :
df['col2']
else:
I am hoping for every value in column 1 to be replaced with their respective value in column 2 if the condition if the value of the row in column 1 is greater than 1.
No need to loop through the entire dataframe.
idx=df['col1']>1
df.loc[idx,'col1']=df.loc[idx,'col2']
Using a for loop
for _,row in df.iterrows():
if row['col1']>1:
row['col1']=row['col2']
elif condition:
#put assignment here
else other_condition:
#put assignment here
Here is an example
import pandas as pd
df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 4, 6]})
print(df)
print('----------')
# the condition here is A^2 == B
df.loc[df['A'] * df['A'] == df['B'], 'A'] = df['B']
print(df)
output
A B
0 1 4
1 2 4
2 3 6
----------
A B
0 1 4
1 4 4
2 3 6

Find column names when row element meets a criteria Pandas

This is a basic question. I've got a square array with the rows and columns summed up. Eg:
df = pd.DataFrame([[0,0,1,0], [0,0,1,0], [1,0,0,0], [0,1,0,0]], index = ["a","b","c","d"], columns = ["a","b","c","d"])
df["sumRows"] = df.sum(axis = 1)
df.loc["sumCols"] = df.sum()
This returns:
In [100]: df
Out[100]:
a b c d sumRows
a 0 0 1 0 1
b 0 0 1 0 1
c 1 0 0 0 1
d 0 1 0 0 1
sumCols 1 1 2 0 4
I need to find the column labels for the sumCols rows which matches 0. At the moment I am doing this:
[df.loc["sumCols"] == 0].index
But this return a strange index type object. All I want is a list of values that match this criteria i.e: ['d'] in this case.
There is two ways (the index object can be converted to an interable like a list).
Do that with the columns:
columns = df.columns[df.sum()==0]
columns = list(columns)
Or you can rotate the Dataframe and treat columns as rows:
list(df.T[df.T.sumCols == 0].index)
You can use a lambda expression to filter series and if you want a list instead of index as result, you can call .tolist() on the index object:
(df.loc['sumCols'] == 0)[lambda x: x].index.tolist()
# ['d']
Or:
df.loc['sumCols'][lambda x: x == 0].index.tolist()
# ['d']
Without explicitly creating the sumCols and if you want to check which column has sum of zero, you can do:
df.sum()[lambda x: x == 0].index.tolist()
# ['d']
Check rows:
df.sum(axis = 1)[lambda x: x == 0].index.tolist()
# []
Note: The lambda expression approach is as fast as the vectorized method for subsetting, functional style and can be written easily in a one-liner if you prefer.
Heres a simple method using query after transposing
df.T.query('sumCols == 0').index.tolist()

Categories