Pandas: Remove rows where all values equal a certain value - python

I have a DataFrame with regex search results. I need to remove any row where there were no matches for any of the terms. Not all columns are search results, only columns 2 - 6.
Have tried ( NF = "Not Found" ):
cond1 = (df['term1'] != "NF") & (df['term2'] != "NF") & (df['term3'] != "NF") & (df['term4'] != "NF") & (df['term5'] != "NF")
df_pos_results = df[cond1]
For some reason this is removing positive results.

I think you need .all:
df = df[df.iloc[:, 1:5].ne('NF').all(axis=1)]
That will remove all rows where every value in the row is equal to NF.
For multiple values:
df = df[~df.iloc[:, 1:5].isin(['NF', 'ABC', 'DEF']).all(axis=1)]

Related

Set a new column using Pandas

I have a dataframe like this:
A Status_A Invalid_A
0 Null OR Blank True
1 NaN Null OR Blank True
2 Xv Valid False
I want a dataframe like this:
A Status_A Invalid_A
0 Null OR Blank A True
1 NaN Null OR Blank A True
2 Xv Valid False
I want to append column name to the Status_A column when I create df using
def checkNull(ele):
if pd.isna(ele) or (ele == ''):
return ("Null OR Blank", True)
else:
return ("Valid", False)
df[['Status_A', 'Invalid_A']] = df['A'].apply(checkNull).tolist()
I want to pass column name in this function.
You have a couple of options here.
One option is that when you create the dataframe, you can pass additional arguments to pd.Series.apply:
def checkNull(ele, suffix):
if pd.isna(ele) or (ele ==''):
return (f"Null OR Blank {suffix}", True)
else :
return ("Valid", False)
df[['Status_A', 'Invalid_A']] = df['A'].apply(checkNull, args=('A',)).tolist()
Another option is to post-process the dataframe to add the suffix
df.loc[df['Invalid_A'], 'Status_A'] += '_A'
That being said, both columns are redundant, which is usually code smell. Consider just using the boolean series pd.isna(df['A']) | (df['A'] == '') as an index instead.
The more efficient way is to use np.where
df[('Status%s') % '_A'] = np.where((df['A'].isnull()) | (df['A']==''), 'Null or Blank', 'Valid')
df[('Invalid%s') % '_A'] = np.where((df['A'].isnull()) | (df['A']==''), 'True', 'False')
Maybe something like this
def append_col_name(df, col_name):
col = f"Status_{col_name}"
df[col] = df[col].apply(lambda x : x + " " + col_name if x != "Valid" else x)
return df
Then with your df
append_col_name(df, "A")
if you're checking each element, you can use a vectorised operation and return an entire dataframe, as opposed to operating on a column.
def str_col_check(colname : str,
dataframe : pd.DataFrame) -> pd.DataFrame:
suffix = colname.split('_')[-1]
dataframe.loc[df['Status_A'].isin(['Null OR Blank', '']),'Status_A'] = dataframe['Status_A'] + '_' + suffix
return dataframe

Data Cleaning with Pandas

I have a dataframe column consisting of text data and I need to filter it according to the following conditions:
The character "M", if it's present in the string, it can only be at the n-2 position
The n-1 position of the string always has to be a "D".
ex:
KFLL
KSDS
KMDK
MDDL
In this case, for example, I would have to remove the first string, since the character at the n-1 position is not a "D", and the last one, since the character "M" appears out of the n-2 position.
How can I apply this to a whole dataframe column?
Here's with a list comprehension:
l = ['KFLL', 'KSDS', 'KMDK', 'MDDL']
[x for x in l if ((('M' not in x) or (x[-3] == 'M')) and (x[-2] == 'D'))]
Output:
['KSDS', 'KMDK']
This does what you want. Could probably be written down shorter with list comprehensions, but at least this is readable. It assumes that the strings are all longer than 3 characters, otherwise you get an IndexError. In that case you need to add a try/except
from collections import Counter
import pandas as pd
df = pd.DataFrame(data=list(["KFLL", "KSDS", "KMDK", "MDDL"]), columns=["code"])
print("original")
print(df)
mask = list()
for code in df["code"]:
flag = False
if code[-2] == "D":
counter = Counter(list(code))
if counter["M"] == 0 or (counter["M"] == 1 and code[-3] == "M"):
flag = True
mask.append(flag)
df["mask"] = mask
df2 = df[df["mask"]].copy()
df2.drop("mask", axis=1, inplace=True)
print("new")
print(df2)
Output looks like this
original
code
0 KFLL
1 KSDS
2 KMDK
3 MDDL
new
code
1 KSDS
2 KMDK
Thank you all for your help.
I ended up implementing it like this:
l = {"Sequence": [ 'KFLL', 'KSDS', 'KMDK', 'MDDL', "MMMD"]}
df = pd.DataFrame(data= l)
print(df)
df = df[df.Sequence.str[-2] == 'D']
df = df[~df.Sequence.apply(lambda x: ("M" in x and x[-3]!='M') or x.count("M") >1 )]
print(df)
Output:
Sequence
0 KFLL
1 KSDS
2 KMDK
3 MDDL
4 MMMD
Sequence
1 KSDS
2 KMDK

Pandas CSV : Check for each row if a column is empty

I want to test for each row of a CSV file if some column are empty or not and change value of another column depending on that.
Here is what I have :
df = df.replace(r'^\s*$', np.NaN, regex=True)
df['Multi-line'] = pd.Series(dtype=object)
for i, row in df.iterrows():
if (row['Directory Number 1'] != np.NaN and row['Directory Number 2'] != np.NaN and row['Directory Number 3'] != np.NaN and row['Directory Number 4'] != np.NaN):
df.at[i,'Multi-line'] = 'Yes'
If 2 "Directory Number X" or more are not empty, I want the "Multi-line" column to be "Yes" and if 1 or 0 "Directory Number X" are not empty then "Multi-line" should be "No".
Here is only one if just to show you how it looks but in my test sample, all Multi-line are set to "Yes", it seems like the problem is inside the If condition with the row value and the np.nan but I don't know how to check if a row value is empty or not..
Thanks for you help !
I assume that you executed df = df.replace(r'^\s*$', np.NaN, regex=True)
before.
Then, to generate the new column, run:
df['Multi-line'] = df.apply(lambda row: 'Yes' if row.notna().sum() >= 2 else 'No', axis=1)
No need for explicit call to iterrows, as apply arranges just such
a loop, invoking the passed function for each row.
If your DataFrame has also other columns, especially when they can
have NaN values, then application of this lambda function should be
limited to just these 4 columns of interest.
In this case run:
cols = [ f'Directory Number {i}' for i in range(1, 5) ]
df['Multi-line'] = df[cols].apply(lambda row:
'Yes' if row.notna().sum() >= 2 else 'No', axis=1)
Note also that a check like if (row[s] != np.NaN): as proposed
in the other solution is a bad approach, since NaN by definition
is not equal to another NaN, so you can't just compare two NaNs.
To check it try:
s = np.nan
s2 = np.nan
s != s2 # True
s == s2 # False
Then save any "true" string in s, running s = 'xx' and repeat:
s != s2 # True
s == s2 # False
with just the same result.
You can use a counter instead
df = df.replace(r'^\s*$', np.NaN, regex=True)
df['Multi-line'] = pd.Series(dtype=object)
cnt=0;
str = ['Directory Number 1','Directory Number 2','Directory Number 3','Directory Number 4'];
for i, row in df.iterrows():
for s in str:
if (row[s] != np.NaN):
cnt+=1;
if (cnt>2):
df.at[i,'Multi-line'] = 'Yes'
else:
df.at[i,'Multi-line'] = 'No'
cnt=0;

How to fill column based on the condition in dataframe?

I am trying to fill records one column based on some condition but I am not getting the result. Can you please help me how to do this?
Example:
df:
applied_sql_function1 and_or_not_oprtor_pre comb_fld_order_1
CASE WHEN
WHEN AND
WHEN AND
WHEN
WHEN AND
WHEN OR
WHEN
WHEN dummy
WHEN dummy
WHEN
Expected Output:
applied_sql_function1 and_or_not_oprtor_pre comb_fld_order_1 new
CASE WHEN CASE WHEN
WHEN AND
WHEN AND
WHEN WHEN
WHEN AND
WHEN OR
WHEN WHEN
WHEN dummy
WHEN dummy
WHEN WHEN
I have written some logic for this but it is not working:
df_main1['new'] =''
for index,row in df_main1.iterrows():
new = ''
if((str(row['applied_sql_function1']) != '') and (str(row['and_or_not_oprtor_pre']) == '') and (str(row['comb_fld_order_1']) == '')):
new += str(row['applied_sql_function1'])
print(new)
if(str(row['applied_sql_function1']) != '') and (str(row['and_or_not_oprtor_pre']) != ''):
new += ''
print(new)
else:
new += ''
row['new'] = new
print(df_main1['new'])
Using, loc
mask = df.and_or_not_oprtor_pre.fillna("").eq("") \
& df.comb_fld_order_1.fillna("").eq("")
df.loc[mask, 'new'] = df.loc[mask, 'applied_sql_function1']
try this one, it would work in a quick way
indexes = df.index[(df['and_or_not_oprtor_pre'].isna()) & (df['comb_fld_order_1'].isna())]
df.loc[indexes, 'new'] = df.loc[indexes, 'applied_sql_function1']
Go with np.where all the way! It's easy to understand and vectorized, so the performance is good on really large datasets.
import pandas as pd, numpy as np
df['new'] = ''
df['new'] = np.where((df['and_or_not_oprtor_pre'] == '') & (df['comb_fld_order_1'] == ''), df['applied_sql_function1'], df['new'])
df

Splitting strings in tuples within a pandas dataframe column

I have a pandas dataframe where a column contains tuples:
p = pd.DataFrame({"sentence" : [("A.Hi", "B.My", "C.Friend"), \
("AA.How", "BB.Are", "CC.You")]})
I'd like to split each string in the tuple on a punctuation ., take the second part of the split/string and see how many match list of strings:
p["tmp"] = p["sentence"].apply(lambda x: [i.split(".")[1] for i in x])
p["tmp"].apply(lambda x: [True if len(set(x).intersection(set(["Hi", "My"])))>0 else False])
This works as intended, but my dataframe has more than 100k rows - and apply doesn't seem very efficient at these sizes. Is there a way to optize/vectorize the above code?
Use nested list and set comprehension and for test convert sets to bools - empty set return False:
s = set(["Hi", "My"])
p["tmp"] = [bool(set(i.split(".")[1] for i in x).intersection(s)) for x in p["sentence"]]
print (p)
sentence tmp
0 (A.Hi, B.My, C.Friend) True
1 (AA.How, BB.Are, CC.You) False
EDIT:
If there are only 1 or 2 length values after split, you can select last value by indexing [-1]:
p = pd.DataFrame({"sentence" : [("A.Hi", "B.My", "C.Friend"), \
("AA.How", "BB.Are", "You")]})
print (p)
sentence
0 (A.Hi, B.My, C.Friend)
1 (AA.How, BB.Are, You)
s = set(["Hi", "My"])
p["tmp"] = [bool(set(i.split(".")[-1] for i in x).intersection(s)) for x in p["sentence"]]
print (p)
sentence tmp
0 (A.Hi, B.My, C.Friend) True
1 (AA.How, BB.Are, You) False

Categories