This seems like a simple question, but I couldn't find it asked before (this and this are close but the answers aren't great).
The question is: if I want to search for a value somewhere in my df (I don't know which column it's in) and return all rows with a match.
What's the most Pandaic way to do it? Is there anything better than:
for col in list(df):
try:
df[col] == var
return df[df[col] == var]
except TypeError:
continue
?
You can perform equality comparison on the entire DataFrame:
df[df.eq(var1).any(1)]
You should using isin , this is return the column , is want row check cold' answer :-)
df.isin(['bal1']).any()
A False
B True
C False
CLASS False
dtype: bool
Or
df[df.isin(['bal1'])].stack() # level 0 index is row index , level 1 index is columns which contain that value
0 B bal1
1 B bal1
dtype: object
You can try the code below:
import pandas as pd
x = pd.read_csv(r"filePath")
x.columns = x.columns.str.lower().str.replace(' ', '_')
y = x.columns.values
z = y.tolist()
print("Note: It take Case Sensitive Values.")
keyWord = input("Type a Keyword to Search: ")
try:
for k in range(len(z)-1):
l = x[x[z[k]].str.match(keyWord)]
print(l.head(10))
k = k+1
except:
print("")
This is a solution which will return the actual column you need.
df.columns[df.isin(['Yes']).any()]
Minimal solution:
import pandas as pd
import numpy as np
def locate_in_df(df, value):
a = df.to_numpy()
row = np.where(a == value)[0][0]
col = np.where(a == value)[1][0]
return row, col
Related
I have a dataframe like this:
A Status_A Invalid_A
0 Null OR Blank True
1 NaN Null OR Blank True
2 Xv Valid False
I want a dataframe like this:
A Status_A Invalid_A
0 Null OR Blank A True
1 NaN Null OR Blank A True
2 Xv Valid False
I want to append column name to the Status_A column when I create df using
def checkNull(ele):
if pd.isna(ele) or (ele == ''):
return ("Null OR Blank", True)
else:
return ("Valid", False)
df[['Status_A', 'Invalid_A']] = df['A'].apply(checkNull).tolist()
I want to pass column name in this function.
You have a couple of options here.
One option is that when you create the dataframe, you can pass additional arguments to pd.Series.apply:
def checkNull(ele, suffix):
if pd.isna(ele) or (ele ==''):
return (f"Null OR Blank {suffix}", True)
else :
return ("Valid", False)
df[['Status_A', 'Invalid_A']] = df['A'].apply(checkNull, args=('A',)).tolist()
Another option is to post-process the dataframe to add the suffix
df.loc[df['Invalid_A'], 'Status_A'] += '_A'
That being said, both columns are redundant, which is usually code smell. Consider just using the boolean series pd.isna(df['A']) | (df['A'] == '') as an index instead.
The more efficient way is to use np.where
df[('Status%s') % '_A'] = np.where((df['A'].isnull()) | (df['A']==''), 'Null or Blank', 'Valid')
df[('Invalid%s') % '_A'] = np.where((df['A'].isnull()) | (df['A']==''), 'True', 'False')
Maybe something like this
def append_col_name(df, col_name):
col = f"Status_{col_name}"
df[col] = df[col].apply(lambda x : x + " " + col_name if x != "Valid" else x)
return df
Then with your df
append_col_name(df, "A")
if you're checking each element, you can use a vectorised operation and return an entire dataframe, as opposed to operating on a column.
def str_col_check(colname : str,
dataframe : pd.DataFrame) -> pd.DataFrame:
suffix = colname.split('_')[-1]
dataframe.loc[df['Status_A'].isin(['Null OR Blank', '']),'Status_A'] = dataframe['Status_A'] + '_' + suffix
return dataframe
When iterating through column elements (Y,Y,nan,Y in my case) for some reason I can't add a new element when a condition is met (if twice Y,Y is encountered) I want to replace the last Y with: "encountered" or simply just add it or rewrite it since I have track of the index number.
I have a dataframe
col0 col1
1 A Y
2 B Y
3 B nan
4 C Y
code:
count = 0
for i,e in enumerate(df[col1]):
if 'Y' in e:
count += 1
else:
count = 0
if count == 2:
df['col1'][i] = 'encountered' #Index errror: list index out of range
error message:
IndexError: list index out of range
Even if I try to specify the index in which column-cell I would like to 'add the msg to' gives me the same error:
code;
df['col1'][1] = 'or this'
main idea direct example:
df['col1'][2] = 'under index 2 in column1 add this msg'
is it because of the pyPDF2/utils is interfering?
warning:
File "C:\Users\path\lib\site-packages\PyPDF2\utils.py", line 69, in formatWarning
file = filename.replace("/", "\\").rsplit("\\", 1)[1] # find the file name
error:
IndexError: list index out of range
last_index=df[df['col1']=='Y'].index[-1]
df.loc[last_index,'col1']='encountered'
Here's how I would go about solving this:
prev_val = None
# Iterate through rows to utilize the index
for idx, row in df[['col1']].iterrows():
# unpack your row, a bit more overhead but highly readable
val = row['col1']
# Use previous value instead of counter – is easier to read and is more accurate
if val == 'Y' and val == prev_val:
df.loc[idx, 'col1'] = 'encountered'
# now set the prev value to current:
prev_val = val
What possibly could be the issue with your code is the way you are iterating over your dataframe and also the indexing. Another issue is that you are trying to set the value you are iterating over with a new value. That bound give you issues later.
Does this works for you?:
>>> count = 0
>>> df['encounter'] = np.nan
>>> for i in df.itertuples():
>>> if getattr(i, 'col1')=='Y':
>>> count+=1
>>> else:
>>> count = 0
>>> if count==2:
>>> df.loc[i[0], 'encounter']= 'encountered'
>>> print(df)
col0 col1 encounter
0 A Y NaN
1 B Y encountered
2 B NaN NaN
3 C Y NaN
I want to test for each row of a CSV file if some column are empty or not and change value of another column depending on that.
Here is what I have :
df = df.replace(r'^\s*$', np.NaN, regex=True)
df['Multi-line'] = pd.Series(dtype=object)
for i, row in df.iterrows():
if (row['Directory Number 1'] != np.NaN and row['Directory Number 2'] != np.NaN and row['Directory Number 3'] != np.NaN and row['Directory Number 4'] != np.NaN):
df.at[i,'Multi-line'] = 'Yes'
If 2 "Directory Number X" or more are not empty, I want the "Multi-line" column to be "Yes" and if 1 or 0 "Directory Number X" are not empty then "Multi-line" should be "No".
Here is only one if just to show you how it looks but in my test sample, all Multi-line are set to "Yes", it seems like the problem is inside the If condition with the row value and the np.nan but I don't know how to check if a row value is empty or not..
Thanks for you help !
I assume that you executed df = df.replace(r'^\s*$', np.NaN, regex=True)
before.
Then, to generate the new column, run:
df['Multi-line'] = df.apply(lambda row: 'Yes' if row.notna().sum() >= 2 else 'No', axis=1)
No need for explicit call to iterrows, as apply arranges just such
a loop, invoking the passed function for each row.
If your DataFrame has also other columns, especially when they can
have NaN values, then application of this lambda function should be
limited to just these 4 columns of interest.
In this case run:
cols = [ f'Directory Number {i}' for i in range(1, 5) ]
df['Multi-line'] = df[cols].apply(lambda row:
'Yes' if row.notna().sum() >= 2 else 'No', axis=1)
Note also that a check like if (row[s] != np.NaN): as proposed
in the other solution is a bad approach, since NaN by definition
is not equal to another NaN, so you can't just compare two NaNs.
To check it try:
s = np.nan
s2 = np.nan
s != s2 # True
s == s2 # False
Then save any "true" string in s, running s = 'xx' and repeat:
s != s2 # True
s == s2 # False
with just the same result.
You can use a counter instead
df = df.replace(r'^\s*$', np.NaN, regex=True)
df['Multi-line'] = pd.Series(dtype=object)
cnt=0;
str = ['Directory Number 1','Directory Number 2','Directory Number 3','Directory Number 4'];
for i, row in df.iterrows():
for s in str:
if (row[s] != np.NaN):
cnt+=1;
if (cnt>2):
df.at[i,'Multi-line'] = 'Yes'
else:
df.at[i,'Multi-line'] = 'No'
cnt=0;
I am trying to fill records one column based on some condition but I am not getting the result. Can you please help me how to do this?
Example:
df:
applied_sql_function1 and_or_not_oprtor_pre comb_fld_order_1
CASE WHEN
WHEN AND
WHEN AND
WHEN
WHEN AND
WHEN OR
WHEN
WHEN dummy
WHEN dummy
WHEN
Expected Output:
applied_sql_function1 and_or_not_oprtor_pre comb_fld_order_1 new
CASE WHEN CASE WHEN
WHEN AND
WHEN AND
WHEN WHEN
WHEN AND
WHEN OR
WHEN WHEN
WHEN dummy
WHEN dummy
WHEN WHEN
I have written some logic for this but it is not working:
df_main1['new'] =''
for index,row in df_main1.iterrows():
new = ''
if((str(row['applied_sql_function1']) != '') and (str(row['and_or_not_oprtor_pre']) == '') and (str(row['comb_fld_order_1']) == '')):
new += str(row['applied_sql_function1'])
print(new)
if(str(row['applied_sql_function1']) != '') and (str(row['and_or_not_oprtor_pre']) != ''):
new += ''
print(new)
else:
new += ''
row['new'] = new
print(df_main1['new'])
Using, loc
mask = df.and_or_not_oprtor_pre.fillna("").eq("") \
& df.comb_fld_order_1.fillna("").eq("")
df.loc[mask, 'new'] = df.loc[mask, 'applied_sql_function1']
try this one, it would work in a quick way
indexes = df.index[(df['and_or_not_oprtor_pre'].isna()) & (df['comb_fld_order_1'].isna())]
df.loc[indexes, 'new'] = df.loc[indexes, 'applied_sql_function1']
Go with np.where all the way! It's easy to understand and vectorized, so the performance is good on really large datasets.
import pandas as pd, numpy as np
df['new'] = ''
df['new'] = np.where((df['and_or_not_oprtor_pre'] == '') & (df['comb_fld_order_1'] == ''), df['applied_sql_function1'], df['new'])
df
I have this DataFrame:
df = pd.DataFrame({'day':['1/1/2017','1/2/2017','1/3/2017','1/4/2017','1/5/2017','1/6/2017','1/7/2017'],
'event':['Rain','Sunny','Snow','Snow','Rain','Sunny','Sunny'],
'temperature': [32, 35, 28,24,32,31,''],'windspeed':[6,7,2,7,4,2,'']})
df
I am trying to find the headers for the missing values on index 6:
for x in df.loc[6]:
if x == '':
print(df.columns.values)
else: print(x)
I have tried searching and the closest I could get was what I have now. Ultimately I'm trying insert these values into the dataframe: temperature =
34, windspeed = 8.
But my first step was simply trying to build the loop/if statement that says if x=='' & [COLUMN_NAME] == 'temperature'... and that is where I got stuck. I'm new to python, just trying to learn Pandas. I need to only return the column I'm on, and not a list of all the columns.
There are better ways to do this, but this works.
for col, val in df.loc[6].iteritems():
if not val: # this is the same as saying "if val == '':"
print(col)
else:
print(val)
Modified from your code:
for i,x in enumerate(df.loc[6]):
if x == '':
print(df.columns[i])
else: print(x)
I would use list comprehension as follows:
listOfNulls = [ind for ind in df.loc[6].index if df.loc[6][ind] == '']
and when I print the listOfNulls, I get:
>>>> print(listOfNulls)
Out: ['temperature', 'windspeed']
The key here is it understand that df.loc[6] is a pandas Series which has indices. We are using the values of the Series to get the indices.