I have some problem to assign label whether a condition is satisfied. Specifically, I would like to assign False (or 0) to rows which contains at least one of these words
my_list=["maths", "science", "geography", "statistics"]
in one of these fields:
path | Subject | Notes
and look for these websites webs=["www.stanford.edu", "www.ucl.ac.uk", "www.sorbonne-universite.fr"] in column web.
To do this I am using the following code:
def part_is_in(x, values):
output = False
for val in values:
if val in str(x):
return True
break
return output
def assign_value(filename):
my_list=["maths", "", "science", "geography", "statistics"]
filename['Label'] = filename[['path','subject','notes']].apply(part_is_in, values= my_list)
filename['Low_Subject']=filename['Subject']
filename['Low_Notes']=filename['Notes']
lower_cols = [col for col in filename if col not in ['Subject','Notes']]
filename[lower_cols]= filename[lower_cols].apply(lambda x: x.astype(str).str.lower(),axis=1)
webs=["https://www.stanford.edu", "https://www.ucl.ac.uk", "http://www.sorbonne-universite.fr"]
# NEW COLUMN # this is still inside the function but I cannot add an indent within this post
filename['Label'] = pd.Series(index = filename.index, dtype='object')
for index, row in filename.iterrows():
value = row['web']
if any(x in str(value) for x in webs):
filename.at[index,'Label'] = True
else:
filename.at[index,'Label'] = False
for index, row in filename.iterrows():
value = row['Subject']
if any(x in str(value) for x in my_list):
filename.at[index,'Label'] = True
else:
filename.at[index,'Label'] = False
for index, row in filename.iterrows():
value = row['Notes']
if any(x in str(value) for x in my_list):
filename.at[index,'Label'] = True
else:
filename.at[index,'Label'] = False
for index, row in filename.iterrows():
value = row['path']
if any(x in str(value) for x in my_list):
filename.at[index,'Label'] = True
else:
filename.at[index,'Label'] = False
return(filename)
My dataset is
web path Subject Notes
www.stanford.edu /maths/ NA NA
www.ucla.com /history/ History of Egypt NA
www.kcl.ac.uk /datascience/ Data Science 50 students
...
The expected output is:
web path Subject Notes Label
www.stanford.edu /maths/ NA NA 1 # contains the web and maths
www.ucla.com /history/ History of Egypt NA 0
www.kcl.ac.uk /datascience/ Data Science 50 students 1 # contains the word science
...
Using my code, I am getting all values False. Are you able to spot the issue?
The final values in Labels are Booleans
If you want ints, use df.Label = df.Label.astype(int)
def test_words
fill all NaNs, which are float type, with '', which is str type
convert all words to lowercase
replace all / with ' '
split on ' ' to make a list
combine all the lists into a single a set
use set methods to determine if the row contains a word in my_list
set.intersection
{'datascience'}.intersection({'science'}) returns an empty set, because there is not intersection.
{'data', 'science'}.intersection({'science'}) returns {'science'}, because there's an intersection on that word.
lambda x: any(x in y for y in webs)
for each value in webs, check if web is in that value
'www.stanford.edu' in 'https://www.stanford.edu' is True
evaluates as True if any are True.
import pandas as pd
# test data and dataframe
data = {'web': ['www.stanford.edu', 'www.ucla.com', 'www.kcl.ac.uk'],
'path': ['/maths/', '/history/', '/datascience/'],
'Subject': [np.nan, 'History of Egypt', 'Data Science'],
'Notes': [np.nan, np.nan, '50 students']}
df = pd.DataFrame(data)
# given my_list
my_list = ["maths", "science", "geography", "statistics"]
my_list = set(map(str.lower, my_list)) # convert to a set and verify words are lowercase
# given webs; all values should be lowercase
webs = ["https://www.stanford.edu", "https://www.ucl.ac.uk", "http://www.sorbonne-universite.fr"]
# function to test for word content
def test_words(v: pd.Series) -> bool:
v = v.fillna('').str.lower().str.replace('/', ' ').str.split(' ') # replace na, lower case , convert to list
s_set = {st for row in v for st in row if st} # join all the values in the lists to one set
return True if s_set.intersection(my_list) else False # True if there is a word intersection between sets
# test for conditions in the word columns and the web column
df['Label'] = df[['path', 'Subject', 'Notes']].apply(test_words, axis=1) | df.web.apply(lambda x: any(x in y for y in webs))
# display(df)
web path Subject Notes Label
0 www.stanford.edu /maths/ NaN NaN True
1 www.ucla.com /history/ History of Egypt NaN False
2 www.kcl.ac.uk /datascience/ Data Science 50 students True
Notes Regarding Original Code
It's not a good idea to use iterrows multiple times. For a large dataset it will be very time-consuming and error prone.
It was easier to write a new function then interpret the different code blocks for each column.
Related
So I have a column of strings that is listed as "compounds"
Composition (column title)
ZrMo3
Gd(CuS)3
Ba2DyInTe5
I have another column that has strings metal elements from the periodic table and i'll call that column "metals"
Elements (column title)
Li
Be
Na
The objective is to check each string from "compounds" with every single string listed in "metals" and if any string from metals is there then it would be classified as true. Any ideas how I can code this?
Example: (if "metals" has Zr, Ag, and Te)
ZrMo3 True
Gd(CuS)3 False
Ba2DyInTe5 True
I recently tried using this code below, but I ended up getting all false
asd = subset['composition'].isin(metals['Elements'])
print(asd)
also tried this code and got all false as well
subset['Boolean'] = subset.apply(lambda x: True if any(word in x.composition for word in metals) else False, axis=1)
assuming you are using pandas, you can use a list comprehension inside your lambda since you essentially need to iterate over all elements in the elements list
import pandas as pd
elements = ['Li', 'Be', 'Na', 'Te']
compounds = ['ZrMo3', 'Gd(CuS)3', 'Ba2DyInTe5']
df = pd.DataFrame(compounds, columns=['compounds'])
print(df)
output
compounds
0 ZrMo3
1 Gd(CuS)3
2 Ba2DyInTe5
df['boolean'] = df.compounds.apply(lambda x: any([True if el in x else False for el in elements]))
print(df)
output
compounds boolean
0 ZrMo3 False
1 Gd(CuS)3 False
2 Ba2DyInTe5 True
if you are not using pandas, you can apply the lambda function to the lists with the map function
out = list(
map(
lambda x: any([True if el in x else False for el in elements]), compounds)
)
print(out)
output
[False, False, True]
here would be a more complex version which also tackles the potential errors #Ezon mentioned based on the regular expression matching module re. since this approach is essentially looping not only over the elements to compare with a single compound string but also over each constituent of the compounds I made two helper functions for it to be more readable.
import re
import pandas as pd
def split_compounds(c):
# remove all non-alphabet elements
c_split = re.sub(r"[^a-zA-Z]", "", c)
# split string at capital letters
c_split = '-'.join(re.findall('[A-Z][^A-Z]*', c_split))
return c_split
def compare_compound(compound, element):
# split compound into list
compound_list = compound.split('-')
return any([element == c for c in compound_list])
# build sample data
compounds = ['SiO2', 'Ba2DyInTe5', 'ZrMo3', 'Gd(CuS)3']
elements = ['Li', 'Be', 'Na', 'Te', 'S']
df = pd.DataFrame(compounds, columns=['compounds'])
# split compounds into elements
df['compounds_elements'] = [split_compounds(x) for x in compounds]
print(df)
output
compounds compounds_elements
0 SiO2 Si-O
1 Ba2DyInTe5 Ba-Dy-In-Te
2 ZrMo3 Zr-Mo
3 Gd(CuS)3 Gd-Cu-S
# check if any item from 'elements' is in the compounds
df['boolean'] = df.compounds_elements.apply(
lambda x: any([True if compare_compound(x, el) else False for el in elements])
)
print(df)
output
compounds compounds_elements boolean
0 SiO2 Si-O False
1 Ba2DyInTe5 Ba-Dy-In-Te True
2 ZrMo3 Zr-Mo False
3 Gd(CuS)3 Gd-Cu-S True
I have a dataframe like this:
A Status_A Invalid_A
0 Null OR Blank True
1 NaN Null OR Blank True
2 Xv Valid False
I want a dataframe like this:
A Status_A Invalid_A
0 Null OR Blank A True
1 NaN Null OR Blank A True
2 Xv Valid False
I want to append column name to the Status_A column when I create df using
def checkNull(ele):
if pd.isna(ele) or (ele == ''):
return ("Null OR Blank", True)
else:
return ("Valid", False)
df[['Status_A', 'Invalid_A']] = df['A'].apply(checkNull).tolist()
I want to pass column name in this function.
You have a couple of options here.
One option is that when you create the dataframe, you can pass additional arguments to pd.Series.apply:
def checkNull(ele, suffix):
if pd.isna(ele) or (ele ==''):
return (f"Null OR Blank {suffix}", True)
else :
return ("Valid", False)
df[['Status_A', 'Invalid_A']] = df['A'].apply(checkNull, args=('A',)).tolist()
Another option is to post-process the dataframe to add the suffix
df.loc[df['Invalid_A'], 'Status_A'] += '_A'
That being said, both columns are redundant, which is usually code smell. Consider just using the boolean series pd.isna(df['A']) | (df['A'] == '') as an index instead.
The more efficient way is to use np.where
df[('Status%s') % '_A'] = np.where((df['A'].isnull()) | (df['A']==''), 'Null or Blank', 'Valid')
df[('Invalid%s') % '_A'] = np.where((df['A'].isnull()) | (df['A']==''), 'True', 'False')
Maybe something like this
def append_col_name(df, col_name):
col = f"Status_{col_name}"
df[col] = df[col].apply(lambda x : x + " " + col_name if x != "Valid" else x)
return df
Then with your df
append_col_name(df, "A")
if you're checking each element, you can use a vectorised operation and return an entire dataframe, as opposed to operating on a column.
def str_col_check(colname : str,
dataframe : pd.DataFrame) -> pd.DataFrame:
suffix = colname.split('_')[-1]
dataframe.loc[df['Status_A'].isin(['Null OR Blank', '']),'Status_A'] = dataframe['Status_A'] + '_' + suffix
return dataframe
I have this dataset:
Date ID Tweet Note
01/20/2020 4141 The cat is on the table I bought a table
01/20/2020 4142 The sky is blue Once upon a time
01/20/2020 53 What a wonderful day I have no words
I would like to select rows containing in Tweet or Note one of the following words:
w=["sky", "table"]
To do this, I am using the following:
def part_is_in(x, values):
output = False
for val in values:
if val in str(x):
return True
break
return output
def fun_1(filename):
w=["sky", "table"]
filename['Logic'] = filename[['Tweet','Note']].apply(part_is_in, values=w)
filename['Low_Tweet']=filename['Tweet']
filename['Low_ Note']=filename['Note']
lower_cols = [col for col in filename if col not in ['Tweet','Note']]
filename[lower_cols]= filename[lower_cols].apply(lambda x: x.astype(str).str.lower(),axis=1)
# NEW COLUMN
filename['Logic'] = pd.Series(index = filename.index, dtype='object')
filename['TF'] = pd.Series(index = filename.index, dtype='object')
for index, row in filename.iterrows():
value = row['ID']
if any(x in str(value) for x in w):
filename.at[index,'Logic'] = True
else:
filename.at[index,'Logic'] = False
filename.at[index,'TF'] = False
for index, row in filename.iterrows():
value = row['Tweet']
if any(x in str(value) for x in w):
filename.at[index,'Logic'] = True
else:
filename.at[index,'Logic'] = False
filename.at[index,'TF'] = False
for index, row in filename.iterrows():
value = row['Note']
if any(x in str(value) for x in w):
filename.at[index,'Logic'] = True
else:
filename.at[index,'Logic'] = False
filename.at[index,'TF'] = False
return(filename)
What it should do is finding rows having at least one of the words in the list above (w) and assigning a value:
if the row contains in Tweet or Note the word, then assign True, else False.
My expected output would be:
Date ID Tweet Note Logic TF
01/20/2020 4141 The cat is on the table I bought a table True False
01/20/2020 4142 The sky is blue Once upon a time True False
01/20/2020 53 What a wonderful day I have no words False False
Manually checking, I found that some words are not correctly assigned. What could be wrong in my code?
I'm new to pandas as well, so you'll need to take this answer with a grain of salt. I got the impression from the tutorial if you are iterating through the DataFrame, you are not using pandas as it was intended.
To that end, I'll point this out:
df['Logic'] = df['Tweet'].str.contains('table')
df['Logic'] |= df['Tweet'].str.contains('sky')
Yields:
Date ID Tweet Note Logic
0 1/20/20 4141 The cat is on the table I bought a table True
1 1/20/20 4142 The sky is blue Once upon a time True
2 1/20/20 53 What a wonderful day I have no words False
As I understand if the keywords are in those specific column then Logic is True else TF is false and Logic is false. I couldn't figure out when TF is false and Logic is true. So I'm not sure if this helps but
pattern = '|'.join(w)
df['Logic'] = df.Tweet.str.contains(pattern) | df.Note.str.contains(pattern)
this code may help you to avoid from apply.
I am trying to match the names in two columns in the same dataframe, I want to create a function to return True if the name in one column is an acronym of the other even if they contain the same acronym substring.
pd.DataFrame([['Global Workers Company gwc', 'gwc'], ['YTU', 'your team united']] , columns=['Name1','Name2'])
Desired Output:
Name1 Name2 Match
0 Global Workers Company gwc gwc True
1 YTU your team united True
I have creating a lambda function to only get the acronym but haven't been able to do so
t = 'Global Workers Company gwc'
[x[0] for x in t.split()]
['G', 'W', 'C', 'g']
"".join(word[0][0] for word in test1.Name2.str.split()).upper()
You can use Dataframe.apply function along with axis=1 parameter to apply a custom func on the dataframe. Then you can use regular expressions to compare the acronym with the corresponding larger name or phrase.
Try this:
import re
def func(x):
s1 = x["Name1"]
s2 = x["Name2"]
acronym = s1 if len(s1) < len(s2) else s2
fullform = s2 if len(s1) < len(s2) else s1
fmtstr = ""
for a in acronym:
fmtstr += (r"\b" + a + r".*?\b")
if re.search(fmtstr, fullform, flags=re.IGNORECASE):
return True
else:
return False
df["Match"] = df.apply(func, axis=1)
print(df)
Output:
Name1 Name2 Match
0 Global Workers Company gwc gwc True
1 YTU your team united True
I will use a mapper. We will have a lookup dictionary that will transform data to the same type that we can check for equality.
import pandas as pd
#data
df = pd.DataFrame([['Global Workers Company', 'gwc'], ['YTU', 'your team united']] , columns=['Name1','Name2'])
# create a mapper
mapper = {'gwc':'Global Workers Company',
'YTU': 'your team united'}
def replacer(value, mapper=mapper):
'''Takes in value and finds its map,
if not found return original value
'''
return mapper.get(value, value)
# create column checker and assign the equality
df.assign(
checker = lambda column: column['Name1'].map(replacer) == column['Name2'].map(replacer)
)
print(df)
I have a pandas dataframe where a column contains tuples:
p = pd.DataFrame({"sentence" : [("A.Hi", "B.My", "C.Friend"), \
("AA.How", "BB.Are", "CC.You")]})
I'd like to split each string in the tuple on a punctuation ., take the second part of the split/string and see how many match list of strings:
p["tmp"] = p["sentence"].apply(lambda x: [i.split(".")[1] for i in x])
p["tmp"].apply(lambda x: [True if len(set(x).intersection(set(["Hi", "My"])))>0 else False])
This works as intended, but my dataframe has more than 100k rows - and apply doesn't seem very efficient at these sizes. Is there a way to optize/vectorize the above code?
Use nested list and set comprehension and for test convert sets to bools - empty set return False:
s = set(["Hi", "My"])
p["tmp"] = [bool(set(i.split(".")[1] for i in x).intersection(s)) for x in p["sentence"]]
print (p)
sentence tmp
0 (A.Hi, B.My, C.Friend) True
1 (AA.How, BB.Are, CC.You) False
EDIT:
If there are only 1 or 2 length values after split, you can select last value by indexing [-1]:
p = pd.DataFrame({"sentence" : [("A.Hi", "B.My", "C.Friend"), \
("AA.How", "BB.Are", "You")]})
print (p)
sentence
0 (A.Hi, B.My, C.Friend)
1 (AA.How, BB.Are, You)
s = set(["Hi", "My"])
p["tmp"] = [bool(set(i.split(".")[-1] for i in x).intersection(s)) for x in p["sentence"]]
print (p)
sentence tmp
0 (A.Hi, B.My, C.Friend) True
1 (AA.How, BB.Are, You) False