I have this dataset:
Date ID Tweet Note
01/20/2020 4141 The cat is on the table I bought a table
01/20/2020 4142 The sky is blue Once upon a time
01/20/2020 53 What a wonderful day I have no words
I would like to select rows containing in Tweet or Note one of the following words:
w=["sky", "table"]
To do this, I am using the following:
def part_is_in(x, values):
output = False
for val in values:
if val in str(x):
return True
break
return output
def fun_1(filename):
w=["sky", "table"]
filename['Logic'] = filename[['Tweet','Note']].apply(part_is_in, values=w)
filename['Low_Tweet']=filename['Tweet']
filename['Low_ Note']=filename['Note']
lower_cols = [col for col in filename if col not in ['Tweet','Note']]
filename[lower_cols]= filename[lower_cols].apply(lambda x: x.astype(str).str.lower(),axis=1)
# NEW COLUMN
filename['Logic'] = pd.Series(index = filename.index, dtype='object')
filename['TF'] = pd.Series(index = filename.index, dtype='object')
for index, row in filename.iterrows():
value = row['ID']
if any(x in str(value) for x in w):
filename.at[index,'Logic'] = True
else:
filename.at[index,'Logic'] = False
filename.at[index,'TF'] = False
for index, row in filename.iterrows():
value = row['Tweet']
if any(x in str(value) for x in w):
filename.at[index,'Logic'] = True
else:
filename.at[index,'Logic'] = False
filename.at[index,'TF'] = False
for index, row in filename.iterrows():
value = row['Note']
if any(x in str(value) for x in w):
filename.at[index,'Logic'] = True
else:
filename.at[index,'Logic'] = False
filename.at[index,'TF'] = False
return(filename)
What it should do is finding rows having at least one of the words in the list above (w) and assigning a value:
if the row contains in Tweet or Note the word, then assign True, else False.
My expected output would be:
Date ID Tweet Note Logic TF
01/20/2020 4141 The cat is on the table I bought a table True False
01/20/2020 4142 The sky is blue Once upon a time True False
01/20/2020 53 What a wonderful day I have no words False False
Manually checking, I found that some words are not correctly assigned. What could be wrong in my code?
I'm new to pandas as well, so you'll need to take this answer with a grain of salt. I got the impression from the tutorial if you are iterating through the DataFrame, you are not using pandas as it was intended.
To that end, I'll point this out:
df['Logic'] = df['Tweet'].str.contains('table')
df['Logic'] |= df['Tweet'].str.contains('sky')
Yields:
Date ID Tweet Note Logic
0 1/20/20 4141 The cat is on the table I bought a table True
1 1/20/20 4142 The sky is blue Once upon a time True
2 1/20/20 53 What a wonderful day I have no words False
As I understand if the keywords are in those specific column then Logic is True else TF is false and Logic is false. I couldn't figure out when TF is false and Logic is true. So I'm not sure if this helps but
pattern = '|'.join(w)
df['Logic'] = df.Tweet.str.contains(pattern) | df.Note.str.contains(pattern)
this code may help you to avoid from apply.
Related
So I have a column of strings that is listed as "compounds"
Composition (column title)
ZrMo3
Gd(CuS)3
Ba2DyInTe5
I have another column that has strings metal elements from the periodic table and i'll call that column "metals"
Elements (column title)
Li
Be
Na
The objective is to check each string from "compounds" with every single string listed in "metals" and if any string from metals is there then it would be classified as true. Any ideas how I can code this?
Example: (if "metals" has Zr, Ag, and Te)
ZrMo3 True
Gd(CuS)3 False
Ba2DyInTe5 True
I recently tried using this code below, but I ended up getting all false
asd = subset['composition'].isin(metals['Elements'])
print(asd)
also tried this code and got all false as well
subset['Boolean'] = subset.apply(lambda x: True if any(word in x.composition for word in metals) else False, axis=1)
assuming you are using pandas, you can use a list comprehension inside your lambda since you essentially need to iterate over all elements in the elements list
import pandas as pd
elements = ['Li', 'Be', 'Na', 'Te']
compounds = ['ZrMo3', 'Gd(CuS)3', 'Ba2DyInTe5']
df = pd.DataFrame(compounds, columns=['compounds'])
print(df)
output
compounds
0 ZrMo3
1 Gd(CuS)3
2 Ba2DyInTe5
df['boolean'] = df.compounds.apply(lambda x: any([True if el in x else False for el in elements]))
print(df)
output
compounds boolean
0 ZrMo3 False
1 Gd(CuS)3 False
2 Ba2DyInTe5 True
if you are not using pandas, you can apply the lambda function to the lists with the map function
out = list(
map(
lambda x: any([True if el in x else False for el in elements]), compounds)
)
print(out)
output
[False, False, True]
here would be a more complex version which also tackles the potential errors #Ezon mentioned based on the regular expression matching module re. since this approach is essentially looping not only over the elements to compare with a single compound string but also over each constituent of the compounds I made two helper functions for it to be more readable.
import re
import pandas as pd
def split_compounds(c):
# remove all non-alphabet elements
c_split = re.sub(r"[^a-zA-Z]", "", c)
# split string at capital letters
c_split = '-'.join(re.findall('[A-Z][^A-Z]*', c_split))
return c_split
def compare_compound(compound, element):
# split compound into list
compound_list = compound.split('-')
return any([element == c for c in compound_list])
# build sample data
compounds = ['SiO2', 'Ba2DyInTe5', 'ZrMo3', 'Gd(CuS)3']
elements = ['Li', 'Be', 'Na', 'Te', 'S']
df = pd.DataFrame(compounds, columns=['compounds'])
# split compounds into elements
df['compounds_elements'] = [split_compounds(x) for x in compounds]
print(df)
output
compounds compounds_elements
0 SiO2 Si-O
1 Ba2DyInTe5 Ba-Dy-In-Te
2 ZrMo3 Zr-Mo
3 Gd(CuS)3 Gd-Cu-S
# check if any item from 'elements' is in the compounds
df['boolean'] = df.compounds_elements.apply(
lambda x: any([True if compare_compound(x, el) else False for el in elements])
)
print(df)
output
compounds compounds_elements boolean
0 SiO2 Si-O False
1 Ba2DyInTe5 Ba-Dy-In-Te True
2 ZrMo3 Zr-Mo False
3 Gd(CuS)3 Gd-Cu-S True
The following code will tell me if there are partial matches (via the True values in the final column):
import pandas as pd
x = {'Non-Suffix' : ['1234567', '1234568', '1234569', '1234554'], 'Suffix' : ['1234567:C', '1234568:VXCF', '1234569-01', '1234554-01:XC']}
x = pd.DataFrame(x)
x['"Non-Suffix" Partial Match in "Suffix"?'] = x.apply(lambda row: row['Non-Suffix'] in row['Suffix'], axis=1)
x
However, if I re-arrange the values in the second column, I'll get False values:
x = {'Non-Suffix' : ['1234567', '1234568', '1234569', '1234554'], 'Suffix' : ['1234568:VXCF', '1234567:C', '1234554-01:XC', '1234569-01']}
x = pd.DataFrame(x)
x['"Non-Suffix" Partial Match in "Suffix"?'] = x.apply(lambda row: row['Non-Suffix'] in row['Suffix'], axis=1)
x
Is there a way I can get the second block of code to find these partial matches even if they're not in the same row?
Also, instead of 'True/False' values, is there a way for me to have the value of 'Partial Match Exists!' instead of True, and 'Partial Match Does Not Exist!' instead of False?
You can join the Non-Suffix column value with | then use Series.str.contains to check if contain any value
x['"Non-Suffix" Partial Match in "Suffix"?'] = x['Suffix'].str.contains('|'.join(x['Non-Suffix']))
print(x)
Non-Suffix Suffix "Non-Suffix" Partial Match in "Suffix"?
0 1234567 1234568:VXCF True
1 1234568 1234567:C True
2 1234569 1234554-01:XC True
3 1234554 1234569-01 True
Above solution checks if Suffix contains any of Non-Suffix, if you want to do the reverse, you might do
x['"Non-Suffix" Partial Match in "Suffix"?'] = x['Non-Suffix'].apply(lambda v: x['Suffix'].str.contains(v).any())
print(x)
Non-Suffix Suffix "Non-Suffix" Partial Match in "Suffix"?
0 879 1234568:VXCF False
1 1234568 1234567:C True
2 1234569 1234554-01:XC True
3 1234554 1234569-01 True
I have some problem to assign label whether a condition is satisfied. Specifically, I would like to assign False (or 0) to rows which contains at least one of these words
my_list=["maths", "science", "geography", "statistics"]
in one of these fields:
path | Subject | Notes
and look for these websites webs=["www.stanford.edu", "www.ucl.ac.uk", "www.sorbonne-universite.fr"] in column web.
To do this I am using the following code:
def part_is_in(x, values):
output = False
for val in values:
if val in str(x):
return True
break
return output
def assign_value(filename):
my_list=["maths", "", "science", "geography", "statistics"]
filename['Label'] = filename[['path','subject','notes']].apply(part_is_in, values= my_list)
filename['Low_Subject']=filename['Subject']
filename['Low_Notes']=filename['Notes']
lower_cols = [col for col in filename if col not in ['Subject','Notes']]
filename[lower_cols]= filename[lower_cols].apply(lambda x: x.astype(str).str.lower(),axis=1)
webs=["https://www.stanford.edu", "https://www.ucl.ac.uk", "http://www.sorbonne-universite.fr"]
# NEW COLUMN # this is still inside the function but I cannot add an indent within this post
filename['Label'] = pd.Series(index = filename.index, dtype='object')
for index, row in filename.iterrows():
value = row['web']
if any(x in str(value) for x in webs):
filename.at[index,'Label'] = True
else:
filename.at[index,'Label'] = False
for index, row in filename.iterrows():
value = row['Subject']
if any(x in str(value) for x in my_list):
filename.at[index,'Label'] = True
else:
filename.at[index,'Label'] = False
for index, row in filename.iterrows():
value = row['Notes']
if any(x in str(value) for x in my_list):
filename.at[index,'Label'] = True
else:
filename.at[index,'Label'] = False
for index, row in filename.iterrows():
value = row['path']
if any(x in str(value) for x in my_list):
filename.at[index,'Label'] = True
else:
filename.at[index,'Label'] = False
return(filename)
My dataset is
web path Subject Notes
www.stanford.edu /maths/ NA NA
www.ucla.com /history/ History of Egypt NA
www.kcl.ac.uk /datascience/ Data Science 50 students
...
The expected output is:
web path Subject Notes Label
www.stanford.edu /maths/ NA NA 1 # contains the web and maths
www.ucla.com /history/ History of Egypt NA 0
www.kcl.ac.uk /datascience/ Data Science 50 students 1 # contains the word science
...
Using my code, I am getting all values False. Are you able to spot the issue?
The final values in Labels are Booleans
If you want ints, use df.Label = df.Label.astype(int)
def test_words
fill all NaNs, which are float type, with '', which is str type
convert all words to lowercase
replace all / with ' '
split on ' ' to make a list
combine all the lists into a single a set
use set methods to determine if the row contains a word in my_list
set.intersection
{'datascience'}.intersection({'science'}) returns an empty set, because there is not intersection.
{'data', 'science'}.intersection({'science'}) returns {'science'}, because there's an intersection on that word.
lambda x: any(x in y for y in webs)
for each value in webs, check if web is in that value
'www.stanford.edu' in 'https://www.stanford.edu' is True
evaluates as True if any are True.
import pandas as pd
# test data and dataframe
data = {'web': ['www.stanford.edu', 'www.ucla.com', 'www.kcl.ac.uk'],
'path': ['/maths/', '/history/', '/datascience/'],
'Subject': [np.nan, 'History of Egypt', 'Data Science'],
'Notes': [np.nan, np.nan, '50 students']}
df = pd.DataFrame(data)
# given my_list
my_list = ["maths", "science", "geography", "statistics"]
my_list = set(map(str.lower, my_list)) # convert to a set and verify words are lowercase
# given webs; all values should be lowercase
webs = ["https://www.stanford.edu", "https://www.ucl.ac.uk", "http://www.sorbonne-universite.fr"]
# function to test for word content
def test_words(v: pd.Series) -> bool:
v = v.fillna('').str.lower().str.replace('/', ' ').str.split(' ') # replace na, lower case , convert to list
s_set = {st for row in v for st in row if st} # join all the values in the lists to one set
return True if s_set.intersection(my_list) else False # True if there is a word intersection between sets
# test for conditions in the word columns and the web column
df['Label'] = df[['path', 'Subject', 'Notes']].apply(test_words, axis=1) | df.web.apply(lambda x: any(x in y for y in webs))
# display(df)
web path Subject Notes Label
0 www.stanford.edu /maths/ NaN NaN True
1 www.ucla.com /history/ History of Egypt NaN False
2 www.kcl.ac.uk /datascience/ Data Science 50 students True
Notes Regarding Original Code
It's not a good idea to use iterrows multiple times. For a large dataset it will be very time-consuming and error prone.
It was easier to write a new function then interpret the different code blocks for each column.
I have a DataFrame likes below:
IDS Metric
1,2 100
1,3 200
3 300
...
I want to find any two IDs exist in the same row, for example, both "1,2" and "1,3" exist in one row, but "2,3" has no direct relationship (means no competition between them in business)
I want to have a function to judge for any two IDs common existing and return True/False.
Just for "judge for any two IDs common existing", I think the following could work:
target_list = ['1', '2']
df["IDS"].apply(lambda ids: all(id in ids for id in target_list)).any()
# return True
target_list = ['2', '3']
df["IDS"].apply(lambda ids: all(id in ids for id in target_list)).any()
# return False
However, as lambda function will iterate each row in df, and it may be inefficient to iterate all rows, because I only need to judge whether exist.
I hope it should return when first common existing happens.
Could anyone help me about that? Thanks a lot
Use:
df["IDS"].str.split(',', expand=True).isin(target_list).all(axis=1).any()
Another idea with sets:
target_list = ['1', '2']
s = set(target_list)
a = any(s.issubset(x.split(',')) for x in df["IDS"])
print (a)
True
Details:
print (df["IDS"].str.split(',', expand=True))
0 1
0 1 2
1 1 3
2 3 None
print (df["IDS"].str.split(',', expand=True).isin(target_list))
0 1
0 True True
1 True False
2 False False
print (df["IDS"].str.split(',', expand=True).isin(target_list).all(axis=1))
0 True
1 False
2 False
dtype: bool
print (df["IDS"].str.split(',', expand=True).isin(target_list).all(axis=1).any())
True
I have research, but found no answer to the question below.
How can I do a boolean comparison for a list of substrings in a list of strings?
Below is the code:
string = {'strings_1': ['AEAB', 'AC', 'AI'],
'strings_2':['BB', 'BA', 'AG'],
'strings_3': ['AABD', 'DD', 'PP'],
'strings_4': ['AV', 'AB', 'BV']}
df_string = pd.DataFrame(data = string)
substring_list = ['AA', 'AE']
for row in df_string.itertuples(index = False):
combine_row_str = [row[0], row[1], row[2]]
#below is the main operation
print(all(substring in row_str for substring in substring_list for row_str in combine_row_str))
The output I get is:
False
False
False
The output I want is:
True
False
False
Here's one way using pd.DataFrame.sum and a list comprehension:
df = pd.DataFrame(data=string)
lst = ['AA', 'AE']
df['test'] = [all(val in i for val in lst) for i in df.sum(axis=1)]
print(df)
strings_1 strings_2 strings_3 strings_4 test
0 AEAB BB AABD AV True
1 AC BA DD AB False
2 AI AG PP BV False
Since you are using pandas, you can invoke apply row-wise and str.contains with regex to find if strings do match. The first step is to find if any of the values match the strings in the substring_list:
df_string.apply(lambda x: x.str.contains('|'.join(substring_list)), axis=1)
this returns:
strings_1 strings_2 strings_3 strings_4
0 True False True False
1 False False False False
2 False False False False
Now, what is not clear though is whether you want to return true if both substrings are present within a row or only either of them. If only either of them, you can simply add any() after the contains() method:
df_string.apply(lambda x: x.str.contains('|'.join(substring_list)).any(), axis=1)
this returns:
0 True
1 False
2 False
dtype: bool
For the second case jpp provides a one line solution with concating row elements into one string, but please note it will not work for corner cases when you have two elems in a row, say, "BBA" and "ABB" and you try to match for "AA". Concated string "BBAABB" will still match "AA", which is wrong. I would like to propose a solution with apply and an extra function, so that code is more readable:
def areAllPresent(vals, patterns):
result = []
for pat in patterns:
result.append(any([pat in val for val in vals]))
return all(result)
df_string.apply(lambda x: areAllPresent(x.values, substring_list), axis=1)
Due to your sample dataframe it will still return the same result, but it works for cases when matching both is necessary:
0 True
1 False
2 False
dtype: bool