How to find any common existing in Pandas Column - python

I have a DataFrame likes below:
IDS Metric
1,2 100
1,3 200
3 300
...
I want to find any two IDs exist in the same row, for example, both "1,2" and "1,3" exist in one row, but "2,3" has no direct relationship (means no competition between them in business)
I want to have a function to judge for any two IDs common existing and return True/False.
Just for "judge for any two IDs common existing", I think the following could work:
target_list = ['1', '2']
df["IDS"].apply(lambda ids: all(id in ids for id in target_list)).any()
# return True
target_list = ['2', '3']
df["IDS"].apply(lambda ids: all(id in ids for id in target_list)).any()
# return False
However, as lambda function will iterate each row in df, and it may be inefficient to iterate all rows, because I only need to judge whether exist.
I hope it should return when first common existing happens.
Could anyone help me about that? Thanks a lot

Use:
df["IDS"].str.split(',', expand=True).isin(target_list).all(axis=1).any()
Another idea with sets:
target_list = ['1', '2']
s = set(target_list)
a = any(s.issubset(x.split(',')) for x in df["IDS"])
print (a)
True
Details:
print (df["IDS"].str.split(',', expand=True))
0 1
0 1 2
1 1 3
2 3 None
print (df["IDS"].str.split(',', expand=True).isin(target_list))
0 1
0 True True
1 True False
2 False False
print (df["IDS"].str.split(',', expand=True).isin(target_list).all(axis=1))
0 True
1 False
2 False
dtype: bool
print (df["IDS"].str.split(',', expand=True).isin(target_list).all(axis=1).any())
True

Related

How can I classify a column of strings with true and false values by comparing with another column of strings

So I have a column of strings that is listed as "compounds"
Composition (column title)
ZrMo3
Gd(CuS)3
Ba2DyInTe5
I have another column that has strings metal elements from the periodic table and i'll call that column "metals"
Elements (column title)
Li
Be
Na
The objective is to check each string from "compounds" with every single string listed in "metals" and if any string from metals is there then it would be classified as true. Any ideas how I can code this?
Example: (if "metals" has Zr, Ag, and Te)
ZrMo3 True
Gd(CuS)3 False
Ba2DyInTe5 True
I recently tried using this code below, but I ended up getting all false
asd = subset['composition'].isin(metals['Elements'])
print(asd)
also tried this code and got all false as well
subset['Boolean'] = subset.apply(lambda x: True if any(word in x.composition for word in metals) else False, axis=1)
assuming you are using pandas, you can use a list comprehension inside your lambda since you essentially need to iterate over all elements in the elements list
import pandas as pd
elements = ['Li', 'Be', 'Na', 'Te']
compounds = ['ZrMo3', 'Gd(CuS)3', 'Ba2DyInTe5']
df = pd.DataFrame(compounds, columns=['compounds'])
print(df)
output
compounds
0 ZrMo3
1 Gd(CuS)3
2 Ba2DyInTe5
df['boolean'] = df.compounds.apply(lambda x: any([True if el in x else False for el in elements]))
print(df)
output
compounds boolean
0 ZrMo3 False
1 Gd(CuS)3 False
2 Ba2DyInTe5 True
if you are not using pandas, you can apply the lambda function to the lists with the map function
out = list(
map(
lambda x: any([True if el in x else False for el in elements]), compounds)
)
print(out)
output
[False, False, True]
here would be a more complex version which also tackles the potential errors #Ezon mentioned based on the regular expression matching module re. since this approach is essentially looping not only over the elements to compare with a single compound string but also over each constituent of the compounds I made two helper functions for it to be more readable.
import re
import pandas as pd
def split_compounds(c):
# remove all non-alphabet elements
c_split = re.sub(r"[^a-zA-Z]", "", c)
# split string at capital letters
c_split = '-'.join(re.findall('[A-Z][^A-Z]*', c_split))
return c_split
def compare_compound(compound, element):
# split compound into list
compound_list = compound.split('-')
return any([element == c for c in compound_list])
# build sample data
compounds = ['SiO2', 'Ba2DyInTe5', 'ZrMo3', 'Gd(CuS)3']
elements = ['Li', 'Be', 'Na', 'Te', 'S']
df = pd.DataFrame(compounds, columns=['compounds'])
# split compounds into elements
df['compounds_elements'] = [split_compounds(x) for x in compounds]
print(df)
output
compounds compounds_elements
0 SiO2 Si-O
1 Ba2DyInTe5 Ba-Dy-In-Te
2 ZrMo3 Zr-Mo
3 Gd(CuS)3 Gd-Cu-S
# check if any item from 'elements' is in the compounds
df['boolean'] = df.compounds_elements.apply(
lambda x: any([True if compare_compound(x, el) else False for el in elements])
)
print(df)
output
compounds compounds_elements boolean
0 SiO2 Si-O False
1 Ba2DyInTe5 Ba-Dy-In-Te True
2 ZrMo3 Zr-Mo False
3 Gd(CuS)3 Gd-Cu-S True

Assign a value within a new column if a condition is satisfied

I have this dataset:
Date ID Tweet Note
01/20/2020 4141 The cat is on the table I bought a table
01/20/2020 4142 The sky is blue Once upon a time
01/20/2020 53 What a wonderful day I have no words
I would like to select rows containing in Tweet or Note one of the following words:
w=["sky", "table"]
To do this, I am using the following:
def part_is_in(x, values):
output = False
for val in values:
if val in str(x):
return True
break
return output
def fun_1(filename):
w=["sky", "table"]
filename['Logic'] = filename[['Tweet','Note']].apply(part_is_in, values=w)
filename['Low_Tweet']=filename['Tweet']
filename['Low_ Note']=filename['Note']
lower_cols = [col for col in filename if col not in ['Tweet','Note']]
filename[lower_cols]= filename[lower_cols].apply(lambda x: x.astype(str).str.lower(),axis=1)
# NEW COLUMN
filename['Logic'] = pd.Series(index = filename.index, dtype='object')
filename['TF'] = pd.Series(index = filename.index, dtype='object')
for index, row in filename.iterrows():
value = row['ID']
if any(x in str(value) for x in w):
filename.at[index,'Logic'] = True
else:
filename.at[index,'Logic'] = False
filename.at[index,'TF'] = False
for index, row in filename.iterrows():
value = row['Tweet']
if any(x in str(value) for x in w):
filename.at[index,'Logic'] = True
else:
filename.at[index,'Logic'] = False
filename.at[index,'TF'] = False
for index, row in filename.iterrows():
value = row['Note']
if any(x in str(value) for x in w):
filename.at[index,'Logic'] = True
else:
filename.at[index,'Logic'] = False
filename.at[index,'TF'] = False
return(filename)
What it should do is finding rows having at least one of the words in the list above (w) and assigning a value:
if the row contains in Tweet or Note the word, then assign True, else False.
My expected output would be:
Date ID Tweet Note Logic TF
01/20/2020 4141 The cat is on the table I bought a table True False
01/20/2020 4142 The sky is blue Once upon a time True False
01/20/2020 53 What a wonderful day I have no words False False
Manually checking, I found that some words are not correctly assigned. What could be wrong in my code?
I'm new to pandas as well, so you'll need to take this answer with a grain of salt. I got the impression from the tutorial if you are iterating through the DataFrame, you are not using pandas as it was intended.
To that end, I'll point this out:
df['Logic'] = df['Tweet'].str.contains('table')
df['Logic'] |= df['Tweet'].str.contains('sky')
Yields:
Date ID Tweet Note Logic
0 1/20/20 4141 The cat is on the table I bought a table True
1 1/20/20 4142 The sky is blue Once upon a time True
2 1/20/20 53 What a wonderful day I have no words False
As I understand if the keywords are in those specific column then Logic is True else TF is false and Logic is false. I couldn't figure out when TF is false and Logic is true. So I'm not sure if this helps but
pattern = '|'.join(w)
df['Logic'] = df.Tweet.str.contains(pattern) | df.Note.str.contains(pattern)
this code may help you to avoid from apply.

Splitting strings in tuples within a pandas dataframe column

I have a pandas dataframe where a column contains tuples:
p = pd.DataFrame({"sentence" : [("A.Hi", "B.My", "C.Friend"), \
("AA.How", "BB.Are", "CC.You")]})
I'd like to split each string in the tuple on a punctuation ., take the second part of the split/string and see how many match list of strings:
p["tmp"] = p["sentence"].apply(lambda x: [i.split(".")[1] for i in x])
p["tmp"].apply(lambda x: [True if len(set(x).intersection(set(["Hi", "My"])))>0 else False])
This works as intended, but my dataframe has more than 100k rows - and apply doesn't seem very efficient at these sizes. Is there a way to optize/vectorize the above code?
Use nested list and set comprehension and for test convert sets to bools - empty set return False:
s = set(["Hi", "My"])
p["tmp"] = [bool(set(i.split(".")[1] for i in x).intersection(s)) for x in p["sentence"]]
print (p)
sentence tmp
0 (A.Hi, B.My, C.Friend) True
1 (AA.How, BB.Are, CC.You) False
EDIT:
If there are only 1 or 2 length values after split, you can select last value by indexing [-1]:
p = pd.DataFrame({"sentence" : [("A.Hi", "B.My", "C.Friend"), \
("AA.How", "BB.Are", "You")]})
print (p)
sentence
0 (A.Hi, B.My, C.Friend)
1 (AA.How, BB.Are, You)
s = set(["Hi", "My"])
p["tmp"] = [bool(set(i.split(".")[-1] for i in x).intersection(s)) for x in p["sentence"]]
print (p)
sentence tmp
0 (A.Hi, B.My, C.Friend) True
1 (AA.How, BB.Are, You) False

Boolean comparison of a list of substring with a list of string

I have research, but found no answer to the question below.
How can I do a boolean comparison for a list of substrings in a list of strings?
Below is the code:
string = {'strings_1': ['AEAB', 'AC', 'AI'],
'strings_2':['BB', 'BA', 'AG'],
'strings_3': ['AABD', 'DD', 'PP'],
'strings_4': ['AV', 'AB', 'BV']}
df_string = pd.DataFrame(data = string)
substring_list = ['AA', 'AE']
for row in df_string.itertuples(index = False):
combine_row_str = [row[0], row[1], row[2]]
#below is the main operation
print(all(substring in row_str for substring in substring_list for row_str in combine_row_str))
The output I get is:
False
False
False
The output I want is:
True
False
False
Here's one way using pd.DataFrame.sum and a list comprehension:
df = pd.DataFrame(data=string)
lst = ['AA', 'AE']
df['test'] = [all(val in i for val in lst) for i in df.sum(axis=1)]
print(df)
strings_1 strings_2 strings_3 strings_4 test
0 AEAB BB AABD AV True
1 AC BA DD AB False
2 AI AG PP BV False
Since you are using pandas, you can invoke apply row-wise and str.contains with regex to find if strings do match. The first step is to find if any of the values match the strings in the substring_list:
df_string.apply(lambda x: x.str.contains('|'.join(substring_list)), axis=1)
this returns:
strings_1 strings_2 strings_3 strings_4
0 True False True False
1 False False False False
2 False False False False
Now, what is not clear though is whether you want to return true if both substrings are present within a row or only either of them. If only either of them, you can simply add any() after the contains() method:
df_string.apply(lambda x: x.str.contains('|'.join(substring_list)).any(), axis=1)
this returns:
0 True
1 False
2 False
dtype: bool
For the second case jpp provides a one line solution with concating row elements into one string, but please note it will not work for corner cases when you have two elems in a row, say, "BBA" and "ABB" and you try to match for "AA". Concated string "BBAABB" will still match "AA", which is wrong. I would like to propose a solution with apply and an extra function, so that code is more readable:
def areAllPresent(vals, patterns):
result = []
for pat in patterns:
result.append(any([pat in val for val in vals]))
return all(result)
df_string.apply(lambda x: areAllPresent(x.values, substring_list), axis=1)
Due to your sample dataframe it will still return the same result, but it works for cases when matching both is necessary:
0 True
1 False
2 False
dtype: bool

Data check routine isn't working

I have a file containing a 5 x 7 table:
enter image description here
I want a validation check that there is either a 5,7, or 9; but none of them is repeated, i.e there must be only one occurrence of these numbers. 5 and 7 are required, 9 is optional, the remaining three columns can be 0. I have written this code but it doesn't work. I also want to store the valid rows in a separate list.
My attempt of the program in python is as follows
def validation ():
numlist = open("scores.txt","r")
invalidnum=0
for line in numlist:
x = line.count("0")
inv1 = line.count("1")
inv2 = line.count("2")
inv3 = line.count("3")
if x > 2 or inv1 > 1 or inv2 > 1 or inv3 > 1 or line not in ("0","5","7","9"):
invalidnum=invalidnum+1
print(invalidnum,"Invalid numbers found"
else:
print("All numbers are valid in the list")
I will appreciate if someone can help me on this.
Here is an example that uses a set:
lolwat = []
for line in open('scores.txt'):
numbers = set(line.split(','))
if '5' in numbers and '7' in numbers:
print('okay 5,7')
elif '9' in numbers:
print('okay 9')
lolwat.append(numbers)
do_stuff_with(lolwat)
Set de-duplicates the numbers ensuring each is unique like 5,7,9 only occur once.
You need to learn how to break a problem like that down into smaller pieces:
E.g., You want to check each row:
for row in open('scores.txt'):
check_row(row)
def check_row(row):
...
You want to save good rows to a list:
good_rows = []
for row in ...:
if check_row(row): good_rows.add(row)
A good row contains exactly one '5':
def check_row(row):
number_of_fives = count_number_of(row, '5')
if number_of_fives != 1:
return False
...
return True
def count_number_of(row, digit):
...
And so on.

Categories