Conditional String matching in pandas

Conditional String matching in pandas - python

I have the following dataframe a
a=pd.DataFrame([[1,'bayern'],[2,'bayern_leverkusen'],[3,'Chelsea'],
[4,'manunited'],[5,'westhamunited'],[6,'mancity']]
,columns=['no','club'])
I would like to iterate column club in such a way that every values in club iterates with all the other values in club and select only those where there is matching of 4 or more consecutive characters.
For eq bayern and bayern_leverkusen should be filtered because they contain same substring bayern. Similarily, manunited and westhamunited should be filtered because they contain they contain same substring united.
mancity should not be filtered because the matching substring man is only 3.
Expected Output:
no club
0 1 bayern
1 2 bayern_leverkusen
3 4 manunited
4 5 westhamunited

import itertools
import pandas as pd
selector = pd.Series(False,index = a.index)
for first_index,second_index in itertools.combinations(a.index,2):
club1 = a['club'][first_index]
club2 = a['club'][second_index]
for start in range(len(club1)-3):
if club1[start:start+3] in club2:
selector[first] = True
selector[second] = True
break
new_df = a.loc[selector]

Related

Compare two column Pandas row per row

I want to compare 2 column. If same will True if not same will False like this:
filtering
lemmatization
check
[hello, world]
[hello, world]
True
[grape, durian]
[apple, grape]
False
The output from my code is all False. But, the data actually is different. Why?
You can get my data github
import pandas as pd
dc = pd.read_excel('./data clean (spaCy).xlsx')
dc['check'] = dc['filtering'].equals(dc['lemmatization'])

Here is difference between columns - in one column missing '' around strings, possible solution is convert both columns to lists, for comapre use Series.eq (working like ==):
import ast
dc = pd.read_excel('data clean (spaCy).xlsx')
#removed trailing [] and split by ` ,`
dc['filtering'] = dc['filtering'].str.strip('[]').str.split(', ')
#there are string separators, so working literal_eval
dc['lemmatization'] = dc['lemmatization'].apply(ast.literal_eval)
#compare
dc['check'] = dc['filtering'].eq(dc['lemmatization'])
print (dc.head())
label filtering \
0 2 [ppkm, ya]
1 2 [mohon, informasi, pgs, pasar, turi, ppkm, buk...
2 2 [rumah, ppkm]
3 1 [pangkal, penanganan, pandemi, indonesia, terk...
4 1 [ppkm, mikro, anjing]
lemmatization check
0 [ppkm, ya] True
1 [mohon, informasi, pgs, pasar, turi, ppkm, buk... True
2 [rumah, ppkm] True
3 [pangkal, tangan, pandemi, indonesia, kesan, s... False
4 [ppkm, mikro, anjing] True
Reason for False is Series.equals return scalar, so here False

Check if a string is present in multiple lists

I am trying to categorize a dataset based on the string that contains the name of the different objects of the dataset.
The dataset is composed of 3 columns, df['Name'], df['Category'] and df['Sub_Category'], the Category and Sub_Category columns are empty.
For each row I would like to check in different lists of words if the name of the object contains at least one word in one of the list. Based on this first check I would like to attribute a value to the category column. If it finds more than 1 word in 2 different lists I would like to attribute 2 values to the object in the category column.
Moreover, I would like to be able to identify which word has been checked in which list in order to attribute a value to the sub_category column.
Until now, I have been able to do it with only one list, but I am not able to identity which word has been checked and the code is very long to run.
Here is my code (where I added an example of names found in my dataset as df['Name']) :
import pandas as pd
import numpy as np
df['Name'] = ['vitrine murale vintage','commode ancienne', 'lustre antique', 'solex', 'sculpture médievale', 'jante voiture', 'lit et matelas', 'turbine moteur']
furniture_check = ['canape', 'chaise', 'buffet','table','commode','lit']
vehicle_check = ['solex','voiture','moto','scooter']
art_check = ['tableau','scuplture', 'tapisserie']
for idx, row in df.iterrows():
for c in furniture_check:
if c in row['Name']:
df.loc[idx, 'Category'] = 'Meubles'
Any help would be appreciated

Here is an approach that expands lists, merges them and re-combines them.
df = pd.DataFrame({"name":['vitrine murale vintage','commode ancienne', 'lustre antique', 'solex', 'sculpture médievale', 'jante voiture', 'lit et matelas', 'turbine moteur']})
furniture_check = ['canape', 'chaise', 'buffet','table','commode','lit']
vehicle_check = ['solex','voiture','moto','scooter']
art_check = ['tableau','scuplture', 'tapisserie']
# put categories into a dataframe
dfcat = pd.DataFrame([{"category":"furniture","values":furniture_check},
{"category":"vechile","values":vehicle_check},
{"category":"art","values":art_check}])
# turn apace delimited "name" column into a list
dfcatlist = (df.assign(name=df["name"].apply(lambda x: x.split(" ")))
# explode list so it can be used as join. reset_index() to keep a copy of index of original DF
.explode("name").reset_index()
# merge exploded names on both side
.merge(dfcat.explode("values"), left_on="name", right_on="values")
# where there are multiple categoryies, make it a list
.groupby("index", as_index=False).agg({"category":lambda s: list(s)})
# but original index back...
.set_index("index")
)
# simple join and have names and list of associated categories
df.join(dfcatlist)
name
category
0
vitrine murale vintage
nan
1
commode ancienne
['furniture']
2
lustre antique
nan
3
solex
['vechile']
4
sculpture médievale
nan
5
jante voiture
['vechile']
6
lit et matelas
['furniture']
7
turbine moteur
nan

how to replace characters in a dataframe where column may have different data types entries

new to python want to ask a quick question on how to replace multiple characters simultaneously given that the entries may have different data types. I just want to change the strings and keep everything else as it is:
import pandas as pd
def test_me(text):
replacements = [("ID", ""),("u", "a")] #
return [text.replace(a, b) for a, b in replacements if type(text) == str]
cars = {'Brand': ['HonduIDCivic', 1, 3.2,'CarIDA4'],
'Price': [22000,25000,27000,35000]
}
df = pd.DataFrame(cars, columns = ['Brand', 'Price'])
df['Brand'] = df['Brand'].apply(test_me)
resulting in
Brand Price
0 [HonduCivic, HondaIDCivic] 22000
1 [] 25000
2 [] 27000
3 [CarA4, CarIDA4] 35000
rather than
Brand Price
0 HondaCivic 22000
1 1 25000
2 3.2 27000
3 CarA4 35000
Appreciate any suggestions!

If the replacements never have identical search phrases, it will be easier to convert the list of tuples into a dictionary and then use
import re
#...
def test_me(text):
replacements = dict([("ID", ""),("u", "a")])
if type(text) == str:
return re.sub("|".join(sorted(map(re.escape, replacements.keys()),key=len,reverse=True)), lambda x: replacements[x.group()], text)
else:
return text
The "|".join(sorted(map(re.escape, replacements.keys()),key=len,reverse=True)) part will create a regular expression out of re.escaped dictionary keys starting with the longest so as to avoid issues when handling nested search phrases that share the same prefix.
Pandas test:
>>> df['Brand'].apply(test_me)
0 HondaCivic
1 1
2 3.2
3 CarA4
Name: Brand, dtype: object

How to subset dataframe based on presence of words in string? [duplicate]

I'm wondering if there's a more general way to do the below? I'm wondering if there's a way to create the st function so that I can search a non-predefined number of strings?
So for instance, being able to create a generalized st function, and then type st('Governor', 'Virginia', 'Google)
here's my current function, but it predefines two words you can use. (df is a pandas DataFrame)
def search(word1, word2, word3 df):
"""
allows you to search an intersection of three terms
"""
return df[df.Name.str.contains(word1) & df.Name.str.contains(word2) & df.Name.str.contains(word3)]
st('Governor', 'Virginia', newauthdf)

You could use np.logical_and.reduce:
import pandas as pd
import numpy as np
def search(df, *words): #1
"""
Return a sub-DataFrame of those rows whose Name column match all the words.
"""
return df[np.logical_and.reduce([df['Name'].str.contains(word) for word in words])] # 2
df = pd.DataFrame({'Name':['Virginia Google Governor',
'Governor Virginia',
'Governor Virginia Google']})
print(search(df, 'Governor', 'Virginia', 'Google'))
prints
Name
0 Virginia Google Governor
2 Governor Virginia Google
The * in def search(df, *words) allows search to accept an
unlimited number of positional arguments. It will collect all the
arguments (after the first) and place them in a list called words.
np.logical_and.reduce([X,Y,Z]) is equivalent to X & Y & Z. It
allows you to handle an arbitrarily long list, however.

str.contains can take regex. so you can use '|'.join(words) as the pattern; to be safe map to re.escape as well:
>>> df
Name
0 Test
1 Virginia
2 Google
3 Google in Virginia
4 Apple
[5 rows x 1 columns]
>>> words = ['Governor', 'Virginia', 'Google']
'|'.join(map(re.escape, words)) would be the search pattern:
>>> import re
>>> pat = '|'.join(map(re.escape, words))
>>> df.Name.str.contains(pat)
0 False
1 True
2 True
3 True
4 False
Name: Name, dtype: bool

Select Pandas rows with regex match

I have the following data-frame.
and I have an input list of values
I want to match each item from the input list to the Symbol and Synonym column in the data-frame and to extract only those rows where the input value appears in either the Symbol column or Synonym column(Please note that here the values are separated by '|' symbol).
In the output data-frame I need an additional column Input_symbol which denotes the matching value. So here in this case the desired output will should be like the image bellow.
How can I do the same ?.

IIUIC, use
In [346]: df[df.Synonyms.str.contains('|'.join(mylist))]
Out[346]:
Symbol Synonyms
0 A1BG A1B|ABG|GAB|HYST2477
1 A2M A2MD|CPAMD5|FWP007|S863-7
2 A2MP1 A2MP
6 SERPINA3 AACT|ACT|GIG24|GIG25

Check in both columns by str.contains and chain conditions by | (or), last filter by boolean indexing:
mylist = ['GAB', 'A2M', 'GIG24']
m1 = df.Synonyms.str.contains('|'.join(mylist))
m2 = df.Symbol.str.contains('|'.join(mylist))
df = df[m1 | m2]
Another solution is logical_or.reduce all masks created by list comprehension:
masks = [df[x].str.contains('|'.join(mylist)) for x in ['Symbol','Synonyms']]
m = np.logical_or.reduce(masks)
Or by apply, then use DataFrame.any for check at least one True per row:
m = df[['Symbol','Synonyms']].apply(lambda x: x.str.contains('|'.join(mylist))).any(1)
df = df[m]
print (df)
Symbol Synonyms
0 A1BG A1B|ABG|GAB|HYST2477
1 A2M A2MD|CPAMD5|FWP007|S863-7
2 A2MP1 A2MP
6 SERPINA3 AACT|ACT|GIG24|GIG25

The question has changed. What you want to do now is to look through the two columns (Symbol and Synonyms) and if you find a value that is inside mylist return it. If no match you can return 'No match!' (for instance).
import pandas as pd
import io
s = '''\
Symbol,Synonyms
A1BG,A1B|ABG|GAB|HYST2477
A2M,A2MD|CPAMD5|FWP007|S863-7
A2MP1,A2MP
NAT1,AAC1|MNAT|NAT-1|NATI
NAT2,AAC2|NAT-2|PNAT
NATP,AACP|NATP1
SERPINA3,AACT|ACT|GIG24|GIG25'''
mylist = ['GAB', 'A2M', 'GIG24']
df = pd.read_csv(io.StringIO(s))
# Store the lookup serie
lookup_serie = df['Symbol'].str.cat(df['Synonyms'],'|').str.split('|')
# Create lambda function to return first value from mylist, No match! if stop-iteration
f = lambda x: next((i for i in x if i in mylist), 'No match!')
df.insert(0,'Input_Symbol',lookup_serie.apply(f))
print(df)
Returns
Input_Symbol Symbol Synonyms
0 GAB A1BG A1B|ABG|GAB|HYST2477
1 A2M A2M A2MD|CPAMD5|FWP007|S863-7
2 No match! A2MP1 A2MP
3 No match! NAT1 AAC1|MNAT|NAT-1|NATI
4 No match! NAT2 AAC2|NAT-2|PNAT
5 No match! NATP AACP|NATP1
6 GIG24 SERPINA3 AACT|ACT|GIG24|GIG25
Old solution:
f = lambda x: [i for i in x.split('|') if i in mylist] != []
m1 = df['Symbol'].apply(f)
m2 = df['Synonyms'].apply(f)
df[m1 | m2]

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Conditional String matching in pandas - python

Related

Compare two column Pandas row per row

Check if a string is present in multiple lists

how to replace characters in a dataframe where column may have different data types entries

How to subset dataframe based on presence of words in string? [duplicate]

Select Pandas rows with regex match

Categories

Resources