Python function to find similarity between differently formatted strings

Python function to find similarity between differently formatted strings - python

I have 2 excel files with names of items. I want to compare the items but the only remotely similar column is the name column which too has different formatting of the names like
KIDS-Piano as kids piano
Butter Gel 100mg as Butter-Gel-100MG
I know it can't be 100% accurate so I would instead ask the human operating the code to make the final verification but how do I show the closest matching names?

The proper way of doing this is writing a regular expression.
But the vanilla code below might do the trick as well:
column_a = ["KIDS-Piano", "Butter Gel 100mg"]
column_b = ["kids piano", "Butter-Gel-100MG"]
new_column_a = []
for i in column_a:
# convert strings into lowercase
a = i.lower()
# replace dashes with spaces
a = a.replace('-', ' ')
new_column_a.append(a)
# do the same for column b
new_column_b = []
for i in column_b:
# convert strings into lowercase
a = i.lower()
# replace dashes with spaces
a = a.replace('-', ' ')
new_column_b.append(a)
as_not_found_in_b = []
for i in new_column_a:
if i not in new_column_b:
as_not_found_in_b.append(i)
bs_not_found_in_a = []
for i in new_column_b:
if i not in new_column_a:
bs_not_found_in_a.append(i)
# find the problematic ones and manually fix them
print(as_not_found_in_b)
print(bs_not_found_in_a)

Related

Python: Replace string in one column from list in other column

I need some help please.
I have a dataframe with multiple columns where 2 are:
Content_Clean = Column filled with Content - String
Removals: list of strings to be removed from Content_Clean Column
Problem: I am trying to replace words in Content_Clean with spaces if in Removals Column:
Example Image
Example:
Content Clean: 'Johnny and Mary went to the store'
Removals: ['Johnny','Mary']
Output: 'and went to the store'
Example Code:
for i in data_eng['Removals']:
for u in i:
data_eng['Content_Clean_II'] = data_eng['Content_Clean'].str.replace(u,' ')
This does not work as Removals columns contain lists per row.
Another Example:
data_eng['Content_Clean_II'] = data_eng['Content_Clean'].apply(lambda x: re.sub(data_eng.loc[data_eng['Content_Clean'] == x, 'Removals'].values[0], '', x))
Does not work as this code is only looking for one string.
The problem is that Removals column is a list that I want use to remove/ replace with spaces in the Content_Clean column on a per row basis.
The example image link might help

Here you go. This worked on my test data. Let me know if it works for you
def repl(row):
for word in row['Removals']:
row['Content_Clean'] = row['Content_Clean'].replace(word, '')
return row
data_eng = data_eng.apply(repl, axis=1)

You can call the str.replace(old, new) method to remove unwanted words from a string.
Here is one small example I have done.
a_string = "I do not like to eat apples and watermelons"
stripped_string = a_string.replace(" do not", "")
print(stripped_string)
This will remove "do not" from the sentence

How can I show a specific word in a data set?

I just started to learn python. I have a question about matching some of the words in my dataset in excel.
words_list is included some of the words I would like to find in a dataset.
words_list = ('tried','mobile','abc')
df is the extract from excel and picked up a single column.
df =
0 to make it possible or easier for someone to do ...
1 unable to acquire a buffer item very likely ...
2 The organization has tried to make...
3 Broadway tried a variety of mobile Phone for the..
I would like to get the result like this:
'None',
'None',
'tried',
'tried','mobile'
I tried in Jupiter like this:
list = [ ]
for word in df:
if any (aa in word for aa in words_List):
list.append(word)
else:
list.append('None')
print(list)
But the result will show the whole sentence in df
'None'
'None'
'The organization has tried to make...'
'Broadway tried a variety of mobile Phone for the..'
Can I only show the result only in the words list?
Sorry for my English and
thank you all

I'd suggest a manipulation on the DataFrame (that should always be your first thought, use the power of pandas)
import pandas as pd
words_list = {'tried', 'mobile', 'abc'}
df = pd.DataFrame({'col': ['to make it possible or easier for someone to do',
'unable to acquire a buffer item very likely',
'The organization has tried to make',
'Broadway tried a variety of mobile Phone for the']})
df['matches'] = df['col'].str.split().apply(lambda x: set(x) & words_list)
print(df)
col matches
0 to make it possible or easier for someone to do {}
1 unable to acquire a buffer item very likely {}
2 The organization has tried to make {tried}
3 Broadway tried a variety of mobile Phone for the {mobile, tried}

The reason it's printing the whole line has to do with your:
for word in df:
Your "word" variable is actually taking the whole line. Then it's checking the whole line to see if it contains your search word. If it does find it, then it basically says, "yes, I found ____ in this line, so append the line to your list.
What it sounds like you want to do is first split the line into words, and THEN check.
list = [ ]
found = False
for line in df:
words = line.split(" ")
for word in word_list:
if word in words:
found = True
list.append(word)
# this is just to append "None" if nothing found
if found:
found = False
else:
list.append("None")
print(list)
As a side note, you may want to use pprint instead of print when working with lists. It prints lists, dictionaries, etc in easier to read layouts. I don't know if you'll need to install the package. That depends on how you initially installed python. But usage would be something like:
from pprint import pprint
dictionary = {'firstkey':'firstval','secondkey':'secondval','thirdkey':'thirdval'}
pprint(dictionary)

Python matching various keyword from dictionary issues

I have a complex text where I am categorizing different keywords stored in a dictionary:
text = 'data-ls-static="1">Making Bio Implants, Drug Delivery and 3D Printing in Medicine,MEDICINE</h3>'
sector = {"med tech": ['Drug Delivery' '3D printing', 'medicine', 'medical technology', 'bio cell']}
this can successfully find my keywords and categorize them with some limitations:
pattern = r'[a-zA-Z0-9]+'
[cat for cat in sector if any(x in re.findall(pattern,text) for x in sector[cat])]
The limitations that I cannot solve are:
For example, keywords like "Drug Delivery" that are separated by a space are not recognized and therefore categorized.
I was not able to make the pattern case insensitive, as words like MEDICINE are not recognized. I tried to add (?i) to the pattern but it doesn't work.
The categorized keywords go into a pandas df, but they are printed into []. I tried to loop again the script to take them out but they are still there.
Data to pandas df:
ind_list = []
for site in url_list:
ind = [cat for cat in indication if any(x in re.findall(pattern,soup_string) for x in indication[cat])]
ind_list.append(ind)
websites['Indication'] = ind_list
Current output:
Website Sector Sub-sector Therapeutical Area Focus URL status
0 url3.com [med tech] [] [] [] []
1 www.url1.com [med tech, services] [] [oncology, gastroenterology] [] []
2 www.url2.com [med tech, services] [] [orthopedy] [] []
In the output I get [] that I'd like to avoid.
Can you help me with these points?
Thanks!

Give you some hints here the problem that can readily be spot:
Why can't match keywords like "Drug Delivery" that are separated by a space ? This is because the regex pattern r'[a-zA-Z0-9]+' does not match for a space. You can change it to r'[a-zA-Z0-9 ]+' (added a space after 9) if you want to match also for a space. However, if you want to support other types of white spaces (e.g. \t, \n), you need to further change this regex pattern.
Why don't support case insensitive match ? Your code fragment any(x in re.findall(pattern,text) for x in sector[cat]) requires x to have the same upper/lower case for BOTH being in result of re.findall and being in sector[cat]. This constrain even cannot be bypassed by setting flags=re.I in the re.findall() call. Suggest you to convert them all to the same case before checking. That is, for example change them all to lower cases before matching: any(x in re.findall(pattern,text.lower()) for x.lower() in sector[cat]) Here we added .lower() to both text and x.lower().
With the above 2 changes, it should allow you to capture some categorized keywords.
Actually, for this particular case, you may not need to use regular expression and re.findall at all. You may just check e.g. sector[cat][i].lower()) in text.lower(). That is, change the list comprehension as follows:
[cat for cat in sector if any(x in text.lower() for x in [y.lower() for y in sector[cat]])]
Edit
Test Run with 2-word phrase:
text = 'drug delivery'
sector = {"med tech": ['Drug Delivery', '3D printing', 'medicine', 'medical technology', 'bio cell']}
[cat for cat in sector if any(x in text.lower() for x in [y.lower() for y in sector[cat]])]
Output: # Successfully got the categorizing keyword even with dictionary values of different upper/lower cases
['med tech']
text = 'Drug Store fast delivery'
[cat for cat in sector if any(x in text.lower() for x in [y.lower() for y in sector[cat]])]
Ouptput: # Correctly doesn't match with extra words in between
[]

Can you try a different approach other than regex,
I would suggest difflib when you have two similar matching words.

findall is pretty wasteful here since you are repeatedly breaking up the string for each keyword.
If you want to test whether the keyword is in the string:
[cat for cat in sector if any(re.search(word, text, re.I) for word in sector[cat])]
# Output: med tech

Python replace strings using regex on large dataset

I have recently started using the re package in order to clean up transaction descriptions.
Example of original transaction descriptions:
['bread','payment to facebook.com', 'milk', 'savings', 'amazon.com $xx ased lux', 'holiday_amazon']
For a list of expressions I would like to replace the current description with a better one, e.g. if one of the list entries contains 'facebook' or 'amazon' preceded by a space (or at the beginning of the string), I want to replace the entire list entry by the word 'facebook' or 'amazon' respectively, i.e.:
['bread', 'facebook', 'milk', 'savings', 'amazon', 'holiday_amazon']
As I only want to pick it up if the word facebook is preceded by a space or if it is at the beginning of a word, I have created regex that represent this, e.g. (^|\s)facebook. Note that this is only an example, in reality I want to filter out more complex expressions as well.
In total I have a dataframe with 90 such expressions that I want to replace.
My current code (with minimum workable example) is:
import pandas as pd
import re
def specialCases(list_of_narratives, replacement_dataframe):
# Create output array
new_narratives = []
special_cases_identifiers = replacement_dataframe["REGEX_TEST"]
# For each string element of the list
for memo in list_of_narratives:
index_count = 0
found_count = 0
for i in special_cases_identifiers:
regex = re.compile(i)
if re.search(regex, memo.lower()) is not None:
new_narratives.append(replacement_dataframe["NARRATIVE_TO"].values[index_count].lower())
index_count += 1
found_count += 1
break
else:
index_count += 1
if found_count == 0:
new_narratives.append(memo.lower())
return new_narratives
# Minimum example creation
list_of_narratives = ['bread','payment to facebook.com', 'milk', 'savings', 'amazon.com $xx ased lux', 'holiday_amazon']
list_of_regex_expressions = ['(^|\s)facebook', '(^|\s)amazon']
list_of_regex_replacements = ['facebook', 'amazon']
replacement_dataframe = pd.DataFrame({'REGEX_TEST': list_of_regex_expressions, 'NARRATIVE_TO': list_of_regex_replacements})
# run code
new_narratives = specialCases(list_of_narratives, replacement_dataframe)
However, with over 1 million list entries and 90 different regex expressions to be replaced (i.e. len(list_of_regex_expressions) is 90) this is extremely slow, presumably due to the double for loop.
Could someone help me improve the performance of this code?

Need to search a string for a "two word" pattern in python

I’m trying to search a long string of characters for a country name. The country name is sometimes more than one word, such as Costa Rica.
Here is my code:
eol = len(CountryList)
for c in range(0, eol):
country = str(CountryList[c])
countrymatch = re.search(country, fullsampledata)
if countrymatch:
...
fullsampledata is a long string with all the data in one line. I’m trying to parse out the country by cycling thru a list of valid country names. If country is only one word, such as ‘Holland’, it finds it. However, if country is two or more words, ‘Costa Rica’, it doesn’t find it. Why?

You can search for a substring in a string using the .find() function as follows
fullsampledata = "hwfekfwekjfnkwfehCosta Ricakwjfkwfekfekfw"
fullsampledata.find("Morocco")
-1
fullsampledata.index("Costa Rica")
17
So you can make your if statement as follows
fullsampledata = "hwfekfwekjfnkwfehCosta Ricakwjfkwfekfekfw"
country = "Costa Rica"
if fullsampledata.index(country) != -1:
# Found
pass
else:
# Not Found
pass

In [1]: long_string = 'asdfsadfCosta Ricaasdkj asdfsd asdjas USA alsj'
In [2]: 'Costa Rica' in long_string
Out[2]: True
You don't have your code properly shown and I'm a little too lazy to parse it. Hope this helps.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python function to find similarity between differently formatted strings - python

Related

Python: Replace string in one column from list in other column

How can I show a specific word in a data set?

Python matching various keyword from dictionary issues

Python replace strings using regex on large dataset

Need to search a string for a "two word" pattern in python

Categories

Resources