Python replace strings using regex on large dataset

Python replace strings using regex on large dataset - python

I have recently started using the re package in order to clean up transaction descriptions.
Example of original transaction descriptions:
['bread','payment to facebook.com', 'milk', 'savings', 'amazon.com $xx ased lux', 'holiday_amazon']
For a list of expressions I would like to replace the current description with a better one, e.g. if one of the list entries contains 'facebook' or 'amazon' preceded by a space (or at the beginning of the string), I want to replace the entire list entry by the word 'facebook' or 'amazon' respectively, i.e.:
['bread', 'facebook', 'milk', 'savings', 'amazon', 'holiday_amazon']
As I only want to pick it up if the word facebook is preceded by a space or if it is at the beginning of a word, I have created regex that represent this, e.g. (^|\s)facebook. Note that this is only an example, in reality I want to filter out more complex expressions as well.
In total I have a dataframe with 90 such expressions that I want to replace.
My current code (with minimum workable example) is:
import pandas as pd
import re
def specialCases(list_of_narratives, replacement_dataframe):
# Create output array
new_narratives = []
special_cases_identifiers = replacement_dataframe["REGEX_TEST"]
# For each string element of the list
for memo in list_of_narratives:
index_count = 0
found_count = 0
for i in special_cases_identifiers:
regex = re.compile(i)
if re.search(regex, memo.lower()) is not None:
new_narratives.append(replacement_dataframe["NARRATIVE_TO"].values[index_count].lower())
index_count += 1
found_count += 1
break
else:
index_count += 1
if found_count == 0:
new_narratives.append(memo.lower())
return new_narratives
# Minimum example creation
list_of_narratives = ['bread','payment to facebook.com', 'milk', 'savings', 'amazon.com $xx ased lux', 'holiday_amazon']
list_of_regex_expressions = ['(^|\s)facebook', '(^|\s)amazon']
list_of_regex_replacements = ['facebook', 'amazon']
replacement_dataframe = pd.DataFrame({'REGEX_TEST': list_of_regex_expressions, 'NARRATIVE_TO': list_of_regex_replacements})
# run code
new_narratives = specialCases(list_of_narratives, replacement_dataframe)
However, with over 1 million list entries and 90 different regex expressions to be replaced (i.e. len(list_of_regex_expressions) is 90) this is extremely slow, presumably due to the double for loop.
Could someone help me improve the performance of this code?

Related

Python matching various keyword from dictionary issues

I have a complex text where I am categorizing different keywords stored in a dictionary:
text = 'data-ls-static="1">Making Bio Implants, Drug Delivery and 3D Printing in Medicine,MEDICINE</h3>'
sector = {"med tech": ['Drug Delivery' '3D printing', 'medicine', 'medical technology', 'bio cell']}
this can successfully find my keywords and categorize them with some limitations:
pattern = r'[a-zA-Z0-9]+'
[cat for cat in sector if any(x in re.findall(pattern,text) for x in sector[cat])]
The limitations that I cannot solve are:
For example, keywords like "Drug Delivery" that are separated by a space are not recognized and therefore categorized.
I was not able to make the pattern case insensitive, as words like MEDICINE are not recognized. I tried to add (?i) to the pattern but it doesn't work.
The categorized keywords go into a pandas df, but they are printed into []. I tried to loop again the script to take them out but they are still there.
Data to pandas df:
ind_list = []
for site in url_list:
ind = [cat for cat in indication if any(x in re.findall(pattern,soup_string) for x in indication[cat])]
ind_list.append(ind)
websites['Indication'] = ind_list
Current output:
Website Sector Sub-sector Therapeutical Area Focus URL status
0 url3.com [med tech] [] [] [] []
1 www.url1.com [med tech, services] [] [oncology, gastroenterology] [] []
2 www.url2.com [med tech, services] [] [orthopedy] [] []
In the output I get [] that I'd like to avoid.
Can you help me with these points?
Thanks!

Give you some hints here the problem that can readily be spot:
Why can't match keywords like "Drug Delivery" that are separated by a space ? This is because the regex pattern r'[a-zA-Z0-9]+' does not match for a space. You can change it to r'[a-zA-Z0-9 ]+' (added a space after 9) if you want to match also for a space. However, if you want to support other types of white spaces (e.g. \t, \n), you need to further change this regex pattern.
Why don't support case insensitive match ? Your code fragment any(x in re.findall(pattern,text) for x in sector[cat]) requires x to have the same upper/lower case for BOTH being in result of re.findall and being in sector[cat]. This constrain even cannot be bypassed by setting flags=re.I in the re.findall() call. Suggest you to convert them all to the same case before checking. That is, for example change them all to lower cases before matching: any(x in re.findall(pattern,text.lower()) for x.lower() in sector[cat]) Here we added .lower() to both text and x.lower().
With the above 2 changes, it should allow you to capture some categorized keywords.
Actually, for this particular case, you may not need to use regular expression and re.findall at all. You may just check e.g. sector[cat][i].lower()) in text.lower(). That is, change the list comprehension as follows:
[cat for cat in sector if any(x in text.lower() for x in [y.lower() for y in sector[cat]])]
Edit
Test Run with 2-word phrase:
text = 'drug delivery'
sector = {"med tech": ['Drug Delivery', '3D printing', 'medicine', 'medical technology', 'bio cell']}
[cat for cat in sector if any(x in text.lower() for x in [y.lower() for y in sector[cat]])]
Output: # Successfully got the categorizing keyword even with dictionary values of different upper/lower cases
['med tech']
text = 'Drug Store fast delivery'
[cat for cat in sector if any(x in text.lower() for x in [y.lower() for y in sector[cat]])]
Ouptput: # Correctly doesn't match with extra words in between
[]

Can you try a different approach other than regex,
I would suggest difflib when you have two similar matching words.

findall is pretty wasteful here since you are repeatedly breaking up the string for each keyword.
If you want to test whether the keyword is in the string:
[cat for cat in sector if any(re.search(word, text, re.I) for word in sector[cat])]
# Output: med tech

Python; use list of initial characters to retrieve full word from other list?

I'm trying to use the list of shortened words to select & retrieve the corresponding full word identified by its initial sequence of characters:
shortwords = ['appe', 'kid', 'deve', 'colo', 'armo']
fullwords = ['appearance', 'armour', 'colored', 'developing', 'disagreement', 'kid', 'pony', 'treasure']
Trying this regex match with a single shortened word:
import re
shortword = 'deve'
retrieved=filter(lambda i: re.match(r'{}'.format(shortword),i), fullwords)
print(retrieved*)
returns
developing
So the regex match works but the question is how to adapt the code to iterate through the shortwords list and retrieve the full words?
EDIT: The solution needs to preserve the order from the shortwords list.

Maybe using a dictionary
# Using a dictionary
test= 'appe is a deve arm'
shortwords = ['appe', 'deve', 'colo', 'arm', 'pony', 'disa']
fullwords = ['appearance', 'developing', 'colored', 'armour', 'pony', 'disagreement']
#Building the dictionary
d={}
for i in range(len(shortwords)):
d[shortwords[i]]=fullwords[i]
# apply dictionary to test
res=" ".join(d.get(s,s) for s in test.split())
# print test data after dictionary mapping
print(res)

That is one way to do it:
shortwords = ['appe', 'deve', 'colo', 'arm', 'pony', 'disa']
fullwords = ['appearance', 'developing', 'colored', 'armour', 'pony', 'disagreement']
# Dict comprehension
words = {short:full for short, full in zip(shortwords, fullwords)}
#Solving problem
keys = ['deve','arm','pony']
values = [words[key] for key in keys]
print(values)
This is a classical key - value problem. Use a dictionary for that or consider pandas in long-term.

Your question text seems to indicate that you're looking for your shortwords at the start of each word. That should be easy then:
matched_words = [word for word in fullwords if any(word.startswith(shortword) for shortword in shortwords]
If you'd like to regex this for some reason (it's unlikely to be faster), you could do that with a large alternation:
regex_alternation = '|'.join(re.escape(shortword) for shortword in shortwords)
matched_words = [word for word in fullwords if re.match(rf"^{regex_alternation}", word)]
Alternately if your shortwords are always four characters, you could just slice the first four off:
shortwords = set(shortwords) # sets have O(1) lookups so this will save
# a significant amount of time if either shortwords
# or longwords is long
matched_words = [word for word in fullwords if word[:4] in shortwords]

This snippet has the functionality I wanted. It builds a regular expression pattern at each loop iteration in order to accomodate varying word length parameters. Further it maintains the original order of the wordroots list. In essence it looks at each word in wordroots and fills out the full word from the dataset. This is useful when working with the bip-0039 words list which contains words of 3-8 characters in length and which are uniquely identifiable by their initial 4 characters. Recovery phrases are built by randomly selecting a sequence of words from the bip-0039 list, order is important. Observed security practice is often to abbreviate each word to a maximum of its four initial characters. Here is code which would rebuild a recovery phrase from its abbreviation:
import re
wordroots = ['sun', 'sunk', 'sunn', 'suns']
dataset = ['sun', 'sunk', 'sunny', 'sunshine']
retrieved = []
for root in wordroots:
#(exact match) or ((exact match at beginning of word when root is 4 or more characters) else (exact match))
pattern = r"(^" + root + "$|" + ("^" + root + "[a-zA-Z]+)" if len(root) >= 4 else "^" + root + "$)")
retrieved.extend( filter(lambda i: re.match(pattern, i), dataset) )
print(*retrieved)
Output:
sun sunk sunny sunshine

Python code to return elements in a Series

I am currently putting together a script for topic modelling scraped Tweets but I am having a couple of issues. I want to be able to search for all instances of a word, then return all instances of that word, plus the words before and after, in order to provide better context into the use of a word.
I have tokenised all the tweets, and added them to a Series where the relative index position is used to identify surrounding words.
The code I currently have is:
myseries = pd.Series(["it", 'was', 'a', 'bright', 'cold', 'day', 'in', 'april'],
index= [0,1,2,3,4,5,6,7])
def phrase(w):
search_word= myseries[myseries == w].index[0]
before = myseries[[search_word- 1]].index[0]
after = myseries[[search_word+ 1]].index[0]
print(myseries[before], myseries[search_word], myseries[after])
The code mostly works, but will return an error if the first or last word is searched, as it falls outside the index range of the Series. Is there a way to ignore out of range indexes and simply return what is within range?
The current code also only returns the word before and after the searched word. I want to be able to input a number into the function which then returns a range of words before and after, but my current code is hard coded. Is there a way to have it return a designated range of elements?
I am also having issues creating a loop to search the entire series. Depending on what I write it either returns the first element and nothing else, or repeatedly prints the first element over and over again rather than continuing on with the search. The offending bit of code that keeps repeating the first element is:
def ws(word):
for element in tokened_df:
if word == element:
search_word = tokened_df[tokened_df == word].index[0]
before = tokened_df[[search_word - 1]].index[0]
after = tokened_df[[search_word + 1]].index[0]
print(tokened_df[before], word, tokened_df[after])
There is obviously something simple I've overlooked, but can't for the life of me figure out what it is. How can I modify the code so that if the same word is repeated in the series, it will return each instance of the word, plus the surrounding words? The way I want it to work follows the logic of 'if condition is true, execute 'phrase' function, if not true, continue down the series.

Something like this? I have added a repeated word ("bright") to your example. Also added n_before and n_after to put in number of surrounding words
import pandas as pd
myseries = pd.Series(["it", 'was', 'a', 'bright', 'bright', 'cold', 'day', 'in', 'april'],
index= [0,1,2,3,4,5,6,7,8])
def phrase(w, n_before=1, n_after=1):
search_words = myseries[myseries == w].index
for index in search_words:
start_index = max(index - n_before, 0)
end_index = min(index + n_after+1, myseries.shape[0])
print(myseries.iloc[start_index: end_index])
phrase("bright", n_before=2, n_after=3)
This gives:
1 was
2 a
3 bright
4 bright
5 cold
6 day
dtype: object
2 a
3 bright
4 bright
5 cold
6 day
7 in
dtype: object

This is not very elegant, but you probably need some conditionals to account for words that come at the beginning or the end of your phrase. To account for repeated words, find all instances of the repeated word and loop through your print statements. For the variable myseries, I repeated the word cold twice so there should be two print statements
import pandas as pd
myseries = pd.Series(["it", 'was', 'a', 'cold', 'bright', 'cold', 'day', 'in', 'april'],
index= [0,1,2,3,4,5,6,7,8])
def phrase(w):
for i in myseries[myseries == w].index.tolist():
search_word= i
if search_word == 0:
print(myseries[search_word], myseries[i+1])
elif search_word == len(myseries)-1:
print(myseries[i-1], myseries[search_word])
else:
print(myseries[i-1], myseries[search_word], myseries[i+1])
Output:
>>> myseries
0 it
1 was
2 a
3 cold
4 bright
5 cold
6 day
7 in
8 april
dtype: object
>>> phrase("was")
it was a
>>> phrase("cold")
a cold bright
bright cold day

Python function to find similarity between differently formatted strings

I have 2 excel files with names of items. I want to compare the items but the only remotely similar column is the name column which too has different formatting of the names like
KIDS-Piano as kids piano
Butter Gel 100mg as Butter-Gel-100MG
I know it can't be 100% accurate so I would instead ask the human operating the code to make the final verification but how do I show the closest matching names?

The proper way of doing this is writing a regular expression.
But the vanilla code below might do the trick as well:
column_a = ["KIDS-Piano", "Butter Gel 100mg"]
column_b = ["kids piano", "Butter-Gel-100MG"]
new_column_a = []
for i in column_a:
# convert strings into lowercase
a = i.lower()
# replace dashes with spaces
a = a.replace('-', ' ')
new_column_a.append(a)
# do the same for column b
new_column_b = []
for i in column_b:
# convert strings into lowercase
a = i.lower()
# replace dashes with spaces
a = a.replace('-', ' ')
new_column_b.append(a)
as_not_found_in_b = []
for i in new_column_a:
if i not in new_column_b:
as_not_found_in_b.append(i)
bs_not_found_in_a = []
for i in new_column_b:
if i not in new_column_a:
bs_not_found_in_a.append(i)
# find the problematic ones and manually fix them
print(as_not_found_in_b)
print(bs_not_found_in_a)

python word grouping based on words before and after

I am trying create groups of words. First I am counting all words. Then I establish the top 10 words by word count. Then I want to create 10 groups of words based on those top 10. Each group consist of all the words that are before and after the top word.
I have survey results stored in a python pandas dataframe structured like this
Question_ID | Customer_ID | Answer
1 234 Data is very important to use because ...
2 234 We value data since we need it ...
I also saved the answers column as a string.
I am using the following code to find 3 words before and after a word ( I actually had to create a string out of the answers column)
answers_str = df.Answer.apply(str)
for value in answers_str:
non_data = re.split('data|Data', value)
terms_list = [term for term in non_data if len(term) > 0] # skip empty terms
substrs = [term.split()[0:3] for term in terms_list] # slice and grab first three terms
result = [' '.join(term) for term in substrs] # combine the terms back into substrings
print result
I have been manually creating groups of words - but is there a way of doing it in python?
So based on the example shown above the group with word counts would look like this:
group "data":
data : 2
important: 1
value: 1
need:1
then when it goes through the whole file, there would be another group:
group "analytics:
analyze: 5
report: 7
list: 10
visualize: 16
The idea would be to get rid of "we", "to","is" as well - but I can do it manually, if that's not possible.
Then to establish the 10 most used words (by word count) and then create 10 groups with words that are in front and behind those main top 10 words.

We can use regex for this. We'll be using this regular expression
((?:\b\w+?\b\s*){0,3})[dD]ata((?:\s*\b\w+?\b){0,3})
which you can test for yourself here, to extract the three words before and after each occurence of data
First, let's remove all the words we don't like from the strings.
import re
# If you're processing a lot of sentences, it's probably wise to preprocess
#the pattern, assuming that bad_words is the same for all sentences
def remove_words(sentence, bad_words):
pat = r'(?:{})'.format(r'|'.join(bad_words))
return re.sub(pat, '', sentence, flags=re.IGNORECASE)
The we want to get the words that surround data in each line
data_pat = r'((?:\b\w+?\b\s*){0,3})[dD]ata((?:\s*\b\w+?\b){0,3})'
res = re.findall(pat, s, flags=re.IGNORECASE)
gives us a list of tuples of strings. We want to get a list of those strings after they are split.
from itertools import chain
list_of_words = list(chain.from_iterable(map(str.split, chain.from_iterable(map(chain, chain(res))))))
That's not pretty, but it works. Basically, we pull the tuples out of the list, pull the strings out of each tuples, then split each string then pull all the strings out of the lists they end up in into one big list.
Let's put this all together with your pandas code. pandas isn't my strongest area, so please don't assume that I haven't made some elementary mistake if you see something weird looking.
import re
from itertools import chain
from collections import Counter
def remove_words(sentence, bad_words):
pat = r'(?:{})'.format(r'|'.join(bad_words))
return re.sub(pat, '', sentence, flags=re.IGNORECASE)
bad_words = ['we', 'is', 'to']
sentence_list = df.Answer.apply(lambda x: remove_words(str(x), bad_words))
c = Counter()
data_pat = r'((?:\b\w+?\b\s*){0,3})data((?:\s*\b\w+?\b){0,3})'
for sentence in sentence_list:
res = re.findall(data_pat, sentence, flags=re.IGNORECASE)
words = chain.from_iterable(map(str.split, chain.from_iterable(map(chain, chain(res)))))
c.update(words)
The nice thing about the regex we're using is that all of the complicated parts don't care about what word we're using. With a slight change, we can make a format string
base_pat = r'((?:\b\w+?\b\s*){{0,3}}){}((?:\s*\b\w+?\b){{0,3}})'
such that
base_pat.format('data') == data_pat
So with some list of words we want to collect information about key_words
import re
from itertools import chain
from collections import Counter
def remove_words(sentence, bad_words):
pat = r'(?:{})'.format(r'|'.join(bad_words))
return re.sub(pat, '', sentence, flags=re.IGNORECASE)
bad_words = ['we', 'is', 'to']
sentence_list = df.Answer.apply(lambda x: remove_words(str(x), bad_words))
key_words = ['data', 'analytics']
d = {}
base_pat = r'((?:\b\w+?\b\s*){{0,3}}){}((?:\s*\b\w+?\b){{0,3}})'
for keyword in key_words:
key_pat = base_pat.format(keyword)
c = Counter()
for sentence in sentence_list:
res = re.findall(key_pat, sentence, flags=re.IGNORECASE)
words = chain.from_iterable(map(str.split, chain.from_iterable(map(chain, chain(res)))))
c.update(words)
d[keyword] = c
Now we have a dictionary d that maps keywords, like data and analytics to Counters that map words that are not on our blacklist to their counts in the vicinity of the associated keyword. Something like
d= {'data' : Counter({ 'important' : 2,
'very' : 3}),
'analytics' : Counter({ 'boring' : 5,
'sleep' : 3})
}
As to how we get the top 10 words, that's basically the thing Counter is best at.
key_words, _ = zip(*Counter(w for sentence in sentence_list for w in sentence.split()).most_common(10))

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python replace strings using regex on large dataset - python

Related

Python matching various keyword from dictionary issues

Python; use list of initial characters to retrieve full word from other list?

Python code to return elements in a Series

Python function to find similarity between differently formatted strings

python word grouping based on words before and after

Categories

Resources