How to add a if condition in re.sub in python - python

I am using the following code to replace the strings in words with words[0] in the given sentences.
import re
sentences = ['industrial text minings', 'i love advanced data minings and text mining']
words = ["data mining", "advanced data mining", "data minings", "text mining"]
start_terms = sorted(words, key=lambda x: len(x), reverse=True)
start_re = "|".join(re.escape(item) for item in start_terms)
results = []
for sentence in sentences:
for terms in words:
if terms in sentence:
result = re.sub(start_re, words[0], sentence)
results.append(result)
break
print(results)
My expected output is as follows:
[industrial text minings', 'i love data mining and data mining]
However, what I am getting is:
[industrial data minings', 'i love data mining and data mining]
In the first sentence text minings is not in words. However, it contains "text mining" in the words list, so the condition "text mining" in "industrial text minings" becomes True. Then post replacement, it "text mining" becomes "data mining", with the 's' character staying at the same place. I want to avoid such situations.
Therefore, I am wondering if there is a way to use if condition in re.sub to see if the next character is a space or not. If a space, do the replacement, else do not do it.
I am also happy with other solutions that could resolve my issue.

I modifed your code a bit:
# Using Python 3.6.1
import re
sentences = ['industrial text minings and data minings and data', 'i love advanced data mining and text mining as data mining has become a trend']
words = ["data mining", "advanced data mining", "data minings", "text mining", "data", 'text']
# Sort by length
start_terms = sorted(words, key=len, reverse=True)
results = []
# Loop through sentences
for sentence in sentences:
# Loop through sorted words to replace
result = sentence
for term in start_terms:
# Use exact word matching
exact_regex = r'\b' + re.escape(term) + r'\b'
# Replace matches with blank space (to avoid priority conflicts)
result = re.sub(exact_regex, " ", result)
# Replace inserted blank spaces with "data mining"
blank_regex = r'^\s(?=\s)|(?<=\s)\s$|(?<=\s)\s(?=\s)'
result = re.sub(blank_regex, words[0] , result)
results.append(result)
# Print sentences
print(results)
Output:
['industrial data mining minings and data mining and data mining', 'i love data mining and data mining as data mining has become a trend']
The regex can be a bit confusing so here's a quick breakdown:
\bword\b matches exact phrases/words since \b is a word boundary (more on that here)
^\s(?=\s) matches a space at the beginning followed by another space.
(?<=\s)\s$ matches a space at the end preceded by another space.
(?<=\s)\s(?=\s) matches a space with a space on both sides.
For more info on positive look behinds (?<=...) and positive look aheads (?=...) see this Regex tutorial.

You can use a word boundary \b to surround your whole regex:
start_re = "\\b(?:" + "|".join(re.escape(item) for item in start_terms) + ")\\b"
Your regex will become something like:
\b(?:data mining|advanced data mining|data minings|text mining)\b
(?:) denotes a non-capturing group.

Related

How to replace multiple substrings in a list of sentences using regex in python?

I have a list of sentences as below :
sentences = ["I am learning to code", "coding seems to be intresting in python", "how to code in python", "practicing how to code is the key"]
Now I wish to replace few substrings in this list of sentences using dictionary of words and its replacements.
word_list = {'intresting': 'interesting', 'how to code': 'learning how to code', 'am learning':'love learning', 'in python': 'using python'}
I tried the following code:
replaced_sentences = [' '.join([word_list.get(w, w) for w in sentence.split()])
for sentence in sentences]
But only the one word string is getting replaced and not the keys with more than one word. It is because am using sentence.split() which tokenizes sentences word by word and misses out replacing substrings greater than one word.
How do I get to replace the substring with exact match using regex or any other suggestions?
expected output:
sentences = ["I love learning to code", "coding seems to be interesting using python", "learning how to code using python", "practicing learning how to code is the key"]
Thanks in advance.
It's probably easiest to read if you break this into a function that replaces all the words for a single sentence. Then you can apply it to all the sentences in the list. Here we make a single regex by concaving all the keys of the dict with '|'. Then use re.sub grab the found value associated with the key, and return it as the replacement.
import re
def replace_words(s, word_lookup):
rx = '|'.join(word_lookup.keys())
return re.sub(rx, lambda match: word_lookup[match.group(0)], s)
[replace_words(s, word_list) for s in sentences]
This will result in:
['I love learning to code',
'coding seems to be interesting using python',
'learning how to code using python',
'practicing learning how to code is the key']
You could optimize a bit by making the regex once instead of each time in the function. This would allow you to do something like:
import re
rx = re.compile('|'.join(word_list.keys()))
[rx.sub(lambda match: word_list[match.group(0)], s) for s in sentences]

replace any words in string that match an entry in list with a single tag (python)

I have a list of sentences (~100k sentences total) and a list of "infrequent words" (length ~20k). I would like to run through each sentence and replace any word that matches an entry in "infrequent_words" with the tag "UNK".
(so as a small example, if
infrequent_words = ['dog','cat']
sentence = 'My dog likes to chase after cars'
Then after applying the transformation it should be
sentence = 'My unk likes for chase after cars'
I am having trouble finding an efficient way to do this. This function below (applied to each sentence) works, but it is very slow and I know there must be something better. Any suggestions?
def replace_infrequent_words(text,infrequent_words):
for word in infrequent_words:
text = text.replace(word,'unk')
return text
Thank you!
infrequent_words = {'dog','cat'}
sentence = 'My dog likes to chase after cars'
def replace_infrequent_words(text, infrequent_words):
words = text.split()
for i in range(len(words)):
if words[i] in infrequent_words:
words[i] = 'unk'
return ' '.join(words)
print(replace_infrequent_words(sentence, infrequent_words))
Two things that should improve performance:
Use a set instead of a list for storing infrequent_words.
Use a list to store each word in text so you don't have to scan the entire text string with each replacement.
This doesn't account for grammar and punctuation but this should be a performance improvement from what you posted.

How to replace words with their synonyms of word-net?

i want to do data augmentation for sentiment analysis task by replacing words with it's synonyms from wordnet but replacing is random i want to loop over the synonyms and replace word with all synonyms one at the time to increase data-size
sentences=[]
for index , r in pos_df.iterrows():
text=normalize(r['text'])
words=tokenize(text)
output = ""
# Identify the parts of speech
tagged = nltk.pos_tag(words)
for i in range(0,len(words)):
replacements = []
# Only replace nouns with nouns, vowels with vowels etc.
for syn in wordnet.synsets(words[i]):
# Do not attempt to replace proper nouns or determiners
if tagged[i][1] == 'NNP' or tagged[i][1] == 'DT':
break
# The tokenizer returns strings like NNP, VBP etc
# but the wordnet synonyms has tags like .n.
# So we extract the first character from NNP ie n
# then we check if the dictionary word has a .n. or not
word_type = tagged[i][1][0]
if syn.name().find("."+word_type+"."):
# extract the word only
r = syn.name()[0:syn.name().find(".")]
replacements.append(r)
if len(replacements) > 0:
# Choose a random replacement
replacement = replacements[randint(0,len(replacements)-1)]
print(replacement)
output = output + " " + replacement
else:
# If no replacement could be found, then just use the
# original word
output = output + " " + words[i]
sentences.append([output,'positive'])
Even I'm working with a similar kind of project, generating new sentences from a given input but without changing the context from the input text.
While coming across this, I found a data augmentation technique. Which seems to work well on the augmentation part. EDA(Easy Data Augmentation) is a paper[https://github.com/jasonwei20/eda_nlp].
Hope this helps you.

Replace space between indicated phrases with underscore

I've a text file where important phrases are indicated with special symbols. To be exact, they will start with <highlight> and end with <\highlight>.
For example,
"<highlight>machine learning<\highlight> is gaining more popularity, so do <highlight>block chain<\highlight>."
In this sentence, important phrases are segmented by <highlight> and <\highlight>.
I need to remove the <highlight> and <\highlight>, and replace the space connecting words surrounded by them with underscore. Namely, convert "<highlight>machine learning<\highlight>" to "machine_learning". The whole sentence after processing will be "machine_learning is gaining more popularity, so do block_chain".
Try this:
>>> text = "<highlight>machine learning<\\highlight> is gaining more popularity, so do <highlight>block chain<\\highlight>."
>>> re.sub(r"<highlight>(.*?)<\\highlight>", lambda x: x.group(1).replace(" ", "_"), text)
'machine_learning is gaining more popularity, so do block_chain.'
There you go:
import re
txt = "<highlight>machine learning<\\highlight> is gaining more popularity, so do <highlight>block chain<\\highlight>."
words = re.findall('<highlight>(.*?)<\\\highlight', txt)
for w in words:
txt = txt.replace(w, w.replace(' ', '_'))
txt = txt.replace('<highlight>', '')
txt = txt.replace('<\highlight>', '')
print(txt)

Tagging words based on a dictionary/list in Python

I have the following dictionary of gene names:
gene_dict = {"repA1":1, "leuB":1}
# the actual dictionary is longer, around ~30K entries.
# or in list format
# gene_list = ["repA1", "leuB"]
What I want to do is given any sentence, we search for terms that is contained in the above dictionary and then tagged them.
For example given this sentence:
mytext = "xxxxx repA1 yyyy REPA1 zzz."
It will be then tagged as:
xxxxx <GENE>repA1</GENE> yyyy <GENE>REPA1</GENE> zzz.
Is there any efficient way to do that? In practicality we would process couple of millions of sentences.
If you "gene_list" in not really-really-really long, you could use a compiled regular expression, like
import re
gene_list = ["repA1", "leuB"]
regexp = re.compile('|'.join(gene_list), flags=re.IGNORECASE)
result = re.sub(regexp, r'<GENE>\g<0></GENE>', 'xxxxx repA1 yyyy REPA1 zzz.')
and put in a loop for all your sentences. I think this should be quite fast.
If most of the sentences are short and separated by single spaces, something like:
gene_dict = {"repA1":1, "leuB":1}
format_gene = "<GENE>{}</GENE>".format
mytext = " ".join(format_gene(word) if word in gene_dict else word for word in mytext.split())
is going to be faster.
For slightly longer sentences or sentences you cannot reform with " ".join it might be more efficient or more correct to use several .replaces:
gene_dict = {"repA1":1, "leuB":1}
genes = set(gene_dict)
format_gene = "<GENE>{}</GENE>".format
to_replace = genes.intersection(mytext.split())
for gene in to_replace:
mytext = mytext.replace(gene, format_gene(gene))
Each of these assume that splits of the sentences will not take extortionate time, which is fair assuming genes_dict is a much longer than the sentences.

Categories