Replace space between indicated phrases with underscore - python

I've a text file where important phrases are indicated with special symbols. To be exact, they will start with <highlight> and end with <\highlight>.
For example,
"<highlight>machine learning<\highlight> is gaining more popularity, so do <highlight>block chain<\highlight>."
In this sentence, important phrases are segmented by <highlight> and <\highlight>.
I need to remove the <highlight> and <\highlight>, and replace the space connecting words surrounded by them with underscore. Namely, convert "<highlight>machine learning<\highlight>" to "machine_learning". The whole sentence after processing will be "machine_learning is gaining more popularity, so do block_chain".

Try this:
>>> text = "<highlight>machine learning<\\highlight> is gaining more popularity, so do <highlight>block chain<\\highlight>."
>>> re.sub(r"<highlight>(.*?)<\\highlight>", lambda x: x.group(1).replace(" ", "_"), text)
'machine_learning is gaining more popularity, so do block_chain.'

There you go:
import re
txt = "<highlight>machine learning<\\highlight> is gaining more popularity, so do <highlight>block chain<\\highlight>."
words = re.findall('<highlight>(.*?)<\\\highlight', txt)
for w in words:
txt = txt.replace(w, w.replace(' ', '_'))
txt = txt.replace('<highlight>', '')
txt = txt.replace('<\highlight>', '')
print(txt)

Related

How can I split a string into a list by sentences, but keep the \n?

I want to split text into sentences but keep the \n such as:
Civility vicinity graceful is it at. Improve up at to on mention
perhaps raising. Way building not get formerly her peculiar.
Arrived totally in as between private. Favour of so as on pretty
though elinor direct.
into sentences like:
['Civility vicinity graceful is it at.', 'Improve up at to on mention
perhaps raising.', 'Way building not get formerly her peculiar.', '\n
Arrived totally in as between private.', 'Favour of so as on pretty though elinor direct.']
Right now I'm using this code with re to split the sentences:
import re
alphabets= "([A-Za-z])"
prefixes = "(Mr|St|Mrs|Ms|Dr)[.]"
suffixes = "(Inc|Ltd|Jr|Sr|Co)"
starters = "(Mr|Mrs|Ms|Dr|He\s|She\s|It\s|They\s|Their\s|Our\s|We\s|But\s|However\s|That\s|This\s|Wherever)"
acronyms = "([A-Z][.][A-Z][.](?:[A-Z][.])?)"
websites = "[.](com|net|org|io|gov)"
digits = "([0-9])"
def remove_urls(text):
text = re.sub(r'http\S+', '', text)
return text
def split_into_sentences(text):
print("in")
print(text)
text = " " + text + " "
text = re.sub(prefixes,"\\1<prd>",text)
text = re.sub(websites,"<prd>\\1",text)
text = re.sub(digits + "[.]" + digits,"\\1<prd>\\2",text)
if "Ph.D" in text: text = text.replace("Ph.D.","Ph<prd>D<prd>")
text = re.sub("\s" + alphabets + "[.] "," \\1<prd> ",text)
text = re.sub(acronyms+" "+starters,"\\1<stop> \\2",text)
if "..." in text: text = text.replace("...",".<prd>")
text = re.sub(alphabets + "[.]" + alphabets + "[.]","\\1<prd>\\2<prd>",text)
text = re.sub(" "+suffixes+"[.] "+starters," \\1<stop> \\2",text)
text = re.sub(" "+suffixes+"[.]"," \\1<prd>",text)
text = re.sub(" " + alphabets + "[.]"," \\1<prd>",text)
if "”" in text: text = text.replace(".”","”.")
if "\"" in text: text = text.replace(".\"","\".")
if "!" in text: text = text.replace("!\"","\"!")
if "?" in text: text = text.replace("?\"","\"?")
text = text.replace(".",".<stop>")
text = text.replace("?","?<stop>")
text = text.replace("!","!<stop>")
text = text.replace("<prd>",".")
sentences = text.split("<stop>")
sentences = sentences[:-1]
sentences = [s.strip() for s in sentences]
print(sentences)
return sentences
However the code gets rid of the \n, which I need. I need the \n because I'm using text in moviepy, and moviepy has no built in functions to space out text with \n, so I must create my own. The only way I can do that is through having \n as a signifier in the text, but when I split my sentences it also gets rid of the \n. What should I do?
You can use (?<=...) to retain separator followed by what you want to remove by the split:
import re
s='Civility vicinity graceful is it at. Improve up at to on mention perhaps raising. Way building not get formerly
her peculiar.\n\nArrived totally in as between private. Favour of so as on pretty though elinor direct.'
re.split(r'(?<=\.)[ \n]', s)
output:
['Civility vicinity graceful is it at.',
'Improve up at to on mention perhaps raising.',
'Way building not get formerly her peculiar.',
'\nArrived totally in as between private.',
'Favour of so as on pretty though elinor direct.']
Use could use split by .
text = '''Civility vicinity graceful is it at. Improve up at to on mention
perhaps raising. Way building not get formerly her peculiar.
Arrived totally in as between private. Favour of so as on pretty though elinor
direct.'''
text.split('.')
>>> ['Civility vicinity graceful is it at', ' Improve up at to on mention
perhaps raising', ' Way building not get formerly her peculiar', '\nArrived
totally in as between private', ' Favour of so as on pretty though elinor
direct', '']
check this Split by comma and strip whitespace in Python
I have been able to reproduce your output using this:
txt = 'Civility vicinity graceful is it at. Improve up at to on mention perhaps raising. Way building not get formerly her peculiar. \nArrived totally in as between private. Favour of so as on pretty though elinor direct.'
Code:
updated_text = [a if a.endswith('.') else a+'.' for a in txt.split('. ')]
Output:
['Civility vicinity graceful is it at.', 'Improve up at to on mention perhaps raising.', 'Way building not get formerly her peculiar.', '\nArrived totally in as between private.', 'Favour of so as on pretty though elinor direct.']

General regex about english words, numbers, length and specific symbols

I am working in an NLP task and I want to do a general cleaning for a specific purpose that doesn't matter to explain further.
I want a function that
Remove non English words
Remove words that are full in capital
Remove words that have '-' in the beginning or the end
Remove words that have length less than 2 characters
Remove words that have only numbers
For example if I have the following string
'George -wants to play 123_134 foot-ball in _123pantis FOOTBALL ελλαδα 123'
the output should be
'George play 123_134 _123pantis'
The function that I have created already is the following:
def clean(text):
# remove words that aren't in english words (isn't working)
#text = re.sub(r'^[a-zA-Z]+', '', text)
# remove words that are in capital
text = re.sub(r'(\w*[A-Z]+\w*)', '', text)
# remove words that start or have - in the middle (isn't working)
text = re.sub(r'(\s)-\w+', '', text)
# remove words that have length less than 2 characters (is working)
text = re.sub(r'\b\w{1,2}\b', '', text)
# remove words with only numbers
text = re.sub(r'[0-9]+', '', text) (isn't working)
return text
The output is
- play _ foot-ball _ _pantis ελλαδα
which is not what I need. Thank you very much for your time and help!
You can do this in single re.sub call.
Search using this regex:
(?:\b(?:\w+(?=-)|\w{2}|\d+|[A-Z]+|\w*[^\x01-\x7F]\w*)\b|-\w+)\s*
and replace with empty string.
RegEx Demo
Code:
import re
s = 'George -wants to play 123_134 foot-ball in _123pantis FOOTBALL ελλαδα 123'
r = re.sub(r'(?:\b(?:\w+(?=-)|\w{2}|\d+|[A-Z]+|\w*[^\x01-\x7F]\w*)\b|-\w+)\s*', '', s)
print (r)
# George play 123_134 _123pantis
Online Code Demo

How to add a if condition in re.sub in python

I am using the following code to replace the strings in words with words[0] in the given sentences.
import re
sentences = ['industrial text minings', 'i love advanced data minings and text mining']
words = ["data mining", "advanced data mining", "data minings", "text mining"]
start_terms = sorted(words, key=lambda x: len(x), reverse=True)
start_re = "|".join(re.escape(item) for item in start_terms)
results = []
for sentence in sentences:
for terms in words:
if terms in sentence:
result = re.sub(start_re, words[0], sentence)
results.append(result)
break
print(results)
My expected output is as follows:
[industrial text minings', 'i love data mining and data mining]
However, what I am getting is:
[industrial data minings', 'i love data mining and data mining]
In the first sentence text minings is not in words. However, it contains "text mining" in the words list, so the condition "text mining" in "industrial text minings" becomes True. Then post replacement, it "text mining" becomes "data mining", with the 's' character staying at the same place. I want to avoid such situations.
Therefore, I am wondering if there is a way to use if condition in re.sub to see if the next character is a space or not. If a space, do the replacement, else do not do it.
I am also happy with other solutions that could resolve my issue.
I modifed your code a bit:
# Using Python 3.6.1
import re
sentences = ['industrial text minings and data minings and data', 'i love advanced data mining and text mining as data mining has become a trend']
words = ["data mining", "advanced data mining", "data minings", "text mining", "data", 'text']
# Sort by length
start_terms = sorted(words, key=len, reverse=True)
results = []
# Loop through sentences
for sentence in sentences:
# Loop through sorted words to replace
result = sentence
for term in start_terms:
# Use exact word matching
exact_regex = r'\b' + re.escape(term) + r'\b'
# Replace matches with blank space (to avoid priority conflicts)
result = re.sub(exact_regex, " ", result)
# Replace inserted blank spaces with "data mining"
blank_regex = r'^\s(?=\s)|(?<=\s)\s$|(?<=\s)\s(?=\s)'
result = re.sub(blank_regex, words[0] , result)
results.append(result)
# Print sentences
print(results)
Output:
['industrial data mining minings and data mining and data mining', 'i love data mining and data mining as data mining has become a trend']
The regex can be a bit confusing so here's a quick breakdown:
\bword\b matches exact phrases/words since \b is a word boundary (more on that here)
^\s(?=\s) matches a space at the beginning followed by another space.
(?<=\s)\s$ matches a space at the end preceded by another space.
(?<=\s)\s(?=\s) matches a space with a space on both sides.
For more info on positive look behinds (?<=...) and positive look aheads (?=...) see this Regex tutorial.
You can use a word boundary \b to surround your whole regex:
start_re = "\\b(?:" + "|".join(re.escape(item) for item in start_terms) + ")\\b"
Your regex will become something like:
\b(?:data mining|advanced data mining|data minings|text mining)\b
(?:) denotes a non-capturing group.

Tagging words based on a dictionary/list in Python

I have the following dictionary of gene names:
gene_dict = {"repA1":1, "leuB":1}
# the actual dictionary is longer, around ~30K entries.
# or in list format
# gene_list = ["repA1", "leuB"]
What I want to do is given any sentence, we search for terms that is contained in the above dictionary and then tagged them.
For example given this sentence:
mytext = "xxxxx repA1 yyyy REPA1 zzz."
It will be then tagged as:
xxxxx <GENE>repA1</GENE> yyyy <GENE>REPA1</GENE> zzz.
Is there any efficient way to do that? In practicality we would process couple of millions of sentences.
If you "gene_list" in not really-really-really long, you could use a compiled regular expression, like
import re
gene_list = ["repA1", "leuB"]
regexp = re.compile('|'.join(gene_list), flags=re.IGNORECASE)
result = re.sub(regexp, r'<GENE>\g<0></GENE>', 'xxxxx repA1 yyyy REPA1 zzz.')
and put in a loop for all your sentences. I think this should be quite fast.
If most of the sentences are short and separated by single spaces, something like:
gene_dict = {"repA1":1, "leuB":1}
format_gene = "<GENE>{}</GENE>".format
mytext = " ".join(format_gene(word) if word in gene_dict else word for word in mytext.split())
is going to be faster.
For slightly longer sentences or sentences you cannot reform with " ".join it might be more efficient or more correct to use several .replaces:
gene_dict = {"repA1":1, "leuB":1}
genes = set(gene_dict)
format_gene = "<GENE>{}</GENE>".format
to_replace = genes.intersection(mytext.split())
for gene in to_replace:
mytext = mytext.replace(gene, format_gene(gene))
Each of these assume that splits of the sentences will not take extortionate time, which is fair assuming genes_dict is a much longer than the sentences.

Python regex to print all sentences that contain two identified classes of markup

I wish to read in an XML file, find all sentences that contain both the markup <emotion> and the markup <LOCATION>, then print those entire sentences to a unique line. Here is a sample of the code:
import re
text = "Cello is a <emotion> wonderful </emotion> parakeet who lives in <LOCATION> Omaha </LOCATION>. He is the <emotion> best </emotion> singer <pronoun> I </pronoun> have ever heard."
out = open('out.txt', 'w')
for match in re.findall(r'(?:(?<=\.)\s+|^)((?=(?:(?!\.(?:\s|$)).)*?\bwonderful(?=\s|\.|$))(?=(?:(?!\.(?:\s|$)).)*?\bomaha(?=\s|\.|$)).*?\.(?=\s|$))', text, flags=re.I):
line = ''.join(str(x) for x in match)
out.write(line + '\n')
out.close()
The regex here grabs all sentences with "wonderful" and "omaha" in them, and returns:
Cello is a <emotion> wonderful </emotion> parakeet who lives in <LOCATION> Omaha </LOCATION>.
Which is perfect, but I really want to print all sentences that contain both <emotion> and <LOCATION>. For some reason, though, when I replace "wonderful" in the regex above with "emotion," the regex fails to return any output. So the following code yields no result:
import re
text = "Cello is a <emotion> wonderful </emotion> parakeet who lives in <LOCATION> Omaha </LOCATION>. He is the <emotion> best </emotion> singer I have ever heard."
out = open('out.txt', 'w')
for match in re.findall(r'(?:(?<=\.)\s+|^)((?=(?:(?!\.(?:\s|$)).)*?\bemotion(?=\s|\.|$))(?=(?:(?!\.(?:\s|$)).)*?\bomaha(?=\s|\.|$)).*?\.(?=\s|$))', text, flags=re.I):
line = ''.join(str(x) for x in match)
out.write(line + '\n')
out.close()
My question is: How can I modify my regular expression in order to grab only those sentences that contain both <emotion> and <LOCATION>? I would be most grateful for any help others can offer on this question.
(For what it's worth, I'm working on parsing my text in BeautifulSoup as well, but wanted to give regular expressions one last shot before throwing in the towel.)
Your problem appears to be that your regex is expecting a space (\s) to follow the matching word, as seen with:
emotion(?=\s|\.|$)
Since when it's part of a tag, it's followed by a >, rather than a space, no match is found since that lookahead fails. To fix it, you can just add the > after emotion, like:
for match in re.findall(r'(?:(?<=\.)\s+|^)((?=(?:(?!\.(?:\s|$)).)*?\bemotion>(?=\s|\.|$))(?=(?:(?!\.(?:\s|$)).)*?\bomaha(?=\s|\.|$)).*?\.(?=\s|$))', text, flags=re.I):
line = ''.join(str(x) for x in match)
Upon testing, this seems to solve your problem. Make sure and treat "LOCATION" similarly:
for match in re.findall(r'(?:(?<=\.)\s+|^)((?=(?:(?!\.(?:\s|$)).)*?\bemotion>(?=\s|\.|$))(?=(?:(?!\.(?:\s|$)).)*?\bLOCATION>(?=\s|\.|$)).*?\.(?=\s|$))', text, flags=re.I):
line = ''.join(str(x) for x in match)
If I do not understand bad what you are trying to do is remove <emotion> </emotion> <LOCATION></LOCATION> ??
Well if is that what you want to do you can do this
import re
text = "Cello is a <emotion> wonderful </emotion> parakeet who lives in <LOCATION> Omaha </LOCATION>. He is the <emotion> best </emotion> singer I have ever heard."
out = open('out.txt', 'w')
def remove_xml_tags(xml):
content = re.compile(r'<.*?>')
return content.sub('', xml)
data = remove_xml_tags(text)
out.write(data + '\n')
out.close()
I have just discovered that the regex may be bypassed altogether. To find (and print) all sentences that contain two identified classes of markup, you can use a simple for loop. In case it might help others who find themselves where I found myself, I'll post my code:
# read in your file
f = open('sampleinput.txt', 'r')
# use read method to convert the read data object into string
readfile = f.read()
#########################
# now use the replace() method to clean data
#########################
# replace all \n with " "
nolinebreaks = readfile.replace('\n', ' ')
# replace all commas with ""
nocommas = nolinebreaks.replace(',', '')
# replace all ? with .
noquestions = nocommas.replace('?', '.')
# replace all ! with .
noexclamations = noquestions.replace('!', '.')
# replace all ; with .
nosemicolons = noexclamations.replace(';', '.')
######################
# now use replace() to get rid of periods that don't end sentences
######################
# replace all Mr. with Mr
nomisters = nosemicolons.replace('Mr.', 'Mr')
#replace 'Mrs.' with 'Mrs' etc.
cleantext = nomisters
#now, having cleaned the input, find all sentences that contain your two target words. To find markup, just replace "Toby" and "pipe" with <markupclassone> and <markupclasstwo>
periodsplit = cleantext.split('.')
for x in periodsplit:
if 'Toby' in x and 'pipe' in x:
print x

Categories