How to add a missing closing parenthesis to a string in Python? - python

I have multiple strings to postprocess, where a lot of the acronyms have a missing closing bracket. Assume the string text below, but also assume that this type of missing bracket happens often.
My code below only works by adding the closing bracket to the missing acronym independently, but not to the full string/sentence. Any tips on how to do this efficiently, and preferably without needing to iterate ?
import re
#original string
text = "The dog walked (ABC in the park"
#Desired output:
desired_output = "The dog walked (ABC) in the park"
#My code:
acronyms = re.findall(r'\([A-Z]*\)?', text)
for acronym in acronyms:
if ')' not in acronym: #find those without a closing bracket ')'.
print(acronym + ')') #add the closing bracket ')'.
#current output:
>>'(ABC)'

You may use
text = re.sub(r'(\([A-Z]+(?!\))\b)', r"\1)", text)
With this approach, you can also get rid of the check if the text has ) in it before, see a demo on regex101.com.
In full:
import re
#original string
text = "The dog walked (ABC in the park"
text = re.sub(r'(\([A-Z]+(?!\))\b)', r"\1)", text)
print(text)
This yields
The dog walked (ABC) in the park
See a working demo on ideone.com.

For the typical example you have provided, I don't see the need of using regex
You can just use some string methods:
text = "The dog walked (ABC in the park"
withoutClosing = [word for word in text.split() if word.startswith('(') and not word.endswith(')') ]
withoutClosing
Out[45]: ['(ABC']
Now you have the words without closing parenthesis, you can just replace them:
for eachWord in withoutClosing:
text = text.replace(eachWord, eachWord+')')
text
Out[46]: 'The dog walked (ABC) in the park'

Related

Search a string for a word/sentence and print the following word

I have a string that has around 10 lines of text. What I am trying to do is find a sentence that has a specific word(s) in it, and display the word following.
Example String:
The quick brown fox
The slow donkey
The slobbery dog
The Furry Cat
I want the script to search for 'The slow', then print the following word, so in this case, 'donkey'.
I have tried using the Find function, but that just prints the location of the word(s).
Example code:
sSearch = output.find("destination-pattern")
print(sSearch)
Any help would be greatly appreciated.
output = "The slow donkey brown fox"
patt = "The slow"
sSearch = output.find(patt)
print(output[sSearch+len(patt)+1:].split(' ')[0])
output:
donkey
You could work with regular expressions. Python has builtin library called re.
Example usage:
s = "The slow donkey some more text"
finder = "The slow"
idx_finder_end = s.find(finder) + len(finder)
next_word_match = re.match(r"\s\w*\s", s[idx_finder_end:])
next_word = next_word_match.group().strip()
# donkey
I would do it using regular expressions (re module) following way:
import re
txt = '''The quick brown fox
The slow donkey
The slobbery dog
The Furry Cat'''
words = re.findall(r'(?<=The slow) (\w*)',txt)
print(words) # prints ['donkey']
Note that words is now list of words, if you are sure that there is exactly 1 word to be found you could do then:
word = words[0]
print(word) # prints donkey
Explanation: I used so-called lookbehind assertion in first argument of re.findall, which mean I am looking for something behind The slow. \w* means any substring consisting of: letters, digits, underscores (_). I enclosed it in group (brackets) because it is not part of word.
You can do it using regular expressions:
>>> import re
>>> r=re.compile(r'The slow\s+\b(\w+)\b')
>>> r.match('The slow donkey')[1]
'donkey'
>>>

Python RE. excluding some results

I'm new to RE and I'm trying to take song lyrics and isolate the verse titles, the backing vocals, and main vocals:
Here's an example of some lyrics:
[Intro]
D.A. got that dope!
[Chorus: Travis Scott]
Ice water, turned Atlantic (Freeze)
Nightcrawlin' in the Phantom (Skrrt, Skrrt)...
The verse titles include the square brackets and any words between them. They can be successfully isolated with
r'\[{1}.*?\]{1}'
The backing vocals are similar to the verse titles, but between (). They've been successfully isolated with:
r'\({1}.*?\){1}'
For the main vocals, I've used
r'\S+'
which does isolate the main_vocals, but also the verse titles and backing vocals. I cannot figure out how to isolate only the main vocals with simple REs.
Here's a python script that gets the output I desire, but I'd like to do it with REs (as a learning exercise) and cannot figure it out through documentation.
import re
file = 'D:/lyrics.txt'
with open(file, 'r') as f:
lyrics = f.read()
def find_spans(pattern, string):
pattern = re.compile(pattern)
return [match.span() for match in pattern.finditer(string)]
verses = find_spans(r'\[{1}.*?\]{1}', lyrics)
backing_vocals = find_spans(r'\({1}.*?\){1}', lyrics)
main_vocals = find_spans(r'\S+', lyrics)
exclude = verses
exclude.extend(backing_vocals)
not_main_vocals = []
for span in exclude:
start, stop = span
not_main_vocals.extend(list(range(start, stop)))
main_vocals_temp = []
for span in main_vocals:
append = True
start, stop = span
for i in range(start, stop):
if i in not_main_vocals:
append = False
continue
if append == True:
main_vocals_temp.append(span)
main_vocals = main_vocals_temp
Try this Demo:
pattern = r'(?P<Verse>\[[^\]]+])|(?P<Backing>\([^\)]+\))|(?P<Lyrics>[^\[\(]+)'
You can use re.finditer to isolate the groups.
breakdown = {k: [] for k in ('Verse', 'Backing', 'Lyrics')}
for p in pattern.finditer(song):
for key, item in p.groupdict().items():
if item: breakdown[key].append(item)
Result:
{
'Verse':
[
'[Intro]',
'[Chorus: Travis Scott]'
],
'Backing':
[
'(Freeze)',
'(Skrrt, Skrrt)'
],
'Lyrics':
[
'\nD.A. got that dope!\n\n',
'\nIce water, turned Atlantic ',
"\nNightcrawlin' in the Phantom ",
'...'
]
}
To elaborate a bit further on the pattern, it's using the named groups to separate the three distinct groups. Using [^\]+] and similar just means to find everything that is not ] (and likewise when \) means everything not )). In the Lyrics part we exclude anything that starts with [ and (. The link to the demo on regex101 would explain the components in more details if you need.
If you don't care for the newlines in the main lyrics, use (?P<Lyrics>[^\[\(\n]+) (which excludes the \n) to turn your Lyrics without newlines:
'Lyrics': [
'D.A. got that dope!',
'Ice water, turned Atlantic ',
"Nightcrawlin' in the Phantom ",
'...'
]
You could search for the text between close-brackets and open-brackets, using regex groups. If you have a single group (sub-pattern inside round-brackets) in your regex, re.findall will just return the contents of those brackets.
For example, "\[(.*?)\]" would find you just the section labels, not including the square brackets (since they're outside the group).
The regex "\)(.*?)\(" would find just the last line ("\nNightcrawlin' in the Phantom ").
Similarly, we could find the first line with "\](.*?)\[".
Combining the two types of brackets into a character class, the (significantly messier looking) regex "[\]\)](.*?)[\[\(]" captures all of the lyrics.
It will miss lines that don't have brackets before or after them (ie. a the very start before [Intro] if there are any, or at the end if there are no backing vocals afterwards). A possible workaround is to prepend a "]" character and append a "[" character to the end to force a match to start/end at the end of the string. Note we need to add the DOTALL option to make sure the wildcard "." will match the newline character "\n"
import re
lyrics = """[Intro]
D.A. got that dope!
[Chorus: Travis Scott]
Ice water, turned Atlantic (Freeze)
Nightcrawlin' in the Phantom (Skrrt, Skrrt)..."""
matches = re.findall(r"[\]\)](.*?)[\[\(]", "]" + lyrics + "[", re.DOTALL)
main_vocals = '\n'.join(matches)

How to add a if condition in re.sub in python

I am using the following code to replace the strings in words with words[0] in the given sentences.
import re
sentences = ['industrial text minings', 'i love advanced data minings and text mining']
words = ["data mining", "advanced data mining", "data minings", "text mining"]
start_terms = sorted(words, key=lambda x: len(x), reverse=True)
start_re = "|".join(re.escape(item) for item in start_terms)
results = []
for sentence in sentences:
for terms in words:
if terms in sentence:
result = re.sub(start_re, words[0], sentence)
results.append(result)
break
print(results)
My expected output is as follows:
[industrial text minings', 'i love data mining and data mining]
However, what I am getting is:
[industrial data minings', 'i love data mining and data mining]
In the first sentence text minings is not in words. However, it contains "text mining" in the words list, so the condition "text mining" in "industrial text minings" becomes True. Then post replacement, it "text mining" becomes "data mining", with the 's' character staying at the same place. I want to avoid such situations.
Therefore, I am wondering if there is a way to use if condition in re.sub to see if the next character is a space or not. If a space, do the replacement, else do not do it.
I am also happy with other solutions that could resolve my issue.
I modifed your code a bit:
# Using Python 3.6.1
import re
sentences = ['industrial text minings and data minings and data', 'i love advanced data mining and text mining as data mining has become a trend']
words = ["data mining", "advanced data mining", "data minings", "text mining", "data", 'text']
# Sort by length
start_terms = sorted(words, key=len, reverse=True)
results = []
# Loop through sentences
for sentence in sentences:
# Loop through sorted words to replace
result = sentence
for term in start_terms:
# Use exact word matching
exact_regex = r'\b' + re.escape(term) + r'\b'
# Replace matches with blank space (to avoid priority conflicts)
result = re.sub(exact_regex, " ", result)
# Replace inserted blank spaces with "data mining"
blank_regex = r'^\s(?=\s)|(?<=\s)\s$|(?<=\s)\s(?=\s)'
result = re.sub(blank_regex, words[0] , result)
results.append(result)
# Print sentences
print(results)
Output:
['industrial data mining minings and data mining and data mining', 'i love data mining and data mining as data mining has become a trend']
The regex can be a bit confusing so here's a quick breakdown:
\bword\b matches exact phrases/words since \b is a word boundary (more on that here)
^\s(?=\s) matches a space at the beginning followed by another space.
(?<=\s)\s$ matches a space at the end preceded by another space.
(?<=\s)\s(?=\s) matches a space with a space on both sides.
For more info on positive look behinds (?<=...) and positive look aheads (?=...) see this Regex tutorial.
You can use a word boundary \b to surround your whole regex:
start_re = "\\b(?:" + "|".join(re.escape(item) for item in start_terms) + ")\\b"
Your regex will become something like:
\b(?:data mining|advanced data mining|data minings|text mining)\b
(?:) denotes a non-capturing group.

Search in a string and obtain the 2 words before and after the match in Python

I'm using Python to search some words (also multi-token) in a description (string).
To do that I'm using a regex like this
result = re.search(word, description, re.IGNORECASE)
if(result):
print ("Trovato: "+result.group())
But what I need is to obtain the first 2 word before and after the match. For example if I have something like this:
Parking here is horrible, this shop sucks.
"here is" is the word that I looking for. So after I matched it with my regex I need the 2 words (if exists) before and after the match.
In the example:
Parking here is horrible, this
"Parking" and horrible, this are the words that I need.
ATTTENTION
The description cab be very long and the pattern "here is" can appear multiple times?
How about string operations?
line = 'Parking here is horrible, this shop sucks.'
before, term, after = line.partition('here is')
before = before.rsplit(maxsplit=2)[-2:]
after = after.split(maxsplit=2)[:2]
Result:
>>> before
['Parking']
>>> after
['horrible,', 'this']
Try this regex: ((?:[a-z,]+\s+){0,2})here is\s+((?:[a-z,]+\s*){0,2})
with re.findall and re.IGNORECASE set
Demo
I would do it like this (edit: added anchors to cover most cases):
(\S+\s+|^)(\S+\s+|)here is(\s+\S+|)(\s+\S+|$)
Like this you will always have 4 groups (might have to be trimmed) with the following behavior:
If group 1 is empty, there was no word before (group 2 is empty too)
If group 2 is empty, there was only one word before (group 1)
If group 1 and 2 are not empty, they are the words before in order
If group 3 is empty, there was no word after
If group 4 is empty, there was only one word after
If group 3 and 4 are not empty, they are the words after in order
Corrected demo link
Based on your clarification, this becomes a bit more complicated. The solution below deals with scenarios where the searched pattern may in fact also be in the two preceding or two subsequent words.
line = "Parking here is horrible, here is great here is mediocre here is here is "
print line
pattern = "here is"
r = re.search(pattern, line, re.IGNORECASE)
output = []
if r:
while line:
before, match, line = line.partition(pattern)
if match:
if not output:
before = before.split()[-2:]
else:
before = ' '.join([pattern, before]).split()[-2:]
after = line.split()[:2]
output.append((before, after))
print output
Output from my example would be:
[(['Parking'], ['horrible,', 'here']), (['is', 'horrible,'], ['great', 'here']), (['is', 'great'], ['mediocre', 'here']), (['is', 'mediocre'], ['here', 'is']), (['here', 'is'], [])]

Python Error"TypeError: coercing to Unicode: need string or buffer, list found"

The purpose of this code is to make a program that searches a persons name (on Wikipedia, specifically) and uses keywords to come up with reasons why that person is significant.
I'm having issues with this specific line "if fact_amount < 5 and (terms in sentence.lower()):" because I get this error ("TypeError: coercing to Unicode: need string or buffer, list found")
If you could offer some guidance it would be greatly appreciated, thank you.
import requests
import nltk
import re
#You will need to install requests and nltk
terms = ['pronounced'
'was a significant'
'major/considerable influence'
'one of the (X) most important'
'major figure'
'earliest'
'known as'
'father of'
'best known for'
'was a major']
names = ["Nelson Mandela","Bill Gates","Steve Jobs","Lebron James"]
#List of people that you need to get info from
for name in names:
print name
print '==============='
#Goes to the wikipedia page of the person
r = requests.get('http://en.wikipedia.org/wiki/%s' % (name))
#Parses the raw html into text
raw = nltk.clean_html(r.text)
#Tries to split each sentence.
#sort of buggy though
#For example St. Mary will split after St.
sentences = re.split('[?!.][\s]*',raw)
fact_amount = 0
for sentence in sentences:
#I noticed that important things came after 'he was' and 'she was'
#Seems to work for my sample list
#Also there may be buggy sentences, so I return 5 instead of 3
if fact_amount < 5 and (terms in sentence.lower()):
#remove the reference notation that wikipedia has
#ex [ 33 ]
sentence = re.sub('[ [0-9]+ ]', '', sentence)
#removes newlines
sentence = re.sub('\n', '', sentence)
#removes trailing and leading whitespace
sentence = sentence.strip()
fact_amount += 1
#sentence is formatted. Print it out
print sentence + '.'
print
You should be checking it the other way
sentence.lower() in terms
terms is list and sentence.lower() is a string. You can check if a particular string is there in a list, but you cannot check if a list is there in a string.
you might mean if any(t in sentence_lower for t in terms), to check whether any terms from terms list is in the sentence string.

Categories