I'm new to regex and I can't figure it out how to do this:
Hello this is JURASSIC WORLD shut up Ok
[REVIEW] The movie BATMAN is awesome lol
What I need is the title of the movie. It will be only one per sentence. I have to ignore the words between [] as it will not be the title of the movie.
I thought of this:
^\w([A-Z]{2,})+
Any help would be welcome.
Thanks.
You can use negative look arounds to ensure that the title is not within []
\b(?<!\[)[A-Z ]{2,}(?!\])\b
\b Matches word boundary.
(?<!\[) Negative look behind. Checks if the matched string is not preceded by [
[A-Z ]{2,} Matches 2 or more uppercase letters.
(?!\]) Negative look ahead. Ensures that the string is not followed by ]
Example
>>> string = """Hello this is JURASSIC WORLD shut up Ok
... [REVIEW] The movie BATMAN is awesome lol"""
>>> re.findall(r'\b(?<!\[)[A-Z ]{2,}(?!\])\b', string)
[' JURASSIC WORLD ', ' BATMAN ']
>>>
Related
sentence = "I love the grand mother bag i bought . I love my sister's ring "
import re
regex = re.search('(\w+){2}the grand mother bag(\w+){2}', sentence)
print(regex.groups())
I should have extracted: I love and I bought.
Any idea where I went wrong?
Change your regex: \w does not match word but a character so you extract only 2 characters:
>>> re.search('(\w+\s+\w+)\s+the grand mother bag\s+(\w+\s+\w+)', sentence).groups()
('I love', 'i bought')
I have multiple strings to postprocess, where a lot of the acronyms have a missing closing bracket. Assume the string text below, but also assume that this type of missing bracket happens often.
My code below only works by adding the closing bracket to the missing acronym independently, but not to the full string/sentence. Any tips on how to do this efficiently, and preferably without needing to iterate ?
import re
#original string
text = "The dog walked (ABC in the park"
#Desired output:
desired_output = "The dog walked (ABC) in the park"
#My code:
acronyms = re.findall(r'\([A-Z]*\)?', text)
for acronym in acronyms:
if ')' not in acronym: #find those without a closing bracket ')'.
print(acronym + ')') #add the closing bracket ')'.
#current output:
>>'(ABC)'
You may use
text = re.sub(r'(\([A-Z]+(?!\))\b)', r"\1)", text)
With this approach, you can also get rid of the check if the text has ) in it before, see a demo on regex101.com.
In full:
import re
#original string
text = "The dog walked (ABC in the park"
text = re.sub(r'(\([A-Z]+(?!\))\b)', r"\1)", text)
print(text)
This yields
The dog walked (ABC) in the park
See a working demo on ideone.com.
For the typical example you have provided, I don't see the need of using regex
You can just use some string methods:
text = "The dog walked (ABC in the park"
withoutClosing = [word for word in text.split() if word.startswith('(') and not word.endswith(')') ]
withoutClosing
Out[45]: ['(ABC']
Now you have the words without closing parenthesis, you can just replace them:
for eachWord in withoutClosing:
text = text.replace(eachWord, eachWord+')')
text
Out[46]: 'The dog walked (ABC) in the park'
I have some sentence like
1:
"RLB shows Oubre Jr. (WAS) legally ties up Nurkic (POR), and a held
ball is correctly called."
2:
"Nurkic (POR) maintains legal
guarding position and makes incidental contact with Wall (WAS) that
does not affect his driving shot attempt."
I need to use Python regex to find the name "Oubre Jr." ,"Nurkic" and "Nurkic", "Wall".
p = r'\s*(\w+?)\s[(]'
use this pattern,
I can find "['Nurkic', 'Wall']", but in sentence 1, I just can find ['Nurkic'], missed "Oubre Jr."
Who can help me?
You can use the following regex:
(?:[A-Z][a-z][\s\.a-z]*)+(?=\s\()
|-----Main Pattern-----|
Details:
(?:) - Creates a non-capturing group
[A-Z] - Captures 1 uppercase letter
[a-z] - Captures 1 lowercase letter
[\s\.a-z]* - Captures spaces (' '), periods ('.') or lowercase letters 0+ times
(?=\s\() - Captures the main pattern if it is only followed by ' (' string
str = '''RLB shows Oubre Jr. (WAS) legally ties up Nurkic (POR), and a held ball is correctly called.
Nurkic (POR) maintains legal guarding position and makes incidental contact with Wall (WAS) that does not affect his driving shot attempt.'''
res = re.findall( r'(?:[A-Z][a-z][\s\.a-z]*)+(?=\s\()', str )
print(res)
Demo: https://repl.it/#RahulVerma8/OvalRequiredAdvance?language=python3
Match: https://regex101.com/r/OsLTrY/1
Here is one approach:
line = "RLB shows Oubre Jr (WAS) legally ties up Nurkic (POR), and a held ball is correctly called."
results = re.findall( r'([A-Z][\w+'](?: [JS][r][.]?)?)(?= \([A-Z]+\))', line, re.M|re.I)
print(results)
['Oubre Jr', 'Nurkic']
The above logic will attempt to match one name, beginning with a capital letter, which is possibly followed by either the suffix Jr. or Sr., which in turn is followed by a ([A-Z]+) term.
You need a pattern that you can match - for your sentence you cou try to match things before (XXX) and include a list of possible "suffixes" to include as well - you would need to extract them from your sources
import re
suffs = ["Jr."] # append more to list
rsu = r"(?:"+"|".join(suffs)+")? ?"
# combine with suffixes
regex = r"(\w+ "+rsu+")\(\w{3}\)"
test_str = "RLB shows Oubre Jr. (WAS) legally ties up Nurkic (POR), and a held ball is correctly called. Nurkic (POR) maintains legal guarding position and makes incidental contact with Wall (WAS) that does not affect his driving shot attempt."
matches = re.finditer(regex, test_str, re.MULTILINE)
names = []
for matchNum, match in enumerate(matches,1):
for groupNum in range(0, len(match.groups())):
names.extend(match.groups(groupNum))
print(names)
Output:
['Oubre Jr.', 'Nurkic ', 'Nurkic ', 'Wall ']
This should work as long as you do not have Names with non-\w in them. If you need to adapt the regex, use https://regex101.com/r/pRr9ZU/1 as starting point.
Explanation:
r"(?:"+"|".join(suffs)+")? ?" --> all items in the list suffs are strung together via | (OR) as non grouping (?:...) and made optional followed by optional space.
r"(\w+ "+rsu+")\(\w{3}\)" --> the regex looks for any word characters followed by optional suffs group we just build, followed by literal ( then three word characters followed by another literal )
I've been struggling to split my string by a regex expression in Python.
I have a text file which I load that is in the format of:
"Peter went to the gym; \nhe worked out for two hours \nKyle ate lunch
at Kate's house. Kyle went home at 9. \nSome other sentence
here\n\u2022Here's a bulleted line"
I'd like to get the following output:
['Peter went to the gym; he worked out for two hours','Kyle ate lunch
at Kate's house. He went home at 9.', 'Some other sentence here',
'\u2022Here's a bulleted line']
I'm looking to split my string by a new line and a capital letter or a bullet point in Python.
I've tried tackling the first half of the problem, splitting my string by just a new line and capital letter.
Here's what I have so far:
print re.findall(r'\n[A-Z][a-z]+',str,re.M)
This just gives me:
[u'\nKyle', u'\nSome']
which is just the first word. I've tried variations of that regex expression but I don't know how to get the rest of the line.
I assume that to also split by the bullet point, I would just include an OR regex expression that is in the same format as the regex of splitting by a capital letter. Is this the best way?
I hope this makes sense and I'm sorry if my question is in anyway unclear. :)
You can use this split function:
>>> str = u"Peter went to the gym; \nhe worked out for two hours \nKyle ate lunch at Kate's house. Kyle went home at 9. \nSome other sentence here\n\u2022Here's a bulleted line"
>>> print re.split(u'\n(?=\u2022|[A-Z])', str)
[u'Peter went to the gym; \nhe worked out for two hours ',
u"Kyle ate lunch at Kate's house. Kyle went home at 9. ",
u'Some other sentence here',
u"\u2022Here's a bulleted line"]
Code Demo
You can split at a \n proceeded by a capital letter or the bullet character:
import re
s = """
Peter went to the gym; \nhe worked out for two hours \nKyle ate lunch
at Kate's house. Kyle went home at 9. \nSome other sentence
here\n\u2022Here's a bulleted line
"""
new_list = filter(None, re.split('\n(?=•)|\n(?=[A-Z])', s))
Output:
['Peter went to the gym; \nhe worked out for two hours ', "Kyle ate lunch \nat Kate's house. Kyle went home at 9. ", 'Some other sentence \nhere', "•Here's a bulleted line\n"]
Or, without using the symbol for the bullet character:
new_list = filter(None, re.split('\n(?=\u2022)|\n(?=[A-Z])', s))
I am trying to get all names that start with a capital letter and ends with a full-stop on the same line where the number of characters are between 3 and 5
My text is as follows:
King. Great happinesse
Rosse. That now Sweno, the Norwayes King,
Craues composition:
Nor would we deigne him buriall of his men,
Till he disbursed, at Saint Colmes ynch,
Ten thousand Dollars, to our generall vse
King. No more that Thane of Cawdor shall deceiue
Our Bosome interest: Goe pronounce his present death,
And with his former Title greet Macbeth
Rosse. Ile see it done
King. What he hath lost, Noble Macbeth hath wonne.
I am testing it out on this link. I am trying to get all words between 3 and 5 but haven't succeeded.
Does this produce your desired output?
import re
re.findall(r'[A-Z].{2,4}\.', text)
When text contains the text in your question it will produce this output:
['King.', 'Rosse.', 'King.', 'Rosse.', 'King.']
The regex pattern matches any sequence of characters following an initial capital letter. You can tighten that up if required, e.g. using [a-z] in the pattern [A-Z][a-z]{2,4}\. would match an upper case character followed by between 2 to 4 lowercase characters followed by a literal dot/period.
If you don't want duplicates you can use a set to get rid of them:
>>> set(re.findall(r'[A-Z].{2,4}\.', text))
set(['Rosse.', 'King.'])
You may have your own reasons for wanting to use regexs here, but Python provides a rich set of string methods and (IMO) it's easier to understand the code using these:
matched_words = []
for line in open('text.txt'):
words = line.split()
for word in words:
if word[0].isupper() and word[-1] == '.' and 3 <= len(word)-1 <=5:
matched_words.append(word)
print matched_words