Python Regex - Sentence not including strings - python

I have a series of sentences I am trying to decipher. Here are two examples:
Valid for brunch on Saturdays and Sundays
and
Valid for brunch
I want to compose a regex that identifies the word brunch, but only in the case where the sentence does not include the word saturday or sunday. How can I modify the following regex to do this?
re.compile(r'\bbrunch\b',re.I)

^(?!.*saturday)(?!.*sunday).*(brunch)
You can try in this way.Grab the capture.See demo.
https://regex101.com/r/nL5yL3/18

use a list comprehension , if you have all the sentences in a list like sentences you can use the following comprehension :
import re
[re.search(r'\bbranch\b',s) for s in sentences if `saturday` not in s and 'sunday' not in s ]

I would do like this,
>>> sent = ["Valid for brunch on Saturdays and Sundays", "Valid for brunch"]
>>> sent
['Valid for brunch on Saturdays and Sundays', 'Valid for brunch']
>>> for i in sent:
if not re.search(r'(?i)(?:saturday|sunday)', i) and re.search(r'brunch', i):
print(i)
Valid for brunch

Related

Regex to detect a phone number

I need to find a phone number in a given paragraph text, with the conditions as below.
The word Phone/Ph/tel/telephone should exist in the sentence where the phone number is present.
For ex: (consider the below paragraph.)
This is my Phone number and I am 25 years old, 999-888-7894 and I am looking for a regex script.
As you can see this paragraph has a phone number signified, and it has the word "Phone" in the sentence (31 characters before the phone number).
So i would like to detect this as a phone number if and only if it has the words Phone/Ph/tel/telephone 50 characters before or after the phone number.
I tried using lookaround in regex but did not work.
import re
phno = re.compile(r'(?<=Ph\s)(?<=Phone\s)(?<=tel\s)telephone(?<=telephone\s)\b([0-9]{3}[-][0-9]{3}[-][0-9]{4})\b',re.MULTILINE)
data = "This is my phone number and I am 25 years old, 999-888-7894 and I am looking for a regex script."
l = phno.findall(data)
print(l)
I am getting output empty list [ ] because the word 'Phone' is not detected by regex (I need it to detect 50 chars before or after phone number)
import re
data = """This is my phone number and I am 25 years old, 999-888-7894 and I am looking for a regex script.
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx 999-123-4567 xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
And 555-555-1212 is my telephone."""
phno = re.compile(r'\b(?:phone|ph|telephone)\b.{0,49}\b(\d{3}[-]\d{3}[-]\d{4})\b|\b(\d{3}[-]\d{3}[-]\d{4})\b.{0,49}\b(?:phone|ph|telephone)\b', flags=re.I)
phones = [m.group(1) if m.group(1) else m.group(2) for m in phno.finditer(data)]
print(phones)
Prints:
['999-888-7894', '555-555-1212']
See demo
Assuming you only want to detect hyphen-separated US phone numbers containing area codes, you could use the following regex pattern with re.findall:
\b\d{3}-\d{3}-\d{4}\b
Script:
sentence = "This is my Phone number and I am 25 years old, 999-888-7894 and I am looking for a regex script."
numbers = re.findall(r'\b\d{3}-\d{3}-\d{4}\b', sentence)
print(numbers)
This prints:
['999-888-7894']

How to extract a substring beginning with a specific substring and ends with another specific substring?

I have something like that:
"1111Austria9999Salzburg (SZG)Vienna (VIE)1111Bosnia-Herzegovina9999Sarajevo (SJJ)1111Bulgaria9999Bourgas (BOJ)Varna (VAR)"
And I want to extract
Salzburg (SZG), Sarajevo (SJJ), Bourgas (BOJ), Varna (VAR)
import re
sentence = "1111Austria9999Salzburg (SZG)Vienna (VIE)1111Bosnia-Herzegovina9999Sarajevo (SJJ)1111Bulgaria9999Bourgas (BOJ)Varna (VAR)"
regs = re.findall(r'[A-z]+\s\([A-Z]{3}\)', sentence)
print(regs)
This follows from my logic in the comments.

Replace word between two substrings (keeping other words)

I'm trying to replace a word (e.g. on) if it falls between two substrings (e.g. <temp> & </temp>) however other words are present which need to be kept.
string = "<temp>The sale happened on February 22nd</temp>"
The desired string after the replace would be:
Result = <temp>The sale happened {replace} February 22nd</temp>
I've tried using regex, I've only been able to figure out how to replace everything lying between the two <temp> tags. (Because of the .*?)
result = re.sub('<temp>.*?</temp>', '{replace}', string, flags=re.DOTALL)
However on may appear later in the string not between <temp></temp> and I wouldn't want to replace this.
re.sub('(<temp>.*?) on (.*?</temp>)', lambda x: x.group(1)+" <replace> "+x.group(2), string, flags=re.DOTALL)
Output:
<temp>The sale happened <replace> February 22nd</temp>
Edit:
Changed the regex based on suggestions by Wiktor and HolyDanna.
P.S: Wiktor's comment on the question provides a better solution.
Try lxml:
from lxml import etree
root = etree.fromstring("<temp>The sale happened on February 22nd</temp>")
root.text = root.text.replace(" on ", " {replace} ")
print(etree.tostring(root, pretty_print=True))
Output:
<temp>The sale happened {replace} February 22nd</temp>

NLTK: How can I extract information based on sentence maps?

I know you can use noun extraction to get nouns out of sentences but how can I use sentence overlays/maps to take out phrases?
For example:
Sentence Overlay:
"First, #action; Second, Foobar"
Input:
"First, Dance and Code; Second, Foobar"
I want to return:
action = "Dance and Code"
Normal Noun Extractions wont work because it wont always be nouns
The way sentences are phrased differs so it cant be words[x] ... because the positioning of the words changes
You can slightly rewrite your string templates to turn them into regexps, and see which one (or which ones) match.
>>> template = "First, (?P<action>.*); Second, Foobar"
>>> mo = re.search(template, "First, Dance and Code; Second, Foobar")
>>> if mo:
print(mo.group("action"))
Dance and Code
You can even transform your existing strings into this kind of regexp (after escaping regexp metacharacters like .?*()).
>>> template = "First, #action; (Second, Foobar...)"
>>> re_template = re.sub(r"\\#(\w+)", r"(?P<\g<1>>.*)", re.escape(template))
>>> print(re_template)
First\,\ (?P<action>.*)\;\ \(Second\,\ Foobar\.\.\.\)

Extract words between the 2nd and the 3rd comma

I am total newbie to regex, so this question might seem trivial to many of you.
I would like to extract the words between the second and the third comma, like in the sentence:
Chateau d'Arsac, Bordeaux blanc, Cuvee Celine, 2012
I have tried : (?<=,\s)[^,]+(?=,) but this doesn't return what I want...
data = "Chateau d'Arsac, Bordeaux blanc, Cuvee Celine, 2012"
import re
print re.match(".*?,.*?,\s*(.*?),.*", data).group(1)
Output
Cuvee Celine
But for this simple task, you can simply split the strings based on , like this
data.split(",")[2].strip()
In this case I see easier to use a simple split by comma.
>>> s = "Chateau d'Arsac, Bordeaux blanc, Cuvee Celine, 2012"
>>> s.split(',')[2]
' Cuvee Celine'
Why not just split the string by commas using str.split() ?
data.split(",")[2]

Categories