Regex: match address string if multiple words

Regex: match address string if multiple words - python

Disclaimer: I know from this answer that regex isn't great for U.S. addresses since they're not regular. However, this is a fairly small project and I want to see if I can reduce the number of false positives.
My challenge is to distinguish (i.e. match) between addresses like "123 SOUTH ST" and "123 SOUTH MAIN ST". The best solution I can come up with is to check if more than 1 word comes after the directional word.
My python regex is of the form:
^(NORTH|SOUTH|EAST|WEST)(\s\S*\s\S*)+$
Explanation:
^(NORTH|SOUTH|EAST|WEST) matches direction at the start of the string
(\s\S*\s\S*)+$ attempts to match a space, a word of any length, another space, and another word of any length 1 or more times
But my expression doesn't seem to distinguish between the 2 types of term. Where's my error (besides using regex for U.S. addresses)?
Thanks for your help.

Your regex misses number in beginning of the address and treats optional word (MAIN in this case) as mandatory. Try this
^\d+ (NORTH|SOUTH|EAST|WEST)((\s\S*)?\s\S*)+$

Related

Remove duplicated puntaction in a string

I'm working on a cleaning some text as the one bellow:
Great talking with you. ? See you, the other guys and Mr. Jack Daniels next week, I hope-- ? Bobette ? ? Bobette Riner??????????????????????????????? Senior Power Markets Analyst?????? TradersNews Energy 713/647-8690 FAX: 713/647-7552 cell: 832/428-7008 bobette.riner#ipgdirect.com http://www.tradersnewspower.com ? ? - cinhrly020101.doc
It has multiple spaces and question marks, to clean it I'm using regular expressions:
def remove_duplicate_characters(text):
text = re.sub("\s+"," ",text)
text = re.sub("\s*\?+","?",text)
text = re.sub("\s*\?+","?",text)
return text
remove_duplicate_characters(msg)
remove_duplicate_characters(msg)
Which gives me the following result:
'Great talking with you.? See you, the other guys and Mr. Jack Daniels next week, I hope--? Bobette? Bobette Riner? Senior Power Markets Analyst? TradersNews Energy 713/647-8690 FAX: 713/647-7552 cell: 832/428-7008 bobette.riner#ipgdirect.com http://www.tradersnewspower.com? - cinhrly020101.doc'
For this particular case, it does work, but does not looks like the best approach if I want to add more charaters to remove. Is there an optimal way to solve this?

To replace all consecutive punctuation chars with their single occurrence you can use
re.sub(r"([^\w\s]|_)\1+", r"\1", text)
If the leading whitespace must be removed, use the r"\s*([^\w\s]|_)\1+" regex.
See the regex demo online.
In case you want to introduce exceptions to this generic regex, you may add an alternative on the left where you'd capture all the contexts where you wat the consecutive punctuation to be kept:
re.sub(r'((?<!\.)\.{3}(?!\.)|://)|([^\w\s]|_)\2+', r'\1\2', text)
See this regex demo.
The ((?<!\.)\.{3}(?!\.)|://)|([^\w\s]|_)\2+ regex matches and captures a ... (not encosed with other dots on both ends) and a :// string (commonly seen in URLS), and the rest is the original regex with the adjusted backreference (since now, there are two capturing groups).
The \1\2 in the replacement pattern put back the captured vaues into the resulting string.

regex for returning first sentence from a bigger text Python3

I want to get the first sentence from a text. I am encountering various text formats.Using Python3 re.split().the regex I wrote: '.*\. [A-Z]' meaning take anything until format appears.
This works form 90% of the cases, the case with 'Dr. Firstname Lastname' in the first sentence is breaking the pattern, it gets the first sentence until Firstname.I was thinking of trying to exclude substrings like 'Dr. [A-Z]' but cannot figure out a way to do it.Any ideas? Thanks
Sample:The rain in U.S.A. and Spain is researched by Dr. Martin Laurance. This is the latest U.S.A. study. Anything else will just be ignored.

Don't reinvent the wheel, the problem has been tackled before.
When using Python (what your link suggests), give nltk a try:
from nltk import sent_tokenize
string = "The rain in U.S.A. and Spain is researched by Dr. Martin Laurance. This is the latest U.S.A. study. Anything else will just be ignored."
for sent in sent_tokenize(string):
print(sent)
This yields
The rain in U.S.A. and Spain is researched by Dr. Martin Laurance.
This is the latest U.S.A. study.
Anything else will just be ignored.

Wanted to kill a minute or two (or 25 ;)) so I came up with this (not at all foolproof) solution:
(?i).*?\b((?=[a-z']*[aoueiy])(?=[a-z']*[^aoueiy])\w{2,}\.)
What it does is to identify a word followed by a full stop. To separate this word from any abbreviations it's searching for a sequence of characters ( {2,} = more than 1) than contains at least one vowel and one consonant. This is achieved using two "look a heads" prior to matching the word.
Look a head to find a vowel in a word: (?=[a-z]*[aoueiy])
[a-z]* = any number of letters followed by the character class [aoueiy] - a vowel.
The consonant is the same, only with a negated character class [^aoueiy] matching any consonant (and also any other non letter, but since the match is letters only it doesn't matter ;)
Note that this is of course is nothing close to a complete language parser, but it may work in many cases. One thing it would miss is sentences end with the one letter word "I". Like "We're good together you and I."
See it here at regex101

How can I get a regular expression to find the correct instance of a word?

I'm trying to write a regular expression in python to identify instances of the phrases "played for" and "plays for" in a text, with the potential for finding instances where words come between the two, for example, "played guitar for". I only want this to find the first instance of the word "for" after "plays" or "played", however, I cannot work out how to write the regular expression.
The code I have at the moment is like this:
def play_finder(doc)
playre = re.compile(r'\bplay[s|e][d]?\b.*\bfor\b\s\b')
if playre.findall(doc):
for inst in playre.findall(doc):
playstr = inst
print(playstr)
mytext = "He played for four hours last night. He plays guitar for the foo pythers. He won an award for his guitar playing."
play_finder(mytext)
I would like my to be able to pull out two instances from mytext; "played for four" and "plays guitar for the".
Instead, what my code is finding is:
"He played for four hours last night. He plays guitar for the foo pythers. He won an award for".
So it's skipping the first and second for, and only finding the last.
How can I rewrite the regular expression to get it to stop skipping over the first and second instance of "for" in the sentence, and to identify both of them?
Edit: Another problem has become apparent to me after applying a solution I was offered. Given more than one sentence, such as:
"He played an eight hour set. It seemed like he went on for ever."
I don't want the regex to identify "He played an eight hour set. It seemed like he went on for" as matching the pattern. Is there a way to stop it looking for the "for" if it encounters a full stop?

You can try this,
\bplay(?:s|ed).*?for\b
Demo
There are some faults in the regex of your script.
playre = re.compile(r'\bplay[s|e][d]?\b.*\bfor\b\s\b')
[s|e] : is not workable for logical expression because [] is character class and means only one character which it allows
.* : greed(*) search seems match the string of possible maximum length match.

Somebody answered that I needed the lazy .*? then deleted their answer. I'm not sure why, because that worked. Hence, the code I'm using now is:
(r'\bplay[s|e][d]?\b.*?\bfor\b\s\b')
#ThmLee I tried your suggestion:
\bplay(s|ed).*?for\b
I'm (clearly) no expert with Regex, but it seemed not to work as well. Instead of outputting the lines "played for" and " plays guitar for" it just outputs "s" and "ed".

You misunderstand the use of square brackets. They create a character class which matches a single character out of the set of characters enumerated between the brackets. So [s|e] matches s or | or e.
Also, the word boundary is simply an assertion. It matches if the previous character was a "word" character and the next one isn't, or vice versa; but it doesn't advance the position within the string. So, for example, \s\bfor\b\s is redundant; we already know that \s matches whitespace (which is non-word) and for consists of word characters. You mean simply \sfor\s because the dropped \b conditions don't change what is being matched.
Try
r'\bplay(?:s|ed)?\s+(?:\w+\s+)??for\s+\w+'
The (?:\w+\s+)?? allows for a single optional word before for. The second question mark makes the capture non-greedy, i.e. it matches the shortest possible string which still allows the expression to match, instead of the longest. You will not want to allow unlimited repetitions (because then you'd match e.g. "played another game before he sat down for") but you might consider replacing the ?? with e.g. {0,3}? to allow for up to three words before "for".
We use (?:...) instead of (...) to make the grouping parentheses non-capturing; otherwise, findall will return a list of the captured submatches rather than the entire match.
The if findall: for findall is a minor inefficiency; you just need for match in findall which will simply iterate zero times if there are no matches.
More generally, using regex for higher-level grammatical patterns is very often unsatisfactory. A grammatical parser (even some type of shallow parsing) is better at telling you when some words are constituents of an optional attribute or modifier for a noun phrase, or when "play" should be analyzed as a noun. Consider
He played - or rather, tapped his fingers and hummed - for three minutes.
I play another silly but not completely outrageous role for the third time in a year.
She plays what for many is considered offensive gameplay for the Hawks.
Brett plays the oboe although he thinks it's for wimps.
Some plays are for fools.

Python Regular expression search specific string beside number

I need help here.
I have a list and string.
Things I want to do is to find all the numbers from the string and also match the words from the list in the string that are beside numbers.
str = 'Lily goes to school everyday at 9:00. Her House is near to her school.
Lily's address - Flat No. 203, 14th street lol lane, opp to yuta mall,
washington. Her school name is kids International.'
list = ['school', 'international', 'house', 'flat no']
I wrote a regex which can pull numbers
x = re.findall('([0-9]+[\S]+[0-9]+|[0-9]+)' , str,re.I|re.M)
Output I want:
Numbers - ['9:00', '203', '14th']
Flat No.203 (because flat no is beside 203)
14 is also beside string but I dont want it because it is not contained in list.
But How can I write regex to make second condition satisfy. that is to search
whether flat no is beside 203 or not in same regex.

There you go:
(\d{1,2}:\d{1,2})|(?:No\. (\d+))|(\d+\w{2})
Demo on Regex101.com can be found here
What does it do and how does it work?
I use two pipes (|) to gather different number "types" you want:
First alteration ((\d{1,2}:\d{1,2}) - captures time using 1-2 digits followed by a colon and another set of 1-2 digits (probably you could go for 2 digits only).
Second alteration (?:No\. (\d+)) - gives you the number prefixed with literal "No. " (note the space at the end), and then captures following number, no matter how long (at least one digit)
The third and the last part (\d+\w{2}) - simply captures any number of digits (again, at least one) followed by two word characters. You could further improve this part of the regex to match only st, nd, and th suffixes, but I will leave this up to you.
Also to get rid of further unneeded matches you could use lookarounds, but again - I'll leave this up to you to implement.
General note - rather than using one regex to rule... erm - match them all, you should focus on creating many simple regexes. Not only will this improve legibility, but also maintainability of the regexes. This also allows you to search for timestamps, building numbers and positional numerals separately, easily allowing you to split this information to specific variables.

Non greedy python regex

I'm trying to work my way through some regular expressions; I'm using python.
My task right now is to scrape newspaper articles and look for instances where people have died. Once I have a relevant article, I'm trying to snag the death count for some other things. I'm trying to come up with a few patterns, but I'm having difficulty with one in particular. Take this sample article section:
SANAA, Oct 21 (Reuters) - Three men thought to be al Qaeda militants
were killed in an apparent U.S. drone attack on a car in Yemen on
Sunday, tribal sources and local officials said.
The code that I'm using to snag the 'three' first does a replace on the entire document, so that the 'three' becomes a '3' before any patterns at all are applied. The pattern relevant to this example is this:
re.compile(r"(\d+)\s(:?men|women|children|people)?.*?(:?were|have been)? killed")
The idea is that this pattern will start with a number, be followed by an optional noun such as one of the ones listed, then have a minimum amount of clutter before finding 'dead' or 'died'. I want to leave room so that this pattern would catch:
3 people have been killed since Sunday
and still catch the instance in the example:
3 men thought to be al qaeda militants were killed
The problem is that the pattern I'm using is collecting the date from the first part of the article, and returning a count of 21. No amount of fiddling so far has enabled me to limit the scope to the digit right beside the word men, followed by the participial phrase, then the relevant 'were killed'.
Any help would be much appreciated. I'm definitely no guru when it comes to RE.

Don't make the men|women|children optional, i.e. take out the question mark after the closing parenthesis. The regex engine will match at the first possible place, regardless of whether repetition operators are greedy or stingy.
Alternatively, or additionally, make the "anything here" pattern only match non-numbers, i.e. replace .*? with \D*?

This is because, you have used the quantifier ?, which matches 0 or 1 of your (:?men|women|children|people) after your digit. So, 21 will match. since it has 0 of them.
Try removing your quantifier after it, to match exactly one of them: -
re.compile(r"(\d+)\s(?:men|women|children|people).*?(?:were|have been)? killed")
UPDATE: - To use ? quantifier and still get the required result, you need to use Look-Ahead Regex, to make sure that your digit is not followed by a string containing a hiephen(-) as is in your example.
re.compile(r"(\d+)(?!.*?-.*?)\s(?:men|women|children|people)?.*?(?:were|have been)? killed")

You use wrong syntax (:?...). You probably wanted to use (?:...).
Use regex pattern
(\d+).*?\b(?:men|women|children|people|)\b.*?\b(?:were|have been|)\b.*?\bkilled\b
or if just spaces are allowed between those words, then
(\d+)\s+(?:men|women|children|people|)\s+(?:were|have been|)\s+killed\b

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.