Regex pattern only catches first occurrence of US address

Regex pattern only catches first occurrence of US address - python

I have a transcribed text where customer and agent talks to each other. I want to match addresses. I have a regex pattern:
\d+ (.)?(dr|drive|circle|highway|way|street|st|road|rd|boulevard|blvd|parkway|avenue|ave\b|court|ct|cove\b|crossing|estate|junction|loop|park|\bpike\b|ridge|square|terrace|trail|turnpike|village) .*? \d{4,6}
It catches the address. However, it does not catch second address. How to catch all addresses instead of only first occurrence? The second address is appended to end of the sample text 15620 e glenwood... I provide my pattern and sample text below:
Regex 101

i added to your original regex ( \w+ )? at the begining to iclude the case where the user says the name of the road
this should do the trick :
\d+ (.)?( \w+ )?(dr|drive|circle|highway|way|street|st|road|rd|boulevard|blvd|parkway|avenue|ave\b|court|ct|cove\b|crossing|estate|junction|loop|park|\bpike\b|ridge|square|terrace|trail|turnpike|village).*?\d{4,6}

Related

Going in reverse with RegEx

I'm writing a Python script and I need to extract two pieces of information from the following text:
The user XXXXXXXX (XXXXXXX#XXXXXX.com) was involved in an impossible travel incident. The user connected from two countries within 102 minutes, from these IP addresses: Country1 (111.111.111.111) and Country2 (222.222.222.222). Another irrelevant staff...
I need "Country1" and "Country2". I already extracted the IPs so I can look for them in my expression.
With this regex: (?> )(.*)(?= \(111\.111\.111\.111)
I take all this:
The user XXXXXXXX (XXXXXXX#XXXXXX.com) was involved in an impossible travel incident. The user connected from two countries within 102 minutes, from these IP addresses: Country1
Is there a way to take all the characters going backward and make it stop at the first space, to take just "Country1" ?
Or does anyone knows a better way to extract "Country1" and "Country2" with a regex or directly with Python?

You can use
\S+(?=\s*\(\d{1,3}(?:\.\d{1,3}){3}\))
See the regex demo.
Details:
\S+ - one or more non-whitespace chars
(?=\s*\(\d{1,3}(?:\.\d{1,3}){3}\)) - a positive lookahead that requires the following pattern to appear immediately at the right of the current location:
\s* - zero or more whitespaces
\( - a ( char
\d{1,3}(?:\.\d{1,3}){3} - one to three digits and then three repetitions of . and one to three digits
\) - a ) char.

If your message pattern is always the same you can get the countries like this using Python:
your_string = 'The user XXXXXXXX (XXXXXXX#XXXXXX.com) ...'
your_string = your_string.split(': ')[1].split(' and ')
first_country = your_string[0].split(' (')[0]
second_country = your_string[1].split(' (')[0]

With your shown samples please try following regex, written and tested in Python3. I am using Python3's re library and its findall module here.
import re
var="""...""" ##Place your value here.
re.findall(r'(\S+)\s\((?:\d{1,3}\.){3}\d{1,3}\)',var)
['Country1', 'Country2']
Here is the Online demo for above used regex.

Multiline regex in pdf file

I am interested in extracting some information from some PDF files that look like this. I only need the information at pages 2 and after which looks like this:
(U) country: On [date] [text]. (text in brackets)
This means it always starts with a number a dot a country and finishes with brackets which brackets may also go to the next line too.
My implementation in python is the following:
use pdfminer extract_text function to get the whole text.
Then use re.findall function in the whole text using this regex ^\d{1,2}\. \(u\) \w+.\w*.\w*:.* on \d{1,2} \w+.*$ with the re.MULTILINE option too.
I have noticed that this extracts the first line of all the paragraphs that I am interested in, but I cannot find a way to grab everything until the end of the paragraph which is the brackets (.*).
I was wondering if anyone can provide some help into this. I was hoping I can match this by only one regex. Otherwise I might try split it by line and iterate through each one.
Thanks in advance.

You could update the pattern using a negated character class matching until the first occurrence of : and then match at least on after it.
To match all following line, you can match a newline and assert that the nextline does not contain only spaces followed by a newline using a negative lookahead.
Using a case insensitive match:
^\d{1,2}\.\s\(u\)\s[^:\n]*:.*?\son\s\d{1,2}\s.*(?:\n(?![^\S\r\n]*\n).*)*
The pattern matches:
^ Start of string
\d{1,2}\.\s\(u\)\s Match 2 digits, . a whitespace char and (u)
[^:\n]*: Match any char except : or a newline, then match :
.*?\son\s Match the first occurrence of on between whitespace chars
\d{1,2}\s Match 1-2 digits and a whitespace char
.* Match the rest of the line
(?: Non capture group
\n(?![^\S\r\n]*\n).* Match a newline, and assert not only spaces followed by a newline
)* Close non capture group and optionally repeat
Regex demo
For example
pattern = r"^\d{1,2}\.\s\(u\)\s[^:]*:.*?\son\s\d{1,2}\s.*(?:\n(?![^\S\r\n]*\n).*)*"
print(re.findall(pattern, extracted_text, re.M | re.I))

Regex number removal from text

I am trying to clean up text for use in a machine learning application. Basically these are specification documents that are "semi-structured" and I am trying to remove the section number that is messing with NLTK sent_tokenize() function.
Here is a sample of the text I am working with:
and a Contract for the work and/or material is entered into with some other person for a
greater amount, the undersigned hereby agrees to forfeit all right and title to the
aforementioned deposit, and the same is forfeited to the Crown.
2.3.3
...
(b)
until thirty-five days after the time fixed for receiving this tender,
whichever first occurs.
2.4
AGREEMENT
Should this tender be accepted, the undersigned agrees to enter into written agreement with
the Minister of Transportation of the Province of Alberta for the faithful performance of the
works covered by this tender, in accordance with the said plans and specifications and
complete the said work on or before October 15, 2019.
I am trying to remove all the section breaks (ex. 2.3.3, 2.4, (b)), but not the date numbers.
Here is the regex I have so far: [0-9]*\.[0-9]|[0-9]\.
Unfortunately it matches part of the date in the last paragraph (2019. turns into 201) and I really dont know how to fix this being a non-expert at regex.
Thanks for any help!

You may try replacing the following pattern with empty string
((?<=^)|(?<=\n))(?:\d+(?:\.\d+)*|\([a-z]+\))
output = re.sub(r'((?<=^)|(?<=\n))(?:\d+(?:\.\d+)*|\([a-z]+\))', '', input)
print(output)
This pattern works by matching a section number as \d+(?:\.\d+)*, but only if it appears as the start of a line. It also matches letter section headers as \([a-z]+\).

To your specific case, I think \n[\d+\.]+|\n\(\w\) should works. The \n helps to diferentiate the section.

The pattern you tried [0-9]*\.[0-9]|[0-9]\. is not anchored and will match 0+ digits, a dot and single digit or | a single digit and a dot
It does not take the match between parenthesis into account.
Assuming that the section breaks are at the start of the string and perhaps might be preceded with spaces or tabs, you could update your pattern with the alternation to:
^[\t ]*(?:\d+(?:\.\d+)+|\([a-z]+\))
^ Start of string
[\t ]* Match 0+ times a space or tab
(?: Non capturing group
\d+(?:\.\d+)+ Match 1+ digits and repeat 1+ times a dot and 1+ digits to match at least a single dot to match 2.3.3 or 2.4
|
\([a-z]+\) Match 1+ times a-z between parenthesis
) Close non capturing group
Regex demo | Python demo
For example using re.MULTILINE whers s is your string:
pattern = r"^(?:\d+(?:\.\d+)+|\([a-z]+\))"
result = re.sub(pattern, "", s, 0, re.MULTILINE)

Regex Pattern For Finding String Before First Dot in Python

I need a regex pattern to grab the string before the first dot:
google.com.com
yahoo.com
192.168.1.4
I need a regex that gives google and yahoo, but my pattern grabs IP addresses too. My regex is r'(.*)\.(.*)
Any advice would be appreciated.

I'm assuming that the text you provided is one string with newlines. In that case, this would be a quick solution to pull the results into a list.
re.findall(r'^\s*([a-zA-Z]+)\.', str, re.MULTILINE)
The re.MULTILINE simplifies things by allowing the use of '^' to match the beginning of each line.
This results in ['google', 'yahoo'].
For an explanation of the regex, see the verbose version below. (also good for documenting purposes)
re.findall(r'''
^ # beginning of each line (multiline mode)
\s* # zero or more whitespace characters
([a-zA-Z]+) # captures one or more characters a-z case-insensitive (just in case)
\. # matches '.'
''', str, re.MULTILINE | re.VERBOSE)

I wasn't sure if the spaces on the beginning of the rows have any meaning, so I took them in account in this regex:
(?:^|(?<=\s))[\w-]+(?=.[a-zA-Z])
It matches substrings right after either the beginning of the row, or a space, that are followed by a dot and then a letter (as opposed to a digit).

This Regex is validating a URL and only this URL wrongly. Why?

&copy 2014 Fairfax New Zealand Limited<br/>
Privacy<!-- |
The above is the offending section in my HTML document.
Below is my regex. It works on every other URL in my document. Except this one.
urliter = re.finditer(r'(http://|https://)([\w]+\.[\w\.]+\/?)([\w\/\.]+")',lines)
urlMatches = defaultdict(list)
for match in urliter:
urlMatches[match.group(2)].append(match.group())
When I view the output, for some reason, www.fairfaxmedia.co.nz cuts off the z at the end, so it only shows www.fairfaxmedia.co.n for group(2)
I can't figure out why this would be?
Also, question #2 - how would I only search for URLs in quotations, but leave the quotations out of the match?

Your regex uses capturing group:
(http://|https://) matches (and captures in group 1) the http part
([\w]+\.[\w\.]+\/?) captures in the second group
([\w\/\.]+") captures in the third group
Since you put a + in ([\w\/\.]+"), the character class [\w\/\.] cannot match no character. Meaning that in http://www.fairfaxmedia.co.nz" the last group has to match at least z".
Hence, the z cannot be in the second group (which is the one you're calling), illustration here.
If you want to simply separate the domain name from the rest of your URL, you can tweak your regex to:
"(https?://(\w+\.[\w.]+)(/?[\w/.-]*))"
The whole URL (without quotes) is in capturing group 1, the domain name in capturing group 2, the rest in capturing group 3: see demo here.

For searching for text in quotations, but leaving quotations out of the match you can use lookaround assertions.
For example (core regexp taken from Robins answer)
(?<=\")(https?://(\w+\.[\w.]+)(/?[\w\/\.]*))(?=\")

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Regex pattern only catches first occurrence of US address - python

Related

Going in reverse with RegEx

Multiline regex in pdf file

Regex number removal from text

Regex Pattern For Finding String Before First Dot in Python

This Regex is validating a URL and only this URL wrongly. Why?

Categories

Resources