Going in reverse with RegEx - python

I'm writing a Python script and I need to extract two pieces of information from the following text:
The user XXXXXXXX (XXXXXXX#XXXXXX.com) was involved in an impossible travel incident. The user connected from two countries within 102 minutes, from these IP addresses: Country1 (111.111.111.111) and Country2 (222.222.222.222). Another irrelevant staff...
I need "Country1" and "Country2". I already extracted the IPs so I can look for them in my expression.
With this regex: (?> )(.*)(?= \(111\.111\.111\.111)
I take all this:
The user XXXXXXXX (XXXXXXX#XXXXXX.com) was involved in an impossible travel incident. The user connected from two countries within 102 minutes, from these IP addresses: Country1
Is there a way to take all the characters going backward and make it stop at the first space, to take just "Country1" ?
Or does anyone knows a better way to extract "Country1" and "Country2" with a regex or directly with Python?

You can use
\S+(?=\s*\(\d{1,3}(?:\.\d{1,3}){3}\))
See the regex demo.
Details:
\S+ - one or more non-whitespace chars
(?=\s*\(\d{1,3}(?:\.\d{1,3}){3}\)) - a positive lookahead that requires the following pattern to appear immediately at the right of the current location:
\s* - zero or more whitespaces
\( - a ( char
\d{1,3}(?:\.\d{1,3}){3} - one to three digits and then three repetitions of . and one to three digits
\) - a ) char.

If your message pattern is always the same you can get the countries like this using Python:
your_string = 'The user XXXXXXXX (XXXXXXX#XXXXXX.com) ...'
your_string = your_string.split(': ')[1].split(' and ')
first_country = your_string[0].split(' (')[0]
second_country = your_string[1].split(' (')[0]

With your shown samples please try following regex, written and tested in Python3. I am using Python3's re library and its findall module here.
import re
var="""...""" ##Place your value here.
re.findall(r'(\S+)\s\((?:\d{1,3}\.){3}\d{1,3}\)',var)
['Country1', 'Country2']
Here is the Online demo for above used regex.

Related

Store output after finding matching string using regex and pexpect

I'm writing a Python script and I am having some trouble figuring out how to get the output of a command I send and store it in a variable, but for the entire output of that command - I only want to store the rest of 1 specific line after a certain word.
To illustrate - say I have a command that outputs hundreds of lines that all represent certain details of a specific product.
Color: Maroon Red
Height: 187cm
Number Of Seats: 6
Number Of Wheels: 4
Material: Aluminum
Brand: Toyota
#and hundreds of more lines...
I want to parse the entire output of the command that I sent which print the details above and only store the material of the product in a variable.
Right now I have something like:
child.sendline('some command that lists details')
variable = child.expect(["Material: .*"])
print(variable)
child.expect(prompt)
The sendline and expect prompt parts list the details correctly and all, but I'm having trouble figuring out how to parse the output of that command, look for a part that says "Material: " and only store the Aluminum string in a variable.
So instead of having variable equal to and print a value of 0 which is what currently prints right now, it should instead print the word "Aluminum".
Is there a way to do this using regex? I'm trying to get used to using regex expressions so I would prefer a solution using that but if not, I'd still appreciate any help! I'm also editing my code in vim and using linux if that helps.
You only need to look for the substring Material: . For this you can place the string you want to match (I am using a dot character, which means "match any character") in between a positive lookbehind for Material: and a positive lookahead for \r\n:
(?<=Material:\s).*(?=[\r\n])
You can find a good explanation for this regex here.
As you are using Python, you can use a capture group and store the value in for example my_var in the example code.
^Material:\s*(.+)
The pattern matches:
^ Start of string
Material:\s* Match Material: and optional whitspace chars
(.+) Capture group 1 match 1+ times any char except a newline
See a regex demo and a Python demo.
For example
import re
regex = r"^Material:\s*(.+)"
s = ("Color: Maroon Red\n"
"Height: 187cm\n"
"Number Of Seats: 6\n"
"Number Of Wheels: 4\n"
"Material: Aluminum\n"
"Brand: Toyota \n"
"#and hundreds of more lines...")
match = re.search(regex, s, re.MULTILINE)
if match:
my_var = match.group(1)
print(my_var)
Output
Aluminum

Extracting the inside of an expression using REGEX

I currently have this regular expression that I use to match the result of an SQL query: [^\\n]+(?=\\r\\n\\r\\n\(1 rows affected\)). However, it is not working as intended....
'\r\n----------------------------------------------------------
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
--------------------------------------\r\nCS: GPS
on Date.
\r\n\r\n(1 rows affected)\r\n'
What I get from the expression above is Date whereas I would want to match CS: GPS on Date. It's fine if there's leading and following spaces... Nothing Python's trim can't handle. How do I change my regular expression so that the match is done properly?
Thanks in advance.
Edit: The Python version I am using is Python 3.6
You get your current match because the character class [^\\n]+ matches 1+ times any char except \ or n.
Then the positive lookahead asserts what is on the right is \r\n\r\n(1 rows affected) which results in matching Date.
See https://regex101.com/r/wDzq8l/1
You could use a non greedy .+? in a capturing group and match what follows instead of using a positive lookahead.
In the code use re.DOTALL to let the dot match a newline.
-\\r\\n(.+?) ?\\r\\n\\r\\n\(\d+ rows affected\)
Regex demo
Maybe, some expression similar to:
-{5,}\s*([A-Za-z][^.]+\.)
would extract that or somewhat similar to that.
Demo
Test
import re
regex = r'-{5,}\s*([A-Za-z][^.]+\.)'
string = '''
----------------------------------------------------------
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
--------------------------------------
CS: GPS
on Date.
\r\n\r\n(1 rows affected)\r\n
'''
print(re.findall(regex, string, re.DOTALL))
Output
['CS: GPS\non Date.']
If you wish to simplify/modify/explore the expression, it's been explained on the top right panel of regex101.com. If you'd like, you can also watch in this link, how it would match against some sample inputs.

Regex number removal from text

I am trying to clean up text for use in a machine learning application. Basically these are specification documents that are "semi-structured" and I am trying to remove the section number that is messing with NLTK sent_tokenize() function.
Here is a sample of the text I am working with:
and a Contract for the work and/or material is entered into with some other person for a
greater amount, the undersigned hereby agrees to forfeit all right and title to the
aforementioned deposit, and the same is forfeited to the Crown.
2.3.3
...
(b)
until thirty-five days after the time fixed for receiving this tender,
whichever first occurs.
2.4
AGREEMENT
Should this tender be accepted, the undersigned agrees to enter into written agreement with
the Minister of Transportation of the Province of Alberta for the faithful performance of the
works covered by this tender, in accordance with the said plans and specifications and
complete the said work on or before October 15, 2019.
I am trying to remove all the section breaks (ex. 2.3.3, 2.4, (b)), but not the date numbers.
Here is the regex I have so far: [0-9]*\.[0-9]|[0-9]\.
Unfortunately it matches part of the date in the last paragraph (2019. turns into 201) and I really dont know how to fix this being a non-expert at regex.
Thanks for any help!
You may try replacing the following pattern with empty string
((?<=^)|(?<=\n))(?:\d+(?:\.\d+)*|\([a-z]+\))
output = re.sub(r'((?<=^)|(?<=\n))(?:\d+(?:\.\d+)*|\([a-z]+\))', '', input)
print(output)
This pattern works by matching a section number as \d+(?:\.\d+)*, but only if it appears as the start of a line. It also matches letter section headers as \([a-z]+\).
To your specific case, I think \n[\d+\.]+|\n\(\w\) should works. The \n helps to diferentiate the section.
The pattern you tried [0-9]*\.[0-9]|[0-9]\. is not anchored and will match 0+ digits, a dot and single digit or | a single digit and a dot
It does not take the match between parenthesis into account.
Assuming that the section breaks are at the start of the string and perhaps might be preceded with spaces or tabs, you could update your pattern with the alternation to:
^[\t ]*(?:\d+(?:\.\d+)+|\([a-z]+\))
^ Start of string
[\t ]* Match 0+ times a space or tab
(?: Non capturing group
\d+(?:\.\d+)+ Match 1+ digits and repeat 1+ times a dot and 1+ digits to match at least a single dot to match 2.3.3 or 2.4
|
\([a-z]+\) Match 1+ times a-z between parenthesis
) Close non capturing group
Regex demo | Python demo
For example using re.MULTILINE whers s is your string:
pattern = r"^(?:\d+(?:\.\d+)+|\([a-z]+\))"
result = re.sub(pattern, "", s, 0, re.MULTILINE)

Hyphen character '-' creating issues when using regular expressions for BeautifulSoup

I am learning how to webscrape with python using a Wikepedia article. I managed to get the data I needed, the tables, by using the .get_text() method on the table rows ().
I am cleaning up the data in Pandas and one of the routines involves getting the date a book or movie was published. Since there are many ways in which this can occur such as:
(1986)
(1986-1989)
(1986-present)
Currently, I am using the code below which works on a test sentence:
# get the first columns of row 19 from the table and get its text
test = data_collector[19].find_all('td')[0]
text = test.get_text()
#create and test the pattern
pattern = re.compile('\(\d\d\d\d\)|\(\d\d\d\d-\d\d\d\d\)|\(\d\d\d\d-[ Ppresent]*\)')
re.findall(pattern, 'This is Agent (1857), the years were (1987-1868), which lasted from (1678- Present)')
I get the expected output on the test sentence.
['(1857)', '(1987-1868)', '(1678- Present)']
However, when I test it on a particular piece of text from the wiki article 'The Adventures of Sherlock Holmes (1891–1892) (series), (1892) (novel), Arthur Conan Doyle\n', I am able to extract (1892), but NOT (1891-1892).
text = test.get_text()
re.findall(pattern, text)
o/p: ['(1892)']
Even as I type this, I can see that the hyphen that I am using and the one on the text are different. I am sure that this is the issue and was hoping if someone could tell me what this particular symbol is called and how I can actually "type" it using my keyboard.
Thank you!
I suggest enhancing the pattern to search for the most common hyphens, -, – and —, and fix the present pattern from a character class to a char sequence (so as not to match sent with [ Ppresent]*):
re.compile(r'\(\d{4}(?:[\s–—-]+(?:\d{4}|present))?\)', re.I)
See the regex demo. Note that re.I flag will make the regex match in a case insensitive way.
Details
\( - a (
\d{4} - four digits ({4} is a limiting quantifier that repeats the pattern it modifies four times)
(?:[\s–—-]+(?:\d{4}|present))? - an optional (as there is a ? at the end) non-capturing (due to ?:) group matching 1 or 0 occurrences of
[\s–—-]+ - 1 or more whitespaces, -, — or –
(?:\d{4}|present) - either 4 digits or present
\) - a ) char.
If you plan to match any hyphens use [\u002D\u058A\u05BE\u1400\u1806\u2010-\u2015\u2E17\u2E1A\u2E3A\u2E3B\u2E40\u301C\u3030\u30A0\uFE31\uFE32\uFE58\uFE63\uFF0D\s]+ instead of [\s–—-]+.
Or, to match any 1+ non-word chars at that location, probably, other than ( and ), use [^\w()]+ instead: re.compile(r'\(\d{4}(?:[^\w()]+(?:\d{4}|present))?\)', re.I).

Prevent Catastrophic Backtracking in Regex

I have a code to scrape a million websites and detect contact info from their homepage.
For some reasons, when I run code, it gets stuck and does not proceed after crawling about 60k requests, I am marking the website URLs in my DB as status=done
I have run code several times but it gets stuck around 60k requests.
It doesnt get stuck on a certain website.
Here is Regex I am using
emails = re.findall('[\w\.-]+#[\w-]+\.[\w\.-]+', lc_body)
mobiles = re.findall(r"(\(?(?<!\d)\d{3}\)?-? *\d{3}-? *-?\d{4})(?!\d)|(?<!\d)(\+\d{11})(?!\d)", lc_body)
abns = re.findall('[a][-\.\s]??[b][-\.\s]??[n][-\:\.\s]?[\:\.\s]?(\d+[\s\-\.]?\d+[\s\-\.]?\d+[\s\-\.]?\d+)', lc_body)
licences = re.findall(r"(Licence|Lic|License|Licence)\s*(\w*)(\s*|\s*#\s*|\s*.\s*|\s*-\s*|\s*:\s+)(\d+)", lc_body, re.IGNORECASE)
My thought is licences's regex is causing issues, how can I simplify it? How can I remove Backtracking ?
I want to find all Licence numbers possible.
It can be License No: 2543 , License: 2543, License # 2543, License #2543, License# 2543 and many other combinations as well.
The issue is caused with the third group: (\s*|\s*#\s*|\s*.\s*|\s*-\s*|\s*:\s+) - all alternatives start with \s* here. This causes lots of redundant backtracking as these alternatives can match at the same location in a string. The best practice is to use alternatives in an alternation group that do not match at the same location.
Now, looking at the strings you need to match, I suggest using
Lic(?:en[cs]e)?(?:\W*No:)?\W*\d+
See the regex demo
Make the pattern more specific and linear, get rid of as many alternations as possible, use optional non-capturing groups and character classes.
Details:
Lic(?:en[cs]e)? - Lic followed with 1 or 0 occurrences (the (?:...)? is an optional non-capturing group since ? quantifier matches 1 or 0 occurrences of the quantified subpatterns) of ence or ense (the character class [sc] matches either s or c and is much more efficient than (s|c))
(?:\W*No:)? - a non-capturing group that matches 1 or 0 occurrences of 0+ non-word chars (with \W*) followed with No: substring
\W*
\d+ - 1 or more digits.

Categories