Prevent Catastrophic Backtracking in Regex - python

I have a code to scrape a million websites and detect contact info from their homepage.
For some reasons, when I run code, it gets stuck and does not proceed after crawling about 60k requests, I am marking the website URLs in my DB as status=done
I have run code several times but it gets stuck around 60k requests.
It doesnt get stuck on a certain website.
Here is Regex I am using
emails = re.findall('[\w\.-]+#[\w-]+\.[\w\.-]+', lc_body)
mobiles = re.findall(r"(\(?(?<!\d)\d{3}\)?-? *\d{3}-? *-?\d{4})(?!\d)|(?<!\d)(\+\d{11})(?!\d)", lc_body)
abns = re.findall('[a][-\.\s]??[b][-\.\s]??[n][-\:\.\s]?[\:\.\s]?(\d+[\s\-\.]?\d+[\s\-\.]?\d+[\s\-\.]?\d+)', lc_body)
licences = re.findall(r"(Licence|Lic|License|Licence)\s*(\w*)(\s*|\s*#\s*|\s*.\s*|\s*-\s*|\s*:\s+)(\d+)", lc_body, re.IGNORECASE)
My thought is licences's regex is causing issues, how can I simplify it? How can I remove Backtracking ?
I want to find all Licence numbers possible.
It can be License No: 2543 , License: 2543, License # 2543, License #2543, License# 2543 and many other combinations as well.

The issue is caused with the third group: (\s*|\s*#\s*|\s*.\s*|\s*-\s*|\s*:\s+) - all alternatives start with \s* here. This causes lots of redundant backtracking as these alternatives can match at the same location in a string. The best practice is to use alternatives in an alternation group that do not match at the same location.
Now, looking at the strings you need to match, I suggest using
Lic(?:en[cs]e)?(?:\W*No:)?\W*\d+
See the regex demo
Make the pattern more specific and linear, get rid of as many alternations as possible, use optional non-capturing groups and character classes.
Details:
Lic(?:en[cs]e)? - Lic followed with 1 or 0 occurrences (the (?:...)? is an optional non-capturing group since ? quantifier matches 1 or 0 occurrences of the quantified subpatterns) of ence or ense (the character class [sc] matches either s or c and is much more efficient than (s|c))
(?:\W*No:)? - a non-capturing group that matches 1 or 0 occurrences of 0+ non-word chars (with \W*) followed with No: substring
\W*
\d+ - 1 or more digits.

Related

Python regex conditional, don't match if

Sorry for the somewhat unhelpful title, I'm having a really hard time explaining this issue.
I have a list of unique identifiers that can appear in a number of different ways and I'm trying to use regex to normalize them so I can compare across several databases. Here are some examples of them:
AB1201
AB-1201
AB1201-T
AB-12-01L1
AB1201-TER
AB1201 Transit
I've written a line of code that pulls out all hypens and spaces, and the used this regex:
([a-zA-Z]{2}[\d]{4})(L\d|Transit|T$)?
This works exactly as expected, returning a list looking like this:
AB1201
AB1201
AB1201T
AB1201L1
AB1201
AB1201T
The issue is, I have one identifier that looks like this: AB1201-02. I need this to be raised as an exception, and not included as a match.
Any ideas? I'm happy to provide more clarification if necessary. Thanks!
From Regex101 online tester
You can exclude matching the following hyphen and a digit (?!-\d) using a negative lookahead.
If it should start at the beginning of the string, you could use an anchor ^
Note that you could write [\d] as \d
^([a-zA-Z]{2}\d{4})(?!-\d)(L\d|Transit|T$)?
The pattern will look like
^ Start of string
( Capture group 1
[a-zA-Z]{2}\d{4} Match 2 times a-zA-Z and 4 digits
) Close group
(?!-\d) Negative lookahead, assert what is directly to the right is not - and a digit
(L\d|Transit|T$)? Optional capture group 2
Regex demo
Try this regular expression
^([a-zA-Z]{2}[\d]{4})(?!-\d)(L\d|Transit|T|-[A-Z]{3})?$
I have added the (?!...) Negative Lookahead to avoid matching with the -02.
(?!...) Negative Lookahead: Starting at the current position in the expression, ensures that the given pattern will not match. Does not consume characters.
You can view a demo on this link.

Regex number removal from text

I am trying to clean up text for use in a machine learning application. Basically these are specification documents that are "semi-structured" and I am trying to remove the section number that is messing with NLTK sent_tokenize() function.
Here is a sample of the text I am working with:
and a Contract for the work and/or material is entered into with some other person for a
greater amount, the undersigned hereby agrees to forfeit all right and title to the
aforementioned deposit, and the same is forfeited to the Crown.
2.3.3
...
(b)
until thirty-five days after the time fixed for receiving this tender,
whichever first occurs.
2.4
AGREEMENT
Should this tender be accepted, the undersigned agrees to enter into written agreement with
the Minister of Transportation of the Province of Alberta for the faithful performance of the
works covered by this tender, in accordance with the said plans and specifications and
complete the said work on or before October 15, 2019.
I am trying to remove all the section breaks (ex. 2.3.3, 2.4, (b)), but not the date numbers.
Here is the regex I have so far: [0-9]*\.[0-9]|[0-9]\.
Unfortunately it matches part of the date in the last paragraph (2019. turns into 201) and I really dont know how to fix this being a non-expert at regex.
Thanks for any help!
You may try replacing the following pattern with empty string
((?<=^)|(?<=\n))(?:\d+(?:\.\d+)*|\([a-z]+\))
output = re.sub(r'((?<=^)|(?<=\n))(?:\d+(?:\.\d+)*|\([a-z]+\))', '', input)
print(output)
This pattern works by matching a section number as \d+(?:\.\d+)*, but only if it appears as the start of a line. It also matches letter section headers as \([a-z]+\).
To your specific case, I think \n[\d+\.]+|\n\(\w\) should works. The \n helps to diferentiate the section.
The pattern you tried [0-9]*\.[0-9]|[0-9]\. is not anchored and will match 0+ digits, a dot and single digit or | a single digit and a dot
It does not take the match between parenthesis into account.
Assuming that the section breaks are at the start of the string and perhaps might be preceded with spaces or tabs, you could update your pattern with the alternation to:
^[\t ]*(?:\d+(?:\.\d+)+|\([a-z]+\))
^ Start of string
[\t ]* Match 0+ times a space or tab
(?: Non capturing group
\d+(?:\.\d+)+ Match 1+ digits and repeat 1+ times a dot and 1+ digits to match at least a single dot to match 2.3.3 or 2.4
|
\([a-z]+\) Match 1+ times a-z between parenthesis
) Close non capturing group
Regex demo | Python demo
For example using re.MULTILINE whers s is your string:
pattern = r"^(?:\d+(?:\.\d+)+|\([a-z]+\))"
result = re.sub(pattern, "", s, 0, re.MULTILINE)

Hyphen character '-' creating issues when using regular expressions for BeautifulSoup

I am learning how to webscrape with python using a Wikepedia article. I managed to get the data I needed, the tables, by using the .get_text() method on the table rows ().
I am cleaning up the data in Pandas and one of the routines involves getting the date a book or movie was published. Since there are many ways in which this can occur such as:
(1986)
(1986-1989)
(1986-present)
Currently, I am using the code below which works on a test sentence:
# get the first columns of row 19 from the table and get its text
test = data_collector[19].find_all('td')[0]
text = test.get_text()
#create and test the pattern
pattern = re.compile('\(\d\d\d\d\)|\(\d\d\d\d-\d\d\d\d\)|\(\d\d\d\d-[ Ppresent]*\)')
re.findall(pattern, 'This is Agent (1857), the years were (1987-1868), which lasted from (1678- Present)')
I get the expected output on the test sentence.
['(1857)', '(1987-1868)', '(1678- Present)']
However, when I test it on a particular piece of text from the wiki article 'The Adventures of Sherlock Holmes (1891–1892) (series), (1892) (novel), Arthur Conan Doyle\n', I am able to extract (1892), but NOT (1891-1892).
text = test.get_text()
re.findall(pattern, text)
o/p: ['(1892)']
Even as I type this, I can see that the hyphen that I am using and the one on the text are different. I am sure that this is the issue and was hoping if someone could tell me what this particular symbol is called and how I can actually "type" it using my keyboard.
Thank you!
I suggest enhancing the pattern to search for the most common hyphens, -, – and —, and fix the present pattern from a character class to a char sequence (so as not to match sent with [ Ppresent]*):
re.compile(r'\(\d{4}(?:[\s–—-]+(?:\d{4}|present))?\)', re.I)
See the regex demo. Note that re.I flag will make the regex match in a case insensitive way.
Details
\( - a (
\d{4} - four digits ({4} is a limiting quantifier that repeats the pattern it modifies four times)
(?:[\s–—-]+(?:\d{4}|present))? - an optional (as there is a ? at the end) non-capturing (due to ?:) group matching 1 or 0 occurrences of
[\s–—-]+ - 1 or more whitespaces, -, — or –
(?:\d{4}|present) - either 4 digits or present
\) - a ) char.
If you plan to match any hyphens use [\u002D\u058A\u05BE\u1400\u1806\u2010-\u2015\u2E17\u2E1A\u2E3A\u2E3B\u2E40\u301C\u3030\u30A0\uFE31\uFE32\uFE58\uFE63\uFF0D\s]+ instead of [\s–—-]+.
Or, to match any 1+ non-word chars at that location, probably, other than ( and ), use [^\w()]+ instead: re.compile(r'\(\d{4}(?:[^\w()]+(?:\d{4}|present))?\)', re.I).

RegEx to Capture Two Parts of String

I'm scraping some data. One of the data points is tournament prize pools. There are many different currencies in the data. I'd like to extract the amount and currency from each value, so that I can use Google to convert these to a base currency. However, it's been a while since I've used regular expressions, so I'm rusty to say the least. Possible formats of the data are as follows:
$534
$22,136.20
3,200,000 Ft HUF
12,500 kr DKK
50,000 kr SEK
$3,800 AUD
$10,000 NZD
€4,500 EUR
¥100,000 CNY
₹7,000,000 INR
R$39,000 BRL
Below is the first regular expression I came up with.
[0-9,.]+(.+)[A-Z]{3}
But that obviously doesn't capture the amount and currency, so I changed it.
([0-9,.]+).+([A-Z]{3})
However, there are issues with this regular expression that I can't figure out.
([0-9,.]+) by itself works fine to capture just the amount.
When I add .+ to that expression, for some reason it stops capturing the trailing 4 and 0 in the first and second test cases respectively. Why?
Then when I add ([A-Z]{3}), it seems to work perfectly for all of the test cases, but obviously selects nothing in the first two.
So I changed it to ([A-Z]{0,3}), which seems to break everything.
What's happening? How can I change the expression so that it works?
This is where I'm at: ([0-9,.]+)((?:.+)([A-Z]{3}))?
This should work:
([0-9,.]+).*?([A-Z]{3})?$
A few changes I made:
I changed the .+ to .*? because there isn't always something after the number (like the first two cases). I used lazy matching here because otherwise it would match everything till the end.
I made group 2 optional with a ? because there isn't always a currency (first 2 cases)
I added an end of line anchor $ to make the lazy .*? match something instead of nothing.
If you don't know what "lazy" means in this context, see this post.
Demo
For the example data, you could use an optional non capturing group to match the space and the characters before the currency:
([0-9,.]+)(?:(?: [A-Za-z]+)? ([A-Z]{3}))?
Regex demo
That will match
( Capture group
[0-9,.]+ match 1+ times what is listed in the character class
) Close capture group
(?: Non capturing group
(?: [A-Za-z]+ )? Optional group to match a space, 1+ times a-zA-Z and space
([A-Z]{3}) Capture 3 uppercase chars
)? Close non capturing group and make it optional

Python: Extracting URLs using regex or other means

I’m stumped on a problem. I have a large data frame where two of the columns are like this:
pd.DataFrame([['a', 'https://gofundme.com/ydvmve-surgery-for-jax,https://twitter.com/dog_rates/status/890971913173991426/photo/1'], ['b','https://twitter.com/dog_rates/status/890971913173991426/photo/1,https://twitter.com/dog_rates/status/890971913173991426/photo/1'],['c','https://twitter.com/dog_rates/status/890971913173991430/video/1'] ],columns=['ID','URLs'])
What I’m trying to do is leave only the URL including the word “twitter” left in each cell and remove the rest. The pattern is that the URLs I want always include the word “twitter” and ends with “/” + a one-digit number. In the cases where there are two identical URLs in the same cell then only one should remain. Like this:
Test2 = pd.DataFrame([['a', 'https://twitter.com/dog_rates/status/890971913173991426/photo/1'],
['b','https://twitter.com/dog_rates/status/890971913173991426/photo/1'],
['c','https://twitter.com/dog_rates/status/890971913173991430/video/1'] ],columns=['ID','URLs'])
Test2
I’m new to Python and after a lot of googling I’ve started to understand that something called regex is the answer but that is as far as I come. One of the postings here at Stackoverflow led me to regex101.com and after playing around this is as far as I’ve come and it doesn't work:
r’^[https]+(:)(//)(.*?)(/)(\d)’
Can anyone tell me how to solve this problem?
Thanks in advance.
Regular expressions are certainly handy for such tasks. Refer to this question and online tools such as regex101 to learn more.
Your current pattern is incorrect because:
^ Matches the following pattern at the start of string.
[https]+ This is a character set, meaning it will match h, s, ps, therefore any combination of one or more letters present in the [] brackets, and not just the strings http and https which is what you are after.
(:) You don't need to put this : in a capturing group here.
(//) / Needs to be escaped in regex, \/. No need for capturing group here either.
(.*?) The .*? combo is often misused when a negated character set [^] could be used instead.
(/) As discussed above.
(\d) Matches and captures a digit. The capturing group here is also redundant for your task.
You may use the following expression:
https?:\/\/twitter\.com[^,]+(?<=\/\d$)
https? Matches literal substrings http or https.
:\/\/twitter\.com Matches literal substring ://twitter.com.
[^,]+ Anything that is not a comma, one or more.
(?<=\/\d$) Positive lookbehind. Assert that a / followed by a digit \d is present at the end of the string $.
Regex demo here.
Python demo:
import pandas as pd
df = pd.DataFrame([['a', 'https://gofundme.com/ydvmve-surgery-for-jax,https://twitter.com/dog_rates/status/890971913173991426/photo/1'],
['b','https://twitter.com/dog_rates/status/890971913173991426/photo/1,https://twitter.com/dog_rates/status/890971913173991426/photo/1'],
['c','https://twitter.com/dog_rates/status/890971913173991430/video/1'] ],columns=['ID','URLs'])
df['URLs'] = df['URLs'].str.findall(r"https?:\/\/twitter\.com[^,]+(?<=\/\d$)").str[0]
print(df)
Prints:
ID URLs
0 a https://twitter.com/dog_rates/status/890971913173991426/photo/1
1 b https://twitter.com/dog_rates/status/890971913173991426/photo/1
2 c https://twitter.com/dog_rates/status/890971913173991430/video/1

Categories