writing flexible regex expressions - python

I am trying to write a program that reads text from screenshots and then identifies various PII from it. Using pytesseract to read in the text, I am trying to write regex for urls, email IDs etc. Here is an example of a function which takes in a string and returns True email IDs and False otherwise:
def email_regex(text):
pattern = compile(r"\A[a-z0-9!#$%&'*+/=?^_‘{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_‘{|}~-]+)*#(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?")
return bool(pattern.match(text))
This function works well for all email IDs in a proper format(abc#xyz.dd), but since the input to the function is text read in from pytesseract, the text is not guaranteed to be properly formatted. My function returns False for abc#xyzdd. I'm running into the same issues with URL regex,domain name regex etc. Is there a way to make my regex expressions more robust to reading in errors from pytesseract?
I have tried following the accepted solution to this answer, but that leads to the regex functions returning True for random words as well. Any help to resolve this would be greatly appreciated.
EDIT :- Here are my url and domain regexs, where I'm running into the same problem as my email regex. Any help with these will be very useful for me.
pattern = compile(r'\b(((([a-zA-Z0-9])|([a-zA-Z0-9][a-zA-Z0-9\-]{0,86}
[a-zA-Z0-9]))\.(([a-zA-Z0-9])|([a-zA-Z0-9][a-zA-Z0-9\-]{0,73}[a-zA-Z0-9]))\.(([a-zA-Z0-9]{2,12}\.[a-zA-Z0-9]{2,12})|([a-zA-Z0-9]{2,25})))|((([a-zA-Z0-9])|([a-zA-Z0-9][a-zA-Z0-9\-]{0,162}[a-zA-Z0-9]))\.(([a-zA-Z0-9]{2,12}\.[a-zA-Z0-9]{2,12})|([a-zA-Z0-9]{2,25}))))\b', re.IGNORECASE)
return pattern.match(text)```
def url_regex(text):
pattern = compile(r'(http|https://)?:(?:[a-zA-Z]|[0-9]|[$-_#.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F])+)', re.IGNORECASE)
return pattern.match(text)

Perhaps adding some flags, such as ignorecase and DOTALL for newlines:
# Match email ID:
my_pattern = compile(r"^[a-z0-9]+[\._]?[a-z0-9]+[#]\w+[.]?\w{2,3}$", re.I, re.S)
Match URLs:
https://gist.github.com/gruber/8891611

Related

how to write a regular expression to return the keyword in the url?

we want to write a regular expression to query the url based on some keywords
for example, when we input 'google'. the regular express should help us to find the urls as:
https://www.google.com
https://api.google.com/help
https://www.apigoogle.com/example/02.js
https://www.googleapi.com/02/example/02.js
Currently my Regex is, 'sites' is the input value:
^http(s)?://([a-z0-9-]+.)+(" + sites + ").(com|net)/?$
It only matches the first one, how can I finish my Regex...?
The main purpose is to check if keyword inside domain part.
^(http\w?.{3}) start with two kinds of protocol
([^\/]*?google[^\/]*?) check whether domain part has keyword. To avoid matching more than the specified area, did not match \
(?=\/|$) the main part should be end of text, or it has \ behind
Code:
import re
regex = lambda keyword: r"^(http\w?.{3})([^\/]*?%s[^\/]*?)(?=\/|$)"%keyword
text = """
https://www.google.com
https://api.google.com/help
https://www.apigoogle.com/example/02.js
https://www.googleapi.com/02/example/02.js
https://www.abcd.com/red?=www.google.com
https://www.googleapi.com/02/example/03.js
"""
for e in text.split():
if re.search(regex("google"),e):
print(e)
This should work fine for you.
^((https)\:\/\/)(([a-z0-9])+\.)*(google|apigoogle\.com)
Test

How to filter out specific strings from a string

Python beginner here. I'm stumped on part of this code for a bot I'm writing.
I am making a reddit bot using Praw to comb through posts and removed a specific set of characters (steam CD keys).
I made a test post here: https://www.reddit.com/r/pythonforengineers/comments/91m4l0/testing_my_reddit_scraping_bot/
This should have all the formats of keys.
Currently, my bot is able to find the post using a regex expression. I have these variables:
steamKey15 = (r'\w\w\w\w\w.\w\w\w\w\w.\w\w\w\w\w')
steamKey25 = (r'\w\w\w\w\w.\w\w\w\w\w.\w\w\w\w\w.\w\w\w\w\w.\w\w\w\w\w.')
steamKey17 = (r'\w\w\w\w\w\w\w\w\w\w\w\w\w\w\w\s\w\w')
I am finding the text using this:
subreddit = reddit.subreddit('pythonforengineers')
for submission in subreddit.new(limit=20):
if submission.id not in steamKeyPostID:
if re.search(steamKey15, submission.selftext, re.IGNORECASE):
searchLogic()
saveSteamKey()
So this is just to show that the things I should be using in a filter function is a combination of steamKey15/25/17, and submission.selftext.
So here is the part where I am confused. I cant find a function that works, or is doing what I want. My goal is to remove all the text from submission.selftext(the body of the post) BUT the keys, which will eventually be saved in a .txt file.
Any advice on a good way to go around this? I've looked into re.sub and .translate but I don't understand how the parts fit together.
I am using Python 3.7 if it helps.
can't you just get the regexp results?
m = re.search(steamKey15, submission.selftext, re.IGNORECASE)
if m:
print(m.group(0))
Also note that a dot . means any char in a regexp. If you want to match only dots, you should use \.. You can probably write your regexp like this instead:
r'\w{5}[-.]\w{5}[-.]\w{5}'
This will match the key when separated by . or by -.
Note that this will also match anything that begin or end with a key, or has a key in the middle - that can cause you problems as your 15-char key regexp is contained in the 25-key one! To fix that use negative lookahead/negative lookbehind:
r'(?<![\w.-])\w{5}[-.]\w{5}[-.]\w{5}(?![\w.-])'
that will only find the keys if there are no extraneous characters before and after them
Another hint is to use re.findall instead of re.search - some posts contain more than one steam key in the same post! findall will return all matches while search only returns the first one.
So a couple things first . means any character in regex. I think you know that, but just to be sure. Also \w\w\w\w\w can be replaced with \w{5} where this specifies 5 alphanumerics. I would use re.findall.
import re
steamKey15 = (r'(?:\w{5}.){2}\w{5}')
steamKey25 = (r'(?:\w{5}.){5}')
steamKey17 = (r'\w{15}\s\w\w')
subreddit = reddit.subreddit('pythonforengineers')
for submission in subreddit.new(limit=20):
if submission.id not in steamKeyPostID:
finds_15 = re.findall(steamKey15, submission.selftext)
finds_25 = re.findall(steamKey25, submission.selftext)
finds_17 = re.findall(steamKey17, submission.selftext)

Search for email address with the pattern [at]/(at) in python

I am developing a web scraper code. The basic thing which I am retrieving is email address from the HTML source. I am using the following code
r = re.compile(r"\b[A-Z0-9._%+-]+#[A-Z0-9.-]+\.[A-Z]{2,6}\b", re.IGNORECASE)
emailAddresses = r.findall(html)
In few websites the email address is in the format abcd[at]gmail.com/abcd(at)gmail.com. I need a generic regex code which will retrieve email address in either of the three formats abcd[at]gmail.com/abcd(at)gmail.com/abcd#gmail.com. I tried the following code, but didn't get expected result. Can any one help me.
r = re.compile(r"\b[A-Z0-9._%+-]+[#|(at)|[at]][A-Z0-9.-]+\.[A-Z]{2,6}\b", re.IGNORECASE)
emailAddresses = r.findall(html)
Solution: Replace # by (#|\(at\)|\[at\]) as such:
r = re.compile(r"\b[A-Z0-9._%+-]+(#|\(at\)|\[at\])[A-Z0-9.-]+\.[A-Z]{2,6}\b", re.IGNORECASE)
emailAddresses = r.findall(html)
Explanation: In your attempt, you did [one|two|three], you cannot do that. […] is used for single characters or for sets ([a-z] is the same as [abcd…xyz]). You must use (one|two|three) instead. [1]
Also, you attempt to match () and [] which are all special characters regarding to REGEX, so they have special functionality. If you want to actually match them (and not using their special functionality), you must remember to escape them before by putting a \ in front of them. Same goes for .?+* etc.
Suggestion: You can also try to match [dot] and (dot) that very same way if you wish so.
Just remember that there are a ton of way to obfuscate email addresses out there, including some you might not be aware of.
And that, also, validating email addresses (and so trying to catch them with REGEX) can be very tricky:
The actual official REGEX is (?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*")#(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\]).
(EDIT: Source: http://www.ex-parrot.com/pdw/Mail-RFC822-Address.html Looks like it could be even worse than the above REGEX!!)
[1] Beware that using (…) will capture its content, if you wish this content not being captured you have to use (?:…) instead.
r = re.compile(r"\b[A-Z0-9._%+-]+(?:#|[(\[]at[\])])[A-Z0-9.-]+\.[A-Z]{2,6}\b", re.IGNORECASE)
^^^^^^^^^^^^^^^^^^
emailAddresses = r.findall(html)
See demo.
https://regex101.com/r/nD5jY4/5#python

RegEx in Python for WikiMarkup

I'm trying to create a re in python that will match this pattern in order to parse MediaWiki Markup:
<ref>*Any_Character_Could_Be_Here</ref>
But I'm totally lost when it comes to regex. Can someone help me, or point me to a tutorial or resource that might be of some help. Thanks!'
Assuming that svick is correct that MediaWiki Markup is not valid xml (or html), then you could use re in this circumstance (although I will certainly defer to better solutions):
>>> import re
>>> test_string = '''<ref>*Any_Character_Could_Be_Here</ref>
<ref>other characters could be here</ref>'''
>>> re.findall(r'<ref>.*?</ref>', test_string)
['<ref>*Any_Character_Could_Be_Here</ref>', '<ref>other characters could be here</ref>'] # a list of matching strings
In any case, you will want to familiarize yourself with the re module (whether or not you use a regex to solve this particular problem).
srhoades28, this will match your pattern.
if re.search(r"<ref>\*[^<]*</ref>", subject):
# Successful match
else:
# Match attempt failed
Note that from your post, it is assumed that the * after always occurs, and that the only variable part is the blue text, in your example "Any_Character_Could_Be_Here".
If this is not the case let me know and I will tweak the expression.

finding email address in a web page using regular expression

I'm a beginner-level student of Python. Here is the code I have to find instances of email addresses from a web page.
page = urllib.request.urlopen("http://website/category")
reg_ex = re.compile(r'[-a-z0-9._]+#([-a-z0-9]+)(\.[-a-z0-9]+)+', re.IGNORECASE
m = reg_ex.search_all(page)
m.group()
When I ran it, the Python module said that there is an invalid syntax and it is on the line:
m = reg_ex.search_all(page)
Would anyone tell me why it is invalid?
Consider an alternative:
## Suppose we have a text with many email addresses
str = 'purple alice#google.com, blah monkey bob#abc.com blah dishwasher'
## Here re.findall() returns a list of all the found email strings
emails = re.findall(r'[\w\.-]+#[\w\.-]+', str)
## ['alice#google.com', 'bob#abc.com']
for email in emails:
# do something with each found email string
print email
Source: https://developers.google.com/edu/python/regular-expressions
Besides, reg_ex has no search_all method. And you should pass in page.read().
You don't have closing ) at this line:
reg_ex = re.compile(r'[a-z0-9._]+#([-a-z0-9]+)(\.[-a-z0-9]+)+', re.IGNORECASE)
Plus, your regex is not valid, try this instead:
"[a-zA-Z0-9_.+-]+#[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+"
FYI, validating email using regex is not that trivial, see these threads:
Python check for valid email address?
Using a regular expression to validate an email address
there is no .search_all method with the re module
maybe theone you are looking for is .findall
you can try
re.findall(r"(\w(?:[-.+]?\w+)+\#(?:[a-zA-Z0-9](?:[-+]?\w+)*\.)+[a-zA-Z]{2,})", text)
i assume text is the text to search, in your case should be text = page.read()
or you need to compile the regex:
r = re.compile(r"(\w(?:[-.+]?\w+)+\#(?:[a-z0-9](?:[-+]?\w+)*\.)+[a-z]{2,})", re.I)
results = r.findall(text)
Note:
.findall returns a list of matches
if you need to iterate to get a match object, you can use .finditer
(from the example before)
r = re.compile(r"(\w(?:[-.+]?\w+)+\#(?:[a-z0-9](?:[-+]?\w+)*\.)+[a-z]{2,})", re.I)
for email_match in r.finditer(text):
email_addr = email_match.group() #or anything you need for a matched object
Now the problem is what Regex you have to use :)
Change r'[-a-z0-9._]+#([-a-z0-9]+)(\.[-a-z0-9]+)+' to r'[aA-zZ0-9._]+#([aA-zZ0-9]+)(\.[aA-zZ0-9]+)+'. The - character before a-z is the cause

Categories