I am developing a web scraper code. The basic thing which I am retrieving is email address from the HTML source. I am using the following code
r = re.compile(r"\b[A-Z0-9._%+-]+#[A-Z0-9.-]+\.[A-Z]{2,6}\b", re.IGNORECASE)
emailAddresses = r.findall(html)
In few websites the email address is in the format abcd[at]gmail.com/abcd(at)gmail.com. I need a generic regex code which will retrieve email address in either of the three formats abcd[at]gmail.com/abcd(at)gmail.com/abcd#gmail.com. I tried the following code, but didn't get expected result. Can any one help me.
r = re.compile(r"\b[A-Z0-9._%+-]+[#|(at)|[at]][A-Z0-9.-]+\.[A-Z]{2,6}\b", re.IGNORECASE)
emailAddresses = r.findall(html)
Solution: Replace # by (#|\(at\)|\[at\]) as such:
r = re.compile(r"\b[A-Z0-9._%+-]+(#|\(at\)|\[at\])[A-Z0-9.-]+\.[A-Z]{2,6}\b", re.IGNORECASE)
emailAddresses = r.findall(html)
Explanation: In your attempt, you did [one|two|three], you cannot do that. […] is used for single characters or for sets ([a-z] is the same as [abcd…xyz]). You must use (one|two|three) instead. [1]
Also, you attempt to match () and [] which are all special characters regarding to REGEX, so they have special functionality. If you want to actually match them (and not using their special functionality), you must remember to escape them before by putting a \ in front of them. Same goes for .?+* etc.
Suggestion: You can also try to match [dot] and (dot) that very same way if you wish so.
Just remember that there are a ton of way to obfuscate email addresses out there, including some you might not be aware of.
And that, also, validating email addresses (and so trying to catch them with REGEX) can be very tricky:
The actual official REGEX is (?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*")#(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\]).
(EDIT: Source: http://www.ex-parrot.com/pdw/Mail-RFC822-Address.html Looks like it could be even worse than the above REGEX!!)
[1] Beware that using (…) will capture its content, if you wish this content not being captured you have to use (?:…) instead.
r = re.compile(r"\b[A-Z0-9._%+-]+(?:#|[(\[]at[\])])[A-Z0-9.-]+\.[A-Z]{2,6}\b", re.IGNORECASE)
^^^^^^^^^^^^^^^^^^
emailAddresses = r.findall(html)
See demo.
https://regex101.com/r/nD5jY4/5#python
Related
I have this regex: If you don't want these messages, please [a-zA-Z0-9öäüÖÄÜ<>\n\-=#;&?_ "/:.#]+settings<\/a>. It works on regexr but not when I am using the re
library in Python:
data = "<my text (comes from a file)>"
search = "If you don't want these messages, please [a-zA-Z0-9öäüÖÄÜ<>\n\-=#;&?_ \"/:.#]+settings<\/a>" # this search string comes from a database, so it's not hardcoded into my script
print(re.search(search, data))
Is there something I don't see?
Thank you!
the pattern you are using on regexr contains \- but in your exemple shows \\- wich may give an incorrect regex. (and add the r in front of of the string as jupiterby said).
I am trying to write a program that reads text from screenshots and then identifies various PII from it. Using pytesseract to read in the text, I am trying to write regex for urls, email IDs etc. Here is an example of a function which takes in a string and returns True email IDs and False otherwise:
def email_regex(text):
pattern = compile(r"\A[a-z0-9!#$%&'*+/=?^_‘{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_‘{|}~-]+)*#(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?")
return bool(pattern.match(text))
This function works well for all email IDs in a proper format(abc#xyz.dd), but since the input to the function is text read in from pytesseract, the text is not guaranteed to be properly formatted. My function returns False for abc#xyzdd. I'm running into the same issues with URL regex,domain name regex etc. Is there a way to make my regex expressions more robust to reading in errors from pytesseract?
I have tried following the accepted solution to this answer, but that leads to the regex functions returning True for random words as well. Any help to resolve this would be greatly appreciated.
EDIT :- Here are my url and domain regexs, where I'm running into the same problem as my email regex. Any help with these will be very useful for me.
pattern = compile(r'\b(((([a-zA-Z0-9])|([a-zA-Z0-9][a-zA-Z0-9\-]{0,86}
[a-zA-Z0-9]))\.(([a-zA-Z0-9])|([a-zA-Z0-9][a-zA-Z0-9\-]{0,73}[a-zA-Z0-9]))\.(([a-zA-Z0-9]{2,12}\.[a-zA-Z0-9]{2,12})|([a-zA-Z0-9]{2,25})))|((([a-zA-Z0-9])|([a-zA-Z0-9][a-zA-Z0-9\-]{0,162}[a-zA-Z0-9]))\.(([a-zA-Z0-9]{2,12}\.[a-zA-Z0-9]{2,12})|([a-zA-Z0-9]{2,25}))))\b', re.IGNORECASE)
return pattern.match(text)```
def url_regex(text):
pattern = compile(r'(http|https://)?:(?:[a-zA-Z]|[0-9]|[$-_#.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F])+)', re.IGNORECASE)
return pattern.match(text)
Perhaps adding some flags, such as ignorecase and DOTALL for newlines:
# Match email ID:
my_pattern = compile(r"^[a-z0-9]+[\._]?[a-z0-9]+[#]\w+[.]?\w{2,3}$", re.I, re.S)
Match URLs:
https://gist.github.com/gruber/8891611
Python beginner here. I'm stumped on part of this code for a bot I'm writing.
I am making a reddit bot using Praw to comb through posts and removed a specific set of characters (steam CD keys).
I made a test post here: https://www.reddit.com/r/pythonforengineers/comments/91m4l0/testing_my_reddit_scraping_bot/
This should have all the formats of keys.
Currently, my bot is able to find the post using a regex expression. I have these variables:
steamKey15 = (r'\w\w\w\w\w.\w\w\w\w\w.\w\w\w\w\w')
steamKey25 = (r'\w\w\w\w\w.\w\w\w\w\w.\w\w\w\w\w.\w\w\w\w\w.\w\w\w\w\w.')
steamKey17 = (r'\w\w\w\w\w\w\w\w\w\w\w\w\w\w\w\s\w\w')
I am finding the text using this:
subreddit = reddit.subreddit('pythonforengineers')
for submission in subreddit.new(limit=20):
if submission.id not in steamKeyPostID:
if re.search(steamKey15, submission.selftext, re.IGNORECASE):
searchLogic()
saveSteamKey()
So this is just to show that the things I should be using in a filter function is a combination of steamKey15/25/17, and submission.selftext.
So here is the part where I am confused. I cant find a function that works, or is doing what I want. My goal is to remove all the text from submission.selftext(the body of the post) BUT the keys, which will eventually be saved in a .txt file.
Any advice on a good way to go around this? I've looked into re.sub and .translate but I don't understand how the parts fit together.
I am using Python 3.7 if it helps.
can't you just get the regexp results?
m = re.search(steamKey15, submission.selftext, re.IGNORECASE)
if m:
print(m.group(0))
Also note that a dot . means any char in a regexp. If you want to match only dots, you should use \.. You can probably write your regexp like this instead:
r'\w{5}[-.]\w{5}[-.]\w{5}'
This will match the key when separated by . or by -.
Note that this will also match anything that begin or end with a key, or has a key in the middle - that can cause you problems as your 15-char key regexp is contained in the 25-key one! To fix that use negative lookahead/negative lookbehind:
r'(?<![\w.-])\w{5}[-.]\w{5}[-.]\w{5}(?![\w.-])'
that will only find the keys if there are no extraneous characters before and after them
Another hint is to use re.findall instead of re.search - some posts contain more than one steam key in the same post! findall will return all matches while search only returns the first one.
So a couple things first . means any character in regex. I think you know that, but just to be sure. Also \w\w\w\w\w can be replaced with \w{5} where this specifies 5 alphanumerics. I would use re.findall.
import re
steamKey15 = (r'(?:\w{5}.){2}\w{5}')
steamKey25 = (r'(?:\w{5}.){5}')
steamKey17 = (r'\w{15}\s\w\w')
subreddit = reddit.subreddit('pythonforengineers')
for submission in subreddit.new(limit=20):
if submission.id not in steamKeyPostID:
finds_15 = re.findall(steamKey15, submission.selftext)
finds_25 = re.findall(steamKey25, submission.selftext)
finds_17 = re.findall(steamKey17, submission.selftext)
I want to replace consecutive symbols just one such as;
this is a dog???
to
this is a dog?
I'm using
str = re.sub("([^\s\w])(\s*\1)+", "\\1",str)
however I notice that this might replace symbols in urls that might happen in my text.
like http://example.com/this--is-a-page.html
Can someone give me some advice how to alter my regex?
So you want to unleash the power of regular expressions on an irregular language like HTML. First of all, search SO for "parse HTML with regex" to find out why that might not be such a good idea.
Then consider the following: You want to replace duplicate symbols in (probably user-entered) text. You don't want to replace them inside a URL. How can you tell what a URL is? They don't always start with http – let's say ars.userfriendly.org might be a URL that is followed by a longer path that contains duplicate symbols.
Furthermore, you'll find lots of duplicate symbols that you definitely don't want to replace (think of nested parentheses (like this)), some of them maybe inside a <script> on the page you're working on (||, && etc. come to mind.
So you might come up with something like
(?<!\b(?:ftp|http|mailto)\S+)([^\\|&/=()"'\w\s])(?:\s*\1)+
which happens to work on the source code of this very page but will surely fail in other cases (for example if URLs don't start with ftp, http or mailto). Plus, it won't work in Python since it uses variable repetition inside lookbehind.
All in all, you probably won't get around parsing your HTML with a real parser, locating the body text, applying a regex to it and writing it back.
EDIT:
OK, you're already working on the parsed text, but it still might contain URLs.
Then try the following:
result = re.sub(
r"""(?ix) # case-insensitive, verbose regex
# Either match a URL
# (protocol optional (if so, URL needs to start with www or ftp))
(?P<URL>\b(?:(?:https?|ftp|file)://|www\.|ftp\.)[-A-Z0-9+&##/%=~_|$?!:,.]*[A-Z0-9+&##/%=~_|$])
# or
|
# match repeated non-word characters
(?P<rpt>[^\s\w])(?:\s{0,100}(?P=rpt))+""",
# and replace with both captured groups (one will always be empty)
r"\g<URL>\g<rpt>", subject)
Re-EDIT: Hm, Python chokes on the (?:\s*(?P=rpt))+ part, saying the + has nothing to repeat. Looks like a bug in Python (reproducible with (.)(\s*\1)+ whereas (.)(\s?\1)+ works)...
Re-Re-EDIT: If I replace the * with {0,100}, then the regex compiles. But now Python complains about an unmatched group. Obviously you can't reference a group in a replacement if it hasn't participated in the match. I give up... :(
So, here's my question:
I have a crawler that goes and downloads web pages and strips those of URLs (for future crawling). My crawler operates from a whitelist of URLs which are specified in regular expressions, so they're along the lines of:
(http://www.example.com/subdirectory/)(.*?)
...which would allow URLs that followed the pattern to be crawled in the future. The problem I'm having is that I'd like to exclude certain characters in URLs, so that (for example) addresses such as:
(http://www.example.com/subdirectory/)(somepage?param=1¶m=5#print)
...in the case above, as an example, I'd like to be able to exclude URLs that feature ?, #, and = (to avoid crawling those pages). I've tried quite a few different approaches, but I can't seem to get it right:
(http://www.example.com/)([^=\?#](.*?))
etc. Any help would be really appreciated!
EDIT: sorry, should've mentioned this is written in Python, and I'm normally fairly proficient at regex (although this has me stumped)
EDIT 2: VoDurden's answer (the accepted one below) almost yields the correct result, all it needs is the $ character at the end of the expression and it works perfectly - example:
(http://www.example.com/)([^=\?#]*)$
(http://www.example.com/)([^=?#]*?)
Should do it, this will allow any URL that does not contain the characters you don't want.
It might however be a little bit hard to extend this approach. A better option is to have the system work two-tiered, i.e. one set of matching regex, and one set of blocking regex. Then only URL:s which pass both of these will be allowed. I think this solution will be a bit more transparent and flexible.
This expression should be what you're looking for:
(http://www.example.com/subdirectory/)([^=?#]*)$
[^=\?#] Will match anything except for the characters you specified.
For Example:
http://www.example.com/subdirectory/ Match
http://www.example.com/subdirectory/index.php Match
http://www.example.com/subdirectory/somepage?param=1¶m=5#print No Match
http://www.example.com/subdirectory/index.php?param=1 No Match
You will need to crawl the pages upto ?param=1¶m=5
because normally param=1 and param=2 could give you completely different web page.
pick up one the wordpress website to confirm that.
Try like this one, It will try to match just before # char
(http://www.example.com/)([^#]*?)
I'm not sure of what you want. If you wan't to match anything that doesn't containst any ?, #, and = then the regex is
([^=?#]*)
As an alternative there's always the urlparse module which is designed for parsing urls.
from urlparse import urlparse
urls= [
'http://www.example.com/subdirectory/',
'http://www.example.com/subdirectory/index.php',
'http://www.example.com/subdirectory/somepage?param=1¶m=5#print',
'http://www.example.com/subdirectory/index.php?param=1',
]
for url in urls:
# in python 2.5+ you can use urlparse(url).query instead
if not urlparse(url)[4]:
print url
Provides the following:
http://www.example.com/subdirectory/
http://www.example.com/subdirectory/index.php