Does anyone know a regular expression to match Domain.CCTLD? I don't want subdomains, only the "atomic domain". For example, docs.google.com doesn't get matched, but google.com does. However, this gets complicated with stuff like .co.uk, CCTLDs. Does anyone know a solution? Thanks in advance.
EDIT: I've realized I also have to deal with multiple subdomains, like john.doe.google.co.uk. Need a solution now more than ever :P.
It sounds like you are looking for the information available through the Public Suffix List project.
A "public suffix" is one under which Internet users can directly register names. Some examples of public suffixes are ".com", ".co.uk" and "pvt.k12.wy.us". The Public Suffix List is a list of all known public suffixes.
There is no single regular expression that will reasonably match the list of public suffixes. You will need to implement code to use the public suffix list, or find an existing library that already does so.
Based on your comment above, I'm going to reinterpret the question -- rather than making a regex that will match them, we'll create a function that will match them, and apply that function to filter a list of domain names to only include first class domains, e.g. google.com, amazon.co.uk.
First, we'll need a list of TLDs. As Greg mentioned, the public suffix list is a great place to start. Let's assume you've parsed the list into a python array called suffixes. If this isn't something your comfortable with, comment and I can add some code that will do it.
suffixes = parse_suffix_list("suffix_list.txt")
Now we'll need code that identifies whether a given domain name matches the pattern some-name.suffix:
def is_domain(d):
for suffix in suffixes:
if d.endswith(suffix):
# Get the base domain name without suffix
base_name = d[0:-(suffix.length + 1)]
# If it contains '.', it's a subdomain.
if not base_name.contains('.'):
return true
# If we get here, no matches were found
return false
I would probably solve this by getting a complete list of TLDs and using it to create the regex. For example (in Ruby, sorry, not a Pythonista yet):
tld_alternation = ['\.com','\.co\.uk','\.eu','\.org',...].join('|')
regex = /^[a-z0-9]([a-z0-9\-]*[a-z0-9])?(#{tld_alternation})$/i
I don't think it's possible to properly differentiate between a real two part TLD and a subdomain without knowing the actual list of TLDs (ie: you could always construct a subdomain that looks like a TLD if you knew how the regex worked.)
Related
I want to extract for example 2 entities from a sentence. eg:
str1 = 'i am tom and i have a car'
I want to extract the word 'tom' or 'jack' as name if exist.
I also want to extract the word 'car' or 'bike' as property if exist
Now I can simply write 2 regular expressions:
re.search(r"(?P<name>tom|jack)", s).group('name')
re.search(r"(?P<property>car|bike)", s).group('property')
But I wonder if I can combine these two together.
The problem is I could not know the order of both name and property. So the following code
re.search(r"(?P<name>tom|jim).*(?P<property>car|bike)", s)
does not work for :
'str2 = i have a car and i am tom'
I tried to simply combine two order situation
re.search(r"(((?P<name>tom|jack).*(?P<property>car|bike))|((?P<property>car|bike).*(?P<name>tom|jack)))", s2)
it gives me "redefinition of group name" error unless I changed to
re.search(r"(((?P<name>tom|jack).*(?P<property>car|bike))|((?P<property2>car|bike).*(?P<name2>tom|jack)))", s2)
Question
How can i write a regular express to extract tom/jack as name and car/bike as property without considering the order?
Moreover
I don't want to simply list all the possible orders because it might be too many situations if i want to extract n kinds of entities.
Yes, it's possible but within lookarounds otherwise characters are consumed and engine pointer doesn't bother to go back for a new look up.
\A(?=.*(?P<name>tom|jack))(?=.*(?P<property>car|bike))
Live demo
Every pattern in a regex should match to lead a successful match. If they are not mandatory patterns make them optional.
My project (unrelated to this question, just context) is a ML classifier, I'm trying to improve it and have found that when I stripped URLS from the text given to it, some of the URLS have been broken by spaces. For example:
https:// twitter.com/username/sta tus/ID
After I remove links that are not broken, I am left with thinks like www website com. I removed those with the following regular expression in Python:
tweet = re.sub('(www|http).*?(org |net |edu |com |be |tt |me |ms )','',tweet);
I've put a space after every one of them because this happens after the regular strip and text processing (so only working with parts of a URL separated by spaces) and theoretically we should only pick up the remainders of a broken link... not something like
http website strangeTLD .... communication
It's not perfect but it works, however I just thought that I might try to preemptively remove URLS from twitter only, since I know that the spaces that break the regular URL strip will always be in the same places, hoping this improves my classifier accuracy? This will get rid of the string of characters that occurs after a link... specifically pictures, which is a lot of my data.
Specifically, is there a way to select the entity surrounding/after:
pic.twitter.com/
or, in reference to the example I gave earlier, select the entity after the username broken by the space in status (I'm just guessing at this regex)...
http.*?twitter.com/*?/sta tus/
Thank you in advance! And for the record, I was given this dataset to work with; I am not sure why the URLs are almost all broken by spaces.
Yes, what you are talking about is called Positive Lookbehind and works using (?<=...), where the ellipsis should be replaced by what you want to skip.
E.g. if you want to select whatever comes after username in https://twitter.com/username/status/ID, just use
(?<=https:\/\/twitter\.com\/username\/).*
and you will get status/ID, like you can see with this live demo.
In this case I had to escape slashes / using backslashes, as required by Regex specifications; I also used the Kleene star operator, i.e. the asterisk, to match any occurrence of . (any character), just like you did.
What a positive lookbehind combination does is specifying some mandatory text before the current position of your cursor; in other words, it puts the cursor after the expression you feed it (if the said text exists).
Of course this is not enough in your case, since username won't be a fixed string but a variable one. This might be an additional requirement, since lookbehinds do not work with variable lengths.
So you can just skip www.twitter.com/
(?<=https:\/\/twitter\.com\/).*
And then, via Python, create a substring
currentText = "username/status/ID"
result = currentText.split("/",1)[1] # returns status/ID
Test it in this demo (click "Execute"); a simple explanation of how this works is in the answer to this question (in short, you just split the string at the first slash character).
As a sidenote, blanks/spaces aren't allowed in URLs and if necessary are usually encoded as %20 or + (see e.g. this answer). In other words, every URL you got can be safely stripped of spaces before processing, so... why didn't they do it?
I wrote a script in Python for custom HTML page that finds a word within a string/line and highlights just that word with use of following tags where instance is the word that is searched for.
<b><font color=\"red\">"+instance+"</font></b>
With the following result:
I need to find a word (case insensitive) let's say "port" within a string that can be port, Port, SUPPORT, Support, support etc, which is easy enough.
pattern = re.compile(word, re.IGNORECASE)
find_all_instances = pattern.findall(string_to_search)
However my strings often contain 2 or more instances in single line, and I need to append
<b><font color=\"red\">"+instance+"</font></b> to each of those instances, without changing cases.
Problem with my approach, is that I am attempting to itterate over each of instances found with findall (exact match),
while multiple same matches can also be found within the string.
for instance in find_all_instances:
second_pattern = re.compile(instance)
string_to_search = second_pattern.sub("<b><font color=\"red\">"+instance+"</font></b>", string_to_search)
This results in following:
<b><font color="red"><b><font color="red"><b><font color="red">Http</font></b></font></b></font></b></font>
when I need
<b><font color="red">Http</font></b>
I was thinking, I would be able to avoid this if I was able to find out exact part of the string that the pattern.sub substitutes at the moment of doing it,
however I was not able to find any examples of that kind of usage, which leads me to believe that I am doing something very wrong.
If anyone have a way I could use to insert <b><font color="red">instance</font></b> without replacing instance for all matches(case insensitive), then I would be grateful.
Maybe I'm misinterpretting your question, but wouldn't re.sub be the best option?
Example: https://repl.it/DExs
Okay so two ways I did quickly! The second loop is definitely the way to go. It uses re.sub (as someone else commented too). It replaces with the lowercase search term bear in mind.
import re
FILE = open("testing.txt","r")
word="port"
#THIS LOOP IS CASE SENSITIVE
for line in FILE:
newline=line.replace(word,"<b><font color=\"red\">"+word+"</font></b>")
print newline
#THIS LOOP IS INCASESENSITIVE
for line in FILE:
pattern=re.compile(word,re.IGNORECASE)
newline = pattern.sub("<b><font color=\"red\">"+word+"</font></b>",line)
print newline
I am writing a web crawler using Scrapy and as a result I get a set of URLs like: [Dummy URLs]
*http://matrix.com/en/Zion
http://matrix.com/en/Machine_World
http://matrix.com/en/Matrix:Banner_guidelines
http://matrix.com/en/File:Link_Banner.jpg
http://matrix.com/wiki/en/index.php*
In the rules in scrapy, I want to add a regex that allows urls ONLY of the kind "http://matrix.com/en/Machine_World" or "http://matrix.com/en/Zion"
i.e urls that contain anything outside of the set "http://matrix.com/en/<[a-zA-Z,_]>" must not be allowed.
Constraints :
The string after "/en/" could be of any length. So I cannot ask it to look only for the first 10 or 20 characters. e.g when I use the regex : [a-zA-Z,]{1,20} OR [a-zA-Z,]{1,} it still matches the URLs like http://matrix.com/en/Matrix:Banner_guidelines coz it finds "http://matrix.com/en/Matrix" part of the url a successful match. I want it look at the string starting after "/en/" till the end of URL and then apply this rule.
Unfortunately I cannot extract that string n write a sub-routine of any kind. It has to be done using a regex only!
i.e urls that contain anything outside of the set "http://matrix.com/en/<[a-zA-Z,_]>" must not be allowed.
Have you tried using this character class in your regex? Looks like you aren't including underscores.
Try
[a-zA-Z,_]+
The plus sign means "one or more" - which is the same as {1,} just a nice shorthand :)
If you want to exclude items with .php or .jpg, feel free to add a $ sign to the end, as so:
[a-zA-Z,_]+$
The $ means "end of line" meaning that your matching sequence must run to the end of the line. As fullstops are not included in the character class, those options will be excluded
Let me know if that works,
Elliott
Reproducible evidence that the suggested regex works:
grep("matrix.com\\/en\\/[a-zA-Z,_]+$", x, perl=TRUE, value=TRUE)
#[1] "http://matrix.com/en/Zion"
#[2] "http://matrix.com/en/Machine_World"
Data
x <- c("http://matrix.com/en/Zion", "http://matrix.com/en/Machine_World",
"http://matrix.com/en/Matrix:Banner_guidelines",
"http://matrix.com/en/File:Link_Banner.jpg",
"http://matrix.com/wiki/en/index.php")
I want to replace consecutive symbols just one such as;
this is a dog???
to
this is a dog?
I'm using
str = re.sub("([^\s\w])(\s*\1)+", "\\1",str)
however I notice that this might replace symbols in urls that might happen in my text.
like http://example.com/this--is-a-page.html
Can someone give me some advice how to alter my regex?
So you want to unleash the power of regular expressions on an irregular language like HTML. First of all, search SO for "parse HTML with regex" to find out why that might not be such a good idea.
Then consider the following: You want to replace duplicate symbols in (probably user-entered) text. You don't want to replace them inside a URL. How can you tell what a URL is? They don't always start with http – let's say ars.userfriendly.org might be a URL that is followed by a longer path that contains duplicate symbols.
Furthermore, you'll find lots of duplicate symbols that you definitely don't want to replace (think of nested parentheses (like this)), some of them maybe inside a <script> on the page you're working on (||, && etc. come to mind.
So you might come up with something like
(?<!\b(?:ftp|http|mailto)\S+)([^\\|&/=()"'\w\s])(?:\s*\1)+
which happens to work on the source code of this very page but will surely fail in other cases (for example if URLs don't start with ftp, http or mailto). Plus, it won't work in Python since it uses variable repetition inside lookbehind.
All in all, you probably won't get around parsing your HTML with a real parser, locating the body text, applying a regex to it and writing it back.
EDIT:
OK, you're already working on the parsed text, but it still might contain URLs.
Then try the following:
result = re.sub(
r"""(?ix) # case-insensitive, verbose regex
# Either match a URL
# (protocol optional (if so, URL needs to start with www or ftp))
(?P<URL>\b(?:(?:https?|ftp|file)://|www\.|ftp\.)[-A-Z0-9+&##/%=~_|$?!:,.]*[A-Z0-9+&##/%=~_|$])
# or
|
# match repeated non-word characters
(?P<rpt>[^\s\w])(?:\s{0,100}(?P=rpt))+""",
# and replace with both captured groups (one will always be empty)
r"\g<URL>\g<rpt>", subject)
Re-EDIT: Hm, Python chokes on the (?:\s*(?P=rpt))+ part, saying the + has nothing to repeat. Looks like a bug in Python (reproducible with (.)(\s*\1)+ whereas (.)(\s?\1)+ works)...
Re-Re-EDIT: If I replace the * with {0,100}, then the regex compiles. But now Python complains about an unmatched group. Obviously you can't reference a group in a replacement if it hasn't participated in the match. I give up... :(