Python URL splitting - python

I have a string like google.com in Python, which I would like split into two parts: google and .com. The problem is where I have a URL such as subdomain.google.com, which I would like to be split into subdomain.google and .com.
How do I separate the rest of the URL from the TLD? It can't operate based on the last . in the URL because of TLDs such as .co.uk. Note the URL does not contain http:// or www.

tldextract looks like what you need. It deals with the .co.uk issue.

I used tdl and urllib, but did not find them satisfying.
I found this question multiple times on my Google search on how to parse a URL.
After a while, I took the time to make a regex and make it into an open source package.
It handles URLs which have a secondary top-domain like co.uk, and also supports national URLs with special characters.
url-parser on PyPi
URL Parser on GitHub
For you, it would be easy to use it like this:
Step one:
pip install url-parser
Step two:
from url_parser import parse_url
url = parse_url('subdomain.google.com')
url['subdomain'] # subdomain
url['domain'] # google
url['top_domain'] #com
You can use these keys to get the different part of the URL.
protocol
www
sub_domain
domain
top_domain
dir
file
fragment
query

To do this, you will need a list of valid domain names. The top level ones (.com, .org, etc.) and the country codes (.us, .fr, etc.) are easy to find. Try http://www.icann.org/en/resources/registries/tlds.
For the second level ones (.co.uk, .org.au) you might need to look up each country code to see its sub domains. Wikipedia is your friend.
Once you have the list, grab the last two parts from the name you have (google.com or co.uk) and see if it is in your second level list. If not, grab the last part and see if it is in your top level list.

Related

How to retrieve the domain of a web archived website using the archived url in Python?

Given a url such as :
http://web.archive.org/web/20010312011552/www.feralhouse.com/cgi-bin/store/commerce.cgi?page=ac2.html
Is there a way (using some library, package, or vanilla Python) to retrieve the domain "www.feralhouse.com"?
I thought of simply using split at "www", split the second-index item at "com", and re-group the first-index item like:
url = "http://web.archive.org/web/20010312011552/www.feralhouse.com/cgi-bin/store/commerce.cgi?page=ac2.html"
url1=url.split("www")
url2=url1[1].split("com")
desired_output = "www"+url2[0]+"com"
print(desired_output)
#www.feralhouse.com
But there are some exceptions to this method (sites with no www, I assume they rely on the browser automatically changing that). I would prefer a less "hacky" approach if possible. Thanks in advance!
NOTE: I dont want a solution just for this SPECIFIC url, I want a solution for all possible archived urls.
EDIT: Another example url
http://web.archive.org/web/20000614170338/http://www.clonejesus.com/
Two methods, one with split, one with re module:
s = 'http://web.archive.org/web/20010312011552/www.feralhouse.com/cgi-bin/store/commerce.cgi?page=ac2.html'
print(s.split('/', 5)[-1])
import re
print(re.findall(r'\d{14}/(.*)', s)[0])
Prints:
www.feralhouse.com/cgi-bin/store/commerce.cgi?page=ac2.html
www.feralhouse.com/cgi-bin/store/commerce.cgi?page=ac2.html

web-scraping, regex and iteration in python

I have the following url 'http://www.alriyadh.com/file/278?&page=1'
I would like to write a regex to access urls from page=2 till page=12
For example, this url is needed 'http://www.alriyadh.com/file/278?&page=4', but not page = 14
I reckon what will work is a function that iterate the specified 10 pages to access all the urls within them. I have tried this regex but does not work
'.*?=[2-9]'
My aim is to get the content from those urls using newspaper package. I simply want this data for my research
Thanks in advance
does not require regex, a simple preset loop will do.
import requests
from bs4 import BeautifulSoup as bs
url = 'http://www.alriyadh.com/file/278?&page='
for page in range(2,13):
html = requests.get(url+str(page)).text
soup = bs(html)
Here's a regex to access the proper range (i.e. 2-12):
([2-9]|1[012])
Judging by what you have now, I am unsure that your regex will work as you intend it to. Perhaps I am misinterpreting your regex altogether, but is the '?=' intended to be a lookahead?
Or are you actually searching for a '?' immediately followed by a '=' immediately followed by any number 2-9?
How familiar are you with regexs in general? This particular one seems dangerously vague to find a meaningful match.

How to reliably extract URLs contained in URLs with Python?

Many search engines track clicked URLs by adding the result's URL to the query string which can take a format like: http://www.example.com/result?track=http://www.stackoverflow.com/questions/ask
In the above example the result URL is part of the query string but in some cases it takes the form http://www.example.com/http://www.stackoverflow.com/questions/ask or URL encoding is used.
The approach I tried first is to split searchengineurl.split("http://"). Some obvious problems with this:
it would return all parts of the query string that follow the result URL and not just the result URL. This would be a problem with an URL like this: http://www.example.com/result?track=http://www.stackoverflow.com/questions/ask&showauthor=False&display=None
it does not distinguish between any additional parts of the search engine tracking URL's query string and the result URL's query string. This would be a problem with an URL like this: http://www.example.com/result?track=http://www.stackoverflow.com/questions/ask?showauthor=False&display=None
it fails if the "http://" is ommitted in the result URL
What is the most reliable, general and non-hacky way in Python to extract URLs contained in other URLs?
I would try using urlparse.urlparse it will probably get you most of the way there and a little extra work on your end will get what you want.
This works for me.
from urlparse import urlparse
from urllib import unquote
urls =["http://www.example.com/http://www.stackoverflow.com/questions/ask",
"http://www.example.com/result?track=http://www.stackoverflow.com/questions/ask&showauthor=False&display=None",
"http://www.example.com/result?track=http://www.stackoverflow.com/questions/ask?showauthor=False&display=None",
"http://www.example.com/result?track=http%3A//www.stackoverflow.com/questions/ask%3Fshowauthor%3DFalse%26display%3DNonee"]
def clean(url):
path = urlparse(url).path
index = path.find("http")
if not index == -1:
return path[index:]
else:
query = urlparse(url).query
index = query.index("http")
query = query[index:]
index_questionmark = query.find("?")
index_ampersand = query.find("&")
if index_questionmark == -1 or index_questionmark > index_ampersand:
return unquote(query[:index_ampersand])
else:
return unquote(query)
for url in urls:
print clean(url)
> http://www.stackoverflow.com/questions/ask
> http://www.stackoverflow.com/questions/ask
> http://www.stackoverflow.com/questions/ask?showauthor=False&display=None
> http://www.stackoverflow.com/questions/ask?showauthor=False&display=None
I don't know about Python specifically, but I would use a regular expression to get the parts (key=value) of the query string, with something like...
(?:\?|&)[^=]+=([^&]*)
That captures the "value" parts. I would then decode those and check them against another pattern (probably another regex) to see which one looks like a URL. I would just check the first part, then take the whole value. That way your pattern doesn't have to account for every possible type of URL (and presumably they didn't combine the URL with something else within a single value field). This should work with or without the protocol being specified (it's up to your pattern to determine what looks like a URL).
As for the second type of URL... I don't think there is a non-hacky way to parse that. You could URL-decode the entire URL, then look for the second instance of http:// (or https://, and/or any other protocols you might run across). You would have to decide whether any query strings are part of "your" URL or the tracker URL. You could also not decode the URL and attempt to match on the encoded values. Either way will be messy, and if they don't include the protocol it will be even worse! If you're working with a set of specific formats, you could work out good rules for them... but if you just have to handle whatever they happen to throw at you... I don't think there's a reliable way to handle the second type of embedding.

Python: Keyword to Links

I am building a blog on Google App Engine. I would like to convert some keywords in my blog posts to links, just like what you see in many WordPress blogs.
Here is one WP plugin which do the same thing:http://wordpress.org/extend/plugins/blog-mechanics-keyword-link-plugin-v01/
A plugin that allows you to define keyword/link pairs. The keywords are automatically linked in each of your posts.
I think this is more than a simple Python Replace. What I am dealing with is HTML code. It can be quite complex sometimes.
Take the following code snippet as an example. I want to conver the word example into a link to http://example.com:
Here is an example link:example.com
By a simple Python replace function which replaces example with example, it would output:
Here is an example link:example.com">example.com</a>
but I want:
Here is an example link:example.com
Is there any Python plugin that capable of this? Thanks a lot!
This is roughly what you could do using Beautifulsoup:
from BeautifulSoup import BeautifulSoup
html_body ="""
Here is an example link:<a href='http://example.com'>example.com</a>
"""
soup = BeautifulSoup(html_body)
for link_tag in soup.findAll('a'):
link_tag.string = "%s%s%s" % ('|',link_tag.string,'|')
for text in soup.findAll(text=True):
text_formatted = ['example'\
if word == 'example' and not (word.startswith('|') and word.endswith('|'))\
else word for word in foo.split() ]
text.replaceWith(' '.join(text_formatted))
for link_tag in soup.findAll('a'):
link_tag.string = link_tag.string[1:-1]
print soup
Basically I'm stripping out all the text from the post_body, replacing the example word with the given link, without touching the links text that are saved by the '|' characters during the parsing.
This is not 100% perfect, for example it does not work if the word you are trying to replace ends with a period; with some patience you could fix all the edge cases.
This would probably be better suited to client-side code. You could easily modify a word highlighter to get the desired results. By keeping this client-side, you can avoid having to expire page caches when your 'tags' change.
If you really need it to be processed server-side, then you need to look at using re.sub which lets you pass in a function, but unless you are operating on plain-text you will have to first parse the HTML using something like minidom to ensure you are not replacing something in the middle of any elements.

In Python, how do i check if 2 different links actually point to the same page?

For example, these 2 links point to the same location:
http://www.independent.co.uk/life-style/gadgets-and-tech/news/chinese-blamed-for-gmail-hacking-2292113.html
http://www.independent.co.uk/life-style/gadgets-and-tech/news/2292113.html
How do i check this in python?
Call geturl() on the result of urllib2.urlopen(). geturl() "returns the URL of the resource retrieved, commonly used to determine if a redirect was followed."
For example:
#!/usr/bin/env python
# coding: utf-8
import urllib2
url1 = 'http://www.independent.co.uk/life-style/gadgets-and-tech/news/chinese-blamed-for-gmail-hacking-2292113.html'
url2 = 'http://www.independent.co.uk/life-style/gadgets-and-tech/news/2292113.html'
for url in [url1, url2]:
result = urllib2.urlopen(url)
print result.geturl()
The output is:
http://www.independent.co.uk/life-style/gadgets-and-tech/news/chinese-blamed-for-gmail-hacking-2292113.html
http://www.independent.co.uk/life-style/gadgets-and-tech/news/chinese-blamed-for-gmail-hacking-2292113.html
It's impossible to discern that merely from the URLs, obviously.
You could fetch the content and compare it, but then I imagine you'd have to use a smart criterion to decide when two pages are the same -- say, for example, that both point to the same article, but a random advertising comes different, or related articles change depending on other factors.
Design your program in such a way that the criterion for matching pages is easily replaced, even dynamically, and try until you find one that doesn't fail -- for example, for a newspaper page, you could try finding headlines.

Categories