How to reliably extract URLs contained in URLs with Python? - python

Many search engines track clicked URLs by adding the result's URL to the query string which can take a format like: http://www.example.com/result?track=http://www.stackoverflow.com/questions/ask
In the above example the result URL is part of the query string but in some cases it takes the form http://www.example.com/http://www.stackoverflow.com/questions/ask or URL encoding is used.
The approach I tried first is to split searchengineurl.split("http://"). Some obvious problems with this:
it would return all parts of the query string that follow the result URL and not just the result URL. This would be a problem with an URL like this: http://www.example.com/result?track=http://www.stackoverflow.com/questions/ask&showauthor=False&display=None
it does not distinguish between any additional parts of the search engine tracking URL's query string and the result URL's query string. This would be a problem with an URL like this: http://www.example.com/result?track=http://www.stackoverflow.com/questions/ask?showauthor=False&display=None
it fails if the "http://" is ommitted in the result URL
What is the most reliable, general and non-hacky way in Python to extract URLs contained in other URLs?

I would try using urlparse.urlparse it will probably get you most of the way there and a little extra work on your end will get what you want.

This works for me.
from urlparse import urlparse
from urllib import unquote
urls =["http://www.example.com/http://www.stackoverflow.com/questions/ask",
"http://www.example.com/result?track=http://www.stackoverflow.com/questions/ask&showauthor=False&display=None",
"http://www.example.com/result?track=http://www.stackoverflow.com/questions/ask?showauthor=False&display=None",
"http://www.example.com/result?track=http%3A//www.stackoverflow.com/questions/ask%3Fshowauthor%3DFalse%26display%3DNonee"]
def clean(url):
path = urlparse(url).path
index = path.find("http")
if not index == -1:
return path[index:]
else:
query = urlparse(url).query
index = query.index("http")
query = query[index:]
index_questionmark = query.find("?")
index_ampersand = query.find("&")
if index_questionmark == -1 or index_questionmark > index_ampersand:
return unquote(query[:index_ampersand])
else:
return unquote(query)
for url in urls:
print clean(url)
> http://www.stackoverflow.com/questions/ask
> http://www.stackoverflow.com/questions/ask
> http://www.stackoverflow.com/questions/ask?showauthor=False&display=None
> http://www.stackoverflow.com/questions/ask?showauthor=False&display=None

I don't know about Python specifically, but I would use a regular expression to get the parts (key=value) of the query string, with something like...
(?:\?|&)[^=]+=([^&]*)
That captures the "value" parts. I would then decode those and check them against another pattern (probably another regex) to see which one looks like a URL. I would just check the first part, then take the whole value. That way your pattern doesn't have to account for every possible type of URL (and presumably they didn't combine the URL with something else within a single value field). This should work with or without the protocol being specified (it's up to your pattern to determine what looks like a URL).
As for the second type of URL... I don't think there is a non-hacky way to parse that. You could URL-decode the entire URL, then look for the second instance of http:// (or https://, and/or any other protocols you might run across). You would have to decide whether any query strings are part of "your" URL or the tracker URL. You could also not decode the URL and attempt to match on the encoded values. Either way will be messy, and if they don't include the protocol it will be even worse! If you're working with a set of specific formats, you could work out good rules for them... but if you just have to handle whatever they happen to throw at you... I don't think there's a reliable way to handle the second type of embedding.

Related

How can I use urllib to parse a url but input multiple url's into a text prompt?

I'm using urllib to parse a url, but I was wanting it to take input from a text box so I could put in multiple url's whenever I needed instead of changing the code to parse just one url. I tried using tkinter but I couldn't figure out how to get urllib to grab the input from that.
You haven't provided much information on your use case but let's pretend you have multiple URLs already and that part is working.
def retrieve_input(list_of_urls):
for url in list_of_urls:
# do parsing as needed
Now if you wanted to have a way to get more than one URL and put them in a list, maybe you would do something like:
list_of_urls = []
while True:
url = input('What is your URL?')
if url != 'Stop':
list_of_urls.append(url)
else:
break
With that example you would probably want to control inputs more but just to give you an idea. If you are expecting to get help with the tkinter portion, you'll need to provide more information and examples of what you have tried, your expected input (and method), and expected output.

Exclude certain keyword from URL

I am successfully able to get the url using my technique but point is that i need to change the url slightly like this: "http://www.example.com/static/p/no-name-0330-227404-1.jpg". Where as in img tag i get this link: "http://www.example.com/static/p/no-name-0330-227404-1-product.jpg"
HTML CODE:
<div class="swiper-wrapper"><img data-error-placeholder="PlaceholderPDP.jpg" class="swiper-lazy swiper-lazy-loaded" src="http://www.example.com/static/p/no-name-0330-227404-1-product.jpg"></div>
Python Code:
imagesList = []
imagesList.append([re.findall(re.compile(u'http.*?\.jpg'), etree.tostring(imagesList).decode("utf-8")) for imagesList in productTree.xpath('//*[#class="swiper-wrapper"]/img')])
print (imagesList)
output:
[['http://www.example.com/static/p/no-name-8143-225244-1-product.jpg']]
NOTE: I need to remove "-product" from url and I have no idea why this url is inside two square brackets.
If you are intending to remove just the product keyword then you can simply use the .replace() API. Otherwise you can construct regular expressions to manipulate the string. Below is an example code for the replace API.
myURL = "http://www.example.com/static/p/no-name-0330-227404-1-product.jpg"
myURL = myURL.replace("-product", "") # gives u "http://www.example.com/static/p/no-name-0330-227404-1.jpg"
print(myURL)
Regular expression version: (Probably not a clean solution, as in it is difficult to understand). However it is better than the first approach because it dynamically discard the last set of -words (e.g. -product)
What I have done is capture 3 parts of the URL but omit the middle part because that is the -product bit, and combine part 1 and 3 together to form your URL.
import re
myURL = "http://www.example.com/static/p/no-name-0330-227404-1-product.jpg"
myPattern = "(.*)(-.*)(\.jpg)$"
pattern = re.compile(myPattern)
match = re.search(pattern, myURL)
print (match.group(1) + match.group(3))
Same output as above:
http://www.example.com/static/p/no-name-0330-227404-1.jpg
If all the images have the word "product" could you just do a simple string replace and remove just that word? Whatever you are trying to do (including renaming files) I see that as the simplest solution.

Capture URL with query string as parameter in Flask route

Is there a way for Flask to accept a full URL as a URL parameter?
I am aware that <path:something> accepts paths with slashes. However I need to accept everything including a query string after ?, and path doesn't capture that.
http://example.com/someurl.com?andother?yetanother
I want to capture someurl.com?andother?yetanother. I don't know ahead of time what query args, if any, will be supplied. I'd like to avoid having to rebuild the query string from request.args.
the path pattern will let you capture more complicated route patterns like URLs:
#app.route('/catch/<path:foo>')
def catch(foo):
print(foo)
return foo
The data past the ? indicate it's a query parameter, so they won't be included in that patter. You can either access that part form request.query_string or build it back up from the request.args as mentioned in the comments.
Due to the way routing works, you will not be able to capture the query string as part of the path. Use <path:path> in the rule to capture arbitrary paths. Then access request.url to get the full URL that was accessed, including the query string. request.url always includes a ? even if there was no query string. It's valid, but you can strip that off if you don't want it.
#app.route("/<path:path>")
def index(path=None):
return request.url.rstrip("?")
For example, accessing http://127.0.0.1:5000/hello?world would return http://127.0.0.1:5000/hello?world.

Loop For And Add To List

I'm trying to create a list of URL's visit_urls to visit.
at first I specify manually the first url to visit self.br.get(url)
and checking the amount of pages that the page has for example it has 40 pages, I will know that it with "count" and I just want to switch the end of the url with &page=2 &page=3 up to 40 in a list.
Here is the loop part of my code. I only need a way to add all the pages into the visit_urls list
visit_urls=[]
self.br.get(url)
count = self.br.find_elements_by_xpath("//*[#class='count']").get_attribute("innerHTML"):
for (here up to count)
self.visit_urls.append(url + need to append this also to the end of the first url &page=2-count)
This code comes after a lot of research and I'm stuck so any help will be great!
Try something like this:
visit_urls=[]
self.br.get(url)
count = self.br.find_elements_by_xpath("//*[#class='count']").get_attribute("innerHTML")
for page_number in xrange(1, count+1):
url = '{url}&page={page_number}'.format(url=url, page_number=page_number)
visit_urls.append(url)
This will work assuming url never changes. That is, if the url variable always points to the same url, you will end up with urls like http://www.mysite.com&page=1&page=2&page=3
Make sure url is always defined appropriately.
I'm assuming everything works and the issue you're having is in generating an array of all URLs based on your findings in "count".
The easiest thing to do would be if you already know the URL, and it's in the proper format, such as:
url = 'http://www.thisisapage.com/somethinghere/dosomething.php?page=1'
If that's the case, do something to strip the 1, getting a 'baseurl' to act on (exactly how to do this depends on what urls, and how they are formed):
baseurl = 'http://www.thisisapage.com/somethinghere/dosomething.php?page='
After, just loop from n to count, appending the current iteration to the baseurl.
Often, it's a lot easier to use a regular expression to do this if you're ever going to have complex URLs, or dynamic URLs that may include security tokens and the such.
For that, you could use something like:
import re
m = re.match(r'^(.*)(page=\d+&?)(.*)$', url)
for i in range(2, count):
self.visit_urls.append(m.group(1) + 'page=%i' % i + m.group(3))
Of course, since you're using a URL, that could be so many things, it will fall on you to make sure that the regular expression catches everything that it needs to. Mine was very simple based on the information you provided.
A very basic webcrawler in Python:
import re, urllib
print "Enter the URL you wish to crawl.." #
print 'Usage - "http://example.com/"'
myurl = input("#> ")
for i in re.findall('''href=["'](.[^"']+)["']''', urllib.urlopen(myurl).read(), re.I):
for ee in re.findall('''href=["'](.[^"']+)["']''', urllib.urlopen(i).read(), re.I):
print ee

Python URL splitting

I have a string like google.com in Python, which I would like split into two parts: google and .com. The problem is where I have a URL such as subdomain.google.com, which I would like to be split into subdomain.google and .com.
How do I separate the rest of the URL from the TLD? It can't operate based on the last . in the URL because of TLDs such as .co.uk. Note the URL does not contain http:// or www.
tldextract looks like what you need. It deals with the .co.uk issue.
I used tdl and urllib, but did not find them satisfying.
I found this question multiple times on my Google search on how to parse a URL.
After a while, I took the time to make a regex and make it into an open source package.
It handles URLs which have a secondary top-domain like co.uk, and also supports national URLs with special characters.
url-parser on PyPi
URL Parser on GitHub
For you, it would be easy to use it like this:
Step one:
pip install url-parser
Step two:
from url_parser import parse_url
url = parse_url('subdomain.google.com')
url['subdomain'] # subdomain
url['domain'] # google
url['top_domain'] #com
You can use these keys to get the different part of the URL.
protocol
www
sub_domain
domain
top_domain
dir
file
fragment
query
To do this, you will need a list of valid domain names. The top level ones (.com, .org, etc.) and the country codes (.us, .fr, etc.) are easy to find. Try http://www.icann.org/en/resources/registries/tlds.
For the second level ones (.co.uk, .org.au) you might need to look up each country code to see its sub domains. Wikipedia is your friend.
Once you have the list, grab the last two parts from the name you have (google.com or co.uk) and see if it is in your second level list. If not, grab the last part and see if it is in your top level list.

Categories