Combining three URL components into a single URL

Combining three URL components into a single URL - python

I am trying to write a function combine three components of a URL: a protocol, location, and resource, into a single URL.
I have the following code, and it works only partially, returning a URL with only the protocol and resource components, but not the location component.
Code:
from urllib.parse import urlparse
import os
def buildURL(protocol, location, resource):
return urllib.parse.urljoin(protocol, os.path.join(location,
resource))
Example: buildURL('http://', 'httpbin.org', '/get')
This returns http:///get. I a trying to debug this to also allow for the location parameter to be in the URL. It should be returning http://httpbin.org/get.
How can I build a URL successfully?

It's because you put /get in the os.path.join. you should call it like buildURL('http://', 'httpbin.org', 'get'). os.path.join will treat / as an absolute path that will be hooked from the root of the base location, which is the first parameter of the join function: location

You shouldn't be using os.path here at all. That module is for filesystem paths, e.g. to deal with things like /usr/bin/bash and C:\Documents and Settings\User\.
It's not for building URLs. They aren't affected by the user's host OS.
Instead, use urlunparse() or urlunsplit() from urllib.parse:
from urllib.parse import urlunparse
urlunparse(('https', 'httpbin.org', '/get', None, None, None))
# 'https://httpbin.org/get'

Related

Connexion/Flask application: get base_path from request

I have a connexion/flask/werkzeug application and I need to be able to obtain the "base_path" during requests. For example: my application is available at: http://0.0.0.0:8080/v1.0/ui/#/Pet, with the base_path being: "http://0.0.0.0:8080/v1.0".
I want to be able to get the base_path when the requestor performs any defined operation (GET, POST, PUT, etc). I have not been able to find an easy way to obtain the base path. Through a python debugger, I can see the base_path is available higher up in the stack but doesn't appear to be available to the application entrypoint.
#!/usr/bin/env python3
import connexion
import datetime
import logging
from connexion import NoContent
PETS = {}
def get_pet(pet_id):
pet = PETS.get(pet_id)
# >>>--> I WANT TO GET THE BASE_PATH OF THE REQUEST HERE <--<<<
return pet or ('Not found', 404)
For reasons I can not detail due to nda, I have multiple openapi specs for this application and it's important for me to know which base_path is being requested (as they are different). If somebody could help me figure out a way to obtain the base_path per request I would be greatly appreciative :)
Thank you!

Use connexion.request.base_url .
https://connexion.readthedocs.io/en/latest/request.html#header-parameters you can access the connexion.request inside your handler

refer to this topic (Incoming Request Data) from Flask documentation
and dump the incoming request with the before_request hook and extract the right one e.g request.base_url for your case:
from flask import .., request
#bp.before_request
def dump_incoming_request():
from pprint import pprint
pprint(request.__dict__.items())

Django : extract a path from a full URL

In a Django 1.8 simple tag, I need to resolve the path to the HTTP_REFERER found in the context. I have a piece of code that works, but I would like to know if a more elegant solution could be implemented using Django tools.
Here is my code :
from django.core.urlresolvers import resolve, Resolver404
# [...]
#register.simple_tag(takes_context=True)
def simple_tag_example(context):
# The referer is a full path: http://host:port/path/to/referer/
# We only want the path: /path/to/referer/
referer = context.request.META.get('HTTP_REFERER')
if referer is None:
return ''
# Build the string http://host:port/
prefix = '%s://%s' % (context.request.scheme, context.request.get_host())
path = referer.replace(prefix, '')
resolvermatch = resolve(path)
# Do something very interesting with this resolvermatch...
So I manually construct the string 'http://sub.domain.tld:port', then I remove it from the full path to HTTP_REFERER found in context.request.META. It works but it seems a bit overwhelming for me.
I tried to build a HttpRequest from referer without success. Is there a class or type that I can use to easily extract the path from an URL?

You can use urlparse module to extract the path:
try:
from urllib.parse import urlparse # Python 3
except ImportError:
from urlparse import urlparse # Python 2
parsed = urlparse('http://stackoverflow.com/questions/32809595')
print(parsed.path)
Output:
'/questions/32809595'

python os.listdir() for remote locations

I recently noticed that
os.listdir('http://chymera.eu/data/faceRT')
complains about not finding my directories.
What can I do to be able to run os.listdir() on remote locations? I have checked and this is not a permissions issue, I can open the folder via my browser and my webftp client says it's 755.
Whatever I do, I would NOT like to have to use login information. I made a decision about sharing when I set the directory permissions. If I say r+x for everyone then I want that to mean r+x for everyone.

os.listdir expects the argument to be a path on the filesystem. It does not attempt to understand URLs
You can use urllib to request the page and parse it to find the URLs

Ok, so I solved this by using the HTMLparser to parse my web index:
if source == 'server':
from HTMLParser import HTMLParser
import urllib
class ChrParser(HTMLParser):
def handle_starttag(self, tag, attrs):
if tag =='a':
for key, value in attrs:
if key == 'href' and value.endswith('.csv'):
pre_fileslist.append(value)
results_dir = 'http://chymera.eu/data/faceRT'
data_url = urllib.urlopen(results_dir).read()
parser = ChrParser()
pre_fileslist = []
parser.feed(data_url) # pre_fileslist gets populated here

Making relative paths absolute in python

I want to crawl web page with python, the problem is with relative paths, I have the following functions which normalize and derelativize urls in web page, I can not implement one part of derelativating function. Any ideas? :
def normalizeURL(url):
if url.startswith('http')==False:
url = "http://"+url
if url.startswith('http://www.')==False:
url = url[:7]+"www."+url[7:]
return url
def deRelativizePath(url, path):
url = normalizeURL(url)
if path.startswith('http'):
return path
if path.startswith('/')==False:
if url.endswith('/'):
return url+path
else:
return url+"/"+path
else:
#this part is missing
The problem is: I do not know how to get main url, they can be in many formats:
http://www.example.com
http://www.example.com/
http://www.sub.example.com
http://www.sub.example.com/
http://www.example.com/folder1/file1 #from this I should extract http://www.example.com/ then add path
...

I recommend that you consider using urlparse.urljoin() for this:
Construct a full ("absolute") URL by combining a "base URL" (base) with another URL (url). Informally, this uses components of the base URL, in particular the addressing scheme, the network location and (part of) the path, to provide missing components in the relative URL.

from urlparse import urlparse
And then parse into the respective parts.

Is there a convenient way to map a file uri to os.path?

A subsystem which I have no control over insists on providing filesystem paths in the form of a uri. Is there a python module/function which can convert this path into the appropriate form expected by the filesystem in a platform independent manner?

Use urllib.parse.urlparse to get the path from the URI:
import os
from urllib.parse import urlparse
p = urlparse('file://C:/test/doc.txt')
final_path = os.path.abspath(os.path.join(p.netloc, p.path))

The solution from #Jakob Bowyer doesn't convert URL encoded characters to regular UTF-8 characters. For that you need to use urllib.parse.unquote.
>>> from urllib.parse import unquote, urlparse
>>> unquote(urlparse('file:///home/user/some%20file.txt').path)
'/home/user/some file.txt'

Of all the answers so far, I found none that catch edge cases, doesn't require branching, are both 2/3 compatible, and cross-platform.
In short, this does the job, using only builtins:
try:
from urllib.parse import urlparse, unquote
from urllib.request import url2pathname
except ImportError:
# backwards compatability
from urlparse import urlparse
from urllib import unquote, url2pathname
def uri_to_path(uri):
parsed = urlparse(uri)
host = "{0}{0}{mnt}{0}".format(os.path.sep, mnt=parsed.netloc)
return os.path.normpath(
os.path.join(host, url2pathname(unquote(parsed.path)))
)
The tricky bit (I found) was when working in Windows with paths specifying a host. This is a non-issue outside of Windows: network locations in *NIX can only be reached via paths after being mounted to the root of the filesystem.
From Wikipedia:
A file URI takes the form of file://host/path , where host is the fully qualified domain name of the system on which the path is accessible [...]. If host is omitted, it is taken to be "localhost".
With that in mind, I make it a rule to ALWAYS prefix the path with the netloc provided by urlparse, before passing it to os.path.abspath, which is necessary as it removes any resulting redundant slashes (os.path.normpath, which also claims to fix the slashes, can get a little over-zealous in Windows, hence the use of abspath).
The other crucial component in the conversion is using unquote to escape/decode the URL percent-encoding, which your filesystem won't otherwise understand. Again, this might be a bigger issue on Windows, which allows things like $ and spaces in paths, which will have been encoded in the file URI.
For a demo:
import os
from pathlib import Path # This demo requires pip install for Python < 3.4
import sys
try:
from urllib.parse import urlparse, unquote
from urllib.request import url2pathname
except ImportError: # backwards compatability:
from urlparse import urlparse
from urllib import unquote, url2pathname
DIVIDER = "-" * 30
if sys.platform == "win32": # WINDOWS
filepaths = [
r"C:\Python27\Scripts\pip.exe",
r"C:\yikes\paths with spaces.txt",
r"\\localhost\c$\WINDOWS\clock.avi",
r"\\networkstorage\homes\rdekleer",
]
else: # *NIX
filepaths = [
os.path.expanduser("~/.profile"),
"/usr/share/python3/py3versions.py",
]
for path in filepaths:
uri = Path(path).as_uri()
parsed = urlparse(uri)
host = "{0}{0}{mnt}{0}".format(os.path.sep, mnt=parsed.netloc)
normpath = os.path.normpath(
os.path.join(host, url2pathname(unquote(parsed.path)))
)
absolutized = os.path.abspath(
os.path.join(host, url2pathname(unquote(parsed.path)))
)
result = ("{DIVIDER}"
"\norig path: \t{path}"
"\nconverted to URI:\t{uri}"
"\nrebuilt normpath:\t{normpath}"
"\nrebuilt abspath:\t{absolutized}").format(**locals())
print(result)
assert path == absolutized
Results (WINDOWS):
------------------------------
orig path: C:\Python27\Scripts\pip.exe
converted to URI: file:///C:/Python27/Scripts/pip.exe
rebuilt normpath: C:\Python27\Scripts\pip.exe
rebuilt abspath: C:\Python27\Scripts\pip.exe
------------------------------
orig path: C:\yikes\paths with spaces.txt
converted to URI: file:///C:/yikes/paths%20with%20spaces.txt
rebuilt normpath: C:\yikes\paths with spaces.txt
rebuilt abspath: C:\yikes\paths with spaces.txt
------------------------------
orig path: \\localhost\c$\WINDOWS\clock.avi
converted to URI: file://localhost/c%24/WINDOWS/clock.avi
rebuilt normpath: \localhost\c$\WINDOWS\clock.avi
rebuilt abspath: \\localhost\c$\WINDOWS\clock.avi
------------------------------
orig path: \\networkstorage\homes\rdekleer
converted to URI: file://networkstorage/homes/rdekleer
rebuilt normpath: \networkstorage\homes\rdekleer
rebuilt abspath: \\networkstorage\homes\rdekleer
Results (*NIX):
------------------------------
orig path: /home/rdekleer/.profile
converted to URI: file:///home/rdekleer/.profile
rebuilt normpath: /home/rdekleer/.profile
rebuilt abspath: /home/rdekleer/.profile
------------------------------
orig path: /usr/share/python3/py3versions.py
converted to URI: file:///usr/share/python3/py3versions.py
rebuilt normpath: /usr/share/python3/py3versions.py
rebuilt abspath: /usr/share/python3/py3versions.py

To convert a file uri to a path with python (specific to 3, I can make for python 2 if someone really wants it):
Parse the uri with urllib.parse.urlparse
Unquote the path component of the parsed uri with urllib.parse.unquote
then ...
a. If path is a windows path and starts with /: strip the first character of unquoted path component (path component of file:///C:/some/file.txt is /C:/some/file.txt which is not interpreted to be equivalent to C:\some\file.txt by pathlib.PureWindowsPath)
b. Otherwise just use the unquoted path component as is.
Here is a function that does this:
import urllib
import pathlib
def file_uri_to_path(file_uri, path_class=pathlib.PurePath):
"""
This function returns a pathlib.PurePath object for the supplied file URI.
:param str file_uri: The file URI ...
:param class path_class: The type of path in the file_uri. By default it uses
the system specific path pathlib.PurePath, to force a specific type of path
pass pathlib.PureWindowsPath or pathlib.PurePosixPath
:returns: the pathlib.PurePath object
:rtype: pathlib.PurePath
"""
windows_path = isinstance(path_class(),pathlib.PureWindowsPath)
file_uri_parsed = urllib.parse.urlparse(file_uri)
file_uri_path_unquoted = urllib.parse.unquote(file_uri_parsed.path)
if windows_path and file_uri_path_unquoted.startswith("/"):
result = path_class(file_uri_path_unquoted[1:])
else:
result = path_class(file_uri_path_unquoted)
if result.is_absolute() == False:
raise ValueError("Invalid file uri {} : resulting path {} not absolute".format(
file_uri, result))
return result
Usage examples (ran on linux):
>>> file_uri_to_path("file:///etc/hosts")
PurePosixPath('/etc/hosts')
>>> file_uri_to_path("file:///etc/hosts", pathlib.PurePosixPath)
PurePosixPath('/etc/hosts')
>>> file_uri_to_path("file:///C:/Program Files/Steam/", pathlib.PureWindowsPath)
PureWindowsPath('C:/Program Files/Steam')
>>> file_uri_to_path("file:/proc/cpuinfo", pathlib.PurePosixPath)
PurePosixPath('/proc/cpuinfo')
>>> file_uri_to_path("file:c:/system32/etc/hosts", pathlib.PureWindowsPath)
PureWindowsPath('c:/system32/etc/hosts')
This function works for windows and posix file URIs and it will handle file URIs without an authority section. It will however NOT do validation of the URI's authority so this will not be honoured:
IETF RFC 8089: The "file" URI Scheme / 2. Syntax
The "host" is the fully qualified domain name of the system on which
the file is accessible. This allows a client on another system to
know that it cannot access the file system, or perhaps that it needs
to use some other local mechanism to access the file.
Validation (pytest) for the function:
import os
import pytest
def validate(file_uri, expected_windows_path, expected_posix_path):
if expected_windows_path is not None:
expected_windows_path_object = pathlib.PureWindowsPath(expected_windows_path)
if expected_posix_path is not None:
expected_posix_path_object = pathlib.PurePosixPath(expected_posix_path)
if expected_windows_path is not None:
if os.name == "nt":
assert file_uri_to_path(file_uri) == expected_windows_path_object
assert file_uri_to_path(file_uri, pathlib.PureWindowsPath) == expected_windows_path_object
if expected_posix_path is not None:
if os.name != "nt":
assert file_uri_to_path(file_uri) == expected_posix_path_object
assert file_uri_to_path(file_uri, pathlib.PurePosixPath) == expected_posix_path_object
def test_some_paths():
validate(pathlib.PureWindowsPath(r"C:\Windows\System32\Drivers\etc\hosts").as_uri(),
expected_windows_path=r"C:\Windows\System32\Drivers\etc\hosts",
expected_posix_path=r"/C:/Windows/System32/Drivers/etc/hosts")
validate(pathlib.PurePosixPath(r"/C:/Windows/System32/Drivers/etc/hosts").as_uri(),
expected_windows_path=r"C:\Windows\System32\Drivers\etc\hosts",
expected_posix_path=r"/C:/Windows/System32/Drivers/etc/hosts")
validate(pathlib.PureWindowsPath(r"C:\some dir\some file").as_uri(),
expected_windows_path=r"C:\some dir\some file",
expected_posix_path=r"/C:/some dir/some file")
validate(pathlib.PurePosixPath(r"/C:/some dir/some file").as_uri(),
expected_windows_path=r"C:\some dir\some file",
expected_posix_path=r"/C:/some dir/some file")
def test_invalid_url():
with pytest.raises(ValueError) as excinfo:
validate(r"file://C:/test/doc.txt",
expected_windows_path=r"test\doc.txt",
expected_posix_path=r"/test/doc.txt")
assert "is not absolute" in str(excinfo.value)
def test_escaped():
validate(r"file:///home/user/some%20file.txt",
expected_windows_path=None,
expected_posix_path=r"/home/user/some file.txt")
validate(r"file:///C:/some%20dir/some%20file.txt",
expected_windows_path="C:\some dir\some file.txt",
expected_posix_path=r"/C:/some dir/some file.txt")
def test_no_authority():
validate(r"file:c:/path/to/file",
expected_windows_path=r"c:\path\to\file",
expected_posix_path=None)
validate(r"file:/path/to/file",
expected_windows_path=None,
expected_posix_path=r"/path/to/file")
This contribution is licensed (in addition to any other licenses which may apply) under the Zero-Clause BSD License (0BSD) license
Permission to use, copy, modify, and/or distribute this software for any
purpose with or without fee is hereby granted.
THE SOFTWARE IS PROVIDED "AS IS" AND THE AUTHOR DISCLAIMS ALL WARRANTIES
WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF
MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR
ANY SPECIAL, DIRECT, INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES
WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN
ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF
OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.
To the extent possible under law, Iwan Aucamp has waived all copyright and related or neighboring rights to this stackexchange contribution. This work is published from: Norway.

The solution from #colton7909 is mostly correct and helped me get to this answer, but has some import errors with Python 3. That and I think this is a better way to deal with the 'file://' part of the URL than simply chopping off the first 7 characters. So I feel this is the most idiomatic way to do this using the standard library:
import urllib.parse
url_data = urllib.parse.urlparse('file:///home/user/some%20file.txt')
path = urllib.parse.unquote(url_data.path)
This example should produce the string '/home/user/some file.txt'

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Combining three URL components into a single URL - python

It's because you put /get in the os.path.join. you should call it like buildURL('http://', 'httpbin.org', 'get'). os.path.join will treat / as an absolute path that will be hooked from the root of the base location, which is the first parameter of the join function: location

Related

Connexion/Flask application: get base_path from request

Django : extract a path from a full URL

python os.listdir() for remote locations

Making relative paths absolute in python

Is there a convenient way to map a file uri to os.path?

Categories

Resources