Making relative paths absolute in python - python

I want to crawl web page with python, the problem is with relative paths, I have the following functions which normalize and derelativize urls in web page, I can not implement one part of derelativating function. Any ideas? :
def normalizeURL(url):
if url.startswith('http')==False:
url = "http://"+url
if url.startswith('http://www.')==False:
url = url[:7]+"www."+url[7:]
return url
def deRelativizePath(url, path):
url = normalizeURL(url)
if path.startswith('http'):
return path
if path.startswith('/')==False:
if url.endswith('/'):
return url+path
else:
return url+"/"+path
else:
#this part is missing
The problem is: I do not know how to get main url, they can be in many formats:
http://www.example.com
http://www.example.com/
http://www.sub.example.com
http://www.sub.example.com/
http://www.example.com/folder1/file1 #from this I should extract http://www.example.com/ then add path
...

I recommend that you consider using urlparse.urljoin() for this:
Construct a full ("absolute") URL by combining a "base URL" (base) with another URL (url). Informally, this uses components of the base URL, in particular the addressing scheme, the network location and (part of) the path, to provide missing components in the relative URL.

from urlparse import urlparse
And then parse into the respective parts.

Related

Is it always correct to use URLs like "./about.html" or "../about.htm" instead of Absolute URLS like /about?

I'm a computer science student. Recently we were tasked to develop a static HTTP server from scratch without using any HTTP modules, solely depending on socket programming. So this means that I had to write all the logic for HTTP message parsing, extracting headers, parsing URLs, etc.
However, I'm stuck with some confusion. As I'm somewhat experienced in web development before, I'm used to using URLs in places like anchor tags like this "/about", and "/articles/article-1".However, I've seen people sometimes people to relative paths according to their folder structure like this. "./about.html", "../contact.html".This always seemed to be a bad idea to me. However, I realized that even though in my code I'm not supporting these kinds of URLs explicitly, it seems to work anyhow.
Following is the python code I'm using to get the path from the HTTP message and then get the corresponding path in the file system.
def get_http_url(self, raw_request_headers: list[str]):
"""
Method to get HTTP url by parsing request headers
"""
if len(raw_request_headers) > 0:
method_and_path_header = raw_request_headers[0]
method_and_path_header_segments = method_and_path_header.split(" ")
if len(method_and_path_header_segments) >= 2:
"""
example: GET / HTTP/1.1 => ['GET', '/', 'HTTP/1.1] => '/'
"""
url = method_and_path_header_segments[1]
return url
return False
def get_resource_path_for_url(self, path: str | Literal[False]):
"""
Method to get the resource path based on url
"""
if not path:
return False
else:
if path.endswith('/'):
# Removing trailing '/' to make it easy to parse the url
path = path[0:-1]
# Split to see if the url also includes the file extension
parts = path.split('.')
if path == '':
# if the requested path is "/"
path_to_resource = os.path.join(
os.getcwd(), "htdocs", "index.html")
else:
# Assumes the user entered a valid url with resources file extension as well, ex: http://localhost:2728/pages/about.html
if len(parts) > 1:
path_to_resource = os.path.join(
os.getcwd(), "htdocs", path[1:]) # Get the abslute path with the existing file extension
else:
# Assumes user requested a url without an extension and as such is hoping for a html response
path_to_resource = os.path.join(
os.getcwd(), "htdocs", f"{path[1:]}.html") # Get the absolute path to the corresponding html file
return path_to_resource
So in my code, I'm not explicitly adding any logic to handle that kind of relative path. But somehow, when I use things like ../about.html in my test HTML files, it somehow works?
Is this the expected behavior? As of now (I would like to know where this behavior is implemented), I'm on Windows if that matters. And if this is expected, can I depend on this behavior and conclude that it's safe to refer to HTML files and other assets with relative paths like this on my web server?
Thanks in advance for any help, and I apologize if my question is not clear or well-formed.

How can I modify the url of the Superset welcome page?

I would like to know how I can modify the URL to the welcome page.
Currently it is /superset/welcome.
It is run into superset/views/core.py in a #expose('/welcome').
I know I can modify the code inside this #expose, but I want to redirect to another url.
So I want to find the line where there is:
welcome_page = /superset/welcome
As of Superset 1.3, you can change the default landing page by adding this code to your Superset config:
from flask import Flask, redirect
from flask_appbuilder import expose, IndexView
from superset.typing import FlaskResponse
class SupersetDashboardIndexView(IndexView):
#expose("/")
def index(self) -> FlaskResponse:
return redirect("/dashboard/list/")
FAB_INDEX_VIEW = f"{SupersetDashboardIndexView.__module__}.{SupersetDashboardIndexView.__name__}"
In the above example, I am using /dashboard/list/ instead of the default /superset/welcome/.
The code above is Unlicensed and thus is free and unencumbered software released into the public domain.
In superset's file structure, navigate to:
superset/app.py
There you will find
class SupersetIndexView(IndexView):
#expose("/")
def index(self) -> FlaskResponse:
return redirect("/superset/welcome")
Modify this to path where you want to redirect.

Combining three URL components into a single URL

I am trying to write a function combine three components of a URL: a protocol, location, and resource, into a single URL.
I have the following code, and it works only partially, returning a URL with only the protocol and resource components, but not the location component.
Code:
from urllib.parse import urlparse
import os
def buildURL(protocol, location, resource):
return urllib.parse.urljoin(protocol, os.path.join(location,
resource))
Example: buildURL('http://', 'httpbin.org', '/get')
This returns http:///get. I a trying to debug this to also allow for the location parameter to be in the URL. It should be returning http://httpbin.org/get.
How can I build a URL successfully?
It's because you put /get in the os.path.join. you should call it like buildURL('http://', 'httpbin.org', 'get'). os.path.join will treat / as an absolute path that will be hooked from the root of the base location, which is the first parameter of the join function: location
You shouldn't be using os.path here at all. That module is for filesystem paths, e.g. to deal with things like /usr/bin/bash and C:\Documents and Settings\User\.
It's not for building URLs. They aren't affected by the user's host OS.
Instead, use urlunparse() or urlunsplit() from urllib.parse:
from urllib.parse import urlunparse
urlunparse(('https', 'httpbin.org', '/get', None, None, None))
# 'https://httpbin.org/get'

Django : extract a path from a full URL

In a Django 1.8 simple tag, I need to resolve the path to the HTTP_REFERER found in the context. I have a piece of code that works, but I would like to know if a more elegant solution could be implemented using Django tools.
Here is my code :
from django.core.urlresolvers import resolve, Resolver404
# [...]
#register.simple_tag(takes_context=True)
def simple_tag_example(context):
# The referer is a full path: http://host:port/path/to/referer/
# We only want the path: /path/to/referer/
referer = context.request.META.get('HTTP_REFERER')
if referer is None:
return ''
# Build the string http://host:port/
prefix = '%s://%s' % (context.request.scheme, context.request.get_host())
path = referer.replace(prefix, '')
resolvermatch = resolve(path)
# Do something very interesting with this resolvermatch...
So I manually construct the string 'http://sub.domain.tld:port', then I remove it from the full path to HTTP_REFERER found in context.request.META. It works but it seems a bit overwhelming for me.
I tried to build a HttpRequest from referer without success. Is there a class or type that I can use to easily extract the path from an URL?
You can use urlparse module to extract the path:
try:
from urllib.parse import urlparse # Python 3
except ImportError:
from urlparse import urlparse # Python 2
parsed = urlparse('http://stackoverflow.com/questions/32809595')
print(parsed.path)
Output:
'/questions/32809595'

python os.listdir() for remote locations

I recently noticed that
os.listdir('http://chymera.eu/data/faceRT')
complains about not finding my directories.
What can I do to be able to run os.listdir() on remote locations? I have checked and this is not a permissions issue, I can open the folder via my browser and my webftp client says it's 755.
Whatever I do, I would NOT like to have to use login information. I made a decision about sharing when I set the directory permissions. If I say r+x for everyone then I want that to mean r+x for everyone.
os.listdir expects the argument to be a path on the filesystem. It does not attempt to understand URLs
You can use urllib to request the page and parse it to find the URLs
Ok, so I solved this by using the HTMLparser to parse my web index:
if source == 'server':
from HTMLParser import HTMLParser
import urllib
class ChrParser(HTMLParser):
def handle_starttag(self, tag, attrs):
if tag =='a':
for key, value in attrs:
if key == 'href' and value.endswith('.csv'):
pre_fileslist.append(value)
results_dir = 'http://chymera.eu/data/faceRT'
data_url = urllib.urlopen(results_dir).read()
parser = ChrParser()
pre_fileslist = []
parser.feed(data_url) # pre_fileslist gets populated here

Categories