python os.listdir() for remote locations - python

I recently noticed that
os.listdir('http://chymera.eu/data/faceRT')
complains about not finding my directories.
What can I do to be able to run os.listdir() on remote locations? I have checked and this is not a permissions issue, I can open the folder via my browser and my webftp client says it's 755.
Whatever I do, I would NOT like to have to use login information. I made a decision about sharing when I set the directory permissions. If I say r+x for everyone then I want that to mean r+x for everyone.

os.listdir expects the argument to be a path on the filesystem. It does not attempt to understand URLs
You can use urllib to request the page and parse it to find the URLs

Ok, so I solved this by using the HTMLparser to parse my web index:
if source == 'server':
from HTMLParser import HTMLParser
import urllib
class ChrParser(HTMLParser):
def handle_starttag(self, tag, attrs):
if tag =='a':
for key, value in attrs:
if key == 'href' and value.endswith('.csv'):
pre_fileslist.append(value)
results_dir = 'http://chymera.eu/data/faceRT'
data_url = urllib.urlopen(results_dir).read()
parser = ChrParser()
pre_fileslist = []
parser.feed(data_url) # pre_fileslist gets populated here

Related

how to get the text in one of the divs? (html)

I am writing my bot, which so far has to get the text from the div from one page and put it in a variable, but this does not work out and the variable always remains empty. How i can extract it?
import telebot;
import requests
from lxml import etree
import lxml.html
import csv
bot = telebot.TeleBot('');
#bot.message_handler(content_types=['text'])
def get_text_messages(message):
api = requests.get("https://slovardalja.net/word.php?wordid=21880")
tree = lxml.html.document_fromstring(api.text)
text_original = tree.xpath('/html/body/table/tbody/tr[2]/td/table/tbody/tr/td[2]/index/div[2]/p[1]/strong/text()')
print(text_original)
bot.send_message(message.chat.id,str(text_original))
bot.polling(none_stop=True, interval=0)
https://slovardalja.net/word.php?wordid=21880
I think this code should get the word "ОЛЕКВАС", I copied the path to it and added /text(), but it doesn't work
I have no cyrillic on my system, but with a smaller xpath value and the usage from text_content it print something on shell, hopefully it helps
api = requests.get("https://slovardalja.net/word.php?wordid=21880")
tree = lxml.html.document_fromstring(api.text)
text_original = tree.xpath('//div[#align="justify"]/p/strong')
print(text_original[0].text_content())

Is it always correct to use URLs like "./about.html" or "../about.htm" instead of Absolute URLS like /about?

I'm a computer science student. Recently we were tasked to develop a static HTTP server from scratch without using any HTTP modules, solely depending on socket programming. So this means that I had to write all the logic for HTTP message parsing, extracting headers, parsing URLs, etc.
However, I'm stuck with some confusion. As I'm somewhat experienced in web development before, I'm used to using URLs in places like anchor tags like this "/about", and "/articles/article-1".However, I've seen people sometimes people to relative paths according to their folder structure like this. "./about.html", "../contact.html".This always seemed to be a bad idea to me. However, I realized that even though in my code I'm not supporting these kinds of URLs explicitly, it seems to work anyhow.
Following is the python code I'm using to get the path from the HTTP message and then get the corresponding path in the file system.
def get_http_url(self, raw_request_headers: list[str]):
"""
Method to get HTTP url by parsing request headers
"""
if len(raw_request_headers) > 0:
method_and_path_header = raw_request_headers[0]
method_and_path_header_segments = method_and_path_header.split(" ")
if len(method_and_path_header_segments) >= 2:
"""
example: GET / HTTP/1.1 => ['GET', '/', 'HTTP/1.1] => '/'
"""
url = method_and_path_header_segments[1]
return url
return False
def get_resource_path_for_url(self, path: str | Literal[False]):
"""
Method to get the resource path based on url
"""
if not path:
return False
else:
if path.endswith('/'):
# Removing trailing '/' to make it easy to parse the url
path = path[0:-1]
# Split to see if the url also includes the file extension
parts = path.split('.')
if path == '':
# if the requested path is "/"
path_to_resource = os.path.join(
os.getcwd(), "htdocs", "index.html")
else:
# Assumes the user entered a valid url with resources file extension as well, ex: http://localhost:2728/pages/about.html
if len(parts) > 1:
path_to_resource = os.path.join(
os.getcwd(), "htdocs", path[1:]) # Get the abslute path with the existing file extension
else:
# Assumes user requested a url without an extension and as such is hoping for a html response
path_to_resource = os.path.join(
os.getcwd(), "htdocs", f"{path[1:]}.html") # Get the absolute path to the corresponding html file
return path_to_resource
So in my code, I'm not explicitly adding any logic to handle that kind of relative path. But somehow, when I use things like ../about.html in my test HTML files, it somehow works?
Is this the expected behavior? As of now (I would like to know where this behavior is implemented), I'm on Windows if that matters. And if this is expected, can I depend on this behavior and conclude that it's safe to refer to HTML files and other assets with relative paths like this on my web server?
Thanks in advance for any help, and I apologize if my question is not clear or well-formed.

Django : extract a path from a full URL

In a Django 1.8 simple tag, I need to resolve the path to the HTTP_REFERER found in the context. I have a piece of code that works, but I would like to know if a more elegant solution could be implemented using Django tools.
Here is my code :
from django.core.urlresolvers import resolve, Resolver404
# [...]
#register.simple_tag(takes_context=True)
def simple_tag_example(context):
# The referer is a full path: http://host:port/path/to/referer/
# We only want the path: /path/to/referer/
referer = context.request.META.get('HTTP_REFERER')
if referer is None:
return ''
# Build the string http://host:port/
prefix = '%s://%s' % (context.request.scheme, context.request.get_host())
path = referer.replace(prefix, '')
resolvermatch = resolve(path)
# Do something very interesting with this resolvermatch...
So I manually construct the string 'http://sub.domain.tld:port', then I remove it from the full path to HTTP_REFERER found in context.request.META. It works but it seems a bit overwhelming for me.
I tried to build a HttpRequest from referer without success. Is there a class or type that I can use to easily extract the path from an URL?
You can use urlparse module to extract the path:
try:
from urllib.parse import urlparse # Python 3
except ImportError:
from urlparse import urlparse # Python 2
parsed = urlparse('http://stackoverflow.com/questions/32809595')
print(parsed.path)
Output:
'/questions/32809595'

Making relative paths absolute in python

I want to crawl web page with python, the problem is with relative paths, I have the following functions which normalize and derelativize urls in web page, I can not implement one part of derelativating function. Any ideas? :
def normalizeURL(url):
if url.startswith('http')==False:
url = "http://"+url
if url.startswith('http://www.')==False:
url = url[:7]+"www."+url[7:]
return url
def deRelativizePath(url, path):
url = normalizeURL(url)
if path.startswith('http'):
return path
if path.startswith('/')==False:
if url.endswith('/'):
return url+path
else:
return url+"/"+path
else:
#this part is missing
The problem is: I do not know how to get main url, they can be in many formats:
http://www.example.com
http://www.example.com/
http://www.sub.example.com
http://www.sub.example.com/
http://www.example.com/folder1/file1 #from this I should extract http://www.example.com/ then add path
...
I recommend that you consider using urlparse.urljoin() for this:
Construct a full ("absolute") URL by combining a "base URL" (base) with another URL (url). Informally, this uses components of the base URL, in particular the addressing scheme, the network location and (part of) the path, to provide missing components in the relative URL.
from urlparse import urlparse
And then parse into the respective parts.

Python script for "Google search by image"

I have checked Google Search API's and it seems that they have not released any API for searching "Images". So, I was wondering if there exists a python script/library through which I can automate the "search by image feature".
This was annoying enough to figure out that I thought I'd throw a comment on the first python-related stackoverflow result for "script google image search". The most annoying part of all this is setting up your proper application and custom search engine (CSE) in Google's web UI, but once you have your api key and CSE, define them in your environment and do something like:
#!/usr/bin/env python
# save top 10 google image search results to current directory
# https://developers.google.com/custom-search/json-api/v1/using_rest
import requests
import os
import sys
import re
import shutil
url = 'https://www.googleapis.com/customsearch/v1?key={}&cx={}&searchType=image&q={}'
apiKey = os.environ['GOOGLE_IMAGE_APIKEY']
cx = os.environ['GOOGLE_CSE_ID']
q = sys.argv[1]
i = 1
for result in requests.get(url.format(apiKey, cx, q)).json()['items']:
link = result['link']
image = requests.get(link, stream=True)
if image.status_code == 200:
m = re.search(r'[^\.]+$', link)
filename = './{}-{}.{}'.format(q, i, m.group())
with open(filename, 'wb') as f:
image.raw.decode_content = True
shutil.copyfileobj(image.raw, f)
i += 1
There is no API available but you are can parse the page and imitate the browser, but I don't know how much data you need to parse because google may limit or block access.
You can imitate the browser by simply using urllib and setting correct headers, but if you think parsing complex web-pages may be difficult from python, you can directly use a headless browser like phontomjs, inside a browser it is trivial to get correct elements using javascript/DOM
Note before trying all this check google's TOS
You can try this:
https://developers.google.com/image-search/v1/jsondevguide#json_snippets_python
It's deprecated, but seems to work.

Categories