Find names of all generated HTML files in Sphinx extension - python

I want to create a simple Sphinx extension which post-processes generated HTML files after they're created by HTML builder. I have written a post-processing routine using BeautifulSoup, but then I faced a trouble converting my routine into a separate Sphinx extension.
I've registered my handler for "build-finished" event using app.connect, but I still cannot figure out how to get the list of filenames to preprocess.
How to get the list of all HTML files which were generated? (or, at least)
How to get the output directory? I've found that I can use env.found_docs and builder.get_target_uri to obtain the relative path of generated page, but I need the directory name.

It's not mentioned in Sphinx documentation for some reason, but the path to output directory can be easily found as app.outdir. After discovering this fact, it was easy to gather all the filenames I needed:
def process_build_finished(app, exception):
if exception is not None:
return
target_files = []
for doc in app.env.found_docs:
target_filename = app.builder.get_target_uri(doc)
target_filename = os.path.join(app.outdir, target_filename)
target_filename = os.path.abspath(target_filename)
target_files.append(target_filename)
...

Related

How can one parse whole XML documents using the LXML Sax module?

I have a script that goes through a directory with many XML files and extracts or adds information to these files. I use XPath to identify the elements of interest.
The relevant piece of code is this:
import lxml.etree as et
import lxml.sax
# deleted non relevant code
for root, dirs, files in os.walk(ROOT):
# iterate all files
for file in files:
if file.endswith('.xml'):
# join root dir and file name
file_path = os.path.join(ROOT, file)
# load root element from file
file_root = et.parse(file_path).getroot()
# This is a function that I define elsewhere in which I use XPath to identify relevant
# elements and extract, change or add some information
xml_dosomething(file_root)
# init tree object from file_root
tree = et.ElementTree(file_root)
# save modified xml tree object to file with an added text so that I can keep a copy of original.
tree.write(file_path.replace('.xml', '-clean.xml'), encoding='utf-8', doctype='<!DOCTYPE document SYSTEM "estcorpus.dtd">', xml_declaration=True)
I have seen in various places that people recommend using Sax(on) to speed up the processing of large files. After checking the documentation of the LXML Sax module in (https://lxml.de/sax.html) I'm at a loss as to how to modify my code so that I can leverage the Sax module. I can see the following in the documentation:
handler = lxml.sax.ElementTreeContentHandler()
then there is a list of statements like (handler.startElementNS((None, 'a'), 'a', {})) that would populate the 'handler' "document" (?) with what would be the elements of a the XML document. After that I see:
tree = handler.etree
lxml.etree.tostring(tree.getroot())
I think I understand what handler.etree does but my problem is that I want 'handler' to be the files in the directory that I'm working with rather than a string that I create by using 'handler.startElementNS' and the like. What do I need to change in my code to get the Sax module to do the work that needs to be done with the files as input?

Get URLS of files in Dropbox folder in python

I have a bunch of folders in Dropbox with pictures in them, and I'm trying to get a list of URLs for all of the pictures in a specific folder.
import requests
import json
import dropbox
TOKEN = 'my_access_token'
dbx = dropbox.Dropbox(TOKEN)
for entry in dbx.files_list_folder('/Main/Test').entries:
# print(entry.name)
print(entry.file_requests.FileRequest.url)
# print(entry.files.Metadata.path_lower)
# print(entry.file_properties.PropertyField)
printing the entry name correctly lists all of the file names in the folder, but everything else says 'FileMetadata' object has no attribute 'get_url'.
The files_list_folder method returns a ListFolderResult, where ListFolderResult.entries is a list of Metadata. Files in particular are FileMetadata.
Also, note that you aren't guaranteed to get everything back from files_list_folder method, so make sure you implement files_list_folder_continue as well. Refer to the documentation for more information.
The kind of link you mentioned is a shared link. FileMetadata don't themselves contain a link like that. You can get the path from path_lower though. For example, in the for loop in your code, that would look like print(entry.path_lower).
You should use sharing_list_shared_links to list existing links, and/or sharing_create_shared_link_with_settings to create shared links for any particular file as needed.

Get the torrent download directory from a torrent file using python-libtorrent

I need the default directory a torrent file will create when it is started using any torrent manager - as a string. I'm not a programmer, but with other help I was able to obtain the contents (files) of the torrent as strings:
info = libtorrent.torrent_info(torrent_file)
for f in info.files():
file_name = "%s" % (f.path)
# do something with file_name
One thing to keep in mind is that there are two kinds of torrent files. Single-file torrents and multi-file torrents. The typical filename structure of the two kinds are:
single-file torrents: save-path/torrent-name
multi-file torrents: save-path/torrent-name/all-files-in-torrent
It sounds like you're looking for the name of the directory files of the torrent are stored in (by convention of most clients). i.e. the torrent name.
Example code to do this in python using libtorrent:
import libtorrent as lt
import sys
ti = lt.torrent_info(sys.argv[1])
if ti.num_files() > 1:
print(ti.name())
else:
# single-file torrent, name() may be a filename
# instead of directory name

Google Docs python gdata 2.0.16 upload file to existing collection

I have managed to create a simple app which deletes (bypassing the recycle bin) any files I want to. It can also upload files. The problem I am having is that I cannot specify which collection the new file should be uploaded to.
def UploadFile(folder, filename, local_file, client):
print "Upload Resource"
doc = gdata.docs.data.Resource(type='document', title=filename)
path = _GetDataFilePath(local_file)
media = gdata.data.MediaSource()
media.SetFileHandle(path, 'application/octet-stream')
create_uri = gdata.docs.client.RESOURCE_UPLOAD_URI + '?convert=false'
collection_resource = folder
upload_doc = client.CreateResource(doc, create_uri=create_uri, collection=collection_resource, media=media)
print 'Created, and uploaded:', upload_doc.title, doc.resource_id
From what I understand the function CreateResources requires a resource object representing the collection. How do I get this object? The variable folder is currently just a string which says 'daily' which is the name of the collection, it is this variable which I need to replace with the collection resource.
From various sources, snippets and generally stuff all over the place I managed to work this out. You need to pass a uri to the FindAllResources function (one which I found no mention of in the sample code from gdata).
I have written up in more detail how I managed to upload, delete (bypassing the bin), search for and move files into collections
here

Determine if a listing is a directory or file in Python over FTP

Python has a standard library module ftplib to run FTP communications. It has two means of getting a listing of directory contents. One, FTP.nlst(), will return a list of the contents of a directory given a directory name as an argument. (It will return the name of a file if given a file name instead.) This is a robust way to list the contents of a directory but does not give any indication whether each item in the list is a file or directory. The other method is FTP.dir(), which gives a string formatted listing of the directory contents of the directory given as an argument (or of the file attributes, given a file name).
According to a previous question on Stack Overflow, parsing the results of dir() can be fragile (different servers may return different strings). I'm looking for some way to list just the directories contained within another directory over FTP, though. To the best of my knowledge, scraping for a d in the permissions part of the string is the only solution I've come up with, but I guess I can't guarantee that the permissions will appear in the same place between different servers. Is there a more robust solution to identifying directories over FTP?
Unfortunately FTP doesn't have a command to list just folders so parsing the results you get from ftp.dir() would be 'best'.
A simple app assuming a standard result from ls (not a windows ftp)
from ftplib import FTP
ftp = FTP(host, user, passwd)
for r in ftp.dir():
if r.upper().startswith('D'):
print r[58:] # Starting point
Standard FTP Commands
Custom FTP Commands
If the FTP server supports the MLSD command, then please check that answer for a couple of useful classes (FTPDirectory and FTPTree).
Another way is to assume everything is a directory and try and change into it. If this succeeds it is a directory, but if this throws an ftplib.error_perm it is probably a file. You can catch then catch the exception. Sure, this isn't really the safest, but neither is parsing the crazy string for leading 'd's.
Example
def processRecursive(ftp,directory):
ftp.cwd(directory)
#put whatever you want to do in each directory here
#when you have called processRecursive with a file,
#the command above will fail and you will return
#get the files and directories contained in the current directory
filenames = []
ftp.retrlines('NLST',filenames.append)
for name in filenames:
try:
processRecursive(ftp,name)
except ftplib.error_perm:
#put whatever you want to do with files here
#put whatever you want to do after processing the files
#and sub-directories of a directory here

Categories