I would like to add Yogaglo support for youtube-dl.
I've followed guidance on Github .
And have drafted the following:
# coding: utf-8
from __future__ import unicode_literals
from .common import InfoExtractor
class YogagloIE(InfoExtractor):
_SIGNIN_URL = 'https://www.yogaglo.com/login'
_PASSWORD_URL = 'https://www.yogaglo.com/login/password'
_USER_URL = 'https://www.yogaglo.com/login/user'
_ACCOUNT_CREDENTIALS_HINT = 'Use --username and --password options to provide yogaglo.com account credentials.'
_NETRC_MACHINE = 'yogaglo'
def _real_initialize(self):
self._login()
_VALID_URL = r'https?://(?:www\.)?yogaglo\.com/class/(?P<id>[0-9]+)'
_TEST = {
'url': 'https://www.yogaglo.com/class/7206',
'md5': 'TODO: md5 sum of the first 10241 bytes of the video file (use --test)',
'info_dict': {
'id': '7206',
'ext': 'mp4',
'title': 'Have a Great Day!'
# TODO more properties, either as:
# * A value
# * MD5 checksum; start the string with md5:
# * A regular expression; start the string with re:
# * Any Python type (for example int or float)
}
}
def _real_extract(self, url):
video_id = self._match_id(url)
webpage = self._download_webpage(url, video_id)
title = self._html_search_regex(r'<h1>(.+?)</h1>', webpage, 'title')
return {
'id': video_id,
'title': title,
'description': self._og_search_description(webpage),
'uploader': self._search_regex(r'<div[^>]+id="uploader"[^>]*>([^<]+)<', webpage, 'uploader', fatal=False),
# TODO more properties (see youtube_dl/extractor/common.py)
}
I've added yogagloIE to the list of extractors and when I run it I get an error that the URL is not supported. This is really a first draft and any guidance on hoe to improve it and make it work is recommended.
In Python, indentation is significant, so make sure you indent your class correctly.
After that, you must define a _login method or simply leave _real_initialize empty.
With that being implemented, the extractor will be called (although of course it's not functional yet):
$ youtube-dl test:yogaglo
[TestURL] Test URL: https://www.yogaglo.com/class/7206
[Yogaglo] 7206: Downloading webpage
ERROR: Unable to extract title; please report this issue on https://yt-dl.org/bug . Make sure you are using the latest version; see https://yt-dl.org/update on how to update. Be sure to call youtube-dl with the --verbose flag and include its complete output.
You can pull this working state with the following command (maybe git stash everything beforehand, and rename your old yogaglo.py file to something else):
git pull https://github.com/phihag/youtube-dl.git yogaglo
Related
I'd like to download the entire revision history of a single article on Wikipedia, but am running into a roadblock.
It is very easy to download an entire Wikipedia article, or to grab pieces of its history using the Special:Export URL parameters:
curl -d "" 'https://en.wikipedia.org/w/index.php?title=Special:Export&pages=Stack_Overflow&limit=1000&offset=1' -o "StackOverflow.xml"
And of course I can download the entire site including all versions of every article from here, but that's many terabytes and way more data than I need.
Is there a pre-built method for doing this? (Seems like there must be.)
The example above only gets information about the revisions, not the actual contents themselves. Here's a short python script that downloads the full content and metadata history data of a page into individual json files:
import mwclient
import json
import time
site = mwclient.Site('en.wikipedia.org')
page = site.pages['Wikipedia']
for i, (info, content) in enumerate(zip(page.revisions(), page.revisions(prop='content'))):
info['timestamp'] = time.strftime("%Y-%m-%dT%H:%M:%S", info['timestamp'])
print(i, info['timestamp'])
open("%s.json" % info['timestamp'], "w").write(json.dumps(
{ 'info': info,
'content': content}, indent=4))
Wandering around aimlessly looking for clues to another question I have myself — my way of saying I know nothing substantial about this topic! — I just came upon this a moment after reading your question: http://mwclient.readthedocs.io/en/latest/reference/page.html. Have a look for the revisions method.
EDIT: I also see http://mwclient.readthedocs.io/en/latest/user/page-ops.html#listing-page-revisions.
Sample code using the mwclient module:
#!/usr/bin/env python3
import logging, mwclient, pickle, os
from mwclient import Site
from mwclient.page import Page
logging.root.setLevel(logging.DEBUG)
logging.debug('getting page...')
env_page = os.getenv("MEDIAWIKI_PAGE")
page_name = env_page is not None and env_page or 'Stack Overflow'
page_name = Page.normalize_title(env_page)
site = Site('en.wikipedia.org') # https by default. change w/`scheme=`
page = site.pages[page_name]
logging.debug('extracting revisions (may take a really long time, depending on the page)...')
revisions = []
for i, revision in enumerate(page.revisions()):
revisions.append(revision)
logging.debug('saving to file...')
with open('{}Revisions.mediawiki.pkl'.format(page_name), 'wb+') as f:
pickle.dump(revisions, f, protocol=0) # protocol allows backwards compatibility between machines
is it possible to show HTML code like tables with css style instead of json/csv/text/whatever?
I tried to send html as string but it just inserts html like raw text
Thanks in advance!
This is an old question, but I just went through this and wanted to share my solution in case anyone else is going through it as well.
What I ended up having to do is shim the rest_framework.compat.apply_markdown function to enable the tables extension. At the end of my settings.py, I added the following:
# Monkey patch rest_framework's markdown rendering function, to enable the
# tables extension.
import markdown
import rest_framework.compat
def apply_markdown(text):
"""
Simple wrapper around :func:`markdown.markdown` to set the base level
of '#' style headers to <h2>.
"""
extensions = ['markdown.extensions.toc', 'markdown.extensions.tables']
extension_configs = {
'markdown.extensions.toc': {
'baselevel': '2'
}
}
md = markdown.Markdown(
extensions=extensions, extension_configs=extension_configs
)
return md.convert(text)
rest_framework.compat.apply_markdown = apply_markdown
In this case I'm using DRF 3.6.4 and markdown 2.6.9. In the original rest_framework.compat.apply_markdown function there's some code that sets different options based on the version of markdown, but I omitted that in the shim.
Also note that the default tables extension may not give you tables styled the way you want. I ended up copying markdown/extensions/tables.py into a new module and adding class="table" to the table element. The source for that change is in a gist. For more information about the limited table syntax in markdown see this thread.
For djangorestframework>=3.7 update the default view description function.
https://www.django-rest-framework.org/api-guide/settings/#view_description_function
REST_FRAMEWORK = {
# Module path to a callable which should have a signature (self, html=False)
'VIEW_DESCRIPTION_FUNCTION': 'app_name.view_description.get_view_description',
}
view_description.py
import markdown
from django.utils.encoding import smart_text
from django.utils.html import escape
from django.utils.safestring import mark_safe
from rest_framework.compat import (
HEADERID_EXT_PATH, LEVEL_PARAM, md_filter_add_syntax_highlight
)
from rest_framework.utils import formatting
TABLE_EXTENSION_PATH = 'markdown.extensions.tables'
def _apply_markdown(text):
extensions = [HEADERID_EXT_PATH, TABLE_EXTENSION_PATH]
extension_configs = {
HEADERID_EXT_PATH: {
LEVEL_PARAM: '2'
}
}
md = markdown.Markdown(
extensions=extensions, extension_configs=extension_configs
)
md_filter_add_syntax_highlight(md)
return md.convert(text)
def get_view_description(view_cls, html=False):
description = view_cls.__doc__ or ''
description = formatting.dedent(smart_text(description))
if html:
return mark_safe(_apply_markdown(description))
return description
in your settings.py use renderer classes.
REST_FRAMEWORK = {
'DEFAULT_RENDERER_CLASSES': [
'rest_framework.renderers.AdminRenderer',
],
}
Have you installed the markdown and django-filter packages? These were required to get our HTML browsing capability working on a recent project.
sudo pip install markdown
sudo pip install django-filter
In Python 2, I can use the following code to resolve either a MacOS alias or a symbolic link:
from Carbon import File
File.FSResolveAliasFile(alias_fp, True)[0].as_pathname()
where alias_fp is the path to the file I'm curious about, stored as a string (source).
However, the documentation cheerfully tells me that the whole Carbon family of modules is deprecated. What should I be using instead?
EDIT: I believe the code below is a step in the right direction for the PyObjC approach. It doesn't resolve aliases, but it seems to detect them.
from AppKit import NSWorkspace
def is_alias (path):
uti, err = NSWorkspace.sharedWorkspace().typeOfFile_error_(
os.path.realpath(path), None)
if err:
raise Exception(unicode(err))
else:
return "com.apple.alias-file" == uti
(source)
Unfortunately I'm not able to get #Milliways's solution working (knowing nothing about Cocoa) and stuff I find elsewhere on the internet looks far more complicated (perhaps it's handling all kinds of edge cases?).
The PyObjC bridge lets you access NSURL's bookmark handling, which is the modern (backwards compatible) replacement for aliases:
import os.path
from Foundation import *
def target_of_alias(path):
url = NSURL.fileURLWithPath_(path)
bookmarkData, error = NSURL.bookmarkDataWithContentsOfURL_error_(url, None)
if bookmarkData is None:
return None
opts = NSURLBookmarkResolutionWithoutUI | NSURLBookmarkResolutionWithoutMounting
resolved, stale, error = NSURL.URLByResolvingBookmarkData_options_relativeToURL_bookmarkDataIsStale_error_(bookmarkData, opts, None, None, None)
return resolved.path()
def resolve_links_and_aliases(path):
while True:
alias_target = target_of_alias(path)
if alias_target:
path = alias_target
continue
if os.path.islink(path):
path = os.path.realpath(path)
continue
return path
The following Cocoa code will resolve alias.
NSURL *targetOfAlias(NSURL *url) {
CFErrorRef *errorRef = NULL;
CFDataRef bookmark = CFURLCreateBookmarkDataFromFile (NULL, (__bridge CFURLRef)url, errorRef);
if (bookmark == nil) return nil;
CFURLRef resolvedUrl = CFURLCreateByResolvingBookmarkData (NULL, bookmark, kCFBookmarkResolutionWithoutUIMask, NULL, NULL, false, errorRef);
CFRelease(bookmark);
return CFBridgingRelease(resolvedUrl);
}
I don't know how to invoke Cocoa framework from Python, but I am sure someone has done it
The following link shows code to resolve aslias or symlink https://stackoverflow.com/a/21151368/838253
The APIs those modules use are deprecated by Apple, it appears. You should use POSIX APIs instead.
os.path.realpath(FILE_OBJECT.name)
I have some JavaScript from a 3rd party vendor that is initiating an image request. I would like to figure out the URI of this image request.
I can load the page in my browser, and then monitor "Live HTTP Headers" or "Tamper Data" in order to figure out the image request URI, but I would prefer to create a command line process to do this.
My intuition is that it might be possible using python + qtwebkit, but perhaps there is a better way.
To clarify: I might have this (overly simplified code).
<script>
suffix = magicNumberFunctionIDontHaveAccessTo();
url = "http://foobar.com/function?parameter=" + suffix
img = document.createElement('img'); img.src=url; document.all.body.appendChild(img);
</script>
Then once the page is loaded, I can go figure out the url by sniffing the packets. But I can't just figure it out from the source, because I can't predict the outcome of magicNumberFunction...().
Any help would be muchly appreciated!
Thank you.
The simplest thing to do might be to use something like HtmlUnit and skip a real browser entirely. By using Rhino, it can evaluate JavaScript and likely be used to extract that URL out.
That said, if you can't get that working, try out Selenium RC and use the captureNetworkTraffic command (which requires the Selenium instant be started with an option of captureNetworkTraffic=true). This will launch Firefox with a proxy configured and then let you pull the request info back out as JSON/XML/plain text. Then you can parse that content and get what you want.
Try out the instant test tool that my company offers. If the data you're looking for is in our results (after you click View Details), you'll be able to get it from Selenium. I know, since I wrote the captureNetworkTraffic API for Selenium for my company, BrowserMob.
I would pick any one of the many http proxy servers written in Python -- probably one of the simplest ones at the very top of the list -- and tweak it to record all URLs requested (as well as proxy-serve them) e.g. appending them to a text file -- without loss of generality, call that text file 'XXX.txt'.
Now all you need is a script that: starts the proxy server in question; starts Firefox (or whatever) on your main desired URL with the proxy in question set as your proxy (see e.g. this SO question for how), though I'm sure other browsers would work just as well; waits a bit (e.g. until the proxy's XXX.txt file has not been altered for more than N seconds); reads XXX.txt to extract only the URLs you care about and record them wherever you wish; turns down the proxy and Firefox processes.
I think this will be much faster to put in place and make work correctly, for your specific requirements, than any more general solution based on qtwebkit, selenium, or other "automation kits".
Use Firebug Firefox plugin. It will show you all requests in real time and you can even debug the JS in your Browser or run it step-by-step.
Ultimately, I did it in python, using Selenium-RC. This solution requires the python files for selenium-rc, and you need to start the java server ("java -jar selenium-server.jar")
from selenium import selenium
import unittest
import lxml.html
class TestMyDomain(unittest.TestCase):
def setUp(self):
self.selenium = selenium("localhost", \
4444, "*firefox", "http://www.MyDomain.com")
self.selenium.start()
def test_mydomain(self):
htmldoc = open('site-list.html').read()
url_list = [link for (element, attribute,link,pos) in lxml.html.iterlinks(htmldoc)]
for url in url_list:
try:
sel = self.selenium
sel.open(url)
sel.select_window("null")
js_code = '''
myDomainWindow = this.browserbot.getUserWindow();
for(obj in myDomainWindow) {
/* This code grabs the OMNITURE tracking pixel img */
if ((obj.substring(0,4) == 's_i_') && (myDomainWindow[obj].src)) {
var ret = myDomainWindow[obj].src;
}
}
ret;
'''
omniture_url = sel.get_eval(js_code) #parse&process this however you want
except Exception, e:
print 'We ran into an error: %s' % (e,)
self.assertEqual("expectedValue", observedValue)
def tearDown(self):
self.selenium.stop()
if __name__ == "__main__":
unittest.main()
Why can't you just read suffix, or url for that matter? Is the image loaded in an iframe or in your page?
If it is loaded in your page, then this may be a dirty hack (substitute document.body for whatever element is considered):
var ac = document.body.appendChild;
var sources = [];
document.body.appendChild = function(child) {
if (/^img$/i.test(child.tagName)) {
sources.push(child.getAttribute('src'));
}
ac(child);
}
I am using Sphinx to document a webservice that will be deployed in different servers. The documentation is full of URL examples for the user to click and they should just work. My problem is that the host, port and deployment root will vary and the documentation will have to be re-generated for every deployment.
I tried defining substitutions like this:
|base_url|/path
.. |base_url| replace:: http://localhost:8080
But the generated HTML is not what I want (doesn't include "/path" in the generated link):
http://localhost:8080/path
Does anybody know how to work around this?
New in Sphinx v1.0:
sphinx.ext.extlinks – Markup to shorten external links
https://www.sphinx-doc.org/en/master/usage/extensions/extlinks.html
The extension adds one new config value:
extlinks
This config value must be a dictionary of external sites, mapping unique short alias names to a base URL and a prefix. For example, to create an alias for the above mentioned issues, you would add
extlinks = {'issue':
('http://bitbucket.org/birkenfeld/sphinx/issue/%s', 'issue ')}
Now, you can use the alias name as a new role, e.g. :issue:`123`. This then inserts a link to http://bitbucket.org/birkenfeld/sphinx/issue/123. As you can see, the target given in the role is substituted in the base URL in the place of %s.
The link caption depends on the second item in the tuple, the prefix:
If the prefix is None, the link caption is the full URL.
If the prefix is the empty string, the link caption is the partial URL given in the role content (123 in this case.)
If the prefix is a non-empty string, the link caption is the partial URL, prepended by the prefix – in the above example, the link caption would be issue 123.
You can also use the usual “explicit title” syntax supported by other roles that generate links, i.e. :issue:`this issue <123>`. In this case, the prefix is not relevant.
I had a similar problem where I needed to substitute also URLs in image targets.
The extlinks do not expand when used as a value of image :target: attribute.
Eventually I wrote a custom sphinx transformation that rewrites URLs that start with a given prefix, in my case, http://mybase/. Here is a relevant code for conf.py:
from sphinx.transforms import SphinxTransform
class ReplaceMyBase(SphinxTransform):
default_priority = 750
prefix = 'http://mybase/'
def apply(self):
from docutils.nodes import reference, Text
baseref = lambda o: (
isinstance(o, reference) and
o.get('refuri', '').startswith(self.prefix))
basetext = lambda o: (
isinstance(o, Text) and o.startswith(self.prefix))
base = self.config.mybase.rstrip('/') + '/'
for node in self.document.traverse(baseref):
target = node['refuri'].replace(self.prefix, base, 1)
node.replace_attr('refuri', target)
for t in node.traverse(basetext):
t1 = Text(t.replace(self.prefix, base, 1), t.rawsource)
t.parent.replace(t, t1)
return
# end of class
def setup(app):
app.add_config_value('mybase', 'https://en.wikipedia.org/wiki', 'env')
app.add_transform(ReplaceMyBase)
return
This expands the following rst source to point to English wikipedia.
When conf.py sets mybase="https://es.wikipedia.org/wiki" the links would point to the Spanish wiki.
* inline link http://mybase/Helianthus
* `link with text <http://mybase/Helianthus>`_
* `link with separate definition`_
* image link |flowerimage|
.. _link with separate definition: http://mybase/Helianthus
.. |flowerimage| image:: https://upload.wikimedia.org/wikipedia/commons/f/f1/Tournesol.png
:target: http://mybase/Helianthus
Ok, here's how I did it. First, apilinks.py (the Sphinx extension):
from docutils import nodes, utils
def setup(app):
def api_link_role(role, rawtext, text, lineno, inliner, options={},
content=[]):
ref = app.config.apilinks_base + text
node = nodes.reference(rawtext, utils.unescape(ref), refuri=ref,
**options)
return [node], []
app.add_config_value('apilinks_base', 'http://localhost/', False)
app.add_role('apilink', api_link_role)
Now, in conf.py, add 'apilinks' to the extensions list and set an appropriate value for 'apilinks_base' (otherwise, it will default to 'http://localhost/'). My file looks like this:
extensions = ['sphinx.ext.autodoc', 'apilinks']
# lots of other stuff
apilinks_base = 'http://host:88/base/'
Usage:
:apilink:`path`
Output:
http://host:88/base/path
You can write a Sphinx extension that creates a role like
:apilink:`path`
and generates the link from that. I never did this, so I can't help more than giving this pointer, sorry. You should try to look at how the various roles are implemented. Many are very similar to what you need, I think.