I am building a small program with Python, and I would like to have a GUI for some configuration stuff. Now I have started with a BaseHTTPServer, and I am implementing a BaseHTTPRequestHandler to handle GET and POST requests. But I am wondering what would be best practice for the following problem.
I have two separate requests that result in very similar responses. That is, the two pages that I return have a lot of html in common. I could create a template html page that I retrieve when either of these requests is done and fill in the missing pieces according to the specific request. But I feel like there should be a way where I could directly retrieve two separate html pages, for the two requests, but still have one template page so that I don't have to copy this.
I would like to know how I could best handle this, e.g. something scalable. Thanks!
This has nothing to do with BaseHTTPRequestHandler as its purpose is to serve HTML, how you generate the HTML is another topic.
You should use a templating tool, there are a lot available for Python, I would suggest using Mako or Jinja2. then, on your code, just get the real HTML using the template and use it on your handler response.
Related
This is a simple data visualization task that I am trying to accomplish: seeing data in real time using my web browser.
My idea is to take an HTML template and change only a few key values inside of it once a second using Python. I need it to be HTML and not something else like pygtk because CSS formatting is very practical for my application.
Is that possible? What would be the easiest way?
the Flask module should solve your problems
Use javascript setInterval and XHR.
See https://www.w3schools.com/jsref/met_win_setinterval.asp
https://www.w3schools.com/xml/xml_http.asp
https://makitweb.com/how-to-fire-ajax-request-on-regular-interval/
AJAX updates part of webpage in realtime without refreshing the whole page.
using Ajax in Django is to use the Django Ajax framework.
The most commonly used is django-dajax which is a powerful tool to easily and super-quickly develop asynchronous presentation logic in web applications, using Python and almost no JavaScript source code. It supports four of the most popular Ajax frameworks: Prototype, jQuery, Dojo and MooTools.
https://www.tutorialspoint.com/django/django_ajax.htm#:~:text=Using%20Ajax%20in%20Django%20can,library%20like%20JQuery%20or%20others.&text=The%20most%20commonly%20used%20is,almost%20no%20JavaScript%20source%20code.
I was hoping someone might be able to provide some insights into the feasibility of utilizing the scrapy python framework for creating a realtime wrapper.
To clarify my definition of the term "wrapper" in this context let me describe my situation... I was hoping to use scrapy to essentially script a solution to allow a user to execute a search query on a website which in turn would call a scrapy spider in real-time within which that spider is told to:
login to a 3rd party write
execute the users search query
retrieve only the actual html results for the returned query by extracting the resulting html content by specifying the unique result set container class and/or xpath).
modify the extracted html results (by either reforming the html and/or injecting a new header/footer or css elements). 5) and finally returning the modified html results in real-time so the html can be directly injected into the original domain all by being transparent to the user.
I should point out that I am familiar with writing scrapy spider for large scale crawling in bulk but I am less familiar with the prospect or feasibility of being able to use it to construct a real-time type of "wrapper".
If anyone has any insight, advice or examples which illustrate a similar situation I would greatly appreciate it. CH
You may try HTQL browser interface for python at http://htql.net/. An example to Bing search in real time is:
import htql;
a=htql.Browser();
b=a.goUrl("http://www.bing.com/");
c=a.goForm("<form>1", {"q":"test"});
for d in htql.HTQL(c[0], "<a (tx like '%test%')>"):
print(d);
e=a.click("<a (tx like '%test%' and not (href like '/search%'))>1");
It can be coupled with IRobotSoft scraper to do everything visually, by changing the browser to:
a=htql.Browser(2);
More details can be found from this manual http://htql.net/htql-python-manual.pdf or ask at http://irobotsoft.org/bb/
Say I look at the following Tumblr post: http://ronbarak.tumblr.com/post/40692813…
It (currently) has 292 notes.
I'd like to get all the above notes using a Python script (e.g., via urllib2, BeautifulSoup, simplejson, or tumblr Api).
Some extensive Googling did not produce any items relating to notes' extraction in Tumblr.
Can anyone point me in the right direction on which tool will enable me to do that?
Unfortunately looks like the Tumblr API has some limitations (lacks of meta information about Reblogs, notes limited by 50), so you can't get all the notes.
It is also forbidden to do page scraping according to the Terms of Service.
"You may not do any of the following while accessing or using the Services: (...) scrape the Services, and particularly scrape Content (as defined below) from the Services, without Tumblr's express prior written consent;"
Source:
https://groups.google.com/forum/?fromgroups=#!topic/tumblr-api/ktfMIdJCOmc
Without JS you get separate pages that only contain the notes. For the mentioned blog post the first page would be:
http://ronbarak.tumblr.com/notes/40692813320/4Y70Zzacy
Following pages are linked at the bottom, e.g.:
http://ronbarak.tumblr.com/notes/40692813320/4Y70Zzacy?from_c=1358403506
http://ronbarak.tumblr.com/notes/40692813320/4Y70Zzacy?from_c=1358383221
http://ronbarak.tumblr.com/notes/40692813320/4Y70Zzacy?from_c=1358377013
…
(See my answer on how to find the next URL in a’s onclick attribute.)
Now you could use various tools to download/parse the data.
The following wget command should download all notes pages for that post:
wget --recursive --domains=ronbarak.tumblr.com --include-directories=notes http://ronbarak.tumblr.com/notes/40692813320/4Y70Zzacy
Like Fabio implies, it is better to use the API.
If for whatever reasons you cannot, then the tools you will use will depend on what you want to do with the data in the posts.
for a data dump: urllib will return a string of the page you want
looking for a specific section in the html: lxml is pretty good
looking for something in unruly html: definitely beautifulsoup
looking for a specific item in a section: beautifulsoup, lxml, text parsing is what you need.
need to put the data in a database/file: use scrapy
Tumblr url scheme is simple: url/scheme/1, url/scheme/2, url/scheme/3, etc... until you get to the end of the posts and the servers just does not return any data anymore.
So if you are going to brute force your way to scraping, you can easily tell your script to dump all the data on your hard drive until, say the contents tag, is empty.
One last word of advice, please remember to put a small sleep(1000) in your script, because you could put some stress on Tumblr servers.
how to load all notes on tumblr? also covers the topic, but unor's response (above) does it very well.
it is a html including two forms. One of them is generated dynamic by js when the page is loaded
So, if I try to fetch them, only one form could be return, and the form generated dynamic not found.
the question is
how to fetch all forms even if they are generated by js.
As far as I know Mechanize does not handle javascript.
That means that you should either generate the form yourself - by reading the JS that creates the form, and then "translating" it to python, and inserting it in your script. -
or:
Automate an actual browser that does understand Javascript using something like ruby's Watir
Launch Firefox, use HTTP Live Headers to inspect what the javascript does, then imitate that using Mechanize / relevant HTTP requests.
Use a browser that understands javascript as per WWW::Mechnize::FAQ, a browser like WWW::Mechanize::Firefox or WWW::Scripter
I'm trying to scrap a page in youtube with python which has lot of ajax in it
I've to call the java script each time to get the info. But i'm not really sure how to go about it. I'm using the urllib2 module to open URLs. Any help would be appreciated.
Youtube (and everything else Google makes) have EXTENSIVE APIs already in place for giving you access to just about any and all data you could possibly want.
Take a look at The Youtube Data API for more information.
I use urllib to make the API requests and ElementTree to parse the returned XML.
Main problem is, you're violating the TOS (terms of service) for the youtube site. Youtube engineers and lawyers will do their professional best to track you down and make an example of you if you persist. If you're happy with that prospect, then, on you head be it -- technically, your best bet are python-spidermonkey and selenium. I wanted to put the technical hints on record in case anybody in the future has needs like the ones your question's title indicates, without the legal issues you clearly have if you continue in this particular endeavor.
Here is how I would do it: Install Firebug on Firefox, then turn the NET on in firebug and click on the desired link on YouTube. Now see what happens and what pages are requested. Find the one that are responsible for the AJAX part of page. Now you can use urllib or Mechanize to fetch the link. If you CAN pull the same content this way, then you have what you are looking for, then just parse the content. If you CAN'T pull the content this way, then that would suggest that the requested page might be looking at user login credentials, sessions info or other header fields such as HTTP_REFERER ... etc. Then you might want to look at something more extensive like the scrapy ... etc. I would suggest that you always follow the simple path first. Good luck and happy "responsibly" scraping! :)
As suggested, you should use the YouTube API to access the data made available legitimately.
Regarding the general question of scraping AJAX, you might want to consider the scrapy framework. It provides extensive support for crawling and scraping web sites and uses python-spidermonkey under the hood to access javascript links.
You could sniff the network traffic with something like Wireshark then replay the HTTP calls via a scraping framework that is robust enough to deal with AJAX, such as scraPY.