Problem
Downloading a complete working offline copy of a website that loads links/images dynamically
Research
There are questions (e.g. [1], [2], [3]) on Stackoverflow addressing this issue, most of which have the top answers using wget or httrack, both of which fail miserably (please do correct me if I am wrong) on pages that dyanmically load links or uses srcset instead of src for img tag -or anything loaded via JS-. A rather obvious solution was Selenium, however, if you ever used Selenium in production, you quickly start seeing the issues that arise from such a decision (resource heavy, quite complex to use head-full driver, the fact that is it not built for that), that being said, there are people claiming to have been using it easily in production for years
Expected Solution
A script (preferably in python), that parses the page for links and loads them separately. I cannot seem to find any existing scripts that do that. If your solution is "so implement your own", then it is pointless to be asking the question in the first place, I am seeking an existing implementation.
Examples
Shopify.com
Websites built using Wix
Now there are head-less versions of Selenium and alternatives such as PhantomJS, either can be used with a small script to scrape any dynamically loaded website.
I had implemented a generic scraper here, and explained more about the topic here
Related
What even am I doing
So, as Minecraft Java has been slowly switching over to using Microsoft based accounts instead of solely Mojang accounts, I have been trying to put together an authentication method for a small launcher project I've been working on.
The First Issue.
I've been following a piece of documentation here, which had instructions on what GET and POST requests to send to which URLs, and how to parse them, etc. It's worked pretty well, except for The First Issue.
It was a dark and stormy night, and the Microsoft Authentication URL used Javascript for redirects, so the Requests library I was using in Python could not follow the redirects. There might be a way to parse the HTML content and find the redirections or something, but that is way above my head, because I am still new to even Python.
So I looked around for a solution that would let me follow the JavaScript redirects, and the best solution (in concept) looked to be using a headless browser. This led me down a long path until I came face to face with The Second Issue.
The Second Issue.
I looked around for a headless browser that I could use, and I found a couple:
Selenium, or
PyQT WebEngine or WebKit
(I know there are lots of others but I chose these and used them for examples)
From here, the issue isn't so much an issue to fix, but the issue of I don't know what I'm doing.
I looked into Selenium, and it looked promising, but the fact that I had to download a WebDriver confused me in terms of how I would package that, since this is going to be used for a distributed application.
I then looked into PyQT WebEngine, and it just confused me in all respects, so basically I just need some info on maybe how to use it. I also don't need to have to use PyQT to launch a window, or design my UI, or anything else. I already am planning to use Kivy for the GUI. I just need a headless browser or some other solution to follow Javascript redirects when sending a POST request to a certain URL.
So,
From here I just want to ask advice on which route I should take, since there seems to be a broad amount of options I could use. I've already mentioned what I need, so any advice on how or what I should use, in terms of headless browsers, libraries, etc.
Also if anyone has any other suggestions for how to authenticate a Microsoft account, please let me know.
I'm almost done
If there is anything I could answer or clarify, just let me know. I will highly appreciate all advice or suggestions.
Thanks,
Pyrotex7
Well to resolve this - I just went with PyQt in the end after messing around for a while.
My problem is as follows:
I have written a python code, and I need to run it on a web page.Basically I need that whatever is on the console should be displayed as it is.
I have no experience in web development and similar libraries, and I need to get this done in a short time. Kindly tell how should I proceed?
Note: I might be plotting some graphs also. It would be great if they could be displayed all at once(sequentially) on the website
https://brython.info/
https://skulpt.org/
https://pyodide.org/en/stable/
There are multiple python implementation on browser, some are webassemble some are javascript.
Is it a good idea to run python on browser as a replacement for javascript in 2022? No it is not, learn javascript. No in-browser python implementation can race with javascript as of today and most probably ever.
You Can't execute Python-Code directly inside a webbrowser - however, you could for instance create a basic IDE in HTML & JS, send code written by a user on the page to a Server, which would then run the code and send the results back to the client-page.
Unfortunately, such a project is quite ambitious and complicated, especially when Security & Stability are of mayor concern, as executing client-code is a very dangerous measure indeed, and requires expertise in Virtualization Techniques & Software.
Another Method could be to use a public API, which allows you to run Python code and fetch the results back. The procedure would be exactly the same as with the previous idea in terms of creating the web-client, but the heavy-lifting - which is actually executing the Python-code, would be taken care of for you.
As you can see, there is no concrete answer to this question, only suggestions.
A few useful links below:
https://docs.docker.com/
https://appdividend.com/2022/01/18/best-python-online-ide/
https://www.makeuseof.com/tag/programmer-browser-ides/
https://www.youtube.com/watch?v=og9Gaj1Hzag
How do I execute a string containing Python code in Python?
This question already has answers here:
Web-scraping JavaScript page with Python
(18 answers)
Closed 4 hours ago.
What is the best method to scrape a dynamic website where most of the content is generated by what appears to be ajax requests? I have previous experience with a Mechanize, BeautifulSoup, and python combo, but I am up for something new.
--Edit--
For more detail: I'm trying to scrape the CNN primary database. There is a wealth of information there, but there doesn't appear to be an api.
The best solution that I found was to use Firebug to monitor XmlHttpRequests, and then to use a script to resend them.
This is a difficult problem because you either have to reverse engineer the JavaScript on a per-site basis, or implement a JavaScript engine and run the scripts (which has its own difficulties and pitfalls).
It's a heavy weight solution, but I've seen people doing this with GreaseMonkey scripts - allow Firefox to render everything and run the JavaScript, and then scrape the elements. You can even initiate user actions on the page if needed.
Selenium IDE, a tool for testing, is something I've used for a lot of screen-scraping. There are a few things it doesn't handle well (Javascript window.alert() and popup windows in general), but it does its work on a page by actually triggering the click events and typing into the text boxes. Because the IDE portion runs in Firefox, you don't have to do all of the management of sessions, etc. as Firefox takes care of it. The IDE records and plays tests back.
It also exports C#, PHP, Java, etc. code to build compiled tests/scrapers that are executed on the Selenium server. I've done that for more than a few of my Selenium scripts, which makes things like storing the scraped data in a database much easier.
Scripts are fairly simple to write and alter, being made up of things like ("clickAndWait","submitButton"). Worth a look given what you're describing.
Adam Davis's advice is solid.
I would additionally suggest that you try to "reverse-engineer" what the JavaScript is doing, and instead of trying to scrape the page, you issue the HTTP requests that the JavaScript is issuing and interpret the results yourself (most likely in JSON format, nice and easy to parse). This strategy could be anything from trivial to a total nightmare, depending on the complexity of the JavaScript.
The best possibility, of course, would be to convince the website's maintainers to implement a developer-friendly API. All the cool kids are doing it these days 8-) Of course, they might not want their data scraped in an automated fashion... in which case you can expect a cat-and-mouse game of making their page increasingly difficult to scrape :-(
There is a bit of a learning curve, but tools like Pamie (Python) or Watir (Ruby) will let you latch into the IE web browser and get at the elements. This turns out to be easier than Mechanize and other HTTP level tools since you don't have to emulate the browser, you just ask the browser for the html elements. And it's going to be way easier than reverse engineering the Javascript/Ajax calls. If needed you can also use tools like beatiful soup in conjunction with Pamie.
Probably the easiest way is to use IE webbrowser control in C# (or any other language). You have access to all the stuff inside browser out of the box + you dont need to care about cookies, SSL and so on.
i found the IE Webbrowser control have all kinds of quirks and workarounds that would justify some high quality software to take care of all those inconsistencies, layered around the shvwdoc.dll api and mshtml and provide a framework.
This seems like it's a pretty common problem. I wonder why someone hasn't anyone developed a programmatic browser? I'm envisioning a Firefox you can call from the command line with a URL as an argument and it will load the page, run all of the initial page load JS events and save the resulting file.
I mean Firefox, and other browsers already do this, why can't we simply strip off the UI stuff?
I want to programatically modify the browsing history of Chrome through a Python code.
I already knew that many browsers use sqllite database for the browsing history. And asked google and all the answers and questions were about importing/exporting the data of the browsing history.
However what I want to do is to modify the data in the database to delete specific or all the sites that I've visited.
I would like to ask you if there was some modules in Python that helps doing the task through code.
If not applicable then we will need to switch into making the code take control over the mouse and the screen, open chrome, go to the browsing history, select the rows wanted deleted and press delete/confirm. Which will be impossible for a beginner like me to gather the determination and resources to do it.
This may be helpful How can I delete all web history that matches a specific query in Google Chrome
it uses javascript, but you can easily translate it into some simple python and use something like BS BeautifulSoup or just scroll down and there's some sql things going on which look promising
And because you'll have the code already in one language translating it should be pretty simple, even for a beginner and especially using python. PS this took me about as long to find as it was to read your question. it's rare that you will want to do something that someone hasn't already done..just maybe not in python ;)
How you can realize a minimized view of a html page in a div (like google preview)?
http://img228.imageshack.us/i/minimized.png/
edit: ok.. i see its a picture on google, probably a minimized screenshot.
This is more or less a duplicate of the question: Create thumbnails from URLs using PHP
However, just to add my 2ยข, my strong preference would be to use an existing web service, e.g. websnapr, as mentioned by thirtydot in the comments on your question. Generating the snapshots yourself will be difficult to scale well, and just the kind of thing I'd think is worth using an established service for.
If you really do want to do this yourself, I've had success using CutyCapt to generate snapshots of webpages - there are various other similar options (i.e. external programs you can call to do the rendering) mentioned in that other question.
google displays an image thumbnail, so you would need to generate an image using GD or ImageMagic.
The general flow would be
Fetch page content, including stylesheets and all images via curl (potentially tricky to capture all the embedded files but shouldn't be beyond a competent PHP programmer
Construct a rendering of the page inside PHP itself (EXTREMELY tricky! Wouldn't even know where to start with that, though there might be some kind of third party extension available)
Use GD/Imagemagic/whatever to generate a thumbnail image in an appropriate format (shouldn't be too hard).
Clearly, it's the rendering the page from the HTML, CSS, images etc you downloaded that is going to be the difficult part.
Personally I'd be wondering if the effort involved is worth it.