I try to convert cnn.com pages to PDF by Weasyprint, it actually works but with some unfriendly, the pdf headers are always been covered with a black block, it annoys when the content are covered, does any body know how to remove the annoyed things, likes definite a CSS sheetstyle? sincerely appreciate!!!
You can repeated the problem with any article from cnn.com.
or recommend a better converting tools, I ever try pdfkit, but it cannt download the full page with 'readmore' button are always
always display, even UserAgent has been append in http headers.
Those tools works different among websites, weird
enter image description here
import weasyprint
url = 'https://edition.cnn.com/2021/07/23/tech/taiwan-china-cybersecurity-
intl-hnk/index.html'
weasyprint.HTML(url).write_pdf('1.pdf')
that is my codes
Related
This is a bit of a long theoretical question about how img tags really work, for the purposes of web scraping. I've done a lot of research and have seen a bunch of working solutions, but I haven't felt that the core question was answered.
First off, my task: I wish to efficiently scrape ~100k HTML pages from a website and also download images on these pages, while respecting their robots.txt crawl rate of 3 seconds per page.
First, I built a scraper intending to just crawl the HTML and get a long list of image urls to download on a second pass. But then, I realized that, with ~10 images per page this would be ~1M images. At a 3-second crawl rate, this would take ~30 days.
So I thought: "if I'm scraping using Selenium, the images are getting downloaded anyway! I can just download the images on page-scrape."
Now, my background research: This sent me down a rabbit hole, and I learned that the following options exist to download images on a page without making additional calls:
You can right-click and "Save as" (SO post)
You can screenshot the image (SO post)
Sometimes, weirdly, the image data is loaded into src anyway (SO post)
Selenium Wire exists, which is really the best way to address this. (SO Post)
These all seem like viable answers, but on a deeper level, they all (even Selenium Wire**) seem like hacks.
** Selenium Wire allows you to access the data in the requests made by Selenium. This is great, but I naively assumed that when a page is rendered and the images are placed in the img tags, they're in the page and I shouldn't have to worry about the requests that retrieved them.
Now, finally, my question. I clearly have a fundamental misunderstanding about how the img tag works.
Why can't one directly access image data through the Selenium driver, which is loading and rendering the images anyway? The images are there; I see the images when the driver loads. Theoretically, naively, I would expect to be able to download whatever is loaded on the page.
The one parallel I know of is with iframes -- you can visually see the content of the iframe, but you can only scrape it after directing Selenium to switch into the frame (background). So naively I assumed there would be a switch method for img's as well. The fact that there isn't, and it's not clear how to use Selenium to download the image data, tells me that I'm not really understanding how a browser handles an img tag.
I understand all the hacks and the workarounds, but my question here is why?
first time poster here.
I am just getting into python and coding in general and I am looking into the requests and BeutifulSoup libraries. I am trying to grab image url’s from google images. When inspecting the site in chrome i can find the “div” and the correct img src url. But when I open the HTML that “requests” gives me I can find the same “div” but the img src url is something completely different and only leads to a black page if used.
Img of the HTML requests get
Img of the HTML found in chrome's inspect tool
What I wonder, and want to understand is:
why are these HTML's different
How do I get the img src that is found with the inspect tool with requests?
Hope the question makes sense and thank you in advance for any help!
Maybe differences between the the response HTML and the code in chrome inspector stems for updates to the page when JS changes it . for example when you use innerHTML() to edit div element so the code you add will add to DOM stack so as the code in the inspector but it would have no influence on the response.
You may search the http:// in the begging and the .png or .jpg or any other image format in the end.
Simply put, your code retrieves a single HTML page, and lets you access it, as it was retrieved. The browser, on the other hand, retrieves that HTML, but then lets the scripts embedded in (or linked from) it run, and these scripts often make significant modifications to the HTML (also known as DOM - Document Object Model). The browser's inspector inspects the fully modified DOM.
I am trying to scrape a web site using python and beautiful soup. I encountered that in some sites, the image links although seen on the browser is cannot be seen in the source code. However on using Chrome Inspect or Fiddler, we can see the the corresponding codes.
What I see in the source code is:
<div id="cntnt"></div>
But on Chrome Inspect, I can see a whole bunch of HTML\CSS code generated within this div class. Is there a way to load the generated content also within python? I am using the regular urllib in python and I am able to get the source but without the generated part.
I am not a web developer hence I am not able to express the behaviour in better terms. Please feel free to clarify if my question seems vague !
You need JavaScript Engine to parse and run JavaScript code inside the page.
There are a bunch of headless browsers that can help you
http://code.google.com/p/spynner/
http://phantomjs.org/
http://zombie.labnotes.org/
http://github.com/ryanpetrello/python-zombie
http://jeanphix.me/Ghost.py/
http://webscraping.com/blog/Scraping-JavaScript-webpages-with-webkit/
The Content of the website may be generated after load via javascript, In order to obtain the generated script via python refer to this answer
A regular scraper gets just the HTML document. To get any content generated by JavaScript logic, you rather need a Headless browser that would also generate the DOM, load and run the scripts like a regular browser would. The Wikipedia article and some other pages on the Net have lists of those and their capabilities.
Keep in mind when choosing that some previously major products of those are abandoned now.
TRY THIS FIRST!
Perhaps the data technically could be in the javascript itself and all this javascript engine business is needed. (Some GREAT links here!)
But from experience, my first guess is that the JS is pulling the data in via an ajax request. If you can get your program simulate that, you'll probably get everything you need handed right to you without any tedious parsing/executing/scraping involved!
It will take a little detective work though. I suggest turning on your network traffic logger (such as "Web Developer Toolbar" in Firefox) and then visiting the site. Focus your attention attention on any/all XmlHTTPRequests. The data you need should be found somewhere in one of these responses, probably in the middle of some JSON text.
Now, see if you can re-create that request and get the data directly. (NOTE: You may have to set the User-Agent of your request so the server thinks you're a "real" web browser.)
I'm trying to use python to navigate through a website that have auth forms on its landing page, rendered by ASP scripts.
But when I use python (with mechanize, requests, or urlibs) to get the HTML of that site, I always end up with a semi-blank HTML file, due to such ASP scripts.
Would anyone know any method that I can use to get the final (as displayed on a browser) version of an ASP site?
Your target page is a frameset. There is nothing fancy going on from the server side that I can tell. When I use requests or urllib to download it, even sending no headers at all, I get exactly the same HTML that I see in Chrome or Firefox. There is some embedded JS, but it doesn't do anything. Basically, all there is here is a frameset with a single frame in it.
The frame target is also a perfectly normal page with nothing fancy going on from the server side that I can tell. Again, if I fetch it with no headers, I get the exact same contents as in Chrome or Firefox. There is plenty of embedded JS here, but it's not building the DOM from scratch or anything; the static contents that I get from the server have the whole page contents in them. I can strip out all the JS and render it, and it looks exactly the same.
There is a minor problem that neither the server nor the HTML specifies a charset anywhere, and yet the contents aren't ASCII, which means you need to guess what charset to decode if you want to process it as Unicode. But if you're in Python 2.x, and just planning to grab things out of the DOM by ID or something, that won't matter.
I suspect your real problem is just that you don't know how HTML framesets work. You're downloading the frameset, not downloading the referenced frame, and wondering why the resulting page looks like an empty frameset.
Frames are an obsolete feature that nobody uses anymore for anything but a common trick for letting the user pop up a new window even in ancient browsers, and some obscure tricks for fooling popup blockers. In HTML 5 they're finally gone. But as long as ancient websites are out there and need to be scraped, you need to know how they work.
This isn't a substitute for the full documentation, but here's the short version of what a web browser does with a frameset: For each frame tag, it follows the src attribute, then it replaces the contents of the frame tag with a #document tag with no attributes, with the results of reading the src URL as its contents. Beyond that, of course, frames affect layout, but that probably doesn't affect you.
Meanwhile, if you're trying to learn web scraping, you really want to install your browser's "Web Developer Tools" (different browsers have different names), or a full-on debugger like Firebug. That way, you can inspect the live tree that your browser is rendering, and compare it to what you get from your script (or, more simply, from wget). So, next time you can say "In Chrome's Inspect Page, I see a #document under the frame, with a whole bunch of stuff underneath that, but when I try to read the same page myself, the frame has no children".
I'm a little new to web crawlers and such, though I've been programming for a year already. So please bear with me as I try to explain my problem here.
I'm parsing info from Yahoo! News, and I've managed to get most of what I want, but there's a little portion that has stumped me.
For example: http://news.yahoo.com/record-nm-blaze-test-forest-management-225730172.html
I want to get the numbers beside the thumbs up and thumbs down icons in the comments. When I use "Inspect Element" in my Chrome browser, I can clearly see the things that I have to look for - namely, an em tag under the div class 'ugccmt-rate'. However, I'm not able to find this in my python program. In trying to track down the root of the problem, I clicked to view source of the page, and it seems that this tag is not there. Do you guys know how I should approach this problem? Does this have something to do with the javascript on the page that displays the info only after it runs? I'd appreciate some pointers in the right direction.
Thanks.
The page is being generated via JavaScript.
Check if there is a mobile version of the website first. If not, check for any APIs or RSS/Atom feeds. If there's nothing else, you'll either have to manually figure out what the JavaScript is loading and from where, or use Selenium to automate a browser that renders the JavaScript for you for parsing.
Using the Web Console in Firefox you can pretty easily see what requests the page is actually making as it runs its scripts, and figure out what URI returns the data you want. Then you can request that URI directly in your Python script and tease the data out of it. It is probably in a format that Python already has a library to parse, such as JSON.
Yahoo! may have some stuff on their server side to try to prevent you from accessing these data files in a script, such as checking the browser (user-agent header), cookies, or referrer. These can all be faked with enough perseverance, but you should take their existence as a sign that you should tread lightly. (They may also limit the number of requests you can make in a given time period, which is impossible to get around.)