Any idea on how to convert a webpage like we would see shown on a browser into an RGBA image with something in python?
I am not looking for the other solutions I have seen that either use scikit or other to pull a .png from a webpage. Nor am I looking for a beautiful soup like solution where I can access specific data from a webpage.
I am seeking a solution that renders the webpage into a pixel buffer that I can then manipulate with something like numpy / cv2. Is this possible?
One of the simple solutions of taking screenshots would be using the Selenium package.
See this example: https://pythonbasics.org/selenium-screenshot/#Take-screenshot-of-full-page-with-Python-Selenium
Related
This is a bit of a long theoretical question about how img tags really work, for the purposes of web scraping. I've done a lot of research and have seen a bunch of working solutions, but I haven't felt that the core question was answered.
First off, my task: I wish to efficiently scrape ~100k HTML pages from a website and also download images on these pages, while respecting their robots.txt crawl rate of 3 seconds per page.
First, I built a scraper intending to just crawl the HTML and get a long list of image urls to download on a second pass. But then, I realized that, with ~10 images per page this would be ~1M images. At a 3-second crawl rate, this would take ~30 days.
So I thought: "if I'm scraping using Selenium, the images are getting downloaded anyway! I can just download the images on page-scrape."
Now, my background research: This sent me down a rabbit hole, and I learned that the following options exist to download images on a page without making additional calls:
You can right-click and "Save as" (SO post)
You can screenshot the image (SO post)
Sometimes, weirdly, the image data is loaded into src anyway (SO post)
Selenium Wire exists, which is really the best way to address this. (SO Post)
These all seem like viable answers, but on a deeper level, they all (even Selenium Wire**) seem like hacks.
** Selenium Wire allows you to access the data in the requests made by Selenium. This is great, but I naively assumed that when a page is rendered and the images are placed in the img tags, they're in the page and I shouldn't have to worry about the requests that retrieved them.
Now, finally, my question. I clearly have a fundamental misunderstanding about how the img tag works.
Why can't one directly access image data through the Selenium driver, which is loading and rendering the images anyway? The images are there; I see the images when the driver loads. Theoretically, naively, I would expect to be able to download whatever is loaded on the page.
The one parallel I know of is with iframes -- you can visually see the content of the iframe, but you can only scrape it after directing Selenium to switch into the frame (background). So naively I assumed there would be a switch method for img's as well. The fact that there isn't, and it's not clear how to use Selenium to download the image data, tells me that I'm not really understanding how a browser handles an img tag.
I understand all the hacks and the workarounds, but my question here is why?
first time poster here.
I am just getting into python and coding in general and I am looking into the requests and BeutifulSoup libraries. I am trying to grab image url’s from google images. When inspecting the site in chrome i can find the “div” and the correct img src url. But when I open the HTML that “requests” gives me I can find the same “div” but the img src url is something completely different and only leads to a black page if used.
Img of the HTML requests get
Img of the HTML found in chrome's inspect tool
What I wonder, and want to understand is:
why are these HTML's different
How do I get the img src that is found with the inspect tool with requests?
Hope the question makes sense and thank you in advance for any help!
Maybe differences between the the response HTML and the code in chrome inspector stems for updates to the page when JS changes it . for example when you use innerHTML() to edit div element so the code you add will add to DOM stack so as the code in the inspector but it would have no influence on the response.
You may search the http:// in the begging and the .png or .jpg or any other image format in the end.
Simply put, your code retrieves a single HTML page, and lets you access it, as it was retrieved. The browser, on the other hand, retrieves that HTML, but then lets the scripts embedded in (or linked from) it run, and these scripts often make significant modifications to the HTML (also known as DOM - Document Object Model). The browser's inspector inspects the fully modified DOM.
I'm trying to scrape a website with real estate publications. Each publication looks like this:
https://www.portalinmobiliario.com/venta/casa/providencia-metropolitana/5427357-francisco-bilbao-amapolas-uda#position=5&type=item&tracking_id=cedfbb41-ce47-455d-af9f-825614199c5e
I have been able to extract all the information I need except the coordinates (GIS) of the publications. The maps appear to be pasted (not linked). Does anyone know how to do this?
please help
Im using selenium/python3.
Im using Selenium & Chrome.
This is the list of publications:
https://www.portalinmobiliario.com/venta/casa/propiedades-usadas/las-condes-metropolitana
If you click any property of that list, it will take to the page were the maps are displayed. I'm using a loop to go through all of them (one at a time).
The code is a bit long, but so far i have been mostly using find_element_by_class_name and find_element_by_xpath to find and extract the information. I tried using them for the map but I dont know where to find the coordinates.
I want to translate site using Google Websites Translate and then download it like pdf or jpg. I tried to use wkhtmltopdf, but Google Websites Translate return result in frame. Thus if I take a screenshot (pdf or jpg) of translated page I get empty pdf.
Converting HTML to PDF may not work here.
Go for getting snaps of webpages in png/jpeg format.
try FireShot Chrome Extension.
I am not sure if it'll work, Trying is not bad.
When I open a page, I'm assuming all images are also downloaded with Mechanize.
Is there a way for me to just retrieve the source code?
If mechanize doesn't allow for this, is there an alternative that does?