HTML snapshots using the selenium webdriver? - python

I am trying to capture all the visible content of a page as text. Let's say that one for example.
If I store the page source then I won't be capturing the comments section because it's loaded using javascript.
Is there a way to take HTML snapshots with selenium webdriver?
(Preferably expressed using the python wrapper)

Regardless of whether or not the HTML of the page is generated using JavaScript, you will still be able to capture it using driver.page_source.
I imagine the reason you haven't been able to capture the source of the comments section in your example is because it's contained in an iframe - In order to capture the html source for content within a frame/iframe you'll need to first switch focus to that particular frame followed by calling driver.page_source.

This code will take a screenshot of the entire page:
from selenium import webdriver
driver = webdriver.Firefox()
driver.get('https://dukescript.com/best/practices/2015/11/23/dynamic-templates.html')
driver.save_screenshot('screenshot.png')
driver.quit()
however, if you just want a screenshot of a specific element, you could use this:
def get_element_screenshot(element: WebElement) -> bytes:
driver = element._parent
ActionChains(driver).move_to_element(element).perform() # focus
src_base64 = driver.get_screenshot_as_base64()
scr_png = b64decode(src_base64)
scr_img = Image(blob=scr_png)
x = element.location["x"]
y = element.location["y"]
w = element.size["width"]
h = element.size["height"]
scr_img.crop(
left=math.floor(x),
top=math.floor(y),
width=math.ceil(w),
height=math.ceil(h))
return scr_img.make_blob()
Where the WebElement is the Element you're chasing. of course, this method requires you to import from base64 import b64decode and from wand.image import Image to handle the cropping.

Related

How to scrape a map on a website for informations stored in popups?

I'm relatively new to web scraping so I'm not sure about which approach I should use to collect informations in a specific scenario in which the informations are stored on a map and displayed in popups, such as : https://utils.ocim.fr/cartocim2/
Basically :
the website shows a map,
contact informations are displayed in popups,
a popup will appear when clicking on a geo-tag button,
targeted informations are those lines stored in that popup
I was thinking of using selenium + xpath method but I'm unsure regarding the way to deal :
with this amount of buttons that have to be clicked on
with the popups.
Would you have any resources / tips to advise me to know where to start ?
With great difficulty
Here's a start but it gets a little more complicated as the markers start overlapping so clicking the elements fails, might need to add a step to zoom in etc
from selenium import webdriver
import requests
import pandas as pd
url_base = r'https://utils.ocim.fr/cartocim2/'
driver = webdriver.Chrome(r'C:\Users\username\Downloads\chromedriver_win32\chromedriver.exe')
driver.get(url_base) #open page
#find all the icons
links = driver.find_elements_by_css_selector('div.leaflet-pane.leaflet-marker-pane > img')
import time
output = [] #temp table to append into
for i in range(5): #chaneg to len(links) when done
links[i].click() #click on first icon
output.append(driver.find_elements_by_xpath('//*[#id="popup-header"]')[0].text) #get the text of the name
time.sleep(1) #sleep
driver.find_element_by_css_selector('#initmap').click() #reset the map - needed as without it the next icon might not be on the screen due to map relocation or popup overlap
time.sleep(1)

How to interact with a print box using python

I have been using selenium to automatic printing documents and I am stuck on the print screen. As far as I know, selenium does not interact with the print screen so I am looking for an alternate situation that I can use with selenium. My code so far is down below, and all I need is code that will let me choose a new printer and then print. Also I want to change that printer to Save as PDF and then save the pdf to a file, so if that gives me a shortcut that would help a lot.
from selenium import webdriver
from selenium import *
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
driver = webdriver.Remote(
command_executor='http://127.0.0.1:4444/wd/hub',
desired_capabilities=DesiredCapabilities.CHROME) #This is because I am using remote web driver for my Mac, but it is the same as regular web driver
driver.execute("window.print()")
#Need code here
I used window.print() followed by execution of a python string containing the necessary js commands:
print_function = '''
let A = document.getElementsByTagName('print-preview-app')[0].shadowRoot;
let B = A.getElementById('sidebar').children[0].shadowRoot;
let C = B.getElementById('button-strip').children[0]
C.click()
'''
driver.execute_script(print_function)
Keep in mind that you also need to use driver.swich_to.window(window_handles[i]) to make sure you're interacting with the print box.
Once you enter into a shadowRoot element, you don't have the full compliment of driver.find_element_by methods available to you. You're limited to the methods available via JS when you are searching within a shadowRoot.
Found a suggestion that might work for you.
How to convert webpage into PDF by using Python
Maybe try pdfkit
import pdfkit
pdfkit.from_url('http://google.com', 'out.pdf')

Beautiful Soup can not find all image tags in html (stops exactly at 5)

I am trying to use beautifulsoup to get all the images of a site with a certain class. my issue is that when i run the code just to see if my code can find each image it only gets images 1-5. I think the issue is the html since images 6-end is located in a nested div but Find_all should be able to find all the img with the same class.
import requests, os, bs4, sys, webbrowser
url = 'https://mangapanda.onl/chapter/'
os.makedirs('manga', exist_ok=True)
comic = sys.argv[1:]
aComic = '-'.join(sys.argv[1:])
issue = input('which issue do you want?')
aIssue = ('/chapter-' + issue)
aComic = (aComic + '_110' + aIssue)
comicUrl = (url + aComic)
res = requests.get(comicUrl)
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text, 'html.parser')
comicElem = soup.find_all(class_="PB0mN")
if comicElem == []:
print('nothing in the list')
else:
print('There are ' + str(len(comicElem)) + ' on this page')
for i in range(len(comicElem)):
comicPage = comicElem[i].get('src')
print(str(comicPage) + '\n')
is there something I am missing when it comes to using beautiful soup that could have helped me solve this? is it the html that is causing this problem? Was there a better way i could have diagnosis this problem myself that would have been in my realm of capability (side note: i am currently going through the book "Automating the Boring Stuff with Python". it is where i got the idea for this mini project and a decent idea of where my level of skill is with python. Lastly I am using BeautifulSoup to learn more about it. If possible i would like to solve this issue using BeautifulSoup will research other options of parsing through html if i need to.
Using firefox quantim 59.0.2
using python3
PS, if you know of other questions that may have answered this problem already feel free to just link me to it. I really wanted to just figure out the answer through someone else questions but it seems like my issue was pretty unique.
The problem is some of the images are being added to the DOM via Javascript after the page is loaded. So
res = requests.get(comicUrl)
gets the HTML and DOM before any modification are made by javascript. This is why
soup = bs4.BeautifulSoup(res.text, 'html.parser')
comicElem = soup.find_all(class_="PB0mN")
len(comicElem) # = 5
only finds 5 images.
If you want to get all the images that are loaded you cannot use the requests library. Here is an example using selenium:
from selenium import webdriver
browser = webdriver.Chrome('/Users/glenn/Downloads/chromedriver')
comicUrl = "https://mangapanda.onl/chapter/naruto_107/chapter-700.5"
browser.get(comicUrl)
images = browser.find_elements_by_class_name("PB0mN")
for image in images:
print(image.get_attribute('src'))
len(images) # = 18 images
See this post for additional resources for scraping javascript pages:
Web-scraping JavaScript page with Python
Regarding how to tell if the HTML is being modified using javascript?
I don't have any hard rules but these are some investigative steps you can carry out:
As you observed only finding 5 images originally with requests but seeing there are more images on the page is the first clue the DOM is being changed after it is loaded.
A second step: using the browser Developer Tools -> Scripts you can see there are several javascript files associated with the page. Note that not all javascript modify the DOM so the presence of these scripts does not necessarily mean they are modifying the DOM.
For further verification the DOM is being modified after the page is loaded:
Copy the html from Developer Tools -> View Page Source into an HTML formatter tool like http://htmlformatter.com, format the html and look at the line count. The Developer Tools -> View Page Source is the html that is sent by the server without any modifications.
Then copy the html from Developer Tools -> Elements (be sure to get the whole thing from <html>...</html>) and copy this into an HTML formatter tool like http://htmlformatter.com, format and look at the line count. The Developer Tools -> Elements html is the complete, modified DOM.
If the line counts are significantly different then you know the DOM is being modified after it is loaded.
Comparing line counts for "https://mangapanda.onl/chapter/naruto_107/chapter-700.5" shows 479 lines for the source html and 3245 lines for the complete DOM so you know something is modifying the DOM after the page is loaded.

How to remove black space from png data pulled from selenium?

I have the following png binary data that I was able to pull from a page utilizing selenium with the following code:
from selenium import webdriver
driver = webdriver.Firefox()
driver.get('http://www.polyvore.com/cgi/img-thing?.out=jpg&size=l&tid=39713077')
data = driver.get_screenshot_as_png()
However, the image looks like the following and I'd like to remove the black space around it:
The image is located here: http://www.polyvore.com/cgi/img-thing?.out=jpg&size=l&tid=39713077
Is there a way to remove the black space utilizing the binary data or get selenium to pull only the image and not the black background?
I've tried to utilize Pil, but I've only found ways to remove white space and not black space, plus it's difficult to turn it back to binary data, which I need.
I've also looked into the PNG Module, but I couldn't figure out how to turn it back to binary as well.
One solution would be to directly get the screenshot of the image:
from selenium import webdriver
driver = webdriver.Firefox()
driver.get('http://www.polyvore.com/cgi/img-thing?.out=jpg&size=l&tid=39713077')
element = driver.find_element_by_css_selector("img")
image = element.screenshot_as_png()
But unfortunately Firefox doesn't yet implement this feature.
Another way would be to crop the screenshot to the targeted element:
import StringIO
from selenium import webdriver
from PIL import Image
driver = webdriver.Firefox()
driver.get('http://www.polyvore.com/cgi/img-thing?.out=jpg&size=l&tid=39713077')
element = driver.find_element_by_css_selector("img")
rect = driver.execute_script("return arguments[0].getBoundingClientRect();", element)
screenshot = driver.get_screenshot_as_png()
img = Image.open(StringIO.StringIO(screenshot))
img_cropped = image.crop((rect['x'], rect['y'], rect['width'], rect['height']))
img_cropped.save('screenshot.png')

Python Selenium: click() cannot trigger event

I try to use selenium to mimic my action on a website to convert PDF files to EXCEL files. There are three steps to complete the conversion:
Upload the PDF file.
Input email address.
Click the 'convert' button.
I wrote the code as below. However, every time I click the button the page just refreshes without actually converting the file.
from selenium import webdriver
import time
driver = webdriver.Chrome()
driver.get("https://pdftoexcelonline.com/en/")
# Upload file
el_upload = driver.find_element_by_name("file")
el_upload.send_keys("/path/to/the/file")
# Input email
el_email = driver.find_element_by_name("email")
el_email.clear()
el_email.send_keys("<email address>")
# Convert button
el_button = driver.find_element_by_id("convert_now")
el_button.click()
time.sleep(10)
driver.close()
This page works well when I completed the steps manually. What is reason that my code did not trigger the conversion?
One possible reason is the not enough execution time. You can add some sleep after each action to verify. Treat it as workaround if work.

Categories