selenium webdriver bot to extract bokeh images from local html - python
How to improve this script?
How to save more than one image per one run.
The html with 3 graphs loads instantly, but html with 50 graphs loads a few minutes. So it's not an optimal way to reload page for each image.
I'm getting only one image per one run. After that I get the error
Message:
stale element reference: element is not attached to the page document.
# encoding: utf-8
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import os
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
driver = webdriver.Chrome("/usr/lib/chromium-browser/chromedriver")
url = 'file:\\\\\\%s/26w0.html' % (os.getcwd())
driver.get(url)
elem = driver.find_element_by_class_name("bk-tool-icon-save")
saves = driver.find_elements_by_class_name("bk-tool-icon-save")
for i in range(len(saves)):
print i
driver.get(url)
elem = driver.find_element_by_class_name("bk-tool-icon-save")
saves = driver.find_elements_by_class_name("bk-tool-icon-save")
saves[i].click()
elem.send_keys(Keys.ENTER)
It's written on python, but I'm open to suggestions,
and if you know the solution on java/.net/any other platform/language, you are welcome.
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8">
<title>Bokeh Plot</title>
<link rel="stylesheet" href="https://cdn.bokeh.org/bokeh/release/bokeh-0.12.4.min.css" type="text/css" />
<script type="text/javascript" src="https://cdn.bokeh.org/bokeh/release/bokeh-0.12.4.min.js"></script>
<script type="text/javascript">
Bokeh.set_log_level("info");
</script>
<style>
html {
width: 100%;
height: 100%;
}
body {
width: 90%;
height: 100%;
margin: auto;
}
</style>
</head>
<body>
<div class="bk-root">
<div class="bk-plotdiv" id="c5c2aae1-5936-42f6-b4e3-9bc4b46efd06"></div>
</div>
<script type="text/javascript">
(function() {
var fn = function() {
Bokeh.safely(function() {
var docs_json = {"2797a004-7f17-48fa-b80f-105be766d58d":{"roots":{"references":[{"attributes":{"fill_alpha":{"value":0.1},"fill_color":{"value":"#1f77b4"},"line_alpha":{"value":0.1},"line_color":{"value":"#1f77b4"},"size":{"units":"screen","value":8},"x":{"field":"x"},"y":{"field":"y"}},"id":"bb128c3f-68fb-4464-9d3f-c188b2ce8614","type":"Circle"},{"attributes":{"plot":{"id":"64a90e8d-fce9-4bbd-b044-1ceda097ba2d","subtype":"Figure","type":"Plot"},"ticker":{"id":"ab09a83a-42c4-44e8-919d-30796ac60bfe","type":"BasicTicker"}},"id":"b2cbd3ba-7dbd-400e-9971-227edc77fed7","type":"Grid"},{"attributes":{"format":"%.1f ml"},"id":"de7cb3ad-08b7-45c5-9050-03c4b3621d74","type":"PrintfTickFormatter"},{"attributes":{"data_source":{"id":"83b66900-4cd4-4d1c-8f66-60c8665f23e6","type":"ColumnDataSource"},"glyph":{"id":"5e858549-3ff5-41e0-b90c-d3a6976a3686","type":"Line"},"hover_glyph":null,"nonselection_glyph":{"id":"e6565ecc-8d83-440f-af86-88b488066471","type":"Line"},"selection_glyph":null},"id":"d7941470-1f87-439f-a797-accc78f7cc69","type":"GlyphRenderer"},{"attributes":{"label":{"value":"75x^4 - 542.17x\u00b3 + 396.6x\u00b2 + 131.48x + 2.0519"},"renderers":[{"id":"9b3e35b4-210e-4989-a5c3-8015c3ae34bc","type":"GlyphRenderer"}]},"id":"6f8cc559-bea7-4209-99c5-7037deead29c","type":"LegendItem"},{"attributes":{"callback":null,"column_names":["x","y"],"data":{"x":[0.5,1,2,4,6],"y":[0.183,0.436,0.771,1.453,2.177]}},"id":"32b7f2b8-c72e-4caf-982d-49753e787111","type":"ColumnDataSource"},{"attributes":{"format":"%.1f ml"},"id":"e4ec96e3-88cc-4a38-9c98-013e60302040","type":"PrintfTickFormatter"},{"attributes":{"axis_label":"OD=450nm","axis_label_text_font_style":"bold","formatter":{"id":"de7cb3ad-08b7-45c5-9050-03c4b3621d74","type":"PrintfTickFormatter"},"plot":{"id":"e44c3f42-efae-400d-a997-de22cc1d9be2","subtype":"Figure","type":"Plot"},"ticker":{"id":"8ab7c0bd-3ec6-4ba1-992b-9b1d81e19180","type":"BasicTicker"}},"id":"bf5574c8-01bd-4feb-9dc1-f2d43e3ff684","type":"LinearAxis"},{"attributes":{"fill_color":{"value":"white"},"line_color":{"value":"#1f77b4"},"size":{"units":"screen","value":8},"x":{"field":"x"},"y":{"field":"y"}},"id":"eda96e22-e71f-4452-a30f-b9a4b8718831","type":"Circle"},{"attributes":{"dimension":1,"plot":{"id":"64a90e8d-fce9-4bbd-b044-1ceda097ba2d","subtype":"Figure","type":"Plot"},"ticker":{"id":"2e431824-d01c-4cce-abac-39320d6671b2","type":"BasicTicker"}},"id":"291ccc16-e4c1-4f7e-bae5-af652e9fd098","type":"Grid"},{"attributes":{},"id":"e4c1cac9-ed55-4ac3-8af2-e2390e64ddaa","type":"BasicTicker"},{"attributes":{"callback":null,"column_names":["x","y"],"data":{"x":[10,20,40,80,160],"y":[0.183,0.436,0.771,1.453,2.177]}},"id":"b609e524-f77b-493f-8713-715846980d9c","type":"ColumnDataSource"},{"attributes":{"plot":{"id":"a51e3dcc-2662-4735-b1ca-48832a7264b6","subtype":"Figure","type":"Plot"}},"id":"351ccde3-723b-445d-8765-6cdcede88d0d","type":"SaveTool"},{"attributes":{"bottom_units":"screen","fill_alpha":{"value":0.5},"fill_color":{"value":"lightgrey"},"left_units":"screen","level":"overlay","line_alpha":{"value":1.0},"line_color":{"value":"black"},"line_dash":[4,4],"line_width":{"value":2},"plot":null,"render_mode":"css","right_units":"screen","top_units":"screen"},"id":"521e13fb-9301-4c4f-9622-2a15f6666f88","type":"BoxAnnotation"},{"attributes":{"axis_label":"95735 Human Apoptosense M30, pg/ml","axis_label_text_font_style":"bold","formatter":{"id":"30d8adbc-1bbd-49f9-9056-daf8cbc2920b","type":"PrintfTickFormatter"},"plot":{"id":"e44c3f42-efae-400d-a997-de22cc1d9be2","subtype":"Figure","type":"Plot"},"ticker":{"id":"edb441a1-4fdb-441f-82d7-ef5c9bf89a7b","type":"BasicTicker"}},"id":"7f6c26c5-6c0f-4b82-b54e-6ae035a42551","type":"LinearAxis"},{"attributes":{"plot":{"id":"a51e3dcc-2662-4735-b1ca-48832a7264b6","subtype":"Figure","type":"Plot"},"ticker":{"id":"59097f65-8f85-4e09-a80b-af85f92666ea","type":"BasicTicker"}},"id":"43d8cc65-3140-400a-adbf-301b93cd21fb","type":"Grid"},{"attributes":{"children":[{"id":"a51e3dcc-2662-4735-b1ca-48832a7264b6","subtype":"Figure","type":"Plot"},{"id":"64a90e8d-fce9-4bbd-b044-1ceda097ba2d","subtype":"Figure","type":"Plot"},{"id":"e44c3f42-efae-400d-a997-de22cc1d9be2","subtype":"Figure","type":"Plot"}]},"id":"fa51df2d-c281-4559-9fa2-05403688a06a","type":"Column"},{"attributes":{"bounds":[0,null],"callback":null,"end":2.5},"id":"4a564014-2efa-4bc1-9fc9-66d4a722c4da","type":"Range1d"},{"attributes":{"callback":null},"id":"5e810fb6-8963-47a2-b1ef-4cbf2976a259","type":"DataRange1d"},{"attributes":{"callback":null,"column_names":["x","y"],"data":{"x":[0.5,1,2,4,6],"y":[0.183,0.436,0.771,1.453,2.177]}},"id":"83b66900-4cd4-4d1c-8f66-60c8665f23e6","type":"ColumnDataSource"},{"attributes":{"plot":{"id":"a51e3dcc-2662-4735-b1ca-48832a7264b6","subtype":"Figure","type":"Plot"}},"id":"a0a4a475-1273-44b6-b312-2e9e7a3db071","type":"WheelZoomTool"},{"attributes":{"bounds":[0,null],"callback":null,"end":2.5},"id":"be0121bb-c66c-4f58-bcc6-60dee73e3906","type":"Range1d"},{"attributes":{"format":"%d pmol"},"id":"e2f15250-4359-477a-993a-bace1c43204e","type":"PrintfTickFormatter"},{"attributes":{"items":[{"id":"6f8cc559-bea7-4209-99c5-7037deead29c","type":"LegendItem"}],"label_text_font_style":"bold","location":"top_left","plot":{"id":"64a90e8d-fce9-4bbd-b044-1ceda097ba2d","subtype":"Figure","type":"Plot"}},"id":"64422ead-b565-4cf7-ad62-71be497e3ad3","type":"Legend"},{"attributes":{"line_color":{"value":"#1f77b4"},"line_width":{"value":4},"x":{"field":"x"},"y":{"field":"y"}},"id":"6951ff93-78b0-4410-908a-b6235b52c745","type":"Line"},{"attributes":{"fill_alpha":{"value":0.1},"fill_color":{"value":"#1f77b4"},"line_alpha":{"value":0.1},"line_color":{"value":"#1f77b4"},"size":{"units":"screen","value":8},"x":{"field":"x"},"y":{"field":"y"}},"id":"801ad24a-8d5e-4a6d-bce2-c12c6cc76c28","type":"Circle"},{"attributes":{"callback":null,"column_names":["x","y"],"data":{"x":[9.375,18.75,37.5,75,150],"y":[0.183,0.436,0.771,1.453,2.177]}},"id":"fb013b9d-e02a-4dcf-8a39-53fd28e06b0a","type":"ColumnDataSource"},{"attributes":{"line_alpha":{"value":0.1},"line_color":{"value":"#1f77b4"},"line_width":{"value":4},"x":{"field":"x"},"y":{"field":"y"}},"id":"5c016a30-4628-476d-8271-db0cf0bb814f","type":"Line"},{"attributes":{"axis_label":"95734 Human Free carnitine(F-C)\n, pg/ml","axis_label_text_font_style":"bold","formatter":{"id":"7611152e-8efd-48a7-9d70-dd30e237ab41","type":"PrintfTickFormatter"},"plot":{"id":"64a90e8d-fce9-4bbd-b044-1ceda097ba2d","subtype":"Figure","type":"Plot"},"ticker":{"id":"ab09a83a-42c4-44e8-919d-30796ac60bfe","type":"BasicTicker"}},"id":"ec75fa99-4693-4ca9-91fa-dd33ed3e20ef","type":"LinearAxis"},{"attributes":{"line_alpha":{"value":0.1},"line_color":{"value":"#1f77b4"},"line_width":{"value":4},"x":{"field":"x"},"y":{"field":"y"}},"id":"e6565ecc-8d83-440f-af86-88b488066471","type":"Line"},{"attributes":{},"id":"ab09a83a-42c4-44e8-919d-30796ac60bfe","type":"BasicTicker"},{"attributes":{"plot":{"id":"a51e3dcc-2662-4735-b1ca-48832a7264b6","subtype":"Figure","type":"Plot"}},"id":"fb5572e9-b038-4e5b-8295-3ad612a31885","type":"ResetTool"},{"attributes":{},"id":"e22b7b7a-2d18-4526-80fa-e3001352a319","type":"ToolEvents"},{"attributes":{"line_alpha":{"value":0.1},"line_color":{"value":"#1f77b4"},"line_width":{"value":4},"x":{"field":"x"},"y":{"field":"y"}},"id":"99213003-bdf1-4278-b202-457d8f4b2e7f","type":"Line"},{"attributes":{"label":{"value":"75x^4 - 542.17x\u00b3 + 396.6x\u00b2 + 131.48x + 2.0519"},"renderers":[{"id":"d7941470-1f87-439f-a797-accc78f7cc69","type":"GlyphRenderer"}]},"id":"2accad8a-abf2-468b-8b6a-2e03bc675907","type":"LegendItem"},{"attributes":{"dimension":1,"plot":{"id":"a51e3dcc-2662-4735-b1ca-48832a7264b6","subtype":"Figure","type":"Plot"},"ticker":{"id":"e4c1cac9-ed55-4ac3-8af2-e2390e64ddaa","type":"BasicTicker"}},"id":"8e631177-2a67-4aab-8e2b-5b826bf4208b","type":"Grid"},{"attributes":{"plot":{"id":"64a90e8d-fce9-4bbd-b044-1ceda097ba2d","subtype":"Figure","type":"Plot"}},"id":"3a0a1eb4-3e6d-46b8-ab89-cf8f4f1cbe2a","type":"PanTool"},{"attributes":{"plot":{"id":"64a90e8d-fce9-4bbd-b044-1ceda097ba2d","subtype":"Figure","type":"Plot"}},"id":"91dcbc0c-4c67-43e6-83f0-734c4e51c422","type":"WheelZoomTool"},{"attributes":{"plot":{"id":"64a90e8d-fce9-4bbd-b044-1ceda097ba2d","subtype":"Figure","type":"Plot"}},"id":"0d69a258-4b25-4097-b2fc-4c8ccee39867","type":"HelpTool"},{"attributes":{"bounds":[0,null],"callback":null,"end":2.5},"id":"f2efd5a3-a4d7-45b3-9dc1-7bda5ec259f5","type":"Range1d"},{"attributes":{"plot":{"id":"e44c3f42-efae-400d-a997-de22cc1d9be2","subtype":"Figure","type":"Plot"},"ticker":{"id":"edb441a1-4fdb-441f-82d7-ef5c9bf89a7b","type":"BasicTicker"}},"id":"3df0545d-95fb-4a61-96d8-044988c435ee","type":"Grid"},{"attributes":{"callback":null,"column_names":["x","y"],"data":{"x":[9.375,18.75,37.5,75,150],"y":[0.183,0.436,0.771,1.453,2.177]}},"id":"43064e70-6d90-4e15-a701-691886d6c98d","type":"ColumnDataSource"},{"attributes":{"data_source":{"id":"f80ed589-aa96-4e2b-b357-512934b32548","type":"ColumnDataSource"},"glyph":{"id":"54443d79-4bda-436a-b85c-b327adce5a92","type":"Circle"},"hover_glyph":null,"nonselection_glyph":{"id":"801ad24a-8d5e-4a6d-bce2-c12c6cc76c28","type":"Circle"},"selection_glyph":null},"id":"630f3983-2a3f-4aaa-bfca-49954efc0c25","type":"GlyphRenderer"},{"attributes":{"plot":{"id":"e44c3f42-efae-400d-a997-de22cc1d9be2","subtype":"Figure","type":"Plot"}},"id":"3ab3dce1-6b3d-49d9-b1ea-4081b39446f5","type":"HelpTool"},{"attributes":{"overlay":{"id":"db8c5b58-879a-4a7f-bfe3-e9dd3b9c7d1d","type":"BoxAnnotation"},"plot":{"id":"64a90e8d-fce9-4bbd-b044-1ceda097ba2d","subtype":"Figure","type":"Plot"}},"id":"93e96ad3-8068-48a1-9f6f-2236e0ede821","type":"BoxZoomTool"},{"attributes":{"items":[{"id":"e52ec974-f942-44a5-bc88-eee9aaeb26ad","type":"LegendItem"}],"label_text_font_style":"bold","location":"top_left","plot":{"id":"e44c3f42-efae-400d-a997-de22cc1d9be2","subtype":"Figure","type":"Plot"}},"id":"29a55b43-6628-4c9b-88f2-5241cca6911f","type":"Legend"},{"attributes":{"data_source":{"id":"b609e524-f77b-493f-8713-715846980d9c","type":"ColumnDataSource"},"glyph":{"id":"8473f88f-cf74-4f31-b31a-d2c6db34374a","type":"Line"},"hover_glyph":null,"nonselection_glyph":{"id":"5c016a30-4628-476d-8271-db0cf0bb814f","type":"Line"},"selection_glyph":null},"id":"58aa88a4-8ff8-4f52-8a36-4ae90b87d9fb","type":"GlyphRenderer"},{"attributes":{},"id":"59097f65-8f85-4e09-a80b-af85f92666ea","type":"BasicTicker"},{"attributes":{"plot":{"id":"e44c3f42-efae-400d-a997-de22cc1d9be2","subtype":"Figure","type":"Plot"}},"id":"b6871fce-fdc4-4662-bb5d-85e9b7a4ad1e","type":"ResetTool"},{"attributes":{"fill_alpha":{"value":0.1},"fill_color":{"value":"#1f77b4"},"line_alpha":{"value":0.1},"line_color":{"value":"#1f77b4"},"size":{"units":"screen","value":8},"x":{"field":"x"},"y":{"field":"y"}},"id":"23334a4d-0a36-4888-a59f-a4a6250bd69c","type":"Circle"},{"attributes":{"callback":null},"id":"5b8d0607-7472-4e11-ae18-7254bfc50db3","type":"DataRange1d"},{"attributes":{},"id":"2e431824-d01c-4cce-abac-39320d6671b2","type":"BasicTicker"},{"attributes":{"axis_label":"95724 Human copeptin\n, pmol/L","axis_label_text_font_style":"bold","formatter":{"id":"e2f15250-4359-477a-993a-bace1c43204e","type":"PrintfTickFormatter"},"plot":{"id":"a51e3dcc-2662-4735-b1ca-48832a7264b6","subtype":"Figure","type":"Plot"},"ticker":{"id":"59097f65-8f85-4e09-a80b-af85f92666ea","type":"BasicTicker"}},"id":"ca22b639-f183-4932-9035-6575f43477d4","type":"LinearAxis"},{"attributes":{"plot":{"id":"e44c3f42-efae-400d-a997-de22cc1d9be2","subtype":"Figure","type":"Plot"}},"id":"5d058e2a-1b8a-4886-bc0e-40a209e5c190","type":"SaveTool"},{"attributes":{"items":[{"id":"2accad8a-abf2-468b-8b6a-2e03bc675907","type":"LegendItem"}],"label_text_font_style":"bold","location":"top_left","plot":{"id":"a51e3dcc-2662-4735-b1ca-48832a7264b6","subtype":"Figure","type":"Plot"}},"id":"ba8c08f8-eb6a-430e-a73c-4293329538cc","type":"Legend"},{"attributes":{"plot":{"id":"e44c3f42-efae-400d-a997-de22cc1d9be2","subtype":"Figure","type":"Plot"}},"id":"77ca8c4c-cdfb-47af-a837-7c76e0338f7e","type":"WheelZoomTool"},{"attributes":{},"id":"edb441a1-4fdb-441f-82d7-ef5c9bf89a7b","type":"BasicTicker"},{"attributes":{"callback":null,"column_names":["x","y"],"data":{"x":[10,20,40,80,160],"y":[0.183,0.436,0.771,1.453,2.177]}},"id":"f80ed589-aa96-4e2b-b357-512934b32548","type":"ColumnDataSource"},{"attributes":{"background_fill_alpha":{"value":0.8},"below":[{"id":"ca22b639-f183-4932-9035-6575f43477d4","type":"LinearAxis"}],"left":[{"id":"6862c7c5-cb85-4243-9c7b-9a49642710f8","type":"LinearAxis"}],"plot_width":900,"renderers":[{"id":"ca22b639-f183-4932-9035-6575f43477d4","type":"LinearAxis"},{"id":"43d8cc65-3140-400a-adbf-301b93cd21fb","type":"Grid"},{"id":"6862c7c5-cb85-4243-9c7b-9a49642710f8","type":"LinearAxis"},{"id":"8e631177-2a67-4aab-8e2b-5b826bf4208b","type":"Grid"},{"id":"521e13fb-9301-4c4f-9622-2a15f6666f88","type":"BoxAnnotation"},{"id":"ba8c08f8-eb6a-430e-a73c-4293329538cc","type":"Legend"},{"id":"d7941470-1f87-439f-a797-accc78f7cc69","type":"GlyphRenderer"},{"id":"038d2c03-db1f-4246-969a-2f003bc58214","type":"GlyphRenderer"}],"title":{"id":"f70da64c-6a60-4c81-ac9c-486574b9c4ed","type":"Title"},"tool_events":{"id":"e22b7b7a-2d18-4526-80fa-e3001352a319","type":"ToolEvents"},"toolbar":{"id":"f75e2cb9-f423-4383-a41d-638ff4a5a417","type":"Toolbar"},"x_range":{"id":"29683b67-b091-4a62-ab1f-d40b1289e9f4","type":"DataRange1d"},"y_range":{"id":"f2efd5a3-a4d7-45b3-9dc1-7bda5ec259f5","type":"Range1d"}},"id":"a51e3dcc-2662-4735-b1ca-48832a7264b6","subtype":"Figure","type":"Plot"},{"attributes":{"bottom_units":"screen","fill_alpha":{"value":0.5},"fill_color":{"value":"lightgrey"},"left_units":"screen","level":"overlay","line_alpha":{"value":1.0},"line_color":{"value":"black"},"line_dash":[4,4],"line_width":{"value":2},"plot":null,"render_mode":"css","right_units":"screen","top_units":"screen"},"id":"db8c5b58-879a-4a7f-bfe3-e9dd3b9c7d1d","type":"BoxAnnotation"},{"attributes":{"plot":null,"text":""},"id":"ba81564b-0c23-4a9e-826e-e43201b0e449","type":"Title"},{"attributes":{"fill_color":{"value":"white"},"line_color":{"value":"#1f77b4"},"size":{"units":"screen","value":8},"x":{"field":"x"},"y":{"field":"y"}},"id":"54443d79-4bda-436a-b85c-b327adce5a92","type":"Circle"},{"attributes":{"fill_color":{"value":"white"},"line_color":{"value":"#1f77b4"},"size":{"units":"screen","value":8},"x":{"field":"x"},"y":{"field":"y"}},"id":"a8506a17-fef3-45ea-8118-749941f73322","type":"Circle"},{"attributes":{"plot":{"id":"a51e3dcc-2662-4735-b1ca-48832a7264b6","subtype":"Figure","type":"Plot"}},"id":"35150a0a-6a13-4fb1-9d65-680844b108fd","type":"HelpTool"},{"attributes":{"format":"%d pg"},"id":"30d8adbc-1bbd-49f9-9056-daf8cbc2920b","type":"PrintfTickFormatter"},{"attributes":{"plot":{"id":"64a90e8d-fce9-4bbd-b044-1ceda097ba2d","subtype":"Figure","type":"Plot"}},"id":"ed65518e-1887-4053-bb39-3d162c719899","type":"ResetTool"},{"attributes":{"label":{"value":"75x^4 - 542.17x\u00b3 + 396.6x\u00b2 + 131.48x + 2.0519"},"renderers":[{"id":"58aa88a4-8ff8-4f52-8a36-4ae90b87d9fb","type":"GlyphRenderer"}]},"id":"e52ec974-f942-44a5-bc88-eee9aaeb26ad","type":"LegendItem"},{"attributes":{"background_fill_alpha":{"value":0.8},"below":[{"id":"ec75fa99-4693-4ca9-91fa-dd33ed3e20ef","type":"LinearAxis"}],"left":[{"id":"2c445319-62e1-42a3-b270-59425c3c4100","type":"LinearAxis"}],"plot_width":900,"renderers":[{"id":"ec75fa99-4693-4ca9-91fa-dd33ed3e20ef","type":"LinearAxis"},{"id":"b2cbd3ba-7dbd-400e-9971-227edc77fed7","type":"Grid"},{"id":"2c445319-62e1-42a3-b270-59425c3c4100","type":"LinearAxis"},{"id":"291ccc16-e4c1-4f7e-bae5-af652e9fd098","type":"Grid"},{"id":"db8c5b58-879a-4a7f-bfe3-e9dd3b9c7d1d","type":"BoxAnnotation"},{"id":"64422ead-b565-4cf7-ad62-71be497e3ad3","type":"Legend"},{"id":"9b3e35b4-210e-4989-a5c3-8015c3ae34bc","type":"GlyphRenderer"},{"id":"3771367d-7eac-4366-82eb-2a6ae0581aad","type":"GlyphRenderer"}],"title":{"id":"e437cbca-8f2b-4cd4-83d8-b2c020726684","type":"Title"},"tool_events":{"id":"42e06622-2cec-43bc-8b0b-37606b4e2ed0","type":"ToolEvents"},"toolbar":{"id":"7c537b7a-60fc-4ed7-9f37-2e26d7f7ef84","type":"Toolbar"},"x_range":{"id":"5e810fb6-8963-47a2-b1ef-4cbf2976a259","type":"DataRange1d"},"y_range":{"id":"4a564014-2efa-4bc1-9fc9-66d4a722c4da","type":"Range1d"}},"id":"64a90e8d-fce9-4bbd-b044-1ceda097ba2d","subtype":"Figure","type":"Plot"},{"attributes":{"axis_label":"OD=450nm","axis_label_text_font_style":"bold","formatter":{"id":"e4ec96e3-88cc-4a38-9c98-013e60302040","type":"PrintfTickFormatter"},"plot":{"id":"64a90e8d-fce9-4bbd-b044-1ceda097ba2d","subtype":"Figure","type":"Plot"},"ticker":{"id":"2e431824-d01c-4cce-abac-39320d6671b2","type":"BasicTicker"}},"id":"2c445319-62e1-42a3-b270-59425c3c4100","type":"LinearAxis"},{"attributes":{"overlay":{"id":"65d909ea-fabf-4424-9b9c-08853c798990","type":"BoxAnnotation"},"plot":{"id":"e44c3f42-efae-400d-a997-de22cc1d9be2","subtype":"Figure","type":"Plot"}},"id":"fdd8248a-9231-4b1b-96b2-50a93cc377d6","type":"BoxZoomTool"},{"attributes":{"overlay":{"id":"521e13fb-9301-4c4f-9622-2a15f6666f88","type":"BoxAnnotation"},"plot":{"id":"a51e3dcc-2662-4735-b1ca-48832a7264b6","subtype":"Figure","type":"Plot"}},"id":"d311ebd6-2b3c-407f-ae8d-921f407eb08d","type":"BoxZoomTool"},{"attributes":{"data_source":{"id":"43064e70-6d90-4e15-a701-691886d6c98d","type":"ColumnDataSource"},"glyph":{"id":"6951ff93-78b0-4410-908a-b6235b52c745","type":"Line"},"hover_glyph":null,"nonselection_glyph":{"id":"99213003-bdf1-4278-b202-457d8f4b2e7f","type":"Line"},"selection_glyph":null},"id":"9b3e35b4-210e-4989-a5c3-8015c3ae34bc","type":"GlyphRenderer"},{"attributes":{"callback":null},"id":"29683b67-b091-4a62-ab1f-d40b1289e9f4","type":"DataRange1d"},{"attributes":{"plot":{"id":"64a90e8d-fce9-4bbd-b044-1ceda097ba2d","subtype":"Figure","type":"Plot"}},"id":"4091c142-49a2-4042-9cbd-705b4fba8fb3","type":"SaveTool"},{"attributes":{"axis_label":"OD=450nm","axis_label_text_font_style":"bold","formatter":{"id":"210d7bd4-012f-4d10-8822-2e1e3c576cad","type":"PrintfTickFormatter"},"plot":{"id":"a51e3dcc-2662-4735-b1ca-48832a7264b6","subtype":"Figure","type":"Plot"},"ticker":{"id":"e4c1cac9-ed55-4ac3-8af2-e2390e64ddaa","type":"BasicTicker"}},"id":"6862c7c5-cb85-4243-9c7b-9a49642710f8","type":"LinearAxis"},{"attributes":{"bottom_units":"screen","fill_alpha":{"value":0.5},"fill_color":{"value":"lightgrey"},"left_units":"screen","level":"overlay","line_alpha":{"value":1.0},"line_color":{"value":"black"},"line_dash":[4,4],"line_width":{"value":2},"plot":null,"render_mode":"css","right_units":"screen","top_units":"screen"},"id":"65d909ea-fabf-4424-9b9c-08853c798990","type":"BoxAnnotation"},{"attributes":{"format":"%.1f L"},"id":"210d7bd4-012f-4d10-8822-2e1e3c576cad","type":"PrintfTickFormatter"},{"attributes":{},"id":"9c1cc749-78d3-499e-8166-9d0b4ca3c80e","type":"ToolEvents"},{"attributes":{"line_color":{"value":"#1f77b4"},"line_width":{"value":4},"x":{"field":"x"},"y":{"field":"y"}},"id":"8473f88f-cf74-4f31-b31a-d2c6db34374a","type":"Line"},{"attributes":{"plot":null,"text":""},"id":"e437cbca-8f2b-4cd4-83d8-b2c020726684","type":"Title"},{"attributes":{"plot":{"id":"a51e3dcc-2662-4735-b1ca-48832a7264b6","subtype":"Figure","type":"Plot"}},"id":"9f06704d-ad5c-45c2-aa0d-08c4327129aa","type":"PanTool"},{"attributes":{"active_drag":"auto","active_scroll":"auto","active_tap":"auto","logo":null,"tools":[{"id":"3a0a1eb4-3e6d-46b8-ab89-cf8f4f1cbe2a","type":"PanTool"},{"id":"91dcbc0c-4c67-43e6-83f0-734c4e51c422","type":"WheelZoomTool"},{"id":"93e96ad3-8068-48a1-9f6f-2236e0ede821","type":"BoxZoomTool"},{"id":"4091c142-49a2-4042-9cbd-705b4fba8fb3","type":"SaveTool"},{"id":"ed65518e-1887-4053-bb39-3d162c719899","type":"ResetTool"},{"id":"0d69a258-4b25-4097-b2fc-4c8ccee39867","type":"HelpTool"}]},"id":"7c537b7a-60fc-4ed7-9f37-2e26d7f7ef84","type":"Toolbar"},{"attributes":{"data_source":{"id":"32b7f2b8-c72e-4caf-982d-49753e787111","type":"ColumnDataSource"},"glyph":{"id":"eda96e22-e71f-4452-a30f-b9a4b8718831","type":"Circle"},"hover_glyph":null,"nonselection_glyph":{"id":"bb128c3f-68fb-4464-9d3f-c188b2ce8614","type":"Circle"},"selection_glyph":null},"id":"038d2c03-db1f-4246-969a-2f003bc58214","type":"GlyphRenderer"},{"attributes":{"plot":{"id":"e44c3f42-efae-400d-a997-de22cc1d9be2","subtype":"Figure","type":"Plot"}},"id":"9273eb57-9cd5-47ec-b61a-391ca387cf96","type":"PanTool"},{"attributes":{"background_fill_alpha":{"value":0.8},"below":[{"id":"7f6c26c5-6c0f-4b82-b54e-6ae035a42551","type":"LinearAxis"}],"left":[{"id":"bf5574c8-01bd-4feb-9dc1-f2d43e3ff684","type":"LinearAxis"}],"plot_width":900,"renderers":[{"id":"7f6c26c5-6c0f-4b82-b54e-6ae035a42551","type":"LinearAxis"},{"id":"3df0545d-95fb-4a61-96d8-044988c435ee","type":"Grid"},{"id":"bf5574c8-01bd-4feb-9dc1-f2d43e3ff684","type":"LinearAxis"},{"id":"03fd8f50-9f84-414a-921e-f9215101a4d3","type":"Grid"},{"id":"65d909ea-fabf-4424-9b9c-08853c798990","type":"BoxAnnotation"},{"id":"29a55b43-6628-4c9b-88f2-5241cca6911f","type":"Legend"},{"id":"58aa88a4-8ff8-4f52-8a36-4ae90b87d9fb","type":"GlyphRenderer"},{"id":"630f3983-2a3f-4aaa-bfca-49954efc0c25","type":"GlyphRenderer"}],"title":{"id":"ba81564b-0c23-4a9e-826e-e43201b0e449","type":"Title"},"tool_events":{"id":"9c1cc749-78d3-499e-8166-9d0b4ca3c80e","type":"ToolEvents"},"toolbar":{"id":"d81630f8-b9ed-4cce-907b-0141d2e05635","type":"Toolbar"},"x_range":{"id":"5b8d0607-7472-4e11-ae18-7254bfc50db3","type":"DataRange1d"},"y_range":{"id":"be0121bb-c66c-4f58-bcc6-60dee73e3906","type":"Range1d"}},"id":"e44c3f42-efae-400d-a997-de22cc1d9be2","subtype":"Figure","type":"Plot"},{"attributes":{},"id":"42e06622-2cec-43bc-8b0b-37606b4e2ed0","type":"ToolEvents"},{"attributes":{"plot":null,"text":""},"id":"f70da64c-6a60-4c81-ac9c-486574b9c4ed","type":"Title"},{"attributes":{"data_source":{"id":"fb013b9d-e02a-4dcf-8a39-53fd28e06b0a","type":"ColumnDataSource"},"glyph":{"id":"a8506a17-fef3-45ea-8118-749941f73322","type":"Circle"},"hover_glyph":null,"nonselection_glyph":{"id":"23334a4d-0a36-4888-a59f-a4a6250bd69c","type":"Circle"},"selection_glyph":null},"id":"3771367d-7eac-4366-82eb-2a6ae0581aad","type":"GlyphRenderer"},{"attributes":{"line_color":{"value":"#1f77b4"},"line_width":{"value":4},"x":{"field":"x"},"y":{"field":"y"}},"id":"5e858549-3ff5-41e0-b90c-d3a6976a3686","type":"Line"},{"attributes":{"format":"%d pg"},"id":"7611152e-8efd-48a7-9d70-dd30e237ab41","type":"PrintfTickFormatter"},{"attributes":{"dimension":1,"plot":{"id":"e44c3f42-efae-400d-a997-de22cc1d9be2","subtype":"Figure","type":"Plot"},"ticker":{"id":"8ab7c0bd-3ec6-4ba1-992b-9b1d81e19180","type":"BasicTicker"}},"id":"03fd8f50-9f84-414a-921e-f9215101a4d3","type":"Grid"},{"attributes":{},"id":"8ab7c0bd-3ec6-4ba1-992b-9b1d81e19180","type":"BasicTicker"},{"attributes":{"active_drag":"auto","active_scroll":"auto","active_tap":"auto","logo":null,"tools":[{"id":"9273eb57-9cd5-47ec-b61a-391ca387cf96","type":"PanTool"},{"id":"77ca8c4c-cdfb-47af-a837-7c76e0338f7e","type":"WheelZoomTool"},{"id":"fdd8248a-9231-4b1b-96b2-50a93cc377d6","type":"BoxZoomTool"},{"id":"5d058e2a-1b8a-4886-bc0e-40a209e5c190","type":"SaveTool"},{"id":"b6871fce-fdc4-4662-bb5d-85e9b7a4ad1e","type":"ResetTool"},{"id":"3ab3dce1-6b3d-49d9-b1ea-4081b39446f5","type":"HelpTool"}]},"id":"d81630f8-b9ed-4cce-907b-0141d2e05635","type":"Toolbar"},{"attributes":{"active_drag":"auto","active_scroll":"auto","active_tap":"auto","logo":null,"tools":[{"id":"9f06704d-ad5c-45c2-aa0d-08c4327129aa","type":"PanTool"},{"id":"a0a4a475-1273-44b6-b312-2e9e7a3db071","type":"WheelZoomTool"},{"id":"d311ebd6-2b3c-407f-ae8d-921f407eb08d","type":"BoxZoomTool"},{"id":"351ccde3-723b-445d-8765-6cdcede88d0d","type":"SaveTool"},{"id":"fb5572e9-b038-4e5b-8295-3ad612a31885","type":"ResetTool"},{"id":"35150a0a-6a13-4fb1-9d65-680844b108fd","type":"HelpTool"}]},"id":"f75e2cb9-f423-4383-a41d-638ff4a5a417","type":"Toolbar"}],"root_ids":["fa51df2d-c281-4559-9fa2-05403688a06a"]},"title":"Bokeh Application","version":"0.12.4"}};
var render_items = [{"docid":"2797a004-7f17-48fa-b80f-105be766d58d","elementid":"c5c2aae1-5936-42f6-b4e3-9bc4b46efd06","modelid":"fa51df2d-c281-4559-9fa2-05403688a06a"}];
Bokeh.embed.embed_items(docs_json, render_items);
});
};
if (document.readyState != "loading") fn();
else document.addEventListener("DOMContentLoaded", fn);
})();
</script>
</body>
</html>
I can't do it manually for a long time,
because needed a thousands of images.
p.s.
to #e1che
I've fixed some things in our code:
for i in range(len(saves)):
print i
saves[i].click()
saves[i].send_keys(Keys.ENTER)
looks simpler and better, but also don't work correctly.
Your main problem comes from you gives a "new" url to your driver.
You should simplify a bit your foreach loop.
for i in range(len(saves)):
print i
elem_loop = elem[i]
saves_loop = saves[i]
saves_loop[i].click()
elem_loop.send_keys(Keys.ENTER) // elem_loop[i] ? you click on the same element for each image ?
I don't see the 50 images that you refer to in the question but the main issue you are having is that you are reloading the page before clicking all the save buttons on the page. You can fix that with the code below.
// files is an array of string which is the path to the files on disk
for file in files
driver.get(file)
for e in driver.find_elements_by_class_name("bk-tool-icon-save"):
e.click()
This should save off all the loaded images. Once you provide more details on how/when the other 50 images are loaded, we can answer that portion. You should just wait until all the images are loaded and the code above will work and save all images.
Also, elem is just the first element from the saves collection so there's no need to store both. I'm not sure the purpose of the send_keys()... it shouldn't be needed so I removed it also.
Related
using beautifulsoup to open product pages in different tabs for an inputted search result in amazon
I'm quite new to python and very new to web scraping - currently following along in Al Sweigart's book Automate the Boring Stuff With Python and there is a suggested practice assignment which basically is to make a program that does this: takes an input for a product to search in amazon uses requests.get() and .text() to get the html of that search page searches the html with beautifulsoup for css selectors that signify links to product pages in separate tabs, opens tabs to the top five products for your search result Heres my code: #! python3 # Searches amazon for the inputted product (either through command line or input) and opens 5 tabs with the top # items for that search. import requests, sys, bs4, webbrowser if len(sys.argv) > 1: # if there are system arguments res = requests.get('https://www.amazon.com/s?k=' + ''.join(sys.argv)) res.raise_for_status else: # take input print('what product would you like to search Amazon for?') product = str(input()) res = requests.get('https://www.amazon.com/s?k=' + ''.join(product)) res.raise_for_status # retrieve top search links: soup = bs4.BeautifulSoup(res.text, 'html.parser') print(res.text) # TO CHECK HTML OF SITE, GET RID OF DURING ACTUAL PROGRAM # open a new tab for the top 5 items, and get the css selector for links # a list of all things on the downloaded page that are within the css selector 'a-link-normal a-text-normal' linkElems = soup.select('a-link-normal a-text-normal') numOpen = min(5, len(linkElems)) for i in range(numOpen): urlToOpen = 'https://www.amazon.com/' + linkElems[i].get('href') print('Opening', urlToOpen) webbrowser.open(urlToOpen) I think I've selected the correct css selector ("a-link-normal a-text-normal"), so I think the problem is with the res.text() - when I print to see what it looks like, the html content does not seem to be complete, or contain the contents of the actual html when I look at the same site using inspect element in chrome. Additionally, none of that html contains any contents such as "a-link-normal a-text-normal". Just for a sample, this is what the res.text() looks like for a search for 'big pencil': what product would you like to search Amazon for? big pencil <!-- To discuss automated access to Amazon data please contact api-services-support#amazon.com. For information about migrating to our APIs refer to our Marketplace APIs at https://developer.amazonservices.com/ref=rm_5_sv, or our Product Advertising API at https://affiliate-program.amazon.com/gp/advertising/api/detail/main.html/ref=rm_5_ac for advertising use cases. --> <!doctype html> <html> <head> <meta charset="utf-8"> <meta http-equiv="x-ua-compatible" content="ie=edge"> <meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no"> <title>Sorry! Something went wrong!</title> <style> html, body { padding: 0; margin: 0 } img { border: 0 } #a { background: #232f3e; padding: 11px 11px 11px 192px } #b { position: absolute; left: 22px; top: 12px } #c { position: relative; max-width: 800px; padding: 0 40px 0 0 } #e, #f { height: 35px; border: 0; font-size: 1em } #e { width: 100%; margin: 0; padding: 0 10px; border-radius: 4px 0 0 4px } #f { cursor: pointer; background: #febd69; font-weight: bold; border-radius: 0 4px 4px 0; -webkit-appearance: none; position: absolute; top: 0; right: 0; padding: 0 12px } #media (max-width: 500px) { #a { padding: 55px 10px 10px } #b { left: 6px } } #g { text-align: center; margin: 30px 0 } #g img { max-width: 90% } #d { display: none } #d[src] { display: inline } </style> </head> <body> <img id="b" src="https://images-na.ssl-images-amazon.com/images/G/01/error/logo._TTD_.png" alt="Amazon.com"> <form id="a" accept-charset="utf-8" action="/s" method="GET" role="search"> <div id="c"> <input id="e" name="field-keywords" placeholder="Search"> <input name="ref" type="hidden" value="cs_503_search"> <input id="f" type="submit" value="Go"> </div> </form> <div id="g"> <div><a href="/ref=cs_503_link"><img src="https://images-na.ssl-images-amazon.com/images/G/01/error/500_503.png" alt="Sorry! Something went wrong on our end. Please go back and try again or go to Amazon's home page."></a> </div> <img id="d" alt="Dogs of Amazon"> <script>document.getElementById("d").src = "https://images-na.ssl-images-amazon.com/images/G/01/error/" + (Math.floor(Math.random() * 43) + 1) + "._TTD_.jpg";</script> </div> </body> </html> Thank you so much for your patience.
This is a classic case where you won't find anything if you try to scrape the site directly using a scraper like BeautifulSoup. The way the site works is that an initial chunk of code is first downloaded to your browser the same as what you have added for big pencil and then via Javascript, the rest of the elements on the page are loaded. You'll need to use Selenium Webdriver to first load the page and then fetch the code from the browser. In normal sense, it is equivalent of you opening the console of your browser, going to Elements tab and looking for the classes that you've mentioned. To see the difference, I'll suggest you see the Source code of the page and compare with the code in the Elements tab Out here, you'll need to fetch the data loaded on to the browser via BS4 by using from selenium import webdriver browser = webdriver.Chrome("path_to_chromedriver") # This is the Chromedriver which will open up a new instance of a browser for you. More info in the docs browser.get(url) # Fetch the URL on the browser soup = bs4.BeautifulSoup(browser.page_source, 'html.parser') # Now load it to BS4 and go on with extracting the elements and so on This is a very basic code to understand Selenium, however, in production use-case, you may want to use a headless browser like PhantomJS References: Chromedriver Selenium Selenium with Python
selenium webdriver(chrome) elements differ from those of driver.page_source
I tried to scrape an article in the medium. But it failed because selenium.webdriver.page_source doesn't contain the target div. [E.G.] Demystifying Python Decorators in Less Than 10 Minutes https://medium.com/#adrianmarkperea/demystifying-python-decorators-in-10-minutes-ffe092723c6c In this site, the content holder div's class is "x y z ab ac ez af ag", but this element doesn't show up in driver.page_source. shortcode: below. It is NOT the kind of timeout problem. It seems like the drive.page_source is not processed with javascript, but I don't know. ARTICLE = "https://medium.com/#adrianmarkperea/demystifying-python-decorators-in-10-minutes-ffe092723c6c" driver.get(ARTICLE) text_soup = BeautifulSoup(driver.page_source,"html5lib") text = text_soup.select(".x.y.z.ab.ac.ez.af.ag") print(text) # => [] I expect the output of driver.page_source is the same as that of the chrome developer console's elements. Update: I did some experiment. I doubted webdriver couldn't get the html source processedby javascript, so I "selenium-ed" the below html file. But I got "element-removed" html file. result: webdriver and ordinary chrome console are same -> processed <html lang="en"> <body> <script type="text/javascript"> document.querySelector("#id").remove(); </script> </body></html> wget / requests -> not processed <html lang="en"> <body> <div id="id"> test element </div> <script type="text/javascript"> document.querySelector("#id").remove(); </script> </body></html>
Exporting Images from Vincent in ipython
I've been using vincent in python to make choropleth maps, now i'd like to make them into images to use for a presentation. Does anyone know how to do this? Below is the code I'm using: county_borders = r'us_counties.topo.json' geo_data = [{'name': 'counties', 'url': county_borders, 'feature': 'us_counties.geo'}] choro_pop = vincent.Map(data=counties_only, geo_data=geo_data, scale=1500, projection='albersUsa', data_bind='CENSUS2010POP', data_key='FIPS', map_key={'counties': 'properties.FIPS'}) choro_pop.marks[0].properties.enter.stroke_opacity = ValueRef(value=0.5) choro_pop.rebind(column = 'ESTIMATESBASE2010', brew = 'OrRd') choro_pop.to_json('Counties_Population_choropleth.json', html_out=True, html_path='Counties_Population_choropleth.html') choro_pop.display() This gives me a map in the ipython notebook (yay!) and outputs both an html file and a .json file. The html file is just a "scaffold" it doesn't actually contain any data that I can tell, and it doesn't display anything when opened in a browser (I've tried chrome). The .json file I know is dictionary-like, but I'm not sure how to use it to draw a nice image. thanks! edit1: This is what's in the html file <html> <head> <title>Vega Scaffold</title> <script src="http://d3js.org/d3.v3.min.js" charset="utf-8"></script> <script src="http://d3js.org/topojson.v1.min.js"></script> <script src="http://d3js.org/d3.geo.projection.v0.min.js" charset="utf-8"> </script> <script src="http://trifacta.github.com/vega/vega.js"></script> </head> <body> <div id="vis"></div> </body> <script type="text/javascript"> // parse a spec and create a visualization view function parse(spec) { vg.parse.spec(spec, function(chart) { chart({el:"#vis"}).update(); }); } parse("Counties_Population_change_choropleth.json"); </script> </html>
How to concatenate two html file bodies with BeautifulSoup?
I need to concatenate the bodies of two html files into one html file, with a bit of arbitrary html as a separator in between. I have code that used to work for this, but stopped working when I upgraded from Xubuntu 11.10 (or was it 11.04?) to 12.10, probably due to a BeautifulSoup update (I'm currently using 3.2.1; I don't know what version I had previously) or to a vim update (I use vim to auto-generate the html files from plaintext ones). This is the stripped-down version of the code: from BeautifulSoup import BeautifulSoup soup_original_1 = BeautifulSoup(''.join(open('test1.html'))) soup_original_2 = BeautifulSoup(''.join(open('test2.html'))) contents_1 = soup_original_1.body.renderContents() contents_2 = soup_original_2.body.renderContents() contents_both = contents_1 + "\n<b>SEPARATOR\n</b>" + contents_2 soup_new = BeautifulSoup(''.join(open('test1.html'))) while len(soup_new.body.contents): soup_new.body.contents[0].extract() soup_new.body.insert(0, contents_both) The bodies of the two input files used for the test case are very simple: contents_1 is \n<pre>\nFile 1\n</pre>\n' and contents_2 is '\n<pre>\nFile 2\n</pre>\n'. I would like soup_new.body.renderContents() to be a concatenation of those two with the separator text in between, but instead all the <'s change into < etc. - the desired result is '\n<pre>\nFile 1\n</pre>\n\n<b>SEPARATOR\n</b>\n<pre>\nFile 2\n</pre>\n', which is what I used to get prior to the OS update; the current result is '\n<pre>\nFile 1\n</pre>\n\n<b>SEPARATOR\n</b>\n<pre>\nFile 2\n</pre>\n', which is pretty useless. How do I make BeautifulSoup stop turning < into < etc when inserting html as a string into a soup object's body? Or should I just be doing this in an entirely different way? (This is my only experience with BeautifulSoup and most other html parsing, so I'm guessing this may well be the case.) The html files are automatically generated from plaintext files with vim (the real cases I use are obviously more complicated, and involve custom syntax highlighting, which is why I'm doing it this way at all). The full test1.html file looks like this, and test2.html is identical except for contents and title. <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd"> <html> <head> <meta http-equiv="content-type" content="text/html; charset=utf-8" /> <title>~/programs/lab_notebook_and_printing/concatenate-html_problem_2013/test1.txt.html</title> <meta name="Generator" content="Vim/7.3" /> <meta name="plugin-version" content="vim7.3_v10" /> <meta name="syntax" content="none" /> <meta name="settings" content="ignore_folding,use_css,pre_wrap,expand_tabs,ignore_conceal" /> <style type="text/css"> pre { white-space: pre-wrap; font-family: monospace; color: #000000; background-color: #ffffff; white-space: pre-wrap; word-wrap: break-word } body { font-family: monospace; color: #000000; background-color: #ffffff; font-size: 0.875em } </style> </head> <body> <pre> File 1 </pre> </body> </html>
Trying to read the HTML as text just to insert it into HTML and fighting the encoding and decoding in both directions is making a whole lot of extra work that's very difficult to get right. The easy thing to do is just not do that. You want to insert everything in the body of test2 after everything in the body of test1, right? So just do that: for element in soup_original_2.body: soup_original_1.body.append(element) To append a separator first, just do the same thing with the separator: b = soup.new_tag('b') b.append('SEPARATOR') soup.original_1.body.append(b) for element in soup_original_2.body: soup_original_1.body.append(element) That's it. See the documentation section Modifying the tree for a tutorial that covers all of this.
As mentioned in the comments to the answer by abarnert, there is a problem with append. This answer by Martijn Pieters♦ does the job. From BeautifulSoup 4.4 onwards (released July '15), you can use: import copy document2.body.append(copy.copy(element))
I was having trouble with my html documents and looping through the elements. I found that BeautifulSoup just wasn't successfully parsing some of my HTML files. I ended up inserting a tag around all of the elements inside the body tag: <body><span id="entirebody"> : </span></body> This meant that all of the elements were encompassed in one span element and processed successfully. I want to dig into exactly what is happening when I don't do this but it is one way to get around problems you might encounter. def insertSpan(htmlString): ''' Insert a span tag around all of body contents: <body><span id="entirebody">....</span></body> ''' subRe = re.compile(r'(<body>)(.*)(<\/body>)', re.DOTALL) htmlString = subRe.sub("\g<1><span id=\"entirebody\">\g<2></span>\g<3>",htmlString) return htmlString
Getting html elements of another page with window.open()
I'm trying to access html elements from an html page I access with window.open(). These are my htmls: firstpage.html: <html> <body> <script> function openPage() { return window.open("secondpage.html") } </script> </body> </html> secondpage.html <html> <body> <h1>Hello!</h2> <p>This is the second page</p> </body> </html> This is what I'm doing: from selenium import webdriver browser = webdriver.Firefox() browser.get("firstpage.html") html = browser.execute_script("return openPage().document;") print html What I'm expecting to get is a reference to the document element of the second page. This seems to work in the Firefox web console. When I test the script, the second page opens, but the first page seems to hang and after a while I get a dialog saying: "A script on this page may be busy, or it may have stopped responding. You can stop the script now, or you can continue to see if the script will complete." With the "Stop" and "Continue" buttons. Pressing "Continue" the dialog keeps to appear, when I eventually press "Stop", the html python variable contains the same text of the dialog. What am I doing wrong? EDIT: As in the #e1che answer, this is the right way to do it: firstpage.html: <html> <body> <script> function openPage() { window.open("secondpage.html", "secondpagewindow") } </script> </body> </html> The python code: from selenium import webdriver browser = webdriver.Firefox() browser.get("firstpage.html") browser.execute_script("openPage()") browser.switch_to_window("secondpagewindow") print browser.page_source
You're doing well, but you miss to switch your driver to the new window. so right after you're browser.execute_script("return openPage().document;") You write something like : driver.switch_to_window("windowName") I let you search here for more infos and tricks ;)