I am scraping pages like this one:
site to scrape
I am using Python with Selenium and connecting through ProxyCrawler. One of the things I need to do is follow all the links that say For details, click here and grab the text there. The links look like this:
<a href='javascript:void(0)' onclick=javascript:submitLink('TIDFT/AE/VI/IS/ID100201','KQ','KQ')>For details, click here</a>
As you can see, each link's URL gets constructed by a function called submitLink. The function is not defined in the page source; rather it is called from an external .js file referenced in the head. I tried injecting the file into the DOM to make the function run but failed so far. For more details, see my question here.
So I'm trying instead to click each link to make the script run. However, this doesn't work with ProxyCrawler. If I connect directly, the links work fine but obviously that exposes my scraper.
Here is the minimum workable code:
from selenium import webdriver
from urllib import parse
apikey = MY_KEY
scrapeurl = 'https://www.timaticweb.com/cgi-bin/tim_website_client.cgi?SpecData=1&VISA=&page=both&NA=' + \
'ZW' + '&DE=' + 'AE' + '&user=KQ&subuser=KQ'
selenurl = 'https://api.proxycrawl.com/?token=' + apikey + '&url=' + parse.quote(scrapeurl)
DRIVER_PATH = '/Applications/chromedriver'
driver = webdriver.Chrome(executable_path = DRIVER_PATH)
driver.get(selenurl)
#driver.get(scrapeurl)
link = driver.find_element_by_xpath(".//a[contains(#onclick, 'submitLink')]")
link.click()
The above works is I use scrapeurl. It doesn't work with selenurl. Is there a way to use ProxyCrawler and still be able to click on those links?
Related
I am using Python/Selenium to submit genetic sequences to an online database, and want to save the full page of results I get back. Below is the code that gets me to the results I want:
from selenium import webdriver
URL = 'https://blast.ncbi.nlm.nih.gov/Blast.cgi?PROGRAM=blastx&PAGE_TYPE=BlastSearch&LINK_LOC=blasthome'
SEQUENCE = 'CCTAAACTATAGAAGGACAGCTCAAACACAAAGTTACCTAAACTATAGAAGGACAGCTCAAACACAAAGTTACCTAAACTATAGAAGGACAGCTCAAACACAAAGTTACCTAAACTATAGAAGGACAGCTCAAACACAAAGTTACCTAAACTATAGAAGGACA' #'GAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGA'
CHROME_WEBDRIVER_LOCATION = '/home/max/Downloads/chromedriver' # update this for your machine
# open page with selenium
# (first need to download Chrome webdriver, or a firefox webdriver, etc)
driver = webdriver.Chrome(executable_path=CHROME_WEBDRIVER_LOCATION)
driver.get(URL)
time.sleep(5)
# enter sequence into the query field and hit 'blast' button to search
seq_query_field = driver.find_element_by_id("seq")
seq_query_field.send_keys(SEQUENCE)
blast_button = driver.find_element_by_id("b1")
blast_button.click()
time.sleep(60)
At that point I have a page that I can manually click "save as," and get a local file (with a corresponding folder of image/js assets) that lets me view the whole returned page locally (minus content which is generated dynamically from scrolling down the page, which is fine). I assumed there would be a simple way to mimic this 'save as' function in python/selenium but haven't found one. The code to save the page below just saves html, and does not leave me with a local file that looks like it does in the web browser, with images, etc.
content = driver.page_source
with open('webpage.html', 'w') as f:
f.write(content)
I've also found this question/answer on SO, but the accepted answer just brings up the 'save as' box, and does not provide a way to click it (as two commenters point out)
Is there a simple way to 'save [full page] as' using python? Ideally I'd prefer an answer using selenium since selenium makes the crawling part so straightforward, but I'm open to using another library if there's a better tool for this job. Or maybe I just need to specify all of the images/tables I want to download in code, and there is no shortcut to emulating the right-click 'save as' functionality?
UPDATE - Follow up question for James' answer
So I ran James' code to generate a page.html (and associated files) and compared it to the html file I got from manually clicking save-as. The page.html saved via James' script is great and has everything I need, but when opened in a browser it also shows a lot of extra formatting text that's hidden in the manually save'd page. See attached screenshot (manually saved page on the left, script-saved page with extra formatting text shown on right).
This is especially surprising to me because the raw html of the page saved by James' script seems to indicate those fields should still be hidden. See e.g. the html below, which appears the same in both files, but the text at issue only appears in the browser-rendered page on the one saved by James' script:
<p class="helpbox ui-ncbitoggler-slave ui-ncbitoggler" id="hlp1" aria-hidden="true">
These options control formatting of alignments in results pages. The
default is HTML, but other formats (including plain text) are available.
PSSM and PssmWithParameters are representations of Position Specific Scoring Matrices and are only available for PSI-BLAST.
The Advanced view option allows the database descriptions to be sorted by various indices in a table.
</p>
Any idea why this is happening?
As you noted, Selenium cannot interact with the browser's context menu to use Save as..., so instead to do so, you could use an external automation library like pyautogui.
pyautogui.hotkey('ctrl', 's')
time.sleep(1)
pyautogui.typewrite(SEQUENCE + '.html')
pyautogui.hotkey('enter')
This code opens the Save as... window through its keyboard shortcut CTRL+S and then saves the webpage and its assets into the default downloads location by pressing enter. This code also names the file as the sequence in order to give it a unique name, though you could change this for your use case. If needed, you could additionally change the download location through some extra work with the tab and arrow keys.
Tested on Ubuntu 18.10; depending on your OS you may need to modify the key combination sent.
Full code, in which I also added conditional waits to improve speed:
import time
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.expected_conditions import visibility_of_element_located
from selenium.webdriver.support.ui import WebDriverWait
import pyautogui
URL = 'https://blast.ncbi.nlm.nih.gov/Blast.cgi?PROGRAM=blastx&PAGE_TYPE=BlastSearch&LINK_LOC=blasthome'
SEQUENCE = 'CCTAAACTATAGAAGGACAGCTCAAACACAAAGTTACCTAAACTATAGAAGGACAGCTCAAACACAAAGTTACCTAAACTATAGAAGGACAGCTCAAACACAAAGTTACCTAAACTATAGAAGGACAGCTCAAACACAAAGTTACCTAAACTATAGAAGGACA' #'GAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGA'
# open page with selenium
# (first need to download Chrome webdriver, or a firefox webdriver, etc)
driver = webdriver.Chrome()
driver.get(URL)
# enter sequence into the query field and hit 'blast' button to search
seq_query_field = driver.find_element_by_id("seq")
seq_query_field.send_keys(SEQUENCE)
blast_button = driver.find_element_by_id("b1")
blast_button.click()
# wait until results are loaded
WebDriverWait(driver, 60).until(visibility_of_element_located((By.ID, 'grView')))
# open 'Save as...' to save html and assets
pyautogui.hotkey('ctrl', 's')
time.sleep(1)
pyautogui.typewrite(SEQUENCE + '.html')
pyautogui.hotkey('enter')
This is not a perfect solution, but it will get you most of what you need. You can replicate the behavior of "save as full web page (complete)" by parsing the html and downloading any loaded files (images, css, js, etc.) to their same relative path.
Most of the javascript won't work due to cross origin request blocking. But the content will look (mostly) the same.
This uses requests to save the loaded files, lxml to parse the html, and os for the path legwork.
from selenium import webdriver
import chromedriver_binary
from lxml import html
import requests
import os
driver = webdriver.Chrome()
URL = 'https://blast.ncbi.nlm.nih.gov/Blast.cgi?PROGRAM=blastx&PAGE_TYPE=BlastSearch&LINK_LOC=blasthome'
SEQUENCE = 'CCTAAACTATAGAAGGACAGCTCAAACACAAAGTTACCTAAACTATAGAAGGACAGCTCAAACACAAAGTTACCTAAACTATAGAAGGACAGCTCAAACACAAAGTTACCTAAACTATAGAAGGACAGCTCAAACACAAAGTTACCTAAACTATAGAAGGACA'
base = 'https://blast.ncbi.nlm.nih.gov/'
driver.get(URL)
seq_query_field = driver.find_element_by_id("seq")
seq_query_field.send_keys(SEQUENCE)
blast_button = driver.find_element_by_id("b1")
blast_button.click()
content = driver.page_source
# write the page content
os.mkdir('page')
with open('page/page.html', 'w') as fp:
fp.write(content)
# download the referenced files to the same path as in the html
sess = requests.Session()
sess.get(base) # sets cookies
# parse html
h = html.fromstring(content)
# get css/js files loaded in the head
for hr in h.xpath('head//#href'):
if not hr.startswith('http'):
local_path = 'page/' + hr
hr = base + hr
res = sess.get(hr)
if not os.path.exists(os.path.dirname(local_path)):
os.makedirs(os.path.dirname(local_path))
with open(local_path, 'wb') as fp:
fp.write(res.content)
# get image/js files from the body. skip anything loaded from outside sources
for src in h.xpath('//#src'):
if not src or src.startswith('http'):
continue
local_path = 'page/' + src
print(local_path)
src = base + src
res = sess.get(hr)
if not os.path.exists(os.path.dirname(local_path)):
os.makedirs(os.path.dirname(local_path))
with open(local_path, 'wb') as fp:
fp.write(res.content)
You should have a folder called page with a file called page.html in it with the content you are after.
Inspired by FThompson's answer above, I came up with the following tool that can download full/complete html for a given page url (see: https://github.com/markfront/SinglePageFullHtml)
UPDATE - follow up with Max's suggestion, below are steps to use the tool:
Clone the project, then run maven to build:
$> git clone https://github.com/markfront/SinglePageFullHtml.git
$> cd ~/git/SinglePageFullHtml
$> mvn clean compile package
Find the generated jar file in target folder: SinglePageFullHtml-1.0-SNAPSHOT-jar-with-dependencies.jar
Run the jar in command line like:
$> java -jar .target/SinglePageFullHtml-1.0-SNAPSHOT-jar-with-dependencies.jar <page_url>
The result file name will have a prefix "FP, followed by the hashcode of the page url, with file extension ".html". It will be found in either folder "/tmp" (which you can get by System.getProperty("java.io.tmp"). If not, try find it in your home dir or System.getProperty("user.home") in Java).
The result file will be a big fat self-contained html file that includes everything (css, javascript, images, etc.) referred to by the original html source.
I'll advise u to have a try on sikulix which is an image based automation tool for operate any widgets within PC OS, it supports python grammar and run with command line and maybe the simplest way to solve ur problem.
All u need to do is just give it a screenshot, call sikulix script in ur python automation script(with OS.system("xxxx") or subprocess...).
I'm currently working on a learner project for webscraping
I've picked my site:
https://www.game.co.uk/en/m/games/best-selling-games/best-selling-xbox-one-games/?merchname=MobileTopNav-_-XboxOne_Games-_-BestSellers#Page0
On this page, there is a button on the bottom that displays the list of the next 10 products there without this button being clicked it does not display the next batch of products however the URL does not change when the button is clicked.
I wanted to ask how I will solve this dilemma using requests module.
My code is below:
import requests
from bs4 import BeautifulSoup
r = requests.get("https://www.game.co.uk/en/m/games/best-selling-games/best-selling-xbox-one-games/?merchname=MobileTopNav-_-XboxOne_Games-_-BestSellers")
c = r.content
soup = BeautifulSoup(c,"html.parser")
all=soup.find_all("div",{"class":"product"})
for item in all:
print(item.find({"h2": "productInfo"}).text.replace('\h2','').replace(" ", ""))
print(item.find("span",{"class": "condition"}).text + " " + item.find("span",{"class": "value"}).text )
try:
print(item.find_all("span",{"class": "condition"})[1].text + " " + item.find_all("span",{"class": "value"})[1].text )
except:
print("No Preowned")
print(" ")
Try this code to get all the items available in that page. You can make use of chrome dev tools to retrieve this url in which there is an option for page number increment.
from bs4 import BeautifulSoup
import requests
page_link = "https://www.game.co.uk/en/m/games/best-selling-games/best-selling-xbox-one-games/?merchname=MobileTopNav-_-XboxOne_Games-_-BestSellers&pageNumber={}&pageMode=true"
page_no = 0
while True:
page_no+=1
res = requests.get(page_link.format(page_no))
soup = BeautifulSoup(res.text,'lxml')
container = soup.select(".productInfo h2")
if len(container)<=1:break
for content in container:
print(content.text)
Output of the last few titles:
ARK Survival Evolved
Kingdom Come Deliverance Special Edition
Halo 5 Guardians
Sonic Forces
The Elder Scrolls Online: Summerset - Digital
you need to use a webcrawler that supports javascript/jquery execution - i.e. selenium (it uses BoutifulSoup under the hood)
The problem you're facing is that the content you try to access gets created dynamically via javascript when the mentioned button is clicked.
When you request the page the additional html elements you want to read from are not created so BoutifulSoup cant find them.
Using selenium you can click buttons/fill out forms and much more. You can also wait for the server to create the content you want to access.
The documentation of selenium should be self explaining...
I am very new to web scraping with Python. In the web page, which I am trying to scrape, I can enter string 'ABC' in the text box and click search. This gives me the details of 'ABC', but under the same URL. There is no change in url. I am trying to scrape the result details information.
I have worked till the "search" click. But I do not know how to capture the results of the search (details of search string 'ABC'). Please suggest how could I achieve it.
from selenium import webdriver
import webbrowser
new = 2 # open in a new tab, if possible
path_to_chromedriver = 'C:/Tech-stuffs/chromedriver/chromedriver.exe' # change path as needed
browser = webdriver.Chrome(executable_path = path_to_chromedriver)
url = 'https://www.federalreserve.gov/apps/mdrm/data-dictionary'
browser.get(url)
browser.find_element_by_xpath('//*[#id="form0"]/table/tbody/tr[2]/td/label[2]').click()
browser.find_element_by_xpath("//select[#id='SelectedReportForm']/option[#value='1']").click()
browser.find_element_by_xpath('//*[#id="Search"]').click()
Use find_elements_by_xpath() to locate the xpath which entails all of the search results. Then iterate through them using a for loop and print each result's text. That should, at the bare minimum, get what you want.
results = browser.find_elements_by_xpath('//table//tr')
for result in results:
print "%s\n" % result.text
I'm fairly new to coding and Python so I apologize if this is a silly question. I'd like a script that goes through all 19,000 search results pages and scrapes each page for all of the urls. I've got all of the scrapping working but can't figure out how to deal with the fact that the page uses AJAX to paginate. Usually I'd just make a loop with the url to capture each search result but that's not possible. Here's the page: http://www.heritage.org/research/all-research.aspx?nomobile&categories=report
This is the script I have so far:
with io.open('heritageURLs.txt', 'a', encoding='utf8') as logfile:
page = urllib2.urlopen("http://www.heritage.org/research/all-research.aspx?nomobile&categories=report")
soup = BeautifulSoup(page)
snippet = soup.find_all('a', attrs={'item-title'})
for a in snippet:
logfile.write ("http://www.heritage.org" + a.get('href') + "\n")
print "Done collecting urls"
Obviously, it scrapes the first page of results and nothing more.
And I have looked at a few related questions but none seem to use Python or at least not in a way that I can understand. Thank you in advance for your help.
For the sake of completeness, while you may try accessing the POST request and to find a way round to access to next page, like I suggested in my comment, if an alternative is possible, using Selenium will be quite easy to achieve what you want.
Here is a simple solution using Selenium for your question:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from time import sleep
# uncomment if using Firefox web browser
driver = webdriver.Firefox()
# uncomment if using Phantomjs
#driver = webdriver.PhantomJS()
url = 'http://www.heritage.org/research/all-research.aspx?nomobile&categories=report'
driver.get(url)
# set initial page count
pages = 1
with open('heritageURLs.txt', 'w') as f:
while True:
try:
# sleep here to allow time for page load
sleep(5)
# grab the Next button if it exists
btn_next = driver.find_element_by_class_name('next')
# find all item-title a href and write to file
links = driver.find_elements_by_class_name('item-title')
print "Page: {} -- {} urls to write...".format(pages, len(links))
for link in links:
f.write(link.get_attribute('href')+'\n')
# Exit if no more Next button is found, ie. last page
if btn_next is None:
print "crawling completed."
exit(-1)
# otherwise click the Next button and repeat crawling the urls
pages += 1
btn_next.send_keys(Keys.RETURN)
# you should specify the exception here
except:
print "Error found, crawling stopped"
exit(-1)
Hope this helps.
These are the steps I need to automatize:
1) Log in
2) Select an option from a drop down menu (To acces a list of products)
3) search something on the search field (The product we are looking for)
4) click a link (To open up the product's options)
5) click another link(To compile all the .pdf files relevant to said product in a bigger .pdf)
6) wait for a .pdf to load and then download it.(Save the .pdf on my machine with the name of the product as the file name)
I want to know if this is possible. If it is, where can I find how to do it?
Is it pivotal that there is actual clicking involved? If you're just looking to download PDFs then I suggest you use the Requests library. You might also want to consider using Scrapy.
In terms of searching on the site, you may want to use Fiddler to capture the HTTP POST request and then replicate that in Python.
Here is some code that might be useful as a starting place - these functions would login to a server and download a target file.
def login():
login_url = 'http://www.example.com'
payload = 'usr=username&pwd=password'
connection = requests.Session()
post_login = connection.post(data=payload,
url=login_url,
headers=main_headers,
proxies=proxies,
allow_redirects=True)
def download():
directory = "C:\\example\\"
url = "http://example.com/download.pdf"
filename = directory + '\\' + url[url.rfind("/")+1:]
r = connection.get(url=url,
headers=main_headers,
proxies=proxies)
file_size = int(r.headers["Content-Length"])
block_size = 1024
mode = 'wb'
print "\tDownloading: %s [%sKB]" % (filename, int(file_size/1024))
if r.status_code == 200:
with open(filename, mode) as f:
for chunk in r.iter_content(block_size):
f.write(chunk)
For static sites you can use the mechanize module, available from PyPi, it does everything you want - except it does not run Javascript and thus does not work on dynamic websites. Also it is strictly Python 2 only.
easy_install mechanize
For something way more complicated you might have to use python bindings for Selenium (install instructions) to control an external browser; or use spynner that embeds a web browser. However these 2 are far more difficult to set up.
Sure, just use selenium webdriver
from selenium import webdriver
browser = webdriver.Chrome()
browser.get('http://your-website.com')
search_box = browser.find_element_by_css_selector('input[id=search]')
search_box.send_keys('my search term')
browser.find_element_by_css_selector('input[type=submit']).click()
That would get you through the visit page, enter search term, click on search, stage of your problem. Read through the api for the rest.
Mechanize has problems at the moment because so much of a web page is generated via javascript. And if it is not rendering that you can't do much with the page.
It helps if you understand css selectors, else you can find elements by id, or xpath or other things...