I'm using Selenium in python to test a data table (not really a html table, combined by multiple divs)
That's what my table looks like:
<div class="products">
<div class="product">
<span class="original-price">20$</span>
<span class="discounted-price">10$</span>
</div>
<div class="product">
<span class="price">20$</span>
</div>
...
</div>
There are multiple products, some has discounted price.
This is my script:
products = self.driver.find_elements_by_css_selector('.products > div')
for product in products:
found_price = True
try:
original_price = product.find_element_by_css_selector('.original-price').text
reduced_price = product.find_element_by_css_selector('.discounted-price').text
except NoSuchElementException:
try:
original_price = product.find_element_by_css_selector('.price').text
reduced_price = original_price
except NoSuchElementException:
found_price = False
if found_price: check_price(original_price, reduced_price)
But my script runs very slowly. It sends a lot of request "remote_connection" each time the "find_element_by_css_selector" called like this one:
2018-02-27 13:48:08 [selenium.webdriver.remote.remote_connection] DEBUG: POST http://127.0.0.1:62147/session/14902b71a0f812fa74f81524f0eb1386/elements {"using": "css selector", "sessionId": "14902b71a0f812fa74f81524f0eb1386", "value": ".products > div .original-price"}
2018-02-27 13:48:08 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
Any ideas to improve its performance ?
Thanks !
One of the parameters of the remove WebDriver is keep_alive. As the doc says:
keep_alive - Whether to configure remote_connection.RemoteConnection to use
HTTP keep-alive. Defaults to False.
Keeping the connection alive would improve the speed as it does not have to connect on every find request.
It is still inconslusive why you feel your script runs very slowly. As you mentioned your code sends a lot of request remote_connection each time the find_element_by_css_selector is exactly the way as it is defined in the WebDriver-W3C Candidate Recommendation.
A small test with the Search Box of Google Home Page i.e. https://www.google.co.in with all the major variants of WebDrivers and Web Browsers reveals that :
Each time you search a webelement in the HTML DOM as follows :
product.find_element_by_css_selector('.original-price')
The following request is generated :
[selenium.webdriver.remote.remote_connection] DEBUG: POST http://127.0.0.1:62147/session/14902b71a0f812fa74f81524f0eb1386/elements {"using": "css selector", "sessionId": "14902b71a0f812fa74f81524f0eb1386", "value": ".products > div .original-price"}
[selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
On a successful search the following response is sent back from the Web Browser :
webdriver::server DEBUG <- 200 OK {"value":{"element-6066-11e4-a52e-4f735466cecf":"6e35faa4-233f-400c-a6c7-6a66b54a69e5"}}
You can find a detailed discussion in Values returned by webdrivers
However there will be some deviation in the performance depending on the Locator Strategy you use.
There have been quite some experiments and benchmarking about the performance aspects of the locators. You can find some discussions here :
Locator Performance Metrics using Selenium
Css Vs. X Path
Css Vs. X Path, Under a Microscope
Css Vs. X Path, Under a Microscope (Part 2)
What is the difference between css-selector & Xpath? which is better(according to performance & for cross browser testing)?
Related
I've been building this scraper (with some massive help from users here) to get data on some companies' debt with the public sector and I've been able to get to the site, input the desired
search parameters and scrape the first 50 results (out of 300). The problem I've encountered is that this page's pagination has the following characteristics:
It does not possess a next page button
The URL doesn't change with the pagination
The pagination is done with a Javascript script
Here's the code so far:
path_driver = "C:/Users/CS330584/Documents/Documentos de Defesa da Concorrência/Automatização de Processos/chromedriver.exe"
website = "https://sat.sef.sc.gov.br/tax.NET/Sat.Dva.Web/ConsultaPublicaDevedores.aspx"
value_search = "300"
final_table = []
driver = webdriver.Chrome(path_driver)
driver.get(website)
search_max = driver.find_element_by_id("Body_Main_Main_ctl00_txtTotalDevedores")
search_max.send_keys(value_search)
btn_consult = driver.find_element_by_id("Body_Main_Main_ctl00_btnBuscar")
btn_consult.click()
driver.implicitly_wait(10)
cnpjs = driver.find_elements_by_xpath("//*[#id='Body_Main_Main_grpDevedores_gridView']/tbody/tr/td[1]")
empresas = driver.find_elements_by_xpath("//*[#id='Body_Main_Main_grpDevedores_gridView']/tbody/tr/td[2]")
dividas = driver.find_elements_by_xpath("//*[#id='Body_Main_Main_grpDevedores_gridView']/tbody/tr/td[3]")
for i in range(len(empresas)):
temp_data = {'CNPJ' : cnpjs[i].text,
'Empresas' : empresas[i].text,
'Divida' : dividas[i].text
}
final_table.append(temp_data)
How can I navigate through the pages in order to scrape their data ? Thank you all for the help!
If you inspect the page and look at what happens when you click on the next page button, you'll see in the tag they're actually executing some javascript. It looks like this:
<font style="vertical-align: inherit;"><font style="vertical-align: inherit;">6</font></font>
But if you take that javascript call out of that href tag (and fix the " to be quotations) you'll see two function calls that look like this:
GridView_ScrollToTop("Body_Main_Main_grpDevedores_gridView");
__doPostBack('ctl00$ctl00$ctl00$Body$Main$Main$grpDevedores$gridView','Page$5');
Now I didn't take the time to analyze these functions in depth, but you don't really need to. You see the first call causes the browser to scroll to the top, and the second call actually causes the next page of data to load on the page. For your purposes, you only care about the second call.
You can mess around with this in the browser; Just perform your search and then, in the JS console, paste in the JS call, exchanging the number for the page you want to look at.
If you can do it via JS in the console on the webpage, you can do it with Selenium. You would do something like this to "click" each tab:
for(i in range(1, 7)):
js = "__doPostBack('ctl00$ctl00$ctl00$Body$Main$Main$grpDevedores$gridView','Page$" + str(i) + "');"
driver.execute_script(js)
#do scraping stuff
Hi I was wondering why if I have a certain page's url and use selenium like this:
webdriver.get(url)
webdriver.page_source
The source code given by selenium lacks elements that are there when inspecting the page from the browser ?
Is it some kind of way the website protects itself from scraping ?
Try adding some delay between webdriver.get(url) and webdriver.page_source to let the page completely loaded
Generally it should give you entire page source content with all the tags and tag attributes. But this is only applicable for static web pages .
for dynamic web pages, webdriver.page_source will only give you page resource whatever is available at that point of time in DOM. cause DOM will be updated based on user interaction with page.
Note that iframes are excluded from page_source in any way.
If the site you are scraping is a Dynamic website, then it takes some time to load as the JavaScript should run, do some DOM manipulations etc., and only after this you get the source code of the page.
So it is better to add some time delay between your get request and getting the page source.
import time
webdriver.get(url)
# pauses execution for x seconds.
time.sleep(x)
webdriver.page_source
The page source might contain one link on javascript file and you will see many controls on the page that has been generated on your side in your browser by running js code.
The source page is:
<script>
[1,2,3,4,5].map(i => document.write(`<p id="${i}">${i}</p>`))
</script>
Virtual DOM is:
<p id="1">1</p>
<p id="2">2</p>
<p id="3">3</p>
<p id="4">4</p>
<p id="5">5</p>
To get Virtual DOM HTML:
document.querySelector('html').innerHTML
<script>
[1,2,3,4,5].map(i => document.write(`<p id="${i}">${i}</p>`))
console.log(document.querySelector('body').innerHTML)
</script>
im trying to extract a simple title of a product from amazon.com using the id that the span which contains the title has.
this is what i wrote:
import requests
from bs4 import BeautifulSoup
url = 'https://www.amazon.com/Acer-SB220Q-Ultra-Thin-Frame-Monitor/dp/B07CVL2D2S/ref=lp_16225007011_1_7'
res = requests.get(url)
soup = BeautifulSoup(res.content, 'html.parser')
title = soup.find(id='productTitle').get_text()
print(title)
and i keep getting either none or empty list or i cant extract anything and gives me an attribute error saying that the object i used doesnt have an attribute get_text, which raised another question which is how to get the text of this simple span.
i really appreciate it if someone could figure it out and help me.
thanks in advance.
Problem
Running your code and checking the res value, you would get a 503 error. This means that the Service is unavailable (htttp status 503).
Solution
Following up, using this SO post, seems that adding the headers={"User-Agent":"Defined"} to the get requests does work.
res = requests.get(url, headers={"User-Agent": "Defined"})
Will return a 200 (OK) response.
The Twist
Amazon actually checks for web scrapers, and even though you will get a page back, printing the result (print(soup)) will likely show you the following:
<body>
<!--
To discuss automated access to Amazon data please contact api-services-support#amazon.com.
For information about migrating to our APIs refer to our Marketplace APIs at https://developer.amazonservices.com/ref=rm_c_sv, or our Product Advertising API at https://affiliate-program.amazon.com/gp/advertising/api/detail/main.html/ref=rm_c_ac for advertising use cases.
-->
...
<h4>Enter the characters you see below</h4>
<p class="a-last">Sorry, we just need to make sure you're not a robot. For best results, please make sure your browser is accepting cookies.</p>
</div>
</div>
The counter
But you can use selenium to simulate a human. A minimal working example for me was the following:
import selenium.webdriver
url = 'http://www.amazon.com/Acer-SB220Q-Ultra-Thin-Frame-Monitor/dp/B07CVL2D2S/ref=lp_16225007011_1_7'
driver = selenium.webdriver.Firefox()
driver.get(url)
title = driver.find_element_by_id('productTitle').text
print(title)
Which prints out
Acer SB220Q bi 21.5 Inches Full HD (1920 x 1080) IPS Ultra-Thin Zero Frame Monitor (HDMI & VGA Port), Black
A small thing when using selenium is that it is much slower than the requests library. Also a new screen will pop-up that shows the page, but luckily we can do something about that screen by using a headless driver.
I am scraping data from a site with a paginated table (max results 500 with 25 results per page). When I use chrome to "view source" I can see all 500 results, however, once the JS renders in selenium only 25 results show when using driver.page_source.
I have tried passing the cookies and headers off to requests, but that's not reliable and need to stick with selenium. I have also made a janky solution of clicking through the paginator's next button, but there must be a better way!
So how does one capture the full page source prior to JS rendering using selenium with the python bindings?
There might be a simpler way but it turns out you can do all kinds of asynchronous things from the browser including fetch:
def fetch(url):
return driver.execute_async_script("""
(async () => {
let r = await fetch('""" + url + """')
arguments[0](await r.text())
})()
""")
html = fetch('https://stackoverflow.com/')
Same-origin policy will apply.
I have to parse only the positions and points from this link. That link has 21 listings (I don't know actually what to call them) on it and each listing has 40 players on it expect the last one. Now I have written a code which is like this,
from bs4 import BeautifulSoup
import urllib2
def overall_standing():
url_list = ["http://www.afl.com.au/afl/stats/player-ratings/overall-standings#",
"http://www.afl.com.au/afl/stats/player-ratings/overall-standings#page/2",
# "http://www.afl.com.au/afl/stats/player-ratings/overall-standings#page/3",
# "http://www.afl.com.au/afl/stats/player-ratings/overall-standings#page/4",
# "http://www.afl.com.au/afl/stats/player-ratings/overall-standings#page/5",
# "http://www.afl.com.au/afl/stats/player-ratings/overall-standings#page/6",
# "http://www.afl.com.au/afl/stats/player-ratings/overall-standings#page/7",
# "http://www.afl.com.au/afl/stats/player-ratings/overall-standings#page/8",
# "http://www.afl.com.au/afl/stats/player-ratings/overall-standings#page/9",
# "http://www.afl.com.au/afl/stats/player-ratings/overall-standings#page/10",
# "http://www.afl.com.au/afl/stats/player-ratings/overall-standings#page/11",
# "http://www.afl.com.au/afl/stats/player-ratings/overall-standings#page/12",
# "http://www.afl.com.au/afl/stats/player-ratings/overall-standings#page/13",
# "http://www.afl.com.au/afl/stats/player-ratings/overall-standings#page/14",
# "http://www.afl.com.au/afl/stats/player-ratings/overall-standings#page/15",
# "http://www.afl.com.au/afl/stats/player-ratings/overall-standings#page/16",
# "http://www.afl.com.au/afl/stats/player-ratings/overall-standings#page/17",
# "http://www.afl.com.au/afl/stats/player-ratings/overall-standings#page/18",
# "http://www.afl.com.au/afl/stats/player-ratings/overall-standings#page/19",
# "http://www.afl.com.au/afl/stats/player-ratings/overall-standings#page/20",
"http://www.afl.com.au/afl/stats/player-ratings/overall-standings#page/21"]
gDictPlayerPointsInfo = {}
for url in url_list:
print url
header = {'User-Agent': 'Mozilla/5.0'}
header = {'User-Agent': 'Mozilla/5.0'}
req = urllib2.Request(url,headers=header)
page = urllib2.urlopen(req)
soup = BeautifulSoup(page)
table = soup.find("table", { "class" : "ladder zebra player-ratings" })
lCount = 1
for row in table.find_all("tr"):
lPlayerName = ""
lTeamName = ""
lPosition = ""
lPoint = ""
for cell in row.find_all("td"):
if lCount == 2:
lPlayerName = str(cell.get_text()).strip().upper()
elif lCount == 3:
lTeamName = str(cell.get_text()).strip().split("\n")[-1].strip().upper()
elif lCount == 4:
lPosition = str(cell.get_text().strip())
elif lCount == 6:
lPoint = str(cell.get_text().strip())
lCount += 1
if url == "http://www.afl.com.au/afl/stats/player-ratings/overall-standings#page/2":
print lTeamName, lPlayerName, lPoint
if lPlayerName <> "" and lTeamName <> "":
lStr = lPosition + "," + lPoint
# if gDictPlayerPointsInfo.has_key(lTeamName):
# gDictPlayerPointsInfo[lTeamName].append({lPlayerName:lStr})
# else:
gDictPlayerPointsInfo[lTeamName+","+lPlayerName] = lStr
lCount = 1
lfp = open("a.txt","w")
for key in gDictPlayerPointsInfo:
if key.find("RICHMOND"):
lfp.write(str(gDictPlayerPointsInfo[key]))
lfp.close()
return gDictPlayerPointsInfo
# overall_standing()
but the problem is it always gives me the first listing's points and positions, It ignored the other 20. How could I get the positions and points for the whole 21? Now I heard scrapy can do this type thing pretty easy but I am not fully familiar with scrapy. Is there any other way possible than using scrapy.
This is happening because these links are handled by the server, and often the portion of the link followed by the # symbol, called the fragment identifier, is processed by the browser and refers to some link or javascript behavior, i.e. loading a different set of results.
I would suggest two appraoches, either finding a way to use a link which the server can evaluate that you could continue using scrapy with or using a webdriver like selenium.
Scrapy
Your first step is to identify the javascript load call, often ajax, and use those links to pull your information. These are calls to the site's DB. This can be done by opening your web inspector and watching the network traffic as you click the next search result page:
and then after the click
we can see that there is a new call the this url:
http://www.afl.com.au/api/cfs/afl/playerRatings?roundId=CD_R201401408&pageNum=3&pageSize=40
This url returns a json file which can be parsed, and you can even shorten your steps are it looks like you can control more what information is returned to you.
You could either write a method to generate a series of links for you:
def gen_url(page_no):
return "http://www.afl.com.au/api/cfs/afl/playerRatings?roundId=CD_R201401408&pageNum=" + str(page_no) + "&pageSize=40"
and then, for example, use scrapy with the seed list:
seed = [gen_url(i) for i in range(20)]
or you can try tweaking the url parameters and see what you get, maybe you can get multiple pages at a time:
http://www.afl.com.au/api/cfs/afl/playerRatings?roundId=CD_R201401408&pageNum=1&pageSize=200
I changed the end the pageSize parameter to 200 since it seems this corresponds directly to the number of results returned.
NOTE There is a chance this method would not work as sites sometimes block their data API from outside usage via screening the ip of where the request is coming from.
If this is the case you should go with the following approach.
Selenium (or other webdriver)
Using something like selenium which is a webdriver, you can use what is loaded into a browser to evaluate data that is loaded after the server has returned the webpage.
There is some initial setup that needs to be set up in order for selenium to be usable, but it is a very powerful tool once you have it working.
A simple example of this would be:
from selenium import webdriver
driver = webdriver.Firefox()
driver.get("http://www.afl.com.au/stats/player-ratings/overall-standings")
You will see a python controlled Firefox browser (this can be done with other browsers too) open on your screen and load the url you provide, then follow the commands you give it which can be done from a shell even (useful for debugging) and you can search and parse html in the same way you would with scrapy (code contd from previous code section...)
If you want to perform something like clicking the next page button:
driver.find_elements_by_xpath("//div[#class='pagination']//li[#class='page']")
That expression may need some tweaking but it intends to find all the li elements of class='page' that are in the div with class='pagination', the // means shortened path between elements, your other alternative would be like /html/body/div/div/..... until you get to the one in question, which is why //div/... is useful and appealing.
For specific help and reference on locating elements see their page
My usual method is trial and error for this, tweaking the expression until it hits the target elements I want. This is where the console/shell comes in handy. After setting up the driver as above, I usually try and build my expression:
Say you have an html structure like:
<html>
<head></head>
<body>
<div id="container">
<div id="info-i-want">
treasure chest
</div>
</div>
</body>
</html>
I would start with something like:
>>> print driver.get_element_by_xpath("//body")
'<body>
<div id="container">
<div id="info-i-want">
treasure chest
</div>
</div>
</body>'
>>> print driver.get_element_by_xpath("//div[#id='container']")
<div id="container">
<div id="info-i-want">
treasure chest
</div>
</div>
>>> print driver.get_element_by_xpath("//div[#id='info-i-want']")
<div id="info-i-want">
treasure chest
</div>
>>> print driver.get_element_by_xpath("//div[#id='info-i-want']/text()")
treasure chest
>>> # BOOM TREASURE!
Usually it will be more complex, but this is a good and often necessary debugging tactic.
Back to your case, you could then save them out into an array:
links = driver.find_elements_by_xpath("//div[#class='pagination']//li[#class='page']")
and then one by one click them, scrape the new data, click the next one:
import time
from selenium import webdriver
driver = None
try:
driver = webdriver.Firefox()
driver.get("http://www.afl.com.au/stats/player-ratings/overall-standings")
#
# Scrape the first page
#
links = driver.find_elements_by_xpath("//div[#class='pagination']//li[#class='page']")
for link in links:
link.click()
#
# scrape the next page
#
time.sleep(1) # pause for a time period to let the data load
finally:
if driver:
driver.close()
It is best to wrap it all in a try...finally type block to make sure you close the driver instance.
If you decide to delve deeper into the selenium approach, you can refer to their docs which have excellent and very explicit documentation and examples.
Happy scraping!