I've been building this scraper (with some massive help from users here) to get data on some companies' debt with the public sector and I've been able to get to the site, input the desired
search parameters and scrape the first 50 results (out of 300). The problem I've encountered is that this page's pagination has the following characteristics:
It does not possess a next page button
The URL doesn't change with the pagination
The pagination is done with a Javascript script
Here's the code so far:
path_driver = "C:/Users/CS330584/Documents/Documentos de Defesa da Concorrência/Automatização de Processos/chromedriver.exe"
website = "https://sat.sef.sc.gov.br/tax.NET/Sat.Dva.Web/ConsultaPublicaDevedores.aspx"
value_search = "300"
final_table = []
driver = webdriver.Chrome(path_driver)
driver.get(website)
search_max = driver.find_element_by_id("Body_Main_Main_ctl00_txtTotalDevedores")
search_max.send_keys(value_search)
btn_consult = driver.find_element_by_id("Body_Main_Main_ctl00_btnBuscar")
btn_consult.click()
driver.implicitly_wait(10)
cnpjs = driver.find_elements_by_xpath("//*[#id='Body_Main_Main_grpDevedores_gridView']/tbody/tr/td[1]")
empresas = driver.find_elements_by_xpath("//*[#id='Body_Main_Main_grpDevedores_gridView']/tbody/tr/td[2]")
dividas = driver.find_elements_by_xpath("//*[#id='Body_Main_Main_grpDevedores_gridView']/tbody/tr/td[3]")
for i in range(len(empresas)):
temp_data = {'CNPJ' : cnpjs[i].text,
'Empresas' : empresas[i].text,
'Divida' : dividas[i].text
}
final_table.append(temp_data)
How can I navigate through the pages in order to scrape their data ? Thank you all for the help!
If you inspect the page and look at what happens when you click on the next page button, you'll see in the tag they're actually executing some javascript. It looks like this:
<font style="vertical-align: inherit;"><font style="vertical-align: inherit;">6</font></font>
But if you take that javascript call out of that href tag (and fix the " to be quotations) you'll see two function calls that look like this:
GridView_ScrollToTop("Body_Main_Main_grpDevedores_gridView");
__doPostBack('ctl00$ctl00$ctl00$Body$Main$Main$grpDevedores$gridView','Page$5');
Now I didn't take the time to analyze these functions in depth, but you don't really need to. You see the first call causes the browser to scroll to the top, and the second call actually causes the next page of data to load on the page. For your purposes, you only care about the second call.
You can mess around with this in the browser; Just perform your search and then, in the JS console, paste in the JS call, exchanging the number for the page you want to look at.
If you can do it via JS in the console on the webpage, you can do it with Selenium. You would do something like this to "click" each tab:
for(i in range(1, 7)):
js = "__doPostBack('ctl00$ctl00$ctl00$Body$Main$Main$grpDevedores$gridView','Page$" + str(i) + "');"
driver.execute_script(js)
#do scraping stuff
Related
the webpage is : https://www.vpgame.com/market/gold?order_type=pro_price&order=desc&offset=0
As you can see there are 25 items in the selling part of this page that when you click them it opens a new tab and show you that specific item details.
Now I want to make a program to get those 25 item URLs and save them in a list, and my problem is as you can see in page inspect, their tags are which should be and also I can't find any 'href' attributes that related to them.
# using selenium and driver = webdriver.Chrome()
link = driver.find_elements_by_tag_name('a')
link2 = [l.get_attribute('href') for l in link]
I thought I can do it with above code but the problem is what I said. any suggestion?
Looks like you are trying to scrape a page that is powered by react. There are no href tags because javascript is powering all the linking. Your best bet is to use selenium to execute a click on each of the div objects, switch to the newly tabe, and use something like this code to get the URL of the page it's taken you to:
import time
links = driver.find_elements_by_class_name('card-header')
urls = []
for link in links:
new_page = link.click()
driver.switch_to.window(driver.window_handles[1])
url = driver.current_url
urls.append(url)
driver.close()
driver.switch_to.window(driver.window_handles[0])
time.sleep(1)
Note that the code closes the new tab each time and goes back to the main tab. I added time.sleep() so it doesn't go too fast.
for i in range(2,12):
try:
# q = '//* [#id="Content"]/tbody/tr[18]/td/table/tbody/tr/td['+str(i)+']'
# continue_link = browser.find_element_by_xpath(q).click()
q = str(i)
continue_link = browser.find_element_by_partial_link_text('Page$'+q).click()
Iam able to use commented lines but there is limitation when it come to 11(I am getting struck with "..."symbol) 9 10 ...
So instead of that I choose to use href but next page is loading using JS _doPostBack method. I need to select go through each page until end page. Please help.
So I'm trying to scrape census data from a website that changes dynamically when a county is selected from the drop down menu. It looks like this:
<select id="cat_id_select_GEO" onchange="changeHeaderSelection('GEO');
<option value="0500000US01001" select="selected">Autaga County, Alabama</option>
<select>
a link
So from the research i've done, it sounds like i need to make some sort of Get request? (selenium?) but I am completely lost on how to do this. I know how to get the data i want, once i've made the county selection. But I've never had to scrape something where the website changes dynamically (i.e. the url doesn't change)
I understand that some may find this to be a simple question... but I've read numerous other similar questions and would greatly benefit from someone walking me through example, and/or directing me to a solid guide.
this is what i've been messing around with so far. I can see it kinda works at selecting the values... but it spits out this error: Message: stale element reference: element is not attached to the page document
(Session info: chrome=74.0.3729.169)
for index, row in StateURLs.iterrows():
url = row['URL']
state = row['STATE']
driver = webdriver.Chrome(executable_path=r'C:\chromedriver.exe')
driver.get(url)
select_county = Select(driver.find_element_by_id('cat_id_select_GEO'))
options = select_county.options
for index in range(0, len(options) - 1):
select_county.select_by_index(index)
I also would love help on how to then convert this webpages to beautiful soup so i can scrape each page after the selection is made
The main landing page does get requests with a query string that returns a json string containing the info from that is first returned when you submit your query including further urls that are listed on the results page.
import requests
search_term = 'searchTerm: Autauga County, Alabama'
search_term = search_term.replace(' ','+')
r = requests.get('https://factfinder.census.gov/rest/communityFactsNav/nav?N=0&_t=1558559559868&log=t&searchTerm=term ' + search_term + ',Alabama&src=').json()
Here is an example of that json
I can generate the correct url to use in the browser which returns all that data as json but can't seem to configure requests so works. Perhaps someone else can pick up this and work it out. I will look again tomorrow.
r = requests.get('https://factfinder.census.gov/rest/communityFactsNav/nav?N=0&_t=1558559559868&log=t&searchTerm=term ' + search_term + ',Alabama&src=', allow_redirects= True).json()
url = 'https://factfinder.census.gov' + r['CFMetaData']['measuresAndLinks']['links']['2017 American Community Survey'][0]['url']
code = url.split('/')[-2]
url = 'https://factfinder.census.gov/tablerestful/tableServices/renderProductData?renderForMap=f&renderForChart=f&pid=ACS_17_5YR_{}&prodToReplace=ACS_16_5YR_{}&log=t&_ts=576607332612'.format(code, code)
I have to parse only the positions and points from this link. That link has 21 listings (I don't know actually what to call them) on it and each listing has 40 players on it expect the last one. Now I have written a code which is like this,
from bs4 import BeautifulSoup
import urllib2
def overall_standing():
url_list = ["http://www.afl.com.au/afl/stats/player-ratings/overall-standings#",
"http://www.afl.com.au/afl/stats/player-ratings/overall-standings#page/2",
# "http://www.afl.com.au/afl/stats/player-ratings/overall-standings#page/3",
# "http://www.afl.com.au/afl/stats/player-ratings/overall-standings#page/4",
# "http://www.afl.com.au/afl/stats/player-ratings/overall-standings#page/5",
# "http://www.afl.com.au/afl/stats/player-ratings/overall-standings#page/6",
# "http://www.afl.com.au/afl/stats/player-ratings/overall-standings#page/7",
# "http://www.afl.com.au/afl/stats/player-ratings/overall-standings#page/8",
# "http://www.afl.com.au/afl/stats/player-ratings/overall-standings#page/9",
# "http://www.afl.com.au/afl/stats/player-ratings/overall-standings#page/10",
# "http://www.afl.com.au/afl/stats/player-ratings/overall-standings#page/11",
# "http://www.afl.com.au/afl/stats/player-ratings/overall-standings#page/12",
# "http://www.afl.com.au/afl/stats/player-ratings/overall-standings#page/13",
# "http://www.afl.com.au/afl/stats/player-ratings/overall-standings#page/14",
# "http://www.afl.com.au/afl/stats/player-ratings/overall-standings#page/15",
# "http://www.afl.com.au/afl/stats/player-ratings/overall-standings#page/16",
# "http://www.afl.com.au/afl/stats/player-ratings/overall-standings#page/17",
# "http://www.afl.com.au/afl/stats/player-ratings/overall-standings#page/18",
# "http://www.afl.com.au/afl/stats/player-ratings/overall-standings#page/19",
# "http://www.afl.com.au/afl/stats/player-ratings/overall-standings#page/20",
"http://www.afl.com.au/afl/stats/player-ratings/overall-standings#page/21"]
gDictPlayerPointsInfo = {}
for url in url_list:
print url
header = {'User-Agent': 'Mozilla/5.0'}
header = {'User-Agent': 'Mozilla/5.0'}
req = urllib2.Request(url,headers=header)
page = urllib2.urlopen(req)
soup = BeautifulSoup(page)
table = soup.find("table", { "class" : "ladder zebra player-ratings" })
lCount = 1
for row in table.find_all("tr"):
lPlayerName = ""
lTeamName = ""
lPosition = ""
lPoint = ""
for cell in row.find_all("td"):
if lCount == 2:
lPlayerName = str(cell.get_text()).strip().upper()
elif lCount == 3:
lTeamName = str(cell.get_text()).strip().split("\n")[-1].strip().upper()
elif lCount == 4:
lPosition = str(cell.get_text().strip())
elif lCount == 6:
lPoint = str(cell.get_text().strip())
lCount += 1
if url == "http://www.afl.com.au/afl/stats/player-ratings/overall-standings#page/2":
print lTeamName, lPlayerName, lPoint
if lPlayerName <> "" and lTeamName <> "":
lStr = lPosition + "," + lPoint
# if gDictPlayerPointsInfo.has_key(lTeamName):
# gDictPlayerPointsInfo[lTeamName].append({lPlayerName:lStr})
# else:
gDictPlayerPointsInfo[lTeamName+","+lPlayerName] = lStr
lCount = 1
lfp = open("a.txt","w")
for key in gDictPlayerPointsInfo:
if key.find("RICHMOND"):
lfp.write(str(gDictPlayerPointsInfo[key]))
lfp.close()
return gDictPlayerPointsInfo
# overall_standing()
but the problem is it always gives me the first listing's points and positions, It ignored the other 20. How could I get the positions and points for the whole 21? Now I heard scrapy can do this type thing pretty easy but I am not fully familiar with scrapy. Is there any other way possible than using scrapy.
This is happening because these links are handled by the server, and often the portion of the link followed by the # symbol, called the fragment identifier, is processed by the browser and refers to some link or javascript behavior, i.e. loading a different set of results.
I would suggest two appraoches, either finding a way to use a link which the server can evaluate that you could continue using scrapy with or using a webdriver like selenium.
Scrapy
Your first step is to identify the javascript load call, often ajax, and use those links to pull your information. These are calls to the site's DB. This can be done by opening your web inspector and watching the network traffic as you click the next search result page:
and then after the click
we can see that there is a new call the this url:
http://www.afl.com.au/api/cfs/afl/playerRatings?roundId=CD_R201401408&pageNum=3&pageSize=40
This url returns a json file which can be parsed, and you can even shorten your steps are it looks like you can control more what information is returned to you.
You could either write a method to generate a series of links for you:
def gen_url(page_no):
return "http://www.afl.com.au/api/cfs/afl/playerRatings?roundId=CD_R201401408&pageNum=" + str(page_no) + "&pageSize=40"
and then, for example, use scrapy with the seed list:
seed = [gen_url(i) for i in range(20)]
or you can try tweaking the url parameters and see what you get, maybe you can get multiple pages at a time:
http://www.afl.com.au/api/cfs/afl/playerRatings?roundId=CD_R201401408&pageNum=1&pageSize=200
I changed the end the pageSize parameter to 200 since it seems this corresponds directly to the number of results returned.
NOTE There is a chance this method would not work as sites sometimes block their data API from outside usage via screening the ip of where the request is coming from.
If this is the case you should go with the following approach.
Selenium (or other webdriver)
Using something like selenium which is a webdriver, you can use what is loaded into a browser to evaluate data that is loaded after the server has returned the webpage.
There is some initial setup that needs to be set up in order for selenium to be usable, but it is a very powerful tool once you have it working.
A simple example of this would be:
from selenium import webdriver
driver = webdriver.Firefox()
driver.get("http://www.afl.com.au/stats/player-ratings/overall-standings")
You will see a python controlled Firefox browser (this can be done with other browsers too) open on your screen and load the url you provide, then follow the commands you give it which can be done from a shell even (useful for debugging) and you can search and parse html in the same way you would with scrapy (code contd from previous code section...)
If you want to perform something like clicking the next page button:
driver.find_elements_by_xpath("//div[#class='pagination']//li[#class='page']")
That expression may need some tweaking but it intends to find all the li elements of class='page' that are in the div with class='pagination', the // means shortened path between elements, your other alternative would be like /html/body/div/div/..... until you get to the one in question, which is why //div/... is useful and appealing.
For specific help and reference on locating elements see their page
My usual method is trial and error for this, tweaking the expression until it hits the target elements I want. This is where the console/shell comes in handy. After setting up the driver as above, I usually try and build my expression:
Say you have an html structure like:
<html>
<head></head>
<body>
<div id="container">
<div id="info-i-want">
treasure chest
</div>
</div>
</body>
</html>
I would start with something like:
>>> print driver.get_element_by_xpath("//body")
'<body>
<div id="container">
<div id="info-i-want">
treasure chest
</div>
</div>
</body>'
>>> print driver.get_element_by_xpath("//div[#id='container']")
<div id="container">
<div id="info-i-want">
treasure chest
</div>
</div>
>>> print driver.get_element_by_xpath("//div[#id='info-i-want']")
<div id="info-i-want">
treasure chest
</div>
>>> print driver.get_element_by_xpath("//div[#id='info-i-want']/text()")
treasure chest
>>> # BOOM TREASURE!
Usually it will be more complex, but this is a good and often necessary debugging tactic.
Back to your case, you could then save them out into an array:
links = driver.find_elements_by_xpath("//div[#class='pagination']//li[#class='page']")
and then one by one click them, scrape the new data, click the next one:
import time
from selenium import webdriver
driver = None
try:
driver = webdriver.Firefox()
driver.get("http://www.afl.com.au/stats/player-ratings/overall-standings")
#
# Scrape the first page
#
links = driver.find_elements_by_xpath("//div[#class='pagination']//li[#class='page']")
for link in links:
link.click()
#
# scrape the next page
#
time.sleep(1) # pause for a time period to let the data load
finally:
if driver:
driver.close()
It is best to wrap it all in a try...finally type block to make sure you close the driver instance.
If you decide to delve deeper into the selenium approach, you can refer to their docs which have excellent and very explicit documentation and examples.
Happy scraping!
I am learning to use Python Selenium and BeautifulSoup for web scraping. Currently, I am trying to scrape the hot searches on Google search trends http://www.google.com/trends/hottrends#pn=p5
This is my current code. However, I realized the full html is not downloaded and I only have content from the most recent few dates. What can I do to rectify this problem?
from selenium import webdriver
from bs4 import BeautifulSoup
googleURL = "http://www.google.com/trends/hottrends#pn=p5"
browser = webdriver.Firefox()
browser.get(googleURL)
content = browser.page_source
soup = BeautifulSoup(content)
print soup
Users add more content to the page (from previous dates) by clicking the <div onclick="control.moreData()" id="moreLink">More...</div> element at the bottom of the page.
So to get your desired content, you could use Selenium to click the id="moreLink" element or execute some JavaScript to call control.moreData(); in a loop.
For example, if you want to get all content as far back as Friday, February 15, 2013 (it looks like a string of this format exists for every date, for loaded content) your python might look something like this:
content = browser.page_source
desired_content_is_loaded = false;
while (desired_content_is_loaded == false):
if not "Friday, February 15, 2013" in content:
sel.run_script("control.moreData();")
content = browser.page_source
else:
desired_content_is_loaded = true;
EDIT:
If you disable JavaScript in your browser and reload the page, you will see that there is no "trends" content at all. What that tells me, is that the those items are loaded dynamically. Meaning, they are not part of the HTML document which is downloaded when you open the page. Selenium's .get() waits for the HTML document to load, but not for all JS to complete. There's no telling if async JS will complete before or after any other event. It completes when it's ready, and could be different every time. That would explain why you might sometimes get all, some, or none of that content when you call browser.page_source because it depends how fast async JS happens to be working at that moment.
So, after opening the page, you might try waiting a few seconds before getting the source - giving the JS which loads the content time to complete.
browser.get(googleURL)
time.sleep(3)
content = browser.page_source