Web Scraper Using Scrapy - python

I have to parse only the positions and points from this link. That link has 21 listings (I don't know actually what to call them) on it and each listing has 40 players on it expect the last one. Now I have written a code which is like this,
from bs4 import BeautifulSoup
import urllib2
def overall_standing():
url_list = ["http://www.afl.com.au/afl/stats/player-ratings/overall-standings#",
"http://www.afl.com.au/afl/stats/player-ratings/overall-standings#page/2",
# "http://www.afl.com.au/afl/stats/player-ratings/overall-standings#page/3",
# "http://www.afl.com.au/afl/stats/player-ratings/overall-standings#page/4",
# "http://www.afl.com.au/afl/stats/player-ratings/overall-standings#page/5",
# "http://www.afl.com.au/afl/stats/player-ratings/overall-standings#page/6",
# "http://www.afl.com.au/afl/stats/player-ratings/overall-standings#page/7",
# "http://www.afl.com.au/afl/stats/player-ratings/overall-standings#page/8",
# "http://www.afl.com.au/afl/stats/player-ratings/overall-standings#page/9",
# "http://www.afl.com.au/afl/stats/player-ratings/overall-standings#page/10",
# "http://www.afl.com.au/afl/stats/player-ratings/overall-standings#page/11",
# "http://www.afl.com.au/afl/stats/player-ratings/overall-standings#page/12",
# "http://www.afl.com.au/afl/stats/player-ratings/overall-standings#page/13",
# "http://www.afl.com.au/afl/stats/player-ratings/overall-standings#page/14",
# "http://www.afl.com.au/afl/stats/player-ratings/overall-standings#page/15",
# "http://www.afl.com.au/afl/stats/player-ratings/overall-standings#page/16",
# "http://www.afl.com.au/afl/stats/player-ratings/overall-standings#page/17",
# "http://www.afl.com.au/afl/stats/player-ratings/overall-standings#page/18",
# "http://www.afl.com.au/afl/stats/player-ratings/overall-standings#page/19",
# "http://www.afl.com.au/afl/stats/player-ratings/overall-standings#page/20",
"http://www.afl.com.au/afl/stats/player-ratings/overall-standings#page/21"]
gDictPlayerPointsInfo = {}
for url in url_list:
print url
header = {'User-Agent': 'Mozilla/5.0'}
header = {'User-Agent': 'Mozilla/5.0'}
req = urllib2.Request(url,headers=header)
page = urllib2.urlopen(req)
soup = BeautifulSoup(page)
table = soup.find("table", { "class" : "ladder zebra player-ratings" })
lCount = 1
for row in table.find_all("tr"):
lPlayerName = ""
lTeamName = ""
lPosition = ""
lPoint = ""
for cell in row.find_all("td"):
if lCount == 2:
lPlayerName = str(cell.get_text()).strip().upper()
elif lCount == 3:
lTeamName = str(cell.get_text()).strip().split("\n")[-1].strip().upper()
elif lCount == 4:
lPosition = str(cell.get_text().strip())
elif lCount == 6:
lPoint = str(cell.get_text().strip())
lCount += 1
if url == "http://www.afl.com.au/afl/stats/player-ratings/overall-standings#page/2":
print lTeamName, lPlayerName, lPoint
if lPlayerName <> "" and lTeamName <> "":
lStr = lPosition + "," + lPoint
# if gDictPlayerPointsInfo.has_key(lTeamName):
# gDictPlayerPointsInfo[lTeamName].append({lPlayerName:lStr})
# else:
gDictPlayerPointsInfo[lTeamName+","+lPlayerName] = lStr
lCount = 1
lfp = open("a.txt","w")
for key in gDictPlayerPointsInfo:
if key.find("RICHMOND"):
lfp.write(str(gDictPlayerPointsInfo[key]))
lfp.close()
return gDictPlayerPointsInfo
# overall_standing()
but the problem is it always gives me the first listing's points and positions, It ignored the other 20. How could I get the positions and points for the whole 21? Now I heard scrapy can do this type thing pretty easy but I am not fully familiar with scrapy. Is there any other way possible than using scrapy.

This is happening because these links are handled by the server, and often the portion of the link followed by the # symbol, called the fragment identifier, is processed by the browser and refers to some link or javascript behavior, i.e. loading a different set of results.
I would suggest two appraoches, either finding a way to use a link which the server can evaluate that you could continue using scrapy with or using a webdriver like selenium.
Scrapy
Your first step is to identify the javascript load call, often ajax, and use those links to pull your information. These are calls to the site's DB. This can be done by opening your web inspector and watching the network traffic as you click the next search result page:
and then after the click
we can see that there is a new call the this url:
http://www.afl.com.au/api/cfs/afl/playerRatings?roundId=CD_R201401408&pageNum=3&pageSize=40
This url returns a json file which can be parsed, and you can even shorten your steps are it looks like you can control more what information is returned to you.
You could either write a method to generate a series of links for you:
def gen_url(page_no):
return "http://www.afl.com.au/api/cfs/afl/playerRatings?roundId=CD_R201401408&pageNum=" + str(page_no) + "&pageSize=40"
and then, for example, use scrapy with the seed list:
seed = [gen_url(i) for i in range(20)]
or you can try tweaking the url parameters and see what you get, maybe you can get multiple pages at a time:
http://www.afl.com.au/api/cfs/afl/playerRatings?roundId=CD_R201401408&pageNum=1&pageSize=200
I changed the end the pageSize parameter to 200 since it seems this corresponds directly to the number of results returned.
NOTE There is a chance this method would not work as sites sometimes block their data API from outside usage via screening the ip of where the request is coming from.
If this is the case you should go with the following approach.
Selenium (or other webdriver)
Using something like selenium which is a webdriver, you can use what is loaded into a browser to evaluate data that is loaded after the server has returned the webpage.
There is some initial setup that needs to be set up in order for selenium to be usable, but it is a very powerful tool once you have it working.
A simple example of this would be:
from selenium import webdriver
driver = webdriver.Firefox()
driver.get("http://www.afl.com.au/stats/player-ratings/overall-standings")
You will see a python controlled Firefox browser (this can be done with other browsers too) open on your screen and load the url you provide, then follow the commands you give it which can be done from a shell even (useful for debugging) and you can search and parse html in the same way you would with scrapy (code contd from previous code section...)
If you want to perform something like clicking the next page button:
driver.find_elements_by_xpath("//div[#class='pagination']//li[#class='page']")
That expression may need some tweaking but it intends to find all the li elements of class='page' that are in the div with class='pagination', the // means shortened path between elements, your other alternative would be like /html/body/div/div/..... until you get to the one in question, which is why //div/... is useful and appealing.
For specific help and reference on locating elements see their page
My usual method is trial and error for this, tweaking the expression until it hits the target elements I want. This is where the console/shell comes in handy. After setting up the driver as above, I usually try and build my expression:
Say you have an html structure like:
<html>
<head></head>
<body>
<div id="container">
<div id="info-i-want">
treasure chest
</div>
</div>
</body>
</html>
I would start with something like:
>>> print driver.get_element_by_xpath("//body")
'<body>
<div id="container">
<div id="info-i-want">
treasure chest
</div>
</div>
</body>'
>>> print driver.get_element_by_xpath("//div[#id='container']")
<div id="container">
<div id="info-i-want">
treasure chest
</div>
</div>
>>> print driver.get_element_by_xpath("//div[#id='info-i-want']")
<div id="info-i-want">
treasure chest
</div>
>>> print driver.get_element_by_xpath("//div[#id='info-i-want']/text()")
treasure chest
>>> # BOOM TREASURE!
Usually it will be more complex, but this is a good and often necessary debugging tactic.
Back to your case, you could then save them out into an array:
links = driver.find_elements_by_xpath("//div[#class='pagination']//li[#class='page']")
and then one by one click them, scrape the new data, click the next one:
import time
from selenium import webdriver
driver = None
try:
driver = webdriver.Firefox()
driver.get("http://www.afl.com.au/stats/player-ratings/overall-standings")
#
# Scrape the first page
#
links = driver.find_elements_by_xpath("//div[#class='pagination']//li[#class='page']")
for link in links:
link.click()
#
# scrape the next page
#
time.sleep(1) # pause for a time period to let the data load
finally:
if driver:
driver.close()
It is best to wrap it all in a try...finally type block to make sure you close the driver instance.
If you decide to delve deeper into the selenium approach, you can refer to their docs which have excellent and very explicit documentation and examples.
Happy scraping!

Related

Python Selenium - Scraping javascript pagination

I've been building this scraper (with some massive help from users here) to get data on some companies' debt with the public sector and I've been able to get to the site, input the desired
search parameters and scrape the first 50 results (out of 300). The problem I've encountered is that this page's pagination has the following characteristics:
It does not possess a next page button
The URL doesn't change with the pagination
The pagination is done with a Javascript script
Here's the code so far:
path_driver = "C:/Users/CS330584/Documents/Documentos de Defesa da Concorrência/Automatização de Processos/chromedriver.exe"
website = "https://sat.sef.sc.gov.br/tax.NET/Sat.Dva.Web/ConsultaPublicaDevedores.aspx"
value_search = "300"
final_table = []
driver = webdriver.Chrome(path_driver)
driver.get(website)
search_max = driver.find_element_by_id("Body_Main_Main_ctl00_txtTotalDevedores")
search_max.send_keys(value_search)
btn_consult = driver.find_element_by_id("Body_Main_Main_ctl00_btnBuscar")
btn_consult.click()
driver.implicitly_wait(10)
cnpjs = driver.find_elements_by_xpath("//*[#id='Body_Main_Main_grpDevedores_gridView']/tbody/tr/td[1]")
empresas = driver.find_elements_by_xpath("//*[#id='Body_Main_Main_grpDevedores_gridView']/tbody/tr/td[2]")
dividas = driver.find_elements_by_xpath("//*[#id='Body_Main_Main_grpDevedores_gridView']/tbody/tr/td[3]")
for i in range(len(empresas)):
temp_data = {'CNPJ' : cnpjs[i].text,
'Empresas' : empresas[i].text,
'Divida' : dividas[i].text
}
final_table.append(temp_data)
How can I navigate through the pages in order to scrape their data ? Thank you all for the help!
If you inspect the page and look at what happens when you click on the next page button, you'll see in the tag they're actually executing some javascript. It looks like this:
<font style="vertical-align: inherit;"><font style="vertical-align: inherit;">6</font></font>
But if you take that javascript call out of that href tag (and fix the " to be quotations) you'll see two function calls that look like this:
GridView_ScrollToTop("Body_Main_Main_grpDevedores_gridView");
__doPostBack('ctl00$ctl00$ctl00$Body$Main$Main$grpDevedores$gridView','Page$5');
Now I didn't take the time to analyze these functions in depth, but you don't really need to. You see the first call causes the browser to scroll to the top, and the second call actually causes the next page of data to load on the page. For your purposes, you only care about the second call.
You can mess around with this in the browser; Just perform your search and then, in the JS console, paste in the JS call, exchanging the number for the page you want to look at.
If you can do it via JS in the console on the webpage, you can do it with Selenium. You would do something like this to "click" each tab:
for(i in range(1, 7)):
js = "__doPostBack('ctl00$ctl00$ctl00$Body$Main$Main$grpDevedores$gridView','Page$" + str(i) + "');"
driver.execute_script(js)
#do scraping stuff

Web scrape website with dropdown menu that changes website dynamically (onchange)

So I'm trying to scrape census data from a website that changes dynamically when a county is selected from the drop down menu. It looks like this:
<select id="cat_id_select_GEO" onchange="changeHeaderSelection('GEO');
<option value="0500000US01001" select="selected">Autaga County, Alabama</option>
<select>
a link
So from the research i've done, it sounds like i need to make some sort of Get request? (selenium?) but I am completely lost on how to do this. I know how to get the data i want, once i've made the county selection. But I've never had to scrape something where the website changes dynamically (i.e. the url doesn't change)
I understand that some may find this to be a simple question... but I've read numerous other similar questions and would greatly benefit from someone walking me through example, and/or directing me to a solid guide.
this is what i've been messing around with so far. I can see it kinda works at selecting the values... but it spits out this error: Message: stale element reference: element is not attached to the page document
(Session info: chrome=74.0.3729.169)
for index, row in StateURLs.iterrows():
url = row['URL']
state = row['STATE']
driver = webdriver.Chrome(executable_path=r'C:\chromedriver.exe')
driver.get(url)
select_county = Select(driver.find_element_by_id('cat_id_select_GEO'))
options = select_county.options
for index in range(0, len(options) - 1):
select_county.select_by_index(index)
I also would love help on how to then convert this webpages to beautiful soup so i can scrape each page after the selection is made
The main landing page does get requests with a query string that returns a json string containing the info from that is first returned when you submit your query including further urls that are listed on the results page.
import requests
search_term = 'searchTerm: Autauga County, Alabama'
search_term = search_term.replace(' ','+')
r = requests.get('https://factfinder.census.gov/rest/communityFactsNav/nav?N=0&_t=1558559559868&log=t&searchTerm=term ' + search_term + ',Alabama&src=').json()
Here is an example of that json
I can generate the correct url to use in the browser which returns all that data as json but can't seem to configure requests so works. Perhaps someone else can pick up this and work it out. I will look again tomorrow.
r = requests.get('https://factfinder.census.gov/rest/communityFactsNav/nav?N=0&_t=1558559559868&log=t&searchTerm=term ' + search_term + ',Alabama&src=', allow_redirects= True).json()
url = 'https://factfinder.census.gov' + r['CFMetaData']['measuresAndLinks']['links']['2017 American Community Survey'][0]['url']
code = url.split('/')[-2]
url = 'https://factfinder.census.gov/tablerestful/tableServices/renderProductData?renderForMap=f&renderForChart=f&pid=ACS_17_5YR_{}&prodToReplace=ACS_16_5YR_{}&log=t&_ts=576607332612'.format(code, code)

Python Request Module - Displaying More Results

I'm currently working on a learner project for webscraping
I've picked my site:
https://www.game.co.uk/en/m/games/best-selling-games/best-selling-xbox-one-games/?merchname=MobileTopNav-_-XboxOne_Games-_-BestSellers#Page0
On this page, there is a button on the bottom that displays the list of the next 10 products there without this button being clicked it does not display the next batch of products however the URL does not change when the button is clicked.
I wanted to ask how I will solve this dilemma using requests module.
My code is below:
import requests
from bs4 import BeautifulSoup
r = requests.get("https://www.game.co.uk/en/m/games/best-selling-games/best-selling-xbox-one-games/?merchname=MobileTopNav-_-XboxOne_Games-_-BestSellers")
c = r.content
soup = BeautifulSoup(c,"html.parser")
all=soup.find_all("div",{"class":"product"})
for item in all:
print(item.find({"h2": "productInfo"}).text.replace('\h2','').replace(" ", ""))
print(item.find("span",{"class": "condition"}).text + " " + item.find("span",{"class": "value"}).text )
try:
print(item.find_all("span",{"class": "condition"})[1].text + " " + item.find_all("span",{"class": "value"})[1].text )
except:
print("No Preowned")
print(" ")
Try this code to get all the items available in that page. You can make use of chrome dev tools to retrieve this url in which there is an option for page number increment.
from bs4 import BeautifulSoup
import requests
page_link = "https://www.game.co.uk/en/m/games/best-selling-games/best-selling-xbox-one-games/?merchname=MobileTopNav-_-XboxOne_Games-_-BestSellers&pageNumber={}&pageMode=true"
page_no = 0
while True:
page_no+=1
res = requests.get(page_link.format(page_no))
soup = BeautifulSoup(res.text,'lxml')
container = soup.select(".productInfo h2")
if len(container)<=1:break
for content in container:
print(content.text)
Output of the last few titles:
ARK Survival Evolved
Kingdom Come Deliverance Special Edition
Halo 5 Guardians
Sonic Forces
The Elder Scrolls Online: Summerset - Digital
you need to use a webcrawler that supports javascript/jquery execution - i.e. selenium (it uses BoutifulSoup under the hood)
The problem you're facing is that the content you try to access gets created dynamically via javascript when the mentioned button is clicked.
When you request the page the additional html elements you want to read from are not created so BoutifulSoup cant find them.
Using selenium you can click buttons/fill out forms and much more. You can also wait for the server to create the content you want to access.
The documentation of selenium should be self explaining...

Scrape with BeautifulSoup from site that uses AJAX pagination using Python

I'm fairly new to coding and Python so I apologize if this is a silly question. I'd like a script that goes through all 19,000 search results pages and scrapes each page for all of the urls. I've got all of the scrapping working but can't figure out how to deal with the fact that the page uses AJAX to paginate. Usually I'd just make a loop with the url to capture each search result but that's not possible. Here's the page: http://www.heritage.org/research/all-research.aspx?nomobile&categories=report
This is the script I have so far:
with io.open('heritageURLs.txt', 'a', encoding='utf8') as logfile:
page = urllib2.urlopen("http://www.heritage.org/research/all-research.aspx?nomobile&categories=report")
soup = BeautifulSoup(page)
snippet = soup.find_all('a', attrs={'item-title'})
for a in snippet:
logfile.write ("http://www.heritage.org" + a.get('href') + "\n")
print "Done collecting urls"
Obviously, it scrapes the first page of results and nothing more.
And I have looked at a few related questions but none seem to use Python or at least not in a way that I can understand. Thank you in advance for your help.
For the sake of completeness, while you may try accessing the POST request and to find a way round to access to next page, like I suggested in my comment, if an alternative is possible, using Selenium will be quite easy to achieve what you want.
Here is a simple solution using Selenium for your question:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from time import sleep
# uncomment if using Firefox web browser
driver = webdriver.Firefox()
# uncomment if using Phantomjs
#driver = webdriver.PhantomJS()
url = 'http://www.heritage.org/research/all-research.aspx?nomobile&categories=report'
driver.get(url)
# set initial page count
pages = 1
with open('heritageURLs.txt', 'w') as f:
while True:
try:
# sleep here to allow time for page load
sleep(5)
# grab the Next button if it exists
btn_next = driver.find_element_by_class_name('next')
# find all item-title a href and write to file
links = driver.find_elements_by_class_name('item-title')
print "Page: {} -- {} urls to write...".format(pages, len(links))
for link in links:
f.write(link.get_attribute('href')+'\n')
# Exit if no more Next button is found, ie. last page
if btn_next is None:
print "crawling completed."
exit(-1)
# otherwise click the Next button and repeat crawling the urls
pages += 1
btn_next.send_keys(Keys.RETURN)
# you should specify the exception here
except:
print "Error found, crawling stopped"
exit(-1)
Hope this helps.

Selenium download full html page

I am learning to use Python Selenium and BeautifulSoup for web scraping. Currently, I am trying to scrape the hot searches on Google search trends http://www.google.com/trends/hottrends#pn=p5
This is my current code. However, I realized the full html is not downloaded and I only have content from the most recent few dates. What can I do to rectify this problem?
from selenium import webdriver
from bs4 import BeautifulSoup
googleURL = "http://www.google.com/trends/hottrends#pn=p5"
browser = webdriver.Firefox()
browser.get(googleURL)
content = browser.page_source
soup = BeautifulSoup(content)
print soup
Users add more content to the page (from previous dates) by clicking the <div onclick="control.moreData()" id="moreLink">More...</div> element at the bottom of the page.
So to get your desired content, you could use Selenium to click the id="moreLink" element or execute some JavaScript to call control.moreData(); in a loop.
For example, if you want to get all content as far back as Friday, February 15, 2013 (it looks like a string of this format exists for every date, for loaded content) your python might look something like this:
content = browser.page_source
desired_content_is_loaded = false;
while (desired_content_is_loaded == false):
if not "Friday, February 15, 2013" in content:
sel.run_script("control.moreData();")
content = browser.page_source
else:
desired_content_is_loaded = true;
EDIT:
If you disable JavaScript in your browser and reload the page, you will see that there is no "trends" content at all. What that tells me, is that the those items are loaded dynamically. Meaning, they are not part of the HTML document which is downloaded when you open the page. Selenium's .get() waits for the HTML document to load, but not for all JS to complete. There's no telling if async JS will complete before or after any other event. It completes when it's ready, and could be different every time. That would explain why you might sometimes get all, some, or none of that content when you call browser.page_source because it depends how fast async JS happens to be working at that moment.
So, after opening the page, you might try waiting a few seconds before getting the source - giving the JS which loads the content time to complete.
browser.get(googleURL)
time.sleep(3)
content = browser.page_source

Categories