Scraping second page of dynamic element on webpage with Python - python

I work for a group that's looking to pull automatic reports of port statuses in Tx. To do this, I'm trying to web scrape (which I'm not too familiar with) the Coast Guard Homeport site. I've managed to use Selenium to pull all the information using the xpath of the page's table with the ports, however there is one port (Victoria) on the 'page 2' of the table that the script is not able to see. The xpath does not change if I tab between the pages, so I'm not sure how to locate it. Any help would be much appreciated!
edit: The page uses Javascript elements.
https://homeport.uscg.mil/port-directory/corpus-christi
url = 'https://homeport.uscg.mil/port-directory/corpus-christi'
xpath= "/html/body/form/div[12]/div[2]/div[2]/div[2]/div[3]/div[1]/div[4]/div/div/div/div/div/div[1]/div/div[2]/div[1]/div/div/div[2]/div/div[2]/div/table"
portsList = ['CORPUS CHRISTI','ORANGE','BEAUMONT','VICTORIA','CALHOUN','HARLINGEN','PALACIOS','PORT ISABEL','PORT LAVACA','PORT MANSFIELD']
df = pd.DataFrame(index=portsList, columns=['status','comments','dateupdated'])
driver = webdriver.Chrome(executable_path = r"C:\Users\M3ECHJJJ\Documents\chromedriver.exe")
urlpage = url+page
driver.get(urlpage)
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);var lenOfPage=document.body.scrollHeight;return lenOfPage;")
time.sleep(15)
results = driver.find_elements_by_xpath(xpath)
ports_split = results[0].text.split('\n')
i = 0
for port in ports_split:
if port.upper() in portsList:
print(port)
df.xs(port.upper())['status'],df.xs(port.upper())['comments'],df.xs(port.upper())['dateupdated'] = parsePara(ports_split[i+1])
i = i+1
driver.quit()

First, be very careful writing automation that acts against government websites (or any website for that matter) and be sure you're allowed to do so. You may also find that many sites offer the information you're looking for in a structured format, like through an API or data download.
While selenium may provide you with a great tool for automating the browser, its capabilities for locating elements and parsing HTML may leave much to be desired. In cases like this, I might reach for using BeautifulSoup as a complementary tool to use alongside browser automation.
BeautifulSoup will support all the same locators as selenium, but also provides additional capabilities, including the ability to define your own criteria for locating elements.
For example, you can define a function to use very specific rules for locating elements (tags). The function should return True for tags that match your interests.
from bs4 import BeautifulSoup
def important_table(tag):
"""Given a particular tag, return True if it's what you're looking for"""
return bool(
# match <table> elements
tag.name == 'table' and
# check for expected text
any(port_name in tag.text for port_name in portsList) and
# check element attributes
"classname" in tag.get('class', []) and
# has at least 3 rows
len(tag.findall('tr')) > 3
# and so on
)
This is just an example, but you can write this function however you like to fit your need.
Then you could apply this like so
...
driver.get(urlpage)
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);var lenOfPage=document.body.scrollHeight;return lenOfPage;")
time.sleep(15)
html = driver.page_source # get the DOM content as a string
soup = BeautifulSoup(html)
table = soup.find(important_table)
for row in table.findall('tr'):
print(row.text)

Related

Python (with selenium) how to modify and activate elements to update webpage before scraping data

I am trying to scrape some data from https://marvelsnapzone.com/decks/ but I would like to modify the table of decks before scraping them. For example:
Adding card names:
I am trying to add new div id="tagsblock" with certain names like class="tag card" "Angela "
Executing the "Search":
I would then like to execute the id="searchdecks" command to update the table of decks
Sorting by ascending "Likes":
Lastly I want to edit the span data-sorttype="likes" class to say span data-sorttype="likes" class ="asc"
Below is my current python script which doesn't seem to sort the "Likes" before scraping the deck info. It also currently does not add cards or execute the "Search".
import re
import requests
import os
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
def scrap():
url = 'https://marvelsnapzone.com/decks'
chrome_options = Options()
chrome_options.headless = True
chrome_options.add_argument('--disable-dev-shm-usage')
chrome_options.add_argument('--disable-extensions')
chrome_options.add_argument('--disable-gpu')
browser = webdriver.Chrome(options=chrome_options)
browser.get(url)
html = browser.page_source
soup = BeautifulSoup(html, 'html.parser')
# I would like to add cards and execute the "Search" option here
selects = soup.findAll('span', {'data-sorttype': 'likes'})
for select in selects:
browser.execute_script("arguments[0].setAttribute('class', 'asc')", select)
# this does not seem to sort the table, this is based on the data scraped later
links = soup.findAll('a', {'class': 'card cardtooltip maindeckcard tooltiploaded'})
# ... more web-scraping code ...
# I am able to scrape the information after this, but I am struggling to modify the table
# before scraping the information.
if __name__ == '__main__':
characters = scrap()
Usually sites are dynamic and thus load new data via a script when you click on a button. This means that in these cases if you set an attribute with selenium the site will not change.
That said, your code has some errors which I think are caused by the fact that you think selenium and beautifulsoup talk to each other (i.e. interact).
By doing this
soup = BeautifulSoup(...)
browser.execute_script(...)
links = soup.findAll(...)
you are trying to "update" soup by executing a script, but it doesn't work like that, in fact soup is an immutable object. So when you run soup.findAll(...) you are using an "old" soup which doesn't contain the modifications following from browser.execute_script(...).
By doing this
browser.execute_script("arguments[0].setAttribute('class', 'asc')", select)
you are trying to use selenium to set an attribute of an object found with beautifulsoup. You cannot do this. The correct way is to find the element with selenium
select = browser.find_element(By.CSS_SELECTOR, '[data-sorttype=likes]')
browser.execute_script("arguments[0].setAttribute('class', 'asc')", select)
Anyway this doesn't work because as I said in the beginning, if you set an attribute with selenium the site will not change.
In this code block
selects = soup.findAll('span', {'data-sorttype': 'likes'})
for select in selects:
# do something with select
why doing a loop if you just want to set one attribute? Use soup.find (returns a webelement) instead of soup.findAll (returns a list, in this case a list with only one element)
select = soup.find('span', {'data-sorttype': 'likes'})
# do something with select
So the correct sequence of commands to sort the table and scrape it with beautifulsoup is the following
browser.get(url)
# click on "Likes" button
select = driver.find_element(By.CSS_SELECTOR, '[data-sorttype=likes]')
select.click()
time.sleep(2) # wait to be sure that the table is sorted
html = browser.page_source
soup = BeautifulSoup(html, 'html.parser')
links = soup.findAll('a', {'class': 'card cardtooltip maindeckcard tooltiploaded'})
Notice that beautifulsoup is not mandatory to scrape the page, you can use selenium too.

Get dynamically generated content with python Selenium

This question has been asked before, but I've searched and tried and still can't get it to work. I'm a beginner when it comes to Selenium.
Have a look at: https://finance.yahoo.com/quote/FB
I'm trying to web scrape the "Recommended Rating", which in this case at the time of writing is 2. I've tried:
driver.get('https://finance.yahoo.com/quote/FB')
time.sleep(10)
rating = driver.find_element_by_css_selector('#Col2-4-QuoteModule-Proxy > div > section > div > div > div')
print(rating.text)
...which doesn't give me an error, but doesn't print any text either. I've also tried with xpath, class_name, etc. Instead I tried:
source = driver.page_source
print(source)
This doesn't work either, I'm just getting the actual source without the dynamically generated content. When I click "View Source" in Chrome, it's not there. I tried saving the webpage in chrome. Didn't work.
Then I discovered that if I save the entire webpage, including images and css-files and everything, the source code is different from the one where I just save the HTML.
The HTML-file I get when I save the entire webpage using Chrome DOES contain the information that I need, and at first I was thinking about using pyautogui to just Ctrl + S every webpage, but there must be another way.
The information that I need is obviosly there, in the html-code, but how do I get it without downloading the entire web page?
Try this to execute the dynamically generated content (JavaScript):
driver.execute_script("return document.body.innerHTML")
See similar question:
Running javascript in Selenium using Python
The CSS selector, div.rating-text, is working just fine and is unique on the page. Returning .text will give you the value you are looking for.
First, you need to wait for the element to be clickable, then make sure you scroll down to the element before getting the rating. Try
element.location_once_scrolled_into_view
element.text
EDIT:
Use the following XPath selector:
'//a[#data-test="recommendation-rating-header"]//following-sibling::div//div[#class="rating-text Arrow South Fw(b) Bgc($buy) Bdtc($buy)"]'
Then you will have:
rating = driver.find_element_by_css_selector('//a[#data-test="recommendation-rating-header"]//following-sibling::div//div[#class="rating-text Arrow South Fw(b) Bgc($buy) Bdtc($buy)"]')
To extract the value of the slider, use
val = rating.get_attribute("aria-label")
The script below answers a different question but somehow I think this is what you are after.
import requests
from bs4 import BeautifulSoup
base_url = 'http://finviz.com/screener.ashx?v=152&s=ta_topgainers&o=price&c=0,1,2,3,4,5,6,7,25,63,64,65,66,67'
html = requests.get(base_url)
soup = BeautifulSoup(html.content, "html.parser")
main_div = soup.find('div', attrs = {'id':'screener-content'})
light_rows = main_div.find_all('tr', class_="table-light-row-cp")
dark_rows = main_div.find_all('tr', class_="table-dark-row-cp")
data = []
for rows_set in (light_rows, dark_rows):
for row in rows_set:
row_data = []
for cell in row.find_all('td'):
val = cell.a.get_text()
row_data.append(val)
data.append(row_data)
# sort rows to maintain original order
data.sort(key=lambda x: int(x[0]))
import pandas
pandas.DataFrame(data).to_csv("AAA.csv", header=False)

Retrieve search results selenium python bs4

I successfully put together a script to retrieve search results from sales navigator in Linkedin. The following is the script, using python, selenium, and bs4.
browser = webdriver.Firefox(executable_path=r'D:\geckodriver\geckodriver.exe')
url1 = "https://www.linkedin.com/sales/search/company?companySize=E&geoIncluded=emea%3A0%2Ceurope%3A0&industryIncluded=6&keywords=AI&page=1&searchSessionId=zreYu57eQo%2BSZiFskdWJqg%3D%3D"
browser.get(url1)
time.sleep(15)
parsed = browser.find_element_by_tag_name('html').get_attribute('innerHTML')
soup = BeautifulSoup(parsed, 'html.parser')
search_results = soup.select('dt.result-lockup__name a')
print(len(search_results))
time.sleep(5)
browser.quit()
Irrespective of the no.of results, the answer was always 10 (i.e.) only 10 results were returned. Upon further investigation into the source, I noticed the following :
That the first 10 results are represented at a different level and the rest are under a div tag with style class named as deferred area. Though the dt class name is the same for all the search results (result-lockup__name), due to the change in levels, I am not able to access/retrieve it.
What would be the right way to retrieve all results in such a case?
EDIT 1
An example of how the tag levels are within li
And an example of the html script of the result that is not being retrieved
EDIT 2
The page source as requested
https://pastebin.com/D11YpHGQ
A lot of sites don't display all search results on page load rather only display them when needed, e.g the visitor keeps scrolling indicating they want to view more.
We can use javascript to scroll to the bottom of the page for us window.scrollTo(0,document.body.scrollHeight) , (you may want to loop this if you expect hundreds of results) forcing all results on the page, after which we can grab the HTML.
Below should do the trick.
browser = webdriver.Firefox(executable_path=r'D:\geckodriver\geckodriver.exe')
url1 = "https://www.linkedin.com/sales/search/company?companySize=E&geoIncluded=emea%3A0%2Ceurope%3A0&industryIncluded=6&keywords=AI&page=1&searchSessionId=zreYu57eQo%2BSZiFskdWJqg%3D%3D"
browser.get(url1)
time.sleep(15)
browser.execute_script('window.scrollTo(0,document.body.scrollHeight)')
time.sleep(15)
parsed = browser.find_element_by_tag_name('html').get_attribute('innerHTML')
soup = BeautifulSoup(parsed, 'html.parser')
search_results = soup.select('dt.result-lockup__name a')
print(len(search_results))

Python scrape table

I'm new to programming so it's very likely my idea of doing what I'm trying to do is totally not the way to do that.
I'm trying to scrape standings table from this site - http://www.flashscore.com/hockey/finland/liiga/ - for now it would be fine if I could even scrape one column with team names, so I try to find td tags with the class "participant_name col_participant_name col_name" but the code returns empty brackets:
import requests
from bs4 import BeautifulSoup
import lxml
def table(url):
teams = []
source = requests.get(url).content
soup = BeautifulSoup(source, "lxml")
for td in soup.find_all("td"):
team = td.find_all("participant_name col_participant_name col_name")
teams.append(team)
print(teams)
table("http://www.flashscore.com/hockey/finland/liiga/")
I tried using tr tag to retrieve whole rows, but no success either.
I think the main problem here is that you are trying to scrape a dynamically generated content using requests, note that there's no participant_name col_participant_name col_name text at all in the HTML source of the page, which means this is being generated with JavaScript by the website. For that job you should use something like selenium together with ChromeDriver or the driver that you find better, below is an example using both of the mentioned tools:
from bs4 import BeautifulSoup
from selenium import webdriver
url = "http://www.flashscore.com/hockey/finland/liiga/"
driver = webdriver.Chrome()
driver.get(url)
source = driver.page_source
soup = BeautifulSoup(source, "lxml")
elements = soup.findAll('td', {'class':"participant_name col_participant_name col_name"})
I think another issue with your code is the way you were trying to access the tags, if you want to match a specific class or any other specific attribute you can do so using a Python's dictionary as an argument of .findAll function.
Now we can use elements to find all the teams' names, try print(elements[0]) and notice that the team's name is inside an a tag, we can access it using .a.text, so something like this:
teams = []
for item in elements:
team = item.a.text
print(team)
teams.append(team)
print(teams)
teams now should be the desired output:
>>> teams
['Assat', 'Hameenlinna', 'IFK Helsinki', 'Ilves', 'Jyvaskyla', 'KalPa', 'Lukko', 'Pelicans', 'SaiPa', 'Tappara', 'TPS Turku', 'Karpat', 'KooKoo', 'Vaasan Sport', 'Jukurit']
teams could also be created using list comprehension:
teams = [item.a.text for item in elements]
Mr Aguiar beat me to it! I will just point out that you can do it all with selenium alone. Of course he is correct in pointing out that this is one of the many sites that loads most of its content dynamically.
You might be interested in observing that I have used an xpath expression. These often make for compact ways of saying what you want. Not too hard to read once you get used to them.
>>> from selenium import webdriver
>>> driver = webdriver.Chrome()
>>> driver.get('http://www.flashscore.com/hockey/finland/liiga/')
>>> items = driver.find_elements_by_xpath('.//span[#class="team_name_span"]/a[text()]')
>>> for item in items:
... item.text
...
'Assat'
'Hameenlinna'
'IFK Helsinki'
'Ilves'
'Jyvaskyla'
'KalPa'
'Lukko'
'Pelicans'
'SaiPa'
'Tappara'
'TPS Turku'
'Karpat'
'KooKoo'
'Vaasan Sport'
'Jukurit'
You're very close.
Start out being a little less ambitious, and just focus on "participant_name". Take a look at https://www.crummy.com/software/BeautifulSoup/bs4/doc/#find-all . I think you want something like:
for td in soup.find_all("td", "participant_name"):
Also, you must be seeing different web content than I am. After a wget of your URL, grep doesn't find "participant_name" in the text at all. You'll want to verify that your code is looking for an ID or a class that is actually present in the HTML text.
Achieving the same using css selector which will let you make the code more readable and concise:
from selenium import webdriver; driver = webdriver.Chrome()
driver.get('http://www.flashscore.com/hockey/finland/liiga/')
for player_name in driver.find_elements_by_css_selector('.participant_name'):
print(player_name.text)
driver.quit()

Parsing with BeautifulSoup Python with dynamic link

I was trying to parse table information listed on this site:
https://www.theice.com/productguide/ProductSpec.shtml;jsessionid=7A651D7E9437F76904BEC5623DBAB055?specId=19118104#expiry
This is the following code I'm using:
link = re.findall(re.compile('<a href="(.*?)">'), str(row))
link = 'https://www.theice.com'+link[0]
print link #Double check if link is correct
user_agent = 'Mozilla/20.0.1 (compatible; MSIE 5.5; Windows NT)'
headers = {'User-Agent':user_agent}
req = urllib2.Request(link, headers = headers)
try:
pg = urllib2.urlopen(req).read()
page = BeautifulSoup(pg)
except urllib2.HTTPError, e:
print 'Error:', e.code, '\n', '\n'
table = page.find('table', attrs = {'class':'default'})
tr_odd = table.findAll('tr', attrs = {'class':'odd'})
tr_even = table.findAll('tr', attrs = {'class':'even'})
print tr_odd, tr_even
For some reason, during the urllib2.urlopen(req).read() step, the link changes, i.e., the link doesn't contain the same url as the one provided above. Therefore, my program opens a different page and the variable page stores information form this new, different site. Thus, my tr_odd and tr_even variables are NULL.
What could be the reason for the link changing? Is there another way to access the contents of this page? All I need are the table values.
The information in this page is being supplied by a JavaScript function. When you download the page with urllib you get the page before the JavaScript is executed. When you view the page in a standard browser manually, you see the HTML after the JavaScript has been executed.
To get at the data programmatically, you need to use some tool that can execute JavaScript. There are a number of 3rd party options available for Python, such as selenium, WebKit, or spidermonkey.
Here is an example of how to scrape the page using selenium (with phantomjs) and lxml:
import contextlib
import selenium.webdriver as webdriver
import lxml.html as LH
link = 'https://www.theice.com/productguide/ProductSpec.shtml;jsessionid=7A651D7E9437F76904BEC5623DBAB055?specId=19118104#expiry'
with contextlib.closing(webdriver.PhantomJS('phantomjs')) as driver:
driver.get(link)
content = driver.page_source
doc = LH.fromstring(content)
tds = doc.xpath(
'//table[#class="default"]//tr[#class="odd" or #class="even"]/td/text()')
print('\n'.join(map(str, zip(*[iter(tds)]*5))))
yields
('Jul13', '2/11/13', '7/26/13', '7/26/13', '7/26/13')
('Aug13', '2/11/13', '8/30/13', '8/30/13', '8/30/13')
('Sep13', '2/11/13', '9/27/13', '9/27/13', '9/27/13')
('Oct13', '2/11/13', '10/25/13', '10/25/13', '10/25/13')
...
('Aug18', '2/11/13', '8/31/18', '8/31/18', '8/31/18')
('Sep18', '2/11/13', '9/28/18', '9/28/18', '9/28/18')
('Oct18', '2/11/13', '10/26/18', '10/26/18', '10/26/18')
('Nov18', '2/11/13', '11/30/18', '11/30/18', '11/30/18')
('Dec18', '2/11/13', '12/28/18', '12/28/18', '12/28/18')
Explanation of the XPath:
lxml allows you to select tags using XPath.
The XPath
'//table[#class="default"]//tr[#class="odd" or #class="even"]/td/text()'
means
//table # search recursively for <table>
[#class="default"] # with an attribute class="default"
//tr # and find inside <table> all <tr> tags
[#class="odd" or #class="even"] # that have attribute class="odd" or class="even"
/td # find the <td> tags which are direct children of the <tr> tags
/text() # return the text inside the <td> tag
Explanation of zip(*[iter(tds)]*5):
The tds is a list. It looks something like
['Jul13', '2/11/13', '7/26/13', '7/26/13', '7/26/13', 'Aug13', '2/11/13', '8/30/13', '8/30/13', '8/30/13',...]
Notice that each row of the table consists of 5 items. But our list is flat. So, to group every 5 items together into a tuple, we can use the grouper recipe. zip(*[iter(tds)]*5) is an application of the grouper recipe. It takes a flat list, like tds, and turns it into a list of tuples with every 5 items grouped together.
Here is an explanation of how the grouper recipe works. Please read that and if you have any question about it, I'll be glad to try to answer.
To get just the first column of the table, change the XPath to:
tds = doc.xpath(
'''//table[#class="default"]
//tr[#class="odd" or #class="even"]
/td[1]/text()''')
print(tds)
For example,
import contextlib
import selenium.webdriver as webdriver
import lxml.html as LH
link = 'https://www.theice.com/productguide/ProductSpec.shtml?specId=6753474#expiry'
with contextlib.closing(webdriver.PhantomJS('phantomjs')) as driver:
driver.get(link)
content = driver.page_source
doc = LH.fromstring(content)
tds = doc.xpath(
'''//table[#class="default"]
//tr[#class="odd" or #class="even"]
/td[1]/text()''')
print(tds)
yields
['Jul13', 'Aug13', 'Sep13', 'Oct13', 'Nov13', 'Dec13', 'Jan14', 'Feb14', 'Mar14', 'Apr14', 'May14', 'Jun14', 'Jul14', 'Aug14', 'Sep14', 'Oct14', 'Nov14', 'Dec14', 'Jan15', 'Feb15', 'Mar15', 'Apr15', 'May15', 'Jun15', 'Jul15', 'Aug15', 'Sep15', 'Oct15', 'Nov15', 'Dec15']
I don't think the link is actually changing.
Anyway, the problem is that your regex is wrong. If you take the links it prints out and paste it into a browser, you get a blank page, or the wrong page, or a redirect to the wrong page. And Python is going to download the exact same thing.
Here's a link from the actual page:
Here's what your regex finds:
/productguide/MarginRates.shtml;jsessionid=B53D8EF107AAC5F37F0ADF627B843B58?index=&specId=19118104
Notice that & there? You need to decode that to & or your URL is wrong. Instead of having a query-string variable specId with value 19118104, you've got a query-string variable amp;specId (although technically, you can't have unescaped semicolons like that either, so everything from jsession on is a fragment).
You'll notice that if you paste the first one into a browser, you get a blank page. I you remove the extra amp;, then you get the right page (after a redirect). And the same is true in Python.

Categories