Parsing with BeautifulSoup Python with dynamic link - python

I was trying to parse table information listed on this site:
https://www.theice.com/productguide/ProductSpec.shtml;jsessionid=7A651D7E9437F76904BEC5623DBAB055?specId=19118104#expiry
This is the following code I'm using:
link = re.findall(re.compile('<a href="(.*?)">'), str(row))
link = 'https://www.theice.com'+link[0]
print link #Double check if link is correct
user_agent = 'Mozilla/20.0.1 (compatible; MSIE 5.5; Windows NT)'
headers = {'User-Agent':user_agent}
req = urllib2.Request(link, headers = headers)
try:
pg = urllib2.urlopen(req).read()
page = BeautifulSoup(pg)
except urllib2.HTTPError, e:
print 'Error:', e.code, '\n', '\n'
table = page.find('table', attrs = {'class':'default'})
tr_odd = table.findAll('tr', attrs = {'class':'odd'})
tr_even = table.findAll('tr', attrs = {'class':'even'})
print tr_odd, tr_even
For some reason, during the urllib2.urlopen(req).read() step, the link changes, i.e., the link doesn't contain the same url as the one provided above. Therefore, my program opens a different page and the variable page stores information form this new, different site. Thus, my tr_odd and tr_even variables are NULL.
What could be the reason for the link changing? Is there another way to access the contents of this page? All I need are the table values.

The information in this page is being supplied by a JavaScript function. When you download the page with urllib you get the page before the JavaScript is executed. When you view the page in a standard browser manually, you see the HTML after the JavaScript has been executed.
To get at the data programmatically, you need to use some tool that can execute JavaScript. There are a number of 3rd party options available for Python, such as selenium, WebKit, or spidermonkey.
Here is an example of how to scrape the page using selenium (with phantomjs) and lxml:
import contextlib
import selenium.webdriver as webdriver
import lxml.html as LH
link = 'https://www.theice.com/productguide/ProductSpec.shtml;jsessionid=7A651D7E9437F76904BEC5623DBAB055?specId=19118104#expiry'
with contextlib.closing(webdriver.PhantomJS('phantomjs')) as driver:
driver.get(link)
content = driver.page_source
doc = LH.fromstring(content)
tds = doc.xpath(
'//table[#class="default"]//tr[#class="odd" or #class="even"]/td/text()')
print('\n'.join(map(str, zip(*[iter(tds)]*5))))
yields
('Jul13', '2/11/13', '7/26/13', '7/26/13', '7/26/13')
('Aug13', '2/11/13', '8/30/13', '8/30/13', '8/30/13')
('Sep13', '2/11/13', '9/27/13', '9/27/13', '9/27/13')
('Oct13', '2/11/13', '10/25/13', '10/25/13', '10/25/13')
...
('Aug18', '2/11/13', '8/31/18', '8/31/18', '8/31/18')
('Sep18', '2/11/13', '9/28/18', '9/28/18', '9/28/18')
('Oct18', '2/11/13', '10/26/18', '10/26/18', '10/26/18')
('Nov18', '2/11/13', '11/30/18', '11/30/18', '11/30/18')
('Dec18', '2/11/13', '12/28/18', '12/28/18', '12/28/18')
Explanation of the XPath:
lxml allows you to select tags using XPath.
The XPath
'//table[#class="default"]//tr[#class="odd" or #class="even"]/td/text()'
means
//table # search recursively for <table>
[#class="default"] # with an attribute class="default"
//tr # and find inside <table> all <tr> tags
[#class="odd" or #class="even"] # that have attribute class="odd" or class="even"
/td # find the <td> tags which are direct children of the <tr> tags
/text() # return the text inside the <td> tag
Explanation of zip(*[iter(tds)]*5):
The tds is a list. It looks something like
['Jul13', '2/11/13', '7/26/13', '7/26/13', '7/26/13', 'Aug13', '2/11/13', '8/30/13', '8/30/13', '8/30/13',...]
Notice that each row of the table consists of 5 items. But our list is flat. So, to group every 5 items together into a tuple, we can use the grouper recipe. zip(*[iter(tds)]*5) is an application of the grouper recipe. It takes a flat list, like tds, and turns it into a list of tuples with every 5 items grouped together.
Here is an explanation of how the grouper recipe works. Please read that and if you have any question about it, I'll be glad to try to answer.
To get just the first column of the table, change the XPath to:
tds = doc.xpath(
'''//table[#class="default"]
//tr[#class="odd" or #class="even"]
/td[1]/text()''')
print(tds)
For example,
import contextlib
import selenium.webdriver as webdriver
import lxml.html as LH
link = 'https://www.theice.com/productguide/ProductSpec.shtml?specId=6753474#expiry'
with contextlib.closing(webdriver.PhantomJS('phantomjs')) as driver:
driver.get(link)
content = driver.page_source
doc = LH.fromstring(content)
tds = doc.xpath(
'''//table[#class="default"]
//tr[#class="odd" or #class="even"]
/td[1]/text()''')
print(tds)
yields
['Jul13', 'Aug13', 'Sep13', 'Oct13', 'Nov13', 'Dec13', 'Jan14', 'Feb14', 'Mar14', 'Apr14', 'May14', 'Jun14', 'Jul14', 'Aug14', 'Sep14', 'Oct14', 'Nov14', 'Dec14', 'Jan15', 'Feb15', 'Mar15', 'Apr15', 'May15', 'Jun15', 'Jul15', 'Aug15', 'Sep15', 'Oct15', 'Nov15', 'Dec15']

I don't think the link is actually changing.
Anyway, the problem is that your regex is wrong. If you take the links it prints out and paste it into a browser, you get a blank page, or the wrong page, or a redirect to the wrong page. And Python is going to download the exact same thing.
Here's a link from the actual page:
Here's what your regex finds:
/productguide/MarginRates.shtml;jsessionid=B53D8EF107AAC5F37F0ADF627B843B58?index=&specId=19118104
Notice that & there? You need to decode that to & or your URL is wrong. Instead of having a query-string variable specId with value 19118104, you've got a query-string variable amp;specId (although technically, you can't have unescaped semicolons like that either, so everything from jsession on is a fragment).
You'll notice that if you paste the first one into a browser, you get a blank page. I you remove the extra amp;, then you get the right page (after a redirect). And the same is true in Python.

Related

Scraping second page of dynamic element on webpage with Python

I work for a group that's looking to pull automatic reports of port statuses in Tx. To do this, I'm trying to web scrape (which I'm not too familiar with) the Coast Guard Homeport site. I've managed to use Selenium to pull all the information using the xpath of the page's table with the ports, however there is one port (Victoria) on the 'page 2' of the table that the script is not able to see. The xpath does not change if I tab between the pages, so I'm not sure how to locate it. Any help would be much appreciated!
edit: The page uses Javascript elements.
https://homeport.uscg.mil/port-directory/corpus-christi
url = 'https://homeport.uscg.mil/port-directory/corpus-christi'
xpath= "/html/body/form/div[12]/div[2]/div[2]/div[2]/div[3]/div[1]/div[4]/div/div/div/div/div/div[1]/div/div[2]/div[1]/div/div/div[2]/div/div[2]/div/table"
portsList = ['CORPUS CHRISTI','ORANGE','BEAUMONT','VICTORIA','CALHOUN','HARLINGEN','PALACIOS','PORT ISABEL','PORT LAVACA','PORT MANSFIELD']
df = pd.DataFrame(index=portsList, columns=['status','comments','dateupdated'])
driver = webdriver.Chrome(executable_path = r"C:\Users\M3ECHJJJ\Documents\chromedriver.exe")
urlpage = url+page
driver.get(urlpage)
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);var lenOfPage=document.body.scrollHeight;return lenOfPage;")
time.sleep(15)
results = driver.find_elements_by_xpath(xpath)
ports_split = results[0].text.split('\n')
i = 0
for port in ports_split:
if port.upper() in portsList:
print(port)
df.xs(port.upper())['status'],df.xs(port.upper())['comments'],df.xs(port.upper())['dateupdated'] = parsePara(ports_split[i+1])
i = i+1
driver.quit()
First, be very careful writing automation that acts against government websites (or any website for that matter) and be sure you're allowed to do so. You may also find that many sites offer the information you're looking for in a structured format, like through an API or data download.
While selenium may provide you with a great tool for automating the browser, its capabilities for locating elements and parsing HTML may leave much to be desired. In cases like this, I might reach for using BeautifulSoup as a complementary tool to use alongside browser automation.
BeautifulSoup will support all the same locators as selenium, but also provides additional capabilities, including the ability to define your own criteria for locating elements.
For example, you can define a function to use very specific rules for locating elements (tags). The function should return True for tags that match your interests.
from bs4 import BeautifulSoup
def important_table(tag):
"""Given a particular tag, return True if it's what you're looking for"""
return bool(
# match <table> elements
tag.name == 'table' and
# check for expected text
any(port_name in tag.text for port_name in portsList) and
# check element attributes
"classname" in tag.get('class', []) and
# has at least 3 rows
len(tag.findall('tr')) > 3
# and so on
)
This is just an example, but you can write this function however you like to fit your need.
Then you could apply this like so
...
driver.get(urlpage)
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);var lenOfPage=document.body.scrollHeight;return lenOfPage;")
time.sleep(15)
html = driver.page_source # get the DOM content as a string
soup = BeautifulSoup(html)
table = soup.find(important_table)
for row in table.findall('tr'):
print(row.text)

Get dynamically generated content with python Selenium

This question has been asked before, but I've searched and tried and still can't get it to work. I'm a beginner when it comes to Selenium.
Have a look at: https://finance.yahoo.com/quote/FB
I'm trying to web scrape the "Recommended Rating", which in this case at the time of writing is 2. I've tried:
driver.get('https://finance.yahoo.com/quote/FB')
time.sleep(10)
rating = driver.find_element_by_css_selector('#Col2-4-QuoteModule-Proxy > div > section > div > div > div')
print(rating.text)
...which doesn't give me an error, but doesn't print any text either. I've also tried with xpath, class_name, etc. Instead I tried:
source = driver.page_source
print(source)
This doesn't work either, I'm just getting the actual source without the dynamically generated content. When I click "View Source" in Chrome, it's not there. I tried saving the webpage in chrome. Didn't work.
Then I discovered that if I save the entire webpage, including images and css-files and everything, the source code is different from the one where I just save the HTML.
The HTML-file I get when I save the entire webpage using Chrome DOES contain the information that I need, and at first I was thinking about using pyautogui to just Ctrl + S every webpage, but there must be another way.
The information that I need is obviosly there, in the html-code, but how do I get it without downloading the entire web page?
Try this to execute the dynamically generated content (JavaScript):
driver.execute_script("return document.body.innerHTML")
See similar question:
Running javascript in Selenium using Python
The CSS selector, div.rating-text, is working just fine and is unique on the page. Returning .text will give you the value you are looking for.
First, you need to wait for the element to be clickable, then make sure you scroll down to the element before getting the rating. Try
element.location_once_scrolled_into_view
element.text
EDIT:
Use the following XPath selector:
'//a[#data-test="recommendation-rating-header"]//following-sibling::div//div[#class="rating-text Arrow South Fw(b) Bgc($buy) Bdtc($buy)"]'
Then you will have:
rating = driver.find_element_by_css_selector('//a[#data-test="recommendation-rating-header"]//following-sibling::div//div[#class="rating-text Arrow South Fw(b) Bgc($buy) Bdtc($buy)"]')
To extract the value of the slider, use
val = rating.get_attribute("aria-label")
The script below answers a different question but somehow I think this is what you are after.
import requests
from bs4 import BeautifulSoup
base_url = 'http://finviz.com/screener.ashx?v=152&s=ta_topgainers&o=price&c=0,1,2,3,4,5,6,7,25,63,64,65,66,67'
html = requests.get(base_url)
soup = BeautifulSoup(html.content, "html.parser")
main_div = soup.find('div', attrs = {'id':'screener-content'})
light_rows = main_div.find_all('tr', class_="table-light-row-cp")
dark_rows = main_div.find_all('tr', class_="table-dark-row-cp")
data = []
for rows_set in (light_rows, dark_rows):
for row in rows_set:
row_data = []
for cell in row.find_all('td'):
val = cell.a.get_text()
row_data.append(val)
data.append(row_data)
# sort rows to maintain original order
data.sort(key=lambda x: int(x[0]))
import pandas
pandas.DataFrame(data).to_csv("AAA.csv", header=False)

xpath query on a html table always returns empty string in python

here's the python code of what i've tried:
from lxml import html
import requests
page = requests.get('http://www.rsssf.com/tablese/eng2017det.html')
tree = html.fromstring(page.content)
print(tree.xpath('/html/body/table/tbody/tr[2]//text()'))
I'm always getting my output as []
I have also checked the html page, the URL isn't broken
Do not use tbody tag in your XPath. Note that developer might skip this tag, so it will be added automatically by browser while page rendering.
Simply try
print(tree.xpath('/html/body/table//tr[2]//text()'))
or
print([i for i in tree.xpath('/html/body/table//tr[2]//text()') if i.strip()])
to avoid printing new line characters

Python scrape table

I'm new to programming so it's very likely my idea of doing what I'm trying to do is totally not the way to do that.
I'm trying to scrape standings table from this site - http://www.flashscore.com/hockey/finland/liiga/ - for now it would be fine if I could even scrape one column with team names, so I try to find td tags with the class "participant_name col_participant_name col_name" but the code returns empty brackets:
import requests
from bs4 import BeautifulSoup
import lxml
def table(url):
teams = []
source = requests.get(url).content
soup = BeautifulSoup(source, "lxml")
for td in soup.find_all("td"):
team = td.find_all("participant_name col_participant_name col_name")
teams.append(team)
print(teams)
table("http://www.flashscore.com/hockey/finland/liiga/")
I tried using tr tag to retrieve whole rows, but no success either.
I think the main problem here is that you are trying to scrape a dynamically generated content using requests, note that there's no participant_name col_participant_name col_name text at all in the HTML source of the page, which means this is being generated with JavaScript by the website. For that job you should use something like selenium together with ChromeDriver or the driver that you find better, below is an example using both of the mentioned tools:
from bs4 import BeautifulSoup
from selenium import webdriver
url = "http://www.flashscore.com/hockey/finland/liiga/"
driver = webdriver.Chrome()
driver.get(url)
source = driver.page_source
soup = BeautifulSoup(source, "lxml")
elements = soup.findAll('td', {'class':"participant_name col_participant_name col_name"})
I think another issue with your code is the way you were trying to access the tags, if you want to match a specific class or any other specific attribute you can do so using a Python's dictionary as an argument of .findAll function.
Now we can use elements to find all the teams' names, try print(elements[0]) and notice that the team's name is inside an a tag, we can access it using .a.text, so something like this:
teams = []
for item in elements:
team = item.a.text
print(team)
teams.append(team)
print(teams)
teams now should be the desired output:
>>> teams
['Assat', 'Hameenlinna', 'IFK Helsinki', 'Ilves', 'Jyvaskyla', 'KalPa', 'Lukko', 'Pelicans', 'SaiPa', 'Tappara', 'TPS Turku', 'Karpat', 'KooKoo', 'Vaasan Sport', 'Jukurit']
teams could also be created using list comprehension:
teams = [item.a.text for item in elements]
Mr Aguiar beat me to it! I will just point out that you can do it all with selenium alone. Of course he is correct in pointing out that this is one of the many sites that loads most of its content dynamically.
You might be interested in observing that I have used an xpath expression. These often make for compact ways of saying what you want. Not too hard to read once you get used to them.
>>> from selenium import webdriver
>>> driver = webdriver.Chrome()
>>> driver.get('http://www.flashscore.com/hockey/finland/liiga/')
>>> items = driver.find_elements_by_xpath('.//span[#class="team_name_span"]/a[text()]')
>>> for item in items:
... item.text
...
'Assat'
'Hameenlinna'
'IFK Helsinki'
'Ilves'
'Jyvaskyla'
'KalPa'
'Lukko'
'Pelicans'
'SaiPa'
'Tappara'
'TPS Turku'
'Karpat'
'KooKoo'
'Vaasan Sport'
'Jukurit'
You're very close.
Start out being a little less ambitious, and just focus on "participant_name". Take a look at https://www.crummy.com/software/BeautifulSoup/bs4/doc/#find-all . I think you want something like:
for td in soup.find_all("td", "participant_name"):
Also, you must be seeing different web content than I am. After a wget of your URL, grep doesn't find "participant_name" in the text at all. You'll want to verify that your code is looking for an ID or a class that is actually present in the HTML text.
Achieving the same using css selector which will let you make the code more readable and concise:
from selenium import webdriver; driver = webdriver.Chrome()
driver.get('http://www.flashscore.com/hockey/finland/liiga/')
for player_name in driver.find_elements_by_css_selector('.participant_name'):
print(player_name.text)
driver.quit()

BeautifulSoup Returning empty array

I'm currently trying to scrape data off a website, but using the code beneath it would return an empty array " [] " for some reason. I can't seem to figure out the reasoning behind it. When I check the html generated there seems to be a lot of \t \r \n. I am unsure what the issue seems to be with my code.
url = "http://www.hkex.com.hk/eng/csm/price_movement_result.htm?location=priceMoveSearch&PageNo=1&SearchMethod=2&mkt=hk&LangCode=en&StockType=ALL&Ranking=ByMC&x=51&y=6"
html = requests.get(url)
soup = BeautifulSoup(html.text,'html.parser')
rows = soup.find_all('tr')
print rows
I have attempted to parse non ".text" and also "lxml" instead of "html.parser" but ended up with the same result.
EDIT: Found the workaround, used selenium to open the page and grab the source that way instead.
url = "http://www.hkex.com.hk/eng/csm/price_movement_result.htm?location=priceMoveSearch&PageNo=1&SearchMethod=2&mkt=hk&LangCode=en&StockType=ALL&Ranking=ByMC&x=51&y=6"
driver = webdriver.Firefox()
driver.get(url)
f = driver.page_source
soup = BeautifulSoup(f,'html.parser')
rows = soup.find_all('tr')
this page use javascript to fetch data from server, and you can find javascript use this link to request data in chrome's dev_tools, so you can requests this link to get the info you need.
http://www.hkex.com.hk/eng/csm/ws/Result.asmx/GetData?location=priceMoveSearch&SearchMethod=2&LangCode=en&StockCode=&StockName=&Ranking=ByMC&StockType=ALL&mkt=hk&PageNo=1&ATypeSHEx=&AType=&FDD=&FMM=&FYYYY=&TDD=&TMM=&TYYYY=
there is no need to use selenium
There are no true HTML rows in the document. The rows are dynamically generated by JavaScript. BeautifulSoup cannot execute JavaScript.
If you view the contents of the html.text variable, you will notice that the content is generated dynamically and does not have any valid elements.

Categories