Parsing html page that change content using PostBack function - python

I'm using Python with Requests and BeautifulSoup to parse the pages, and everything worked well until on one of the pages buttons which have a PostBack function instead of a link appeared.
Buttons have this structure:
<a onclick="PostBack('FollowLink','2');return false;" href="#">Continue</a>
I have no idea on how to navigate to the next page since main link remains unchanged.

You have two options. One is to manually inspect the Javascript and see what the PostBack function does, then simulate it. The other is to change to something like Selenium, where you run an instance of Chrome which interprets the Javascript for you. The first option would be less work.

Related

How to Fetch href links in Chromedriver?

I am trying to scrape the link from a button. If I click the button, it opens a new tab and I can't navigate in it. So I thought I'd scrape the link, go to it via webdriver.get(link) and do it that way since this will be a background program. I cannot find any tutorials on this using the most recent version of selenium. This is in Python
I tried using
wd.find_element("xpath", 'xpath here')
but that just scrapes the button title. Is there a different tag I should be using?
I've also tried just clicking the button but that opens a new tab and I don't know how to navigate on it, since it doesn't work by default and I'm still fairly new to Chromedriver.
I can't use beautifulsoup to my knowledge, since the webpage must be logged in.
You need to get the href attribute of the button. If your code gets the right button you can just use
button.get_attribute("href")
Of course if you get redirected using Javascript this is a different story, but since you didn't specify I will assume my answer works
You can use swith_of function to manage multiple windows(tabs) in same test case session
driver.switch_to.window(name_or_handler)
An extra information: If you want to get attribute value from element, you can use get_attribute() function
link_value = driver.find_element(By, selector).get_attribute("href")
P.S: example code written in Python. If you use another language, you can use equivalent Selenium functions for them.

How to loop through postback links with selenium

I started working on my first web scraper with python and selenium. I'm sure it will be painfully obvious without me saying, but I'm still very new to this. This web scraper navigates to a website, performs a search to find a bunch of political candidate committee pages, clicks the first search result, then scrapes some text data into a dictionary and then a csv. There are 50 results per page and I could pass 50 different id's into the code and it would work (I've tested it). But of course, I want this to be at least somewhat automated. I'd like it to loop through the 50 search results (candidate committee pages) and scrape them one after the other.
I thought a for loop would work well here. I would just need to loop through each of the 50 search result elements with the code that I know works. This is where I'm having issues. When I copy the html element corresponding to the search result link, this is what I get.
a id="_ctl0_Content_dgdSearchResults__ctl2_lnkCandidate" class="grdBodyDisplay" href="javascript:__doPostBack('_ctl0$Content$dgdSearchResults$_ctl2$lnkCandidate','')">ALLEN, KEVIN</a>
As you can see from the html above, the href attribute isn't a normal link. It's some sort of javascript Postback thing that I don't really understand. After some googling, I still don't really get it. Some people are saying this means you have to make the program wait before you click the link, but my original code doesn't do that. My code performs the search and clicks the first link without issue. I just have to pass it the id.
I thought a good first step would be to scrape the search results page to get a list of links. Then I could iterate through a list of links with the rest of the scraping code. After some messing around I tried this:
links = driver.find_elements_by_tag_name('a')
for i in links:
print(i.get_attribute('href'))
This gives me a list of all the links on the page, and after playing with the list a little bit, it narrows down to a list of 50 of these corresponding to the 50 search results (notice the id's change by 1 number):
javascript:__doPostBack('_ctl0$Content$dgdSearchResults$_ctl2$lnkCandidate','')
javascript:__doPostBack('_ctl0$Content$dgdSearchResults$_ctl3$lnkCandidate','')
javascript:__doPostBack('_ctl0$Content$dgdSearchResults$_ctl4$lnkCandidate','')
etc
That's what the href attribute gives me...but are those even links? How do I work with them? Is this the wrong way to go about through iterating through search results? I feel like I am so close to getting this to work! I'd appreciate any suggestions you have. Thanks!
__doPostBack() function
Postback is the functionality where the page contents are posted to the server due to an occurrence of an event in a page control. As an example can be, a button click or a index change event when AutoPostBack value is set to true. All the webcontrols except Button and ImageButton control can call a javascript function called __doPostBack() to post the form to server. Button and ImageButton control will use the browsers ability and submit the form to the server. ASP.Net runtime automatically inserts the definition of __doPostBack() function in the HTML output when there is a control that can initiate a postback in the page.
An example defination of __doPostBack:
<html>
<body>
<script type="text/javascript">
//<![CDATA[
var theForm = document.forms['form1'];
if (!theForm) {
theForm = document.form1;
}
function __doPostBack(eventTarget, eventArgument) {
if (!theForm.onsubmit || (theForm.onsubmit() != false)) {
theForm.__EVENTTARGET.value = eventTarget;
theForm.__EVENTARGUMENT.value = eventArgument;
theForm.submit();
}
}
//]]>
</script>
</div>
<a id="LinkButton1" href="javascript:__doPostBack('LinkButton1','')">LinkButton</a>
</form>
</body>
</html>
This usecase
Extracting the value of the href attributes will always give the similar output:
javascript:__doPostBack('_ctl0$Content$dgdSearchResults$_ctl2$lnkCandidate','')
javascript:__doPostBack('_ctl0$Content$dgdSearchResults$_ctl3$lnkCandidate','')
javascript:__doPostBack('_ctl0$Content$dgdSearchResults$_ctl4$lnkCandidate','')
A pottential solution will be to:
Open the <a> tags in the adjascent tab using CONTROL + Click
Switch to the adjascent tab
Extract the current_url
Switch back to the main tab.
No, these aren't links that you can just load in a browser or with Selenium. From what I can tell with a little googling, the first argument to __doPostBack() is the id for a button (or maybe another element) on the page.
In more general terms, "post back" refers to making a POST request back to the exact same URL as the current page. From what I can tell, __doPostBack() performs a post back by simulating a click on the specified element.

MechanicalSoup tricky html tables

I'm completely green to MechanicalSoup and webscraping.
I have been working on parsing a html timetable and making it into icalendar (ics) file to get it on mobile. (Which i have succesfully done, yay).
Now to make it work, I downloaded the html of the timetable site once I had selected my timetable. Now I need to use Python to actually navigate to the timetable.
Here is my code so far (I am stuck because the HTML is sooo messy I don't know how to do it, and the documentation for MechanicalSoup is not that large yet):
import argparse
import mechanicalsoup
from getpass import getpass
browser = mechanicalsoup.StatefulBrowser(
soup_config={'features': 'lxml'},
raise_on_404=True,
user_agent='MyBot/0.1: mysite.example.com/bot_info',
)
browser.open("http://keaplan.kea.dk/sws/prodE2017/default.aspx")
browser.select_form(WHAT TO SELECT :D)
See the HTML here :( http://keaplan.kea.dk/sws/prodE2017/default.aspx
I want to do the following:
td class=“FilterPanel” #go to the table containing this td
div id = pFilter #set value to BYG
div id = pObject #set value to BAKINT-2l
submit (which will redirect to the timetable i need)
and download the html from the submitted redirect.
Help is lovingly appreciated!
The argument of select_form is a CSS selector. If you have just one form, then "form" can do the trick (the next version of MechanicalSoup will actually have this as default argument). Otherwise, use your browser's developer tools, for example Firefox has Right-Click -> Inspect Element -> Right Click -> Copy -> CSS selector, that can be a good starting point.
In your case, even thought there's a funny layout, there is only one form, so:
browser.select_form("form")
Unfortunately, the page you are pointing is partly generated with JavaScript (the select element you're searching doesn't appear in the soup object obtained by parsing the page). See what MechanicalSoup sees from your page with
browser.launch_browser()
:-(. You can work around the issue by creating the missing controls yourself with new_control.

How to scrape value from page that loads dynamicaly?

The homepage of the website I'm trying to scrape displays four tabs, one of which reads "[Number] Available Jobs". I'm interested in scraping the [Number] value. When I inspect the page in Chrome, I can see the value enclosed within a <span> tag.
However, there is nothing enclosed in that <span> tag when I view the page source directly. I was planning on using the Python requests module to make an HTTP GET request and then use regex to capture the value from the returned content. This is obviously not possible if the content doesn't contain the number I need.
My questions are:
What is happening here? How can a value be dynamically loaded into a
page, displayed, and then not appear within the HTML source?
If the value doesn't appear in the page source, what can I do to
reach it?
If the content doesn't appear in the page source then it is probably generated using javascript. For example the site might have a REST API that lists jobs, and the Javascript code could request the jobs from the API and use it to create the node in the DOM and attach it to the available jobs. That's just one possibility.
One way to scrap this information is to figure out how that javascript works and make your python scraper do the same thing (for example, if there is a simple REST API it is using, you just need to make a request to that same URL). Often that is not so easy, so another alternative is to do your scraping using a javascript capable browser like selenium.
One final thing I want to mention is that regular expressions are a fragile way to parse HTML, you should generally prefer to use a library like BeautifulSoup.
1.A value can be loaded dynamically with ajax, ajax loads asynchronously that means that the rest of the site does not wait for ajax to be rendered, that's why when you get the DOM the elements loaded with ajax does not appear in it.
2.For scraping dynamic content you should use selenium, here a tutorial
for data that load dynamically you should look for an xhr request in the networks and if you can make that data productive for you than voila!!
you can you phantom js, it's a headless browser and it captures the html of that page with the dynamically loaded content.

How can Selenium (or BeautifulSoup) be used to access these hidden elements?

Here is an example page with pagination controlling dynamically loaded results.
http://www.rehabs.com/local/jacksonville-fl/
All that I presently know to try is:
curButton = 1
driver.find_element_by_css_selector('ul[class="pagination"]').find_elements_by_tag_name('li')[curButton].click()
Nothing seems to happen (also when trying to access and click the a tag or driver.get() the href of the a element).
Is there another way to access the hidden elements? For instance, when reading the html of the entire page, the elements of different pagination are shown, but are apparently inaccessible with BeautifulSoup.
Pagination was added for humans. Maybe you used the wrong xpath or css. Check it.
Use this xpath:
//div[#id="listing-basic"]/article/div[#class="h3"]/a/#href
You can click on the pagination button using:
driver.find_elements_by_css_selector('.pagination li a')[1].click()

Categories