I started working on my first web scraper with python and selenium. I'm sure it will be painfully obvious without me saying, but I'm still very new to this. This web scraper navigates to a website, performs a search to find a bunch of political candidate committee pages, clicks the first search result, then scrapes some text data into a dictionary and then a csv. There are 50 results per page and I could pass 50 different id's into the code and it would work (I've tested it). But of course, I want this to be at least somewhat automated. I'd like it to loop through the 50 search results (candidate committee pages) and scrape them one after the other.
I thought a for loop would work well here. I would just need to loop through each of the 50 search result elements with the code that I know works. This is where I'm having issues. When I copy the html element corresponding to the search result link, this is what I get.
a id="_ctl0_Content_dgdSearchResults__ctl2_lnkCandidate" class="grdBodyDisplay" href="javascript:__doPostBack('_ctl0$Content$dgdSearchResults$_ctl2$lnkCandidate','')">ALLEN, KEVIN</a>
As you can see from the html above, the href attribute isn't a normal link. It's some sort of javascript Postback thing that I don't really understand. After some googling, I still don't really get it. Some people are saying this means you have to make the program wait before you click the link, but my original code doesn't do that. My code performs the search and clicks the first link without issue. I just have to pass it the id.
I thought a good first step would be to scrape the search results page to get a list of links. Then I could iterate through a list of links with the rest of the scraping code. After some messing around I tried this:
links = driver.find_elements_by_tag_name('a')
for i in links:
print(i.get_attribute('href'))
This gives me a list of all the links on the page, and after playing with the list a little bit, it narrows down to a list of 50 of these corresponding to the 50 search results (notice the id's change by 1 number):
javascript:__doPostBack('_ctl0$Content$dgdSearchResults$_ctl2$lnkCandidate','')
javascript:__doPostBack('_ctl0$Content$dgdSearchResults$_ctl3$lnkCandidate','')
javascript:__doPostBack('_ctl0$Content$dgdSearchResults$_ctl4$lnkCandidate','')
etc
That's what the href attribute gives me...but are those even links? How do I work with them? Is this the wrong way to go about through iterating through search results? I feel like I am so close to getting this to work! I'd appreciate any suggestions you have. Thanks!
__doPostBack() function
Postback is the functionality where the page contents are posted to the server due to an occurrence of an event in a page control. As an example can be, a button click or a index change event when AutoPostBack value is set to true. All the webcontrols except Button and ImageButton control can call a javascript function called __doPostBack() to post the form to server. Button and ImageButton control will use the browsers ability and submit the form to the server. ASP.Net runtime automatically inserts the definition of __doPostBack() function in the HTML output when there is a control that can initiate a postback in the page.
An example defination of __doPostBack:
<html>
<body>
<script type="text/javascript">
//<![CDATA[
var theForm = document.forms['form1'];
if (!theForm) {
theForm = document.form1;
}
function __doPostBack(eventTarget, eventArgument) {
if (!theForm.onsubmit || (theForm.onsubmit() != false)) {
theForm.__EVENTTARGET.value = eventTarget;
theForm.__EVENTARGUMENT.value = eventArgument;
theForm.submit();
}
}
//]]>
</script>
</div>
<a id="LinkButton1" href="javascript:__doPostBack('LinkButton1','')">LinkButton</a>
</form>
</body>
</html>
This usecase
Extracting the value of the href attributes will always give the similar output:
javascript:__doPostBack('_ctl0$Content$dgdSearchResults$_ctl2$lnkCandidate','')
javascript:__doPostBack('_ctl0$Content$dgdSearchResults$_ctl3$lnkCandidate','')
javascript:__doPostBack('_ctl0$Content$dgdSearchResults$_ctl4$lnkCandidate','')
A pottential solution will be to:
Open the <a> tags in the adjascent tab using CONTROL + Click
Switch to the adjascent tab
Extract the current_url
Switch back to the main tab.
No, these aren't links that you can just load in a browser or with Selenium. From what I can tell with a little googling, the first argument to __doPostBack() is the id for a button (or maybe another element) on the page.
In more general terms, "post back" refers to making a POST request back to the exact same URL as the current page. From what I can tell, __doPostBack() performs a post back by simulating a click on the specified element.
Related
I'm using Python with Requests and BeautifulSoup to parse the pages, and everything worked well until on one of the pages buttons which have a PostBack function instead of a link appeared.
Buttons have this structure:
<a onclick="PostBack('FollowLink','2');return false;" href="#">Continue</a>
I have no idea on how to navigate to the next page since main link remains unchanged.
You have two options. One is to manually inspect the Javascript and see what the PostBack function does, then simulate it. The other is to change to something like Selenium, where you run an instance of Chrome which interprets the Javascript for you. The first option would be less work.
I am executing a python script to scrape data off the first 5 pages of searching papers on the SSRN database. Currently what I have working right now is
links = browser.find_elements_by_xpath("//h3//a")
for link in links:
href = link.get_attribute("href")
extract_information(href,sheet,excel_data_pointer,workbook)
This is all currently being done on page 1 of my website. I also have some additional search preferences specified in my code namely:
Where this is the last option on the search dropdown. I access this on the first page by using:
select_element = Select(browser.find_element_by_css_selector('select#sort-by'))
#Modify me to change the Sorting filter
select_element.select_by_visible_text("Date Posted, Descending")
browser.implicitly_wait(2)
Then I run the code I stated at the start, extract the links and use a helper function to perform my scraping. My problem begins in the second page. When it loads the browser pauses for a second and switches back to the old search preferences and now it looks like:
I am using a for loop to traverse through the pages in the script accessing them by:
for i in range(2,num_papers):
browser.find_element_by_xpath("//*[#id='maincontent']/div/div[1]/div/div[3]/div/ul/li["+ str(i) +"]/a").click()
My first question would be why does this happen? Does the browser not automatically remember search preferences? I have tried accessing the drop down again on this page but it doesn't seem to change. Is there a way to have selenium browsers memorise these preferences?
EDIT: I was also searching by "Title, Abstract and Keywords" in my original search which worked well on page 1 but by page 2 it started to search through "Title and Abstract" only (the default option)
I am currently trying to crawl using selenium-python through an entire website with a specified crawl depth. I started with Google and thought of moving forward by crawling with it and simultaneously develop the code.
The way it works is: If the page is 'www.google.com' and has 15 links within it, once all the links are fetched, it is stored in a dictionary with 'www.google.com' as the key and a list of 15 links as value. Then each of the 15 links are then taken from the corresponding dictionary and the crawling continues in a recursive manner.
The problem with this is that it moves forward with respect to the href attribute of every links found on a page. But not every links will have href attribute.
For example: As it crawled and reached the My Account Page it has Help and Feedback in it's footer which has an outerHTML of <span role="button" tabindex="0" class="fK1S1c" jsname="ngKiOe">Help and Feedback</span>.
So what I am not sure is that - what can be done on such a context where a link is highly supported by javascript/ajax for it matters - as it does not have a link but opens up a modal window/dialog box or sorts.
You might need to find a pattern of design for links. For eg: you
could have a link with anchor tag and in your case span.
It depends on the design of the webpage. How the developers intent do design the html elements through attributes/ identifiers.
For eg: if the dev decides to have a common class value for all the links that are not with the anchor tag name, it would be easy to identify all those elements.
You could also try writing a script to fetch all the elements with the
expected tag name( for eg : span) here and try clicking on the
elements. You could fetch the details of the backend response/log
details. So for those, clicks, where you are getting additional
response/log would mean that it has an additional code written behind
giving us an idea that it is not a static element.
I am new to Python and web crawling. I intend to scrape links in the top stories of a website. I was told to look at to its Ajax requests and send similar ones. The problem is that all requests for the links are same: http://www.marketwatch.com/newsviewer/mktwheadlines
My question would be how to extract links from an infinite scrolling box like this. I am using beautiful soup, but I think it's not suitable for this task. I am also not familiar with Selenium and java scripts. I know how to scrape certain requests by Scrapy though.
It is indeed an AJAX request. If you take a look at the network tab in your browser inspector:
You can see that it's making a POST request to download the urls to the articles.
Every value is self explanatory in here except maybe for docid and timestamp. docid seems to indicate which box to pull articles for(there are multiple boxes on the page) and it seems to be the id attached to <li> element under which the article urls are stored.
Fortunately in this case POST and GET are interchangable. Also timestamp paremeter doesn't seem to be required. So in all you can actually view the results in your browser, by right clicking the url in the inspector and selecting "copy location with parameters":
http://www.marketwatch.com/newsviewer/mktwheadlines?blogs=true&commentary=true&docId=1275261016&premium=true&pullCount=100&pulse=true&rtheadlines=true&topic=All%20Topics&topstories=true&video=true
This example has timestamp parameter removed as well as increased pullCount to 100, so simply request it, it will return 100 of article urls.
You can mess around more to reverse engineer how the website does it and what the use of every keyword, but this is a good start.
I'm trying to use scrapy to get content rendered only after a javascript: link is clicked. As the links don't appear to follow a systematic numbering scheme, I don't know how to
1 - activate a javascript: link to expand a collapsed panel
2 - activate a (now visible) javascript: link to cause the popup to be rendered so that its content (the abstract) can be scraped
The site https://b-com.mci-group.com/EventProgramme/EHA19.aspx contains links to abstracts that will be presented at a conference I plan to attend. The site's export to PDF is buggy, in that it duplicates a lot of data at PDF generation time. Rather than dealing with the bug, I turned to scrapy only to realize that I'm in over my head. I've read:
Can scrapy be used to scrape dynamic content from websites that are using AJAX?
and
How to scrape coupon code of coupon site (coupon code comes on clicking button)
But I don't think I'm able to connect the dots. I've also seen mentions to Selenium, but am not sure that I must resort to that.
I have made little progress, and wonder if I can get a push in the right direction, with the following information in hand:
In order to make the POST request that will expand the collapsed panel (item 1 above) I have a traced that the on-page JS javascript:ShowCollapsiblePanel(116114,1695,44,191); will result in a POST request to TARGETURLOFWEBSITE/EventSessionAjaxService/GetSessionDetailsHtml with payload:
{"eventSessionID":116114,"eventSessionWebSiteSetupViewID":191}
The parameters for eventSessionID and eventSessionWebSiteSetupViewID are clearly in the javascript:ShowCollapsiblePanel text.
How do I use scrapy to iterate over all of the links of form javascript:ShowCollapsiblePanel? I tried to use SgmlLinkExtractor, but that didn't return any of the javascript:ShowCollapsiblePanel() links - I suspect that they don't meet the criteria for "links".
UPDATE
Making progress, I've found that SgmlLinkExtractor is not the right way to go, and the much simpler:
sel.xpath('//a[contains(#href, "javascript:ShowCollapsiblePanel")]').re('((\d+)\,(\d+)\,(\d+)\,(\d+)')
in scrapy console returns me all of the numeric parameters for each javascript:ShowCollapsiblePanel() (of course, right now they are all in one long string, but I'm just messing around in the console).
The next step will be to take the first javascript:ShowCollapsiblePanel() and generate the POST request and analyze the response to see if the response contains what I see when I click the link in the browser.
I fought with a similar problem and after much pulling out hair I pulled the data set I needed with import.io which has a visual type scraper but it's able to run with javascript enabled which did just what I needed and it's free. There's also a fork on git hub I saw last night of scrapy that looked just like the import io scraper it called ..... give me a min
Portia but I don't know if it'll do what you want
https://codeload.github.com/scrapinghub/portia/zip/master
Good