I am currently trying to crawl using selenium-python through an entire website with a specified crawl depth. I started with Google and thought of moving forward by crawling with it and simultaneously develop the code.
The way it works is: If the page is 'www.google.com' and has 15 links within it, once all the links are fetched, it is stored in a dictionary with 'www.google.com' as the key and a list of 15 links as value. Then each of the 15 links are then taken from the corresponding dictionary and the crawling continues in a recursive manner.
The problem with this is that it moves forward with respect to the href attribute of every links found on a page. But not every links will have href attribute.
For example: As it crawled and reached the My Account Page it has Help and Feedback in it's footer which has an outerHTML of <span role="button" tabindex="0" class="fK1S1c" jsname="ngKiOe">Help and Feedback</span>.
So what I am not sure is that - what can be done on such a context where a link is highly supported by javascript/ajax for it matters - as it does not have a link but opens up a modal window/dialog box or sorts.
You might need to find a pattern of design for links. For eg: you
could have a link with anchor tag and in your case span.
It depends on the design of the webpage. How the developers intent do design the html elements through attributes/ identifiers.
For eg: if the dev decides to have a common class value for all the links that are not with the anchor tag name, it would be easy to identify all those elements.
You could also try writing a script to fetch all the elements with the
expected tag name( for eg : span) here and try clicking on the
elements. You could fetch the details of the backend response/log
details. So for those, clicks, where you are getting additional
response/log would mean that it has an additional code written behind
giving us an idea that it is not a static element.
Related
I started working on my first web scraper with python and selenium. I'm sure it will be painfully obvious without me saying, but I'm still very new to this. This web scraper navigates to a website, performs a search to find a bunch of political candidate committee pages, clicks the first search result, then scrapes some text data into a dictionary and then a csv. There are 50 results per page and I could pass 50 different id's into the code and it would work (I've tested it). But of course, I want this to be at least somewhat automated. I'd like it to loop through the 50 search results (candidate committee pages) and scrape them one after the other.
I thought a for loop would work well here. I would just need to loop through each of the 50 search result elements with the code that I know works. This is where I'm having issues. When I copy the html element corresponding to the search result link, this is what I get.
a id="_ctl0_Content_dgdSearchResults__ctl2_lnkCandidate" class="grdBodyDisplay" href="javascript:__doPostBack('_ctl0$Content$dgdSearchResults$_ctl2$lnkCandidate','')">ALLEN, KEVIN</a>
As you can see from the html above, the href attribute isn't a normal link. It's some sort of javascript Postback thing that I don't really understand. After some googling, I still don't really get it. Some people are saying this means you have to make the program wait before you click the link, but my original code doesn't do that. My code performs the search and clicks the first link without issue. I just have to pass it the id.
I thought a good first step would be to scrape the search results page to get a list of links. Then I could iterate through a list of links with the rest of the scraping code. After some messing around I tried this:
links = driver.find_elements_by_tag_name('a')
for i in links:
print(i.get_attribute('href'))
This gives me a list of all the links on the page, and after playing with the list a little bit, it narrows down to a list of 50 of these corresponding to the 50 search results (notice the id's change by 1 number):
javascript:__doPostBack('_ctl0$Content$dgdSearchResults$_ctl2$lnkCandidate','')
javascript:__doPostBack('_ctl0$Content$dgdSearchResults$_ctl3$lnkCandidate','')
javascript:__doPostBack('_ctl0$Content$dgdSearchResults$_ctl4$lnkCandidate','')
etc
That's what the href attribute gives me...but are those even links? How do I work with them? Is this the wrong way to go about through iterating through search results? I feel like I am so close to getting this to work! I'd appreciate any suggestions you have. Thanks!
__doPostBack() function
Postback is the functionality where the page contents are posted to the server due to an occurrence of an event in a page control. As an example can be, a button click or a index change event when AutoPostBack value is set to true. All the webcontrols except Button and ImageButton control can call a javascript function called __doPostBack() to post the form to server. Button and ImageButton control will use the browsers ability and submit the form to the server. ASP.Net runtime automatically inserts the definition of __doPostBack() function in the HTML output when there is a control that can initiate a postback in the page.
An example defination of __doPostBack:
<html>
<body>
<script type="text/javascript">
//<![CDATA[
var theForm = document.forms['form1'];
if (!theForm) {
theForm = document.form1;
}
function __doPostBack(eventTarget, eventArgument) {
if (!theForm.onsubmit || (theForm.onsubmit() != false)) {
theForm.__EVENTTARGET.value = eventTarget;
theForm.__EVENTARGUMENT.value = eventArgument;
theForm.submit();
}
}
//]]>
</script>
</div>
<a id="LinkButton1" href="javascript:__doPostBack('LinkButton1','')">LinkButton</a>
</form>
</body>
</html>
This usecase
Extracting the value of the href attributes will always give the similar output:
javascript:__doPostBack('_ctl0$Content$dgdSearchResults$_ctl2$lnkCandidate','')
javascript:__doPostBack('_ctl0$Content$dgdSearchResults$_ctl3$lnkCandidate','')
javascript:__doPostBack('_ctl0$Content$dgdSearchResults$_ctl4$lnkCandidate','')
A pottential solution will be to:
Open the <a> tags in the adjascent tab using CONTROL + Click
Switch to the adjascent tab
Extract the current_url
Switch back to the main tab.
No, these aren't links that you can just load in a browser or with Selenium. From what I can tell with a little googling, the first argument to __doPostBack() is the id for a button (or maybe another element) on the page.
In more general terms, "post back" refers to making a POST request back to the exact same URL as the current page. From what I can tell, __doPostBack() performs a post back by simulating a click on the specified element.
Currently i use Python with selenium for scraping purpose. There are many ways in selenium to scrape data. And I used to use css selectors.
But then I realised that,
Only tagNames are those things which always are on websites.
For example,
Not every website uses classes or Id's like, take an example of Wikipedia. They use normally just tags in it.
like <h1>, <a> without having any classes or id in it.
There comes the limitation for scraping USING tagNames, as they scrape every element under their tags.
For example : if I want to scrape table contents which are under <p> tag, then it scrapes the table contents as well as all the descriptions which are not needed.
My question is: is it possible to scrape the required elements under the tags which do not copy every other elements under their tags?
Like if I want to scrape content from, say Amazon then it will select only product names under h1 tags, not scraping all the headings under the h1 tag which are not product names.
If you find any other method/locator to use, even except the tagName also then also you can tell me. But the condition is that it must be present on every website/ most of the websites
Any help would be appreciated 😊...
I am new to scrapy, trying to extract google news from the the given link bellow:
https://www.google.co.in/search?q=cholera+news&safe=strict&source=lnms&tbm=nws&sa=X&ved=0ahUKEwik0KLV-JfYAhWLpY8KHVpaAL0Q_AUICigB&biw=1863&bih=966
"cholera" key word was provided that shows small blocks of various news associated with cholera key world further I try this with scrapy to extract the each block that contents individual news.
fetch("https://www.google.co.in/search?q=cholera+news&safe=strict&source=lnms&tbm=nws&sa=X&ved=0ahUKEwik0KLV-JfYAhWLpY8KHVpaAL0Q_AUICigB&biw=1863&bih=966")
response.css(".ts._JGs._KHs._oGs._KGs._jHs::text").extract()
where .ts._JGs._KHs._oGs._KGs._jHs::text represent the div class="ts _JGs _KHs _oGs _KGs _jHs for each block of news.
but it return None.
After struggling I find out a way to scrap desired data with very simple trick,
fetch("https://www.google.co.in/search?q=cholera+news&safe=strict&source=lnms&tbm=nws&sa=X&ved=0ahUKEwik0KLV-JfYAhWLpY8KHVpaAL0Q_AUICigB&biw=1863&bih=966")
and css selector "class="g" tag can be used to extract desired block like this
response.css(".g").extract()
which return list of all the individual news blocks which can be further used on the basis of list index like this:
response.css(".g").extract()[0]
or
response.css(".g").extract()[1]
In scrapy shell uses view(response) and you will see in web browser what you fetch().
Google uses JavaScript to display data, but it can also send page which doesn't use JavaScript. But page without JavaScript usually has different tags and classes.
You can also turn off JavaScript in your browse and then open Google to see tags.
Try this:
response.css('#search td ::text').extract()
I'm new to scrapy but using python for a while. I took lesson from the scrapy docs along with the xpath selectors. Now, I would like to turn the knowledge to do a small project. I'm trying to scrap the job links and the associated info like job title, location, emails (if any), phone numbers (if any) from the job board https://www.germanystartupjobs.com/ using the scrapy.
I have this starter code,
import scrapy
class GermanSpider(scrapy.Spider):
# spider name
name = 'germany'
# the first page of the website
start_urls= ['https://www.germanystartupjobs.com/']
print start_urls
def parse(self, response):
pass
def parse_detail(self, response):
pass
and will run the spider scrapy runspider germany
Inside the parse function, I would like to get the hrefs and details inside the parse_detail function.
When, I opened the mentioned page with chrome developer tools and inspect the listed jobs, I see that all the jobs are inside this ul
<ul id="job-listing-view" class="job_listings job-listings-table-bordered">
and then, the separates jobs are listed in the many inside divs of
<div class="job-info-row-listing-class"> with associate infos, say, the href is provided inside <a href="https://www.germanystartupjobs.com/job/foodpanda-berlin-germany-2-sem-manager-mf/">
Other divs provides job title, company name, location etc with divs such as
<div>
<h4 class="job-title-class">
SEM Manager (m/f) </h4>
</div>
<div class="job-company-name">
<normal>foodpanda<normal> </normal></normal></div>
</div>
<div class="location">
<div class="job-location-class"><i class="glyphicon glyphicon-map-marker"></i>
Berlin, Germany </div>
</div>
The first step will be to get the href using the parse function and then, the associated info inside the parse_details using the response. I find that the email and the phone number only provided when you will open the links from the hrefs but the title and location is provided inside the current divs of the same page.
As I mentioned, I have okay programming skill in python, but, I struggles with the using xpaths even after having this tutorial. How do find the links and associated info ? Some sample code with little explanation will help a lot.
I try using the code
# firstly
for element in response.css("job-info-row-listing-class"):
href = element.xpath('#href').extract()[0]
print href
yield scrapy.Request(href, callback=self.parse_detail)
# secondly
values = response.xpath('//div[#class="job-info-row-listing-class"]//a/text()').extract()
for v in values:
print v
#
values = response.xpath('//ul[#id="job-listing-view"]//div[#class="job-info-row-listing-class"]//a/text()').extract()
They seems return nothing so far after runing the spider using scrapy runspider germany
You probably won't be able to extract the information on this site that easily, since the actual job-listings are loaded as a POST-request.
How do you know this?
Type scrapy shell "https://www.germanystartupjobs.com/" in your terminal of choice. (This opens up the, you guessed it, shell, which is highly recommendable, when first starting to scrape a website. There you can try out functions, xpaths etc.)
In the shell, type view(response). This opens the response scrapy is getting in your default browser.
When the page has finished loading, you should be able to see, that there are no job listings. This is because they are loaded through a POST-Request.
How do we find out what request it is? (I work with Firebug for FireFox, don't know how it works on Chrome)
Fire up firebug (e.g. by right-clicking on an element and clicking Inspect with Firebug. This opens up Firebug, which is essentially like the Developer tools in Chrome. I prefer it.
Here you can click the Network-Tab. If there is nothing there, reload the page.
Now you should be able to see the request with which the job listings are loaded.
In this case, the request to https://www.germanystartupjobs.com/jm-ajax/get_listings/ returns a JSON-object (click JSON) with the HTML-code as aprt of it.
For your spider this means that you will need to tell scrapy to get this request and process the HTML-part of the JSON-object in order to be able to apply your xpaths.
You do this by import the json-module at the top of your spider and then something along the lines of:
data = json.loads(response.body)
html = data['html']
selector = scrapy.Selector(text=data['html'], type="html")
For example, if you'd like to extract all the urls from the site and follow them, you'd need to specify the xpath, where to urls are found and yield a new request to this url. So basically you're telling scrapy "Look, here is the url, now go and follow it".
An example for an xpath would be:
url = selector.xpath('//a/#href').extract()
So everything in the brackets is your xpath. You don't need to specify all the path from ul[#id="job-listing-view"]/ or so, you just need to make sure it is an identifiable path. Here for example, we only have the urls in the a-tags that you want, there are no other a-tags on the site.
This is pretty much the basic stuff.
I strongly recommend you to play around in the shell until you feel you get a hang of the xpaths. Take a site that looks quite easy, without any requests and see if you can find any element you want through the xpaths.
I am new to Python and web crawling. I intend to scrape links in the top stories of a website. I was told to look at to its Ajax requests and send similar ones. The problem is that all requests for the links are same: http://www.marketwatch.com/newsviewer/mktwheadlines
My question would be how to extract links from an infinite scrolling box like this. I am using beautiful soup, but I think it's not suitable for this task. I am also not familiar with Selenium and java scripts. I know how to scrape certain requests by Scrapy though.
It is indeed an AJAX request. If you take a look at the network tab in your browser inspector:
You can see that it's making a POST request to download the urls to the articles.
Every value is self explanatory in here except maybe for docid and timestamp. docid seems to indicate which box to pull articles for(there are multiple boxes on the page) and it seems to be the id attached to <li> element under which the article urls are stored.
Fortunately in this case POST and GET are interchangable. Also timestamp paremeter doesn't seem to be required. So in all you can actually view the results in your browser, by right clicking the url in the inspector and selecting "copy location with parameters":
http://www.marketwatch.com/newsviewer/mktwheadlines?blogs=true&commentary=true&docId=1275261016&premium=true&pullCount=100&pulse=true&rtheadlines=true&topic=All%20Topics&topstories=true&video=true
This example has timestamp parameter removed as well as increased pullCount to 100, so simply request it, it will return 100 of article urls.
You can mess around more to reverse engineer how the website does it and what the use of every keyword, but this is a good start.