How to get the href and associated information using scrapy?

How to get the href and associated information using scrapy? - python

I'm new to scrapy but using python for a while. I took lesson from the scrapy docs along with the xpath selectors. Now, I would like to turn the knowledge to do a small project. I'm trying to scrap the job links and the associated info like job title, location, emails (if any), phone numbers (if any) from the job board https://www.germanystartupjobs.com/ using the scrapy.
I have this starter code,
import scrapy
class GermanSpider(scrapy.Spider):
# spider name
name = 'germany'
# the first page of the website
start_urls= ['https://www.germanystartupjobs.com/']
print start_urls
def parse(self, response):
pass
def parse_detail(self, response):
pass
and will run the spider scrapy runspider germany
Inside the parse function, I would like to get the hrefs and details inside the parse_detail function.
When, I opened the mentioned page with chrome developer tools and inspect the listed jobs, I see that all the jobs are inside this ul
<ul id="job-listing-view" class="job_listings job-listings-table-bordered">
and then, the separates jobs are listed in the many inside divs of
<div class="job-info-row-listing-class"> with associate infos, say, the href is provided inside <a href="https://www.germanystartupjobs.com/job/foodpanda-berlin-germany-2-sem-manager-mf/">
Other divs provides job title, company name, location etc with divs such as
<div>
<h4 class="job-title-class">
SEM Manager (m/f) </h4>
</div>
<div class="job-company-name">
<normal>foodpanda<normal> </normal></normal></div>
</div>
<div class="location">
<div class="job-location-class"><i class="glyphicon glyphicon-map-marker"></i>
Berlin, Germany </div>
</div>
The first step will be to get the href using the parse function and then, the associated info inside the parse_details using the response. I find that the email and the phone number only provided when you will open the links from the hrefs but the title and location is provided inside the current divs of the same page.
As I mentioned, I have okay programming skill in python, but, I struggles with the using xpaths even after having this tutorial. How do find the links and associated info ? Some sample code with little explanation will help a lot.
I try using the code
# firstly
for element in response.css("job-info-row-listing-class"):
href = element.xpath('#href').extract()[0]
print href
yield scrapy.Request(href, callback=self.parse_detail)
# secondly
values = response.xpath('//div[#class="job-info-row-listing-class"]//a/text()').extract()
for v in values:
print v
#
values = response.xpath('//ul[#id="job-listing-view"]//div[#class="job-info-row-listing-class"]//a/text()').extract()
They seems return nothing so far after runing the spider using scrapy runspider germany

You probably won't be able to extract the information on this site that easily, since the actual job-listings are loaded as a POST-request.
How do you know this?
Type scrapy shell "https://www.germanystartupjobs.com/" in your terminal of choice. (This opens up the, you guessed it, shell, which is highly recommendable, when first starting to scrape a website. There you can try out functions, xpaths etc.)
In the shell, type view(response). This opens the response scrapy is getting in your default browser.
When the page has finished loading, you should be able to see, that there are no job listings. This is because they are loaded through a POST-Request.
How do we find out what request it is? (I work with Firebug for FireFox, don't know how it works on Chrome)
Fire up firebug (e.g. by right-clicking on an element and clicking Inspect with Firebug. This opens up Firebug, which is essentially like the Developer tools in Chrome. I prefer it.
Here you can click the Network-Tab. If there is nothing there, reload the page.
Now you should be able to see the request with which the job listings are loaded.
In this case, the request to https://www.germanystartupjobs.com/jm-ajax/get_listings/ returns a JSON-object (click JSON) with the HTML-code as aprt of it.
For your spider this means that you will need to tell scrapy to get this request and process the HTML-part of the JSON-object in order to be able to apply your xpaths.
You do this by import the json-module at the top of your spider and then something along the lines of:
data = json.loads(response.body)
html = data['html']
selector = scrapy.Selector(text=data['html'], type="html")
For example, if you'd like to extract all the urls from the site and follow them, you'd need to specify the xpath, where to urls are found and yield a new request to this url. So basically you're telling scrapy "Look, here is the url, now go and follow it".
An example for an xpath would be:
url = selector.xpath('//a/#href').extract()
So everything in the brackets is your xpath. You don't need to specify all the path from ul[#id="job-listing-view"]/ or so, you just need to make sure it is an identifiable path. Here for example, we only have the urls in the a-tags that you want, there are no other a-tags on the site.
This is pretty much the basic stuff.
I strongly recommend you to play around in the shell until you feel you get a hang of the xpaths. Take a site that looks quite easy, without any requests and see if you can find any element you want through the xpaths.

Related

Fetch links having no href attribute : Selenium-Python

I am currently trying to crawl using selenium-python through an entire website with a specified crawl depth. I started with Google and thought of moving forward by crawling with it and simultaneously develop the code.
The way it works is: If the page is 'www.google.com' and has 15 links within it, once all the links are fetched, it is stored in a dictionary with 'www.google.com' as the key and a list of 15 links as value. Then each of the 15 links are then taken from the corresponding dictionary and the crawling continues in a recursive manner.
The problem with this is that it moves forward with respect to the href attribute of every links found on a page. But not every links will have href attribute.
For example: As it crawled and reached the My Account Page it has Help and Feedback in it's footer which has an outerHTML of <span role="button" tabindex="0" class="fK1S1c" jsname="ngKiOe">Help and Feedback</span>.
So what I am not sure is that - what can be done on such a context where a link is highly supported by javascript/ajax for it matters - as it does not have a link but opens up a modal window/dialog box or sorts.

You might need to find a pattern of design for links. For eg: you
could have a link with anchor tag and in your case span.
It depends on the design of the webpage. How the developers intent do design the html elements through attributes/ identifiers.
For eg: if the dev decides to have a common class value for all the links that are not with the anchor tag name, it would be easy to identify all those elements.
You could also try writing a script to fetch all the elements with the
expected tag name( for eg : span) here and try clicking on the
elements. You could fetch the details of the backend response/log
details. So for those, clicks, where you are getting additional
response/log would mean that it has an additional code written behind
giving us an idea that it is not a static element.

Navigate through all the search results pages with BeautifulSoup

I can not seem to grasp.
How can I make BeautifulSoup parse every page by navigating using Next page link up until the last page and stop parsing when there is no "Next page" found. On a site like this
enter link description here
I try looking for the Next button element name, I use 'find' to find it, but do not know how to make it recurring to do iterations until all pages are scraped.
Thank you

beautiful soup will only give you the tools, how to go about navigating pages is something you need to work out in a flow diagram sense.
Taking the page you mentioned, clicking through a few of the pages it seems that when we are on page 1, nothing is shown in the url.
htt...ru/moskva/transport
and we see in the source of the page:
<div class="pagination-pages clearfix">
<span class="pagination-page pagination-page_current">1</span>
<a class="pagination-page" href="/moskva/transport?p=2">2</a>
lets check what happens when we go to page 2
ht...ru/moskva/transport?p=2
<div class="pagination-pages clearfix">
<a class="pagination-page" href="/moskva/transport">1</a>
<span class="pagination-page pagination-page_current">2</span>
<a class="pagination-page" href="/moskva/transport?p=3">3</a>
perfect, now we have the layout. one more thing to know before we make our beautiful soup. what happenes when we go to a page past the last available page. which at the time of this writing was: 40161
ht...ru/moskva/transport?p=40161
we change this to:
ht...ru/moskva/transport?p=40162
the page seems to go back to page 1 automatically. great!
so now we have everything we need to make our soup loop.
instead of clicking next each time, just make a url statement. you know the elements required.
url = ht...ru/moskva/$searchterm?p=$pagenum
im assuming transport is the search term??? i dont know, i cant read russian. but you get the idea. construct the url. then do a requests call
request = requests.get(url)
mysoup = bs4.BeautifulSoup(request.text)
and now you can wrap that whole thing in a while loop, and each time except the first time check
mysoup.select['.pagination-page_current'][0].text == 1
this says, each time we get the page, find the currently selected page by using the class pagination-page_current, it returns an array so we select the first element [0] get its text .text and see if it equals 1.
this should only be true in two cases. the first page you run, and the last. so you can use this to start and stop the script, or however you want.
this should be everything you need to do this properly. :)

BeautifulSoup by itself does not load pages. You need to use something like requests, scrape the URL you want to follow, load it and pass its content to another BS4 soup.
import requests
# Scrape your url
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser') # You can now scrape the new page

Scraping a page inside a main page?

I'm doing a web app that searches all the shoe sizes that are in stock for each model of shoe.
So for example, for a website having a list of shoes:
http://www.soccer.com/shop/footwear/?page=1&pageSize=12&query=*&facet=ads_f40502_ntk_cs%253A%2522Nike%2522
I'll need to go inside each link to scrape this information.
Is there any way I can effectively do this with Scrapy (or something else)? Or is it impossible to do it?

It is possible and it is one of Scrapy's core functionalities.
For example, for scraping every shoe on this site what you would do is:
In your spider variables start_urls = ['http://www.soccer.com/shop/footwear/?page=1&pageSize=12&query=*&facet=ads_f40502_ntk_cs%253A%2522Nike%2522']
Then on your parse(self, response) your code should look like this:
for shoe_url in response.xpath(<ENTER_THE_XPATH>).extract()
yield scrapy.Request(response.urljoin(shoe_url), callback=self.parse_shoe)
and in the method parse_shoe which we registered as callback in the for loop, you should extract all the information you need.
Now what happens here, is that the spider starts to crawl the URL in start_urls and then for every url that meets the xpath we specified it will parse it using the parse_shoe function, where you could simply extract the shoe sizes.
You can follow the "Follow Links" tutorial on scrapy's main site on this link too - it is very clear.
For completeness I looked for the right xpath for you on that page, it should be '*//ul[#class="medium-3 columns product-list product-grid"]//a/#href'

Web scraping for divs inserted by scripts

Sorry if this is a silly question.
I am trying to use Beautifulsoup and urllib2 in python to look at a url and extract all divs with a particular class. However, the result is always empty even though I can see the divs when I "inspect element" in chrome's developer tools.
I looked at the page source and those divs were not there which means they were inserted by a script. So my question is how can i look for those divs (using their class name) using Beautifulsoup? I want to eventually read and follow hrefs under those divs.
Thanks.
[Edit]
I am currently looking at the H&M website: http://www.hm.com/sg/products/ladies and I am interested to get all the divs with class 'product-list-item'

Try using selenium to run the javascript
from selenium import webdriver
driver = webdriver.Firefox()
driver.get("http://www.python.org")
html = driver.page_source

check this link enter link description here
you can get all info by change the url, this link can be found in chrome dev tools > Network

The reason why you got nothing from that specific url is simply because, the info you need is not there.
So first let me explain a little bit about how that page is loaded in a browser: when you request for that page(http://www.hm.com/sg/products/ladies), the literal content will be returned in the very first phase(which is what you got from your urllib2 request), then the browser starts to read/parse the content, basically it tells the browser where to find all information it needs to render the whole page(e.g. CSS to control layout, additional javascript/urls/pages to populate certain area etc.), and the browser does all that behind the scene. When you "inspect element" in chrome, the page is already fully loaded, and those info you want is not in original url, so you need to find out which url is used to populate those area and go after that specific url instead.
So now we need to find out what happens behind the scene, and a tool is needed to capture all traffic when that page loads(I would recommend fiddler).
As you can see, lots of things happen when you open that page in a browser!(and that's only part of the whole page-loading process) So by educated guess, those info you need should be in one of those three "api.hm.com" requests, and the best part is they are alread JSON formatted, which means you might not even bother with BeautifulSoup, the built-in json module could do the job!
OK, now what? Use urllib2 to simulate those requests and get what you want.
P.S. requests is a great tool for this kind of job, you can get it here.

Try This one :
from bs4 import BeautifulSoup
import urllib2
page = urllib2.urlopen("http://www.hm.com/sg/products/ladies")
soup = BeautifulSoup(page.read(),'lxml')
scrapdiv = open('scrapdiv.txt','w')
product_lists = soup.findAll("div",{"class":"o-product-list"})
print product_lists
for product_list in product_lists:
print product_list
scrapdiv.write(str(product_list))
scrapdiv.write("\n\n")
scrapdiv.close()

Scraping the content of a box contains infinite scrolling in Python

I am new to Python and web crawling. I intend to scrape links in the top stories of a website. I was told to look at to its Ajax requests and send similar ones. The problem is that all requests for the links are same: http://www.marketwatch.com/newsviewer/mktwheadlines
My question would be how to extract links from an infinite scrolling box like this. I am using beautiful soup, but I think it's not suitable for this task. I am also not familiar with Selenium and java scripts. I know how to scrape certain requests by Scrapy though.

It is indeed an AJAX request. If you take a look at the network tab in your browser inspector:
You can see that it's making a POST request to download the urls to the articles.
Every value is self explanatory in here except maybe for docid and timestamp. docid seems to indicate which box to pull articles for(there are multiple boxes on the page) and it seems to be the id attached to <li> element under which the article urls are stored.
Fortunately in this case POST and GET are interchangable. Also timestamp paremeter doesn't seem to be required. So in all you can actually view the results in your browser, by right clicking the url in the inspector and selecting "copy location with parameters":
http://www.marketwatch.com/newsviewer/mktwheadlines?blogs=true&commentary=true&docId=1275261016&premium=true&pullCount=100&pulse=true&rtheadlines=true&topic=All%20Topics&topstories=true&video=true
This example has timestamp parameter removed as well as increased pullCount to 100, so simply request it, it will return 100 of article urls.
You can mess around more to reverse engineer how the website does it and what the use of every keyword, but this is a good start.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.