Scraping a page inside a main page?

Scraping a page inside a main page? - python

I'm doing a web app that searches all the shoe sizes that are in stock for each model of shoe.
So for example, for a website having a list of shoes:
http://www.soccer.com/shop/footwear/?page=1&pageSize=12&query=*&facet=ads_f40502_ntk_cs%253A%2522Nike%2522
I'll need to go inside each link to scrape this information.
Is there any way I can effectively do this with Scrapy (or something else)? Or is it impossible to do it?

It is possible and it is one of Scrapy's core functionalities.
For example, for scraping every shoe on this site what you would do is:
In your spider variables start_urls = ['http://www.soccer.com/shop/footwear/?page=1&pageSize=12&query=*&facet=ads_f40502_ntk_cs%253A%2522Nike%2522']
Then on your parse(self, response) your code should look like this:
for shoe_url in response.xpath(<ENTER_THE_XPATH>).extract()
yield scrapy.Request(response.urljoin(shoe_url), callback=self.parse_shoe)
and in the method parse_shoe which we registered as callback in the for loop, you should extract all the information you need.
Now what happens here, is that the spider starts to crawl the URL in start_urls and then for every url that meets the xpath we specified it will parse it using the parse_shoe function, where you could simply extract the shoe sizes.
You can follow the "Follow Links" tutorial on scrapy's main site on this link too - it is very clear.
For completeness I looked for the right xpath for you on that page, it should be '*//ul[#class="medium-3 columns product-list product-grid"]//a/#href'

Related

Is scrapy.spider or crawler good fit for this task?

I am trying to scrape soccer players’ data using python’s Scrapy package. The website I’m scraping has the format
https://www.example.com/players — I’ll refer to it as “Homepage”
Here, there is a list of players playing in the league. To get to the data I’m looking for starting at the Homepage, I have to click the player’s name and it takes me to an “overview” page of that player which has the data I need. To get the data I want to scrape for the second player, I have to go back up to the Homepage and click the name of the second player and scrape the data > back up to the Homepage again and click the name of the third player and so on. So How should I go about doing this task in Scrapy? Should I use scrapy.spider or crawlspider? How do I tell scrapy I want to go into a specific page (player’s overview page) and out to the Homepage where the list of all players exist so I’m able to go to the next player repeating the same process? Thank you in advance!

Assuming that the page isn't rendered with javascript the scrapy would be a great tool.
I would suggest reading the installation docs and the tutorial to get a general understanding of how it works, where to begin and how to start a new project.
Here is an example of what your spider could look like:
import scrapy
class MySpider(scrapy.Spider):
name = "myspider"
start_urls = ["https://example.com/homepage"]
def parse(self, response):
for players_name in response.xpath_or_css_selector(some_selector_path_to_url).getall():
yield scrapy.Request(url, callback=self.parse_player)
def parse_player(self, response):
# scrape the player data into a dictionary and then yield it as an item
yield {player: data}
Installation docs
Scrapy Tutorial

How can I crawl a bunch of links on a root website using Scrapy?

I am trying to crawl a covid-19 statistics website which has a bunch of links to pages regarding the statistics for different countries. The links all have a class name that makes them easy to access using css selectors ('mt_a'). There is no continuity between the countries so if you are on the webpage for one of them, there is no link to go to the next country. I am a complete beginner to scrapy and I'm not sure what I should do if my goal is to scrape all the (200 ish) links listed on the root page for the same few pieces of information. Any guidance on what I should be trying to do would be appreciated.
The link I'm trying to scrape: https://www.worldometers.info/coronavirus/
(scroll down to see country links)

What I would do is create two spiders. One would parse the home page and extract all specific links to country pages href within anchor tags, i.e. href="country/us/" and then create full urls from these relative links so that you get a proper url like https://www.worldometers.info/coronavirus/country/us/.
Then the second spider is given the list of all country urls and then goes on to crawl all individual pages and extract information from those.
For example, you get a list of urls from the first spider:
urls = ['https://www.worldometers.info/coronavirus/country/us/',
'https://www.worldometers.info/coronavirus/country/russia/']
Then in the second spider you give that list to the start_urls attribute.

I think others have already answered the question, but here is the page for Link extractors.

Crawl a website and its external links recursively to create a graph for a data analysis project n Python

I have this project that I'm trying to put together for a data analytics experiment. I have a pipeline in mind but I don't exactly know how to go on about getting the data I need.
I want to crawl a website and find all internal and external link, separate them and crawl the external links recursively until it reaches a certain depth. I want to do this to create a graph of all the connections for a website, to then use centrality algorithms to find the center node and proceed from there.
Ideally, I would like to use python 2 for this project.
I had a look at scrapy, beautiful soup and other libraries but it is all quite confusing.
Any help and/or advice would be much appreciated on crawling and creating the graph especially
Thank you
EDIT:
I'm trying to implement the solution you suggested and with the code below, I can see in the debug information that it is finding the links but either they are not being saved in the LinkList class or I'm extracting them wrong and they are getting filtered.
Any suggestions?
class LinkList(Item):
url = Field()
class WebcrawlerSpider(CrawlSpider):
name = 'webcrawler'
allowed_domains = ['https://www.wehiweb.com']
start_urls = ['https://www.wehiweb.com']
rules = (
Rule(LxmlLinkExtractor(allow=()), callback='parse_obj', follow=True),
)
def parse_obj(self,response):
item = LinkList()
item['url'] = []
for link in LxmlLinkExtractor(allow=(),deny = self.allowed_domains).extract_links(response):
item['url'].append(link.url)
yield item
def main():
links = LinkList()
process = CrawlerProcess()
process.crawl(WebcrawlerSpider)
process.start()
print(links.items())
if __name__ == "__main__":
main()

Scrapy should work fine for this. Most people use it to extract data from websites (scraping), but it can be used for simple crawling as well.
In scrapy you have spiders that crawl websites and follow links. A scrapy project can consist of many spiders, but in the standard setup each spider will have its own queue and do its own task.
As you described your use case, I would recommend two separate scrapy spiders:
one for onsite scraping with a allowed_domains setting only for this domain and a very high or even 0 (=infinite) MAX_DEPTH setting, so that it will crawl the whole domain
one for offsite scraping with an empty allowed_domains (will allow all domains) and a low MAX_DEPTH setting, so that it will stop after certain number of hops
From your parse method's perspective scrapy has a concept of Request and Item. You can return both Request and Item from the method that parses your response:
requests will trigger scrapy to visit a website and in turn call your parse method on the result
items allow you to specify the results you define for your project
So whenever you want to follow a link you will yield a Request from your parse method. And for all results of your project you will yield Item.
In your case, I'd say that your Item is something like this:
class LinkItem(scrapy.Item):
link_source = scrapy.Field()
link_target = scrapy.Field()
This will allow you to return the item link_source="http://example.com/", link_target="http://example.com/subsite" if you are on page http://example.com/ and found a link to /subsite:
def parse(self, response):
# Here: Code to parse the website (e.g. with scrapy selectors
# or beautifulsoup, but I think scrapy selectors should
# suffice
# after parsing, you have a list "links"
for link in links:
yield Request(link) # make scrapy continue the crawl
item = LinkItem()
item['link_source'] = response.url
item['link_target'] = link
yield item # return the result we want (connections in link graph)
You might see that I did not do any depth checking etc. You don't have to do this manually in your parse method, scrapy ships with Middleware. One of the middlewares is called OffsiteMiddleware and will check if your spider is allowed to visit specific domains (with the option allowed_domains, check the scrapy tutorials). And other one is DepthMiddleware (also check the tutorials).
These results can be written anywhere you want. Scrapy ships with something called feed exports which allow you to write data to files. If you need something more advanced, e.g. a database, you can look at scrapy's Pipeline.
I currently do not see the need for other libraries and projects apart from scrapy for your data collection.
Of course when you want to work with the data, you might need specialized data structures instead of plain text files.

How to get the href and associated information using scrapy?

I'm new to scrapy but using python for a while. I took lesson from the scrapy docs along with the xpath selectors. Now, I would like to turn the knowledge to do a small project. I'm trying to scrap the job links and the associated info like job title, location, emails (if any), phone numbers (if any) from the job board https://www.germanystartupjobs.com/ using the scrapy.
I have this starter code,
import scrapy
class GermanSpider(scrapy.Spider):
# spider name
name = 'germany'
# the first page of the website
start_urls= ['https://www.germanystartupjobs.com/']
print start_urls
def parse(self, response):
pass
def parse_detail(self, response):
pass
and will run the spider scrapy runspider germany
Inside the parse function, I would like to get the hrefs and details inside the parse_detail function.
When, I opened the mentioned page with chrome developer tools and inspect the listed jobs, I see that all the jobs are inside this ul
<ul id="job-listing-view" class="job_listings job-listings-table-bordered">
and then, the separates jobs are listed in the many inside divs of
<div class="job-info-row-listing-class"> with associate infos, say, the href is provided inside <a href="https://www.germanystartupjobs.com/job/foodpanda-berlin-germany-2-sem-manager-mf/">
Other divs provides job title, company name, location etc with divs such as
<div>
<h4 class="job-title-class">
SEM Manager (m/f) </h4>
</div>
<div class="job-company-name">
<normal>foodpanda<normal> </normal></normal></div>
</div>
<div class="location">
<div class="job-location-class"><i class="glyphicon glyphicon-map-marker"></i>
Berlin, Germany </div>
</div>
The first step will be to get the href using the parse function and then, the associated info inside the parse_details using the response. I find that the email and the phone number only provided when you will open the links from the hrefs but the title and location is provided inside the current divs of the same page.
As I mentioned, I have okay programming skill in python, but, I struggles with the using xpaths even after having this tutorial. How do find the links and associated info ? Some sample code with little explanation will help a lot.
I try using the code
# firstly
for element in response.css("job-info-row-listing-class"):
href = element.xpath('#href').extract()[0]
print href
yield scrapy.Request(href, callback=self.parse_detail)
# secondly
values = response.xpath('//div[#class="job-info-row-listing-class"]//a/text()').extract()
for v in values:
print v
#
values = response.xpath('//ul[#id="job-listing-view"]//div[#class="job-info-row-listing-class"]//a/text()').extract()
They seems return nothing so far after runing the spider using scrapy runspider germany

You probably won't be able to extract the information on this site that easily, since the actual job-listings are loaded as a POST-request.
How do you know this?
Type scrapy shell "https://www.germanystartupjobs.com/" in your terminal of choice. (This opens up the, you guessed it, shell, which is highly recommendable, when first starting to scrape a website. There you can try out functions, xpaths etc.)
In the shell, type view(response). This opens the response scrapy is getting in your default browser.
When the page has finished loading, you should be able to see, that there are no job listings. This is because they are loaded through a POST-Request.
How do we find out what request it is? (I work with Firebug for FireFox, don't know how it works on Chrome)
Fire up firebug (e.g. by right-clicking on an element and clicking Inspect with Firebug. This opens up Firebug, which is essentially like the Developer tools in Chrome. I prefer it.
Here you can click the Network-Tab. If there is nothing there, reload the page.
Now you should be able to see the request with which the job listings are loaded.
In this case, the request to https://www.germanystartupjobs.com/jm-ajax/get_listings/ returns a JSON-object (click JSON) with the HTML-code as aprt of it.
For your spider this means that you will need to tell scrapy to get this request and process the HTML-part of the JSON-object in order to be able to apply your xpaths.
You do this by import the json-module at the top of your spider and then something along the lines of:
data = json.loads(response.body)
html = data['html']
selector = scrapy.Selector(text=data['html'], type="html")
For example, if you'd like to extract all the urls from the site and follow them, you'd need to specify the xpath, where to urls are found and yield a new request to this url. So basically you're telling scrapy "Look, here is the url, now go and follow it".
An example for an xpath would be:
url = selector.xpath('//a/#href').extract()
So everything in the brackets is your xpath. You don't need to specify all the path from ul[#id="job-listing-view"]/ or so, you just need to make sure it is an identifiable path. Here for example, we only have the urls in the a-tags that you want, there are no other a-tags on the site.
This is pretty much the basic stuff.
I strongly recommend you to play around in the shell until you feel you get a hang of the xpaths. Take a site that looks quite easy, without any requests and see if you can find any element you want through the xpaths.

Scraping the content of a box contains infinite scrolling in Python

I am new to Python and web crawling. I intend to scrape links in the top stories of a website. I was told to look at to its Ajax requests and send similar ones. The problem is that all requests for the links are same: http://www.marketwatch.com/newsviewer/mktwheadlines
My question would be how to extract links from an infinite scrolling box like this. I am using beautiful soup, but I think it's not suitable for this task. I am also not familiar with Selenium and java scripts. I know how to scrape certain requests by Scrapy though.

It is indeed an AJAX request. If you take a look at the network tab in your browser inspector:
You can see that it's making a POST request to download the urls to the articles.
Every value is self explanatory in here except maybe for docid and timestamp. docid seems to indicate which box to pull articles for(there are multiple boxes on the page) and it seems to be the id attached to <li> element under which the article urls are stored.
Fortunately in this case POST and GET are interchangable. Also timestamp paremeter doesn't seem to be required. So in all you can actually view the results in your browser, by right clicking the url in the inspector and selecting "copy location with parameters":
http://www.marketwatch.com/newsviewer/mktwheadlines?blogs=true&commentary=true&docId=1275261016&premium=true&pullCount=100&pulse=true&rtheadlines=true&topic=All%20Topics&topstories=true&video=true
This example has timestamp parameter removed as well as increased pullCount to 100, so simply request it, it will return 100 of article urls.
You can mess around more to reverse engineer how the website does it and what the use of every keyword, but this is a good start.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.