Is scrapy.spider or crawler good fit for this task?

Is scrapy.spider or crawler good fit for this task? - python

I am trying to scrape soccer players’ data using python’s Scrapy package. The website I’m scraping has the format
https://www.example.com/players — I’ll refer to it as “Homepage”
Here, there is a list of players playing in the league. To get to the data I’m looking for starting at the Homepage, I have to click the player’s name and it takes me to an “overview” page of that player which has the data I need. To get the data I want to scrape for the second player, I have to go back up to the Homepage and click the name of the second player and scrape the data > back up to the Homepage again and click the name of the third player and so on. So How should I go about doing this task in Scrapy? Should I use scrapy.spider or crawlspider? How do I tell scrapy I want to go into a specific page (player’s overview page) and out to the Homepage where the list of all players exist so I’m able to go to the next player repeating the same process? Thank you in advance!

Assuming that the page isn't rendered with javascript the scrapy would be a great tool.
I would suggest reading the installation docs and the tutorial to get a general understanding of how it works, where to begin and how to start a new project.
Here is an example of what your spider could look like:
import scrapy
class MySpider(scrapy.Spider):
name = "myspider"
start_urls = ["https://example.com/homepage"]
def parse(self, response):
for players_name in response.xpath_or_css_selector(some_selector_path_to_url).getall():
yield scrapy.Request(url, callback=self.parse_player)
def parse_player(self, response):
# scrape the player data into a dictionary and then yield it as an item
yield {player: data}
Installation docs
Scrapy Tutorial

Related

How can I crawl a bunch of links on a root website using Scrapy?

I am trying to crawl a covid-19 statistics website which has a bunch of links to pages regarding the statistics for different countries. The links all have a class name that makes them easy to access using css selectors ('mt_a'). There is no continuity between the countries so if you are on the webpage for one of them, there is no link to go to the next country. I am a complete beginner to scrapy and I'm not sure what I should do if my goal is to scrape all the (200 ish) links listed on the root page for the same few pieces of information. Any guidance on what I should be trying to do would be appreciated.
The link I'm trying to scrape: https://www.worldometers.info/coronavirus/
(scroll down to see country links)

What I would do is create two spiders. One would parse the home page and extract all specific links to country pages href within anchor tags, i.e. href="country/us/" and then create full urls from these relative links so that you get a proper url like https://www.worldometers.info/coronavirus/country/us/.
Then the second spider is given the list of all country urls and then goes on to crawl all individual pages and extract information from those.
For example, you get a list of urls from the first spider:
urls = ['https://www.worldometers.info/coronavirus/country/us/',
'https://www.worldometers.info/coronavirus/country/russia/']
Then in the second spider you give that list to the start_urls attribute.

I think others have already answered the question, but here is the page for Link extractors.

Crawl a website and its external links recursively to create a graph for a data analysis project n Python

I have this project that I'm trying to put together for a data analytics experiment. I have a pipeline in mind but I don't exactly know how to go on about getting the data I need.
I want to crawl a website and find all internal and external link, separate them and crawl the external links recursively until it reaches a certain depth. I want to do this to create a graph of all the connections for a website, to then use centrality algorithms to find the center node and proceed from there.
Ideally, I would like to use python 2 for this project.
I had a look at scrapy, beautiful soup and other libraries but it is all quite confusing.
Any help and/or advice would be much appreciated on crawling and creating the graph especially
Thank you
EDIT:
I'm trying to implement the solution you suggested and with the code below, I can see in the debug information that it is finding the links but either they are not being saved in the LinkList class or I'm extracting them wrong and they are getting filtered.
Any suggestions?
class LinkList(Item):
url = Field()
class WebcrawlerSpider(CrawlSpider):
name = 'webcrawler'
allowed_domains = ['https://www.wehiweb.com']
start_urls = ['https://www.wehiweb.com']
rules = (
Rule(LxmlLinkExtractor(allow=()), callback='parse_obj', follow=True),
)
def parse_obj(self,response):
item = LinkList()
item['url'] = []
for link in LxmlLinkExtractor(allow=(),deny = self.allowed_domains).extract_links(response):
item['url'].append(link.url)
yield item
def main():
links = LinkList()
process = CrawlerProcess()
process.crawl(WebcrawlerSpider)
process.start()
print(links.items())
if __name__ == "__main__":
main()

Scrapy should work fine for this. Most people use it to extract data from websites (scraping), but it can be used for simple crawling as well.
In scrapy you have spiders that crawl websites and follow links. A scrapy project can consist of many spiders, but in the standard setup each spider will have its own queue and do its own task.
As you described your use case, I would recommend two separate scrapy spiders:
one for onsite scraping with a allowed_domains setting only for this domain and a very high or even 0 (=infinite) MAX_DEPTH setting, so that it will crawl the whole domain
one for offsite scraping with an empty allowed_domains (will allow all domains) and a low MAX_DEPTH setting, so that it will stop after certain number of hops
From your parse method's perspective scrapy has a concept of Request and Item. You can return both Request and Item from the method that parses your response:
requests will trigger scrapy to visit a website and in turn call your parse method on the result
items allow you to specify the results you define for your project
So whenever you want to follow a link you will yield a Request from your parse method. And for all results of your project you will yield Item.
In your case, I'd say that your Item is something like this:
class LinkItem(scrapy.Item):
link_source = scrapy.Field()
link_target = scrapy.Field()
This will allow you to return the item link_source="http://example.com/", link_target="http://example.com/subsite" if you are on page http://example.com/ and found a link to /subsite:
def parse(self, response):
# Here: Code to parse the website (e.g. with scrapy selectors
# or beautifulsoup, but I think scrapy selectors should
# suffice
# after parsing, you have a list "links"
for link in links:
yield Request(link) # make scrapy continue the crawl
item = LinkItem()
item['link_source'] = response.url
item['link_target'] = link
yield item # return the result we want (connections in link graph)
You might see that I did not do any depth checking etc. You don't have to do this manually in your parse method, scrapy ships with Middleware. One of the middlewares is called OffsiteMiddleware and will check if your spider is allowed to visit specific domains (with the option allowed_domains, check the scrapy tutorials). And other one is DepthMiddleware (also check the tutorials).
These results can be written anywhere you want. Scrapy ships with something called feed exports which allow you to write data to files. If you need something more advanced, e.g. a database, you can look at scrapy's Pipeline.
I currently do not see the need for other libraries and projects apart from scrapy for your data collection.
Of course when you want to work with the data, you might need specialized data structures instead of plain text files.

Scraping a page inside a main page?

I'm doing a web app that searches all the shoe sizes that are in stock for each model of shoe.
So for example, for a website having a list of shoes:
http://www.soccer.com/shop/footwear/?page=1&pageSize=12&query=*&facet=ads_f40502_ntk_cs%253A%2522Nike%2522
I'll need to go inside each link to scrape this information.
Is there any way I can effectively do this with Scrapy (or something else)? Or is it impossible to do it?

It is possible and it is one of Scrapy's core functionalities.
For example, for scraping every shoe on this site what you would do is:
In your spider variables start_urls = ['http://www.soccer.com/shop/footwear/?page=1&pageSize=12&query=*&facet=ads_f40502_ntk_cs%253A%2522Nike%2522']
Then on your parse(self, response) your code should look like this:
for shoe_url in response.xpath(<ENTER_THE_XPATH>).extract()
yield scrapy.Request(response.urljoin(shoe_url), callback=self.parse_shoe)
and in the method parse_shoe which we registered as callback in the for loop, you should extract all the information you need.
Now what happens here, is that the spider starts to crawl the URL in start_urls and then for every url that meets the xpath we specified it will parse it using the parse_shoe function, where you could simply extract the shoe sizes.
You can follow the "Follow Links" tutorial on scrapy's main site on this link too - it is very clear.
For completeness I looked for the right xpath for you on that page, it should be '*//ul[#class="medium-3 columns product-list product-grid"]//a/#href'

Access the previous and next arrows on a web page

I am using Scrapy to scrape the following website's posts. I have written the code that will give me the max_id or the latest post number. For example, for http://papa-gen.livejournal.com/: if I have the max_id in theory I will be able to create a for loop 1 through the max_id and I should be able to scrape all of the posts.
The problem is that there are not as many posts as the max_id. For example, the max_id for the website above is 2870789 for the post for December 17th, but the post for December 16th has the number 2870614, a difference of 175. If I loop through all the 2870789, I will reach each post, but the code will of course not be very efficient. My idea is to access the previous and forward buttons on the website using my python code and loop in this manner.
Could someone explain how I could accomplish this using Scrapy?

Scrapy has extensive documentation. There is an example of using the CrawlSpider class to accomplish what you're describing, which you can modify to look something like this...
class MySpider(CrawlSpider):
name = 'livejournal.com'
allowed_domains = ['livejournal.com']
start_urls = ['http://www.papa-gen.livejournal.com']
rules = (
Rule(SgmlLinkExtractor(allow=('skip=', )), callback='parse_item'),
)
def parse_item(self, response):
# here is where the parsing happens
pass
The basic idea is to specify rules that match the links. Scrapy with add them to a list of urls to visit and then call the callback function with the page data when the url is fetched.

Scraping forums with scrapy

I am trying to scrap some forums with scrapy and store the data in a database. But I don't know to do it efficiently when it comes to updating the database. This is what my spider looks like:
class ForumSpider(CrawlSpider):
name = "forum"
allowed_domains= ["forums.example.com"]
start_urls = ["forums.example.com/index.php"]
rules = (
Rule(SgmlLinkExtractor(allow=(r'/forum?id=\d+',)),
follow=True, callback='parse_index'),
)
def parse_index(self, response):
hxs = HtmlXPathSelector(response)
#parsing....looking for threads.....
#pass the data to pipeline and store in to the db....
My problem is when I scrap the same forum again, say a week later, there is no point to go through all the pages, because the new threads or any threads with new post would be on top of other inactive threads. My idea is to check the first pages of a forum(forums.example.com/forum?id=1), if it found a thread with the same URL and the same number of reply on page one. There is no point to go to the second page. So the spider should proceed to another forum(forums.example.com/forum?id=2). I tried modifying the start_urls and rules, but it seemed like they are not responding once the spider is running. Is there a way to do it in scrapy?
My second problem is how to use different pipeline for different spiders. I found something on stack overflow. But it seems like scrapy isn't built to do this, it seems like you suppose to create a new project for different sites.
Am I using the wrong tool to do this? Or I am missing something. I thought about using mechanize and lxml to do it. But I need to implement twisted and unicode handling and so on which makes me want to stick with scrapy
Thanks

What you are asking for is to create a http requests on fly.
Inside the parse_index function do this.
request = self.make_requests_from_url(http://forums.example.com/forum?id=2)
return request
If you want to submit multiple http requests return a array.
See this Request in scrapy
You are right about the second thing, you are suppose to write different spiders if you want to extract different type of data from different websites.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Is scrapy.spider or crawler good fit for this task? - python

Related

How can I crawl a bunch of links on a root website using Scrapy?

Crawl a website and its external links recursively to create a graph for a data analysis project n Python

Scraping a page inside a main page?

Access the previous and next arrows on a web page

Scraping forums with scrapy

Categories

Resources