I am trying to scrap some forums with scrapy and store the data in a database. But I don't know to do it efficiently when it comes to updating the database. This is what my spider looks like:
class ForumSpider(CrawlSpider):
name = "forum"
allowed_domains= ["forums.example.com"]
start_urls = ["forums.example.com/index.php"]
rules = (
Rule(SgmlLinkExtractor(allow=(r'/forum?id=\d+',)),
follow=True, callback='parse_index'),
)
def parse_index(self, response):
hxs = HtmlXPathSelector(response)
#parsing....looking for threads.....
#pass the data to pipeline and store in to the db....
My problem is when I scrap the same forum again, say a week later, there is no point to go through all the pages, because the new threads or any threads with new post would be on top of other inactive threads. My idea is to check the first pages of a forum(forums.example.com/forum?id=1), if it found a thread with the same URL and the same number of reply on page one. There is no point to go to the second page. So the spider should proceed to another forum(forums.example.com/forum?id=2). I tried modifying the start_urls and rules, but it seemed like they are not responding once the spider is running. Is there a way to do it in scrapy?
My second problem is how to use different pipeline for different spiders. I found something on stack overflow. But it seems like scrapy isn't built to do this, it seems like you suppose to create a new project for different sites.
Am I using the wrong tool to do this? Or I am missing something. I thought about using mechanize and lxml to do it. But I need to implement twisted and unicode handling and so on which makes me want to stick with scrapy
Thanks
What you are asking for is to create a http requests on fly.
Inside the parse_index function do this.
request = self.make_requests_from_url(http://forums.example.com/forum?id=2)
return request
If you want to submit multiple http requests return a array.
See this Request in scrapy
You are right about the second thing, you are suppose to write different spiders if you want to extract different type of data from different websites.
Related
I'm using scrapy (on PyCharm v2020.1.3) to build a spider that crawls this webpage: "https://www.woolworths.com.au/shop/browse/drinks/cordials-juices-iced-teas/iced-teas", i want to extract the products names, and the breadcrumb in a list format, and save the results in a csv file.
I tried the following code but it returns empty brackets [] , after i've inspected the html code i discovred that the content is hidden in angularjs format.
If someone has a solution for that it would be great
Thank you
import scrapy
class ProductsSpider(scrapy.Spider):
name = 'products'
start_urls = ['https://www.woolworths.com.au/shop/browse/drinks/cordials-juices-iced-teas/iced-teas']
def parse(self, response):
product = response.css('a.shelfProductTile-descriptionLink::text').extract()
yield "productnames"
You won't be able to get the desired products through parsing the HTML. It is heavily javascript orientated and therefore scrapy wont parse this.
The simplest way to get the product names, I'm not sure what you mean by breadcrumbs is to re-engineer the HTTP requests. The woolworths website generates the product details via an API. If we can mimick the request the browser makes to obtain that product information we can get the information in a nice neat format.
First you have to set within settings.py ROBOTSTXT_OBEY = False. Becareful about protracted scrapes of this data because your IP will probably get banned at some point.
Code Example
import scrapy
class TestSpider(scrapy.Spider):
name = 'test'
allowed_domains = ['woolworths.com']
data = {
'excludeUnavailable': 'true',
'source': 'RR-Best Sellers'}
def start_requests(self):
url = 'https://www.woolworths.com.au/apis/ui/products/58520,341057,305224,70660,208073,69391,69418,65416,305227,305084,305223,427068,201688,427069,341058,305195,201689,317793,714860,57624'
yield scrapy.Request(url=url,meta=self.data,callback=self.parse)
def parse(self, response):
data = response.json()
for a in data:
yield {
'name': a['Name'],
}
Explanation
We start of with our defined url in start_requests. This URL is the specific URL of the API woolworth uses to obtain information for iced tea. For any other link on woolworths the part of the URL after /products/ will be specific to that part of the website.
The reason why we're using this, is because using browser activity is slow and prone to being brittle. This is fast and the information we can get is usually highly structured much better for scraping.
So how do we get the URL you may be asking ? You need to inspect the page, and find the correct request. If you click on network tools and then reload the website. You'll see a bunch of requests. Usually the largest sized request is the one with all the data. Clicking that and clicking preview gives you a box on the right hand side. This gives all the details of the products.
In this next image, you can see a preview of the product data
We can then get the request URL and anything else from this request.
I will often copy this request as a CURL (Bash Command) as seen here
And enter it into curl.trillworks.com. This can convert CURL to python. Giving you a nice formatted headers and any other data needed to mimick the request.
Now putting this into jupyter and playing about, you actually only need the params NOT the headers which is much better.
So back to the code. We do a request, using meta argument we can pass on the data, remember because it's outside the function we have to use self.data and then specifying the callback to parse.
We can use the response.json() method to convert the JSON object to a set of python dictionaries corresponding to each product. YOU MUST have scrapy V2.2 to use this method. Other you could use data = json.loads(response.text), but you'll have put to import json at the top of the script.
From the preview and playing about with the json in requests we can see these python dictionaries are actually within a list and so we can use a for loop to loop round each product, which is what we are doing here.
We then yield a dictionary to extract the data, a refers to each products which is it's own dictionary and a['Name'] refers to that specific python dictionary key 'Name' and giving us the correct value. To get a better feel for this, I always use requests package in jupyter to figure out the correct way to get the data I want.
The only thing left to do is to use scrapy crawl test -o products.csv to output this to a CSV file.
I can't really help you more than this until you specify any other data you want from this page. Please remember that you're going against what the site wants you to scrape, but also any other pages on that website you will need to find out the specific URL to the API to get those products. I have given you the way to do this, I suggest if you want to automate this it would be worth your while trying to struggle with this. We are hear to help but an attempt on your part is how you're going to learn coding.
Additional Information on the Approach of Dynamic Content
There is a wealth of information on this topic. Here are some guidelines to think about when looking at javascript orientated websites. The default is you should try re-engineer the requests the browser makes to load the pages information. This is what the javascript in this site and many other sites is doing, it's providing a dynamic way without reloading the page to display new information by making an HTTP request. If we can mimic that request, we can get the information we desire. This is the most efficent way to get dynamic content.
In order of preference
Re-engineering the HTTP requests
Scrapy-splash
Scrapy_selenium
importing selenium package into your scripts
Scrapy-splash is slightly better than the selenium package, as it pre-renders the page, giving you access to the selectors with the data. Selenium is slow, prone to errors but will allow you to mimic browser activity.
There are multiple ways to include selenium into your scripts see down below as an overview.
Recommended Reading/Research
Look at the scrapy documentation with regard to dynamic content here
This will give you an overview of the steps to handling dynamic content. I will say generally speaking selenium should be thought of as a last resort. It's pretty inefficient when doing larger scale scraping.
If you are consider adding in the selenium package into your script. This might be the lower barrier of entry to getting your script working but not necessarily that efficient. At the end of the day scrapy is a framework but there is a lot of flexibility in adding in 3rd party packages. The spider scripts are just a python class importing the scrapy architecture in the background. As long as you're mindful of the response and translating some of the selenium to work with scrapy, you should be able to input selenium into your scripts. I would this solution is probably the least efficient though.
Consider using scrapy-splash, splash pre-renders the page and allows for you to add in javascript execution. Docs are here and a good article from scrapinghub here
Scrapy-selenium is a package with a custom scrapy downloader middleware that allows you to do selenium actions and execute javascript. Docs here You'll need to have a play around to get the login in procedure from this, it doesn't have the same level of detail as the selenium package itself.
I am quite new to Scrapy but am designing a web scrape to pull certain information from GoFundMe, specifically in this case the amount of people who have donated to a project. I have written an xpath statement which works fine in Chrome but returns null in Scrapy.
A random example project is https://www.gofundme.com/f/passage/donations, which at present has 22 donations. The below when entered in Chrome inspect gives me "Donations(22)" which is what I need -
//h2[#class="heading-5 mb0"]/text()
However in my Scrapy spider the following yields null -
class DonationsSpider(scrapy.Spider):
name = 'get_donations'
start_urls = [
'https://www.gofundme.com/f/passage/donations'
]
def parse(self, response):
amount_of_donations = response.xpath('//h2[#class="heading-5 mb0"]/text()').extract_first()
yield{
'Donations': amount_of_donations
}
Does anyone know why Scrapy is unable to see this value?
I am doing this in an attempt to find out how many times the rest of the spider needs to loop, as when I hard code this value it works with no problems and yields all of the donations.
Well because there are many requests going on the fulfil the request "https://www.gofundme.com/f/passage/donations". Where
your chrome is smart enough to under stand javascript, using that
smartness it reads the JavaScript code and fetches all the responses
from different different endpoints to fulfil your request
there's one request to the endpoint "https://gateway.gofundme.com/web-gateway/v1/feed/passage/counts" which loads the data you're looking for. which your python script can't do and also it's not recommend.
Instead you can call directly to that api and you'll get the data, good news is that endpoint responds JSON data which is very structured, easy to parse.
and I'm sure you're also looking for the data which is coming from this endpoint "https://gateway.gofundme.com/web-gateway/v1/feed/passage/donations?limit=20&offset=0&sort=recent"
for more information you may refer to one of my blog by clicking here
I have this project that I'm trying to put together for a data analytics experiment. I have a pipeline in mind but I don't exactly know how to go on about getting the data I need.
I want to crawl a website and find all internal and external link, separate them and crawl the external links recursively until it reaches a certain depth. I want to do this to create a graph of all the connections for a website, to then use centrality algorithms to find the center node and proceed from there.
Ideally, I would like to use python 2 for this project.
I had a look at scrapy, beautiful soup and other libraries but it is all quite confusing.
Any help and/or advice would be much appreciated on crawling and creating the graph especially
Thank you
EDIT:
I'm trying to implement the solution you suggested and with the code below, I can see in the debug information that it is finding the links but either they are not being saved in the LinkList class or I'm extracting them wrong and they are getting filtered.
Any suggestions?
class LinkList(Item):
url = Field()
class WebcrawlerSpider(CrawlSpider):
name = 'webcrawler'
allowed_domains = ['https://www.wehiweb.com']
start_urls = ['https://www.wehiweb.com']
rules = (
Rule(LxmlLinkExtractor(allow=()), callback='parse_obj', follow=True),
)
def parse_obj(self,response):
item = LinkList()
item['url'] = []
for link in LxmlLinkExtractor(allow=(),deny = self.allowed_domains).extract_links(response):
item['url'].append(link.url)
yield item
def main():
links = LinkList()
process = CrawlerProcess()
process.crawl(WebcrawlerSpider)
process.start()
print(links.items())
if __name__ == "__main__":
main()
Scrapy should work fine for this. Most people use it to extract data from websites (scraping), but it can be used for simple crawling as well.
In scrapy you have spiders that crawl websites and follow links. A scrapy project can consist of many spiders, but in the standard setup each spider will have its own queue and do its own task.
As you described your use case, I would recommend two separate scrapy spiders:
one for onsite scraping with a allowed_domains setting only for this domain and a very high or even 0 (=infinite) MAX_DEPTH setting, so that it will crawl the whole domain
one for offsite scraping with an empty allowed_domains (will allow all domains) and a low MAX_DEPTH setting, so that it will stop after certain number of hops
From your parse method's perspective scrapy has a concept of Request and Item. You can return both Request and Item from the method that parses your response:
requests will trigger scrapy to visit a website and in turn call your parse method on the result
items allow you to specify the results you define for your project
So whenever you want to follow a link you will yield a Request from your parse method. And for all results of your project you will yield Item.
In your case, I'd say that your Item is something like this:
class LinkItem(scrapy.Item):
link_source = scrapy.Field()
link_target = scrapy.Field()
This will allow you to return the item link_source="http://example.com/", link_target="http://example.com/subsite" if you are on page http://example.com/ and found a link to /subsite:
def parse(self, response):
# Here: Code to parse the website (e.g. with scrapy selectors
# or beautifulsoup, but I think scrapy selectors should
# suffice
# after parsing, you have a list "links"
for link in links:
yield Request(link) # make scrapy continue the crawl
item = LinkItem()
item['link_source'] = response.url
item['link_target'] = link
yield item # return the result we want (connections in link graph)
You might see that I did not do any depth checking etc. You don't have to do this manually in your parse method, scrapy ships with Middleware. One of the middlewares is called OffsiteMiddleware and will check if your spider is allowed to visit specific domains (with the option allowed_domains, check the scrapy tutorials). And other one is DepthMiddleware (also check the tutorials).
These results can be written anywhere you want. Scrapy ships with something called feed exports which allow you to write data to files. If you need something more advanced, e.g. a database, you can look at scrapy's Pipeline.
I currently do not see the need for other libraries and projects apart from scrapy for your data collection.
Of course when you want to work with the data, you might need specialized data structures instead of plain text files.
So I want to scrape articles from a site that has pagination. Basically, every page is a list of article links and the spider follows the links on the page in a parse_article method, as well as following the successive next page links. However, is there a way to make this stop after a given number of articles are scraped? For example, this is what I have so far using a crawlspider:
rules = (
#next page rule:
Rule(LinkExtractor(restrict_xpaths="//a[#class='next']"),follow=True)
#Extract all internal links which follows this regex:
Rule(LinkExtractor(allow=('REGEXHERE',),deny=()),callback='parse_article'),
)
def parse_article(self, response):
#do parsing stuff here
I want to stop following the next page once I've parsed 150 articles. It doesn't matter if I scrape a little more than 150, I just want to stop going to the next page once I've hit that number. Is there any way to do that? Something like having a counter in the parse_article method? Just new to scrapy so I'm not sure what to try.... I looked into depth_limit, but I'm not so sure that's what I am looking for.
Any help would be greatly appreciated, thanks!
You could achieve that by setting:
CLOSESPIDER_ITEMCOUNT = 150
In your project settings.
If you have multiple Spiders in your project and just want a particular one to be affected by this setting, set it in custom_settings class variable:
custom_settings = { 'CLOSESPIDER_ITEMCOUNT': 150 }
The approach I take on my spiders is to actually have a donescraping flag and I check it first thing in each of my parse_* functions and return an empty list for the results.
This adds the graceful behavior of allowing items and urls already in the download queue to finish happening while not fetching any MORE items.
I've never used CLOSESPIDER_ITEMCOUNT so I dont' know if that "gracefully" closes the spider. I expect it does not
At the beginning of every parse function:
#early exit if done scraping
if self.donescraping:
return None
My query is for the CrawlSpider
I understand the link extractor rules is a static variable,
Can i change the rules in runtime say, like
#classmethod
def set_rules(cls,rules):
cls.rules = rules
by
self.set_rules(rules)
Is this the acceptable practice for the CrawlSpider ? if not please suggest the appropriate method
My use case,
I'm using scrapy to crawl certain categories A,B,C....Z of a particular website. each category has 1000 links spread over 10 pages
and when scrapy hits a link in a some category which is "too old". I'd like the crawler to stop following/crawling the remainder of the 10 pages ONLY for that category alone and thus my requirement of dynamic rule changes.
Please point me out on the right direction.
Thanks!
The rules in a spider aren't meant to be changed dynamically. They are compiled at instantiation of the CrawlSpider. You could always change your spider.rules and re-run spider._compile_rules(), but I advise against it.
The rules create a set of instructions for the Crawler in what to queue up to crawl (ie. it queues Requests). These requests aren't revisited and re-evaluated before they are dispatched, as the rules weren't "designed" to change. So even if you did change the rules dynamically, you may still end up making a bunch of requests you didn't intend to, and still crawl a bunch of content you didn't mean to.
For instance, if your target page is setup so that the page for "Category A" contains links to pages 1 to 10 of "Category A", then Scrapy will queue up requests for all 10 of these pages. If Page 2 turns out to have entries that are "too old", changing the rules will do nothing because requests for pages 3-10 are already queued to go.
As #imx51 said, it would be much better to write a Downloader Middleware. These would be able to drop each request that you not longer want to make as they trigger for every request going through it before it's downloaded.
I would suggest to you to write your own custom downloader middleware. These would allow you to filter out those requests that you not longer want to make.
Further details about the architecture overview of Scrapy can you find here: http://doc.scrapy.org/en/master/topics/architecture.html
And about downloader middleware and how to write your custom one: http://doc.scrapy.org/en/master/topics/downloader-middleware.html