Build referer URL chains while crawling data through scrapy?

Build referer URL chains while crawling data through scrapy? - python

Is there any scrapy module available to build referrer chains while crawling urls.
Lets say for instance I start my crawl from http://www.example.com and move to http://www.new-example.com and then from http://www.new-example.com to http://very-new-example.com.
Can I create a url chains(a csv or json file) like this:
http://www.example.com, http://www.new-example.com
http://www.example.com, http://www.new-example.com, http://very-new-example.com
and so on, if there's no module or implementation available at the moment then what other options can I try?

Yes you can keep track of referrals by making a global list which accesible by all methods for example.
referral_url_list = []
def call_back1(self, response):
self.referral_url_list.append(response.url)
def call_back1(self, response):
self.referral_url_list.append(response.url)
def call_back1(self, response):
self.referral_url_list.append(response.url)
after spider completion which is detected by spider signals. you can write csv or json file in signal function

Related

Python - Scrapy - Navigating through a website

I’m trying to use Scrapy to log into a website, then navigate within than website, and eventually download data from it. Currently I’m stuck in the middle of the navigation part. Here are the things I looked into to solve the problem on my own.
Datacamp course on Scrapy
Following Pagination Links with Scrapy
http://scrapingauthority.com/2016/11/22/scrapy-login/
Scrapy - Following Links
Relative URL to absolute URL Scrapy
However, I do not seem to connect the dots.
Below is the code I currently use. I manage to log in (when I call the "open_in_browser" function, I see that I’m logged in). I also manage to "click" on the first button on the website in the "parse2" part (if I call "open_in_browser" after parse 2, I see that the navigation bar at the top of the website has gone one level deeper.
The main problem is now in the "parse3" part as I cannot navigate another level deeper (or maybe I can, but the "open_in_browser" does not open the website any more - only if I put it after parse or parse 2). My understanding is that I put multiple "parse-functions" after another to navigate through the website.
Datacamp says I always need to start with a "start request function" which is what I tried but within the YouTube videos, etc. I saw evidence that most start directly with parse functions. Using "inspect" on the website for parse 3, I see that this time href is a relative link and I used different methods (See source 5) to navigate to it as I thought this might be the source of error.
import scrapy
from scrapy.http import FormRequest
from scrapy.utils.response import open_in_browser
from scrapy.crawler import CrawlerProcess
class LoginNeedScraper(scrapy.Spider):
name = "login"
start_urls = ["<some website>"]
def parse(self, response):
loginTicket = response.xpath('/html/body/section/div/div/div/div[2]/form/div[3]/input[1]/#value').extract_first()
execution = response.xpath('/html/body/section/div/div/div/div[2]/form/div[3]/input[2]/#value').extract_first()
return FormRequest.from_response(response, formdata={
'loginTicket': loginTicket,
'execution': execution,
'username': '<someusername>',
'password': '<somepassword>'},
callback=self.parse2)
def parse2(self, response):
next_page_url = response.xpath('/html/body/nav/div[2]/ul/li/a/#href').extract_first()
yield scrapy.Request(url=next_page_url, callback=self.parse3)
def parse3(self, response):
next_page_url_2 = response.xpath('/html//div[#class = "headerPanel"]/div[3]/a/#href').extract_first()
absolute_url = response.urljoin(next_page_url_2)
yield scrapy.Request(url=absolute_url, callback=self.start_scraping)
def start_scraping(self, response):
open_in_browser(response)
process = CrawlerProcess()
process.crawl(LoginNeedScraper)
process.start()

You need to define rules in order to scrape a website completely. Let's say you want to crawl all links in the header of the website and then open that link in order to see the main page to which that link was referring.
In order to achieve this, firstly identify what you need to scrape and mark CSS or XPath selectors for those links and put them in a rule. Every rule has a default callback to parse or you can also assign it to some other method. I am attaching a dummy example of creating rules, and you can map it accordingly to your case:
rules = (
Rule(LinkExtractor(restrict_css=[crawl_css_selectors])),
Rule(LinkExtractor(restrict_css=[product_css_selectors]), callback='parse_item')
)

How to stop scrapy from crawling when it reaches the starting point of previous run

I am making a spider which will crawl the entire site on the first run and store the data in my database.
But I will keep running this spider on weekly basis to get the updates of the crawled site in my database and I don't want scrapy to crawl the pages which are already present in my database how to achieve this I have made two plans -
1] Make a crawler to fetch the entire site and somehow store the first fetched URL in a csv file then keep following the next pages. Then make another crawler which will start fetching backwards that means it will take the input from the URL in csv and keep running till prev_page exits this way I will get the data, but the url in csv will be crawled twice.
2] Make a crawler which will check condition if the data is in the database then stop, is it possible? This will be the most productive way but I can't find the way out. Maybe making logs files might help in some way?
Update
The site is a blog which updates frequently and sorted as latest post on the top manner

Something like this :
from scrapy import Spider
from scrapy.http import Request, FormRequest
class MintSpiderSpider(Spider):
name = 'Mint_spider'
allowed_domains = ['example.com']
start_urls = ['http://www.example.com/']
def parse(self, response):
urls = response.xpath('//div[#class = "post-inner post-hover"]/h2/a/#href').extract()
for url in urls:
if never_visited(url, database):
yield Request(url, callback=self.parse_lyrics) #do you mean parse_foo ?
next_page_url = response.xpath('//li[#class="next right"]/a/#href').extract_first()
if next_page_url:
yield scrapy.Request(next_page_url, callback=self.parse)
def parse_foo(self, response):
save_url(response.request.url, database)
info = response.xpath('//*[#class="songinfo"]/p/text()').extract()
name = response.xpath('//*[#id="lyric"]/h2/text()').extract()
yield{
'name' : name,
'info': info
}
You just need to implement never_visited and save_url functions.
never_visited will check in your database if url is already there. save_url will add the url into your database.

Scrape All External Links from Multiple URLs in a Text File with Scrapy

I am new to Scrapy and Python and as such I am a beginner. I want to be able to have Scrapy read a text file with a seed list of around 100k urls, have Scrapy visit each URL, and extract all external URLs (URLs of Other Sites) found on each of those Seed URLs and export the results to a separate text file.
Scrapy should only visit the URLs in the text file, not spider out and follow any other URL.
I want to be able to have Scrapy work as fast as possible, I have a very powerful server with a 1GBS line. Each URL in my list is from a unique domain, so I won't be hitting any 1 site hard at all and thus won't be encountering IP blocks.
How would I go about creating a project in Scrapy to be able to extract all external links from a list of urls stored in a textfile?
Thanks.

You should use:
1. start_requests function for reading list of urls.
2. css or xpath selector for all "a" html elements.
from scrapy import Spider
class YourSpider(Spider):
name = "your_spider"
def start_requests(self):
with open('your_input.txt', 'r') as f: # read the list of urls
for url in f.readlines() # process each of them
yield Request(url, callback=self.parse)
def parse(self, response):
item = YourItem(parent_url=response.url)
item['child_urls'] = response.css('a::attr(href)').extract()
return item
More info about start_requests here:
http://doc.scrapy.org/en/latest/topics/spiders.html#scrapy.spiders.Spider.start_requests
For extracting scraped items to another file use Item Pipeline or Feed Export. Basic pipeline example here:
http://doc.scrapy.org/en/latest/topics/item-pipeline.html#write-items-to-a-json-file

Scrapy have a item limit?

in those days i'm making a Spider with Scrapy in Python.
It's basically a simple spider class, that make simple parsing of some field in a Html page.
I don't use the starts_url[] Scrapy field, but i use a personalized list like this:
class start_urls_mod():
def __init__(self, url, data):
self.url=url
self.data=data
#Defined in the class:
url_to_scrape = []
#Populated in the body in this way
self.url_to_scrape.append(start_urls_mod(url_found), str(data_found))
passing the url in this way
for any_url in self.url_to_scrape:
yield scrapy.Request(any_url.url, callback=self.parse_page)
It works good with a limited numbers of url like 3000.
But if i try to make a test and it found about 32532 url to scrape.
In the JSON output file i found only about 3000 url scraped.
My function recall it self:
yield scrapy.Request(any_url.url, callback=self.parse_page)
So the question is, there is some memory limit for the Scrapy items?

No, if you haven't specified CLOSESPIDER_ITEMCOUNT on your settings.
maybe scrapy is finding duplicates in your requests, please check if that stats contain something like dupefilter/filtered on your logs.

Scrapy multiple spiders

I have defined two spiders which do the following:
Spider A:
Visits the home page.
Extracts all the links from the page and stores them in a text file.
This is necessary since the home page has a More Results button which produces further links to different products.
Spider B:
Opens the text file.
Crawls the individual pages and saves the information.
I am trying to combine the two and make a crawl-spider.
The URL structure of the home page is similar to:
http://www.example.com
The URL structure of the individual pages is similar to:
http://www.example.com/Home/Detail?id=some-random-number
The text file contains the list of such URLs which are to be scraped by the second spider.
My question:
How do I combine the two spiders so as to make a single spider which does the complete scraping?

From scrapy documantation:
In the callback function, you parse the response (web page) and return either Item objects, Request objects, or an iterable of both. Those Requests will also contain a callback (maybe the same) and will then be downloaded by Scrapy and then their response handled by the specified callback.
So what you actually need to do is in the parse method (which yuo extract the links there, for each link, yield a new request like:
yield self.make_requests_from_url(http://www.example.com/Home/Detail?id=some-random-number)
the self.make_requests_from_url is already implemented in Spider
Example of such:
class MySpider(Spider):
name = "my_spider"
def parse(self, response):
try:
user_name = Selector(text=response.body).xpath('//*[#id="ft"]/a/#href').extract()[0]
yield self.make_requests_from_url("https://example.com/" + user_name)
yield MyItem(user_name)
except Exception as e:
pass
You can handle the other requests using a different parsing function. do it by returning a Request object and specify the callback explicitly (The self.make_requests_from_url function call the parse function bu default)
Request(url=url,callback=self.parse_user_page)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Build referer URL chains while crawling data through scrapy? - python

Related

Python - Scrapy - Navigating through a website

How to stop scrapy from crawling when it reaches the starting point of previous run

Scrape All External Links from Multiple URLs in a Text File with Scrapy

Scrapy have a item limit?

Scrapy multiple spiders

Categories

Resources