Scrapy - How to keep track of start url - python

Given a pool of start urls I would like to identify in the parse_item() function the origin url.
As far as I'm concerned the scrapy spiders start crawling from the initial pool of start urls, but when parsing there is no trace of which of those urls was the initial one. How it would be possible to keep track of the starting point?

If you need a parsing url inside the spider, just use response.url:
def parse_item(self, response):
print response.url
but in case you need it outside spider I can think of following ways:
Use scrapy core api
You can also call scrapy from an external python module with OS command (which apparently is not recommended):
in scrapycaller.py
from subprocess import call
urls = 'url1,url2'
cmd = 'scrapy crawl myspider -a myurls={}'.format(urls)
call(cmd, shell=True)
Inside myspider:
class mySpider(scrapy.Spider):
def __init__(self, myurls=''):
self.start_urls = myurls.split(",")

Related

Parsing output from scrapy splash

I'm testing out a splash instance with scrapy 1.6 following https://blog.scrapinghub.com/2015/03/02/handling-javascript-in-scrapy-with-splash and https://aaqai.me/notes/scrapy-splash-setup. My spider:
import scrapy
from scrapy_splash import SplashRequest
from scrapy.utils.response import open_in_browser
class MySpider(scrapy.Spider):
start_urls = ["http://yahoo.com"]
name = 'mytest'
def start_requests(self):
for url in self.start_urls:
yield SplashRequest(url, self.parse, endpoint='render.html', args={'wait': 7.5},)
def parse(self, response):
# response.body is a result of render.html call; it
# contains HTML processed by a browser.
open_in_browser(response)
return None
The output opens up in notepad rather than a browser. How can I open this in a browser?
If you are using the splash middleware and everything the splash response goes into the regular response object with you can access via response.css and response.xpath. Depending on what endpoint you use you can execute JavaScript and other stuff.
If you need to do moving around a page and other stuff you will need to write a LUA script to execute with the proper endpoint. As far as parsing the output it automatically goes into the response object.
Get rid of open_in_browser I'm not exactly sure what you are doing but if all you want to do is parse the page you can do so like so
body = response.css('body').extract_first()
links = response.css('a::attr(href)').extract()
If you could please clarify your question most people don't want to look in links to try and guess what your having trouble with.
Update for clarified question:
It sounds like you may want scrapy shell with Splash this will enable you to experiment with selectors:
scrapy shell 'http://localhost:8050/render.html?url=http://page.html&timeout=10&wait=0.5'
In order to access Splash in a browser instance simply go to http://0.0.0.0:8050/ you input the URL in there. I'm not sure about the method in the tutorial but this is how you can interact with the Splash session.

Scrapy request, shell Fetch() in spider

I'm trying to reach a specific page, let's call it http://example.com/puppers. This page cannot be reached when connecting directly using scrapy shell or the standard scrapy.request module (results in <405> HTTP).
However, when I use scrapy shell 'http://example.com/kittens' first, and then use fetch('http://example.com/puppers') it works and I get a <200> OK HTTP code. I can now extract data using scrapy shell.
I tried implementing this in my script, by altering the referer (using url #1), the user-agent and a few others while connecting to the puppers (url #2) page. I still get a <405> code..
I appreciate all the help. Thank you.
start_urls = ['http://example.com/kittens']
def parse(self, response):
yield scrapy.Request(
url="http://example.com/puppers",
callback=self.parse_puppers
)
def parse_puppers(self, response):
#process your puppers
.....

Running a Scrapy Crawler

I am very new in Python and Scrapy and I have written a crawler in PyCharm as follow:
import scrapy
from scrapy.spiders import Spider
from scrapy.http import Request
import re
class TutsplusItem(scrapy.Item):
title = scrapy.Field()
class MySpider(Spider):
name = "tutsplus"
allowed_domains = ["bbc.com"]
start_urls = ["http://www.bbc.com/"]
def parse(self, response):
links = response.xpath('//a/#href').extract()
# We stored already crawled links in this list
crawledLinks = []
for link in links:
# If it is a proper link and is not checked yet, yield it to the Spider
#if linkPattern.match(link) and not link in crawledLinks:
if not link in crawledLinks:
link = "http://www.bbc.com" + link
crawledLinks.append(link)
yield Request(link, self.parse)
titles = response.xpath('//a[contains(#class, "media__link")]/text()').extract()
for title in titles:
item = TutsplusItem()
item["title"] = title
print("Title is : %s" %title)
yield item
However, when I run above codes, nothing prints on the screen! What is wrong in my code?
Put the code in a text file, name it to something like your_spider.py and run the spider using the runspider command:
scrapy runspider your_spider.py
You would typically start scrapy using scrapy crawl, which will hook everything up for you and start the crawling.
It also looks like your code is not properly indented (only one line inside parse when they all should be).
To run a spider from within Pycharm you need to configure "Run/Debug configuration" properly. Running your_spider.py as a standalone script wouldn't result in anything.
As mentioned by #stranac scrapy crawl is the way to go. With scrapy being a binary and crawl an argument of your binary.
Configure Run/Debug
In the main menu go to :
Run > Run Configurations...
Find the appropriate scrapy binary within your virtualenv and set its absolute path as Script.
This should look like something like this:
/home/username/.virtualenvs/your_virtualenv_name/bin/scrapy
In Scrapy parameters set up the parameters the binary scrapy will execute. In your case, you wan to start your spider. this is how this should look like:
crawl your_spider_name e.g. crawl tutsplus
Make sure that the Python intrepreter is the one where you setup Scrapy and other packages needed for your project.
Make sure that the working directory is the directory containing settings.pywhich is also generated by Scrapy.
From now on you should be able to Run and Debug your spiders from within Pycharm.

Python Scrapy does not crawl website

I am new to python scrapy and trying to get through a small example, however I am having some problems!
I am able to crawl the first given URL only, but I am unable to crawl more than one page or an entire website for that matter!
Please help me or give me some advice on how I can crawl an entire website or more pages in general...
The example I am doing is very simple...
My items.py
import scrapy
class WikiItem(scrapy.Item):
title = scrapy.Field()
my wikip.py (the spider)
import scrapy
from wiki.items import WikiItem
class CrawlSpider(scrapy.Spider):
name = "wikip"
allowed_domains = ["en.wikipedia.org/wiki/"]
start_urls = (
'http://en.wikipedia.org/wiki/Portal:Arts',
)
def parse(self, response):
for sel in response.xpath('/html'):
item = WikiItem()
item['title'] = sel.xpath('//h1[#id="firstHeading"]/text()').extract()
yield item
When I run scrapy crawl wikip -o data.csv in the root project diretory the result is:
title
Portal:Arts
Can anyone give me insight as to why it is not following urls and crawling deeper?
I have checked some related SO questions but they have not helped to solve the issue
scrapy.Spider is the simplest spider. Change the name CrawlSpider, since Crawl Spider is one of the generic spiders of scrapy.
One of the below option can be used:
eg: 1. class WikiSpider(scrapy.Spider)
or 2. class WikiSpider(CrawlSpider)
If you are using first option you need to code the logic for following the links you need to follow on that webpage.
For second option you can do the below:
After the start urls you need to define the rule as below:
rules = (
Rule(LinkExtractor(allow=('https://en.wikipedia.org/wiki/Portal:Arts\?.*?')), callback='parse_item', follow=True,),
)
Also please change the name of the function defined as "parse" if you use CrawlSpider. The Crawl Spider uses parse method to implement the logic. Thus, here you are trying to override the parse method and hence the crawl spider doesn't work.

Build referer URL chains while crawling data through scrapy?

Is there any scrapy module available to build referrer chains while crawling urls.
Lets say for instance I start my crawl from http://www.example.com and move to http://www.new-example.com and then from http://www.new-example.com to http://very-new-example.com.
Can I create a url chains(a csv or json file) like this:
http://www.example.com, http://www.new-example.com
http://www.example.com, http://www.new-example.com, http://very-new-example.com
and so on, if there's no module or implementation available at the moment then what other options can I try?
Yes you can keep track of referrals by making a global list which accesible by all methods for example.
referral_url_list = []
def call_back1(self, response):
self.referral_url_list.append(response.url)
def call_back1(self, response):
self.referral_url_list.append(response.url)
def call_back1(self, response):
self.referral_url_list.append(response.url)
after spider completion which is detected by spider signals. you can write csv or json file in signal function

Categories