Running a Scrapy Crawler

Running a Scrapy Crawler - python

I am very new in Python and Scrapy and I have written a crawler in PyCharm as follow:
import scrapy
from scrapy.spiders import Spider
from scrapy.http import Request
import re
class TutsplusItem(scrapy.Item):
title = scrapy.Field()
class MySpider(Spider):
name = "tutsplus"
allowed_domains = ["bbc.com"]
start_urls = ["http://www.bbc.com/"]
def parse(self, response):
links = response.xpath('//a/#href').extract()
# We stored already crawled links in this list
crawledLinks = []
for link in links:
# If it is a proper link and is not checked yet, yield it to the Spider
#if linkPattern.match(link) and not link in crawledLinks:
if not link in crawledLinks:
link = "http://www.bbc.com" + link
crawledLinks.append(link)
yield Request(link, self.parse)
titles = response.xpath('//a[contains(#class, "media__link")]/text()').extract()
for title in titles:
item = TutsplusItem()
item["title"] = title
print("Title is : %s" %title)
yield item
However, when I run above codes, nothing prints on the screen! What is wrong in my code?

Put the code in a text file, name it to something like your_spider.py and run the spider using the runspider command:
scrapy runspider your_spider.py

You would typically start scrapy using scrapy crawl, which will hook everything up for you and start the crawling.
It also looks like your code is not properly indented (only one line inside parse when they all should be).

To run a spider from within Pycharm you need to configure "Run/Debug configuration" properly. Running your_spider.py as a standalone script wouldn't result in anything.
As mentioned by #stranac scrapy crawl is the way to go. With scrapy being a binary and crawl an argument of your binary.
Configure Run/Debug
In the main menu go to :
Run > Run Configurations...
Find the appropriate scrapy binary within your virtualenv and set its absolute path as Script.
This should look like something like this:
/home/username/.virtualenvs/your_virtualenv_name/bin/scrapy
In Scrapy parameters set up the parameters the binary scrapy will execute. In your case, you wan to start your spider. this is how this should look like:
crawl your_spider_name e.g. crawl tutsplus
Make sure that the Python intrepreter is the one where you setup Scrapy and other packages needed for your project.
Make sure that the working directory is the directory containing settings.pywhich is also generated by Scrapy.
From now on you should be able to Run and Debug your spiders from within Pycharm.

Related

Python - Scrapy - Navigating through a website

I’m trying to use Scrapy to log into a website, then navigate within than website, and eventually download data from it. Currently I’m stuck in the middle of the navigation part. Here are the things I looked into to solve the problem on my own.
Datacamp course on Scrapy
Following Pagination Links with Scrapy
http://scrapingauthority.com/2016/11/22/scrapy-login/
Scrapy - Following Links
Relative URL to absolute URL Scrapy
However, I do not seem to connect the dots.
Below is the code I currently use. I manage to log in (when I call the "open_in_browser" function, I see that I’m logged in). I also manage to "click" on the first button on the website in the "parse2" part (if I call "open_in_browser" after parse 2, I see that the navigation bar at the top of the website has gone one level deeper.
The main problem is now in the "parse3" part as I cannot navigate another level deeper (or maybe I can, but the "open_in_browser" does not open the website any more - only if I put it after parse or parse 2). My understanding is that I put multiple "parse-functions" after another to navigate through the website.
Datacamp says I always need to start with a "start request function" which is what I tried but within the YouTube videos, etc. I saw evidence that most start directly with parse functions. Using "inspect" on the website for parse 3, I see that this time href is a relative link and I used different methods (See source 5) to navigate to it as I thought this might be the source of error.
import scrapy
from scrapy.http import FormRequest
from scrapy.utils.response import open_in_browser
from scrapy.crawler import CrawlerProcess
class LoginNeedScraper(scrapy.Spider):
name = "login"
start_urls = ["<some website>"]
def parse(self, response):
loginTicket = response.xpath('/html/body/section/div/div/div/div[2]/form/div[3]/input[1]/#value').extract_first()
execution = response.xpath('/html/body/section/div/div/div/div[2]/form/div[3]/input[2]/#value').extract_first()
return FormRequest.from_response(response, formdata={
'loginTicket': loginTicket,
'execution': execution,
'username': '<someusername>',
'password': '<somepassword>'},
callback=self.parse2)
def parse2(self, response):
next_page_url = response.xpath('/html/body/nav/div[2]/ul/li/a/#href').extract_first()
yield scrapy.Request(url=next_page_url, callback=self.parse3)
def parse3(self, response):
next_page_url_2 = response.xpath('/html//div[#class = "headerPanel"]/div[3]/a/#href').extract_first()
absolute_url = response.urljoin(next_page_url_2)
yield scrapy.Request(url=absolute_url, callback=self.start_scraping)
def start_scraping(self, response):
open_in_browser(response)
process = CrawlerProcess()
process.crawl(LoginNeedScraper)
process.start()

You need to define rules in order to scrape a website completely. Let's say you want to crawl all links in the header of the website and then open that link in order to see the main page to which that link was referring.
In order to achieve this, firstly identify what you need to scrape and mark CSS or XPath selectors for those links and put them in a rule. Every rule has a default callback to parse or you can also assign it to some other method. I am attaching a dummy example of creating rules, and you can map it accordingly to your case:
rules = (
Rule(LinkExtractor(restrict_css=[crawl_css_selectors])),
Rule(LinkExtractor(restrict_css=[product_css_selectors]), callback='parse_item')
)

The class in Item can`t be recognized after building a python scrapy program

I try to build a scrapy project according to a book.
After using 'scrapy startproject tutorial/cd tutorial/scrapy genspider quotes
quotes.toscrape.coom' comands and adding the parse function & changing items, the detail code as fellow:
quotes.py:
import scrapy
from tutorial.items import QuoteItem
class QuotesSpider(scrapy.Spider):
name = "quotes"
allowed_domains = ["quotes.toscrape.com"]
start_urls = ['http://quotes.toscrape.com/']
def parse(self, response):
quotes = response.css('.quote')
for quote in quotes:
item = QuoteItem()
item['text'] = quote.css('.text::text').extract_first()
item['author'] = quote.css('.author::text').extract_first()
item['tags'] = quote.css('.tags .tag::text').extract()
yield item
next = response.css('.pager .next a::attr(href)').extract_first()
url = response.urljoin(next)
yield scrapy.Request(url=url, callback=self.parse)
items.py:
import scrapy
class QuoteItem(scrapy.Item):
text = scrapy.Field()
author = scrapy.Field()
tags = scrapy.Field()
The class QuoteItem cann`t be recognized in quotes.py
error prompt picture
And after I changed to 'from tutorial.tutorial.items import QuoteItem'
and run 'scrapy crawl quotes', there is another error as fellow:
error again
And this caused the results can`t be saved. Someone can help, thanks in advance.

It's working fine with that code!! try using scrapy runspider yourspiderfile.py instead of scrapy crawl quotes.There is no error in the code.

From Scrapy Tutorial:
To put our spider to work, go to the project’s top level directory and run:
scrapy crawl quotes
In new version Scrapy, please go to the top directory:
scrapy runspider xx/spiders/xxx.py(the full path from top level directory)
In your xxx.py, make sure you import the item like this:
from xx.items import xxItem
# (remember to import the right class name, not always QuoteItem)
Pay attention
It's recommended to run scrapy on linux system or WSL. It does not work well on Win. If you meet the problem like the spider is not found, it may be the system issue.

Scrapy - How to keep track of start url

Given a pool of start urls I would like to identify in the parse_item() function the origin url.
As far as I'm concerned the scrapy spiders start crawling from the initial pool of start urls, but when parsing there is no trace of which of those urls was the initial one. How it would be possible to keep track of the starting point?

If you need a parsing url inside the spider, just use response.url:
def parse_item(self, response):
print response.url
but in case you need it outside spider I can think of following ways:
Use scrapy core api
You can also call scrapy from an external python module with OS command (which apparently is not recommended):
in scrapycaller.py
from subprocess import call
urls = 'url1,url2'
cmd = 'scrapy crawl myspider -a myurls={}'.format(urls)
call(cmd, shell=True)
Inside myspider:
class mySpider(scrapy.Spider):
def __init__(self, myurls=''):
self.start_urls = myurls.split(",")

Python Scrapy does not crawl website

I am new to python scrapy and trying to get through a small example, however I am having some problems!
I am able to crawl the first given URL only, but I am unable to crawl more than one page or an entire website for that matter!
Please help me or give me some advice on how I can crawl an entire website or more pages in general...
The example I am doing is very simple...
My items.py
import scrapy
class WikiItem(scrapy.Item):
title = scrapy.Field()
my wikip.py (the spider)
import scrapy
from wiki.items import WikiItem
class CrawlSpider(scrapy.Spider):
name = "wikip"
allowed_domains = ["en.wikipedia.org/wiki/"]
start_urls = (
'http://en.wikipedia.org/wiki/Portal:Arts',
)
def parse(self, response):
for sel in response.xpath('/html'):
item = WikiItem()
item['title'] = sel.xpath('//h1[#id="firstHeading"]/text()').extract()
yield item
When I run scrapy crawl wikip -o data.csv in the root project diretory the result is:
title
Portal:Arts
Can anyone give me insight as to why it is not following urls and crawling deeper?
I have checked some related SO questions but they have not helped to solve the issue

scrapy.Spider is the simplest spider. Change the name CrawlSpider, since Crawl Spider is one of the generic spiders of scrapy.
One of the below option can be used:
eg: 1. class WikiSpider(scrapy.Spider)
or 2. class WikiSpider(CrawlSpider)
If you are using first option you need to code the logic for following the links you need to follow on that webpage.
For second option you can do the below:
After the start urls you need to define the rule as below:
rules = (
Rule(LinkExtractor(allow=('https://en.wikipedia.org/wiki/Portal:Arts\?.*?')), callback='parse_item', follow=True,),
)
Also please change the name of the function defined as "parse" if you use CrawlSpider. The Crawl Spider uses parse method to implement the logic. Thus, here you are trying to override the parse method and hence the crawl spider doesn't work.

Using scrapy I get an empty item

I want to crawl some information from a webpage using python and scrapy, but when I try to do it the output of my item is empty...
First of all I've started a new project with scrapy. Then I've written the following in the items.py file:
import scrapy
class KakerlakeItem(scrapy.Item):
info=scrapy.Field()
pass
Next, I've created a new file in the spider's folder with the following code:
import scrapy
from kakerlake.items import KakerlakeItem
class Kakerlakespider(scrapy.Spider):
name='Coco'
allowed_domains=['http://www.goeuro.es/']
start_urls=['http://www.goeuro.es/search/NTYzY2U2Njk4YzA1ZDoyNzE2OTU4ODM=']
def parse(self, response):
item=KakerlakeItem()
item['info']=response.xpath('//span[#class= "inline-b height-100"]/text()').extract()
#yield item
return item
I expect, by writing scrapy crawl Coco -o data.json in the console, that I will get what I want, but instead of this I obtain the json file with {'info': []}. That is, an empty item.
I've tried a lot of things and I don't know why it doesn't work correctly...

Your xpath is invalid for the page, as there isn't a single class with "inline-b" or "height-100". This page is heavily modified via Javascript, so what you see in a browser will not be representative of what Scrapy receives.
xpath results:
>>> response.xpath('//span[contains(#class, "inline-b")]')
[]
>>> response.xpath('//span[contains(#class, "height-100")]')
[]

Remove the pass in KakerlakeItem(scrapy.Item) ?

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Running a Scrapy Crawler - python

Put the code in a text file, name it to something like your_spider.py and run the spider using the runspider command: scrapy runspider your_spider.py

You would typically start scrapy using scrapy crawl, which will hook everything up for you and start the crawling. It also looks like your code is not properly indented (only one line inside parse when they all should be).

Related

Python - Scrapy - Navigating through a website

The class in Item can`t be recognized after building a python scrapy program

Scrapy - How to keep track of start url

Python Scrapy does not crawl website

Using scrapy I get an empty item

Categories

Resources