Python Scrapy not executing scrapy.Request callback function for every link - python

I am trying to make a ebay spider that goes through each product link on a page and for each link visit each link and do something with that new page in parse_link function.
i am scraping this link
in parse function it iterates over each link fine prints out each link fine but only calls the parse function for only one link on a page
i mean each page has 50 or so products i am getting each product link and for each link visit each link and do something in the pase_link function
but for each page the parse_link function gets called for only one link (out of 50 or so links)
here is the code
class EbayspiderSpider(scrapy.Spider):
name = "ebayspider"
#allowed_domains = ["ebay.com"]
start_urls = ['http://www.ebay.com/sch/hfinney/m.html?item=132127244893&rt=nc&_trksid=p2047675.l2562']
def parse(self, response):
global c
for attr in response.xpath('//*[#id="ListViewInner"]/li'):
item = EbayItem()
linkse = '.vip ::attr(href)'
link = attr.css('a.vip ::attr(href)').extract_first()
c+=1
print '', 'I AM HERE', link, '\t', c
yield scrapy.Request(link, callback=self.parse_link, meta={'item': item})
next_page = '.gspr.next ::attr(href)'
next_page = response.css(next_page).extract_first()
print '\nI AM NEXT PAGE\n'
if next_page:
yield scrapy.Request(urljoin(response.url, next_page), callback=self.parse)
def parse_link(self, response):
global c2
c2+=1
print '\n\n\tIam in parselink\t', c2
SEE FOR EVERY 50 or so links scrapy only executes the parse link 1 time i am printing the counts how many links extracted and how many times parse_link gets executed using global variables
shady#shadyD:~/Desktop/ebay$ scrapy crawl ebayspider
ENTER THE URL TO SCRAPE : http://www.ebay.com/sch/hfinney/m.html?item=132127244893&rt=nc&_trksid=p2047675.l2562
2017-05-13 22:44:31 [scrapy.utils.log] INFO: Scrapy 1.3.3 started (bot: ebay)
2017-05-13 22:44:31 [scrapy.utils.log] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'ebay.spiders', 'SPIDER_MODULES': ['ebay.spiders'], 'BOT_NAME': 'ebay'}
2017-05-13 22:44:32 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.logstats.LogStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.corestats.CoreStats']
2017-05-13 22:44:33 [selenium.webdriver.remote.remote_connection] DEBUG: POST http://127.0.0.1:38079/session {"requiredCapabilities": {}, "desiredCapabilities": {"platform": "ANY", "browserName": "chrome", "version": "", "chromeOptions": {"args": [], "extensions": []}, "javascriptEnabled": true}}
2017-05-13 22:44:33 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2017-05-13 22:44:33 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2017-05-13 22:44:33 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2017-05-13 22:44:33 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2017-05-13 22:44:33 [scrapy.core.engine] INFO: Spider opened
2017-05-13 22:44:33 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-05-13 22:44:33 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2017-05-13 22:44:34 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.ebay.com/sch/hfinney/m.html?item=132127244893&rt=nc&_trksid=p2047675.l2562> (referer: None)
I AM HERE http://www.ebay.com/itm/Cat-Caterpillar-Excavator-Monitor-320B-320BL-320BLN-321B-322BL-325BL-151-9385-/361916086833?hash=item5443e13a31:g:NMwAAOSwX~dWomWJ 1
I AM HERE http://www.ebay.com/itm/257954A1-New-Case-580SL-580SM-580SL-Series-2-Backhoe-Loader-Hydraulic-Pump-/361345120303?hash=item5421d8f82f:g:KQEAAOSwBLlVVP0X 2
I AM HERE http://www.ebay.com/itm/Case-580K-forward-reverse-transmission-shuttle-kit-includ-NEW-PUMP-SEALS-GASKETS-/110777599002?hash=item19cadc041a:g:QBgAAOSwh-1W2GkE 3
I AM HERE http://www.ebay.com/itm/Case-Loader-Backhoe-580L-Hydraulic-Pump-130258A1-130258A2-15-spline-NEW-/361889539361?hash=item54424c2521:g:nzgAAOSw9GhYiQzz 4
I AM HERE http://www.ebay.com/itm/Hitachi-EX60-PLAIN-Excavator-Service-Manual-Shop-Repair-Book-KM-099-00-KM09900-/132118077640?hash=item1ec2d9e0c8:g:DLkAAOxyVLNS6Cj7 5
I AM HERE http://www.ebay.com/itm/CAT-Caterpillar-416E-420D-420E-428D-Backhoe-3054c-C4-4-engine-TurboCharger-turbo-/361576953143?hash=item542faa7537:g:I78AAOSw3ihXTZwm 6
I AM HERE http://www.ebay.com/itm/CAT-Caterpillar-excavator-311B-312-312B-Stepping-Throttle-Motor-1200002-120-0002-/131402610746?hash=item1e9834b83a:g:hBUAAOSwpdpVX4DS 7
I AM HERE http://www.ebay.com/itm/Fuel-Cap-Case-Backhoe-Skid-Steer-1845c-1845-1840-1835-1835b-1835c-diesel-or-gas-/132102578279?hash=item1ec1ed6067:g:LCYAAOSwGYVXCDJ4 8
I AM HERE http://www.ebay.com/itm/CAT-Caterpillar-excavator-312C-312CL-Stepping-Throttle-Motor-247-5207-2475207-/112125482091?hash=item1a1b33146b:g:1wAAAOSw9IpX0HLt 9
I AM HERE http://www.ebay.com/itm/AT179792-John-Deere-Loader-Backhoe-310E-310G-310K-310J-710D-Hydraulic-Pump-NEW-/111290280036?hash=item19e96ae864:g:hxQAAOSw2GlXEW8g 10
I AM HERE http://www.ebay.com/itm/L32129-CASE-580C-480C-Brake-master-cylinder-REPAIR-KIT-480B-580B-530-570-480-430-/112228195723?hash=item1a21525d8b:g:lWEAAOSwux5YRucG 11
I AM HERE http://www.ebay.com/itm/John-Deere-210C-310C-310D-310E-410B-410C-510C-710C-King-pin-Kingpin-kit-T184816-/112266699462?hash=item1a239de2c6:g:~qAAAOSw44BYfmcP 12
I AM HERE http://www.ebay.com/itm/Case-257948A1-580L-580L-580SL-580M-580SM-590SL-590SM-Series-2-Coupler-17-spline-/131506726034?hash=item1e9e696492:g:ZnkAAOSwPgxVTNAx 13
I AM HERE http://www.ebay.com/itm/Construction-Equipment-key-set-John-Deere-Hitachi-JD-JCB-excavator-backhoe-multi-/360445978301?hash=item53ec4126bd:g:1HkAAMXQlUNRLOiF 14
I AM HERE http://www.ebay.com/itm/Case-580C-580E-forward-reverse-transmission-shuttle-kit-includ-NEW-SEALS-GASKETS-/361588374712?hash=item543058bcb8:g:kOYAAOSwDuJW2Gna 15
I AM HERE http://www.ebay.com/itm/John-Deere-300D-310D-315D-TRANSMISSION-REVERSER-SOLENOID-ASSEMBLY-EARLY-AT163601-/361435304759?hash=item5427391337:g:5rsAAOSwnipWXft4 16
I AM HERE http://www.ebay.com/itm/Bobcat-743-Service-Manual-Book-Skid-steer-6566109-/131768685855?hash=item1eae06951f:g:rgcAAOSwQgpW~nqW 17
I AM HERE http://www.ebay.com/itm/Cat-Caterpillar-Excavator-Monitor-320C-312c-330c-325c-1573198-157-3198-panel-/112063225844?hash=item1a177d1ff4:g:BtgAAOSwepZXTfZ~ 18
I AM HERE http://www.ebay.com/itm/Ford-NEW-HOLLAND-Loader-BACKHOE-Hydraulic-pump-550-535-555-D1NN600B-Cessna-/360202190657?hash=item53ddb93f41:g:3gkAAOSwPgxVP5VF 19
I AM HERE http://www.ebay.com/itm/87435827-New-Case-590SL-590SM-Series-1-2-Backhoe-Loader-Hydraulic-oil-Pump-14S-/131992359553?hash=item1ebb5b9281:g:KQEAAOSwBLlVVP0X 20
I AM HERE http://www.ebay.com/itm/CAT-Caterpillar-excavator-311B-312-312B-Stepping-Throttle-Motor-2475227-247-5227-/111677605339?hash=item1a008105db:g:stsAAOSwNSxVX4kG 21
I AM HERE http://www.ebay.com/itm/Cat-Caterpillar-938H-950H-962H-416E-Wheel-Loader-Locking-Fuel-Tank-Cap-2849039-/111446084638?hash=item19f2b44c1e:g:u0IAAOxy1klRdqOQ 22
I AM HERE http://www.ebay.com/itm/FORD-BACKHOE-Hydraulic-pump-555C-555D-655D-E7NN600CA-/361376010222?hash=item5423b04fee:g:UdkAAOSwu4BV4J6T 23
I AM HERE http://www.ebay.com/itm/John-Deere-Excavator-AT154524-High-Speed-Solenoid-valve-490E-790ELC-790E-pump-/131623918235?hash=item1ea5659a9b:g:o-EAAOSwo0JWF~PC 24
I AM HERE http://www.ebay.com/itm/John-Deere-350C-450C-Dozer-Loader-Arm-Rest-PAIR-SEAT-/360164308266?hash=item53db77352a:m:m-79tleHP2PC3zD-HqRPMQw 25
I AM HERE http://www.ebay.com/itm/Caterpillar-Cat-D3-D3B-D3C-D4B-D4C-D4H-D5C-Dozer-3204-Engine-water-pump-NEW-/112061839578?hash=item1a1767f8da:g:6x0AAOSwIgNXjkNm 26
I AM HERE http://www.ebay.com/itm/International-IH-TD5-OLD-Crawler-Dozer-Seat-cushions-/110840656548?hash=item19ce9e32a4:m:mu5f6-grIZNQVtDoLSDcDJg 27
I AM HERE http://www.ebay.com/itm/Cat-Caterpillar-D3C-Series-III-D4G-D4H-8E4148-Arm-rests-rest-cushion-Dozer-seat-/131827423319?hash=item1eb186d857:g:JxMAAOSwQaJXRdzW 28
I AM HERE http://www.ebay.com/itm/Cat-Caterpillar-Excavator-Monitor-320C-321C-322C-325C-260-2160-2602160-gauge-/112014409886?hash=item1a1494409e:g:BtgAAOSwepZXTfZ~ 29
I AM HERE http://www.ebay.com/itm/John-Deere-JD-NON-Turbo-Muffler-AT83613-210C-300D-310C-310D-315C-315D-400G-410B-/361917008791?hash=item5443ef4b97:g:U0wAAOSw~CRTpFsn 30
I AM HERE http://www.ebay.com/itm/John-Deere-210C-310D-Shuttle-transmission-Overhaul-Kit-With-Pump-Forward-Reverse-/361916993624?hash=item5443ef1058:g:8cUAAOSwDNdVp7-1 31
I AM HERE http://www.ebay.com/itm/AT318659-AT139444-John-Deere-Loader-Brake-Hydraulic-Pump-NEW-SURPLUS-544E-544G-/132040240495?hash=item1ebe362d6f:g:mRMAAOSwJ7RYWWUF 32
I AM HERE http://www.ebay.com/itm/Hitachi-EX60-PLAIN-Excavator-PARTS-Manual-Book-P10717-P107E16-Machine-Comp-/132110375418?hash=item1ec26459fa:g:rbwAAOSwPe1UAQal 33
I AM HERE http://www.ebay.com/itm/Cat-Caterpillar-D2-ENGINE-SERVICE-REPAIR-manual-book-D311-212-motor-grader-/360724733057?hash=item53fcde9c81:m:mfYRAKtemeCg_HnjxHAiO0w 34
I AM HERE http://www.ebay.com/itm/Cat-Caterpillar-Excavator-Monitor-312C-315C-318C-319C-260-2160-2602160-gauge-/131833751423?hash=item1eb1e7677f:g:BtgAAOSwepZXTfZ~ 35
I AM HERE http://www.ebay.com/itm/121335A1-Case-580L-580L-Series-2-Backhoe-Throttle-Cable-BENT-77-75-LONG-BEND-/361891435313?hash=item5442691331:g:lgcAAOSwhOdXogxu 36
I AM HERE http://www.ebay.com/itm/Heavy-Construction-Equipment-21-Key-Set-Cat-Case-Deere-Komatsu-Volvo-Truck-Laser-/111018804148?hash=item19d93c83b4:m:mm5Eephzc48HDdiNjCCaxtg 37
I AM HERE http://www.ebay.com/itm/CAT-Caterpillar-320B-322B-325B-throttle-motor-governor-2475232-247-5232-5-pin-/112183024608?hash=item1a1ea11be0:g:4bUAAOSwXeJYESNh 38
I AM HERE http://www.ebay.com/itm/John-Deere-REAR-Window-BOTTOM-300D-310D-310E-410D-410E-510D-710D-Backhoe-T132952-/111788475468?hash=item1a071cc44c:m:mM6nkmXre_mrGj9gBQbSQHQ 39
I AM HERE http://www.ebay.com/itm/JD-John-Deere-200CLC-120CLC-Excavator-Cab-Front-Upper-Glass-Window-4602562-120C-/361479558328?hash=item5429dc54b8:g:WvEAAOSw2s1Uz-er 40
I AM HERE http://www.ebay.com/itm/Hitachi-Excavator-Front-Lower-Glass-Window-4369588-/110718985349?hash=item19c75da485:m:mettchbVo-QopfqTgIqtY3g 41
I AM HERE http://www.ebay.com/itm/Caterpillar-D6M-D6N-D6R-D8R-Suspension-Seat-6W9744-Cat-/361294230211?hash=item541ed072c3:g:3wAAAOSwNSxVULZJ 42
I AM HERE http://www.ebay.com/itm/Komatsu-D20A-3-D20P-7-D21P-7-Dozer-Track-Adjuster-Seal-Kit-909036-WITH-BUSHING-/132165283763?hash=item1ec5aa2fb3:g:-0MAAOSwdzVXl3CN 43
I AM HERE http://www.ebay.com/itm/Locking-Fuel-Cap-John-Deere-310S-310SE-410E-backhoe-AT176378-NEW-310-S-SE-410-E-/361853261989?hash=item54402298a5:g:NUIAAOSwOtdYUEnj 44
I AM HERE http://www.ebay.com/itm/John-Deere-450G-455G-550G-555G-650G-Dozer-Loader-Arm-Rest-rests-/361912161141?hash=item5443a55375:g:7rkAAOSw3xJVVhwe 45
I AM HERE http://www.ebay.com/itm/John-Deere-AT418735-RIGHT-bucket-Handle-CT322-240-250-260-270-Skid-Steer-loader-/112335938162?hash=item1a27be6272:g:A2MAAOSwTM5YyYCc 46
I AM HERE http://www.ebay.com/itm/Caterpillar-Cat-Tooth-Penetration-Rock-Tip-220-9092-2209092-320C-320D-325C-325D-/361928972291?hash=item5444a5d803:g:nGsAAOxy4YdTV~Qx 47
I AM HERE http://www.ebay.com/itm/John-Deere-AT418734-LEFT-Bucket-Handle-CT322-240-250-260-270-Skid-Steer-loader-/132127244893?hash=item1ec365c25d:g:5doAAOSwax5YyYAH 48
I AM HERE http://www.ebay.com/itm/4N9618-CAT-Caterpillar-977L-966C-235-D6C-3306-ENGINE-caterpiller-dozer-loader-/112360381857?hash=item1a29335da1:g:dLsAAOSwuLZY5lPU 49
I AM HERE http://www.ebay.com/itm/Bobcat-763-763F-Service-Manual-Book-Skid-steer-6900091-repair-shop-book-/131531875901?hash=item1e9fe9263d:g:VUsAAOxyOlhS0EiN 50
I AM NEXT PAGE
2017-05-13 22:44:36 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.ebay.com/itm/Bobcat-763-763F-Service-Manual-Book-Skid-steer-6900091-repair-shop-book-/131531875901?hash=item1e9fe9263d:g:VUsAAOxyOlhS0EiN> (referer: http://www.ebay.com/sch/hfinney/m.html?item=132127244893&rt=nc&_trksid=p2047675.l2562)
Iam in parselink 2
2017-05-13 22:44:36 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.ebay.com/sch/m.html?item=132127244893&_ssn=hfinney&_pgn=2&_skc=50&rt=nc> (referer: http://www.ebay.com/sch/hfinney/m.html?item=132127244893&rt=nc&_trksid=p2047675.l2562)
I AM HERE http://www.ebay.com/itm/Hitachi-EX120-3-Excavator-Service-Technical-WorkShop-Manual-Shop-KM135E00-/361971788377?hash=item5447332a59:g:uXEAAMXQEgpTERZv 51
I AM HERE http://www.ebay.com/itm/Cat-Caterpillar-Excavator-Monitor-320B-320BL-320BLN-321B-322BL-325BL-106-0172-/112208711245?hash=item1a20290e4d:g:NMwAAOSwX~dWomWJ 52
I AM HERE http://www.ebay.com/itm/CAT-Caterpillar-D4D-Seat-Cushion-Set-Arm-Rest-Dozer-9M6702-8K9100-3K4403-NEW-/111027276253?hash=item19d9bdc9dd:g:taYAAMXQhuVROmSf 53
I AM HERE http://www.ebay.com/itm/FORD-555E-575E-655E-675E-BACKHOE-GLASS-WINDOW-DOOR-UPPER-RH-LH-85801626-/111004314632?hash=item19d85f6c08:g:kSkAAOxyzHxRL8~e 54
I AM HERE http://www.ebay.com/itm/187-8391-1878391-Caterpillar-Cat-Oil-Cooler-939C-D4C-D5C-933C-D3C-Series-3-/132036431899?hash=item1ebdfc101b:g:VhQAAOSw3YNXYtcn 55
I AM HERE http://www.ebay.com/itm/A137187-CASE-BACKHOE-Power-Steering-pump-480B-580B-530-NEW-A36559-/132028859390?hash=item1ebd8883fe:g:HMsAAOSwzOxUWpVL 56
I AM HERE http://www.ebay.com/itm/Cat-Caterpillar-953-7N5538-Exhaust-flex-pipe-EARLY-S-N-/361407787737?hash=item54259532d9:g:n3YAAOSwo6lWHQOL 57
I AM HERE http://www.ebay.com/itm/LINKBELT-Excavator-locking-Fuel-Cap-with-keys-KHH0140-/131504146758?hash=item1e9e420946:g:FHUAAOSwPhdVSLkJ 58
I AM HERE http://www.ebay.com/itm/Cat-Caterpillar-D4H-D5H-D6D-EXHAUST-PIPE-LOCKING-RAIN-CAP-5-INCH-/131962111459?hash=item1eb98e05e3:g:0fgAAOSwpLNX9qT1 59
I AM HERE http://www.ebay.com/itm/Caterpillar-CAT-Dozer-D5C-D5G-rear-sprocket-segments-NEW-1979677-1979678-CR6602-/361403972171?hash=item54255afa4b:g:qJsAAOSwLqFV9tkk 60
I AM HERE http://www.ebay.com/itm/John-Deere-4265372-RPM-sensor-110-120-160C-200C-330CLC-490E-790ELC-892E-HITACHI-/131567763291?hash=item1ea20cbf5b:g:PZYAAOSwPcVVup-H 61
I AM HERE http://www.ebay.com/itm/CATERPILLAR-D3B-931B-arm-rests-9C4136-5G2621-/360160327148?hash=item53db3a75ec:m:mY4iFhRua2zcfV6IL5i8csQ 62
I AM HERE http://www.ebay.com/itm/Bobcat-864-Operation-Maintenance-Manual-Book-6900953-operator-skid-steer-Track-/131664897965?hash=item1ea7d6e7ad:g:exkAAOSwcBhWXem~ 63
I AM HERE http://www.ebay.com/itm/Case-550G-650G-750G-850G-1150G-arm-rests-194738A1-144427A1-seat-cushion-crawler-/112393155898?hash=item1a2b27753a:g:GVEAAOSw5L9XDoN- 64
I AM HERE http://www.ebay.com/itm/7834-41-3002-7834-41-3003-Komatsu-PC300-7-PC360-7-PC400-7-Throttle-motor-/132135899267?hash=item1ec3e9d083:g:ulMAAOSw4A5Y1Agl 65
I AM HERE http://www.ebay.com/itm/CAT-Caterpillar-955H-Crawler-Loader-Dozer-Parts-Manual-Book-NEW-60A8413-and-up-/361855690487?hash=item544047a6f7:g:FeUAAOSwux5YVDfu 66
I AM HERE http://www.ebay.com/itm/Case-580CK-530-530ck-2wd-Power-Steering-cylinder-A37859-A37509-/111184835276?hash=item19e321f2cc:g:h~QAAOxyGstR8DSu 67
I AM HERE http://www.ebay.com/itm/Case-Backhoe-580-SUPER-L-580L-590SL-Radiator-234876A1-234876A2-Metal-tank-580SL-/111646548306?hash=item19fea72152:g:3igAAOxyI8lR8TnL 68
I AM HERE http://www.ebay.com/itm/Dresser-International-TD7C-TD8C-TD7E-TD12-TD15E-Dozer-Fuel-Cap-701922C2-103768C1-/132062834112?hash=item1ebf8eedc0:g:-CEAAOSwImRYeOug 69
I AM HERE http://www.ebay.com/itm/JD-John-Deere-120-160LC-200LC-230LC-Excavator-Cab-Door-Lower-Glass-4383401-/360651229974?hash=item53f87d0b16:g:fhUAAMXQDfdRqPQ5 70
I AM HERE http://www.ebay.com/itm/New-Holland-LB75b-loader-backhoe-operators-manual-operator-operation-maintenance-/361287895632?hash=item541e6fca50:g:1WAAAOSwAvJW9X~t 71
I AM HERE http://www.ebay.com/itm/Bobcat-743-early-parts-Manual-Book-Skid-steer-loader-6566179-/112084996042?hash=item1a18c94fca:g:wAoAAOxykmZTNY92 72
I AM HERE http://www.ebay.com/itm/Dresser-TD15E-Operator-Maintenance-Manual-International-crawler-dozer-operation-/111385189587?hash=item19ef131cd3:g:qDYAAOSwnQhXohwA 73
I AM HERE http://www.ebay.com/itm/FORD-555E-575E-655E-675E-BACKHOE-GLASS-WINDOW-REAR-BACK-85801632-/360573341694?hash=item53f3d88ffe:g:nDQAAOxyyF5RL9H2 74
I AM HERE http://www.ebay.com/itm/DEERE-160LC-200LC-230LC-330LC-370-GLASS-LOWER-AT214097-/361070972976?hash=item541181d030:m:mettchbVo-QopfqTgIqtY3g 75
I AM HERE http://www.ebay.com/itm/John-Deere-NEW-Turbocharger-turbo-545D-590D-595-495D-EXCAVATOR-JD-RE26342-NEW-/131458659790?hash=item1e9b8bf5ce:g:3c4AAOxyu4dRwzW4 76
I AM HERE http://www.ebay.com/itm/FORD-555E-575E-655E-675E-BACKHOE-GLASS-WINDOW-DOOR-FRONT-LOWER-LH-85801623-/361342507318?hash=item5421b11936:g:ZbYAAOSwPcVVpsif 77
I AM HERE http://www.ebay.com/itm/CAT-Caterpillar-excavator-311B-312-312B-Stepping-Throttle-Motor-247-5231-1190633-/132186922816?hash=item1ec6f45f40:g:hBUAAOSwpdpVX4DS 78
I AM HERE http://www.ebay.com/itm/Cat-Caterpillar-Excavator-Monitor-330C-260-2160-2602160-gauge-/361578440228?hash=item542fc12624:g:BtgAAOSwepZXTfZ~ 79
I AM HERE http://www.ebay.com/itm/John-Deere-210C-310D-Shuttle-Reverser-Overhaul-Kit-With-Pump-Forward-Reverse-/131963132435?hash=item1eb99d9a13:g:8cUAAOSwDNdVp7-1 80
I AM HERE http://www.ebay.com/itm/Caterpillar-Cat-Multi-Terrain-Skid-Steer-Loader-Suspension-seat-cushion-kit-/360880511219?hash=item54062798f3:m:m5Tt8bBvIax8MVfT4VqcQgA 81
I AM HERE http://www.ebay.com/itm/Case-310G-Crawler-Tractor-4pc-Seat-Cushion-set-/361381166532?hash=item5423fefdc4:g:hzAAAOSwSdZWdHZS 82
I AM HERE http://www.ebay.com/itm/International-IH-500-OLD-Crawler-Dozer-Seat-cushions-/110598250697?hash=item19c02b60c9:g:DQ0AAMXQTT9RwIuh 83
I AM HERE http://www.ebay.com/itm/Caterpillar-Cat-Excavator-Locking-Fuel-Cap-0963100-key-E110-E120-E70B-E110B-312-/110702080613?hash=item19c65bb265:g:pLwAAOxy2YtRwx2L 84
I AM HERE http://www.ebay.com/itm/Fuel-Cap-Case-Backhoe-Skid-Steer-1845c-1845-1840-1835-1835b-1835c-diesel-or-gas-/132102578719?hash=item1ec1ed621f:g:~IcAAOSwgZ1Xvyk9 85
I AM HERE http://www.ebay.com/itm/87433897-New-Case-580SL-580SM-580SL-Series-1-2-Backhoe-Hydraulic-Pump-14-Spline-/112192774351?hash=item1a1f35e0cf:g:KQEAAOSwBLlVVP0X 86
I AM HERE http://www.ebay.com/itm/Case-580K-580SK-580L-580SL-BACKHOE-Right-Door-Rear-Hinged-Window-Glass-R52882-/111777519523?hash=item1a067597a3:m:mUh405BlfpMRnDzu0J8qEEw 87
I AM HERE http://www.ebay.com/itm/Case-backhoe-door-spring-580E-580K-580SK-580SL-580SL-SERIES-2-580L-F44881-/111485899971?hash=item19f513d4c3:m:mpgpGQ1o0j_2ewhNIMMA53w 88
I AM HERE http://www.ebay.com/itm/FORD-555E-575E-655E-675E-BACKHOE-GLASS-WINDOW-DOOR-LOWER-LH-85801625-/111002325387?hash=item19d841118b:g:HrIAAMXQySpRL9SJ 89
I AM HERE http://www.ebay.com/itm/International-Dresser-TD8E-Dozer-4pc-Seat-Cushion-set-TD8C-IH-/131522416031?hash=item1e9f58cd9f:g:qC0AAOSwqBJXUJIL 90
I AM HERE http://www.ebay.com/itm/John-Deere-450G-550G-650G-Crawler-Dozer-Operators-Manual-Maintenance-OMT163974-/132190364513?hash=item1ec728e361:g:lUAAAOxygPtS59xJ 91
I AM HERE http://www.ebay.com/itm/Heavy-Construction-Equipment-key-set-excavator-bull-dozer-broom-forklift-loaders-/110751342295?hash=item19c94b5ed7:m:mm5Eephzc48HDdiNjCCaxtg 92
I AM HERE http://www.ebay.com/itm/International-IH-Dresser-TD15B-TD15C-Crawler-Loader-Seat-Cushion-set-4-pieces-/111731372191?hash=item1a03b5709f:g:TrAAAOSwDNdVu5He 93
I AM HERE http://www.ebay.com/itm/Caterpillar-Cat-Skid-Steer-loader-Suspension-COMPLETE-Seat-247-247B-more-/131069185959?hash=item1e84550fa7:g:kQYAAOxy4dNSqIYD 94
I AM HERE http://www.ebay.com/itm/John-Deere-JD-Loader-Backhoe-710D-310g-310E-310J-310K-Hydraulic-charge-Pump-/131129131733?hash=item1e87e7c2d5:g:zFQAAOxy9eVRJ9cw 95
I AM HERE http://www.ebay.com/itm/Case-480E-480ELL-LANDSCAPE-Backhoe-4x4-4wd-FRONT-RIM-wheel-New-D126930-12-X-16-5-/360913564299?hash=item54081ff28b:m:mYte9AXdktKLD9H-HOFJthQ 96
I AM HERE http://www.ebay.com/itm/Bobcat-763F-763-Operation-Maintenance-Manual-operator-owner-6900788-/360337555830?hash=item53e5cac176:g:4IQAAOxy4dNSxZHP 97
I AM HERE http://www.ebay.com/itm/Bobcat-753H-753-H-Service-Manual-Book-Skid-steer-loader-6900090-/131522633242?hash=item1e9f5c1e1a:g:1JEAAOxyUrZS-j4Q 98
I AM HERE http://www.ebay.com/itm/John-Deere-JD-550-Crawler-Dozer-Parts-Manual-PC1437-/131985496504?hash=item1ebaf2d9b8:g:GkIAAOSwPgxVLR7f 99
I AM HERE http://www.ebay.com/itm/Case-IH-580D-580SE-580SD-Backhoe-Rear-Closure-Panel-Cab-Glass-Window-CG3116-NEW-/111070117033?hash=item19dc4b7ca9:g:jHEAAOxykVNRwL34 100
I AM NEXT PAGE
2017-05-13 22:44:37 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.ebay.com/itm/Case-IH-580D-580SE-580SD-Backhoe-Rear-Closure-Panel-Cab-Glass-Window-CG3116-NEW-/111070117033?hash=item19dc4b7ca9:g:jHEAAOxykVNRwL34> (referer: http://www.ebay.com/sch/m.html?item=132127244893&_ssn=hfinney&_pgn=2&_skc=50&rt=nc)
Iam in parselink 3
2017-05-13 22:44:37 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.ebay.com/sch/m.html?item=132127244893&_ssn=hfinney&_pgn=3&_skc=100&rt=nc> (referer: http://www.ebay.com/sch/m.html?item=132127244893&_ssn=hfinney&_pgn=2&_skc=50&rt=nc)
I AM HERE http://www.ebay.com/itm/John-Deere-Hitachi-Zaxis-110-120-160-200-225-230-Alternator-1812005304-Excavator-/360495635483?hash=item53ef36dc1b:m:mqifohjA-IWXcIg_oWMee1Q 101
I AM HERE http://www.ebay.com/itm/CAT-Caterpillar-955H-Crawler-Loader-Dozer-Parts-Manual-Book-NEW-60A8413-and-up-/361855690487?hash=item54404
EDIT:
settings.py
# -*- coding: utf-8 -*-
# Scrapy settings for ebay project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
# http://doc.scrapy.org/en/latest/topics/settings.html
# http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
# http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html
BOT_NAME = 'ebay'
SPIDER_MODULES = ['ebay.spiders']
NEWSPIDER_MODULE = 'ebay.spiders'
# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'ebay (+http://www.yourdomain.com)'
# Obey robots.txt rules
ROBOTSTXT_OBEY = False
# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32
# Configure a delay for requests for the same website (default: 0)
# See http://scrapy.readthedocs.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16
# Disable cookies (enabled by default)
#COOKIES_ENABLED = False
# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False
# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
# 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
# 'Accept-Language': 'en',
#}
# Enable or disable spider middlewares
# See http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
# 'ebay.middlewares.EbaySpiderMiddleware': 543,
#}
# Enable or disable downloader middlewares
# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
# 'ebay.middlewares.MyCustomDownloaderMiddleware': 543,
#}
# Enable or disable extensions
# See http://scrapy.readthedocs.org/en/latest/topics/extensions.html
#EXTENSIONS = {
# 'scrapy.extensions.telnet.TelnetConsole': None,
#}
# Configure item pipelines
# See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html
#ITEM_PIPELINES = {
# 'ebay.pipelines.EbayPipeline': 300,
#}
# Enable and configure the AutoThrottle extension (disabled by default)
# See http://doc.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False
# Enable and configure HTTP caching (disabled by default)
# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
items.py
import scrapy
from scrapy.item import Item, Field
class EbayItem(scrapy.Item):
NAME = scrapy.Field()
MPN = scrapy.Field()
ITEMID = scrapy.Field()
PRICE = scrapy.Field()
FREIGHT_1_for_quan_1 = scrapy.Field()
FREIGHT_2_for_quan_2 = scrapy.Field()
DATE = scrapy.Field()
QUANTITY = scrapy.Field()
CATAGORY = scrapy.Field()
SUBCATAGORY = scrapy.Field()
SUBCHILDCATAGORY = scrapy.Field()
pipelines.py although i have not touched this file
class EbayPipeline(object):
def process_item(self, item, spider):
return item
Middleware.py Have not touched this file either
from scrapy import signals
class EbaySpiderMiddleware(object):
# Not all methods need to be defined. If a method is not defined,
# scrapy acts as if the spider middleware does not modify the
# passed objects.
#classmethod
def from_crawler(cls, crawler):
# This method is used by Scrapy to create your spiders.
s = cls()
crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
return s
def process_spider_input(response, spider):
# Called for each response that goes through the spider
# middleware and into the spider.
# Should return None or raise an exception.
return None
def process_spider_output(response, result, spider):
# Called with the results returned from the Spider, after
# it has processed the response.
# Must return an iterable of Request, dict or Item objects.
for i in result:
yield i
def process_spider_exception(response, exception, spider):
# Called when a spider or process_spider_input() method
# (from other spider middleware) raises an exception.
# Should return either None or an iterable of Response, dict
# or Item objects.
pass
def process_start_requests(start_requests, spider):
# Called with the start requests of the spider, and works
# similarly to the process_spider_output() method, except
# that it doesn’t have a response associated.
# Must return only requests (not items).
for r in start_requests:
yield r
def spider_opened(self, spider):
spider.logger.info('Spider opened: %s' % spider.name)

Solution: no fix needed, it seems to be working fine
I quickly ran your code (with only slight modifications like removing the global vars and replacing EbayItem) and it works fine and visit alls URLs you are creating.
Explanation / What's going on here:
I suspect your scraper is scheduling the urls in a way that makes it appear as if it is not visiting all links. But it will do, only later.
I suspect you have set CONCURRENT_REQUESTS = 2. That's why scrapy is scheduling 2 of the 51 URLs for being processed next. Among these 2 URLs there is the next page URL which creates another 51 requests. And these new requests are pushing the old 49 requests further back in the queue ... and so on and so on it will go until there are no more next links.
If you run the scraper long enough you will see that all links will be visited sooner or later. Most probably the 49 "missing" requests that were created first will be visited last.
Also you can remove the creation of the next_page request to see whether all 50 links are visited.

Related

Scrapy will start by command line but not with CrawlerProcess

Here is my spider :
import scrapy
class PhonesCDSpider(scrapy.Spider):
name = "phones_CD"
custom_settings = {
"FEEDS": {
"Spiders/spiders/cd.json": {"format": "json"},
},
}
start_urls = [
'https://www.cdiscount.com/telephonie/telephone-mobile/smartphones/tous-nos-smartphones/l-144040211.html'
]
def parse(self, response):
for phone in response.css('div.prdtBlocInline.jsPrdtBlocInline'):
phone_url = phone.css('div.prdtBlocInline.jsPrdtBlocInline a::attr(href)').get()
# go to the phone page
yield response.follow(phone_url, callback=self.parse_phone
def parse_phone(self, response):
yield {
'title': response.css('h1::text').get(),
'price': response.css('span.fpPrice.price.jsMainPrice.jsProductPrice.hideFromPro::attr(content)').get(),
'EAN' : response.css('script').getall(),
'image_url' : response.css('div.fpMainImg a::attr(href)').get(),
'url': response.url
}
If I start it in the terminal with: scrapy crawl phones_CD -O test.json, it works fine. But if I run it in my python script (where the other crawlers work and are configured the same way):
def all_crawlers():
process = CrawlerProcess()
process.crawl(PhonesCBSpider)
process.crawl(PhonesKFSpider)
process.crawl(PhonesMMSpider)
process.crawl(PhonesCDSpider)
process.start()
all_crawlers()
I get an error, here is the traceback :
2021-01-05 18:16:06 [scrapy.core.engine] INFO: Spider opened
2021-01-05 18:16:06 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2021-01-05 18:16:06 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6026
2021-01-05 18:16:06 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.cdiscount.com/telephonie/telephone-mobile/smartphones/tous-nos-smartphones/l-144040211.html> (referer: None)
2021-01-05 18:16:07 [scrapy.core.engine] INFO: Closing spider (finished)
Thanks in advance for your time!
According to Scrapy docs feed-exports
Scrapy FEEDS setting does not support relative path like your "Spiders/spiders/cd.json".

Is there any way to get text inside anchor tag in Scrapy's Crawlspider?

I have a crawlspider which crawls given site upto certain dept and download the pdfs on that site. Everything works fine but along with link of pdf, i also need text inside anchor tag.
for eg:
<a href='../some/pdf/url/pdfname.pdf'>Project Report</a>
consider this anchor tag, in callback i get response object and along with this object i need text inside that tag for eg 'Project Report'.
Is there any way to get this information along with the response object. i have gone through https://docs.scrapy.org/en/latest/topics/selectors.html link but it not something that i am looking for.
sample code:
class DocumunetPipeline(scrapy.Item):
document_url = scrapy.Field()
name = scrapy.Field() # name of pdf/doc file
depth = scrapy.Field()
class MySpider(CrawlSpider):
name = 'pdf'
start_urls = ['http://www.someurl.com']
allowed_domains = ['someurl.com']
rules = (
Rule(LinkExtractor(tags="a", deny_extensions=[]),
callback='parse_document', follow=True),
)
def parse_document(self, response):
content_type = (response.headers
.get('Content-Type', None)
.decode("utf-8"))
url = response.url
if content_type == "application/pdf":
name = response.headers.get('Content-Disposition', None)
document = DocumunetPipeline()
document['document_url'] = url
document['name'] = name
document['depth'] = response.meta.get('depth', None)
yield document
It seems like it's not documented, but the meta attribute does contain the link text. It is updated in this line.
A minimal example would be:
from scrapy.spiders import Rule, CrawlSpider
from scrapy.linkextractors import LinkExtractor
class LinkTextSpider(CrawlSpider):
name = 'linktext'
start_urls = ['https://example.org']
rules = [
Rule(LinkExtractor(), callback='parse_document'),
]
def parse_document(self, response):
return dict(
url=response.url,
link_text=response.meta['link_text'],
)
Which produces an output similar to:
2019-04-01 12:03:30 [scrapy.core.engine] INFO: Spider opened
2019-04-01 12:03:30 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-04-01 12:03:30 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2019-04-01 12:03:31 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://example.org> (referer: None)
2019-04-01 12:03:32 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://www.iana.org/domains/reserved> from <GET http://www.iana.org/domains/example>
2019-04-01 12:03:33 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.iana.org/domains/reserved> (referer: None)
2019-04-01 12:03:33 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.iana.org/domains/reserved>
{'url': 'https://www.iana.org/domains/reserved', 'link_text': 'More information...'}
2019-04-01 12:03:33 [scrapy.core.engine] INFO: Closing spider (finished)
I believe the best way to achieve that is not to use crawling rules, and instead user regular crawling, with your own parse_* methods to handle all responses.
Then, when you yield a request that has parse_document as callback, you can include the link text on the meta parameter of your request, and read it from response.meta on your parse_document method.
class MySpider(CrawlSpider):
name = 'pdf'
start_urls = ['http://www.someurl.com']
allowed_domains = ['someurl.com']
def parse(self, response):
for link in response.css('a'):
yield response.follow(
link,
callback=self.parse_document,
meta={'link_text': link.xpath('text()').get()}
)
def parse_document(self, response):
# …
if content_type == "application/pdf":
# …
document = DocumunetPipeline()
# …
document['link_text'] = response.meta['link_text']
yield document

Sequential scraping from multiple start_urls leading to error in parsing

First, highest appreciation for all of your work answering noob questions like this one.
Second, as it seems to be a quite common problem I was finding (IMO) related questions such as:
Scrapy: Wait for a specific url to be parsed before parsing others
However, at my current state of understanding it is not straightforward to adapt the suggestions in my specific case and I would really appreciate your help.
Problem Outline: running on (Python 3.7.1, Scrapy 1.5.1)
I want to scrape data from every link collected on pages like this
https://www.gipfelbuch.ch/gipfelbuch/touren/seite/1
then from all links on another collection
https://www.gipfelbuch.ch/gipfelbuch/touren/seite/650
I manage to get the desired information (only two elements shown here) if I run the spider for one (e.g. page 1 or 650) at a time. (Note that I restircted the length of links that is crawled per page to 2.) However, once I have multiple start start_urls (setting two elements in the list [1,650] in the code below) the parsed data is no more consistent. Apparently at least one element is not found by xpath. I am suspecting some (or a lot of) incorrect logic how I handle/pass the requests that leads not to the intendet order for parsing.
Code:
class SlfSpider1Spider(CrawlSpider):
name = 'slf_spider1'
custom_settings = { 'CONCURRENT_REQUESTS': '1' }
allowed_domains = ['gipfelbuch.ch']
start_urls = ['https://www.gipfelbuch.ch/gipfelbuch/touren/seite/'+str(i) for i in [1,650]]
# Method which starts the requests by vicisting all URLS specified in start_urls
def start_requests(self):
for url in self.start_urls:
print('#### START REQUESTS: ',url)
yield scrapy.Request(url, callback=self.parse_verhaeltnisse, dont_filter=True)
def parse_verhaeltnisse(self,response):
links = response.xpath('//td//#href').extract()
for link in links[0:2]:
print('##### PARSING: ',link)
abs_link = 'https://www.gipfelbuch.ch/'+link
yield scrapy.Request(abs_link, callback=self.parse_gipfelbuch_item, dont_filter=True)
def parse_gipfelbuch_item(self, response):
route = response.xpath('/html/body/main/div[4]/div[#class="col_f"]//div[#class="togglebox cont_item mt"]//div[#class="label_container"]')
print('#### PARSER OUTPUT: ')
key=[route[i].xpath('string(./label)').extract()[0] for i in range(len(route))]
value=[route[i].xpath('string(div[#class="label_content"])').extract()[0] for i in range(len(route))]
fields = dict(zip(key,value))
print('Route: ', fields['Gipfelname'])
print('Comments: ', fields['Verhältnis-Beschreibung'])
print('Length of dict extracted from Route: {}'.format(len(route)))
return
Command prompt
2019-03-18 15:42:27 [scrapy.core.engine] INFO: Spider opened
2019-03-18 15:42:27 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-03-18 15:42:27 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6024
#### START REQUESTS: https://www.gipfelbuch.ch/gipfelbuch/touren/seite/1
2019-03-18 15:42:28 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.gipfelbuch.ch/gipfelbuch/touren/seite/1> (referer: None)
#### START REQUESTS: https://www.gipfelbuch.ch/gipfelbuch/touren/seite/650
##### PARSING: /gipfelbuch/detail/id/101559/Skitour_Snowboardtour/Beaufort
##### PARSING: /gipfelbuch/detail/id/101557/Skitour_Snowboardtour/Blinnenhorn
2019-03-18 15:42:30 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.gipfelbuch.ch/gipfelbuch/touren/seite/650> (referer: None)
##### PARSING: /gipfelbuch/detail/id/69022/Alpine_Wanderung/Schwaendeliflue
##### PARSING: /gipfelbuch/detail/id/69021/Schneeschuhtour/Cima_Portule
2019-03-18 15:42:32 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.gipfelbuch.ch//gipfelbuch/detail/id/101557/Skitour_Snowboardtour/Blinnenhorn> (referer: https://www.gipfelbuch.ch/gipfelbuch/touren/seite/1)
#### PARSER OUTPUT:
Route: Blinnenhorn/Corno Cieco
Comments: Am Samstag Aufstieg zur Corno Gries Hütte, ca. 2,5h ab All Acqua. Zustieg problemslos auf guter Spur. Zur Verwunderung waren wir die einzigsten auf der Hütte. Danke an Monika für die herzliche Bewirtung...
Length of dict extracted from Route: 27
2019-03-18 15:42:34 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.gipfelbuch.ch//gipfelbuch/detail/id/69021/Schneeschuhtour/Cima_Portule> (referer: https://www.gipfelbuch.ch/gipfelbuch/touren/seite/650)
#### PARSER OUTPUT:
Route: Cima Portule
Comments: Sehr viel Schnee in dieser Gegend und viel Spirarbeit geleiset, deshalb auch viel Zeit gebraucht.
Length of dict extracted from Route: 19
2019-03-18 15:42:38 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.gipfelbuch.ch//gipfelbuch/detail/id/69022/Alpine_Wanderung/Schwaendeliflue> (referer: https://www.gipfelbuch.ch/gipfelbuch/touren/seite/650)
#### PARSER OUTPUT:
Route: Schwändeliflue
Comments: Wege und Pfade meist schneefrei, da im Gebiet viel Hochmoor ist, z.t. sumpfig. Oberhalb 1600m und in Schattenlagen bis 1400m etwas Schnee (max.Schuhtief). Wetter sonnig und sehr warm für die Jahreszeit, T-Shirt - Wetter, Frühlingshaft....
Length of dict extracted from Route: 17
2019-03-18 15:42:40 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.gipfelbuch.ch//gipfelbuch/detail/id/101559/Skitour_Snowboardtour/Beaufort> (referer: https://www.gipfelbuch.ch/gipfelbuch/touren/seite/1)
#### PARSER OUTPUT:
Route: Beaufort
2019-03-18 15:42:40 [scrapy.core.scraper] **ERROR: Spider error processing <GET https://www.gipfelbuch.ch//gipfelbuch/detail/id/101559/Skitour_Snowboardtour/Beaufort> (referer: https://www.gipfelbuch.ch/gipfelbuch/touren/seite/1)
Traceback (most recent call last):
File "C:\Users\Lenovo\Anaconda3\lib\site-packages\twisted\internet\defer.py", line 654, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "C:\Users\Lenovo\Dropbox\Code\avalanche\scrapy\slf1\slf1\spiders\slf_spider1.py", line 38, in parse_gipfelbuch_item
print('Comments: ', fields['Verhältnis-Beschreibung'])
**KeyError: 'Verhältnis-Beschreibung'****
2019-03-18 15:42:40 [scrapy.core.engine] INFO: Closing spider (finished)
Question:
How do I have to structure the first (for links) and second (for content) parsing commands correctly? Why is the "PARSE OUTPUT" not in the order i would expect (first for page 1, links top to bottom, then page 2, links top to bottom)?
I already tried to reduce the number of CONCURRENT_REQUESTS = 1 and DOWNLOAD_DELAY = 2.
I hope the question is clear enough... big thanks in advance.
If the problem is to visit more URLs at the same time, you can visit one by one, using the signal spider_idle (https://docs.scrapy.org/en/latest/topics/signals.html).
The idea is the following:
1.start_requests only visits the first URL
2.when the spider gets idle, the method spider_idle is called
3.the method spider_idle deletes the first URL and visits the second URL
4.so on...
The code would be something like this (I didn't try it):
class SlfSpider1Spider(CrawlSpider):
name = 'slf_spider1'
custom_settings = { 'CONCURRENT_REQUESTS': '1' }
allowed_domains = ['gipfelbuch.ch']
start_urls = ['https://www.gipfelbuch.ch/gipfelbuch/touren/seite/'+str(i) for i in [1,650]]
#classmethod
def from_crawler(cls, crawler, *args, **kwargs):
spider = super(SlfSpider1Spider, cls).from_crawler(crawler, *args, **kwargs)
# Here you set which method the spider has to run when it gets idle
crawler.signals.connect(spider.spider_idle, signal=signals.spider_idle)
return spider
# Method which starts the requests by vicisting all URLS specified in start_urls
def start_requests(self):
# the spider visits only the first provided URL
url = self.start_urls[0]:
print('#### START REQUESTS: ',url)
yield scrapy.Request(url, callback=self.parse_verhaeltnisse, dont_filter=True)
def parse_verhaeltnisse(self,response):
links = response.xpath('//td//#href').extract()
for link in links[0:2]:
print('##### PARSING: ',link)
abs_link = 'https://www.gipfelbuch.ch/'+link
yield scrapy.Request(abs_link, callback=self.parse_gipfelbuch_item, dont_filter=True)
def parse_gipfelbuch_item(self, response):
route = response.xpath('/html/body/main/div[4]/div[#class="col_f"]//div[#class="togglebox cont_item mt"]//div[#class="label_container"]')
print('#### PARSER OUTPUT: ')
key=[route[i].xpath('string(./label)').extract()[0] for i in range(len(route))]
value=[route[i].xpath('string(div[#class="label_content"])').extract()[0] for i in range(len(route))]
fields = dict(zip(key,value))
print('Route: ', fields['Gipfelname'])
print('Comments: ', fields['Verhältnis-Beschreibung'])
print('Length of dict extracted from Route: {}'.format(len(route)))
return
# When the spider gets idle, it deletes the first url and visits the second, and so on...
def spider_idle(self, spider):
del(self.start_urls[0])
if len(self.start_urls)>0:
url = self.start_urls[0]
self.crawler.engine.crawl(Request(url, callback=self.parse_verhaeltnisse, dont_filter=True), spider)

Scrapy not working (noob level) - 0 pages crawled 0 items crawled

I've been trying to follow the Scrapy tutorial but I stuck and have no idea where is mistake.
It is working but no items are crawled.
I get the following output:
C:\Users\xxx\allegro>scrapy crawl AllegroPrices
2017-12-10 22:25:14 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: AllegroPrices)
2017-12-10 22:25:14 [scrapy.utils.log] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'allegro.spiders', 'SPIDER_MODULES': ['allegro.spiders'], 'ROBOTSTXT_OBEY': True, 'LOG_LEVEL': 'INFO', 'BOT_NAME': 'AllegroPrices'}
2017-12-10 22:25:15 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.logstats.LogStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.corestats.CoreStats']
2017-12-10 22:25:15 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2017-12-10 22:25:15 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'allegro.middlewares.AllegroSpiderMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2017-12-10 22:25:15 [scrapy.middleware] INFO: Enabled item pipelines:
['allegro.pipelines.AllegroPipeline']
2017-12-10 22:25:15 [scrapy.core.engine] INFO: Spider opened
2017-12-10 22:25:15 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-12-10 22:25:15 [AllegroPrices] INFO: Spider opened: AllegroPrices
2017-12-10 22:25:15 [scrapy.core.engine] INFO: Closing spider (finished)
2017-12-10 22:25:15 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'finish_reason': 'finished',
'finish_time': datetime.datetime(2017, 12, 10, 21, 25, 15, 527000),
'log_count/INFO': 8,
'start_time': datetime.datetime(2017, 12, 10, 21, 25, 15, 517000)}
2017-12-10 22:25:15 [scrapy.core.engine] INFO: Spider closed (finished)
My spider file:
# -*- coding: utf-8 -*-
import scrapy
from allegro.items import AllegroItem
class AllegroPrices(scrapy.Spider):
name = "AllegroPrices"
allowed_domains = ["allegro.pl"]
#Use working product URL below
start_urls = [
"http://allegro.pl/diablo-ii-lord-of-destruction-2-pc-big-box-eng-i6896736152.html", "http://allegro.pl/diablo-ii-2-pc-dvd-box-eng-i6961686788.html",
"http://allegro.pl/star-wars-empire-at-war-2006-dvd-box-i6995651106.html", "http://allegro.pl/heavy-gear-ii-2-pc-eng-cdkingpl-i7059163114.html"
]
def parse(self, response):
items = AllegroItem()
title = response.xpath('//h1[#class="title"]//text()').extract()
sale_price = response.xpath('//div[#class="price"]//text()').extract()
seller = response.xpath('//div[#class="btn btn-default btn-user"]/span/text()').extract()
items['product_name'] = ''.join(title).strip()
items['product_sale_price'] = ''.join(sale_price).strip()
items['product_seller'] = ''.join(seller).strip()
yield items
Settings:
# -*- coding: utf-8 -*-
# Scrapy settings for allegro project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
# http://doc.scrapy.org/en/latest/topics/settings.html
# http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
# http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html
BOT_NAME = 'AllegroPrices'
SPIDER_MODULES = ['allegro.spiders']
NEWSPIDER_MODULE = 'allegro.spiders'
# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'allegro (+http://www.yourdomain.com)'
# Obey robots.txt rules
ROBOTSTXT_OBEY = True
# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32
# Configure a delay for requests for the same website (default: 0)
# See http://scrapy.readthedocs.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16
# Disable cookies (enabled by default)
#COOKIES_ENABLED = False
# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False
# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
# 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
# 'Accept-Language': 'en',
#}
# Enable or disable spider middlewares
# See http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html
SPIDER_MIDDLEWARES = {
'allegro.middlewares.AllegroSpiderMiddleware': 543,
}
LOG_LEVEL = 'INFO'
# Enable or disable downloader middlewares
# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
# 'allegro.middlewares.MyCustomDownloaderMiddleware': 543,
#}
# Enable or disable extensions
# See http://scrapy.readthedocs.org/en/latest/topics/extensions.html
#EXTENSIONS = {
# 'scrapy.extensions.telnet.TelnetConsole': None,
#}
# Configure item pipelines
# See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
'allegro.pipelines.AllegroPipeline': 300,
}
# Enable and configure the AutoThrottle extension (disabled by default)
# See http://doc.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False
# Enable and configure HTTP caching (disabled by default)
# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
Pipeline:
# -*- coding: utf-8 -*-
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
class AllegroPipeline(object):
def process_item(self, item, spider):
return item
Items:
# -*- coding: utf-8 -*-
# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html
import scrapy
class AllegroItem(scrapy.Item):
# define the fields for your item here like:
product_name = scrapy.Field()
product_sale_price = scrapy.Field()
product_seller = scrapy.Field()
I have no problem to run it as standalone script without creating project and save to CSV file.
And I don't have to change USER-AGENT.
Maybe there is problem with some settings. You didn't put url to tutorial to check it.
Or simply you have wrong indentions and start_urls and parse() in not inside class. Indentions are very important in Python.
BTW: you forgot /a/ in xpath for seller.
import scrapy
#class AllegroItem(scrapy.Item):
# product_name = scrapy.Field()
# product_sale_price = scrapy.Field()
# product_seller = scrapy.Field()
class AllegroPrices(scrapy.Spider):
name = "AllegroPrices"
allowed_domains = ["allegro.pl"]
start_urls = [
"http://allegro.pl/diablo-ii-lord-of-destruction-2-pc-big-box-eng-i6896736152.html",
"http://allegro.pl/diablo-ii-2-pc-dvd-box-eng-i6961686788.html",
"http://allegro.pl/star-wars-empire-at-war-2006-dvd-box-i6995651106.html",
"http://allegro.pl/heavy-gear-ii-2-pc-eng-cdkingpl-i7059163114.html"
]
def parse(self, response):
title = response.xpath('//h1[#class="title"]//text()').extract()
sale_price = response.xpath('//div[#class="price"]//text()').extract()
seller = response.xpath('//div[#class="btn btn-default btn-user"]/a/span/text()').extract()
title = title[0].strip()
print(title, sale_price, seller)
yield {'title': title, 'price': sale_price, 'seller': seller}
#items = AllegroItem()
#items['product_name'] = ''.join(title).strip()
#items['product_sale_price'] = ''.join(sale_price).strip()
#items['product_seller'] = ''.join(seller).strip()
#yield items
# --- run it as standalone script without project and save in CSV ---
from scrapy.crawler import CrawlerProcess
#c = CrawlerProcess()
c = CrawlerProcess({
# 'USER_AGENT': 'Mozilla/5.0',
'FEED_FORMAT': 'csv',
'FEED_URI': 'output.csv'
})
c.crawl(AllegroPrices)
c.start()
Result in CSV file:
title,price,seller
STAR WARS: EMPIRE AT WAR [2006] DVD BOX,"24,90 zł",CDkingpl
DIABLO II: LORD OF DESTRUCTION 2 PC BIG BOX ENG,"149,00 zł",CDkingpl
HEAVY GEAR II 2 | PC ENG CDkingpl,"19,90 zł",CDkingpl
DIABLO II 2 | PC DVD BOX | ENG,"24,90 zł",CDkingpl

Python / Scrapy: CrawlSpider stops after fetching start_urls

I have wasted days to get my mind around Scrapy, reading the docs and other Scrapy Blogs and Q&A ... and now I am about to do what men hate most: Ask for directions ;-) The problem is: My spider opens, fetches the start_urls, but apparently does nothing with them. Instead it closes immediately and that was that. Apparently, I do not even get to the first self.log() statement.
What I've got so far is this:
# -*- coding: utf-8 -*-
import scrapy
# from scrapy.shell import inspect_response
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.selector import Selector
from scrapy.http import HtmlResponse, FormRequest, Request
from KiPieSpider.items import *
from KiPieSpider.settings import *
class KiSpider(CrawlSpider):
name = "KiSpider"
allowed_domains = ['www.kiweb.de', 'kiweb.de']
start_urls = (
# ST Regra start page:
'https://www.kiweb.de/default.aspx?pageid=206',
# follow ST Regra links in the form of:
# https://www.kiweb.de/default.aspx?pageid=206&page=\d+
# https://www.kiweb.de/default.aspx?pageid=299&docid=\d{6}
# ST Thermo start page:
'https://www.kiweb.de/default.aspx?pageid=202&page=1',
# follow ST Thermo links in the form of:
# https://www.kiweb.de/default.aspx?pageid=202&page=\d+
# https://www.kiweb.de/default.aspx?pageid=299&docid=\d{6}
)
rules = (
# First rule that matches a given link is followed / parsed.
# Follow category pagination without further parsing:
Rule(
LinkExtractor(
# Extract links in the form:
allow=r'Default\.aspx?pageid=(202|206])&page=\d+',
# but only within the pagination table cell:
restrict_xpaths=('//td[#id="ctl04_teaser_next"]'),
),
follow=True,
),
# Follow links to category (202|206) articles and parse them:
Rule(
LinkExtractor(
# Extract links in the form:
allow=r'Default\.aspx?pageid=299&docid=\d+',
# but only within article preview cells:
restrict_xpaths=("//td[#class='TOC-zelle TOC-text']"),
),
# and parse the resulting pages for article content:
callback='parse_init',
follow=False,
),
)
# Once an article page is reached, check whether a login is necessary:
def parse_init(self, response):
self.log('Parsing article: %s' % response.url)
if not response.xpath('input[#value="Logout"]'):
# Note: response.xpath() is a shortcut of response.selector.xpath()
self.log('Not logged in. Logging in...\n')
return self.login(response)
else:
self.log('Already logged in. Continue crawling...\n')
return self.parse_item(response)
def login(self, response):
self.log("Trying to log in...\n")
self.username = self.settings['KI_USERNAME']
self.password = self.settings['KI_PASSWORD']
return FormRequest.from_response(
response,
formname='Form1',
formdata={
# needs name, not id attributes!
'ctl04$Header$ctl01$textbox_username': self.username,
'ctl04$Header$ctl01$textbox_password': self.password,
'ctl04$Header$ctl01$textbox_logindaten_typ': 'Username_Passwort',
'ctl04$Header$ctl01$checkbox_permanent': 'True',
},
callback = self.parse_item,
)
def parse_item(self, response):
articles = response.xpath('//div[#id="artikel"]')
items = []
for article in articles:
item = KiSpiderItem()
item['link'] = response.url
item['title'] = articles.xpath("div[#class='ct1']/text()").extract()
item['subtitle'] = articles.xpath("div[#class='ct2']/text()").extract()
item['article'] = articles.extract()
item['published'] = articles.xpath("div[#class='biblio']/text()").re(r"(\d{2}.\d{2}.\d{4}) PIE")
item['artid'] = articles.xpath("div[#class='biblio']/text()").re(r"PIE \[(d+)-\d+\]")
item['lang'] = 'de-DE'
items.append(item)
# return(items)
yield items
# what is the difference between return and yield?? found both on web.
When doing scrapy crawl KiSpider, this results in:
2017-03-09 18:03:33 [scrapy.utils.log] INFO: Scrapy 1.3.2 started (bot: KiPieSpider)
2017-03-09 18:03:33 [scrapy.utils.log] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'KiPieSpider.spiders', 'DEPTH_LIMIT': 3, 'CONCURRENT_REQUESTS': 8, 'SPIDER_MODULES': ['KiPieSpider.spiders'], 'BOT_NAME': 'KiPieSpider', 'DOWNLOAD_TIMEOUT': 60, 'USER_AGENT': 'KiPieSpider (info#defrent.de)', 'DOWNLOAD_DELAY': 0.25}
2017-03-09 18:03:33 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.logstats.LogStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.corestats.CoreStats']
2017-03-09 18:03:33 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2017-03-09 18:03:33 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2017-03-09 18:03:33 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2017-03-09 18:03:33 [scrapy.core.engine] INFO: Spider opened
2017-03-09 18:03:33 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-03-09 18:03:33 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2017-03-09 18:03:33 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.kiweb.de/default.aspx?pageid=206> (referer: None)
2017-03-09 18:03:34 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.kiweb.de/default.aspx?pageid=202&page=1> (referer: None)
2017-03-09 18:03:34 [scrapy.core.engine] INFO: Closing spider (finished)
2017-03-09 18:03:34 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 465,
'downloader/request_count': 2,
'downloader/request_method_count/GET': 2,
'downloader/response_bytes': 48998,
'downloader/response_count': 2,
'downloader/response_status_count/200': 2,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2017, 3, 9, 17, 3, 34, 235000),
'log_count/DEBUG': 3,
'log_count/INFO': 7,
'response_received_count': 2,
'scheduler/dequeued': 2,
'scheduler/dequeued/memory': 2,
'scheduler/enqueued': 2,
'scheduler/enqueued/memory': 2,
'start_time': datetime.datetime(2017, 3, 9, 17, 3, 33, 295000)}
2017-03-09 18:03:34 [scrapy.core.engine] INFO: Spider closed (finished)
Is it that the login routine should not end with a callback, but some kind of return/yield statement? Or what am I doing wrong? Unfortunately, the docs and tutorials I have seen so far only give me a vague idea of how every bit connects to the others, especially Scrapy's docs seem to be written as a reference for people who already know a lot about Scrapy.
Somewhat frustrated greetings
Christopher
rules = (
# First rule that matches a given link is followed / parsed.
# Follow category pagination without further parsing:
Rule(
LinkExtractor(
# Extract links in the form:
# allow=r'Default\.aspx?pageid=(202|206])&page=\d+',
# but only within the pagination table cell:
restrict_xpaths=('//td[#id="ctl04_teaser_next"]'),
),
follow=True,
),
# Follow links to category (202|206) articles and parse them:
Rule(
LinkExtractor(
# Extract links in the form:
# allow=r'Default\.aspx?pageid=299&docid=\d+',
# but only within article preview cells:
restrict_xpaths=("//td[#class='TOC-zelle TOC-text']"),
),
# and parse the resulting pages for article content:
callback='parse_init',
follow=False,
),
)
you do not need allow parameter, because there is only one link in the tag selected by XPath.
I do not understand the regex in allow parameter but at least you should escape the ?.

Categories