I have a problem with my script such that the same file name, and pdf is downloading. I have checked the output of my results without downloadfile and I get unique data. It's when I use the pipeline that it somehow produces duplicates for download.
Here's my script:
import scrapy
from environment.items import fcpItem
class fscSpider(scrapy.Spider):
name = 'fsc'
start_urls = ['https://fsc.org/en/members']
def start_requests(self):
for url in self.start_urls:
yield scrapy.Request(
url,
callback = self.parse
)
def parse(self, response):
content = response.xpath("(//div[#class='content__wrapper field field--name-field-content field--type-entity-reference-revisions field--label-hidden field__items']/div[#class='content__item even field__item'])[position() >1]")
loader = fcpItem()
names_add = response.xpath(".//div[#class = 'field__item resource-item']/article//span[#class='media-caption file-caption']/text()").getall()
url = response.xpath(".//div[#class = 'field__item resource-item']/article/div[#class='actions']/a//#href").getall()
pdf=[response.urljoin(x) for x in url if '#' is not x]
names = [x.split(' ')[0] for x in names_add]
for nm, pd in zip(names, pdf):
loader['names'] = nm
loader['pdfs'] = [pd]
yield loader
items.py
class fcpItem(scrapy.Item):
names = Field()
pdfs = Field()
results = Field()
pipelines.py
class DownfilesPipeline(FilesPipeline):
def file_path(self, request, response=None, info=None, item=None):
items = item['names']+'.pdf'
return items
settings.py
from pathlib import Path
import os
BASE_DIR = Path(__file__).resolve().parent.parent
FILES_STORE = os.path.join(BASE_DIR, 'fsc')
ROBOTSTXT_OBEY = False
FILES_URLS_FIELD = 'pdfs'
FILES_RESULT_FIELD = 'results'
ITEM_PIPELINES = {
'environment.pipelines.pipelines.DownfilesPipeline': 150
}
I am using css instead of xpath.
From the chrome debug panel, the tag is root of item of PDF list.
Under that div tag has title of PDF and tag for file download URL
Between root tag and tag two child's and sibling relation so xpath is not clean method and hard, a css much better is can easley pick up from root to . it don't necessary relation ship path. css can skip relationship and just sub/or grand sub is not matter. It also provides not necessary to consider index problem which is URL array and title array sync by index match.
Other key point are URL path decoding and file_urls needs to set array type even if single item.
fsc_spider.py
import scrapy
import urllib.parse
from quotes.items import fcpItem
class fscSpider(scrapy.Spider):
name = 'fsc'
start_urls = [
'https://fsc.org/en/members',
]
def parse(self, response):
for book in response.css('div.field__item.resource-item'):
url = urllib.parse.unquote(book.css('div.actions a::attr(href)').get(), encoding='utf-8', errors='replace')
url_left = url[0:url.rfind('/')]+'/'
title = book.css('span.media-caption.file-caption::text').get()
item = fcpItem()
item['original_file_name'] = title.replace(' ','_')
item['file_urls'] = ['https://fsc.org'+url_left+title.replace(' ','%20')]
yield item
items.py
import scrapy
class fcpItem(scrapy.Item):
file_urls = scrapy.Field()
files = scrapy.Field
original_file_name = scrapy.Field()
pipelines.py
import scrapy
from scrapy.pipelines.files import FilesPipeline
class fscPipeline(FilesPipeline):
def file_path(self, request, response=None, info=None):
file_name: str = request.url.split("/")[-1].replace('%20','_')
return file_name
settings.py
BOT_NAME = 'quotes'
FILES_STORE = 'downloads'
SPIDER_MODULES = ['quotes.spiders']
NEWSPIDER_MODULE = 'quotes.spiders'
FEED_EXPORT_ENCODING = 'utf-8'
ROBOTSTXT_OBEY = True
ITEM_PIPELINES = { 'quotes.pipelines.fscPipeline': 1}
file structure
execution
quotes>scrapy crawl fsc
result
The problem is that you are overwriting the same scrapy item every iteration.
What you need to do is create a new item for each time your parse method yields. I have tested this and confirmed that it does produce the results you desire.
I made and inline not in my example below on the line that needs to be changed.
For example:
import scrapy
from environment.items import fcpItem
class fscSpider(scrapy.Spider):
name = 'fsc'
start_urls = ['https://fsc.org/en/members']
def start_requests(self):
for url in self.start_urls:
yield scrapy.Request(
url,
callback = self.parse
)
def parse(self, response):
content = response.xpath("(//div[#class='content__wrapper field field--name-field-content field--type-entity-reference-revisions field--label-hidden field__items']/div[#class='content__item even field__item'])[position() >1]")
names_add = response.xpath(".//div[#class = 'field__item resource-item']/article//span[#class='media-caption file-caption']/text()").getall()
url = response.xpath(".//div[#class = 'field__item resource-item']/article/div[#class='actions']/a//#href").getall()
pdf=[response.urljoin(x) for x in url if '#' is not x]
names = [x.split(' ')[0] for x in names_add]
for nm, pd in zip(names, pdf):
loader = fcpItem() # Here you create a new item each iteration
loader['names'] = nm
loader['pdfs'] = [pd]
yield loader
Related
As stated in the title, I am looking to replace the image path name with the item path name here is my example:
Running my scrapy, I get the files as the standard SHA1 hash format.
If possible I would also appreciate if it can get the first image instead of the whole group.
URL name - https://www.antaira.com/products/10-100Mbps/LNX-500A
Expected image name - LNX-500A.jpg
Spider.py
from copyreg import clear_extension_cache
import scrapy
from ..items import AntairaItem
class ImageDownload(scrapy.Spider):
name = 'ImageDownload'
allowed_domains = ['antaira.com']
start_urls = [
'https://www.antaira.com/products/10-100Mbps/LNX-500A',
]
def parse_images(self, response):
raw_image_urls = response.css('.image img ::attr(src)').getall()
clean_image_urls = []
for img_url in raw_image_urls:
clean_image_urls.append(response.urljoin(img_url))
yield {
'image_urls' : clean_image_urls
}
pipelines.py
from scrapy.pipelines.images import ImagesPipeline
import json
class AntairaPipeline:
def process_item(self, item, spider):
# calling dumps to create json data.
line = json.dumps(dict(item)) + "\n"
self.file.write(line)
return item
def open_spider(self, spider):
self.file = open('result.json', 'w')
def close_spider(self, spider):
self.file.close()
class customImagePipeline(ImagesPipeline):
def file_path(self, request, response=None, info=None):
#item-request.meta['item'] # Like this you can use all from the item, not just url
#image_guid = request.meta.get('filename', '')
image_guid = request.url.split('/')[-1]
#image_direct = request.meta.get('directoryname', '')
return 'full/%s.jpg' % (image_guid)
#Name thumbnail version
def thumb_path(self, request, thumb_id, response=None, info=None):
image_guid = thumb_id + response.url.split('/')[-1]
return 'thumbs/%s/%s.jpg' % (thumb_id, image_guid)
def get_media_requests(self, item, info):
#return [Request(x, meta={'filename': item['image_name']})
# for x in item.get(self.image_urls_field, [])]\
for image in item['images']:
yield Request(image)
I understand there is a way to get the meta data, but I would like it to have the name of the item product if possible for the image, thank you.
EDIT -
Original File Name - 12f6537bd206cf58e86365ed6b7c1fb446c533b2.jpg
Required file name - "LNX_500A_01.jpg" - using the last part of the start_url path
if more than one if not than "LNX_500A.jpg"
So it took a little tweaking but this is what I got. I extracted the name of the item and all of the images with xpath expressions, and then in the image pipeline I add the item name and the file numbers to the requests meta keyword arg. Then add those two together in the file_path method of the pipeline.
You could just as easily split the request url and use that as the file name as well. Both approaches will do the trick.
Also for some reason I wasn't getting any images at all with the css selector so I switched it to an xpath expression. If the css works for you then you can switch it back and it should still work.
spider file
import scrapy
from ..items import MyItem
class ImageDownload(scrapy.Spider):
name = 'ImageDownload'
allowed_domains = ['antaira.com']
start_urls = [
'https://www.antaira.com/products/10-100Mbps/LNX-500A',
]
def parse(self, response):
item = MyItem()
raw_image_urls = response.xpath('//div[#class="selectors"]/a/#href').getall()
name = response.xpath("//h1[#class='product-name']/text()").get()
filename = name.split(' ')[0].strip()
urls = [response.urljoin(i) for i in raw_image_urls]
item["name"] = filename
item["image_urls"] = urls
yield item
items.py
from scrapy import Item, Field
class MyItem(Item):
name = Field()
image_urls = Field()
images = Field()
pipelines.py
from scrapy.http import Request
from scrapy.pipelines.images import ImagesPipeline
class ImagePipeline(ImagesPipeline):
def file_path(self, request, response=None, info=None, *args, item=None):
filename = request.meta["filename"].strip()
number = request.meta["file_num"]
return filename + "_" + str(number) + ".jpg"
def get_media_requests(self, item, info):
name = item["name"]
for i, url in enumerate(item["image_urls"]):
meta = {"filename": name, "file_num": i}
yield Request(url, meta=meta)
settings.py
ITEM_PIPELINES = {
'project.pipelines.ImagePipeline': 1,
}
IMAGES_STORE = 'image_dir'
IMAGES_URLS_FIELD = 'image_urls'
IMAGES_RESULT_FIELD = 'images'
With all of this and running scrapy crawl ImageDownloads it creates this directory:
Project
| - image_dir
| | - LNX-500A_0.jpg
| | - LNX-500A_1.jpg
| | - LNX-500A_2.jpg
| | - LNX-500A_3.jpg
| | - LNX-500A_4.jpg
|
| - project
| - __init__.py
| - items.py
| - middlewares.py
| - pipelines.py
| - settings.py
|
| - spiders
| - antaira.py
And these are the files that were created.
LNX-500A_0.jpg LNX-500A_1.jpg LNX-500A_2.jpg
LNX-500A_3.jpg LNX-500A_4.jpg
I am new to scrapy.I downloaded some files using the code bellow. I want to change the names of my downloaded files but I don't know how.
For example, I want to have a list containing names and use it to rename the files that I downloaded.
Any help will be appreciated
my spider
import scrapy from scrapy.loader
import ItemLoader from demo_downloader.items
import DemoDownloaderItem
class FileDownloader(scrapy.Spider):
name = "file_downloader"
def start_requests(self):
urls = [
"https://www.data.gouv.fr/en/datasets/bases-de-donnees-annuelles-des-accidents-corporels-de-la-circulation-routiere-annees-de-2005-a-2019/#_"
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
for link in response.xpath('//article[#class = "card resource-card "]'):
name = link.xpath('.//h4[#class="ellipsis"]/text()').extract_first()
if ".csv" in name:
loader = ItemLoader(item=DemoDownloaderItem(), selector=link)
absolute_url = link.xpath(".//a[#class = 'btn btn-sm btn-primary']//#href").extract_first()
loader.add_value("file_urls", absolute_url)
loader.add_value("files", name)
yield loader.load_item()
items.py
from scrapy.item import Field, Item
class DemoDownloaderItem(Item):
file_urls = Field()
files = Field()
pipelines.py
from itemadapter import ItemAdapter
class DemoDownloaderPipeline:
def process_item(self, item, spider):
return item
settings.py
BOT_NAME = 'demo_downloader'
SPIDER_MODULES = ['demo_downloader.spiders']
NEWSPIDER_MODULE = 'demo_downloader.spiders'
ROBOTSTXT_OBEY = False
ITEM_PIPELINES = {
'scrapy.pipelines.files.FilesPipeline': 1
}
DOWNLOAD_TIMEOUT = 1200
FILES_STORE = "C:\\Users\\EL\\Desktop\\work\\demo_downloader"
MEDIA_ALLOW_REDIRECTS = True
I am new to Scrapy . I am trying to download files using media pipeline. But when I am running spider no files are stored in the folder.
spider:
import scrapy
from scrapy import Request
from pagalworld.items import PagalworldItem
class JobsSpider(scrapy.Spider):
name = "songs"
allowed_domains = ["pagalworld.me"]
start_urls =['https://pagalworld.me/category/11598/Latest%20Bollywood%20Hindi%20Mp3%20Songs%20-%202017.html']
def parse(self, response):
urls = response.xpath('//div[#class="pageLinkList"]/ul/li/a/#href').extract()
for link in urls:
yield Request(link, callback=self.parse_page, )
def parse_page(self, response):
songName=response.xpath('//li/b/a/#href').extract()
for song in songName:
yield Request(song,callback=self.parsing_link)
def parsing_link(self,response):
item= PagalworldItem()
item['file_urls']=response.xpath('//div[#class="menu_row"]/a[#class="touch"]/#href').extract()
yield{"download_link":item['file_urls']}
Item file:
import scrapy
class PagalworldItem(scrapy.Item):
file_urls=scrapy.Field()
Settings File:
BOT_NAME = 'pagalworld'
SPIDER_MODULES = ['pagalworld.spiders']
NEWSPIDER_MODULE = 'pagalworld.spiders'
ROBOTSTXT_OBEY = True
CONCURRENT_REQUESTS = 5
DOWNLOAD_DELAY = 3
ITEM_PIPELINES = {
'scrapy.pipelines.files.FilesPipeline': 1
}
FILES_STORE = '/tmp/media/'
The output looks like this:
def parsing_link(self,response):
item= PagalworldItem()
item['file_urls']=response.xpath('//div[#class="menu_row"]/a[#class="touch"]/#href').extract()
yield{"download_link":item['file_urls']}
You are yielding:
yield {"download_link": ['http://someurl.com']}
where for scrapy's Media/File pipeline to work you need to yield and item that contains file_urls field. So try this instead:
def parsing_link(self,response):
item= PagalworldItem()
item['file_urls']=response.xpath('//div[#class="menu_row"]/a[#class="touch"]/#href').extract()
yield item
My spider runs without displaying any errors but the images are not stored in the folder here are my scrapy files:
Spider.py:
import scrapy
import re
import os
import urlparse
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.loader.processors import Join, MapCompose, TakeFirst
from scrapy.pipelines.images import ImagesPipeline
from production.items import ProductionItem, ListResidentialItem
class productionSpider(scrapy.Spider):
name = "production"
allowed_domains = ["someurl.com"]
start_urls = [
"someurl.com"
]
def parse(self, response):
for sel in response.xpath('//html/body'):
item = ProductionItem()
img_url = sel.xpath('//a[#data-tealium-id="detail_nav_showphotos"]/#href').extract()[0]
yield scrapy.Request(urlparse.urljoin(response.url, img_url),callback=self.parseBasicListingInfo, meta={'item': item})
def parseBasicListingInfo(item, response):
item = response.request.meta['item']
item = ListResidentialItem()
try:
image_urls = map(unicode.strip,response.xpath('//a[#itemprop="contentUrl"]/#data-href').extract())
item['image_urls'] = [ x for x in image_urls]
except IndexError:
item['image_urls'] = ''
return item
settings.py:
from scrapy.settings.default_settings import ITEM_PIPELINES
from scrapy.pipelines.images import ImagesPipeline
BOT_NAME = 'production'
SPIDER_MODULES = ['production.spiders']
NEWSPIDER_MODULE = 'production.spiders'
DEFAULT_ITEM_CLASS = 'production.items'
ROBOTSTXT_OBEY = True
DEPTH_PRIORITY = 1
IMAGE_STORE = '/images'
CONCURRENT_REQUESTS = 250
DOWNLOAD_DELAY = 2
ITEM_PIPELINES = {
'scrapy.contrib.pipeline.images.ImagesPipeline': 300,
}
items.py
# -*- coding: utf-8 -*-
import scrapy
class ProductionItem(scrapy.Item):
img_url = scrapy.Field()
# ScrapingList Residential & Yield Estate for sale
class ListResidentialItem(scrapy.Item):
image_urls = scrapy.Field()
images = scrapy.Field()
pass
My pipeline file is empty i'm not sure what i am suppose to add to the pipeline.py file.
Any help is greatly appreciated.
My Working end result:
spider.py:
import scrapy
import re
import urlparse
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.loader.processors import Join, MapCompose, TakeFirst
from scrapy.pipelines.images import ImagesPipeline
from production.items import ProductionItem
from production.items import ImageItem
class productionSpider(scrapy.Spider):
name = "production"
allowed_domains = ["url"]
start_urls = [
"startingurl.com"
]
def parse(self, response):
for sel in response.xpath('//html/body'):
item = ProductionItem()
img_url = sel.xpath('//a[#idd="followclaslink"]/#href').extract()[0]
yield scrapy.Request(urlparse.urljoin(response.url, img_url),callback=self.parseImages, meta={'item': item})
def parseImages(self, response):
for elem in response.xpath("//img"):
img_url = elem.xpath("#src").extract_first()
yield ImageItem(image_urls=[img_url])
Settings.py
BOT_NAME = 'production'
SPIDER_MODULES = ['production.spiders']
NEWSPIDER_MODULE = 'production.spiders'
DEFAULT_ITEM_CLASS = 'production.items'
ROBOTSTXT_OBEY = True
IMAGES_STORE = '/Users/home/images'
DOWNLOAD_DELAY = 2
ITEM_PIPELINES = {'scrapy.pipelines.images.ImagesPipeline': 1}
# Disable cookies (enabled by default)
items.py
# -*- coding: utf-8 -*-
import scrapy
class ProductionItem(scrapy.Item):
img_url = scrapy.Field()
# ScrapingList Residential & Yield Estate for sale
class ListResidentialItem(scrapy.Item):
image_urls = scrapy.Field()
images = scrapy.Field()
class ImageItem(scrapy.Item):
image_urls = scrapy.Field()
images = scrapy.Field()
pipelines.py
import scrapy
from scrapy.pipelines.images import ImagesPipeline
from scrapy.exceptions import DropItem
class MyImagesPipeline(ImagesPipeline):
def get_media_requests(self, item, info):
for image_url in item['image_urls']:
yield scrapy.Request(image_url)
def item_completed(self, results, item, info):
image_paths = [x['path'] for ok, x in results if ok]
if not image_paths:
raise DropItem("Item contains no images")
item['image_paths'] = image_paths
return item
Since you don't know what to put in the pipelines I assume you can use the default pipeline for images provided by scrapy so in the settings.py file you can just declare it like
ITEM_PIPELINES = {
'scrapy.pipelines.images.ImagesPipeline':1
}
Also, your images path is wrong the / means that you are going to the absolute root path of your machine, so you either put the absolute path to where you want to save or just do a relative path from where you are running your crawler
IMAGES_STORE = '/home/user/Documents/scrapy_project/images'
or
IMAGES_STORE = 'images'
Now, in the spider you extract the url but you don't save it into the item
item['image_urls'] = sel.xpath('//a[#data-tealium-id="detail_nav_showphotos"]/#href').extract_first()
The field has to literally be image_urls if you're using the default pipeline.
Now, in the items.py file you need to add the following 2 fields (both are required with this literal name)
image_urls=Field()
images=Field()
That should work
In my case it was the IMAGES_STORE path that was causing the problem
I did IMAGES_STORE = 'images' and it worked like a charm!
Here is complete code:
Settings:
ITEM_PIPELINES = {
'mutualartproject.pipelines.MyImagesPipeline': 1,
}
IMAGES_STORE = 'images'
Pipline:
class MyImagesPipeline(ImagesPipeline):
def get_media_requests(self, item, info):
for image_url in item['image_urls']:
yield scrapy.Request(image_url)
def item_completed(self, results, item, info):
image_paths = [x['path'] for ok, x in results if ok]
if not image_paths:
raise DropItem("Item contains no images")
return item
Just adding my misstake here which threw me of for several hours. Perhaps it can help someone.
From scrapy docs (https://doc.scrapy.org/en/latest/topics/media-pipeline.html#using-the-images-pipeline):
Then, configure the target storage setting to a valid value that will be used for storing the downloaded images. Otherwise the pipeline will remain disabled, even if you include it in the ITEM_PIPELINES setting.
For some reason I used a colon ":" instead of an equal sign "=".
# My misstake:
IMAGES_STORE : '/Users/my_user/images'
# Working code
IMAGES_STORE = '/Users/my_user/images'
This dosen't return an error but instead leads to the pipeline not loading at all which for me was pretty hard to trouble shoot.
You have to enable SPIDER_MIDDLEWARES and DOWNLOADER_MIDDLEWARES in the settings.py file
class AljazeeraSpider(XMLFeedSpider):
name = "aljazeera"
allowed_domains = ["aljazeera.com"]
start_urls = [
'http://www.aljazeera.com/',
]
def parse(self, response):
hxs = HtmlXPathSelector(response) # The xPath selector
titles = hxs.select('//div[contains(#class,"SkyScrapperBoxes")]/div[contains(#class,"skyscLines")]')
if not titles:
MailNotify().send_mail("Aljazeera", "Scraper Report")
items = []
for titles in titles:
item = NewsItem()
item['title'] = escape(''.join(titles.select('a/text()').extract()))
item['link'] = "http://www.aljazeera.com" + escape(''.join(titles.select('a/#href').extract()))
item['description'] = ''
item = Request(item['link'], meta={'item': item}, callback=self.parse_detail)
items.append(item)
return items
def parse_detail(self, response):
item = response.meta['item']
sel = HtmlXPathSelector(response)
detail = sel.select('//td[#class = "DetailedSummary"]')
item['details'] = remove_html_tags(escape(''.join(detail.select('p').extract())))
item['location'] = ''
published_date = sel.select('//span[#id = "ctl00_cphBody_lblDate"]')
item['published_date'] = escape(''.join(published_date.select('text()').extract()))
return item
I am currently working on Scrapy to crawl the website. I have some knowledge about unittest in python. But,How can I write the unittest to check that link is working, and item['location'], item['details'] are returning the value or not? I have learned Scrapy contract but cannot understand anything.So, how can write the unittest in this case?
If we are talking specifically about how to test the spiders (not pipelines, or loaders), then what we did is provided a "fake response" from a local HTML file. Sample code:
import os
from scrapy.http import Request, TextResponse
def fake_response(file_name=None, url=None):
"""Create a Scrapy fake HTTP response from a HTML file"""
if not url:
url = 'http://www.example.com'
request = Request(url=url)
if file_name:
if not file_name[0] == '/':
responses_dir = os.path.dirname(os.path.realpath(__file__))
file_path = os.path.join(responses_dir, file_name)
else:
file_path = file_name
file_content = open(file_path, 'r').read()
else:
file_content = ''
response = TextResponse(url=url, request=request, body=file_content,
encoding='utf-8')
return response
Then, in your TestCase class, call the fake_response() function and feed the response to the parse() callback:
from unittest.case import TestCase
class MyTestCase(TestCase):
def setUp(self):
self.spider = MySpider()
def test_parse(self):
response = fake_response('input.html')
item = self.spider.parse(response)
self.assertEqual(item['title'], 'My Title')
# ...
Aside from that, you should definitely start using Item Loaders with input and output processors - this would help to achieve a better modularity and, hence, isolation - spider would just yield item instances, data preparation and modification would be incapsulated inside the loader, which you would test separately.