I have spider that I have written using the Scrapy framework. I am having some trouble getting any pipelines to work. I have the following code in my pipelines.py:
class FilePipeline(object):
def __init__(self):
self.file = open('items.txt', 'wb')
def process_item(self, item, spider):
line = item['title'] + '\n'
self.file.write(line)
return item
and my CrawlSpider subclass has this line to activate the pipeline for this class.
ITEM_PIPELINES = [
'event.pipelines.FilePipeline'
]
However when I run it using
scrapy crawl my_spider
I get a line that says
2010-11-03 20:24:06+0000 [scrapy] DEBUG: Enabled item pipelines:
with no pipelines (I presume this is where the logging should output them).
I have tried looking through the documentation but there doesn't seem to be any full examples of a whole project to see if I have missed anything.
Any suggestions on what to try next? or where to look for further documentation?
Got it! The line needs to go in the settings module for the project. Now it works!
I'm willing to bet that it's a capitalisation difference in the word pipeline somewhere:
Pipeline vs. PipeLine
I notice 'event.pipelines.FilePipeline' uses the former, whereas your code uses the latter: which do your filenames use?
(I have fallen victim to this spelling mistake many times!)
Related
I took the Data Camp Web Scraping with Python course and am trying to run the 'capstone' web scraper in my own environment (the course takes place in a special in-browser environment). The code is intended to scrape the titles and descriptions of courses from the Data Camp webpage.
I've spend a good deal of time tinkering here and there, and at this point am hoping that the community can help me out.
The code I am trying to run is:
# Import scrapy
import scrapy
# Import the CrawlerProcess
from scrapy.crawler import CrawlerProcess
# Create the Spider class
class YourSpider(scrapy.Spider):
name = 'yourspider'
# start_requests method
def start_requests(self):
yield scrapy.Request(url= https://www.datacamp.com, callback = self.parse)
def parse (self, response):
# Parser, Maybe this is where my issue lies
crs_titles = response.xpath('//h4[contains(#class,"block__title")]/text()').extract()
crs_descrs = response.xpath('//p[contains(#class,"block__description")]/text()').extract()
for crs_title, crs_descr in zip(crs_titles, crs_descrs):
dc_dict[crs_title] = crs_descr
# Initialize the dictionary **outside** of the Spider class
dc_dict = dict()
# Run the Spider
process = CrawlerProcess()
process.crawl(YourSpider)
process.start()
# Print a preview of courses
previewCourses(dc_dict)
I get the following output:
C:\Users*\PycharmProjects\TestScrape\venv\Scripts\python.exe C:/Users/*/PycharmProjects/TestScrape/main.py
File "C:\Users******\PycharmProjects\TestScrape\main.py", line 20
yield scrapy.Request(url=https://www.datacamp.com, callback=self.parse1)
^
SyntaxError: invalid syntax
Process finished with exit code 1
I notice that the parse method in line 20 remains grey in my PyCharm window. Maybe I am missing something important in the parse method?
Any help in getting the code to run would be greatly appreciated!
Thank you,
-WolfHawk
The error message is triggered in the following line:
yield scrapy.Request(url=https://www.datacamp.com, callback = self.parse)
As an input to url you should enter a string and strings are written with ' or " in the beginning and in the end.
Try this:
yield scrapy.Request(url='https://www.datacamp.com', callback = self.parse)
If this is your full code, you are also missing the function previewCourses. Check if it is provided to you or write it yourself with something like this:
def previewCourses(dict_to_print):
for key, value in dict_to_print.items():
print(key, value)
As far as I can see nobody has asked this question, and I'm completely stuck trying to solve it. What I have is a spider that sometimes won't return any results (either no results exist or the site could not be scraped due to, say, the robots.txt file), and this results in an empty, headerless, csv file. I have a robot looking to pick up this file, so when it is empty the robot doesn't realise it is finished and in any case without headers cannot understand it.
What I want is to output the csv file with headers every time, even if there are no results. I've tried using json instead but have the same issue - if there is an error or there are no results the file is empty.
I'd be quite happy to call something on the spider closing (for whatever reason, even an error in initialising due to say, a bad url) and writing something to the file.
Thanks.
I solved this by amending my item pipeline and not using the feed exporter in the command line. This allowed me to use close_spider to write, if no results, the header.
I'd obviously welcome any improvements if I've missed something.
Pipeline code:
from scrapy.exceptions import DropItem
from scrapy.exporters import CsvItemExporter
import csv
from Generic.items import GenericItem
class GenericPipeline:
def __init__(self):
self.emails_seen = set()
self.files = {}
def open_spider(self, spider):
self.file = open("results.csv", 'wb')
self.exporter = CsvItemExporter(self.file)
self.exporter.start_exporting()
def close_spider(self, spider):
if not self.emails_seen:
header = GenericItem()
header["email"] = "None Found"
self.exporter.export_item(header)
self.exporter.finish_exporting()
self.file.close()
def process_item(self, item, spider):
if item["email"] in self.emails_seen:
raise DropItem(f"Duplicate item found: {item!r}")
else:
self.emails_seen.add(item["email"])
self.exporter.export_item(item)
return item
I am new to Scrapy, and I built a simple spider that scrapes my local news site for titles and amount of comments. It scrapes well, but I have a problem with my language encoding.
I have created a Scrapy project that I then run through anaconda prompt to save the output to a file like so (from the project directory):
scrapy crawl MySpider -o test.csv
When I then open the json file with the following code:
with open('test.csv', 'r', encoding = "L2") as f:
file = f.read()
I also tried saving it to json, opening in excel, changing to different encodings from there ... always unreadable, but the characters differ. I am Czech if that is relevant. I need characters like ěščřžýáíé etc., but it is Latin.
What I get: Varuje pĹ\x99ed
What I want: Varuje před
Here is my spider code. I did not change anything in settings or pipeline, though I tried multiple tips from other threads that do this. I spent 2 hours on this already, browsing stack overflow and documentation and I can't find the solution, it's becoming a headache for me. I'm not a programmer so this may be the reason... anyway:
urls = []
for number in range(1,101):
urls.append('https://www.idnes.cz/zpravy/domaci/'+str(number))
class MySpider(scrapy.Spider):
name = "MySpider"
def start_requests(self):
urls = ['https://www.idnes.cz/zpravy/domaci/']
for url in urls:
yield scrapy.Request(url=url, callback=self.parse_main)
def parse_main(self, response):
articleBlocks = response.xpath('//div[contains(#class,"art")]')
articleLinks = articleBlocks.xpath('.//a[#class="art-link"]/#href')
linksToFollow = articleLinks.extract()
for url in linksToFollow:
yield response.follow(url = url, callback = self.parse_arts)
print(url)
def parse_arts(self, response):
for article in response.css('div#content'):
yield {
'title': article.css('h1::text').get(),
'comments': article.css('li.community-discusion > a > span::text').get(),
}
Scrapy saves feed exports with utf-8 encoding by default.
Opening the file with the correct encoding displays the characters fine.
If you want to change the encoding used, you can do so by using the FEED_EXPORT_ENCODING setting (or using FEEDS instead).
After one more hour of trial and error, I solved this. The problem was not in Scrapy, it was correctly saving in utf-8, the problem was in the command:
scrapy crawl idnes_spider -o test.csv
that I ran to save it. When I run the command:
scrapy crawl idnes_spider -s FEED_URI=test.csv -s FEED_FORMAT=csv
It works.
I'm newbie to the python scrapy.
When I push the 'scrapy crawl name' command, the cmd window does something very busily. But finally, it doesn't spit out any HTML files.
There's seems lots of questions about scrapy not working, but couldn't find one like this case. So I post this question.
This is my codes.
import scrapy
class PostsSpider(scrapy.Spider):
name = "posts"
start_urls = [
'https://blog.scrapinghub.com/page/1/',
'https://blog.scrapinghub.com/page/2/'
]
def parse(self, response):
page = reponse.url.split('/')[-1]
filename = 'posts-%s.html' % page
with open(filename, 'wb') as f:
f.write(response.body)
I went in to 'cd postscrape' where all these files and venv are layed.
And activated the venv by 'call venv\Scripts\activate.bat'.
And finally went 'scrapy crawl posts' on the cmd, in which venv was activated.
As you see, if I go like this, this code should spit out two HTML files 'posts-1.html' and 'posts-2.html'.
Actually the command doesn't return any error message and seems to do somethings busily. But finally, it returns nothing.
What's the problem??
Thank you genius!
There is no need to manually write items to file. You can simply yield items and provide flag -o as follows:
scrapy crawl some_spider -o some_file_name.json
More you can check in the documentation.
You missed one letter 's' in the 'response'.
page = reponse.url.split('/')[-1]
-->
page = response.url.split('/')[-1]
I'm trying to save screenshots of scraped webpages with Scrapy Splash. I've copied and pasted the code found here into my pipeline folder: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
Here's the code from the url:
import scrapy
import hashlib
from urllib.parse import quote
class ScreenshotPipeline(object):
"""Pipeline that uses Splash to render screenshot of
every Scrapy item."""
SPLASH_URL = "http://localhost:8050/render.png?url={}"
async def process_item(self, item, spider):
encoded_item_url = quote(item["url"])
screenshot_url = self.SPLASH_URL.format(encoded_item_url)
request = scrapy.Request(screenshot_url)
response = await spider.crawler.engine.download(request, spider)
if response.status != 200:
# Error happened, return item.
return item
# Save screenshot to file, filename will be hash of url.
url = item["url"]
url_hash = hashlib.md5(url.encode("utf8")).hexdigest()
filename = "{}.png".format(url_hash)
with open(filename, "wb") as f:
f.write(response.body)
# Store filename in item.
item["screenshot_filename"] = filename
return item
I've also followed the instructions for setting up splash found here: https://github.com/scrapy-plugins/scrapy-splash
When I call the command scrapy crawl spidereverything works correctly except the pipeline.
This is the "Error" I'm seeing.
<coroutine object ScreenshotPipeline.process_item at 0x7f29a9c7c8c0>
The spider is yielding the item correctly, but it will not process the item.
Does anyone have any advice? Thank you.
Edit:
I think what is going on is that Scrapy is calling the process_item() method as you normally would. However according to these docs: https://docs.python.org/3/library/asyncio-task.html a coroutine object must be called differently.
asyncio.run(process_item()) rather than process_item().
I think I may have to modify the source code?
You should use scrapy-splash in a script inside spider not in the pipelines.
I followed this docs and it works for me.