Python Scrapy: saving to csv/json does not encode Latin2 properly - python

I am new to Scrapy, and I built a simple spider that scrapes my local news site for titles and amount of comments. It scrapes well, but I have a problem with my language encoding.
I have created a Scrapy project that I then run through anaconda prompt to save the output to a file like so (from the project directory):
scrapy crawl MySpider -o test.csv
When I then open the json file with the following code:
with open('test.csv', 'r', encoding = "L2") as f:
file = f.read()
I also tried saving it to json, opening in excel, changing to different encodings from there ... always unreadable, but the characters differ. I am Czech if that is relevant. I need characters like ěščřžýáíé etc., but it is Latin.
What I get: Varuje pĹ\x99ed
What I want: Varuje před
Here is my spider code. I did not change anything in settings or pipeline, though I tried multiple tips from other threads that do this. I spent 2 hours on this already, browsing stack overflow and documentation and I can't find the solution, it's becoming a headache for me. I'm not a programmer so this may be the reason... anyway:
urls = []
for number in range(1,101):
urls.append('https://www.idnes.cz/zpravy/domaci/'+str(number))
class MySpider(scrapy.Spider):
name = "MySpider"
def start_requests(self):
urls = ['https://www.idnes.cz/zpravy/domaci/']
for url in urls:
yield scrapy.Request(url=url, callback=self.parse_main)
def parse_main(self, response):
articleBlocks = response.xpath('//div[contains(#class,"art")]')
articleLinks = articleBlocks.xpath('.//a[#class="art-link"]/#href')
linksToFollow = articleLinks.extract()
for url in linksToFollow:
yield response.follow(url = url, callback = self.parse_arts)
print(url)
def parse_arts(self, response):
for article in response.css('div#content'):
yield {
'title': article.css('h1::text').get(),
'comments': article.css('li.community-discusion > a > span::text').get(),
}

Scrapy saves feed exports with utf-8 encoding by default.
Opening the file with the correct encoding displays the characters fine.
If you want to change the encoding used, you can do so by using the FEED_EXPORT_ENCODING setting (or using FEEDS instead).

After one more hour of trial and error, I solved this. The problem was not in Scrapy, it was correctly saving in utf-8, the problem was in the command:
scrapy crawl idnes_spider -o test.csv
that I ran to save it. When I run the command:
scrapy crawl idnes_spider -s FEED_URI=test.csv -s FEED_FORMAT=csv
It works.

Related

Scrapy Doesn't output json when called from another script

I've written a scrapy crawler to fetch me some sweet sweet data and it works. I'm very impressed with myself for the achievement. I even created a jupyter notebook to process the data from the Json file I created.
But I've creatyed the program so that people at work can use it and getting them to navigate to a folder and use command lines isn't going to work so I wanted to make something that I can call on and then process afterwards. But for some reason Scrapy just isnt playing ball. I've found a few bits of help but once the crawl has been completed the json output I've requested doesn't appear. But when i command line it, it shows up.
def parse(self, response):
resp_dict = json.loads(response.body)
f = open(file_name, 'w')
json.dump(resp_dict, f, indent=4)
f.close()
this is the bit that works, sometimes. I just don't understand why it wont give me an output when called from a different script. I've also tried to add this in but I think i'm putting it in the wrong place.
settings = get_project_settings()
settings.set('FEED_FORMAT', 'json')
settings.set('FEED_URI', 'result.json')
I can successfully call the Scrapy Spider, i can see the terminal showing me what's going on. But I just can't get the json output. Literally tearing my hair out now.
process = CrawlerProcess(settings={
"FEEDS": {
"items.json": {"format": "json"},
}
})
process.crawl(HoggleSpider)
process.start()

'scrapy crawl' does things but does'nt make files

I'm newbie to the python scrapy.
When I push the 'scrapy crawl name' command, the cmd window does something very busily. But finally, it doesn't spit out any HTML files.
There's seems lots of questions about scrapy not working, but couldn't find one like this case. So I post this question.
This is my codes.
import scrapy
class PostsSpider(scrapy.Spider):
name = "posts"
start_urls = [
'https://blog.scrapinghub.com/page/1/',
'https://blog.scrapinghub.com/page/2/'
]
def parse(self, response):
page = reponse.url.split('/')[-1]
filename = 'posts-%s.html' % page
with open(filename, 'wb') as f:
f.write(response.body)
I went in to 'cd postscrape' where all these files and venv are layed.
And activated the venv by 'call venv\Scripts\activate.bat'.
And finally went 'scrapy crawl posts' on the cmd, in which venv was activated.
As you see, if I go like this, this code should spit out two HTML files 'posts-1.html' and 'posts-2.html'.
Actually the command doesn't return any error message and seems to do somethings busily. But finally, it returns nothing.
What's the problem??
Thank you genius!
There is no need to manually write items to file. You can simply yield items and provide flag -o as follows:
scrapy crawl some_spider -o some_file_name.json
More you can check in the documentation.
You missed one letter 's' in the 'response'.
page = reponse.url.split('/')[-1]
-->
page = response.url.split('/')[-1]

Scrapy Splash Screenshot Pipeline not working

I'm trying to save screenshots of scraped webpages with Scrapy Splash. I've copied and pasted the code found here into my pipeline folder: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
Here's the code from the url:
import scrapy
import hashlib
from urllib.parse import quote
class ScreenshotPipeline(object):
"""Pipeline that uses Splash to render screenshot of
every Scrapy item."""
SPLASH_URL = "http://localhost:8050/render.png?url={}"
async def process_item(self, item, spider):
encoded_item_url = quote(item["url"])
screenshot_url = self.SPLASH_URL.format(encoded_item_url)
request = scrapy.Request(screenshot_url)
response = await spider.crawler.engine.download(request, spider)
if response.status != 200:
# Error happened, return item.
return item
# Save screenshot to file, filename will be hash of url.
url = item["url"]
url_hash = hashlib.md5(url.encode("utf8")).hexdigest()
filename = "{}.png".format(url_hash)
with open(filename, "wb") as f:
f.write(response.body)
# Store filename in item.
item["screenshot_filename"] = filename
return item
I've also followed the instructions for setting up splash found here: https://github.com/scrapy-plugins/scrapy-splash
When I call the command scrapy crawl spidereverything works correctly except the pipeline.
This is the "Error" I'm seeing.
<coroutine object ScreenshotPipeline.process_item at 0x7f29a9c7c8c0>
The spider is yielding the item correctly, but it will not process the item.
Does anyone have any advice? Thank you.
Edit:
I think what is going on is that Scrapy is calling the process_item() method as you normally would. However according to these docs: https://docs.python.org/3/library/asyncio-task.html a coroutine object must be called differently.
asyncio.run(process_item()) rather than process_item().
I think I may have to modify the source code?
You should use scrapy-splash in a script inside spider not in the pipelines.
I followed this docs and it works for me.

Why am I getting an error when I runspider?

I am currently working through an exercise where I put Amazon Reviews for a specific product into a csv file. I have put together my code to extract the data but I am getting a syntax error when I go to runspider to put into the csv. This part I copied directly from the practice module I am looking at so I wasn't quite sure what the issue could be. All of the resources I have found on runspider indicate that the code should be correct but clearly I've done something wrong here.
Here is my code. I am getting an error on the very last line:
import scrapy
# Implementing Spider
class ReviewspiderSpider(scrapy.Spider):
# Name of Spider
name = 'reviewspider'
allowed_domains = ["amazon.com"]
start_urls = ['https://www.amazon.com/product-reviews/B07N49F51N/ref=cm_cr_arp_d_viewpnt_lft?pageNumber=']
def parse(self, response):
names = response.xpath('//span[#class="a-profile-name"]/text()').extract()
reviewTitles = response.xpath('//a[#data-hook="review-title"]/span/text()').extract()
starRatings = response.xpath('//span[#class="a-icon-alt"]/text()').extract()
reviews = response.xpath('//span[#data-hook="review-body"]/span/text()').extract()
noOfComments = response.xpath('//span[#class="a-size-base"]/text()').extract()
for (name, title, rating, review, comments) in zip(names, reviewTitles, starRatings, reviews, noOfComments):
yield {'Name': name, 'Title': title, 'Rating': rating, 'Review': review, 'No of Comments': comments }
scrapy runspider spiders/reviewspider.py -t csv -o - > amazonreviews.csv
Here is the Error Message:
File "<ipython-input-35-6e8796e727d9>", line 22
scrapy runspider <reviewspider.py> -t csv -o - > amazonreviews.csv
^
SyntaxError: invalid syntax
What am I missing here? I am very new to Python, webscraping and scrapy so any and all breakdown/insight is useful.
The line
scrapy runspider spiders/reviewspider.py -t csv -o - > amazonreviews.csv
is not part of your code. It is just command how to run your spider.
Go to your project location via cmd or anaconda prompt. And try
scrapy runspider reviewspider.py -t csv -o amazonreviews.csv

Extract URLs recursively from website archives in scrapy

Hi I want to crawl the data from http://economictimes.indiatimes.com/archive.cms, all the urls are archived based on date, month and year, first to get the urls list I am using the code from https://github.com/FraPochetti/StocksProject/blob/master/financeCrawler/financeCrawler/spiders/urlGenerator.py, modified the code for my website as,
import scrapy
import urllib
def etUrl():
totalWeeks = []
totalPosts = []
url = 'http://economictimes.indiatimes.com/archive.cms'
data = urllib.urlopen(url).read()
hxs = scrapy.Selector(text=data)
months = hxs.xpath('//ul/li/a').re('http://economictimes.indiatimes.com/archive.cms/\\d+-\\d+/news.cms')
admittMonths = 12*(2013-2007) + 8
months = months[:admittMonths]
for month in months:
data = urllib.urlopen(month).read()
hxs = scrapy.Selector(text=data)
weeks = hxs.xpath('//ul[#class="weeks"]/li/a').re('http://economictimes.indiatimes.com/archive.cms/\\d+-\\d+/news/day\\d+\.cms')
totalWeeks += weeks
for week in totalWeeks:
data = urllib.urlopen(week).read()
hxs = scrapy.Selector(text=data)
posts = hxs.xpath('//ul[#class="archive"]/li/h1/a/#href').extract()
totalPosts += posts
with open("eturls.txt", "a") as myfile:
for post in totalPosts:
post = post + '\n'
myfile.write(post)
etUrl()
saved file as urlGenerator.py and ran with the command $ python urlGenerator.py
I am getting no result, could someone assist me how to adopt this code for my website use case or any other solution?
Try stepping through your code one line at a time using pdb. Run python -m pdb urlGenerator.py and follow the instructions for using pdb in the linked page.
If you step through your code line by line you can immediately see that the line
data = urllib.urlopen(url).read()
is failing to return something useful:
(pdb) print(data)
<HTML><HEAD>
<TITLE>Access Denied</TITLE>
</HEAD><BODY>
<H1>Access Denied</H1>
You don't have permission to access "http://economictimes.indiatimes.com/archive.cms" on this server.<P>
Reference #18.6057c817.1508411706.1c3ffe4
</BODY>
</HTML>
It seems that they are not allowing access by Python's urllib. As pointed out in the comments you really shouldn't be using urllib anyways--Scrapy is already adept at dealing with this.
A lot of the rest of your code is clearly broken as well. For example this line:
months = hxs.xpath('//ul/li/a').re('http://economictimes.indiatimes.com/archive.cms/\\d+-\\d+/news.cms')
returns an empty list even given the real HTML from this site. If you look at the HTML it's clearly in a table, not unsorted lists (<ul>). You also have the URL format wrong. Instead something like this would work:
months = response.xpath('//table//tr//a/#href').re(r'/archive/year-\d+,month-\d+.cms')
If you want to build a web scraper, rather than starting from some code you found (that isn't even correct) and trying to blindly modify it, try following the official tutorial for Scrapy and start with some very simple examples, then build up from there. For example:
class EtSpider(scrapy.Spider):
name = 'et'
start_urls = ["https://economictimes.indiatimes.com/archive.cms"]
def parse(self, response):
months = response.xpath('//table//tr//a/#href').re(r'/archive/year-\d+,month-\d+.cms')
for month in months:
self.logger.info(month)
process = scrapy.crawler.CrawlerProcess()
process.crawl(EtSpider)
process.start()
This runs correctly, and you can clearly see it finding the correct URLs for the individual months, as printed to the log. Now you can go from there and use callbacks, as explained in the documentation, to make further additional requests.
In the end you'll save yourself a lot of time and hassle by reading the docs and getting some understanding of what you're doing rather than taking some dubious code off the internet and trying to shoehorn it into your problem.

Categories