Why am I getting an error when I runspider? - python

I am currently working through an exercise where I put Amazon Reviews for a specific product into a csv file. I have put together my code to extract the data but I am getting a syntax error when I go to runspider to put into the csv. This part I copied directly from the practice module I am looking at so I wasn't quite sure what the issue could be. All of the resources I have found on runspider indicate that the code should be correct but clearly I've done something wrong here.
Here is my code. I am getting an error on the very last line:
import scrapy
# Implementing Spider
class ReviewspiderSpider(scrapy.Spider):
# Name of Spider
name = 'reviewspider'
allowed_domains = ["amazon.com"]
start_urls = ['https://www.amazon.com/product-reviews/B07N49F51N/ref=cm_cr_arp_d_viewpnt_lft?pageNumber=']
def parse(self, response):
names = response.xpath('//span[#class="a-profile-name"]/text()').extract()
reviewTitles = response.xpath('//a[#data-hook="review-title"]/span/text()').extract()
starRatings = response.xpath('//span[#class="a-icon-alt"]/text()').extract()
reviews = response.xpath('//span[#data-hook="review-body"]/span/text()').extract()
noOfComments = response.xpath('//span[#class="a-size-base"]/text()').extract()
for (name, title, rating, review, comments) in zip(names, reviewTitles, starRatings, reviews, noOfComments):
yield {'Name': name, 'Title': title, 'Rating': rating, 'Review': review, 'No of Comments': comments }
scrapy runspider spiders/reviewspider.py -t csv -o - > amazonreviews.csv
Here is the Error Message:
File "<ipython-input-35-6e8796e727d9>", line 22
scrapy runspider <reviewspider.py> -t csv -o - > amazonreviews.csv
^
SyntaxError: invalid syntax
What am I missing here? I am very new to Python, webscraping and scrapy so any and all breakdown/insight is useful.

The line
scrapy runspider spiders/reviewspider.py -t csv -o - > amazonreviews.csv
is not part of your code. It is just command how to run your spider.
Go to your project location via cmd or anaconda prompt. And try
scrapy runspider reviewspider.py -t csv -o amazonreviews.csv

Related

First Python Scrapy Web Scraper Not Working

I took the Data Camp Web Scraping with Python course and am trying to run the 'capstone' web scraper in my own environment (the course takes place in a special in-browser environment). The code is intended to scrape the titles and descriptions of courses from the Data Camp webpage.
I've spend a good deal of time tinkering here and there, and at this point am hoping that the community can help me out.
The code I am trying to run is:
# Import scrapy
import scrapy
# Import the CrawlerProcess
from scrapy.crawler import CrawlerProcess
# Create the Spider class
class YourSpider(scrapy.Spider):
name = 'yourspider'
# start_requests method
def start_requests(self):
yield scrapy.Request(url= https://www.datacamp.com, callback = self.parse)
def parse (self, response):
# Parser, Maybe this is where my issue lies
crs_titles = response.xpath('//h4[contains(#class,"block__title")]/text()').extract()
crs_descrs = response.xpath('//p[contains(#class,"block__description")]/text()').extract()
for crs_title, crs_descr in zip(crs_titles, crs_descrs):
dc_dict[crs_title] = crs_descr
# Initialize the dictionary **outside** of the Spider class
dc_dict = dict()
# Run the Spider
process = CrawlerProcess()
process.crawl(YourSpider)
process.start()
# Print a preview of courses
previewCourses(dc_dict)
I get the following output:
C:\Users*\PycharmProjects\TestScrape\venv\Scripts\python.exe C:/Users/*/PycharmProjects/TestScrape/main.py
File "C:\Users******\PycharmProjects\TestScrape\main.py", line 20
yield scrapy.Request(url=https://www.datacamp.com, callback=self.parse1)
^
SyntaxError: invalid syntax
Process finished with exit code 1
I notice that the parse method in line 20 remains grey in my PyCharm window. Maybe I am missing something important in the parse method?
Any help in getting the code to run would be greatly appreciated!
Thank you,
-WolfHawk
The error message is triggered in the following line:
yield scrapy.Request(url=https://www.datacamp.com, callback = self.parse)
As an input to url you should enter a string and strings are written with ' or " in the beginning and in the end.
Try this:
yield scrapy.Request(url='https://www.datacamp.com', callback = self.parse)
If this is your full code, you are also missing the function previewCourses. Check if it is provided to you or write it yourself with something like this:
def previewCourses(dict_to_print):
for key, value in dict_to_print.items():
print(key, value)

error on running "scrapy crawl quotes" and "scrapy genspider quotes qoutes.toscrape.com" command

I followed the tutorial to make a web scraper.
Overview:-
created virtal environment(virtualenv .)
activated it (.\Scripts\activate)
dir where scrapy.cfg lies(cd quotetutorial)
created quotes_spider.py
executed scrapy crawl quotes and scrapy genspider quotes quotes.toscrape.com and getting same error
spider_quotes.py file content:-
class QuoteSpider(scrapy.Spider):#inheriting from class scrapy from spider
name='quotes'
start_urls=['https://quotes.toscrape.com/']
def parse(self,response)
title=response.css('title').extract()
yield {'titletext': title}
even after running the scrapy crawl quotes in the folder which contained the scrapy.cfg file, I am getting this error.
error message
You forgot to add : symbol in method definition:
class QuoteSpider(scrapy.Spider):#inheriting from class scrapy from spider
name='quotes'
start_urls=['https://quotes.toscrape.com/']
def parse(self,response): # <- : added
title=response.css('title').extract()
yield {'titletext': title}

Python Scrapy: saving to csv/json does not encode Latin2 properly

I am new to Scrapy, and I built a simple spider that scrapes my local news site for titles and amount of comments. It scrapes well, but I have a problem with my language encoding.
I have created a Scrapy project that I then run through anaconda prompt to save the output to a file like so (from the project directory):
scrapy crawl MySpider -o test.csv
When I then open the json file with the following code:
with open('test.csv', 'r', encoding = "L2") as f:
file = f.read()
I also tried saving it to json, opening in excel, changing to different encodings from there ... always unreadable, but the characters differ. I am Czech if that is relevant. I need characters like ěščřžýáíé etc., but it is Latin.
What I get: Varuje pĹ\x99ed
What I want: Varuje před
Here is my spider code. I did not change anything in settings or pipeline, though I tried multiple tips from other threads that do this. I spent 2 hours on this already, browsing stack overflow and documentation and I can't find the solution, it's becoming a headache for me. I'm not a programmer so this may be the reason... anyway:
urls = []
for number in range(1,101):
urls.append('https://www.idnes.cz/zpravy/domaci/'+str(number))
class MySpider(scrapy.Spider):
name = "MySpider"
def start_requests(self):
urls = ['https://www.idnes.cz/zpravy/domaci/']
for url in urls:
yield scrapy.Request(url=url, callback=self.parse_main)
def parse_main(self, response):
articleBlocks = response.xpath('//div[contains(#class,"art")]')
articleLinks = articleBlocks.xpath('.//a[#class="art-link"]/#href')
linksToFollow = articleLinks.extract()
for url in linksToFollow:
yield response.follow(url = url, callback = self.parse_arts)
print(url)
def parse_arts(self, response):
for article in response.css('div#content'):
yield {
'title': article.css('h1::text').get(),
'comments': article.css('li.community-discusion > a > span::text').get(),
}
Scrapy saves feed exports with utf-8 encoding by default.
Opening the file with the correct encoding displays the characters fine.
If you want to change the encoding used, you can do so by using the FEED_EXPORT_ENCODING setting (or using FEEDS instead).
After one more hour of trial and error, I solved this. The problem was not in Scrapy, it was correctly saving in utf-8, the problem was in the command:
scrapy crawl idnes_spider -o test.csv
that I ran to save it. When I run the command:
scrapy crawl idnes_spider -s FEED_URI=test.csv -s FEED_FORMAT=csv
It works.

'scrapy crawl' does things but does'nt make files

I'm newbie to the python scrapy.
When I push the 'scrapy crawl name' command, the cmd window does something very busily. But finally, it doesn't spit out any HTML files.
There's seems lots of questions about scrapy not working, but couldn't find one like this case. So I post this question.
This is my codes.
import scrapy
class PostsSpider(scrapy.Spider):
name = "posts"
start_urls = [
'https://blog.scrapinghub.com/page/1/',
'https://blog.scrapinghub.com/page/2/'
]
def parse(self, response):
page = reponse.url.split('/')[-1]
filename = 'posts-%s.html' % page
with open(filename, 'wb') as f:
f.write(response.body)
I went in to 'cd postscrape' where all these files and venv are layed.
And activated the venv by 'call venv\Scripts\activate.bat'.
And finally went 'scrapy crawl posts' on the cmd, in which venv was activated.
As you see, if I go like this, this code should spit out two HTML files 'posts-1.html' and 'posts-2.html'.
Actually the command doesn't return any error message and seems to do somethings busily. But finally, it returns nothing.
What's the problem??
Thank you genius!
There is no need to manually write items to file. You can simply yield items and provide flag -o as follows:
scrapy crawl some_spider -o some_file_name.json
More you can check in the documentation.
You missed one letter 's' in the 'response'.
page = reponse.url.split('/')[-1]
-->
page = response.url.split('/')[-1]

Python Scrapy not outputting to csv file

What am I doing wrong with the script so it's not outputting a csv file with the data? I am running the script with scrapy runspider yellowpages.py -o items.csv and still nothing is coming out but a blank csv file. I have followed different things here and also watched youtube trying to figure out where I am making the mistake and still cannot figure out what I am not doing correctly.
# -*- coding: utf-8 -*-
import scrapy
import requests
search = "Plumbers"
location = "Hammond, LA"
url = "https://www.yellowpages.com/search"
q = {'search_terms': search, 'geo_location_terms': location}
page = requests.get(url, params=q)
page = page.url
items = ()
class YellowpagesSpider(scrapy.Spider):
name = 'quotes'
allowed_domains = ['yellowpages.com']
start_urls = [page]
def parse(self, response):
self.log("I just visited: " + response.url)
items = response.css('a[class=business-name]::attr(href)')
for item in items:
print(item)
Simple spider without project.
Use my code, I wrote comments to make it easier to understand. This spider looks for all blocks on all pages for a pair of parameters "service" and "location". To run, use:
In your case:
scrapy runspider yellowpages.py -a servise="Plumbers" -a location="Hammond, LA" -o Hammondsplumbers.csv
The code will also work with any queries. For example:
scrapy runspider yellowpages.py -a servise="Doctors" -a location="California, MD" -o MDDoctors.json
etc...
# -*- coding: utf-8 -*-
import scrapy
from scrapy.http import Request
from scrapy.exceptions import CloseSpider
class YellowpagesSpider(scrapy.Spider):
name = 'yellowpages'
allowed_domains = ['yellowpages.com']
start_urls = ['https://www.yellowpages.com/']
# We can use any pair servise + location on our request
def __init__(self, servise=None, location=None):
self.servise = servise
self.location = location
def parse(self, response):
# If "service " and" location " are defined
if self.servise and self.location:
# Create search phrase using "service" and " location"
search_url = 'search?search_terms={}&geo_location_terms={}'.format(self.servise, self.location)
# Send request with url "yellowpages.com" + "search_url", then call parse_result
yield Request(url=response.urljoin(search_url), callback=self.parse_result)
else:
# Else close our spider
# You can add deffault value if you want.
self.logger.warning('=== Please use keys -a servise="service_name" -a location="location" ===')
raise CloseSpider()
def parse_result(self, response):
# all blocks without AD posts
posts = response.xpath('//div[#class="search-results organic"]//div[#class="v-card"]')
for post in posts:
yield {
'title': post.xpath('.//span[#itemprop="name"]/text()').extract_first(),
'url': response.urljoin(post.xpath('.//a[#class="business-name"]/#href').extract_first()),
}
next_page = response.xpath('//a[#class="next ajax-page"]/#href').extract_first()
# If we have next page url
if next_page:
# Send request with url "yellowpages.com" + "next_page", then call parse_result
yield scrapy.Request(url=response.urljoin(next_page), callback=self.parse_result)
for item in items:
print(item)
put yield instead of print there,
for item in items:
yield item
On inspection of your code, I notice a number of problems:
First, you initialize items to a tuple, when it should be a list: items = [].
You should change your name property to reflect the name you want on your crawler so you can use it like so: scrapy crawl my_crawler where name = "my_crawler".
start_urls is supposed to contain strings, not Request objects. You should change the entry from page to the exact search string you want to use. If you have a number of search strings and want to iterate over them, I would suggest using a middleware.
When you try to extract the data from CSS you're forgetting to call extract_all() which would actually transform your selector into string data which you could use.
Also, you shouldn't be redirecting to the standard output stream because a lot of logging goes there and it'll make your output file really messy. Instead, you should extract the responses into items using loaders.
Finally, you're probably missing the appropriate settings from your settings.py file. You can find the relevant documentation here.
FEED_FORMAT = "csv"
FEED_EXPORT_FIELDS = ["Field 1", "Field 2", "Field 3"]

Categories