Hi can someone help me out I seem to be stuck, I am learning how to crawl and save into mysql us scrapy. I am trying to get scrapy to crawl all of the website pages. Starting with "start_urls", but it does not seem to automatically crawl all of the pages only the one, it does save into mysql with pipelines.py. It does also crawl all pages when provided with urls in a f = open("urls.txt") as well as saves data using pipelines.py.
here is my code
test.py
import scrapy
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.selector import HtmlXPathSelector
from gotp.items import GotPItem
from scrapy.log import *
from gotp.settings import *
from gotp.items import *
class GotP(CrawlSpider):
name = "gotp"
allowed_domains = ["www.craigslist.org"]
start_urls = ["http://sfbay.craigslist.org/search/sss"]
rules = [
Rule(SgmlLinkExtractor(
allow=('')),
callback ="parse",
follow=True
)
]
def parse(self, response):
hxs = HtmlXPathSelector(response)
prices = hxs.select("//div[#class="sliderforward arrow"]")
for price in prices:
item = GotPItem()
item ["price"] = price.select("text()").extract()
yield item
If I understand correctly, you are trying to follow the pagination and extract the results.
In this case, you can avoid using CrawlSpider and use regular Spider class.
The idea would be to parse the first page, extract total results count, calculate how much pages to go and yield scrapy.Request instances to the same URL providing s GET parameter value.
Implementation example:
import scrapy
class GotP(scrapy.Spider):
name = "gotp"
allowed_domains = ["www.sfbay.craigslist.org"]
start_urls = ["http://sfbay.craigslist.org/search/sss"]
results_per_page = 100
def parse(self, response):
total_count = int(response.xpath('//span[#class="totalcount"]/text()').extract()[0])
for page in xrange(0, total_count, self.results_per_page):
yield scrapy.Request("http://sfbay.craigslist.org/search/sss?s=%s&" % page, callback=self.parse_result, dont_filter=True)
def parse_result(self, response):
results = response.xpath("//p[#data-pid]")
for result in results:
try:
print result.xpath(".//span[#class='price']/text()").extract()[0]
except IndexError:
print "Unknown price"
This would follow the pagination and print prices on the console. Hope this is a good starting point.
Related
This is my first question here and I'm learning how to code by myself so please bear with me.
I'm working on a final CS50 project which I'm trying to built a website that aggregates online Spanish course from edx.org and other open online couses websites maybe. I'm using scrapy framework to scrap the filter results of Spanish courses on edx.org... Here is my first scrapy spider which I'm trying to get in each courses link to then get it's name (after I get the code right, also get the description, course url and more stuff).
from scrapy.item import Field, Item
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractor import LinkExtractor
from scrapy.loader import ItemLoader
class Course_item(Item):
name = Field()
#description = Field()
#img_url = Field()
class Course_spider(CrawlSpider):
name = 'CourseSpider'
allowed_domains = ['https://www.edx.org/']
start_urls = ['https://www.edx.org/course/?language=Spanish']
rules = (Rule(LinkExtractor(allow=r'/course'), callback='parse_item', follow='True'),)
def parse_item(self, response):
item = ItemLoader(Course_item, response)
item.add_xpath('name', '//*[#id="course-intro-heading"]/text()')
yield item.load_item()
When I run the spider with "scrapy runspider edxSpider.py -o edx.csv -t csv" I get an empty csv file and I also think is not getting into the right spanish courses results.
Basically I want to get in each courses of this link edx Spanish courses and get the name, description, provider, page url and img url.
Any ideas for why might be the problem?
You can't get edx content with a simple request, it uses javascript rendering for getting the course element dynamically, so CrawlSpider won't work on this case, because you need to find specific elements inside the response body to generate a new Request that will get what you need.
The real request (to get the urls of the courses) is this one, but you need to generate it from the previous response body (although you could just visit it an also get the correct data).
So, to generate the real request, you need data that is inside a script tag:
from scrapy import Spider
import re
import json
class Course_spider(Spider):
name = 'CourseSpider'
allowed_domains = ['edx.org']
start_urls = ['https://www.edx.org/course/?language=Spanish']
def parse(self, response):
script_text = response.xpath('//script[contains(text(), "Drupal.settings")]').extract_first()
parseable_json_data = re.search(r'Drupal.settings, ({.+})', script_text).group(1)
json_data = json.loads(parseable_json_data)
...
Now you have what you need on json_data and only need to create the string URL.
This page use JavaScript to get data from server and add to page.
It uses urls like
https://www.edx.org/api/catalog/v2/courses/course-v1:IDBx+IDB33x+3T2017
Last part is course's number which you can find in HTML
<main id="course-info-page" data-course-id="course-v1:IDBx+IDB33x+3T2017">
Code
from scrapy.http import Request
from scrapy.item import Field, Item
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractor import LinkExtractor
from scrapy.loader import ItemLoader
import json
class Course_spider(CrawlSpider):
name = 'CourseSpider'
allowed_domains = ['www.edx.org']
start_urls = ['https://www.edx.org/course/?language=Spanish']
rules = (Rule(LinkExtractor(allow=r'/course'), callback='parse_item', follow='True'),)
def parse_item(self, response):
print('parse_item url:', response.url)
course_id = response.xpath('//*[#id="course-info-page"]/#data-course-id').extract_first()
if course_id:
url = 'https://www.edx.org/api/catalog/v2/courses/' + course_id
yield Request(url, callback=self.parse_json)
def parse_json(self, response):
print('parse_json url:', response.url)
item = json.loads(response.body)
return item
from scrapy.crawler import CrawlerProcess
c = CrawlerProcess({
'USER_AGENT': 'Mozilla/5.0',
'FEED_FORMAT': 'csv', # csv, json, xml
'FEED_URI': 'output.csv', #
})
c.crawl(Course_spider)
c.start()
from scrapy.http import Request
from scrapy import Spider
import json
class edx_scraper(Spider):
name = "edxScraper"
start_urls = [
'https://www.edx.org/api/v1/catalog/search?selected_facets[]=content_type_exact%3Acourserun&selected_facets[]=language_exact%3ASpanish&page=1&page_size=9&partner=edx&hidden=0&content_type[]=courserun&content_type[]=program&featured_course_ids=course-v1%3AHarvardX+CS50B+Business%2Ccourse-v1%3AMicrosoft+DAT206x+1T2018%2Ccourse-v1%3ALinuxFoundationX+LFS171x+3T2017%2Ccourse-v1%3AHarvardX+HDS2825x+1T2018%2Ccourse-v1%3AMITx+6.00.1x+2T2017_2%2Ccourse-v1%3AWageningenX+NUTR101x+1T2018&featured_programs_uuids=452d5bbb-00a4-4cc9-99d7-d7dd43c2bece%2Cbef7201a-6f97-40ad-ad17-d5ea8be1eec8%2C9b729425-b524-4344-baaa-107abdee62c6%2Cfb8c5b14-f8d2-4ae1-a3ec-c7d4d6363e26%2Ca9cbdeb6-5fc0-44ef-97f7-9ed605a149db%2Cf977e7e8-6376-400f-aec6-84dcdb7e9c73'
]
def parse(self, response):
data = json.loads(response.text)
for course in data['objects']['results']:
url = 'https://www.edx.org/api/catalog/v2/courses/' + course['key']
yield response.follow(url, self.course_parse)
if 'next' in data['objects'] is not None:
yield response.follow(data['objects']['next'], self.parse)
def course_parse(self, response):
course = json.loads(response.text)
yield{
'name': course['title'],
'effort': course['effort'],
}
I'm using the latest version of scrapy (http://doc.scrapy.org/en/latest/index.html) and am trying to figure out how to make scrapy crawl only the URL(s) fed to it as part of start_url list. In most cases I want to crawl only 1 page, but in some cases there may be multiple pages that I will specify. I don't want it to crawl to other pages.
I've tried setting the depth level=1 but I'm not sure that in testing it accomplished what I was hoping to achieve.
Any help will be greatly appreciated!
Thank you!
2015-12-22 - Code update:
# -*- coding: utf-8 -*-
import scrapy
from generic.items import GenericItem
class GenericspiderSpider(scrapy.Spider):
name = "genericspider"
def __init__(self, domain, start_url, entity_id):
self.allowed_domains = [domain]
self.start_urls = [start_url]
self.entity_id = entity_id
def parse(self, response):
for href in response.css("a::attr('href')"):
url = response.urljoin(href.extract())
yield scrapy.Request(url, callback=self.parse_dir_contents)
def parse_dir_contents(self, response):
for sel in response.xpath("//body//a"):
item = GenericItem()
item['entity_id'] = self.entity_id
# gets the actual email address
item['emails'] = response.xpath("//a[starts-with(#href, 'mailto')]").re(r'mailto:\s*(.*?)"')
yield item
Below, in the first response, you mention using a generic spider --- isn't that what I'm doing in the code? Also are you suggesting I remove the
callback=self.parse_dir_contents
from the parse function?
Thank you.
looks like you are using CrawlSpider which is a special kind of Spider to crawl multiple categories inside pages.
For only crawling the urls specified inside start_urls just override the parse method, as that is the default callback of the start requests.
Below is a code for the spider that will scrape the title from a blog (Note: the xpath might not be the same for every blog)
Filename: /spiders/my_spider.py
class MySpider(scrapy.Spider):
name = "craig"
allowed_domains = ["www.blogtrepreneur.com"]
start_urls = ["http://www.blogtrepreneur.com/the-best-juice-cleanse-for-weight-loss/"]
def parse(self, response):
hxs = HtmlXPathSelector(response)
dive = response.xpath('//div[#id="tve_editor"]')
items = []
item = DmozItem()
item["title"] = response.xpath('//h1/text()').extract()
item["article"] = response.xpath('//div[#id="tve_editor"]//p//text()').extract()
items.append(item)
return items
The above code will only fetch the title and the article body of the given article.
I got the same problem, because I was using
import scrapy from scrapy.spiders import CrawlSpider
Then I changed to
import scrapy from scrapy.spiders import Spider
And change the class to
class mySpider(Spider):
I'm using a scrapy web crawler to extract a bunch of data, as I describe here, I've figured out a brute force way to get the information I want, but.. it's really pretty crude. I just ennumerate all the pages I want to scrape, which is a few hundred. I need to get this done, so I might just grit my teeth and bear it like a moron, but it would be so much nicer to automate this. How could this process be implemented with link extraction using scrapy? I've looked at the documentation and made some experiments as I desribe in the question linked above but nothing yet has worked. This is the brute force code:
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from brute_force.items import BruteForceItem
class DmozSpider(BaseSpider):
name = "brutus"
allowed_domains = ["tool.httpcn.com"]
start_urls = ["http://tool.httpcn.com/Html/Zi/21/PWAZAZAZXVILEPWXV.shtml",
"http://tool.httpcn.com/Html/Zi/21/PWAZAZCQCQILEPWB.shtml",
"http://tool.httpcn.com/Html/Zi/21/PWAZAZCQKOILEPWD.shtml",
"http://tool.httpcn.com/Html/Zi/21/PWAZAZCQUYILEPWF.shtml",
"http://tool.httpcn.com/Html/Zi/21/PWAZAZCQMEILEKOCQ.shtml",
"http://tool.httpcn.com/Html/Zi/21/PWAZAZCQRNILEKOKO.shtml",
"http://tool.httpcn.com/Html/Zi/22/PWCQKOILUYUYKOTBCQ.shtml",
"http://tool.httpcn.com/Html/Zi/21/PWAZAZAZRNILEPWRN.shtml",
"http://tool.httpcn.com/Html/Zi/21/PWAZAZCQPWILEPWC.shtml",
"http://tool.httpcn.com/Html/Zi/21/PWAZAZCQILILEPWE.shtml",
"http://tool.httpcn.com/Html/Zi/21/PWAZAZCQTBILEKOAZ.shtml",
"http://tool.httpcn.com/Html/Zi/21/PWAZAZCQXVILEKOPW.shtml",
"http://tool.httpcn.com/Html/Zi/21/PWAZAZPWAZILEKOIL.shtml",
"http://tool.httpcn.com/Html/Zi/22/PWCQKOILRNUYKOTBUY.shtml"]
def parse(self, response):
hxs = HtmlXPathSelector(response)
items = []
item = BruteForceItem()
item["the_strokes"] = hxs.xpath('//*[#id="div_a1"]/div[2]').extract()
item["character"] = hxs.xpath('//*[#id="div_a1"]/div[3]').extract()
items.append(item)
return items
I think this is what you want:
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from scrapy.http import Request
from brute_force.items import BruteForceItem
from urlparse import urljoin
class DmozSpider(BaseSpider):
name = "brutus"
allowed_domains = ["tool.httpcn.com"]
start_urls = ['http://tool.httpcn.com/Zi/BuShou.html']
def parse(self, response):
for url in response.css('td a::attr(href)').extract():
cb = self.parse if '/zi/bushou' in url.lower() else self.parse_item
yield Request(urljoin(response.url, url), callback=cb)
def parse_item(self, response):
hxs = HtmlXPathSelector(response)
item = BruteForceItem()
item["the_strokes"] = hxs.xpath('//*[#id="div_a1"]/div[2]').extract()
item["character"] = hxs.xpath('//*[#id="div_a1"]/div[3]').extract()
return item
try this
1.
the spider start with the start_urls.
2.
self.parse. I just find all the a tag in the td tag.
if the url contains '/zi/bushou' then the response should be go to self.parse again because it is what you called 'second layer'.
if not '/zi/bushou' (i think use a more specific regex here is better) like url. i think it is what you want and goes to parse_item function.
3.
self.parse_item. this is the function that you use to get the information from the final page.
Problem: Scrapy keeps visiting a single url and keeps scraping it recursively. I have checked the response.url to ensure that this is a single page that it keeps scraping and there is no query string involved that may render the same page for different url.
What I have done to reolve it :
Under Scrapy/spider.py I noticed that dont_filter was set to True and changed it False. but it didn't help
I have set the unique = True also in the code, but this didn't help either.
Additional information
The Page thats given as start_url has only 1 link to a page a.html. Scrapy keeps scraping a.html again and again.
Code
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import Selector
from kt.items import DmozItem
class DmozSpider(CrawlSpider):
name = "dmoz"
allowed_domains = ["datacaredubai.com"]
start_urls = ["http://www.datacaredubai.com/aj/link.html"]
rules = (
Rule(SgmlLinkExtractor(allow=('/aj'),unique=('Yes')), callback='parse_item'),
)
def parse_item(self, response):
sel = Selector(response)
sites = sel.xpath('//*')
items = []
for site in sites:
item = DmozItem()
item['title']= site.xpath('/html/head/meta[3]').extract()
item['req_url']= response.url
items.append(item)
return items
Scrapy, by default, would append into the output file if it exists. What you see in the output.csv is the results of multiple spider runs. Remove the output.csv before running the spider again.
I'm new to Scrapy, and with some tutorials I was able to scrape a few simple websites, but I'm facing an issue now with a new website where I have to fill a search form and extract the results. The response I get doesn't have the results.
Let's say for example, for the following site: http://www.beaurepaires.com.au/store-locator/
I want to provide a list of postcodes and extract information about stores in each postcode (store name and address).
I'm using the following code but it's not working, and I'm not sure where to start from.
class BeaurepairesSpider(BaseSpider):
name = "beaurepaires"
allowed_domains = ["http://www.beaurepaires.com.au"]
start_urls = ["http://www.beaurepaires.com.au/store-locator/"]
#start_urls = ["http://www.beaurepaires.com.au/"]
def parse(self, response):
yield FormRequest.from_response(response, formname='frm_dealer_locator', formdata={'dealer_postcode_textfield':'2115'}, callback=self.parseBeaurepaires)
def parseBeaurepaires(self, response):
hxs = HtmlXPathSelector(response)
filename = "postcodetest3.txt"
open(filename, 'wb').write(response.body)
table = hxs.select("//div[#id='jl_results']/table/tbody")
headers = table.select("tr[position()<=1]")
data_rows = table.select("tr[position()>1]")
Thanks!!
The page load here heavily uses javascript and is too complex for Scrapy. Here's an example of what I've come up to:
import re
from scrapy.http import FormRequest, Request
from scrapy.selector import HtmlXPathSelector
from scrapy.spider import BaseSpider
class BeaurepairesSpider(BaseSpider):
name = "beaurepaires"
allowed_domains = ["beaurepaires.com.au", "gdt.rightthere.com.au"]
start_urls = ["http://www.beaurepaires.com.au/store-locator/"]
def parse(self, response):
yield FormRequest.from_response(response, formname='frm_dealer_locator',
formdata={'dealer_postcode_textfield':'2115'},
callback=self.parseBeaurepaires)
def parseBeaurepaires(self, response):
hxs = HtmlXPathSelector(response)
script = str(hxs.select("//div[#id='jl_container']/script[4]/text()").extract()[0])
url, script_name = re.findall(r'LoadScripts\("([a-zA-Z:/\.]+)", "(\w+)"', script)[0]
url = "%s/locator/js/data/%s.js" % (url, script_name)
yield Request(url=url, callback=self.parse_js)
def parse_js(self, response):
print response.body # here are your locations - right, inside the js file
see that regular expressions are used, hardcoded urls, and you'll have to parse js in order to get your locations - too fragile even if you'll finish it and get the locations.
Just switch to in-browser tools like selenium (or combine scrapy with it).