I am trying to scrape ticket info from seatgeek, but I am struggling to do so. When I run my code, I get this:
INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
The idea is that I would input the name of the show/event, scrapy would scrape the URL of each of the performances for the show and then scrape ticket prices, etc. My code is below:
import scrapy
from seatgeek import items
class seatgeekSpider(scrapy.Spider):
name = "seatgeek_spider"
showname = input("Enter Show name (lower case please): ")
showname = showname.replace(' ', '-')
start_urls = "https://seatgeek.com/" + showname + "-tickets.html"
def parse_performance(self, response):
for href in response.xpath('//a[#class="event-listing-title"]/#href').extract():
yield scrapy.Request(
url= 'https://seatgeek.com/' + href,
callback=self.parse_ticketinv,
method="POST",
meta={'url': href})
def parse_ticketinv(self, response):
price = response.xpath('//span[#class="omnibox__listing__buy__price"]').extract()
performance = response.xpath('//div[#class="event-detail-words faint-words"]/text()').extract()
quantity = response.xpath('//div[#class="omnibox__seatview__availability"]/text()').extract()
seatinfo = response.xpath('//div[#class="omnibox__listing__section"]/text()').extract()
# creating scrapy items
item = items.seatgeekItem()
item['price'] = price
item['performance'] = performance
item['quantity'] = quantity
item['seatinfo'] = seatinfo
yield item
This is my items.py code:
import scrapy
class SeatgeekItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
price = scrapy.Field()
performnace = scrapy.Field()
quantity = scrapy.Field()
seatinfo = scrapy.Field()
Any help would be greatly appreciated - thank you!
There are two immediate problems I can see:
start_urls should be a list; you should see an error like this as well:
Traceback (most recent call last):
(...)
raise ValueError('Missing scheme in request url: %s' % self._url)
ValueError: Missing scheme in request url: h
By default, the callback used for urls in start_urls is parse(), which is not defined in your code. Maybe you should rename your parse_performance() method?
Also, spider arguments are the more common way to get user input.
Related
I am scraping BBC food for recipes. The logic is as follows:
Main page with about 20 cuisines
-> in each cuisine, there's usually ~20 recipes on 1-3 pages for each letter.
-> in each recipe, there is about 6 things I scrape (ingredients, rating etc.)
Therefore, my logic is: get to main page, create request, extract all cuisine links, then follow each, from there extract each page of recipes, follow each recipe link, and from each recipe finally get all data. Note this is not finished yet as I need to implement the spider to also go through all letters.
I would love to have a 'category' column, i.e. for each recipe in the "african cuisine" link have a column that says "african", for each recipe from the "italian cuisine" an "italian" entry in all columns etc.
Desired outcome:
cook_time prep_time name cuisine
10 30 A italian
20 10 B italian
30 20 C indian
20 10 D indian
30 20 E indian
Here is my following spider:
import scrapy
from recipes_cuisines.items import RecipeItem
class ItalianSpider(scrapy.Spider):
name = "italian_spider"
def start_requests(self):
start_urls = ['https://www.bbc.co.uk/food/cuisines']
for url in start_urls:
yield scrapy.Request(url = url, callback = self.parse_cuisines)
def parse_cuisines(self, response):
cuisine_cards = response.xpath('//a[contains(#class,"promo__cuisine")]/#href').extract()
for url in cuisine_cards:
yield response.follow(url = url, callback = self.parse_main)
def parse_main(self, response):
recipe_cards = response.xpath('//a[contains(#class,"main_course")]/#href').extract()
for url in recipe_cards:
yield response.follow(url = url, callback = self.parse_card)
next_page = response.xpath('//div[#class="pagination gel-wrap"]/ul[#class="pagination__list"]/li[#class="pagination__list-item pagination__priority--0"]/a[#class="pagination__link gel-pica-bold"]/#href').get()
if next_page is not None:
next_page_url = response.urljoin(next_page)
print(next_page_url)
yield scrapy.Request(url = next_page_url, callback = self.parse_main)
def parse_card(self, response):
item = RecipeItem()
item['name'] = response.xpath('//h1[contains(#class,"title__text")]/text()').extract()
item['prep_time'] = response.xpath('//div[contains(#class,"recipe-metadata-wrap")]/p[#class="recipe-metadata__prep-time"]/text()').extract_first()
item['cook_time'] = response.xpath('//p[contains(#class,"cook-time")]/text()').extract_first()
item['servings'] = response.xpath('//p[contains(#class,"serving")]/text()').extract_first()
item['ratings_amount'] = response.xpath('//div[contains(#class="aggregate-rating")]/span[contains(#class="aggregate-rating__total")]/text()').extract()
#item['ratings_amount'] = response.xpath('//*[#id="main-content"]/div[1]/div[4]/div/div[1]/div/div[1]/div[2]/div[1]/span[2]/text()').extract()
item['ingredients'] = response.css('li.recipe-ingredients__list-item > a::text').extract()
return item
and items:
import scrapy
class RecipeItem(scrapy.Item):
name = scrapy.Field()
prep_time = scrapy.Field()
cook_time = scrapy.Field()
servings = scrapy.Field()
ratings_amount = scrapy.Field()
rating = scrapy.Field()
ingredients = scrapy.Field()
cuisine = scrapy.Field()
Note I am saving the output via
scrapy crawl italian_spider -o test.csv
I have read the documentation and tried several things, such as adding the extracted cuisine to a parse_cuisine or parse_main methods, but all yield an error.
There are two ways here. Most common way is to pass some information from one page to another is to use cb_kwargs in your scrapy.Request:
def parse_cousine(self, response):
cousine = response.xpath('//h1/text()').get()
for recipe_url in response.xpath('//div[#id="az-recipes--recipes"]//a[.//h3]').getall():
yield scrapy.Request(
url=response.urljoin(recipe_url),
callback=self.parse_recipe,
cb_kwargs={'cousine': cousine},
)
def parse_recipe(self, response, cousine):
print(cousine)
But one this website you can find it on the recipe page (inside ingredients section after parsing JSON):
def parse_recipe(self, response):
recipe_raw = response.xpath('//script[#type="application/ld+json"][contains(., \'"#type":"Recipe"\')]/text()').get()
recipe = json.loads(recipe_raw)
cousine = recipe['recipeCuisine']
Update This XPath '//script[#type="application/ld+json"][contains(., \'"#type":"Recipe"\')]/text()' finds script node that have type attribute with a value application/ld+json and also contains string "#type":"Recipe" in a text of that node.
I'm trying to extract json data with Scrapy from a website, but i'm facing some issues, like when i run my spider, gives no error and says that crawled 0 pages. I also use the command to store de output to json file to see the output.
The following code is my spider:
import scrapy
class WineSpider(scrapy.Spider):
name = "SpidyWine"
i = 0
url = 'https://maiscarrinho.com/api/search?q=vinho&pageNumber=%s&pageSize=10'
start_urls = [url % 1]
def parse(self, response):
data = json.loads(response.body)
for item in data['results']:
yield {
'Image': item.get('image')
}
if data['Image']:
i = i + 1
yield scrapy.Request(self.url % i, callback=self.parse)
And my class of items:
import scrapy
class MaiscarrinhoItem(scrapy.Item):
image = scrapy.Field()
price = scrapy.Field()
supermarket = scrapy.Field()
promotion = scrapy.Field()
wineName = scrapy.Field()
brand = scrapy.Field()
For now, i'm just using the Image field in my spider to get things more easier.
Also, my ideia when i wrote the if statement in my spider was to 'deal' with the infinite scorlling, when the json api has 'Image' means that that page have content.
Output in Console
Thanks in advance
You did everything right except a very small mistake.
The field name which contains the image is Image and not image
Try :
yield {
'Image': item.get('Image')
}
There is probably something also wrong with your ITEM_PIPELINES in settings.py file
Well answering to my question and after digging into my code after some time... I realized it was about identation errors and some errors of syntaxe.
Another point was the pipeline, i forgot to change de last name to the real name of my pipeline, so instead of having 'Maiscarrinho.pipelines.SomePipeline': 300 now i have 'Maiscarrinho.pipelines.MaiscarrinhoPipeline': 300
The bellow code are extracting the images like i want, but there is one problem yet. Since the page have infinite scrolling i have a condition to evaluate if there is an element named 'Image but for some reason i'm not getting the desired result. It should extract 40 pages each with 10 images.
import scrapy
import json
class WineSpider(scrapy.Spider):
name = "SpidyWine"
url = 'https://maiscarrinho.com/api/search?q=vinho&pageNumber=%s&pageSize=10'
start_urls = [url % 1]
i = 1
def parse(self, response):
data = json.loads(response.body.decode('utf-8'))
for item in data['results']:
yield {
'Image': item.get('Image')
}
if item.get('Image'):
WineSpider.i += 1
yield scrapy.Request(self.url % WineSpider.i, callback=self.parse)
I'm trying to get item fields info from different pages using scrapy.
What I am trying to do:
main_url > scrape all links from this page > go to each link
from each link > scrape info, put info in items list and go to another link
from another link > scrape info and put info in the same items list
Go to next each link...repeat steps 2 - 4
when all links are done go to next page and repeat steps 1 - 3
I found some information from below but, I still can't get the results I want:
How can i use multiple requests and pass items in between them in scrapy python
http://doc.scrapy.org/en/latest/topics/request-response.html#topics-request-response-ref-request-callback-arguments
Goal: to get the below layout results
What I've done is below
My item class
from scrapy.item import Item, Field
class myItems(Item):
info1 = Field()
info2 = Field()
info3 = Field()
info4 = Field()
My spider class
from scrapy.http import Request
from myProject.items import myItems
class mySpider(scrapy.Spider):
name = 'spider1'
start_urls = ['main_link']
def parse(self, response):
items = []
list1 = response.xpath().extract() #extract all info from here
list2 = response.xpath().extract() #extract all info from here
for i,j in zip(list1, list2):
link1 = 'http...' + i
request = Request(link1, self.parseInfo1, dont_filter =True)
request.meta['item'] = items
yield request
link2 = 'https...' + j
request = Request(link2, self.parseInfo2, meta={'item':items}, dont_filter = True)
# Code for crawling to next page
def parseInfo1(self, response):
item = myItems()
items = response.meta['item']
item[info1] = response.xpath().extract()
item[info2] = response.xpath().extract()
items.append(item)
return items
def parseInfo2(self, response):
item = myItems()
items = response.meta['item']
item[info3] = response.xpath().extract()
item[info4] = response.xpath().extract()
items.append(item)
return items
I executed the spider by typing this on the terminal:
> scrapy crawl spider1 -o filename.csv -t csv
I got the results for all the fields, but they are not in the right order. My csv file looks like this:
Does anyone know how to get the results like in my "Goal" above?
I appreciate the help.
Thanks
Never mind, I found my mistake. I instantiated myItems class twice, which resulted in 2 new objects and gave the results that I got.
I am using scrapy to collect some data. My scrapy program collects 100 elements at one session. I need to limit it to 50 or any random number. How can i do that? Any solution is welcomed. Thanks in advance
# -*- coding: utf-8 -*-
import re
import scrapy
class DmozItem(scrapy.Item):
# define the fields for your item here like:
link = scrapy.Field()
attr = scrapy.Field()
title = scrapy.Field()
tag = scrapy.Field()
class DmozSpider(scrapy.Spider):
name = "dmoz"
allowed_domains = ["raleigh.craigslist.org"]
start_urls = [
"http://raleigh.craigslist.org/search/bab"
]
BASE_URL = 'http://raleigh.craigslist.org/'
def parse(self, response):
links = response.xpath('//a[#class="hdrlnk"]/#href').extract()
for link in links:
absolute_url = self.BASE_URL + link
yield scrapy.Request(absolute_url, callback=self.parse_attr)
def parse_attr(self, response):
match = re.search(r"(\w+)\.html", response.url)
if match:
item_id = match.group(1)
url = self.BASE_URL + "reply/ral/bab/" + item_id
item = DmozItem()
item["link"] = response.url
item["title"] = "".join(response.xpath("//span[#class='postingtitletext']//text()").extract())
item["tag"] = "".join(response.xpath("//p[#class='attrgroup']/span/b/text()").extract()[0])
return scrapy.Request(url, meta={'item': item}, callback=self.parse_contact)
def parse_contact(self, response):
item = response.meta['item']
item["attr"] = "".join(response.xpath("//div[#class='anonemail']//text()").extract())
return item
This is what CloseSpider extension and CLOSESPIDER_ITEMCOUNT setting were made for:
An integer which specifies a number of items. If the spider scrapes
more than that amount if items and those items are passed by the item
pipeline, the spider will be closed with the reason
closespider_itemcount. If zero (or non set), spiders won’t be closed
by number of passed items.
I tried alecxe answer but I had to combine all 3 limits to make it work, so leaving it here just in case someone else is having the same issue:
class GenericWebsiteSpider(scrapy.Spider):
"""This generic website spider extracts text from websites"""
name = "generic_website"
custom_settings = {
'CLOSESPIDER_PAGECOUNT': 15,
'CONCURRENT_REQUESTS': 15,
'CLOSESPIDER_ITEMCOUNT': 15
}
...
This is a newbie question (new to Scrapy and first question on Stackoverflow):
I currently have a spider to crawl the following Amazon page (http://www.amazon.co.uk/Televisions-TVs-LED-LCD-Plasma/b/ref=sn_gfs_co_auto_560864_1?ie=UTF8&node=560864).
I am trying to scrape the title of the TV and main (listed) price. I can successfully parse the TV name. However on some of the Amazon TVs listed they don't all have the same Xpath elements; some have a main (listed) price, some have a "as New" price and some also have a "as Used" price.
My issue is that when a TV does not have a main (listed) price my CSV output does not record a NULL for that item but instead takes the next XPATH item which does have a main price.
Is there a way to check whether an item exists in the XPATH content and if not to get the spider or the pipeline to record a NULL or ""?
My main spider code is:
class AmazonSpider(BaseSpider):
name = "amazon"
allowed_domains = ["amazon.co.uk"]
start_urls = [
"http://www.amazon.co.uk/Televisions-TVs-LED-LCD-Plasma /b/ref=sn_gfs_co_auto_560864_1?ie=UTF8&node=560864"
]
def parse(self, response):
sel = Selector(response)
title = sel.xpath('.//*[starts-with(#id,"result_")]/h3/a/span/text()').extract()
price = sel.xpath('.//*[starts-with(#id,"result_")]/ul/li[1]/div/a/span/text()').extract()
items = []
for title,price in zip(title,price):
item = AmazonItem()
item ["title"] = title.strip()
item ["price"] = price.strip()
items.append(item)
return items
My pipeline is:
class AmazonPipeline(object):
def process_item(self, item, spider):
return item
My items file is:
import scrapy
from scrapy.item import Item, Field
class AmazonItem(scrapy.Item):
title = scrapy.Field()
price = scrapy.Field()
I am outputing to CSV as follows:
scrapy crawl amazon -o output.csv -t csv
Thanks in advance!
you can take the xpath relatively so that this won't happen again
Look at the code below, this might help
def parse(self, response):
selector_object = response.xpath('//div[starts-with(#id,"result_")]')
for select in selector_object:
title = select.xpath('./h3/a/span/text()').extract()
title = title[0].strip() if title else 'N/A'
price = select.xpath('/ul/li[1]/div/a/span/text()').extract()
price = price[0].strip() if price else 'N/A'
item = AmazonItem(
title=title,
price=price
)
yield item
I extended Jithin's approach by a couple of if else statements which helped solve my problem:
def parse(self, response):
selector_object = response.xpath('//div[starts-with(#id,"result_")]')
for select in selector_object:
new_price=select.xpath('./ul/li[1]/a/span[1]/text()').extract()
title = select.xpath('./h3/a/span/text()').extract()
title = title[0].strip() if title else 'N/A'
price = select.xpath('./ul/li[1]/div/a/span/text()').extract()
if price:
price = price[0].strip()
elif new_price:
price = new_price[0].strip()
item = AmazonItem(
title=title,
price=price
)
yield item