Scrapy KeyError while processing - python

I couldn't find any answer to my problem so I hope it will be ok to ask here.
I am trying to scrap cinema shows and still getting following error.
What is really confusing for me that the problem apparently lies in pipelines. However, I have second spider for opera house with the exact same code(only place is different) and it works just fine."Shows" and "Place" refers to my Django models. I've changed their fields to be CharFields so it's not a problem with wrong date/time format.
I also tried to use dedicated scrapy item "KikaItem" instead of "ShowItem" (which is shared with my opera spider) but the error still remains.
class ScrapyKika(object):
def process_item(self, ShowItem, spider):
place, created = Place.objects.get_or_create(name="kino kika")
show = Shows.objects.update_or_create(
time=ShowItem["time"],
date=ShowItem["date"],
place=place,
defaults={'title': ShowItem["title"]}
)
return ShowItem
Here is my spider code.I expect the problem is somewhere here, because I used a different approach here than in the opera one. However,I am not sure what can be wrong.
import scrapy
from ..items import ShowItem, KikaItemLoader
class KikaSpider(scrapy.Spider):
name = "kika"
allowed_domains = ["http://www.kinokika.pl/dk.php"]
start_urls = [
"http://www.kinokika.pl/dk.php"
]
def parse(self, response):
divs = response.xpath('//b')
for div in divs:
l = KikaItemLoader(item=ShowItem(), response=response)
l.add_xpath("title", "./text()")
l.add_xpath("date", "./ancestor::ul[1]/preceding-sibling::h2[1]/text()")
l.add_xpath("time", "./preceding-sibling::small[1]/text()")
return l.load_item()
ItemLoader
class KikaItemLoader(ItemLoader):
title_in = MapCompose(strip_string,lowercase)
title_out = Join()
time_in = MapCompose(strip_string)
time_out = Join()
date_in = MapCompose(strip_string)
date_out = Join()
Thank you for your time and sorry for any misspellings :)

Currently, your spider yields a single item:
{'title': u' '}
which does not have the date and time fields filled out. This is because of the way you initialize the ItemLoader class in your spider.
You should be initializing your item loader with a specific selector in mind. Replace:
for div in divs:
l = KikaItemLoader(item=ShowItem(), response=response)
with:
for div in divs:
l = KikaItemLoader(item=ShowItem(), selector=div)

Related

how do I scrape form the website which has next button and also if it scrolling?

I'm trying to scrape all the data from a website called quotestoscrape. But, When I try to run my code it's only getting the one random quote. It should take at least all the data from that page only but it's only taking one. Also, if somehow I get the data from page 1 now what I want is to get the data from all the pages.
So how do I solve this error(which should take all the data from the page1)?
How do I take all the data which is present on the next pages?
items.py file
import scrapy
class QuotetutorialItem(scrapy.Item):
title = scrapy.Field()
author = scrapy.Field()
tag = scrapy.Field()
quotes_spider.py file
import scrapy
from ..items import QuotetutorialItem
class QuoteScrapy(scrapy.Spider):
name = 'quotes'
start_urls = [
'http://quotes.toscrape.com/'
]
def parse(self, response):
items = QuotetutorialItem()
all_div_quotes = response.css('div.quote')
for quotes in all_div_quotes:
title = quotes.css('span.text::text').extract()
author = quotes.css('.author::text').extract()
tag = quotes.css('.tag::text').extract()
items['title'] = title
items['author'] = author
items['tag'] = tag
yield items
Please tell me what change I can do?
As reported, it's missing an ident level on your yield. And to follow next pages, just add a check for the next button, and yield a request following it.
import scrapy
class QuoteScrapy(scrapy.Spider):
name = 'quotes'
start_urls = [
'http://quotes.toscrape.com/'
]
def parse(self, response):
items = {}
all_div_quotes = response.css('div.quote')
for quotes in all_div_quotes:
title = quotes.css('span.text::text').extract()
author = quotes.css('.author::text').extract()
tag = quotes.css('.tag::text').extract()
items['title'] = title
items['author'] = author
items['tag'] = tag
yield items
next_page = response.css('li.next a::attr(href)').extract_first()
if next_page:
yield response.follow(next_page)
As #LanteDellaRovere has correctly identified in a comment, the yield statement should be executed for each iteration of the for loop - which is why you are only seeing a single (presumably the last) link from each page.
As far as reading the continued pages, you could extract it from the <nav> element at the bottom of the page, but the structure is very simple - the links (when no tag is specified) are of the form
http://quotes.toscrape.com/page/N/
You will find that for N=1 you get the first page. So just access the URLs for increasing values of N until the attempt sees a 404 return should work as a simplistic solution.
Not knowing much about Scrapy I can't give you exact code, but the examples at https://docs.scrapy.org/en/latest/intro/tutorial.html#following-links are fairly helpful if you want a more sophisticated and Pythonic approach.

Text not visible Python

Why am I not getting the text? I've used this script on many websites and never came across this issue.
import scrapy.selector
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from Prijsvergelijking_Final.items import PrijsvergelijkingFinalItem
vendors = []
for line in open("vendors.txt", "r"):
vendors.append(line.strip("\n\-"))
e = {}
for vendor in vendors:
e[vendor] = True
class ArtcrafttvSpider(CrawlSpider):
name = "ARTCRAFTTV"
allowed_domains = ["artencraft.be"]
start_urls = ["https://www.artencraft.be/nl/beeld-en-geluid/televisie"]
rules = (Rule(LinkExtractor(allow=(), restrict_xpaths=('//li[#class="next"]',)), callback = "parse_start_url",follow = True),)
def parse_start_url(self, response):
products = response.xpath("//ul[#class='product-overview list']/li")
for product in products:
item = PrijsvergelijkingFinalItem()
item["Product_a"] = product.xpath(".//a/span/h3/text()").extract_first().strip().replace("-","")
item["Product_price"] = product.xpath(".//a/h4/text()").extract_first()
for word in item['Product_a'].split(" "):
if word in e:
item['item_vendor'] = word
yield item
Website code:
Results after script is run:
Any suggestions how I can solve this?
Short Answer would be:
You have a wrong xpath for price field value
Detailed:
do not always assume that page structure will be same as what is displayed on your screen. it is NOT always WYSIWYG
for some reason i see that inspect element(firefox) shows a price value as child of //a/h4 tag but if you will analyze the page source that is downloaded, you will see that price value is present on page but is it no child of //a/h4 tag but it is a child of //a tag so //a/text() would give you the desired value
It appears that the prices are loaded in from Javascript or something- when I pull down the page from Python I get no prices anywhere.
There's two possible things going on here: First, the prices might be loading in with Javascript. If that's the case, I recommend looking at this answer: https://stackoverflow.com/a/26440563/629110 and the library dryscape.
If the prices are being blocked because of your user agent, you can try to change your user agent to a real browser: https://stackoverflow.com/a/10606260/629110 .
Try the user agent first (since it is easier).

How to crawl all webpages on website up to certain depth?

I have a website and I would like to find a webpage with information about job vacancies. There is only one page usually with such information. So I start crawling with website and I manage to get all webpages up to certain depth. It works. But they are many times duplicated. Instead of lets say 45 pages I get 1000 pages. I know the reason why. The reason is that every time I call my "parse" function, it parses all the webpages on a certain webpage. So when I come to a new webpage, it crawls all webpages, out of which some have been crawled before.
1) I tried to make "items=[]" list out of parse function but I get some global error. I don't know how to get a list of unique webpages. When I have one, I will be able to choose the right one with simple url parsing.
2) I also tried to have "Request" and "return items" in the "parse" function, but I get syntax error: return inside generator.
I am using DEPTH_LIMIT. Do I really need to use Rules ?
code:
import scrapy, urlparse, os
from scrapy.spiders import Rule
from scrapy.linkextractors import LinkExtractor
from tutorial.items import JobItem
from scrapy.utils.response import get_base_url
from scrapy.http import Request
from urlparse import urljoin
from datetime import datetime
class JobSpider(scrapy.Spider):
name = "jobs"
allowed_domains = ["www.gen-i.si"]
start_urls = ["http://www.gen-i.si"]
def parse(self, response):
response.selector.remove_namespaces() #
urls = response.xpath('//#href').extract()#choose all "href", either new websites either webpages on our website
items = []
base_url = get_base_url(response) #base url
for url in urls:
#we need only webpages, so we remove all websites and urls with strange characters
if (url[0:4] != "http") and not any(x in url for x in ['%', ':', '?', '&']):
item = JobItem()
absolute_url = urlparse.urljoin(base_url,url)
item["link"] = absolute_url
if item not in items:
items.append(item)
yield item
yield Request(absolute_url, callback = self.parse)
#return items
You're appending item (a newly instantiated object), to your list items. Since item is always a new JobItem() object, it will never exist in your list items.
To illustrate:
>>> class MyItem(object):
... pass
...
>>> a = MyItem()
>>> b = MyItem()
>>> a.url = "abc"
>>> b.url = "abc"
>>> a == b
False
Just because they have one attribute that is the same, doesn't mean they are the same object.
Even if this worked though, you're resetting the list items everytime you call parse (ie. for each request), so you'll never really remove duplicates.
Instead, you would be better checking vs. the absolute_url itself, and putting the list at the spider level:
class JobSpider(scrapy.Spider):
name = "jobs"
allowed_domains = ["www.gen-i.si"]
start_urls = ["http://www.gen-i.si"]
all_urls = []
def parse(self, response):
# remove "items = []"
...
for url in urls:
if (url[0:4] != "http") and not any(x in url for x in ['%', ':', '?', '&']):
absolute_url = urlparse.urljoin(base_url, url)
if absolute_url not in self.all_urls:
self.all_urls.append(absolute_url)
item = JobItem()
item['link'] = absolute_url
yield item
yield Request(absolute_url, callback = self.parse)
This functionality, however, would be better served by creating a Dupefilter instead (see here for more information). Additionally, I agree with #RodrigoNey, a CrawlSpider would likely better serve your purpose, and be more maintainable in the long run.
I'm working on a web crawler and ended up making a list of links that needed to be crawled, then once we went there it was deleted from that list and added to the crawled list. then you can use a not in search to either add/delete/etc.

Scrapy: Defining XPaths with numbered Divs & Dynamically naming item fields

I'm a newbie with regards to Scrapy and Python. Would appreciate some help here please...
I'm scraping a site that uses divs, and cant for the life of me work out why this isn't working. I can only get Field1 and Data1 to populate... the overall plan is to get 10 points for each page...
have a look at my spider - I can't get field2 or data2 to populate correctly...
import scrapy
from tutorial.items import AttorneysItem
class AttorneysSpider(scrapy.Spider):
name = "attorneys"
allowed_domains = ["attorneys.co.za"]
start_urls = [
"http://www.attorneys.co.za/CompanyHomePage.asp?CompanyID=537",
"http://www.attorneys.co.za/CompanyHomePage.asp?CompanyID=776",
]
def parse(self, response):
for sel in response.xpath('//div//div//div[3]//div[1]//div//div'):
item = AttorneysItem()
item['Field1'] = sel.xpath('//div//div//div[3]//div[1]//div[1]//div[1]/text()').extract()
item['Data1'] = sel.xpath('//div//div//div[3]//div[1]//div[1]//div[2]/text()').extract()
item['Field2'] = sel.xpath('//div//div//div[3]//div[1]//div[2]//div[1]/text()').extract()
item['Data2'] = sel.xpath('//div//div//div[3]//div[1]//div[2]//div[2]/text()').extract()
yield item
It s super frustrating. The link to the site is http://www.attorneys.co.za/CompanyHomePage.asp?CompanyID=537.
Thanks
Paddy
--------------UPDATE---------------------------
So I've gotten a bit futher, but hit a wall again.
I can now select the elements okay, but I somehow need to dynamically define the item fields... the best I've been able to do is the below, but it's not great because the number of fields is not consistent, and are not always in the same order. Essentially what I a saying is sometimes their website is listed as the third field down, sometimes it's the fifth.
def parse(self, response):
item = AttorneysItem()
item['a01Firm'] = response.xpath('//h1[#class="name-h1"]/text()').extract()
item['a01Field'] = response.xpath('//div[#class="col-lg-3 display-label"]/text()').extract()[0].strip()
item['a01Data'] = response.xpath('//div[#class="col-lg-9"]/text()').extract()[0].strip()
item['a02Field'] = response.xpath('//div[#class="col-lg-3 display-label"]/text()').extract()[1].strip()
item['a02Data'] = response.xpath('//div[#class="col-lg-9"]/text()').extract()[1].strip()
item['a03Field'] = response.xpath('//div[#class="col-lg-3 display-label"]/text()').extract()[2].strip()
item['a03Data'] = response.xpath('//div[#class="col-lg-9"]/text()').extract()[2].strip()
item['a04Field'] = response.xpath('//div[#class="col-lg-3 display-label"]/text()').extract()[3].strip()
item['a04Data'] = response.xpath('//div[#class="col-lg-9"]/text()').extract()[3].strip()
Thanks again to any and all who can help :D
There are several issues with the xpath you provide:
You only need to use "//" in the beginning, the rest should be "/".
Using only element name to extract is not clean. It leads to bad readability and possibly bad performance. One reason is that many, if not most, webpages contain levels of nested divs. Instead, make good use of selectors.
Besides, You don't need the for loop.
One cleaner way to do this is as follows:
item = AttorneysItem()
item['Field1'] = response.xpath('//div[#class="col-lg-3 display-label"]/text()').extract()[0]
item['Data1'] = response.xpath('//div[#class="col-lg-9"]/text()').extract()[0]
item['Field2'] = response.xpath('//div[#class="col-lg-3 display-label"]/text()').extract()[1]
item['Data2'] = response.xpath('//div[#class="col-lg-9"]/text()').extract()[1]
yield item
In case you don't know, you can use scrapy shell to test your xpath.
Simply type scrapy shell url in command line, where url corresponds to the url you are scraping.

Having trouble understanding where to look in source code, in order to create a web scraper

I am noob with python, been on and off teaching myself since this summer. I am going through the scrapy tutorial, and occasionally reading more about html/xml to help me understand scrapy. My project to myself is to imitate the scrapy tutorial in order to scrape http://www.gamefaqs.com/boards/916373-pc. I want to get a list of the thread title along with the thread url, should be simple!
My problem lies in not understanding xpath, and also html i guess. When viewing the source code for the gamefaqs site, I am not sure what to look for in order to pull the link and title. I want to say just look at the anchor tag and grab the text, but i am confused on how.
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from tutorial.items import DmozItem
class DmozSpider(BaseSpider):
name = "dmoz"
allowed_domains = ["http://www.gamefaqs.com"]
start_urls = ["http://www.gamefaqs.com/boards/916373-pc"]
def parse(self, response):
hxs = HtmlXPathSelector(response)
sites = hxs.select('//a')
items = []
for site in sites:
item = DmozItem()
item['link'] = site.select('a/#href').extract()
item['desc'] = site.select('text()').extract()
items.append(item)
return items
I want to change this to work on gamefaqs, so what would i put in this path?
I imagine the program returning results something like this
thread name
thread url
I know the code is not really right but can someone help me rewrite this to obtain the results, it would help me understand the scraping process better.
The layout and organization of a web page can change and deep tag based paths can be difficult to deal with. I prefer to pattern match the text of the links. Even if the link format changes, matching the new pattern is simple.
For gamefaqs the article links look like:
http://www.gamefaqs.com/boards/916373-pc/37644384
That's the protocol, domain name, literal 'boards' path. '916373-pc' identifies the forum area and '37644384' is the article ID.
We can match links for a specific forum area using using a regular expression:
reLink = re.compile(r'.*\/boards\/916373-pc\/\d+$')
if reLink.match(link)
Or any forum area using using:
reLink = re.compile(r'.*\/boards\/\d+-[^/]+\/\d+$')
if reLink.match(link)
Adding link matching to your code we get:
import re
reLink = re.compile(r'.*\/boards\/\d+-[^/]+\/\d+$')
def parse(self, response):
hxs = HtmlXPathSelector(response)
sites = hxs.select('//a')
items = []
for site in sites:
link = site.select('a/#href').extract()
if reLink.match(link)
item = DmozItem()
item['link'] = link
item['desc'] = site.select('text()').extract()
items.append(item)
return items
Many sites have separate summary and detail pages or description and file links where the paths match a template with an article ID. If needed, you can parse the forum area and article ID like this:
reLink = re.compile(r'.*\/boards\/(?P<area>\d+-[^/]+)\/(?P<id>\d+)$')
m = reLink.match(link)
if m:
areaStr = m.groupdict()['area']
idStr = m.groupdict()['id']
isStr will be a string which is fine for filling in a URL template, but if you need to calculate the previous ID, etc., then convert it to a number:
idInt = int(idStr)
I hope this helps.

Categories