Scrapy: Defining XPaths with numbered Divs & Dynamically naming item fields

Scrapy: Defining XPaths with numbered Divs & Dynamically naming item fields - python

I'm a newbie with regards to Scrapy and Python. Would appreciate some help here please...
I'm scraping a site that uses divs, and cant for the life of me work out why this isn't working. I can only get Field1 and Data1 to populate... the overall plan is to get 10 points for each page...
have a look at my spider - I can't get field2 or data2 to populate correctly...
import scrapy
from tutorial.items import AttorneysItem
class AttorneysSpider(scrapy.Spider):
name = "attorneys"
allowed_domains = ["attorneys.co.za"]
start_urls = [
"http://www.attorneys.co.za/CompanyHomePage.asp?CompanyID=537",
"http://www.attorneys.co.za/CompanyHomePage.asp?CompanyID=776",
]
def parse(self, response):
for sel in response.xpath('//div//div//div[3]//div[1]//div//div'):
item = AttorneysItem()
item['Field1'] = sel.xpath('//div//div//div[3]//div[1]//div[1]//div[1]/text()').extract()
item['Data1'] = sel.xpath('//div//div//div[3]//div[1]//div[1]//div[2]/text()').extract()
item['Field2'] = sel.xpath('//div//div//div[3]//div[1]//div[2]//div[1]/text()').extract()
item['Data2'] = sel.xpath('//div//div//div[3]//div[1]//div[2]//div[2]/text()').extract()
yield item
It s super frustrating. The link to the site is http://www.attorneys.co.za/CompanyHomePage.asp?CompanyID=537.
Thanks
Paddy
--------------UPDATE---------------------------
So I've gotten a bit futher, but hit a wall again.
I can now select the elements okay, but I somehow need to dynamically define the item fields... the best I've been able to do is the below, but it's not great because the number of fields is not consistent, and are not always in the same order. Essentially what I a saying is sometimes their website is listed as the third field down, sometimes it's the fifth.
def parse(self, response):
item = AttorneysItem()
item['a01Firm'] = response.xpath('//h1[#class="name-h1"]/text()').extract()
item['a01Field'] = response.xpath('//div[#class="col-lg-3 display-label"]/text()').extract()[0].strip()
item['a01Data'] = response.xpath('//div[#class="col-lg-9"]/text()').extract()[0].strip()
item['a02Field'] = response.xpath('//div[#class="col-lg-3 display-label"]/text()').extract()[1].strip()
item['a02Data'] = response.xpath('//div[#class="col-lg-9"]/text()').extract()[1].strip()
item['a03Field'] = response.xpath('//div[#class="col-lg-3 display-label"]/text()').extract()[2].strip()
item['a03Data'] = response.xpath('//div[#class="col-lg-9"]/text()').extract()[2].strip()
item['a04Field'] = response.xpath('//div[#class="col-lg-3 display-label"]/text()').extract()[3].strip()
item['a04Data'] = response.xpath('//div[#class="col-lg-9"]/text()').extract()[3].strip()
Thanks again to any and all who can help :D

There are several issues with the xpath you provide:
You only need to use "//" in the beginning, the rest should be "/".
Using only element name to extract is not clean. It leads to bad readability and possibly bad performance. One reason is that many, if not most, webpages contain levels of nested divs. Instead, make good use of selectors.
Besides, You don't need the for loop.
One cleaner way to do this is as follows:
item = AttorneysItem()
item['Field1'] = response.xpath('//div[#class="col-lg-3 display-label"]/text()').extract()[0]
item['Data1'] = response.xpath('//div[#class="col-lg-9"]/text()').extract()[0]
item['Field2'] = response.xpath('//div[#class="col-lg-3 display-label"]/text()').extract()[1]
item['Data2'] = response.xpath('//div[#class="col-lg-9"]/text()').extract()[1]
yield item
In case you don't know, you can use scrapy shell to test your xpath.
Simply type scrapy shell url in command line, where url corresponds to the url you are scraping.

Related

Scrapy KeyError while processing

I couldn't find any answer to my problem so I hope it will be ok to ask here.
I am trying to scrap cinema shows and still getting following error.
What is really confusing for me that the problem apparently lies in pipelines. However, I have second spider for opera house with the exact same code(only place is different) and it works just fine."Shows" and "Place" refers to my Django models. I've changed their fields to be CharFields so it's not a problem with wrong date/time format.
I also tried to use dedicated scrapy item "KikaItem" instead of "ShowItem" (which is shared with my opera spider) but the error still remains.
class ScrapyKika(object):
def process_item(self, ShowItem, spider):
place, created = Place.objects.get_or_create(name="kino kika")
show = Shows.objects.update_or_create(
time=ShowItem["time"],
date=ShowItem["date"],
place=place,
defaults={'title': ShowItem["title"]}
)
return ShowItem
Here is my spider code.I expect the problem is somewhere here, because I used a different approach here than in the opera one. However,I am not sure what can be wrong.
import scrapy
from ..items import ShowItem, KikaItemLoader
class KikaSpider(scrapy.Spider):
name = "kika"
allowed_domains = ["http://www.kinokika.pl/dk.php"]
start_urls = [
"http://www.kinokika.pl/dk.php"
]
def parse(self, response):
divs = response.xpath('//b')
for div in divs:
l = KikaItemLoader(item=ShowItem(), response=response)
l.add_xpath("title", "./text()")
l.add_xpath("date", "./ancestor::ul[1]/preceding-sibling::h2[1]/text()")
l.add_xpath("time", "./preceding-sibling::small[1]/text()")
return l.load_item()
ItemLoader
class KikaItemLoader(ItemLoader):
title_in = MapCompose(strip_string,lowercase)
title_out = Join()
time_in = MapCompose(strip_string)
time_out = Join()
date_in = MapCompose(strip_string)
date_out = Join()
Thank you for your time and sorry for any misspellings :)

Currently, your spider yields a single item:
{'title': u' '}
which does not have the date and time fields filled out. This is because of the way you initialize the ItemLoader class in your spider.
You should be initializing your item loader with a specific selector in mind. Replace:
for div in divs:
l = KikaItemLoader(item=ShowItem(), response=response)
with:
for div in divs:
l = KikaItemLoader(item=ShowItem(), selector=div)

Text not visible Python

Why am I not getting the text? I've used this script on many websites and never came across this issue.
import scrapy.selector
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from Prijsvergelijking_Final.items import PrijsvergelijkingFinalItem
vendors = []
for line in open("vendors.txt", "r"):
vendors.append(line.strip("\n\-"))
e = {}
for vendor in vendors:
e[vendor] = True
class ArtcrafttvSpider(CrawlSpider):
name = "ARTCRAFTTV"
allowed_domains = ["artencraft.be"]
start_urls = ["https://www.artencraft.be/nl/beeld-en-geluid/televisie"]
rules = (Rule(LinkExtractor(allow=(), restrict_xpaths=('//li[#class="next"]',)), callback = "parse_start_url",follow = True),)
def parse_start_url(self, response):
products = response.xpath("//ul[#class='product-overview list']/li")
for product in products:
item = PrijsvergelijkingFinalItem()
item["Product_a"] = product.xpath(".//a/span/h3/text()").extract_first().strip().replace("-","")
item["Product_price"] = product.xpath(".//a/h4/text()").extract_first()
for word in item['Product_a'].split(" "):
if word in e:
item['item_vendor'] = word
yield item
Website code:
Results after script is run:
Any suggestions how I can solve this?

Short Answer would be:
You have a wrong xpath for price field value
Detailed:
do not always assume that page structure will be same as what is displayed on your screen. it is NOT always WYSIWYG
for some reason i see that inspect element(firefox) shows a price value as child of //a/h4 tag but if you will analyze the page source that is downloaded, you will see that price value is present on page but is it no child of //a/h4 tag but it is a child of //a tag so //a/text() would give you the desired value

It appears that the prices are loaded in from Javascript or something- when I pull down the page from Python I get no prices anywhere.
There's two possible things going on here: First, the prices might be loading in with Javascript. If that's the case, I recommend looking at this answer: https://stackoverflow.com/a/26440563/629110 and the library dryscape.
If the prices are being blocked because of your user agent, you can try to change your user agent to a real browser: https://stackoverflow.com/a/10606260/629110 .
Try the user agent first (since it is easier).

Issue with loop in scrapy+selenium+phantomjs

I've been trying to build a small scraper for ebay (college assignment). I already figured out most of it, but I ran into an issue with my loop.
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy.http import Request
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from loop.items import loopitems
class myProjectSpider(CrawlSpider):
name = 'looper'
allowed_domains = ['ebay.com']
start_urls = [l.strip() for l in open('bobo.txt').readlines()]
def __init__(self):
service_args = ['--load-images=no',]
self.driver = webdriver.PhantomJS(executable_path='/Users/localhost/desktop/.bin/phantomjs.cmd', service_args=service_args)
def parse(self, response):
self.driver.get(response.url)
item = loopitems()
for abc in range(2,50):
abc = str(abc)
jackson = self.driver.execute_script("return !!document.evaluate('.//div[5]/div[2]/select/option[" + abc + "]', document, null, XPathResult.FIRST_ORDERED_NODE_TYPE, null).singleNodeValue;")
if jackson == True:
item['title'] = self.driver.execute_script("return document.evaluate('.//div[5]/div[2]/select/option[" + abc + "]', document, null, XPathResult.FIRST_ORDERED_NODE_TYPE, null).singleNodeValue.textContent;")
yield item
else:
break
The urls (start_urls are dispatched from txt file):
http://www.ebay.com/itm/Mens-Jeans-Slim-Fit-Straight-Skinny-Fit-Denim- Trousers-Casual-Pants-14-color-/221560999664?pt=LH_DefaultDomain_0&var=&hash=item3396108ef0
http://www.ebay.com/itm/New-Apple-iPad-3rd-Generation-16GB-32GB-or-64GB-WiFi-Retina-Display-Tablet-/261749018535?pt=LH_DefaultDomain_0&var=&hash=item3cf1750fa7
I'm running scrapy version 0.24.6 and phantomjs version 2.0. The objective is to go to the urls and extract the variations or attributes from the ebay form.
The if statement right at the beginning of the loop is used to check if the element exists because selenium returns a bad header error if it can't find an element. I also loop the (yield item) because I need each variation on a new row. I use execute_script because it is 100 times faster than using seleniums get element by xpath.
The main problem I run into is the way scrapy returns my item results; if I use one url as my start_url it works like it should (it returns all items in a neat order). The second I add more urls to it I get a completly different result all my items are scrambled around and some items are returned multiple times it also happens to vary almost everytime. After countless testing I noticed yield item is causing some kind of problem; so I removed it and tried just printing the results and sure enough it returns them perfectly. I really need each item on a new row though, and the only way I got to do so is by using yield item (maybe there's a better way?).
As of now I have just copy pasted the looped code changing the xpath option manually. And it works like expected, but I really need to be able to loop through items in the future. If someone sees an error in my code or a better way to try it please tell me. All responses are helpful...
Thanks

If I correctly understood what you want to do, I think this one could help you.
Scrapy Crawl URLs in Order
The problem is that start_urls are not processed in order. They are passed to start_requests method and returned with a downloaded response to parse method. This is asynchronous.
Maybe this helps
#Do your thing
start_urls = [open('bobo.txt').readlines()[0].strip()]
other_urls = [l.strip() for l in open('bobo.txt').readlines()[1:]]
other_urls.reverse()
#Do your thing
def parse(self, response):
#Do your thing
if len(self.other_urls) != 0
url = self.other_urls.pop()
yield Request(url=url, callback=self.parse)

Having trouble understanding where to look in source code, in order to create a web scraper

I am noob with python, been on and off teaching myself since this summer. I am going through the scrapy tutorial, and occasionally reading more about html/xml to help me understand scrapy. My project to myself is to imitate the scrapy tutorial in order to scrape http://www.gamefaqs.com/boards/916373-pc. I want to get a list of the thread title along with the thread url, should be simple!
My problem lies in not understanding xpath, and also html i guess. When viewing the source code for the gamefaqs site, I am not sure what to look for in order to pull the link and title. I want to say just look at the anchor tag and grab the text, but i am confused on how.
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from tutorial.items import DmozItem
class DmozSpider(BaseSpider):
name = "dmoz"
allowed_domains = ["http://www.gamefaqs.com"]
start_urls = ["http://www.gamefaqs.com/boards/916373-pc"]
def parse(self, response):
hxs = HtmlXPathSelector(response)
sites = hxs.select('//a')
items = []
for site in sites:
item = DmozItem()
item['link'] = site.select('a/#href').extract()
item['desc'] = site.select('text()').extract()
items.append(item)
return items
I want to change this to work on gamefaqs, so what would i put in this path?
I imagine the program returning results something like this
thread name
thread url
I know the code is not really right but can someone help me rewrite this to obtain the results, it would help me understand the scraping process better.

The layout and organization of a web page can change and deep tag based paths can be difficult to deal with. I prefer to pattern match the text of the links. Even if the link format changes, matching the new pattern is simple.
For gamefaqs the article links look like:
http://www.gamefaqs.com/boards/916373-pc/37644384
That's the protocol, domain name, literal 'boards' path. '916373-pc' identifies the forum area and '37644384' is the article ID.
We can match links for a specific forum area using using a regular expression:
reLink = re.compile(r'.*\/boards\/916373-pc\/\d+$')
if reLink.match(link)
Or any forum area using using:
reLink = re.compile(r'.*\/boards\/\d+-[^/]+\/\d+$')
if reLink.match(link)
Adding link matching to your code we get:
import re
reLink = re.compile(r'.*\/boards\/\d+-[^/]+\/\d+$')
def parse(self, response):
hxs = HtmlXPathSelector(response)
sites = hxs.select('//a')
items = []
for site in sites:
link = site.select('a/#href').extract()
if reLink.match(link)
item = DmozItem()
item['link'] = link
item['desc'] = site.select('text()').extract()
items.append(item)
return items
Many sites have separate summary and detail pages or description and file links where the paths match a template with an article ID. If needed, you can parse the forum area and article ID like this:
reLink = re.compile(r'.*\/boards\/(?P<area>\d+-[^/]+)\/(?P<id>\d+)$')
m = reLink.match(link)
if m:
areaStr = m.groupdict()['area']
idStr = m.groupdict()['id']
isStr will be a string which is fine for filling in a URL template, but if you need to calculate the previous ID, etc., then convert it to a number:
idInt = int(idStr)
I hope this helps.

Scrapy - parse a page to extract items - then follow and store item url contents

I have a question on how to do this thing in scrapy. I have a spider that crawls for listing pages of items.
Every time a listing page is found, with items, there's the parse_item() callback that is called for extracting items data, and yielding items. So far so good, everything works great.
But each item, has among other data, an url, with more details on that item. I want to follow that url and store in another item field (url_contents) the fetched contents of that item's url.
And I'm not sure how to organize code to achieve that, since the two links (listings link, and one particular item link) are followed differently, with callbacks called at different times, but I have to correlate them in the same item processing.
My code so far looks like this:
class MySpider(CrawlSpider):
name = "example.com"
allowed_domains = ["example.com"]
start_urls = [
"http://www.example.com/?q=example",
]
rules = (
Rule(SgmlLinkExtractor(allow=('example\.com', 'start='), deny=('sort='), restrict_xpaths = '//div[#class="pagination"]'), callback='parse_item'),
Rule(SgmlLinkExtractor(allow=('item\/detail', )), follow = False),
)
def parse_item(self, response):
main_selector = HtmlXPathSelector(response)
xpath = '//h2[#class="title"]'
sub_selectors = main_selector.select(xpath)
for sel in sub_selectors:
item = ExampleItem()
l = ExampleLoader(item = item, selector = sel)
l.add_xpath('title', 'a[#title]/#title')
......
yield l.load_item()

After some testing and thinking, I found this solution that works for me.
The idea is to use just the first rule, that gives you listings of items, and also, very important, add follow=True to that rule.
And in parse_item() you have to yield a request instead of an item, but after you load the item. The request is to item detail url. And you have to send the loaded item to that request callback. You do your job with the response, and there is where you yield the item.
So the finish of parse_item() will look like this:
itemloaded = l.load_item()
# fill url contents
url = sel.select(item_url_xpath).extract()[0]
request = Request(url, callback = lambda r: self.parse_url_contents(r))
request.meta['item'] = itemloaded
yield request
And then parse_url_contents() will look like this:
def parse_url_contents(self, response):
item = response.request.meta['item']
item['url_contents'] = response.body
yield item
If anyone has another (better) approach, let us know.
Stefan

I'm sitting with exactly the same problem, and from the fact that no-one has answered your question for 2 days I take it that the only solution is to follow that URL manually, from within your parse_item function.
I'm new to Scrapy, so I wouldn't attempt it with that (although I'm sure it's possible), but my solution will be to use urllib and BeatifulSoup to load the second page manually, extract that information myself, and save it as part of the Item. Yes, much more trouble than Scrapy makes normal parsing, but it should get the job done with the least hassle.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Scrapy: Defining XPaths with numbered Divs & Dynamically naming item fields - python

Related

Scrapy KeyError while processing

Text not visible Python

Issue with loop in scrapy+selenium+phantomjs

Having trouble understanding where to look in source code, in order to create a web scraper

Scrapy - parse a page to extract items - then follow and store item url contents

Categories

Resources