Why am I not getting the text? I've used this script on many websites and never came across this issue.
import scrapy.selector
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from Prijsvergelijking_Final.items import PrijsvergelijkingFinalItem
vendors = []
for line in open("vendors.txt", "r"):
vendors.append(line.strip("\n\-"))
e = {}
for vendor in vendors:
e[vendor] = True
class ArtcrafttvSpider(CrawlSpider):
name = "ARTCRAFTTV"
allowed_domains = ["artencraft.be"]
start_urls = ["https://www.artencraft.be/nl/beeld-en-geluid/televisie"]
rules = (Rule(LinkExtractor(allow=(), restrict_xpaths=('//li[#class="next"]',)), callback = "parse_start_url",follow = True),)
def parse_start_url(self, response):
products = response.xpath("//ul[#class='product-overview list']/li")
for product in products:
item = PrijsvergelijkingFinalItem()
item["Product_a"] = product.xpath(".//a/span/h3/text()").extract_first().strip().replace("-","")
item["Product_price"] = product.xpath(".//a/h4/text()").extract_first()
for word in item['Product_a'].split(" "):
if word in e:
item['item_vendor'] = word
yield item
Website code:
Results after script is run:
Any suggestions how I can solve this?
Short Answer would be:
You have a wrong xpath for price field value
Detailed:
do not always assume that page structure will be same as what is displayed on your screen. it is NOT always WYSIWYG
for some reason i see that inspect element(firefox) shows a price value as child of //a/h4 tag but if you will analyze the page source that is downloaded, you will see that price value is present on page but is it no child of //a/h4 tag but it is a child of //a tag so //a/text() would give you the desired value
It appears that the prices are loaded in from Javascript or something- when I pull down the page from Python I get no prices anywhere.
There's two possible things going on here: First, the prices might be loading in with Javascript. If that's the case, I recommend looking at this answer: https://stackoverflow.com/a/26440563/629110 and the library dryscape.
If the prices are being blocked because of your user agent, you can try to change your user agent to a real browser: https://stackoverflow.com/a/10606260/629110 .
Try the user agent first (since it is easier).
Related
Crawlspider fetches only a subset of the matched links on the first page of the listings. Soon after, it moves to the second page where it successfully follows all matched links, exactly as intended. How to make Crawlspider follow all matched links before proceding in the second page?
I have added the "process_links='link_filter''" argument in the second Rule and verified it matched all links as intended, but it follows a seemingly semi-random subset of them.
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
import re
class ClassfiedsSpider(CrawlSpider):
name = "classfieds_tests"
start_urls = ["https://www.example.com/classifieds/category/laptops/"]
rules = (
Rule(LinkExtractor(restrict_css=("ul[class=ipsPagination] > li[class=ipsPagination_next] > a")), process_links='pl_tmp'),# callback='parse_start_url'),
Rule(LinkExtractor(restrict_css=("h4 > div > a")), process_links='link_filter', callback='parse_classfied', follow=False),
)
def pl_tmp(self, links):
print([link.url for link in links])
return links
def link_filter(self, links):
print("links: ", [re.search("(item/)(.*?)(-)", link.url).group(2) for link in links])
#print("links: ", [link.url for link in links])
return links
I expected that Crawlspider would move to the second page only after it finishes following the links in the first.
After ~10 hours of digging through the source code, I was able to spot the problem in the way the scheduler stores requests in memory. The solution was to change it to a queue(FIFO) so that the older requests get fetched first. It can easily be changed by setting in settings.py:
SCHEDULER_MEMORY_QUEUE = 'scrapy.squeues.FifoMemoryQueue'
I couldn't find any answer to my problem so I hope it will be ok to ask here.
I am trying to scrap cinema shows and still getting following error.
What is really confusing for me that the problem apparently lies in pipelines. However, I have second spider for opera house with the exact same code(only place is different) and it works just fine."Shows" and "Place" refers to my Django models. I've changed their fields to be CharFields so it's not a problem with wrong date/time format.
I also tried to use dedicated scrapy item "KikaItem" instead of "ShowItem" (which is shared with my opera spider) but the error still remains.
class ScrapyKika(object):
def process_item(self, ShowItem, spider):
place, created = Place.objects.get_or_create(name="kino kika")
show = Shows.objects.update_or_create(
time=ShowItem["time"],
date=ShowItem["date"],
place=place,
defaults={'title': ShowItem["title"]}
)
return ShowItem
Here is my spider code.I expect the problem is somewhere here, because I used a different approach here than in the opera one. However,I am not sure what can be wrong.
import scrapy
from ..items import ShowItem, KikaItemLoader
class KikaSpider(scrapy.Spider):
name = "kika"
allowed_domains = ["http://www.kinokika.pl/dk.php"]
start_urls = [
"http://www.kinokika.pl/dk.php"
]
def parse(self, response):
divs = response.xpath('//b')
for div in divs:
l = KikaItemLoader(item=ShowItem(), response=response)
l.add_xpath("title", "./text()")
l.add_xpath("date", "./ancestor::ul[1]/preceding-sibling::h2[1]/text()")
l.add_xpath("time", "./preceding-sibling::small[1]/text()")
return l.load_item()
ItemLoader
class KikaItemLoader(ItemLoader):
title_in = MapCompose(strip_string,lowercase)
title_out = Join()
time_in = MapCompose(strip_string)
time_out = Join()
date_in = MapCompose(strip_string)
date_out = Join()
Thank you for your time and sorry for any misspellings :)
Currently, your spider yields a single item:
{'title': u' '}
which does not have the date and time fields filled out. This is because of the way you initialize the ItemLoader class in your spider.
You should be initializing your item loader with a specific selector in mind. Replace:
for div in divs:
l = KikaItemLoader(item=ShowItem(), response=response)
with:
for div in divs:
l = KikaItemLoader(item=ShowItem(), selector=div)
I'm a newbie with regards to Scrapy and Python. Would appreciate some help here please...
I'm scraping a site that uses divs, and cant for the life of me work out why this isn't working. I can only get Field1 and Data1 to populate... the overall plan is to get 10 points for each page...
have a look at my spider - I can't get field2 or data2 to populate correctly...
import scrapy
from tutorial.items import AttorneysItem
class AttorneysSpider(scrapy.Spider):
name = "attorneys"
allowed_domains = ["attorneys.co.za"]
start_urls = [
"http://www.attorneys.co.za/CompanyHomePage.asp?CompanyID=537",
"http://www.attorneys.co.za/CompanyHomePage.asp?CompanyID=776",
]
def parse(self, response):
for sel in response.xpath('//div//div//div[3]//div[1]//div//div'):
item = AttorneysItem()
item['Field1'] = sel.xpath('//div//div//div[3]//div[1]//div[1]//div[1]/text()').extract()
item['Data1'] = sel.xpath('//div//div//div[3]//div[1]//div[1]//div[2]/text()').extract()
item['Field2'] = sel.xpath('//div//div//div[3]//div[1]//div[2]//div[1]/text()').extract()
item['Data2'] = sel.xpath('//div//div//div[3]//div[1]//div[2]//div[2]/text()').extract()
yield item
It s super frustrating. The link to the site is http://www.attorneys.co.za/CompanyHomePage.asp?CompanyID=537.
Thanks
Paddy
--------------UPDATE---------------------------
So I've gotten a bit futher, but hit a wall again.
I can now select the elements okay, but I somehow need to dynamically define the item fields... the best I've been able to do is the below, but it's not great because the number of fields is not consistent, and are not always in the same order. Essentially what I a saying is sometimes their website is listed as the third field down, sometimes it's the fifth.
def parse(self, response):
item = AttorneysItem()
item['a01Firm'] = response.xpath('//h1[#class="name-h1"]/text()').extract()
item['a01Field'] = response.xpath('//div[#class="col-lg-3 display-label"]/text()').extract()[0].strip()
item['a01Data'] = response.xpath('//div[#class="col-lg-9"]/text()').extract()[0].strip()
item['a02Field'] = response.xpath('//div[#class="col-lg-3 display-label"]/text()').extract()[1].strip()
item['a02Data'] = response.xpath('//div[#class="col-lg-9"]/text()').extract()[1].strip()
item['a03Field'] = response.xpath('//div[#class="col-lg-3 display-label"]/text()').extract()[2].strip()
item['a03Data'] = response.xpath('//div[#class="col-lg-9"]/text()').extract()[2].strip()
item['a04Field'] = response.xpath('//div[#class="col-lg-3 display-label"]/text()').extract()[3].strip()
item['a04Data'] = response.xpath('//div[#class="col-lg-9"]/text()').extract()[3].strip()
Thanks again to any and all who can help :D
There are several issues with the xpath you provide:
You only need to use "//" in the beginning, the rest should be "/".
Using only element name to extract is not clean. It leads to bad readability and possibly bad performance. One reason is that many, if not most, webpages contain levels of nested divs. Instead, make good use of selectors.
Besides, You don't need the for loop.
One cleaner way to do this is as follows:
item = AttorneysItem()
item['Field1'] = response.xpath('//div[#class="col-lg-3 display-label"]/text()').extract()[0]
item['Data1'] = response.xpath('//div[#class="col-lg-9"]/text()').extract()[0]
item['Field2'] = response.xpath('//div[#class="col-lg-3 display-label"]/text()').extract()[1]
item['Data2'] = response.xpath('//div[#class="col-lg-9"]/text()').extract()[1]
yield item
In case you don't know, you can use scrapy shell to test your xpath.
Simply type scrapy shell url in command line, where url corresponds to the url you are scraping.
Hi all I an trying to get whole results from the given link in the code. but my code not giving all results. This link says it contain 2132 results but it returns only 20 results.:
from scrapy.spider import Spider
from scrapy.selector import Selector
from tutorial.items import Flipkart
class Test(Spider):
name = "flip"
allowed_domains = ["flipkart.com"]
start_urls = ["http://www.flipkart.com/mobiles/pr?sid=tyy,4io& otracker=ch_vn_mobile_filter_Mobile%20Brands_All"
]
def parse(self, response):
sel = Selector(response)
sites = sel.xpath('//div[#class="pu-details lastUnit"]')
items = []
for site in sites:
item = Flipkart()
item['title'] = site.xpath('div[1]/a/text()').extract()
items.append(item)
return items**
That is because the site only shows 20 results at a time, and loading of more results is done with JavaScript when the user scrolls to the bottom of the page.
You have two options here:
Find a link on the site which shows all results on a single page (doubtful it exists, but some sites may do so when passed an optional query string, for example).
Handle JavaScript events in your spider. The default Scrapy downloader doesn't do this, so you can either analyze the JS code and send the event signals yourself programmatically or use something like Selenium w/ PhantomJS to let the browser deal with it. I'd recommend the latter since it's more fail-proof than the manual approach of interpreting the JS yourself. See this question for more information, and Google around, there's plenty of information on this topic.
I am noob with python, been on and off teaching myself since this summer. I am going through the scrapy tutorial, and occasionally reading more about html/xml to help me understand scrapy. My project to myself is to imitate the scrapy tutorial in order to scrape http://www.gamefaqs.com/boards/916373-pc. I want to get a list of the thread title along with the thread url, should be simple!
My problem lies in not understanding xpath, and also html i guess. When viewing the source code for the gamefaqs site, I am not sure what to look for in order to pull the link and title. I want to say just look at the anchor tag and grab the text, but i am confused on how.
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from tutorial.items import DmozItem
class DmozSpider(BaseSpider):
name = "dmoz"
allowed_domains = ["http://www.gamefaqs.com"]
start_urls = ["http://www.gamefaqs.com/boards/916373-pc"]
def parse(self, response):
hxs = HtmlXPathSelector(response)
sites = hxs.select('//a')
items = []
for site in sites:
item = DmozItem()
item['link'] = site.select('a/#href').extract()
item['desc'] = site.select('text()').extract()
items.append(item)
return items
I want to change this to work on gamefaqs, so what would i put in this path?
I imagine the program returning results something like this
thread name
thread url
I know the code is not really right but can someone help me rewrite this to obtain the results, it would help me understand the scraping process better.
The layout and organization of a web page can change and deep tag based paths can be difficult to deal with. I prefer to pattern match the text of the links. Even if the link format changes, matching the new pattern is simple.
For gamefaqs the article links look like:
http://www.gamefaqs.com/boards/916373-pc/37644384
That's the protocol, domain name, literal 'boards' path. '916373-pc' identifies the forum area and '37644384' is the article ID.
We can match links for a specific forum area using using a regular expression:
reLink = re.compile(r'.*\/boards\/916373-pc\/\d+$')
if reLink.match(link)
Or any forum area using using:
reLink = re.compile(r'.*\/boards\/\d+-[^/]+\/\d+$')
if reLink.match(link)
Adding link matching to your code we get:
import re
reLink = re.compile(r'.*\/boards\/\d+-[^/]+\/\d+$')
def parse(self, response):
hxs = HtmlXPathSelector(response)
sites = hxs.select('//a')
items = []
for site in sites:
link = site.select('a/#href').extract()
if reLink.match(link)
item = DmozItem()
item['link'] = link
item['desc'] = site.select('text()').extract()
items.append(item)
return items
Many sites have separate summary and detail pages or description and file links where the paths match a template with an article ID. If needed, you can parse the forum area and article ID like this:
reLink = re.compile(r'.*\/boards\/(?P<area>\d+-[^/]+)\/(?P<id>\d+)$')
m = reLink.match(link)
if m:
areaStr = m.groupdict()['area']
idStr = m.groupdict()['id']
isStr will be a string which is fine for filling in a URL template, but if you need to calculate the previous ID, etc., then convert it to a number:
idInt = int(idStr)
I hope this helps.