Issue with loop in scrapy+selenium+phantomjs - python

I've been trying to build a small scraper for ebay (college assignment). I already figured out most of it, but I ran into an issue with my loop.
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy.http import Request
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from loop.items import loopitems
class myProjectSpider(CrawlSpider):
name = 'looper'
allowed_domains = ['ebay.com']
start_urls = [l.strip() for l in open('bobo.txt').readlines()]
def __init__(self):
service_args = ['--load-images=no',]
self.driver = webdriver.PhantomJS(executable_path='/Users/localhost/desktop/.bin/phantomjs.cmd', service_args=service_args)
def parse(self, response):
self.driver.get(response.url)
item = loopitems()
for abc in range(2,50):
abc = str(abc)
jackson = self.driver.execute_script("return !!document.evaluate('.//div[5]/div[2]/select/option[" + abc + "]', document, null, XPathResult.FIRST_ORDERED_NODE_TYPE, null).singleNodeValue;")
if jackson == True:
item['title'] = self.driver.execute_script("return document.evaluate('.//div[5]/div[2]/select/option[" + abc + "]', document, null, XPathResult.FIRST_ORDERED_NODE_TYPE, null).singleNodeValue.textContent;")
yield item
else:
break
The urls (start_urls are dispatched from txt file):
http://www.ebay.com/itm/Mens-Jeans-Slim-Fit-Straight-Skinny-Fit-Denim- Trousers-Casual-Pants-14-color-/221560999664?pt=LH_DefaultDomain_0&var=&hash=item3396108ef0
http://www.ebay.com/itm/New-Apple-iPad-3rd-Generation-16GB-32GB-or-64GB-WiFi-Retina-Display-Tablet-/261749018535?pt=LH_DefaultDomain_0&var=&hash=item3cf1750fa7
I'm running scrapy version 0.24.6 and phantomjs version 2.0. The objective is to go to the urls and extract the variations or attributes from the ebay form.
The if statement right at the beginning of the loop is used to check if the element exists because selenium returns a bad header error if it can't find an element. I also loop the (yield item) because I need each variation on a new row. I use execute_script because it is 100 times faster than using seleniums get element by xpath.
The main problem I run into is the way scrapy returns my item results; if I use one url as my start_url it works like it should (it returns all items in a neat order). The second I add more urls to it I get a completly different result all my items are scrambled around and some items are returned multiple times it also happens to vary almost everytime. After countless testing I noticed yield item is causing some kind of problem; so I removed it and tried just printing the results and sure enough it returns them perfectly. I really need each item on a new row though, and the only way I got to do so is by using yield item (maybe there's a better way?).
As of now I have just copy pasted the looped code changing the xpath option manually. And it works like expected, but I really need to be able to loop through items in the future. If someone sees an error in my code or a better way to try it please tell me. All responses are helpful...
Thanks

If I correctly understood what you want to do, I think this one could help you.
Scrapy Crawl URLs in Order
The problem is that start_urls are not processed in order. They are passed to start_requests method and returned with a downloaded response to parse method. This is asynchronous.
Maybe this helps
#Do your thing
start_urls = [open('bobo.txt').readlines()[0].strip()]
other_urls = [l.strip() for l in open('bobo.txt').readlines()[1:]]
other_urls.reverse()
#Do your thing
def parse(self, response):
#Do your thing
if len(self.other_urls) != 0
url = self.other_urls.pop()
yield Request(url=url, callback=self.parse)

Related

extract_first only resulting in first item only .extract() does not work

I hope that you're all well. I am trying to learn Python through web-scraping. My project at the moment is to scrape data from a games store. I am initially wanting to follow a product link and print the response from each link. There are 60 game links on the page that I wish for Scrapy to follow. Below is the code.
import scrapy
class GameSpider(scrapy.Spider):
name = 'spider'
allowed_domains = ['365games.co.uk']
start_urls = ['https://www.365games.co.uk/3ds-games/']
def parse(self, response):
all_games = response.xpath('//*[#id="product_grid"]')
for game in all_games:
game_url = game.xpath('.//h3/a/#href').extract_first()
yield scrapy.Request(game_url, callback=self.parse_game)
def parse_game(self, response):
print(response.status)
When I run this code scrapy runs and goes through the first link and prints the response, but stops. When I change the code to .extract() I get the following,
TypeError: Request url must be str or unicode, got list
The same applies with .get()/.getall() being that .get() only returns the first and .getall() displays the above error.
Any help would be greatly appreciated, but please be gentle I am trying to learn.
Thanks in advance and best regards,
Gav
The error is saying that you are passing a list instead of a string to scrapy.Request. This tells us that game_url is actually a list, when you want a string. You are very close to the right thing here, but I believe your problem is that you are looping in the wrong place. You first XPath returns just a single item, rather than a list of items. It is within this that you want to find your game_urls leading to
def parse(self, response):
product_grid = response.xpath('//*[#id="product_grid"]')
for game_url in product_grid.xpath('.//h3/a/#href').getall():
yield scrapy.Request(game_url, callback=self.parse_game)
You could also combine your xpath queries to directly to
def parse(self, response):
all_games = response.xpath('//*[#id="product_grid"]//h3/a/#href')
for game_url in all_games.getall():
yield scrapy.Request(game_url, callback=self.parse_game)
In this case you could also use follow instead of creating a new Request directly. You can even directly pass a selector rather than a string to follow so you don't need to getall(), and it knows how to deal with <a> elements so you don't need the #href either!
def parse(self, response):
for game_url in response.xpath('//*[#id="product_grid"]//h3/a'):
yield response.follow(game_url, callback=self.parse_game)

How to crawl all webpages on website up to certain depth?

I have a website and I would like to find a webpage with information about job vacancies. There is only one page usually with such information. So I start crawling with website and I manage to get all webpages up to certain depth. It works. But they are many times duplicated. Instead of lets say 45 pages I get 1000 pages. I know the reason why. The reason is that every time I call my "parse" function, it parses all the webpages on a certain webpage. So when I come to a new webpage, it crawls all webpages, out of which some have been crawled before.
1) I tried to make "items=[]" list out of parse function but I get some global error. I don't know how to get a list of unique webpages. When I have one, I will be able to choose the right one with simple url parsing.
2) I also tried to have "Request" and "return items" in the "parse" function, but I get syntax error: return inside generator.
I am using DEPTH_LIMIT. Do I really need to use Rules ?
code:
import scrapy, urlparse, os
from scrapy.spiders import Rule
from scrapy.linkextractors import LinkExtractor
from tutorial.items import JobItem
from scrapy.utils.response import get_base_url
from scrapy.http import Request
from urlparse import urljoin
from datetime import datetime
class JobSpider(scrapy.Spider):
name = "jobs"
allowed_domains = ["www.gen-i.si"]
start_urls = ["http://www.gen-i.si"]
def parse(self, response):
response.selector.remove_namespaces() #
urls = response.xpath('//#href').extract()#choose all "href", either new websites either webpages on our website
items = []
base_url = get_base_url(response) #base url
for url in urls:
#we need only webpages, so we remove all websites and urls with strange characters
if (url[0:4] != "http") and not any(x in url for x in ['%', ':', '?', '&']):
item = JobItem()
absolute_url = urlparse.urljoin(base_url,url)
item["link"] = absolute_url
if item not in items:
items.append(item)
yield item
yield Request(absolute_url, callback = self.parse)
#return items
You're appending item (a newly instantiated object), to your list items. Since item is always a new JobItem() object, it will never exist in your list items.
To illustrate:
>>> class MyItem(object):
... pass
...
>>> a = MyItem()
>>> b = MyItem()
>>> a.url = "abc"
>>> b.url = "abc"
>>> a == b
False
Just because they have one attribute that is the same, doesn't mean they are the same object.
Even if this worked though, you're resetting the list items everytime you call parse (ie. for each request), so you'll never really remove duplicates.
Instead, you would be better checking vs. the absolute_url itself, and putting the list at the spider level:
class JobSpider(scrapy.Spider):
name = "jobs"
allowed_domains = ["www.gen-i.si"]
start_urls = ["http://www.gen-i.si"]
all_urls = []
def parse(self, response):
# remove "items = []"
...
for url in urls:
if (url[0:4] != "http") and not any(x in url for x in ['%', ':', '?', '&']):
absolute_url = urlparse.urljoin(base_url, url)
if absolute_url not in self.all_urls:
self.all_urls.append(absolute_url)
item = JobItem()
item['link'] = absolute_url
yield item
yield Request(absolute_url, callback = self.parse)
This functionality, however, would be better served by creating a Dupefilter instead (see here for more information). Additionally, I agree with #RodrigoNey, a CrawlSpider would likely better serve your purpose, and be more maintainable in the long run.
I'm working on a web crawler and ended up making a list of links that needed to be crawled, then once we went there it was deleted from that list and added to the crawled list. then you can use a not in search to either add/delete/etc.

Scrapy: Defining XPaths with numbered Divs & Dynamically naming item fields

I'm a newbie with regards to Scrapy and Python. Would appreciate some help here please...
I'm scraping a site that uses divs, and cant for the life of me work out why this isn't working. I can only get Field1 and Data1 to populate... the overall plan is to get 10 points for each page...
have a look at my spider - I can't get field2 or data2 to populate correctly...
import scrapy
from tutorial.items import AttorneysItem
class AttorneysSpider(scrapy.Spider):
name = "attorneys"
allowed_domains = ["attorneys.co.za"]
start_urls = [
"http://www.attorneys.co.za/CompanyHomePage.asp?CompanyID=537",
"http://www.attorneys.co.za/CompanyHomePage.asp?CompanyID=776",
]
def parse(self, response):
for sel in response.xpath('//div//div//div[3]//div[1]//div//div'):
item = AttorneysItem()
item['Field1'] = sel.xpath('//div//div//div[3]//div[1]//div[1]//div[1]/text()').extract()
item['Data1'] = sel.xpath('//div//div//div[3]//div[1]//div[1]//div[2]/text()').extract()
item['Field2'] = sel.xpath('//div//div//div[3]//div[1]//div[2]//div[1]/text()').extract()
item['Data2'] = sel.xpath('//div//div//div[3]//div[1]//div[2]//div[2]/text()').extract()
yield item
It s super frustrating. The link to the site is http://www.attorneys.co.za/CompanyHomePage.asp?CompanyID=537.
Thanks
Paddy
--------------UPDATE---------------------------
So I've gotten a bit futher, but hit a wall again.
I can now select the elements okay, but I somehow need to dynamically define the item fields... the best I've been able to do is the below, but it's not great because the number of fields is not consistent, and are not always in the same order. Essentially what I a saying is sometimes their website is listed as the third field down, sometimes it's the fifth.
def parse(self, response):
item = AttorneysItem()
item['a01Firm'] = response.xpath('//h1[#class="name-h1"]/text()').extract()
item['a01Field'] = response.xpath('//div[#class="col-lg-3 display-label"]/text()').extract()[0].strip()
item['a01Data'] = response.xpath('//div[#class="col-lg-9"]/text()').extract()[0].strip()
item['a02Field'] = response.xpath('//div[#class="col-lg-3 display-label"]/text()').extract()[1].strip()
item['a02Data'] = response.xpath('//div[#class="col-lg-9"]/text()').extract()[1].strip()
item['a03Field'] = response.xpath('//div[#class="col-lg-3 display-label"]/text()').extract()[2].strip()
item['a03Data'] = response.xpath('//div[#class="col-lg-9"]/text()').extract()[2].strip()
item['a04Field'] = response.xpath('//div[#class="col-lg-3 display-label"]/text()').extract()[3].strip()
item['a04Data'] = response.xpath('//div[#class="col-lg-9"]/text()').extract()[3].strip()
Thanks again to any and all who can help :D
There are several issues with the xpath you provide:
You only need to use "//" in the beginning, the rest should be "/".
Using only element name to extract is not clean. It leads to bad readability and possibly bad performance. One reason is that many, if not most, webpages contain levels of nested divs. Instead, make good use of selectors.
Besides, You don't need the for loop.
One cleaner way to do this is as follows:
item = AttorneysItem()
item['Field1'] = response.xpath('//div[#class="col-lg-3 display-label"]/text()').extract()[0]
item['Data1'] = response.xpath('//div[#class="col-lg-9"]/text()').extract()[0]
item['Field2'] = response.xpath('//div[#class="col-lg-3 display-label"]/text()').extract()[1]
item['Data2'] = response.xpath('//div[#class="col-lg-9"]/text()').extract()[1]
yield item
In case you don't know, you can use scrapy shell to test your xpath.
Simply type scrapy shell url in command line, where url corresponds to the url you are scraping.

Scrapy spider not showing whole result

Hi all I an trying to get whole results from the given link in the code. but my code not giving all results. This link says it contain 2132 results but it returns only 20 results.:
from scrapy.spider import Spider
from scrapy.selector import Selector
from tutorial.items import Flipkart
class Test(Spider):
name = "flip"
allowed_domains = ["flipkart.com"]
start_urls = ["http://www.flipkart.com/mobiles/pr?sid=tyy,4io& otracker=ch_vn_mobile_filter_Mobile%20Brands_All"
]
def parse(self, response):
sel = Selector(response)
sites = sel.xpath('//div[#class="pu-details lastUnit"]')
items = []
for site in sites:
item = Flipkart()
item['title'] = site.xpath('div[1]/a/text()').extract()
items.append(item)
return items**
That is because the site only shows 20 results at a time, and loading of more results is done with JavaScript when the user scrolls to the bottom of the page.
You have two options here:
Find a link on the site which shows all results on a single page (doubtful it exists, but some sites may do so when passed an optional query string, for example).
Handle JavaScript events in your spider. The default Scrapy downloader doesn't do this, so you can either analyze the JS code and send the event signals yourself programmatically or use something like Selenium w/ PhantomJS to let the browser deal with it. I'd recommend the latter since it's more fail-proof than the manual approach of interpreting the JS yourself. See this question for more information, and Google around, there's plenty of information on this topic.

Scrapy Spider authenticate and iterate

Below is a Scrapy spider I have put together to pull some elements from a web page. I borrowed this solution from another Stack Overflow solution. It works, but I need more. I need to be able to walk the series of pages specified in the for loop inside the start_requests method after authenticating.
Yes, I did locate the Scrapy documentation discussing this along with a previous solution for something very similar. Neither one seems to make much sense. From what I can gather, I need to somehow create a request object and keep passing it along, but cannot seem to figure out how to do this.
Thank you in advance for your help.
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
import re
class MyBasicSpider(BaseSpider):
name = "awBasic"
allowed_domains = ["americanwhitewater.org"]
def start_requests(self):
'''
Override BaseSpider.start_requests to crawl all reaches in series
'''
# for every integer from one to 5000
for i in xrange(1, 50): # 1 to 50 for testing
# convert to string
iStr = str(i)
# add leading zeros to get to four digit length
while len(iStr) < 4:
iStr = '0{0}'.format(iStr)
# call make requests
yield self.make_requests_from_url('https://mycrawlsite.com/{0}/'.format(iStr))
def parse(self, response):
# create xpath selector object instance with response
hxs = HtmlXPathSelector(response)
# get part of url string
url = response.url
id = re.findall('/(\d{4})/', url)[0]
# selector 01
attribute01 = hxs.select('//div[#id="block_1"]/text()').re('([^,]*)')[0]
# selector for river section
attribute02 = hxs.select('//div[#id="block_1"]/div[1]/text()').extract()[0]
# print results
print('\tID: {0}\n\tAttr01: {1}\n\tAttr02: {2}').format(reachId, river, reachName)
You may have to approach the problem from a different angle:
first of all, scrape the main page; it contains a login form, so you can use FormRequest to simulate a user login; your parse method will likely look something like this:
def parse(self, response):
return [FormRequest.from_response(response,
formdata={'username': 'john', 'password': 'secret'},
callback=self.after_login)]
in after_login you check if the authentication was successful, usually by scanning the response for error messages; if all went well and you're logged in, you can start generating requests for the pages you're after:
def after_login(self, response):
if "Login failed" in response.body:
self.log("Login failed", level=log.ERROR)
else:
for i in xrange(1, 50): # 1 to 50 for testing
# convert to string
iStr = str(i)
# add leading zeros to get to four digit length
while len(iStr) < 4:
iStr = '0{0}'.format(iStr)
# call make requests
yield Request(url='https://mycrawlsite.com/{0}/'.format(iStr),
callback=self.scrape_page)
scrape_page will be called with each of the pages you created a request for; there you can finally extract the information you need using XPath, regex, etc.
BTW, you shouldn't 0-pad numbers manually; format will do it for you if you use the right format specifier.

Categories