I have used Scrapy a few weeks and recently, I have found HtmlXPathSelector couldn't parse some html file properly.
In the web page http://detail.zol.com.cn/series/268/10227_1.html , there's only a tag named
`div id='param-more' class='mod_param '`.
When I used the xpath "//div[#id='param-more']" to select the tag, it returned [].
I have tried scrapy shell and got the same results.
When using wget to retrieve the web page, I could also find the tag "div id='param-more' class='mod_param '" in the html source file and I think it's not caused by the reason that the tag is displayed by triggering an action.
Please give me some tips on how to solve this problem.
The following is the code sinppet about the problem. When processing the above url, len(nodes_product) is always 0
def parse_series(self, response):
hxs = HtmlXPathSelector(response)
xpath_product = "//div[#id='param-normal']/table//td[#class='name']/a | "\
"//div[#id='param-more']/table//td[#class='name']/a"
nodes_product = hxs.select(xpath_product)
if len(nodes_product) == 0:
# there's only the title, no other products in the series
.......
else:
.......
This appears to be a bug with XPathSelectors. I created a quick test spider and ran into the same problem. I believe it has something to do with the non-standard characters on the page.
I do not believe the problem is that the 'param-more' div is associated with any javascript event or CSS hiding. I disabled javascript and also changed my user-agent (and location) to see if this effected the data on the page. It didn't.
I was, however, able to parse the 'param-more' div using beautifulsoup:
from scrapy.selector import HtmlXPathSelector
from scrapy.spider import BaseSpider
from bs4 import BeautifulSoup
class TestSpider(BaseSpider):
name = "Test"
start_urls = [
"http://detail.zol.com.cn/series/268/10227_1.html"
]
def parse(self, response):
hxs = HtmlXPathSelector(response)
#data = hxs.select("//div[#id='param-more']").extract()
data = response.body
soup = BeautifulSoup(data)
print soup.find(id='param-more')
Someone else may know more about the XPathSelect issue, but for the time being, you can save the HTML found by beautifulsoup to an item and pass it into the pipeline.
Here is the link to the most recent beautifulsoup version: http://www.crummy.com/software/BeautifulSoup/#Download
UPDATE
I believe I found the specific issue. The webpage being discussed specifies in a meta tag that it uses the GB 2312 charset. The conversion from GB 2312 to unicode is problematic because there are some characters which do not have a unicode equivalent. This would not be an issue, except for the fact that UnicodeDammit, beautifulsoup's encoding detection module, actually determines the encoding to be ISO 8859-2. The problem is that lxml determines the encoding of a document by looking at the charset specified in the meta tag of the header. Thus, there is an encoding type mismatch between what lxml and scrapy perceive.
The following code demonstrates the above problem, and provides an alternative to having to rely on the BS4 library:
from scrapy.selector import HtmlXPathSelector
from scrapy.spider import BaseSpider
from bs4 import BeautifulSoup
import chardet
class TestSpider(BaseSpider):
name = "Test"
start_urls = [
"http://detail.zol.com.cn/series/268/10227_1.html"
]
def parse(self, response):
encoding = chardet.detect(response.body)['encoding']
if encoding != 'utf-8':
response.body = response.body.decode(encoding, 'replace').encode('utf-8')
hxs = HtmlXPathSelector(response)
data = hxs.select("//div[#id='param-more']").extract()
#print encoding
print data
Here, you see that by forcing lxml to use utf-8 encoding, it does not attempt to map from what it perceives as GB 2312->utf-8.
In scrapy, the HTMLXPathSelectors encoding is set in the scrapy/select/lxmlsel.py module. This module passes the response body to the lxml parser using the response.encoding attribute, which is ultimately set in the scrapy/http/response/test.py module.
The code that handles setting the response.encoding attribute is as follows:
#property
def encoding(self):
return self._get_encoding(infer=True)
def _get_encoding(self, infer=False):
enc = self._declared_encoding()
if enc and not encoding_exists(enc):
enc = None
if not enc and infer:
enc = self._body_inferred_encoding()
if not enc:
enc = self._DEFAULT_ENCODING
return resolve_encoding(enc)
def _declared_encoding(self):
return self._encoding or self._headers_encoding() \
or self._body_declared_encoding()
The important thing to note here is that _headers_encoding and _encoding both will ultimately reflect the encoding declared in the meta tag in the header over actually using something like UnicodeDammit or chardet to determine the documents encoding. Thus, situations will arise where a document contains invalid characters for the encoding it has specified it has, and I believe that Scrapy will overlook this, ultimately resulting in the problem we are seeing today.
'mod_param ' != 'mod_param'
The class does not equal "mod_param" but it does contain "mod_param", note there is a blank space on the end:
stav#maia:~$ scrapy shell http://detail.zol.com.cn/series/268/10227_1.html
2012-08-23 09:17:28-0500 [scrapy] INFO: Scrapy 0.15.1 started (bot: scrapybot)
Python 2.7.3 (default, Aug 1 2012, 05:14:39)
IPython 0.12.1 -- An enhanced Interactive Python.
In [1]: hxs.select("//div[#class='mod_param']")
Out[1]: []
In [2]: hxs.select("//div[contains(#class,'mod_param')]")
Out[2]: [<HtmlXPathSelector xpath="//div[contains(#class,'mod_param')]" data=u'<div id="param-more" class="mod_param "'>]
In [3]: len(hxs.select("//div[contains(#class,'mod_param')]").extract())
Out[3]: 1
In [4]: len(hxs.select("//div[contains(#class,'mod_param')]").extract()[0])
Out[4]: 5372
Related
I am building a crawl.spider to scrape statutory law data from the following website (https://www.azleg.gov/viewdocument/?docName=https://www.azleg.gov/ars/1/00101.htm). I am aiming to extract the statute text, which is contained in the following XPath [//div[#class = 'first']/p/text()]. This path should provide the statute text.
All of my scrapy requests are yielding incomplete html responses, such that when I search for the relevant xpath queries, it yields an empty list. However, when I use the requests library, the html downloads correctly.
Using XPath tester online, I've verified that my xpath queries should produce the desired content. Using scrapy shell, I've viewed the response object from scrapy in my browser - and it looks just like it does when I'm browsing natively. I've tried enabling middleware for both BeautifulSoup and Selenium, but neither has appeared to work.
Here's my crawl spider
class AZspider(CrawlSpider):
name = "arizona"
start_urls = [
"https://www.azleg.gov/viewdocument/?docName=https://www.azleg.gov/ars/1/00101.htm",
]
rule = (Rule(LinkExtractor(restrict_xpaths="//div[#class = 'article']"), callback="parse_stats_az", follow=True),)
def parse_stats_az(self, response):
statutes = response.xpath("//div[#class = 'first']/p")
yield{
"statutes":statutes
}
And here's the code that succsessfuly generated the correct response object
az_leg = requests.get("https://www.azleg.gov/viewdocument/?docName=https://www.azleg.gov/ars/1/00101.htm")
I've used Scrapy to get Movies data, but some of them have special characters which are encoded improperly.
As an example there's a movie that has a link in a website:
Pokémon: Detective Pikachu
The conflict is with the "é" character when getting the movie name.
All the data is added to a json file using the terminal command "scrapy crawl movie -o movies.json"
If in Scrapy's settings.py, non FEED_EXPORT_ENCODING is provided, the word Pokémon, is written in the json file as "Pok\u00e9mon"
If FEED_EXPORT_ENCODING = 'utf-8' is used, the name is being written as "Pokémon"
The parse method in the spider is as follows:
def parse(self, response):
base_link = 'http://www.the-numbers.com'
rows_in_big_table = response.xpath("//table/tr")
movie_name = onerow.xpath('td/b/a/text()').extract()[0]
movie_item['movie_name'] = movie_name
yield movie_budget_item
next_page =
response.xpath('//div[#class="pagination"]/a[#class="active"]/following-
sibling::a/#href').get()
if next_page is not None:
next_page = response.urljoin(next_page)
yield scrapy.Request(next_page, callback=self.parse)
As an extra information, I have this information of the json file where the information is parsed:
<_io.TextIOWrapper name='movie.json' mode='r' encoding='cp1252'>
The goal is to get the character "é" in the word "Pokémon".
How would you tackle this problem and why is this happening, I've been reading lots of info about encoding and in Python documentation but I can find a solution.
I've also tried to use "unicodedata.normalize('NFKC', 'Pok\u00e9mon')" but without success.
I appreciate your help! Thanks guys!
Use encoding ISO-8859-1
import scrapy
from bad_encoding.items import BadEncodingItem
class MoviesSpider(scrapy.Spider):
name = 'movies'
allowed_domains = ['www.the-numbers.com']
start_urls = [
'https://www.the-numbers.com/box-office-records/domestic/all-movies/cumulative/all-time/301'
]
custom_settings = {'FEED_EXPORT_ENCODING': 'ISO-8859-1'}
def parse(self, response):
for row in response.xpath('//table/tbody/tr'):
items = BadEncodingItem()
items['Rank'] = row.xpath('.//td[1]/text()').get()
items['Released'] = row.xpath('.//td[2]/a/text()').get()
items['Movie'] = row.xpath('.//td[3]/b/a/text()').get()
items['Domestic'] = row.xpath('.//td[4]/text()').get()
items['International'] = row.xpath('.//td[5]/text()').get()
items['Worldwide'] = row.xpath('.//td[6]/text()').get()
yield items
And this is my json file
Hi all I an trying to get whole results from the given link in the code. but my code not giving all results. This link says it contain 2132 results but it returns only 20 results.:
from scrapy.spider import Spider
from scrapy.selector import Selector
from tutorial.items import Flipkart
class Test(Spider):
name = "flip"
allowed_domains = ["flipkart.com"]
start_urls = ["http://www.flipkart.com/mobiles/pr?sid=tyy,4io& otracker=ch_vn_mobile_filter_Mobile%20Brands_All"
]
def parse(self, response):
sel = Selector(response)
sites = sel.xpath('//div[#class="pu-details lastUnit"]')
items = []
for site in sites:
item = Flipkart()
item['title'] = site.xpath('div[1]/a/text()').extract()
items.append(item)
return items**
That is because the site only shows 20 results at a time, and loading of more results is done with JavaScript when the user scrolls to the bottom of the page.
You have two options here:
Find a link on the site which shows all results on a single page (doubtful it exists, but some sites may do so when passed an optional query string, for example).
Handle JavaScript events in your spider. The default Scrapy downloader doesn't do this, so you can either analyze the JS code and send the event signals yourself programmatically or use something like Selenium w/ PhantomJS to let the browser deal with it. I'd recommend the latter since it's more fail-proof than the manual approach of interpreting the JS yourself. See this question for more information, and Google around, there's plenty of information on this topic.
I'm having a nightmare with data scrapped with Scrapy. Currently I encode it using UTF-8 i.e detail_content.select('p/text()[1]').extract()[0].encode('utf-8') saved into a JSON file, and then the captured text is displayed again using Django and a mobile app.
In the JSON file the escaped HTML gets escaped using unicode 'blah blah \u00a34,000 blah'
Now my problem is when I try and display the text in a django template or the mobile app the actual literal characters display: \u00a3 instead of £
Should I not be storing escaped unicode in JSON? Would it be better to store ASCII in the JSON file using the JSON escaping? If so how do you go about doing this with scrapy?
Scrappy code:
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from scrapy.http import Request
from scrapy.item import Item, Field
import datetime
import unicodedata
import re
class Spider(BaseSpider):
#spider stuff
def parse(self, response):
hxs = HtmlXPathSelector(response)
rows = hxs.select('//ul[#class = "category3"]/li')
for row in rows:
item = Item()
if len(row.select('div[2]/a/text()').extract()) > 0:
item['header'] = str(row.select('div[2]/a/text()')
.extract()[0].encode('utf-8'))
else:
item['header'] = ''
if len(row.select('div[2]/a/text()').extract()) > 0:
item['_id'] = str(row.select('div[2]/a/text()')
.extract()[0].encode('utf-8'))
else:
item['_id'] = ''
item['_id'] = self.slugify(item['_id'])[0:20]
item_url = row.select('div[2]/a/#href').extract()
today = datetime.datetime.now().isoformat()
item['dateAdded'] = str(today)
yield Request(item_url[0], meta={'item' : item},
callback=self.parse_item)
def parse_item(self, response):
hxs = HtmlXPathSelector(response)
detail_content = hxs.select('//*[#id="content-area"]')
item = response.request.meta['item']
item['description'] = str(detail_content.select('p/text()[1]')
.extract()[0])
item['itemUrl'] = str(detail_content.select('//a[#title="Blah"]/#href')
.extract()[0])
item['image_urls'] = detail_content.select('//img[#width="418"]/../#href')
.extract()
print item
return item
Ok this I find very odd:
item['header'] = str(row.select('div[2]/a/text()')
.extract()[0].encode('utf-8'))
Is not correct to do str(<some_value>.encode('utf-8')). That basically means you're converting a utf-8 bunch of bytes to ascii. This may yield errors when the utf-8 bytes exceed 128.
Now, I strongly believe your getting the characters from Scrappy already in unicode.
I receive errors like: exceptions.UnicodeEncodeError: 'ascii' codec can't encode character u'\u201c' in position 127: ordinal not in range(128)
So, my suggestion is to change the code to this:
item['header'] = row.select('div[2]/a/text()')
.extract()[0].encode('utf-8')
Just remove the str() calling. This will get the unicode received from Scrappy and turn it into utf-8. Once it is in utf-8. Be careful with string operations. Normally this conversion from unicode to a specific encoding should be done just before the writing to disk.
Note that you have this kind of code in two places. Modify them both.
UPDATE: Take a look at this, might be helpful: scrapy text encoding
Hope this helps!
This is the code for Spyder1 that I've been trying to write within Scrapy framework:
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy.item import Item
from firm.items import FirmItem
class Spider1(CrawlSpider):
domain_name = 'wc2'
start_urls = ['http://www.whitecase.com/Attorneys/List.aspx?LastName=A']
rules = (
Rule(SgmlLinkExtractor(allow=["hxs.select(
'//td[#class='altRow'][1]/a/#href').re('/.a\w+')"]),
callback='parse'),
)
def parse(self, response):
hxs = HtmlXPathSelector(response)
JD = FirmItem()
JD['school'] = hxs.select(
'//td[#class="mainColumnTDa"]').re('(?<=(JD,\s))(.*?)(\d+)'
)
return JD
SPIDER = Spider1()
The regex in the rules successfully pulls all the bio urls that I want from the start url:
>>> hxs.select(
... '//td[#class="altRow"][1]/a/#href').re('/.a\w+')
[u'/cabel', u'/jacevedo', u'/jacuna', u'/aadler', u'/zahmedani', u'/tairisto', u
'/zalbert', u'/salberts', u'/aaleksandrova', u'/malhadeff', u'/nalivojvodic', u'
/kallchurch', u'/jalleyne', u'/lalonzo', u'/malthoff', u'/valvarez', u'/camon',
u'/randerson', u'/eandreeva', u'/pangeli', u'/jangland', u'/mantczak', u'/darany
i', u'/carhold', u'/marora', u'/garrington', u'/jartzinger', u'/sasayama', u'/ma
sschenfeldt', u'/dattanasio', u'/watterbury', u'/jaudrlicka', u'/caverch', u'/fa
yanruoh', u'/razar']
>>>
But when I run the code I get
[wc2] ERROR: Error processing FirmItem(school=[]) -
[Failure instance: Traceback: <type 'exceptions.IndexError'>: list index out of range
This is the FirmItem in Items.py
from scrapy.item import Item, Field
class FirmItem(Item):
school = Field()
pass
Can you help me understand where the index error occurs?
It seems to me that it has something to do with SgmLinkExtractor.
I've been trying to make this spider work for weeks with Scrapy. They have an excellent tutorial but I am new to python and web programming so I don't understand how for instance SgmlLinkExtractor works behind the scene.
Would it be easier for me to try to write a spider with the same simple functionality with Python libraries? I would appreciate any comments and help.
Thanks
SgmlLinkExtractor doesn't support selectors in its "allow" argument.
So this is wrong:
SgmlLinkExtractor(allow=["hxs.select('//td[#class='altRow'] ...')"])
This is right:
SgmlLinkExtractor(allow=[r"product\.php"])
The parse function is called for each match of your SgmlLinkExtractor.
As Pablo mentioned you want to simplify your SgmlLinkExtractor.
I also tried to put the names scraped from the initial url into a list and then pass each name to parse in the form of absolute url as http://www.whitecase.com/aabbas (for /aabbas).
The following code loops over the list, but I don't know how to pass this to parse . Do you think this is a better idea?
baseurl = 'http://www.whitecase.com'
names = ['aabbas', '/cabel', '/jacevedo', '/jacuna', '/igbadegesin']
def makeurl(baseurl, names):
for x in names:
url = baseurl + x
baseurl = 'http://www.whitecase.com'
x = ''
return url