I'm having a nightmare with data scrapped with Scrapy. Currently I encode it using UTF-8 i.e detail_content.select('p/text()[1]').extract()[0].encode('utf-8') saved into a JSON file, and then the captured text is displayed again using Django and a mobile app.
In the JSON file the escaped HTML gets escaped using unicode 'blah blah \u00a34,000 blah'
Now my problem is when I try and display the text in a django template or the mobile app the actual literal characters display: \u00a3 instead of £
Should I not be storing escaped unicode in JSON? Would it be better to store ASCII in the JSON file using the JSON escaping? If so how do you go about doing this with scrapy?
Scrappy code:
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from scrapy.http import Request
from scrapy.item import Item, Field
import datetime
import unicodedata
import re
class Spider(BaseSpider):
#spider stuff
def parse(self, response):
hxs = HtmlXPathSelector(response)
rows = hxs.select('//ul[#class = "category3"]/li')
for row in rows:
item = Item()
if len(row.select('div[2]/a/text()').extract()) > 0:
item['header'] = str(row.select('div[2]/a/text()')
.extract()[0].encode('utf-8'))
else:
item['header'] = ''
if len(row.select('div[2]/a/text()').extract()) > 0:
item['_id'] = str(row.select('div[2]/a/text()')
.extract()[0].encode('utf-8'))
else:
item['_id'] = ''
item['_id'] = self.slugify(item['_id'])[0:20]
item_url = row.select('div[2]/a/#href').extract()
today = datetime.datetime.now().isoformat()
item['dateAdded'] = str(today)
yield Request(item_url[0], meta={'item' : item},
callback=self.parse_item)
def parse_item(self, response):
hxs = HtmlXPathSelector(response)
detail_content = hxs.select('//*[#id="content-area"]')
item = response.request.meta['item']
item['description'] = str(detail_content.select('p/text()[1]')
.extract()[0])
item['itemUrl'] = str(detail_content.select('//a[#title="Blah"]/#href')
.extract()[0])
item['image_urls'] = detail_content.select('//img[#width="418"]/../#href')
.extract()
print item
return item
Ok this I find very odd:
item['header'] = str(row.select('div[2]/a/text()')
.extract()[0].encode('utf-8'))
Is not correct to do str(<some_value>.encode('utf-8')). That basically means you're converting a utf-8 bunch of bytes to ascii. This may yield errors when the utf-8 bytes exceed 128.
Now, I strongly believe your getting the characters from Scrappy already in unicode.
I receive errors like: exceptions.UnicodeEncodeError: 'ascii' codec can't encode character u'\u201c' in position 127: ordinal not in range(128)
So, my suggestion is to change the code to this:
item['header'] = row.select('div[2]/a/text()')
.extract()[0].encode('utf-8')
Just remove the str() calling. This will get the unicode received from Scrappy and turn it into utf-8. Once it is in utf-8. Be careful with string operations. Normally this conversion from unicode to a specific encoding should be done just before the writing to disk.
Note that you have this kind of code in two places. Modify them both.
UPDATE: Take a look at this, might be helpful: scrapy text encoding
Hope this helps!
Related
I've used Scrapy to get Movies data, but some of them have special characters which are encoded improperly.
As an example there's a movie that has a link in a website:
Pokémon: Detective Pikachu
The conflict is with the "é" character when getting the movie name.
All the data is added to a json file using the terminal command "scrapy crawl movie -o movies.json"
If in Scrapy's settings.py, non FEED_EXPORT_ENCODING is provided, the word Pokémon, is written in the json file as "Pok\u00e9mon"
If FEED_EXPORT_ENCODING = 'utf-8' is used, the name is being written as "Pokémon"
The parse method in the spider is as follows:
def parse(self, response):
base_link = 'http://www.the-numbers.com'
rows_in_big_table = response.xpath("//table/tr")
movie_name = onerow.xpath('td/b/a/text()').extract()[0]
movie_item['movie_name'] = movie_name
yield movie_budget_item
next_page =
response.xpath('//div[#class="pagination"]/a[#class="active"]/following-
sibling::a/#href').get()
if next_page is not None:
next_page = response.urljoin(next_page)
yield scrapy.Request(next_page, callback=self.parse)
As an extra information, I have this information of the json file where the information is parsed:
<_io.TextIOWrapper name='movie.json' mode='r' encoding='cp1252'>
The goal is to get the character "é" in the word "Pokémon".
How would you tackle this problem and why is this happening, I've been reading lots of info about encoding and in Python documentation but I can find a solution.
I've also tried to use "unicodedata.normalize('NFKC', 'Pok\u00e9mon')" but without success.
I appreciate your help! Thanks guys!
Use encoding ISO-8859-1
import scrapy
from bad_encoding.items import BadEncodingItem
class MoviesSpider(scrapy.Spider):
name = 'movies'
allowed_domains = ['www.the-numbers.com']
start_urls = [
'https://www.the-numbers.com/box-office-records/domestic/all-movies/cumulative/all-time/301'
]
custom_settings = {'FEED_EXPORT_ENCODING': 'ISO-8859-1'}
def parse(self, response):
for row in response.xpath('//table/tbody/tr'):
items = BadEncodingItem()
items['Rank'] = row.xpath('.//td[1]/text()').get()
items['Released'] = row.xpath('.//td[2]/a/text()').get()
items['Movie'] = row.xpath('.//td[3]/b/a/text()').get()
items['Domestic'] = row.xpath('.//td[4]/text()').get()
items['International'] = row.xpath('.//td[5]/text()').get()
items['Worldwide'] = row.xpath('.//td[6]/text()').get()
yield items
And this is my json file
I am trying to write output of a scraped xml to json. The scrape fails due to an item not being serializable.
From this question its advised that you need to build a pipeline, answer not provided out of scope for question SO scrapy serializer
So referring to scrapy docs
It illustrates an example, however the docs then advise not to use this
The purpose of JsonWriterPipeline is just to introduce how to write
item pipelines. If you really want to store all scraped items into a
JSON file you should use the Feed exports.
If I go to feed exports this is shown
JSON
FEED_FORMAT: json Exporter used: JsonItemExporter See this warning if
you’re using JSON with large feeds.
My issue still remains as that as i understand is for executing from command line as such.
scrapy runspider myxml.py -o ~/items.json -t json
However, this creates the error I was aiming to use a pipeline to solve.
TypeError: <bound method SelectorList.extract of [<Selector xpath='.//#venue' data=u'Royal Randwick'>]> is not JSON serializable
How do I create the json pipeline to rectify the json serialize error?
This is my code.
# -*- coding: utf-8 -*-
import scrapy
from scrapy.selector import Selector
from scrapy.http import HtmlResponse
from scrapy.selector import XmlXPathSelector
from conv_xml.items import ConvXmlItem
# https://stackoverflow.com/a/27391649/461887
import json
class MyxmlSpider(scrapy.Spider):
name = "myxml"
start_urls = (
["file:///home/sayth/Downloads/20160123RAND0.xml"]
)
def parse(self, response):
sel = Selector(response)
sites = sel.xpath('//meeting')
items = []
for site in sites:
item = ConvXmlItem()
item['venue'] = site.xpath('.//#venue').extract
item['name'] = site.xpath('.//race/#id').extract()
item['url'] = site.xpath('.//race/#number').extract()
item['description'] = site.xpath('.//race/#distance').extract()
items.append(item)
return items
# class JsonWriterPipeline(object):
#
# def __init__(self):
# self.file = open('items.jl', 'wb')
#
# def process_item(self, item, spider):
# line = json.dumps(dict(item)) + "\n"
# self.file.write(line)
# return item
The problem is here:
item['venue'] = site.xpath('.//#venue').extract
You've just forgot to call extract. Replace it with:
item['venue'] = site.xpath('.//#venue').extract()
When I append a Unicode string to the end of str, I can not click on the URL.
Bad:
base_url = 'https://en.wikipedia.org/w/api.php?action=query&prop=revisions&rvprop=content&format=xml&titles='
url = base_url + u"Ángel_Garasa"
print url
Good:
base_url = 'https://en.wikipedia.org/w/api.php?action=query&prop=revisions&rvprop=content&format=xml&titles='
url = base_url + u"Toby_Maquire"
print url
It appears that you're printing the results in an IDE, perhaps PyCharm. You need to percent encode a UTF-8 encoded version of the string:
import urllib
base_url = 'https://en.wikipedia.org/w/api.php?action=query&prop=revisions&rvprop=content&format=xml&titles='
name = u"Ángel_Garasa"
print base_url + urllib.quote(name.encode("utf-8"))
This shows:
In your case you need to update your code, so that the relevant field from the database is percent encoded. You only need to encode this one field to UTF-8 just for the percent encoding.
I have problem with the encoding of the text, I am scraping from a website. Specifically the Danish letters æ, ø, and å are coming out wrong. I feel confident that the encoding of the webpage is UTF-8, since the browser is showing it correctly with this encoding.
I have tried using BeautifulSoup as many of the other posts have suggested, but it wasn't for the better. However, I probably did it wrong.
I am using python 2.7 on a windows 7 32 bit OS.
The code I have is this:
# -*- coding: UTF-8 -*-
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from scrapy.item import Item, Field
class Sale(Item):
Adresse = Field()
Pris = Field()
Salgsdato = Field()
SalgsType = Field()
KvmPris = Field()
Rum = Field()
Postnummer = Field()
Boligtype = Field()
Kvm = Field()
Bygget = Field()
class HouseSpider(BaseSpider):
name = 'House'
allowed_domains = ["http://boliga.dk/"]
start_urls = ['http://www.boliga.dk/salg/resultater?so=1&type=Villa&type=Ejerlejlighed&type=R%%C3%%A6kkehus&kom=&amt=&fraPostnr=&tilPostnr=&iPostnr=&gade=&min=&max=&byggetMin=&byggetMax=&minRooms=&maxRooms=&minSize=&maxSize=&minsaledate=1992&maxsaledate=today&kode=&p=%d' %n for n in xrange(1, 3, 1)]
def parse(self, response):
hxs = HtmlXPathSelector(response)
sites = hxs.select("id('searchresult')/tr")
items = []
for site in sites:
item = Sale()
item['Adresse'] = site.select("td[1]/a[1]/text()").extract()
item['Pris'] = site.select("td[2]/text()").extract()
item['Salgsdato'] = site.select("td[3]/text()").extract()
item['SalgsType'] = site.select("td[4]/text()").extract()
item['KvmPris'] = site.select("td[5]/text()").extract()
item['Rum'] = site.select("td[6]/text()").extract()
item['Postnummer'] = site.select("td[7]/text()").extract()
item['Boligtype'] = site.select("td[8]/text()").extract()
item['Kvm'] = site.select("td[9]/text()").extract()
item['Bygget'] = site.select("td[10]/text()").extract()
items.append(item)
return items
It is the items 'Adresse' and 'Salgstype' that contain æ, ø, and å. Any help is greatly appreciated!
Cheers,
Ok doing some research I finally checked those characters are indeed those letter but in unicode. Since your cmd.exe doesn't understand unicode, it dumps the bytes of the characters.
You'll have to encode them first in utf-8 and change the code page of the cmd.exe to utf-8
Do this:
To every string you're going to output to the console, call it's method encode('utf-8') like this:
print whatever_string.encode('utf-8')
That's in your code, and in your console, before invoking your script do this:
> chcp 65001
> python your_script.py
Tested this in my python interpreter:
u'\xc6blevangen'.encode('utf-8')
>>>'\xc3\x86blevangen'
Which is the exact AE character encoded in utf-8 :)
Hope it helps!
I have used Scrapy a few weeks and recently, I have found HtmlXPathSelector couldn't parse some html file properly.
In the web page http://detail.zol.com.cn/series/268/10227_1.html , there's only a tag named
`div id='param-more' class='mod_param '`.
When I used the xpath "//div[#id='param-more']" to select the tag, it returned [].
I have tried scrapy shell and got the same results.
When using wget to retrieve the web page, I could also find the tag "div id='param-more' class='mod_param '" in the html source file and I think it's not caused by the reason that the tag is displayed by triggering an action.
Please give me some tips on how to solve this problem.
The following is the code sinppet about the problem. When processing the above url, len(nodes_product) is always 0
def parse_series(self, response):
hxs = HtmlXPathSelector(response)
xpath_product = "//div[#id='param-normal']/table//td[#class='name']/a | "\
"//div[#id='param-more']/table//td[#class='name']/a"
nodes_product = hxs.select(xpath_product)
if len(nodes_product) == 0:
# there's only the title, no other products in the series
.......
else:
.......
This appears to be a bug with XPathSelectors. I created a quick test spider and ran into the same problem. I believe it has something to do with the non-standard characters on the page.
I do not believe the problem is that the 'param-more' div is associated with any javascript event or CSS hiding. I disabled javascript and also changed my user-agent (and location) to see if this effected the data on the page. It didn't.
I was, however, able to parse the 'param-more' div using beautifulsoup:
from scrapy.selector import HtmlXPathSelector
from scrapy.spider import BaseSpider
from bs4 import BeautifulSoup
class TestSpider(BaseSpider):
name = "Test"
start_urls = [
"http://detail.zol.com.cn/series/268/10227_1.html"
]
def parse(self, response):
hxs = HtmlXPathSelector(response)
#data = hxs.select("//div[#id='param-more']").extract()
data = response.body
soup = BeautifulSoup(data)
print soup.find(id='param-more')
Someone else may know more about the XPathSelect issue, but for the time being, you can save the HTML found by beautifulsoup to an item and pass it into the pipeline.
Here is the link to the most recent beautifulsoup version: http://www.crummy.com/software/BeautifulSoup/#Download
UPDATE
I believe I found the specific issue. The webpage being discussed specifies in a meta tag that it uses the GB 2312 charset. The conversion from GB 2312 to unicode is problematic because there are some characters which do not have a unicode equivalent. This would not be an issue, except for the fact that UnicodeDammit, beautifulsoup's encoding detection module, actually determines the encoding to be ISO 8859-2. The problem is that lxml determines the encoding of a document by looking at the charset specified in the meta tag of the header. Thus, there is an encoding type mismatch between what lxml and scrapy perceive.
The following code demonstrates the above problem, and provides an alternative to having to rely on the BS4 library:
from scrapy.selector import HtmlXPathSelector
from scrapy.spider import BaseSpider
from bs4 import BeautifulSoup
import chardet
class TestSpider(BaseSpider):
name = "Test"
start_urls = [
"http://detail.zol.com.cn/series/268/10227_1.html"
]
def parse(self, response):
encoding = chardet.detect(response.body)['encoding']
if encoding != 'utf-8':
response.body = response.body.decode(encoding, 'replace').encode('utf-8')
hxs = HtmlXPathSelector(response)
data = hxs.select("//div[#id='param-more']").extract()
#print encoding
print data
Here, you see that by forcing lxml to use utf-8 encoding, it does not attempt to map from what it perceives as GB 2312->utf-8.
In scrapy, the HTMLXPathSelectors encoding is set in the scrapy/select/lxmlsel.py module. This module passes the response body to the lxml parser using the response.encoding attribute, which is ultimately set in the scrapy/http/response/test.py module.
The code that handles setting the response.encoding attribute is as follows:
#property
def encoding(self):
return self._get_encoding(infer=True)
def _get_encoding(self, infer=False):
enc = self._declared_encoding()
if enc and not encoding_exists(enc):
enc = None
if not enc and infer:
enc = self._body_inferred_encoding()
if not enc:
enc = self._DEFAULT_ENCODING
return resolve_encoding(enc)
def _declared_encoding(self):
return self._encoding or self._headers_encoding() \
or self._body_declared_encoding()
The important thing to note here is that _headers_encoding and _encoding both will ultimately reflect the encoding declared in the meta tag in the header over actually using something like UnicodeDammit or chardet to determine the documents encoding. Thus, situations will arise where a document contains invalid characters for the encoding it has specified it has, and I believe that Scrapy will overlook this, ultimately resulting in the problem we are seeing today.
'mod_param ' != 'mod_param'
The class does not equal "mod_param" but it does contain "mod_param", note there is a blank space on the end:
stav#maia:~$ scrapy shell http://detail.zol.com.cn/series/268/10227_1.html
2012-08-23 09:17:28-0500 [scrapy] INFO: Scrapy 0.15.1 started (bot: scrapybot)
Python 2.7.3 (default, Aug 1 2012, 05:14:39)
IPython 0.12.1 -- An enhanced Interactive Python.
In [1]: hxs.select("//div[#class='mod_param']")
Out[1]: []
In [2]: hxs.select("//div[contains(#class,'mod_param')]")
Out[2]: [<HtmlXPathSelector xpath="//div[contains(#class,'mod_param')]" data=u'<div id="param-more" class="mod_param "'>]
In [3]: len(hxs.select("//div[contains(#class,'mod_param')]").extract())
Out[3]: 1
In [4]: len(hxs.select("//div[contains(#class,'mod_param')]").extract()[0])
Out[4]: 5372