Problems with character-encoding when webscraping with scrapy

Problems with character-encoding when webscraping with scrapy - python

I have problem with the encoding of the text, I am scraping from a website. Specifically the Danish letters æ, ø, and å are coming out wrong. I feel confident that the encoding of the webpage is UTF-8, since the browser is showing it correctly with this encoding.
I have tried using BeautifulSoup as many of the other posts have suggested, but it wasn't for the better. However, I probably did it wrong.
I am using python 2.7 on a windows 7 32 bit OS.
The code I have is this:
# -*- coding: UTF-8 -*-
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from scrapy.item import Item, Field
class Sale(Item):
Adresse = Field()
Pris = Field()
Salgsdato = Field()
SalgsType = Field()
KvmPris = Field()
Rum = Field()
Postnummer = Field()
Boligtype = Field()
Kvm = Field()
Bygget = Field()
class HouseSpider(BaseSpider):
name = 'House'
allowed_domains = ["http://boliga.dk/"]
start_urls = ['http://www.boliga.dk/salg/resultater?so=1&type=Villa&type=Ejerlejlighed&type=R%%C3%%A6kkehus&kom=&amt=&fraPostnr=&tilPostnr=&iPostnr=&gade=&min=&max=&byggetMin=&byggetMax=&minRooms=&maxRooms=&minSize=&maxSize=&minsaledate=1992&maxsaledate=today&kode=&p=%d' %n for n in xrange(1, 3, 1)]
def parse(self, response):
hxs = HtmlXPathSelector(response)
sites = hxs.select("id('searchresult')/tr")
items = []
for site in sites:
item = Sale()
item['Adresse'] = site.select("td[1]/a[1]/text()").extract()
item['Pris'] = site.select("td[2]/text()").extract()
item['Salgsdato'] = site.select("td[3]/text()").extract()
item['SalgsType'] = site.select("td[4]/text()").extract()
item['KvmPris'] = site.select("td[5]/text()").extract()
item['Rum'] = site.select("td[6]/text()").extract()
item['Postnummer'] = site.select("td[7]/text()").extract()
item['Boligtype'] = site.select("td[8]/text()").extract()
item['Kvm'] = site.select("td[9]/text()").extract()
item['Bygget'] = site.select("td[10]/text()").extract()
items.append(item)
return items
It is the items 'Adresse' and 'Salgstype' that contain æ, ø, and å. Any help is greatly appreciated!
Cheers,

Ok doing some research I finally checked those characters are indeed those letter but in unicode. Since your cmd.exe doesn't understand unicode, it dumps the bytes of the characters.
You'll have to encode them first in utf-8 and change the code page of the cmd.exe to utf-8
Do this:
To every string you're going to output to the console, call it's method encode('utf-8') like this:
print whatever_string.encode('utf-8')
That's in your code, and in your console, before invoking your script do this:
> chcp 65001
> python your_script.py
Tested this in my python interpreter:
u'\xc6blevangen'.encode('utf-8')
>>>'\xc3\x86blevangen'
Which is the exact AE character encoded in utf-8 :)
Hope it helps!

Related

Some "non usual characters" encoded incorrectly when being scraped using Scrapy

I've used Scrapy to get Movies data, but some of them have special characters which are encoded improperly.
As an example there's a movie that has a link in a website:
Pokémon: Detective Pikachu
The conflict is with the "é" character when getting the movie name.
All the data is added to a json file using the terminal command "scrapy crawl movie -o movies.json"
If in Scrapy's settings.py, non FEED_EXPORT_ENCODING is provided, the word Pokémon, is written in the json file as "Pok\u00e9mon"
If FEED_EXPORT_ENCODING = 'utf-8' is used, the name is being written as "PokÃ©mon"
The parse method in the spider is as follows:
def parse(self, response):
base_link = 'http://www.the-numbers.com'
rows_in_big_table = response.xpath("//table/tr")
movie_name = onerow.xpath('td/b/a/text()').extract()[0]
movie_item['movie_name'] = movie_name
yield movie_budget_item
next_page =
response.xpath('//div[#class="pagination"]/a[#class="active"]/following-
sibling::a/#href').get()
if next_page is not None:
next_page = response.urljoin(next_page)
yield scrapy.Request(next_page, callback=self.parse)
As an extra information, I have this information of the json file where the information is parsed:
<_io.TextIOWrapper name='movie.json' mode='r' encoding='cp1252'>
The goal is to get the character "é" in the word "Pokémon".
How would you tackle this problem and why is this happening, I've been reading lots of info about encoding and in Python documentation but I can find a solution.
I've also tried to use "unicodedata.normalize('NFKC', 'Pok\u00e9mon')" but without success.
I appreciate your help! Thanks guys!

Use encoding ISO-8859-1
import scrapy
from bad_encoding.items import BadEncodingItem
class MoviesSpider(scrapy.Spider):
name = 'movies'
allowed_domains = ['www.the-numbers.com']
start_urls = [
'https://www.the-numbers.com/box-office-records/domestic/all-movies/cumulative/all-time/301'
]
custom_settings = {'FEED_EXPORT_ENCODING': 'ISO-8859-1'}
def parse(self, response):
for row in response.xpath('//table/tbody/tr'):
items = BadEncodingItem()
items['Rank'] = row.xpath('.//td[1]/text()').get()
items['Released'] = row.xpath('.//td[2]/a/text()').get()
items['Movie'] = row.xpath('.//td[3]/b/a/text()').get()
items['Domestic'] = row.xpath('.//td[4]/text()').get()
items['International'] = row.xpath('.//td[5]/text()').get()
items['Worldwide'] = row.xpath('.//td[6]/text()').get()
yield items
And this is my json file

Back to basics: Scrapy

New to scrapy and I definitely need pointers. I've run through some examples and I'm not getting some basics. I'm running scrapy 1.0.3
Spider:
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from matrix_scrape.items import MatrixScrapeItem
class MySpider(BaseSpider):
name = "matrix"
allowed_domains = ["https://www.kickstarter.com/projects/2061039712/matrix-the-internet-of-things-for-everyonetm"]
start_urls = ["https://www.kickstarter.com/projects/2061039712/matrix-the-internet-of-things-for-everyonetm"]
def parse(self, response):
hxs = HtmlXPathSelector(response)
item = MatrixScrapeItem()
item['backers'] = hxs.select("//*[#id="backers_count"]/data").extract()
item['totalPledged'] = hxs.select("//*[#id="pledged"]/data").extract()
print backers, totalPledged
item:
import scrapy
class MatrixScrapeItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
backers = scrapy.Field()
totalPledged = scrapy.Field()
pass
I'm getting the error:
File "/home/will/Desktop/repos/scrapy/matrix_scrape/matrix_scrape/spiders/test.py", line 15
item['backers'] = hxs.select("//*[#id="backers_count"]/data").extract()
Myquestions are: Why isn't the selecting and extracting working properly? I do see people just using Selector a lot instead of HtmlXPathSelector.
Also I'm trying to save this to a csv file and automate it based on time (extract these data points every 30 min). If anyone has any pointers for examples of that, they'd get super brownie points :)

The syntax error is caused by the way you use double quotes. Mix single and double quotes:
item['backers'] = hxs.select('//*[#id="backers_count"]/data').extract()
item['totalPledged'] = hxs.select('//*[#id="pledged"]/data').extract()
As a side note, you can use response.xpath() shortcut instead of instantiating HtmlXPathSelector:
def parse(self, response):
item = MatrixScrapeItem()
item['backers'] = response.xpath('//*[#id="backers_count"]/data').extract()
item['totalPledged'] = response.xpath('//*[#id="pledged"]/data').extract()
print backers, totalPledged
And you've probably meant to get the text() of the data elements:
//*[#id="backers_count"]/data/text()
//*[#id="pledged"]/data/text()

Encode Scrapy data to display in Django and Android

I'm having a nightmare with data scrapped with Scrapy. Currently I encode it using UTF-8 i.e detail_content.select('p/text()[1]').extract()[0].encode('utf-8') saved into a JSON file, and then the captured text is displayed again using Django and a mobile app.
In the JSON file the escaped HTML gets escaped using unicode 'blah blah \u00a34,000 blah'
Now my problem is when I try and display the text in a django template or the mobile app the actual literal characters display: \u00a3 instead of £
Should I not be storing escaped unicode in JSON? Would it be better to store ASCII in the JSON file using the JSON escaping? If so how do you go about doing this with scrapy?
Scrappy code:
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from scrapy.http import Request
from scrapy.item import Item, Field
import datetime
import unicodedata
import re
class Spider(BaseSpider):
#spider stuff
def parse(self, response):
hxs = HtmlXPathSelector(response)
rows = hxs.select('//ul[#class = "category3"]/li')
for row in rows:
item = Item()
if len(row.select('div[2]/a/text()').extract()) > 0:
item['header'] = str(row.select('div[2]/a/text()')
.extract()[0].encode('utf-8'))
else:
item['header'] = ''
if len(row.select('div[2]/a/text()').extract()) > 0:
item['_id'] = str(row.select('div[2]/a/text()')
.extract()[0].encode('utf-8'))
else:
item['_id'] = ''
item['_id'] = self.slugify(item['_id'])[0:20]
item_url = row.select('div[2]/a/#href').extract()
today = datetime.datetime.now().isoformat()
item['dateAdded'] = str(today)
yield Request(item_url[0], meta={'item' : item},
callback=self.parse_item)
def parse_item(self, response):
hxs = HtmlXPathSelector(response)
detail_content = hxs.select('//*[#id="content-area"]')
item = response.request.meta['item']
item['description'] = str(detail_content.select('p/text()[1]')
.extract()[0])
item['itemUrl'] = str(detail_content.select('//a[#title="Blah"]/#href')
.extract()[0])
item['image_urls'] = detail_content.select('//img[#width="418"]/../#href')
.extract()
print item
return item

Ok this I find very odd:
item['header'] = str(row.select('div[2]/a/text()')
.extract()[0].encode('utf-8'))
Is not correct to do str(<some_value>.encode('utf-8')). That basically means you're converting a utf-8 bunch of bytes to ascii. This may yield errors when the utf-8 bytes exceed 128.
Now, I strongly believe your getting the characters from Scrappy already in unicode.
I receive errors like: exceptions.UnicodeEncodeError: 'ascii' codec can't encode character u'\u201c' in position 127: ordinal not in range(128)
So, my suggestion is to change the code to this:
item['header'] = row.select('div[2]/a/text()')
.extract()[0].encode('utf-8')
Just remove the str() calling. This will get the unicode received from Scrappy and turn it into utf-8. Once it is in utf-8. Be careful with string operations. Normally this conversion from unicode to a specific encoding should be done just before the writing to disk.
Note that you have this kind of code in two places. Modify them both.
UPDATE: Take a look at this, might be helpful: scrapy text encoding
Hope this helps!

Scrapy couldn't parse some html file correctly

I have used Scrapy a few weeks and recently, I have found HtmlXPathSelector couldn't parse some html file properly.
In the web page http://detail.zol.com.cn/series/268/10227_1.html , there's only a tag named
`div id='param-more' class='mod_param '`.
When I used the xpath "//div[#id='param-more']" to select the tag, it returned [].
I have tried scrapy shell and got the same results.
When using wget to retrieve the web page, I could also find the tag "div id='param-more' class='mod_param '" in the html source file and I think it's not caused by the reason that the tag is displayed by triggering an action.
Please give me some tips on how to solve this problem.
The following is the code sinppet about the problem. When processing the above url, len(nodes_product) is always 0
def parse_series(self, response):
hxs = HtmlXPathSelector(response)
xpath_product = "//div[#id='param-normal']/table//td[#class='name']/a | "\
"//div[#id='param-more']/table//td[#class='name']/a"
nodes_product = hxs.select(xpath_product)
if len(nodes_product) == 0:
# there's only the title, no other products in the series
.......
else:
.......

This appears to be a bug with XPathSelectors. I created a quick test spider and ran into the same problem. I believe it has something to do with the non-standard characters on the page.
I do not believe the problem is that the 'param-more' div is associated with any javascript event or CSS hiding. I disabled javascript and also changed my user-agent (and location) to see if this effected the data on the page. It didn't.
I was, however, able to parse the 'param-more' div using beautifulsoup:
from scrapy.selector import HtmlXPathSelector
from scrapy.spider import BaseSpider
from bs4 import BeautifulSoup
class TestSpider(BaseSpider):
name = "Test"
start_urls = [
"http://detail.zol.com.cn/series/268/10227_1.html"
]
def parse(self, response):
hxs = HtmlXPathSelector(response)
#data = hxs.select("//div[#id='param-more']").extract()
data = response.body
soup = BeautifulSoup(data)
print soup.find(id='param-more')
Someone else may know more about the XPathSelect issue, but for the time being, you can save the HTML found by beautifulsoup to an item and pass it into the pipeline.
Here is the link to the most recent beautifulsoup version: http://www.crummy.com/software/BeautifulSoup/#Download
UPDATE
I believe I found the specific issue. The webpage being discussed specifies in a meta tag that it uses the GB 2312 charset. The conversion from GB 2312 to unicode is problematic because there are some characters which do not have a unicode equivalent. This would not be an issue, except for the fact that UnicodeDammit, beautifulsoup's encoding detection module, actually determines the encoding to be ISO 8859-2. The problem is that lxml determines the encoding of a document by looking at the charset specified in the meta tag of the header. Thus, there is an encoding type mismatch between what lxml and scrapy perceive.
The following code demonstrates the above problem, and provides an alternative to having to rely on the BS4 library:
from scrapy.selector import HtmlXPathSelector
from scrapy.spider import BaseSpider
from bs4 import BeautifulSoup
import chardet
class TestSpider(BaseSpider):
name = "Test"
start_urls = [
"http://detail.zol.com.cn/series/268/10227_1.html"
]
def parse(self, response):
encoding = chardet.detect(response.body)['encoding']
if encoding != 'utf-8':
response.body = response.body.decode(encoding, 'replace').encode('utf-8')
hxs = HtmlXPathSelector(response)
data = hxs.select("//div[#id='param-more']").extract()
#print encoding
print data
Here, you see that by forcing lxml to use utf-8 encoding, it does not attempt to map from what it perceives as GB 2312->utf-8.
In scrapy, the HTMLXPathSelectors encoding is set in the scrapy/select/lxmlsel.py module. This module passes the response body to the lxml parser using the response.encoding attribute, which is ultimately set in the scrapy/http/response/test.py module.
The code that handles setting the response.encoding attribute is as follows:
#property
def encoding(self):
return self._get_encoding(infer=True)
def _get_encoding(self, infer=False):
enc = self._declared_encoding()
if enc and not encoding_exists(enc):
enc = None
if not enc and infer:
enc = self._body_inferred_encoding()
if not enc:
enc = self._DEFAULT_ENCODING
return resolve_encoding(enc)
def _declared_encoding(self):
return self._encoding or self._headers_encoding() \
or self._body_declared_encoding()
The important thing to note here is that _headers_encoding and _encoding both will ultimately reflect the encoding declared in the meta tag in the header over actually using something like UnicodeDammit or chardet to determine the documents encoding. Thus, situations will arise where a document contains invalid characters for the encoding it has specified it has, and I believe that Scrapy will overlook this, ultimately resulting in the problem we are seeing today.

'mod_param ' != 'mod_param'
The class does not equal "mod_param" but it does contain "mod_param", note there is a blank space on the end:
stav#maia:~$ scrapy shell http://detail.zol.com.cn/series/268/10227_1.html
2012-08-23 09:17:28-0500 [scrapy] INFO: Scrapy 0.15.1 started (bot: scrapybot)
Python 2.7.3 (default, Aug 1 2012, 05:14:39)
IPython 0.12.1 -- An enhanced Interactive Python.
In [1]: hxs.select("//div[#class='mod_param']")
Out[1]: []
In [2]: hxs.select("//div[contains(#class,'mod_param')]")
Out[2]: [<HtmlXPathSelector xpath="//div[contains(#class,'mod_param')]" data=u'<div id="param-more" class="mod_param "'>]
In [3]: len(hxs.select("//div[contains(#class,'mod_param')]").extract())
Out[3]: 1
In [4]: len(hxs.select("//div[contains(#class,'mod_param')]").extract()[0])
Out[4]: 5372

Scrapy spider index error

This is the code for Spyder1 that I've been trying to write within Scrapy framework:
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy.item import Item
from firm.items import FirmItem
class Spider1(CrawlSpider):
domain_name = 'wc2'
start_urls = ['http://www.whitecase.com/Attorneys/List.aspx?LastName=A']
rules = (
Rule(SgmlLinkExtractor(allow=["hxs.select(
'//td[#class='altRow'][1]/a/#href').re('/.a\w+')"]),
callback='parse'),
)
def parse(self, response):
hxs = HtmlXPathSelector(response)
JD = FirmItem()
JD['school'] = hxs.select(
'//td[#class="mainColumnTDa"]').re('(?<=(JD,\s))(.*?)(\d+)'
)
return JD
SPIDER = Spider1()
The regex in the rules successfully pulls all the bio urls that I want from the start url:
>>> hxs.select(
... '//td[#class="altRow"][1]/a/#href').re('/.a\w+')
[u'/cabel', u'/jacevedo', u'/jacuna', u'/aadler', u'/zahmedani', u'/tairisto', u
'/zalbert', u'/salberts', u'/aaleksandrova', u'/malhadeff', u'/nalivojvodic', u'
/kallchurch', u'/jalleyne', u'/lalonzo', u'/malthoff', u'/valvarez', u'/camon',
u'/randerson', u'/eandreeva', u'/pangeli', u'/jangland', u'/mantczak', u'/darany
i', u'/carhold', u'/marora', u'/garrington', u'/jartzinger', u'/sasayama', u'/ma
sschenfeldt', u'/dattanasio', u'/watterbury', u'/jaudrlicka', u'/caverch', u'/fa
yanruoh', u'/razar']
>>>
But when I run the code I get
[wc2] ERROR: Error processing FirmItem(school=[]) -
[Failure instance: Traceback: <type 'exceptions.IndexError'>: list index out of range
This is the FirmItem in Items.py
from scrapy.item import Item, Field
class FirmItem(Item):
school = Field()
pass
Can you help me understand where the index error occurs?
It seems to me that it has something to do with SgmLinkExtractor.
I've been trying to make this spider work for weeks with Scrapy. They have an excellent tutorial but I am new to python and web programming so I don't understand how for instance SgmlLinkExtractor works behind the scene.
Would it be easier for me to try to write a spider with the same simple functionality with Python libraries? I would appreciate any comments and help.
Thanks

SgmlLinkExtractor doesn't support selectors in its "allow" argument.
So this is wrong:
SgmlLinkExtractor(allow=["hxs.select('//td[#class='altRow'] ...')"])
This is right:
SgmlLinkExtractor(allow=[r"product\.php"])

The parse function is called for each match of your SgmlLinkExtractor.
As Pablo mentioned you want to simplify your SgmlLinkExtractor.

I also tried to put the names scraped from the initial url into a list and then pass each name to parse in the form of absolute url as http://www.whitecase.com/aabbas (for /aabbas).
The following code loops over the list, but I don't know how to pass this to parse . Do you think this is a better idea?
baseurl = 'http://www.whitecase.com'
names = ['aabbas', '/cabel', '/jacevedo', '/jacuna', '/igbadegesin']
def makeurl(baseurl, names):
for x in names:
url = baseurl + x
baseurl = 'http://www.whitecase.com'
x = ''
return url

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Problems with character-encoding when webscraping with scrapy - python

Related

Some "non usual characters" encoded incorrectly when being scraped using Scrapy

Back to basics: Scrapy

Encode Scrapy data to display in Django and Android

Scrapy couldn't parse some html file correctly

Scrapy spider index error

Categories

Resources