This question already has answers here:
Python correct encoding of Website (Beautiful Soup)
(3 answers)
Closed 1 year ago.
I'm writing a crawler with Python using BeautifulSoup, and everything was going swimmingly till I ran into this site:
http://www.elnorte.ec/
I'm getting the contents with the requests library:
r = requests.get('http://www.elnorte.ec/')
content = r.content
If I do a print of the content variable at that point, all the spanish special characters seem to be working fine. However, once I try to feed the content variable to BeautifulSoup it all gets messed up:
soup = BeautifulSoup(content)
print(soup)
...
<a class="blogCalendarToday" href="/component/blog_calendar/?year=2011&month=08&day=27&modid=203" title="1009 artÃculos en este dÃa">
...
It's apparently garbling up all the spanish special characters (accents and whatnot). I've tried doing content.decode('utf-8'), content.decode('latin-1'), also tried messing around with the fromEncoding parameter to BeautifulSoup, setting it to fromEncoding='utf-8' and fromEncoding='latin-1', but still no dice.
Any pointers would be much appreciated.
In your case this page has wrong utf-8 data which confuses BeautifulSoup and makes it think that your page uses windows-1252, you can do this trick:
soup = BeautifulSoup.BeautifulSoup(content.decode('utf-8','ignore'))
by doing this you will discard any wrong symbols from the page source and BeautifulSoup will guess the encoding correctly.
You can replace 'ignore' by 'replace' and check text for '?' symbols to see what has been discarded.
Actually it's a very hard task to write crawler which can guess page encoding every time with 100% chance(Browsers are very good at this nowadays), you can use modules like 'chardet' but, for example, in your case it will guess encoding as ISO-8859-2, which is not correct too.
If you really need to be able to get encoding for any page user can possibly supply - you should either build a multi-level(try utf-8, try latin1, try etc...) detection function(like we did in our project) or use some detection code from firefox or chromium as C module.
could you try:
r = urllib.urlopen('http://www.elnorte.ec/')
x = BeautifulSoup.BeautifulSoup(r.read)
r.close()
print x.prettify('latin-1')
I get the correct output.
Oh, in this special case you could also x.__str__(encoding='latin1').
I guess this is because the content is in ISO-8859-1(5) and the meta http-equiv content-type incorrectly says "UTF-8".
Could you confirm?
You can try this, which works for every encoding
from bs4 import BeautifulSoup
from bs4.dammit import EncodingDetector
headers = {"User-Agent": USERAGENT}
resp = requests.get(url, headers=headers)
http_encoding = resp.encoding if 'charset' in resp.headers.get('content-type', '').lower() else None
html_encoding = EncodingDetector.find_declared_encoding(resp.content, is_html=True)
encoding = html_encoding or http_encoding
soup = BeautifulSoup(resp.content, 'lxml', from_encoding=encoding)
I'd suggest taking a more methodical fool proof approach.
# 1. get the raw data
raw = urllib.urlopen('http://www.elnorte.ec/').read()
# 2. detect the encoding and convert to unicode
content = toUnicode(raw) # see my caricature for toUnicode below
# 3. pass unicode to beautiful soup.
soup = BeautifulSoup(content)
def toUnicode(s):
if type(s) is unicode:
return s
elif type(s) is str:
d = chardet.detect(s)
(cs, conf) = (d['encoding'], d['confidence'])
if conf > 0.80:
try:
return s.decode( cs, errors = 'replace' )
except Exception as ex:
pass
# force and return only ascii subset
return unicode(''.join( [ i if ord(i) < 128 else ' ' for i in s ]))
You can reason no matter what you throw at this, it will always send valid unicode to bs.
As a result your parsed tree will behave much better and not fail in newer more interesting ways every time you have new data.
Trial and Error doesnt work in Code - There are just too many combinations :-)
The first answer is right, this functions some times are efective.
def __if_number_get_string(number):
converted_str = number
if isinstance(number, int) or \
isinstance(number, float):
converted_str = str(number)
return converted_str
def get_unicode(strOrUnicode, encoding='utf-8'):
strOrUnicode = __if_number_get_string(strOrUnicode)
if isinstance(strOrUnicode, unicode):
return strOrUnicode
return unicode(strOrUnicode, encoding, errors='ignore')
def get_string(strOrUnicode, encoding='utf-8'):
strOrUnicode = __if_number_get_string(strOrUnicode)
if isinstance(strOrUnicode, unicode):
return strOrUnicode.encode(encoding)
return strOrUnicode
Related
This question already has answers here:
python requests.get() returns improperly decoded text instead of UTF-8?
(4 answers)
Closed last year.
I spent my last 3 hours to solve this problem even though there are plenty of solutions. It just doesn't work for me, I suspicious of if the website that I'm scrapping is corrupted but Firefox shows the content perfectly.. As I said, this is asked before but I think there is a difference for my code and I want to learn what it is.
from bs4 import BeautifulSoup
import requests
html_text = requests.get('link_for_scrapping').text
soup = BeautifulSoup(html_text, 'lxml')
print(soup.encoding)
soup.encoding = 'utf-8'
print(soup.encoding)
Output:
None
utf-8
Why is it encoded as "None" at first? The content I'm looking for is written with Turkish characters, but in other people's code, they weren't encoded as "None". They were something like "ISO-xxxx-x" or something else
Also, when I converted it to "utf-8" nothing changes. There are still same weird characters.
If we add this code, we can see it better:
menu = soup.find(class_="panel-grid-cell col-md-6").text
print(menu)
Output:
None
utf-8
1) 31.01.2022 Pazartesi Yemekler :
Mercimek Ãorba Fırın Patates Mor Dünya Salatası Sıhhiye Kırmızı Lahana Havuç Salata Elma *Etsiz PatatesKalori : 1099
If I change the encoding to utf-8 or not, problem persists.
Expected Output:
None
utf-8
1) 31.01.2022 Pazartesi Yemekler :
Mercimek Çorba Fırın Patates Mor Dünya Salatası Sıhhiye Kırmızı Lahana Havuç Salata Elma *Etsiz PatatesKalori : 1099
Thanks in advance!
The Problem:
import requests
r = requests.get('link')
print(r.encoding)
Output: ISO-8859-1
The server is not sending the appropriate header, requests doesn't parse <meta charset="utf-8" />, so it defaults to ISO-8859-1.
Solution 1: Tell requests what encoding to use
r.encoding = 'utf-8'
html_text = r.text
Solution 2: Do the decoding yourself
html_text = r.content.decode('utf-8')
Solution 3: Have requests take a guess
r.encoding = r.apparent_encoding
html_text = r.text
In any case, html_text will now contain the (correctly decoded) html source and can be fed to BeautifulSoup.
The encoding setting of BeautifulSoup didn't help, because at that point you already had a wrongly decoded string!
I have watched a video that teaches how to use BeautifulSoup and requests to scrape a website
Here's the code
from bs4 import BeautifulSoup as bs4
import requests
import pandas as pd
pages_to_scrape = 1
for i in range(1,pages_to_scrape+1):
url = ('http://books.toscrape.com/catalogue/page-{}.html').format(i)
pages.append(url)
for item in pages:
page = requests.get(item)
soup = bs4(page.text, 'html.parser')
#print(soup.prettify())
for j in soup.findAll('p', class_='price_color'):
price=j.getText()
print(price)
The code i working well. But as for the results I noticed weird character before the euro symbol and when checking the html source, I didn't find that character. Any ideas why this character appears? and how this be fixed .. is using replace enough or there is a better approach?
Seems for me you explained your question wrongly. I assume that you are using Windows where your terminal IDLE is using the default encoding of cp1252,
But you are dealing with UTF-8, you've to configure your terminal/IDLE with UTF-8
import requests
from bs4 import BeautifulSoup
def main(url):
with requests.Session() as req:
for item in range(1, 10):
r = req.get(url.format(item))
print(r.url)
soup = BeautifulSoup(r.content, 'html.parser')
goal = [(x.h3.a.text, x.select_one("p.price_color").text)
for x in soup.select("li.col-xs-6")]
print(goal)
main("http://books.toscrape.com/catalogue/page-{}.html")
try to always use The DRY Principle which means Don’t Repeat Yourself”.
Since you are dealing with the same host so you've to maintain the same session instead of keep open tcp socket stream and then close it and then open it again. That's can lead to block your requests and consider it as DDOS attack where the TCP flags got captured by the back-end. imagine that you open your browser and then open a website then you close it and repeat the circle!
Python functions is usually looks nice and easy to read instead of letting code looks like journal text.
Notes: the usage of range() and {} format string, CSS selectors.
You could use page.content.decode('utf-8') instead of page.text. As people in the comments said, it is an encoding issue, and .content returns HTML as bytes, then you can convert it into string with right encoding using .decode('utf-8'), whereas .text returns string with bad encoding (maybe cp1252). The final code may look like this:
from bs4 import BeautifulSoup as bs4
import requests
import pandas as pd
pages_to_scrape = 1
pages = [] # You forgot this line
for i in range(1,pages_to_scrape+1):
url = ('http://books.toscrape.com/catalogue/page-{}.html').format(i)
pages.append(url)
for item in pages:
page = requests.get(item)
soup = bs4(page.content.decode('utf-8'), 'html.parser') # Replace .text with .content.decode('utf-8')
#print(soup.prettify())
for j in soup.findAll('p', class_='price_color'):
price=j.getText()
print(price)
This should hopefully work
P.S: Sorry for directly writing the answer, I don't have enought reputation to write in comments :D
I am trying to extract text from a Vietnamese website, which charset is in utf-8. However, the text I got is always in Ascii, and I can't find a way to convert them to unicode or get exactly the text on the website. As a result, I can't save them into file as expected.
I know this is the very popular problem with unicode in Python, but I still hope someone will help me to figure it out. Thanks.
My code:
import requests, re, io
import simplejson as json
from lxml import html, etree
base = "http://www.amthuc365.vn/cong-thuc/"
page = requests.get(base + "trang-" + str(1) + ".html")
pageTree = html.fromstring(page.text)
links = pageTree.xpath('//ul[contains(#class, "mt30")]/li/a/#href')
names = pageTree.xpath('//h3[#class="title"]/a/text()')
for name in names[:1]:
print name
# Là m bánh oreo nhân bÆ¡ Äáºu phá»ng thÆ¡m bùi
but what I need is "Làm bánh oreo nhân bơ đậu phộng thơm bùi"
Thanks.
Just switching from page.text to page.content should make it work.
Explanation here.
Also see:
What is the difference between 'content' and 'text'
HTML encoding and lxml parsing
I m trying to scrape links from the search results of Yahoo with the following python code.
I m using mechanize to for browser instance and Beautiful soup for parsing the HTML code.
The problem is, this script would work fine sometimes and sometimes throws following error:
WARNING:root:Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
Its clear that its something related to encoding and decoding or gzip compression I guess, but why working sometimes and sometimes not? and how it could be fixed to work all the time?
Following is the code. Run it 7-8 times and you will notice.
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import mechanize
import urllib
from bs4 import BeautifulSoup
import re
#mechanize emulates a Browser
br = mechanize.Browser()
br.set_handle_robots(False)
br.addheaders = [('User-agent','chrome')]
term = "stock market".replace(" ","+")
query = "https://search.yahoo.com/search?q=" + term
htmltext = br.open(query).read()
htm = str(htmltext)
soup = BeautifulSoup(htm)
#Since all results are located in the ol tag
search = soup.findAll('ol')
searchtext = str(search)
#Using BeautifulSoup to parse the HTML source
soup1 = BeautifulSoup(searchtext)
#Each search result is contained within div tag
list_items = soup1.findAll('div', attrs={'class':'res'})
#List of first search result
list_item = str(list_items)
for li in list_items:
list_item = str(li)
soup2 = BeautifulSoup(list_item)
link = soup2.findAll('a')
print link[0].get('href')
print ""
Here's an output screenshot:
http://pokit.org/get/img/1d47e0d0dc08342cce89bc32ae6b8e3c.jpg
I had issues with encoding on a project and developed a function to get the encoding of the page i was scraping- then you can decode to unicode for your function to try and prevent these errors. with re: to compression what you need to do is develop your code so that if it encounters a compressed file it can deal with it.
from bs4 import BeautifulSoup, UnicodeDammit
import chardet
import re
def get_encoding(soup):
"""
This is a method to find the encoding of a document.
It takes in a Beautiful soup object and retrieves the values of that documents meta tags
it checks for a meta charset first. If that exists it returns it as the encoding.
If charset doesnt exist it checks for content-type and then content to try and find it.
"""
encod = soup.meta.get('charset')
if encod == None:
encod = soup.meta.get('content-type')
if encod == None:
content = soup.meta.get('content')
match = re.search('charset=(.*)', content)
if match:
encod = match.group(1)
else:
dic_of_possible_encodings = chardet.detect(unicode(soup))
encod = dic_of_possible_encodings['encoding']
return encod
a link to deal with compressed data http://www.diveintopython.net/http_web_services/gzip_compression.html
from this question Check if GZIP file exists in Python
if any(os.path.isfile, ['bob.asc', 'bob.asc.gz']):
print 'yay'
I'm trying to get to grips with regex in Python. I'm writing a very simple script to scrape emails off a given URL.
import re
from urllib.request import *
url = input("Please insert the URL you wish to scrape> ")
page = urlopen(url)
content = page.read()
email_string = b'[a-z0-9_. A-Z]*#[a-z0-9_. A-Z]*.[a-zA-Z]'
emails_in_page = re.findall(email_string, content)
print("Here are the emails found: ")
for email in emails_in_page:
print(email)
re.findall() returns a list, and when the program prints out the emails, the "b" from the regex string is included in the output, like this:
b'email1#email.com'
b'email2#email.com'
...
How can I have a clean list of emails printed out? (ie: email1#email.com)
You are printing bytes objects. Decode them to strings:
encoding = page.headers.get_param('charset')
if encoding is None:
encoding = 'utf8' # sensible default
for email in emails_in_page:
print(email.decode(encoding))
or decode the HTML page you retrieved:
encoding = page.headers.get_param('charset')
if encoding is None:
encoding = 'utf8' # sensible default
content = page.read().decode(encoding)
and use a unicode string regular expression:
email_string = '[a-z0-9_. A-Z]*#[a-z0-9_. A-Z]*.[a-zA-Z]'
Many webpages do not send a proper charset parameter in the content-type header, or set it wrong, so even the 'sensible default' can be wrong from time to time.
A HTML parsing library like BeautifulSoup would do a better job of codec detection still, it includes some more heuristics to make an educated guess:
from bs4 import BeautifulSoup
soup = BeautifulSoup(page.read(), from_encoding=page.headers.get_param('charset'))
for textelem in soup.find_all(text=re.compile(email_string)):
print(textelem)