BeautifulSoup returning messed-up arabic characters - python

I am scraping an arabic website using BeautiifulSoup but the arabic characters returned are returned inverted and separate chars (pasting it here correctly encodes it so you have to trust me on that :).
The website charset is using UTF-8
<meta charset=UTF-8>
This is how I am parsing it:
url = 'https://new.kooora4live.net/matches-today-1/'
page = requests.get(url)
soup = BeautifulSoup(page.content, 'lxml', from_encoding='utf-8')
Writing the requested HTML to a file with utf-8 encoding correctly formats it in the file so it seems as it's an issue with BeautifulSoup.
Any idea what am I doing wrong or how to fix it?
....
Update:
Encoding with utf-8-sig also doesn't work.

You need to set the page encoding to match its apparent encoding.
Try this:
import requests
from bs4 import BeautifulSoup
page = requests.get('https://new.kooora4live.net/matches-today-1/')
page.encoding = page.apparent_encoding
soup = BeautifulSoup(page.content, 'lxml').select("a")
print("\n".join(a.getText(strip=True) for a in soup))
This will print out:
الأخبار
أهم المباريات
جداول
ترتيب الفرق
ترتيب الهدافين
مباريات الأمس
مباريات اليوم
مباريات الغد
جمهورية التشيك7:00 PM0-0لم تبدأ بعدالدنماركبي ان ماكس 1احمد البلوشييورو 2020
اوكرانيا10:00 PM0-0لم تبدأ بعدإنجلترابي ان ماكس 1حسن العيدروسيورو 2020
and more ...

Related

BeautifulSoup shows strange text

I am trying to scrape data from a Bengali (language) website.
When I inspect element on that website, everything is as it should.
code:
request = requests.get("https://corona.gov.bd/")
soup = BeautifulSoup(request.content, "lxml")
print(soup.prettify())
Part of the output:
<strong>
সà¦à¦°à¦¾à¦à¦° à¦à¦¿à¦à§à¦à¦¾à¦¸à¦¾
</strong>
সà¦à¦°à¦¾à¦à¦° à¦à¦¿à¦à§à¦à¦¾à¦¸à¦¾ >> should be >>"সচরাচর জিজ্ঞাসা"
I am not sure if it is ASCII or not. I used https://onlineasciitools.com/convert-ascii-to-unicode to convert that text into Unicode. As per this website, it may be ASCII. But I checked an ASCII table online and none of those characters were in it. So now I need to convert those text into readable stuff. Any help?
You should just decode the content, like this:
request.content.decode('utf-8')
Yes, its work. You need to decode('utf-8') request response.
import requests
from bs4 import BeautifulSoup
request = requests.get("https://corona.gov.bd/")
soup = BeautifulSoup(request.content.decode('utf-8'), "lxml")
my_data = soup.find('div', {'class':'col-md-6 col-sm-6 col-xs-12 slider-button-center xs-mb-15'})
print(my_data.get_text(strip=True, separator='|'))
print output:
্বাস্থ্য বিষয়ক সেবা|(ডাক্তার, হাসপাতাল, ঔষধ, টেস্ট)|খাদ্য ও জরুরি সেবা|(খাদ্য, অ্যাম্বুলেন্স, ফায়ার সার্ভিস)|সচরাচর জিজ্ঞাসা|FAQ
The request returned by requests.get() returns both the raw byte content (request.content) and and the content decoded by the encoding declared in the content.
request.encoding is the actual encoding (which may not be UTF-8), and request.text is the already-decoded content.
Example using request.text instead:
import requests
from bs4 import BeautifulSoup
request = requests.get("https://corona.gov.bd/")
soup = BeautifulSoup(request.text, "lxml")
print(soup.find('title'))
<title>করোনা ভাইরাস ইনফো ২০১৯ | Coronavirus Disease 2019 (COVID-19) Information Bangladesh | corona.gov.bd</title>

Issue with parsing special characters in a utf-8 encoded page with bs4

I’m trying to parse a page and I’m having some issue with special characters such as é è à, etc.
According to the Firefox page information tool, the page is encoded in UTF - 8
My code is the following :
import bs4
import requests
url = 'https://www.registreentreprises.gouv.qc.ca/RQEntrepriseGRExt/GR/GR99/GR99A2_05A_PIU_AfficherMessages_PC/ActiEcon.html'
page = requests.get(url)
cae_obj_soup = bs4.BeautifulSoup(page.text, 'lxml', from_encoding='utf-8')
list_all_domain = cae_obj_soup.find_all('th')
for element in list_all_domain:
print(element.get_text())
The output is :
Pêche et piégeage
Exploitation forestière
I tried changing the encoding with iso-8859-1 (French encoding) and some other encodings without success. I read several posts on parsing special characters, and they basically states that it’s an issue of selecting the right encoding. Is there a possibility that I can’t decode correctly the special characters on some specific webpage or am I doing something wrong ?
The requests library takes a strict approach to the decoding of web pages. On the other hand, BeautifulSoup has powerful tools for determining the encoding of text. So it's better to pass the raw response from the request to BeautifulSoup, and let BeautifulSoup try to determine the encoding.
>>> r = requests.get('https://www.registreentreprises.gouv.qc.ca/RQEntrepriseGRExt/GR/GR99/GR99A2_05A_PIU_AfficherMessages_PC/ActiEcon.html')
>>> soup = BeautifulSoup(r.content, 'lxml')
>>> list_all_domain = soup.find_all('th')
>>> [e.get_text() for e in list_all_domain]
['Agriculture', "Services relatifs à l'agriculture", 'Pêche et piégeage', ...]

Can't properly display characters using BeautifulSoup

I am trying to scrape some names of settlements from a website using a BeautifulSoup library. The website uses the 'windows-1250' character set, but some of the characters are not displayed properly. See the last name of the settlement, which should be Župkov.
Could you help me with this problem?
This is the code:
# imports
import requests
from bs4 import BeautifulSoup
from bs4 import NavigableString
# create beautifulsoup object
obce_url = 'http://www.e-obce.sk/zoznam_vsetkych_obci.html?strana=2500'
source_code = requests.get(obce_url)
plain_text = source_code.text
obce_soup = BeautifulSoup(plain_text, 'html.parser')
# define bs filter
def soup_filter_1(tag):
return tag.has_attr('href') and len(tag.attrs) == 1 and isinstance(tag.next_element, NavigableString)
# print settlement names
for tag in obce_soup.find_all(soup_filter_1):
print(tag.string)
I am using Python 3.5.1 and beautifulsoup 4.4.1.
The problem is not with beautifulsoup, it just cannot determine what encoding you have ( try print('encoding', obce_soup.original_encoding)) and this is caused by you handing it Unicode instead of bytes.
If you try this:
obce_url = 'http://www.e-obce.sk/zoznam_vsetkych_obci.html?strana=2500'
source_code = requests.get(obce_url)
data_bytes = source_code.content # don't use .text it will try to make Unicode
obce_soup = BeautifulSoup(data_bytes, 'html.parser')
print('encoding', obce_soup.original_encoding)
to create your beautifulsoup object, you'll see it now gets the encoding right and your output is OK.
Since you know site's encoding, you can just pass it explicitly to BeautifulSoup constructor with response's content, not text:
source_code = requests.get(obce_url)
content = source_code.content
obce_soup = BeautifulSoup(content, 'html.parser', from_encoding='windows-1250')
Probably the server sends HTTP headers which specify the character set as UTF-8 but then the actual HTML uses Win-1250. So requests uses UTF-8 to decode the HTML data.
But you can get original data source_code.content and use decode('cp1250') to get correct characters.
plain_text = source_code.content.decode('cp1250')
Or you can manually set encoding before you get text
source_code.encoding = 'cp1250'
plain_text = source_code.text
You can also use the original data source_code.content in BS so it should use information within the HTML about its encoding
obce_soup = BeautifulSoup(source_code.content, 'html.parser')
see
print(obce_soup.declared_html_encoding)

Python: parsing UNICODE characters using bs4

I am building a python3 web crawler/scraper using bs4. The program crashes whenever it meets a UNICODE code character like a Chinese symbol. How do I modify my scraper so that it supports UNICODE?
Here's the code:
import urllib.request
from bs4 import BeautifulSoup
def crawlForData(url):
r = urllib.request.urlopen(url)
soup = BeautifulSoup(r.read(),'html.parser')
result = [i.text.replace('\n', ' ').strip() for i in soup.find_all('p')]
for p in result:
print(p)
url = 'https://en.wikipedia.org/wiki/Adivasi'
crawlForData(url)
You can try unicode() method. It decodes unicode strings.
or a way to go is
content.decode('utf-8','ignore')
where content is your string
The complete solution may be:
html = urllib2.urlopen("your url")
content = html.read().decode('utf-8', 'ignore')
soup = BeautifulSoup(content)

How to extract Urdu text from a webpage using beautifulsoup

I am using bs4 to extract text from a web document. But Its output is very strange. like
Ú©Ø¨Ú¾Û ÛÛ Ø¨Ø§Øª سÙجھ ÙÛÚº ÙÛ Ø§ÙØªÛ ØªÚ¾ÛÛ Ù¾Ú¾Ø± اÙÛØ³ØªÛ Ø§ÙÛØ³ØªÛ Ø¬Ø¨ Ú©ÚÚ¾ عÙ٠اÙÙÛ Ø´Ø±Ùع ÛÙØ¦Û ØªÙ Ø¨Ø§Øª Ú©ÚÚ¾ Ù¾ÙÛ Ù¾ÚÛÛÙÛک٠اب ÛÛ Ø¨Ø§Øª اÛØ³Û Ø³Ùجھ ÙÛÚº اÙØ¦Û Ú©Û Ø³ÙÚا اÙÙ¾ Ú©ÛÙÙÚ¯ÙÚº Ú©Û Ø¨Ú¾Û Ø´Ûئر کرÙÚºÛ ÚÙد رÙز Ùب٠ÙÛرا 8 Ù
I think it is some encoding. I am a new user of bs4. Please guide me how to decode it to show as urdu text.
Here is a document source whose title I want to extract
Follwoing code I am using to do it.
from bs4 import BeautifulSoup
import urllib2
import requests
url="http://blog.jang.com.pk/blog_details.asp?id=11058"
r = requests.get(url)
data = r.text
soup = BeautifulSoup(data,'lxml')
print str(soup.title)
Burhan Khalid's answer works, but because the original web page is encoded in utf-8:
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
You should update the requests' response field to match the original page's encoding:
from bs4 import BeautifulSoup
import urllib2
import requests
url="http://blog.jang.com.pk/blog_details.asp?id=11058"
r = requests.get(url)
# Update encoding to match source
r.encoding = "utf-8"
data = r.text
soup = BeautifulSoup(data,'lxml')
print str(soup.title)
Now any field you access will have the correct encoding rather than having to set to Urdu on a per field basis.
If you simply try to print the string, you'll get garbage characters out:
>>> import requests
>>> from bs4 import BeautifulSoup as bs4
>>> r = requests.get('http://blog.jang.com.pk/blog_details.asp?id=11058')
>>> s = bs4(r.text, 'lxml')
>>> print s.title.text
Ú©ÚÚ¾ تÙØ¬Û Ø§Ø³ طر٠بھÛ!
You need to encode it properly, since the result is a unicode bytestring.
>>> print s.title.text.encode('iso-8859-1')
کچھ توجہ اس طرف بھی!
If it displays the glyphs correctly, but in the wrong order (ie, they are not right-to-left), then this is a problem with your operating system/terminal/shell/program you are using to run the application.
The above is from gnome-terminal, which doesn't support Arabic RTL properly.
If I run the same code in mlterm:
The white box is there because I am using an Arabic font, which doesn't have all the characters in the Urdu language.
I think what is happening is that there some badly formed Unicode in the website response:
----> 1 r.content.decode()
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd8 in position 1106: invalid continuation byte
Hence the text is being decoded using the ANSI codec which is obviously wrong. You can work around this issue by calling decode with the option errors='ignore' (we are using the content rather than text because this is the raw binary response from the website:
data = r.content.decode(errors='ignore')
soup = BeautifulSoup(data,'lxml')
print (str(soup.title))
<title>کچھ توجہ اس طرف بھی!</title>

Categories