Can't properly display characters using BeautifulSoup - python

I am trying to scrape some names of settlements from a website using a BeautifulSoup library. The website uses the 'windows-1250' character set, but some of the characters are not displayed properly. See the last name of the settlement, which should be Župkov.
Could you help me with this problem?
This is the code:
# imports
import requests
from bs4 import BeautifulSoup
from bs4 import NavigableString
# create beautifulsoup object
obce_url = 'http://www.e-obce.sk/zoznam_vsetkych_obci.html?strana=2500'
source_code = requests.get(obce_url)
plain_text = source_code.text
obce_soup = BeautifulSoup(plain_text, 'html.parser')
# define bs filter
def soup_filter_1(tag):
return tag.has_attr('href') and len(tag.attrs) == 1 and isinstance(tag.next_element, NavigableString)
# print settlement names
for tag in obce_soup.find_all(soup_filter_1):
print(tag.string)
I am using Python 3.5.1 and beautifulsoup 4.4.1.

The problem is not with beautifulsoup, it just cannot determine what encoding you have ( try print('encoding', obce_soup.original_encoding)) and this is caused by you handing it Unicode instead of bytes.
If you try this:
obce_url = 'http://www.e-obce.sk/zoznam_vsetkych_obci.html?strana=2500'
source_code = requests.get(obce_url)
data_bytes = source_code.content # don't use .text it will try to make Unicode
obce_soup = BeautifulSoup(data_bytes, 'html.parser')
print('encoding', obce_soup.original_encoding)
to create your beautifulsoup object, you'll see it now gets the encoding right and your output is OK.

Since you know site's encoding, you can just pass it explicitly to BeautifulSoup constructor with response's content, not text:
source_code = requests.get(obce_url)
content = source_code.content
obce_soup = BeautifulSoup(content, 'html.parser', from_encoding='windows-1250')

Probably the server sends HTTP headers which specify the character set as UTF-8 but then the actual HTML uses Win-1250. So requests uses UTF-8 to decode the HTML data.
But you can get original data source_code.content and use decode('cp1250') to get correct characters.
plain_text = source_code.content.decode('cp1250')
Or you can manually set encoding before you get text
source_code.encoding = 'cp1250'
plain_text = source_code.text
You can also use the original data source_code.content in BS so it should use information within the HTML about its encoding
obce_soup = BeautifulSoup(source_code.content, 'html.parser')
see
print(obce_soup.declared_html_encoding)

Related

Python web scraping - Why do I get this error?

I want to get text from a website using bs4 but I keep getting this error and I don't know why. This is the error: TypeError: slice indices must be integers or None or have an index method.
This is my code:
from urllib.request import urlopen
import bs4
url = "https://www.oddsshark.com/nfl/dallas-pittsburgh-odds-august-5-2021-1410371"
page = urlopen(url)
html_bytes = page.read()
html = html_bytes.decode("utf-8")
text = html.find("div", {"class":"gc-score__title"})#the error is in this line
print(text)
On this line:
text = html.find("div", {"class":"gc-score__title"})
you just use str.find method, not bs4.BeautifulSoup.find method
So if you do
soup = bs4.BeautifulSoup(html, 'html.parser')
text = soup.find("div", {"class":"gc-score__title"})
print(text)
you will get rid of the error.
That said, the site is using JavaScript, so this will not yield what you expect. You will need to use tools like Selenium to scrape this site.
First, if you want BeautifulSoup to parse the data, you need to ask it to do that.
from urllib.request import urlopen
from bs4 import BeautifulSoup
url = "https://www.oddsshark.com/nfl/dallas-pittsburgh-odds-august-5-2021-1410371"
page = urlopen(url)
html_bytes = page.read()
soup = BeautifulSoup(html_bytes)
Then you can used soup.find to find <div> tags:
text = soup.find("div", {"class":"gc-score__title"})
That will eliminate the error. You were calling str.find because html is a string, and to pick tags out you need to call the find method of a bs4.BeautifulSoup object.
But besides eliminating the error, that line won't do what you want. It won't return anything, because the data at that url does not contain the tag <div class="gc-score__title">.
Copy the contents of html_bytes to a text editor to confirm this.

BeautifulSoup shows strange text

I am trying to scrape data from a Bengali (language) website.
When I inspect element on that website, everything is as it should.
code:
request = requests.get("https://corona.gov.bd/")
soup = BeautifulSoup(request.content, "lxml")
print(soup.prettify())
Part of the output:
<strong>
সà¦à¦°à¦¾à¦à¦° à¦à¦¿à¦à§à¦à¦¾à¦¸à¦¾
</strong>
সà¦à¦°à¦¾à¦à¦° à¦à¦¿à¦à§à¦à¦¾à¦¸à¦¾ >> should be >>"সচরাচর জিজ্ঞাসা"
I am not sure if it is ASCII or not. I used https://onlineasciitools.com/convert-ascii-to-unicode to convert that text into Unicode. As per this website, it may be ASCII. But I checked an ASCII table online and none of those characters were in it. So now I need to convert those text into readable stuff. Any help?
You should just decode the content, like this:
request.content.decode('utf-8')
Yes, its work. You need to decode('utf-8') request response.
import requests
from bs4 import BeautifulSoup
request = requests.get("https://corona.gov.bd/")
soup = BeautifulSoup(request.content.decode('utf-8'), "lxml")
my_data = soup.find('div', {'class':'col-md-6 col-sm-6 col-xs-12 slider-button-center xs-mb-15'})
print(my_data.get_text(strip=True, separator='|'))
print output:
্বাস্থ্য বিষয়ক সেবা|(ডাক্তার, হাসপাতাল, ঔষধ, টেস্ট)|খাদ্য ও জরুরি সেবা|(খাদ্য, অ্যাম্বুলেন্স, ফায়ার সার্ভিস)|সচরাচর জিজ্ঞাসা|FAQ
The request returned by requests.get() returns both the raw byte content (request.content) and and the content decoded by the encoding declared in the content.
request.encoding is the actual encoding (which may not be UTF-8), and request.text is the already-decoded content.
Example using request.text instead:
import requests
from bs4 import BeautifulSoup
request = requests.get("https://corona.gov.bd/")
soup = BeautifulSoup(request.text, "lxml")
print(soup.find('title'))
<title>করোনা ভাইরাস ইনফো ২০১৯ | Coronavirus Disease 2019 (COVID-19) Information Bangladesh | corona.gov.bd</title>

How to stop BeautifulSoup from decoding HTML entities into symbols

I am trying to get all the links on a given website but is stuck with some problems about HTML entities. Here's my code that crawls websites using BeautifulSoup:
from bs4 import BeautifulSoup
import requests
.
.
baseRequest = requests.get("https://www.example.com", SOME_HEADER_SETTINGS)
soup = BeautifulSoup(baseRequest.content, "html.parser")
pageLinks = []
for anchor in soup.findAll("a"):
pageLinks.append(anchor["href"])
.
.
print(pageLinks)
The code becomes problematic when it sees this kind of element:
Link
Instead of printing ["./page?id=123&sect=2"], it treats the &sect part as an HTML entity and shows this in the console:
["./page?id=123§=2"]
Is there a solution to preventing the this?
Here is one
from bs4 import BeautifulSoup
soup = BeautifulSoup('Link', "html.parser")
pageLinks = []
for anchor in soup.findAll("a"):
pageLinks.append(anchor["href"])
uncoded = ''.join(i for i in pageLinks).encode('utf-8')
decoded = ''.join(map(lambda x: chr(ord(x)),''.join(i for i in pageLinks)))
print('uncoded =',uncoded)
print('decoded =',decoded)
output
uncoded = b'./page?id=123\xc2\xa7=2'
decoded = ./page?id=123§=2

BeautifulSoap does not work on subsequent pages

I cannot get the title of subsequent pages. Where is the problem?
from bs4 import BeautifulSoup
import urllib.request
# First page
source = urllib.request.urlopen('https://yeniemlak.az/elan/axtar?emlak=1&elan_nov=1&seher=0&metro=0&qiymet=&qiymet2=&mertebe=&mertebe2=&otaq=&otaq2=&sahe_m=&sahe_m2=&sahe_s=&sahe_s2=').read()
soup = BeautifulSoup(source,'lxml')
print(soup.title) # shows title as expected
# Second page
source = urllib.request.urlopen('https://yeniemlak.az/elan/axtar?emlak=1&elan_nov=1&seher=0&metro=0&qiymet=&qiymet2=&mertebe=&mertebe2=&otaq=&otaq2=&sahe_m=&sahe_m2=&sahe_s=&sahe_s2=&page=2').read()
soup = BeautifulSoup(source,'lxml')
print(soup.title) # shows None
Unsure why only your second case is failing. As mentioned in some other SO thread, sometimes using other parsers might work.
I could get the second page to work fine with html.parser. Though it threw a warning about decoding errors.
from bs4 import BeautifulSoup
import urllib.request
# Second page
source = urllib.request.urlopen('https://yeniemlak.az/elan/axtar?emlak=1&elan_nov=1&seher=0&metro=0&qiymet=&qiymet2=&mertebe=&mertebe2=&otaq=&otaq2=&sahe_m=&sahe_m2=&sahe_s=&sahe_s2=&page=2').read()
soup = BeautifulSoup(source,'html.parser')
print(soup.title) # Now works
Output
Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
<title>YENIEMLAK.AZ Satılır Bina ev menzil </title>

Python: parsing UNICODE characters using bs4

I am building a python3 web crawler/scraper using bs4. The program crashes whenever it meets a UNICODE code character like a Chinese symbol. How do I modify my scraper so that it supports UNICODE?
Here's the code:
import urllib.request
from bs4 import BeautifulSoup
def crawlForData(url):
r = urllib.request.urlopen(url)
soup = BeautifulSoup(r.read(),'html.parser')
result = [i.text.replace('\n', ' ').strip() for i in soup.find_all('p')]
for p in result:
print(p)
url = 'https://en.wikipedia.org/wiki/Adivasi'
crawlForData(url)
You can try unicode() method. It decodes unicode strings.
or a way to go is
content.decode('utf-8','ignore')
where content is your string
The complete solution may be:
html = urllib2.urlopen("your url")
content = html.read().decode('utf-8', 'ignore')
soup = BeautifulSoup(content)

Categories