Python: parsing UNICODE characters using bs4 - python

I am building a python3 web crawler/scraper using bs4. The program crashes whenever it meets a UNICODE code character like a Chinese symbol. How do I modify my scraper so that it supports UNICODE?
Here's the code:
import urllib.request
from bs4 import BeautifulSoup
def crawlForData(url):
r = urllib.request.urlopen(url)
soup = BeautifulSoup(r.read(),'html.parser')
result = [i.text.replace('\n', ' ').strip() for i in soup.find_all('p')]
for p in result:
print(p)
url = 'https://en.wikipedia.org/wiki/Adivasi'
crawlForData(url)

You can try unicode() method. It decodes unicode strings.
or a way to go is
content.decode('utf-8','ignore')
where content is your string
The complete solution may be:
html = urllib2.urlopen("your url")
content = html.read().decode('utf-8', 'ignore')
soup = BeautifulSoup(content)

Related

how to scrape Bengali character properly with python

I want to scrape Bengali characters properly with python. I already try decode it but it gives me an error and while I encode it give me some non readable things
from bs4 import BeautifulSoup
import requests
import urllib3
import requests
import pprint
import json
from requests.models import DecodeError
from urllib3.util.url import Url
content = []
# for i in range(1, 25):
url = 'https://sattacademy.com/job-solution/view?cat_id=1&sub_cat_id=893'
res= requests.get(url)
soup = BeautifulSoup(res.text, 'html.parser')
anchor = soup.findAll("a", attrs={"class": "nav-link"})
for ref in anchor:
print(ref.text)
This is my output ->
enter image description here
But i want this clealry.Thnx
This link might help you. Shows how to print utf-8 characters in python.
Printing unicode number of chars in a string (Python)

I'm trying to get japanese sentences with python 3.8. but I'm just gettting non-intelligible characters, somebody can help me?

I'm using this python script
from lxml import html
import requests
page = requests.get('https://www.tanoshiijapanese.com/dictionary/sentences.cfm?j=%E6%BC%A2%E5%AD%97&e=&search=Search+%3E')
tree = html.fromstring(page.content)
sentences = tree.xpath('//div[#class="sentence"]/div[#class="sm"]/div[#class="jp"]/text()')
print ('Sentences: ', sentences)
and I'm getting this
Sentences: ['ä»\x8aæ\x97¥ã\x81¯æ¼¢å\xad\x97ã\x81®æ\x9b¸ã\x81\x8då\x8f\x96ã\x82\x8aã\x81\x8cã\x81\x82ã\x82\x8bã\x80\x82', 'æ¼¢å\xad\x97ã\x82\x92æ\x9b¸ã\x81\x8fã\x81¨ã\x81\x8dã\x81¯ç\x82¹ã\x82\x84ã\x81¯ã\x82\x89ã\x81\x84ã\x81«æ°\x97ã\x82\x92ã\x81¤ã\x81\x91ã\x81¦ã\x80\x81ã\x81ªã\x82\x8bã\x81¹ã\x81\x8fæ\x97©ã\x81\x8fã\x81¦ã\x81\x84ã\x81\xadã\x81\x84ã\x81«æ\x9b¸ã\x81\x8dã\x81¾ã\x81\x97ã\x82\x87ã\x81\x86ã\x80\x82', '横ç\x9d\x80ã\x81\x97ã\x81ªã\x81\x84ã\x81§ã\x80\x81æ\x95\x99ã\x82\x8fã\x81£ã\x81\x9fæ¼¢å\xad\x97ã\x82\x92使ã\x81\x84ã\x81ªã\x81\x95ã\x81\x84ã\x80\x82', 'ã\x80\x8cé\x81\x93ã\x80\x8dã\x81¨ã\x81\x84ã\x81\x86æ¼¢å\xad\x97ã\x81®ç·\x8fç\x94»æ\x95°ã\x81¯ä½\x95ç\x94»ã\x81§ã\x81\x99ã\x81\x8bã\x80\x82', 'å½¼ã\x81¯æ¼¢å\xad\x97ã\x81\x8cå\x85¨ã\x81\x8fæ\x9b¸ã\x81\x91ã\x81ªã\x81\x84ã\x80\x82', 'å\x90\x9bã\x81¯ã\x81\x93ã\x81®æ¼¢å\xad\x97ã\x81\x8cèª\xadã\x82\x81ã\x81¾ã\x81\x99ã\x81\x8bã\x80\x82', 'æ¼¢å\xad\x97ã\x81¯èª\xadã\x82\x80ã\x81®ã\x81\x8cé\x9b£ã\x81\x97ã\x81\x84ã\x80\x82', 'ã\x81\x93ã\x81®æ¼¢å\xad\x97ã\x81¯ã\x81©ã\x81\x86ã\x81\x84ã\x81\x86æ\x84\x8få\x91³ã\x81§ã\x81\x99ã\x81\x8bã\x80\x82', 'ã\x81\x84ã\x81\x8bã\x81«ã\x82\x82ã\x83ªã\x82¾ã\x83¼ã\x83\x88ã\x81£ã\x81¦ã\x81\x8bã\x82\x93ã\x81\x98ã\x81®æ\xa0¼å¥½ã\x81\xadã\x80\x82', 'æ¼¢å\xad\x97ã\x82\x92å°\x91ã\x81\x97æ\x95\x99ã\x81\x88ã\x81¦ã\x81\x8fã\x81\xa0ã\x81\x95ã\x81\x84ã\x80\x82', '彼女ã\x81¯ã\x81\x93ã\x82\x93ã\x81ªé\x9b£ã\x81\x97ã\x81\x84æ¼¢å\xad\x97ã\x82\x82èª\xadã\x82\x81ã\x81¾ã\x81\x99ã\x80\x82', 'ã\x83\x88ã\x83\x9eã\x81\x95ã\x82\x93ã\x81¯å°\x8få\xad¦ç\x94\x9få\x90\x91ã\x81\x91ã\x81®æ\x9c¬ã\x81\x8cèª\xadã\x82\x81ã\x82\x8bã\x81\x90ã\x82\x89ã\x81\x84æ¼¢å\xad\x97ã\x82\x92ã\x81\x9fã\x81\x8fã\x81\x95ã\x82\x93è¦\x9aã\x81\x88ã\x81¦ã\x81\x84ã\x81¾ã\x81\x99ã\x80\x82', 'ä¸\xadå\x9b½ã\x81§ã\x81¯æ¼¢å\xad\x97ã\x81®å\xad\x97æ\x95°ã\x81\x8cå¤\x9aã\x81\x84ã\x81\x8bã\x82\x89 è¤\x87é\x9b\x91ã\x81ªç¹\x81ä½\x93å\xad\x97ã\x82\x92ã\x82\x84ã\x82\x81ã\x81¦è¦\x9aã\x81\x88ã\x82\x84ã\x81\x99ã\x81\x84ç°¡ä½\x93å\xad\x97ã\x81«ç½®ã\x81\x8dæ\x8f\x9bã\x81\x88ã\x80\x81è\xad\x98å\xad\x97ç\x8e\x87ã\x82\x92é«\x98ã\x82\x81ã\x82\x8bã\x81\x93ã\x81¨ã\x81\x8cç°¡ä½\x93å\xad\x97æ\x8e¨é\x80²ã\x81®ç\x9b®ç\x9a\x84ã\x81§ã\x81\x97ã\x81\x9fã\x80\x82', 'ï¼\x94ç´\x9aã\x81®æ¼¢å\xad\x97ã\x82\x92ã\x81©ã\x82\x8cã\x81\xa0ã\x81\x91è¦\x9aã\x81\x88ã\x81¦ã\x81\x84ã\x81¾ã\x81\x99ã\x81\x8bã\x80\x82', 'ã\x81\x9dã\x82\x93ã\x81ªæ¼¢å\xad\x97ã\x81¯å\x83\x95ã\x81\x8cèª\xadã\x82\x81ã\x81ªã\x81\x84ã\x81»ã\x81©ã\x81\x9fã\x81\x84ã\x81¸ã\x82\x93è¤\x87é\x9b\x91ã\x81ªã\x82\x93ã\x81\xa0ã\x80\x82', 'æ\x97¥æ\x9c¬èª\x9eã\x81¨ä¸\xadå\x9b½èª\x9eã\x81®æ¼¢å\xad\x97ã\x81®ç\x99ºé\x9f³ã\x81¯ã\x81¨ã\x81¦ã\x82\x82é\x81\x95ã\x81\x84ã\x81¾ã\x81\x99ã\x81\xadã\x80\x82', 'ç§\x81ã\x81¯æ¼¢å\xad\x97ã\x82\x92å\x8b\x89å¼·ã\x81\x97ã\x81¦ã\x81\x84ã\x81¾ã\x81\x99ã\x80\x82', 'æ¼¢å\xad\x97ã\x82\x92èª\xadã\x82\x80ã\x81®ã\x81¯é\x9b£ã\x81\x97ã\x81\x84ã\x81§ã\x81\x99ã\x80\x82', 'ã\x81\x93ã\x81®æ¼¢å\xad\x97ã\x81®èª\xadã\x81¿ã\x81\x8bã\x81\x9fã\x81¯ä½\x95ã\x81§ã\x81\x97ã\x82\x87ã\x81\x86ã\x81\x8bã\x80\x82', 'æ\x97¥æ\x9c¬ã\x81§ã\x81¯å®\x89ã\x81\x84æ¼¢å\xad\x97ã\x81®è¾\x9eæ\x9b¸ã\x81\x8cã\x81\x82ã\x82\x8cã\x81°ã\x80\x81è²·ã\x81\x84ã\x81¾ã\x81\x99ã\x80\x82']
Try to get information with BeautifulSoup:
import requests
from bs4 import BeautifulSoup
url = 'https://www.tanoshiijapanese.com/dictionary/sentences.cfm?j=%E6%BC%A2%E5%AD%97&e=&search=Search+%3E'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
for div in soup.select('.jp'):
print(div.text)
Prints:
今日は漢字の書き取りがある。
漢字を書くときは点やはらいに気をつけて、なるべく早くていねいに書きましょう。
横着しないで、教わった漢字を使いなさい。
「道」という漢字の総画数は何画ですか。
彼は漢字が全く書けない。
君はこの漢字が読めますか。
漢字は読むのが難しい。
この漢字はどういう意味ですか。
いかにもリゾートってかんじの格好ね。
漢字を少し教えてください。
彼女はこんな難しい漢字も読めます。
トマさんは小学生向けの本が読めるぐらい漢字をたくさん覚えています。
中国では漢字の字数が多いから 複雑な繁体字をやめて覚えやすい簡体字に置き換え、識字率を高めることが簡体字推進の目的でした。
4級の漢字をどれだけ覚えていますか。
そんな漢字は僕が読めないほどたいへん複雑なんだ。
日本語と中国語の漢字の発音はとても違いますね。
私は漢字を勉強しています。
漢字を読むのは難しいです。
この漢字の読みかたは何でしょうか。
日本では安い漢字の辞書があれば、買います。
Note: It also depends on your terminal, if it can display Unicode characters. If you see garbled text, try to set your terminal to UTF-8.

regex and urllib.request to __scrape__ links from HTML

I am trying to parse an HTML to extract all values in this regex construction :
href="http//.+?"
This is the code:
import urllib.request
import re
url = input('Enter - ')
html = urllib.request.urlopen(url).read()
links = re.findall('href="(http://.*?)"',html)
for link in links:
print(link)
But I am getting an error saying :
TypeError: cannot use a string pattern on a bytes-like object
urlopen(url) returns a bytes object. So your html variable contains bytes as well. You can decode it using something like this:
htmlobject = urllib.request.urlopen(url)
html = htmlobject.read().decode('utf-8')
Then you can use html which is now a string, in your regex.
Your html is a byte string, use str(html):
re.findall(r'href="(http://.*?)"',str(html))
Or, use a byte pattern:
re.findall(rb'href="(http://.*?)"',html)
If you would like all links on a page, you don't even have to use regex. Because you can just use bs4 to get what you need :-)
import requests
import bs4
soup = bs4.BeautifulSoup(requests.get('https://dr.dk/').text, 'lxml')
links = soup.find_all('a', href=True)
[print(i['href']) for i in links]
Hope it was helpful. Good luck with the project ;-)

How to search for a specific unicode string when web scraping?

I recently got interested in web scraping on Python and did it on some simple examples, but I don't know how to handle other languages that don't follow the ASCII codes. For example, searching for a specific string in the HTML file or using those strings to be written in a file.
from urllib.parse import urljoin
import requests
import bs4
website = 'http://book.iranseda.ir'
book_url = 'http://book.iranseda.ir/DetailsAlbum/?VALID=TRUE&g=209103'
soup1 = bs4.BeautifulSoup(requests.get(book_url).text, 'lxml')
match1 = soup1.find_all('a', class_='download-mp3')
for m in match1:
m = m['href'].replace('q=10', 'q=9')
url = urljoin(website, m)
print(url)
print()
Looking at this website under book_url, each row has different text, but the text is in the Persian language.
Let say I need the last row to be considered.
The text is "صدای کل کتاب"
How can I search for this string in <li>, <div>, and <a> tags?
You need to set the encoding from requests to UTF-8. It looks like the requests module was not using the decoding you wanted. As mentioned in this SO post, you can tell requests what encoding to expect.
from urllib.parse import urljoin
import requests
import bs4
website = 'http://book.iranseda.ir'
book_url = 'http://book.iranseda.ir/DetailsAlbum/?VALID=TRUE&g=209103'
req = requests.get(book_url)
req.encoding = 'UTF-8'
soup1 = bs4.BeautifulSoup(req.text, 'lxml')
match1 = soup1.find_all('a', class_='download-mp3')
for m in match1:
m = m['href'].replace('q=10', 'q=9')
url = urljoin(website, m)
print(url)
print()
The only change here is
req = requests.get(book_url)
req.encoding = 'UTF-8'
soup1 = bs4.BeautifulSoup(req.text, 'lxml')

Can't properly display characters using BeautifulSoup

I am trying to scrape some names of settlements from a website using a BeautifulSoup library. The website uses the 'windows-1250' character set, but some of the characters are not displayed properly. See the last name of the settlement, which should be Župkov.
Could you help me with this problem?
This is the code:
# imports
import requests
from bs4 import BeautifulSoup
from bs4 import NavigableString
# create beautifulsoup object
obce_url = 'http://www.e-obce.sk/zoznam_vsetkych_obci.html?strana=2500'
source_code = requests.get(obce_url)
plain_text = source_code.text
obce_soup = BeautifulSoup(plain_text, 'html.parser')
# define bs filter
def soup_filter_1(tag):
return tag.has_attr('href') and len(tag.attrs) == 1 and isinstance(tag.next_element, NavigableString)
# print settlement names
for tag in obce_soup.find_all(soup_filter_1):
print(tag.string)
I am using Python 3.5.1 and beautifulsoup 4.4.1.
The problem is not with beautifulsoup, it just cannot determine what encoding you have ( try print('encoding', obce_soup.original_encoding)) and this is caused by you handing it Unicode instead of bytes.
If you try this:
obce_url = 'http://www.e-obce.sk/zoznam_vsetkych_obci.html?strana=2500'
source_code = requests.get(obce_url)
data_bytes = source_code.content # don't use .text it will try to make Unicode
obce_soup = BeautifulSoup(data_bytes, 'html.parser')
print('encoding', obce_soup.original_encoding)
to create your beautifulsoup object, you'll see it now gets the encoding right and your output is OK.
Since you know site's encoding, you can just pass it explicitly to BeautifulSoup constructor with response's content, not text:
source_code = requests.get(obce_url)
content = source_code.content
obce_soup = BeautifulSoup(content, 'html.parser', from_encoding='windows-1250')
Probably the server sends HTTP headers which specify the character set as UTF-8 but then the actual HTML uses Win-1250. So requests uses UTF-8 to decode the HTML data.
But you can get original data source_code.content and use decode('cp1250') to get correct characters.
plain_text = source_code.content.decode('cp1250')
Or you can manually set encoding before you get text
source_code.encoding = 'cp1250'
plain_text = source_code.text
You can also use the original data source_code.content in BS so it should use information within the HTML about its encoding
obce_soup = BeautifulSoup(source_code.content, 'html.parser')
see
print(obce_soup.declared_html_encoding)

Categories