I want to scrape Bengali characters properly with python. I already try decode it but it gives me an error and while I encode it give me some non readable things
from bs4 import BeautifulSoup
import requests
import urllib3
import requests
import pprint
import json
from requests.models import DecodeError
from urllib3.util.url import Url
content = []
# for i in range(1, 25):
url = 'https://sattacademy.com/job-solution/view?cat_id=1&sub_cat_id=893'
res= requests.get(url)
soup = BeautifulSoup(res.text, 'html.parser')
anchor = soup.findAll("a", attrs={"class": "nav-link"})
for ref in anchor:
print(ref.text)
This is my output ->
enter image description here
But i want this clealry.Thnx
This link might help you. Shows how to print utf-8 characters in python.
Printing unicode number of chars in a string (Python)
Related
Hello I want to scrape a webpage. I posted my code but the line which I targeted is important. It doesn't worked. I mean there is no error but also no output. My code is there. I need to sum to strings and there is the problem.
import requests
from bs4 import BeautifulSoup
import pandas as pd
url='http://www.sis.itu.edu.tr/tr/ders_programlari/LSprogramlar/prg.php'
html_content = requests.get(url).text
soup = BeautifulSoup(html_content, "lxml")
url_course_main='http://www.sis.itu.edu.tr/tr/ders_programlari/LSprogramlar/prg.php?fb='
url_course=url_course_main+soup.find_all('option')[1].get_text() <---this line
html_content_course=requests.get(a).text
soup_course=BeautifulSoup(html_content_course,'lxml')
for j in soup_course.find_all('td'):
print(j.get_text())
When I am changing the line which I showed to
url_course=url_course_main+'AKM'
it worked.
Also soup.find_all('option')[1].get_text() is equal to AKM.
Can you guess where the mistake is.
Instead of requests module, try Python's standard urllib.request. It seems that requests module has problem opening the page:
import urllib.request
from bs4 import BeautifulSoup
url='http://www.sis.itu.edu.tr/tr/ders_programlari/LSprogramlar/prg.php'
html_content = urllib.request.urlopen(url).read()
soup = BeautifulSoup(html_content, "lxml")
url_course_main='http://www.sis.itu.edu.tr/tr/ders_programlari/LSprogramlar/prg.php?fb='
url_course=url_course_main+soup.find_all('option')[1].get_text()
html_content_course=urllib.request.urlopen(url_course).read()
soup_course=BeautifulSoup(html_content_course,'lxml')
for j in soup_course.find_all('td'):
print(j.get_text(strip=True))
Prints:
2019-2020 Yaz Dönemi AKM Kodlu Derslerin Ders Programı
...
Problem is that get_text() gives 'AKM ' with space at the end and requests sends url with this space - and server can't find file 'AKM ' with space.
I used >< in string '>{}<'.format(param) to show this space - >AKM < - because without >< it seems OK.
Code needs get_text(strip=True) or get_text().strip() to remove this space.
import requests
from bs4 import BeautifulSoup
url = 'http://www.sis.itu.edu.tr/tr/ders_programlari/LSprogramlar/prg.php'
html_content = requests.get(url).text
soup = BeautifulSoup(html_content, 'lxml')
url_course_main = 'http://www.sis.itu.edu.tr/tr/ders_programlari/LSprogramlar/prg.php?fb='
param = soup.find_all('option')[1].get_text()
print('>{}<'.format(param)) # I use `> <` to show spaces
param = soup.find_all('option')[1].get_text(strip=True)
print('>{}<'.format(param)) # I use `> <` to show spaces
url_course = url_course_main + param
html_content_course = requests.get(url_course).text
soup_course = BeautifulSoup(html_content_course, 'lxml')
for j in soup_course.find_all('td'):
print(j.get_text())
I'm using this python script
from lxml import html
import requests
page = requests.get('https://www.tanoshiijapanese.com/dictionary/sentences.cfm?j=%E6%BC%A2%E5%AD%97&e=&search=Search+%3E')
tree = html.fromstring(page.content)
sentences = tree.xpath('//div[#class="sentence"]/div[#class="sm"]/div[#class="jp"]/text()')
print ('Sentences: ', sentences)
and I'm getting this
Sentences: ['ä»\x8aæ\x97¥ã\x81¯æ¼¢å\xad\x97ã\x81®æ\x9b¸ã\x81\x8då\x8f\x96ã\x82\x8aã\x81\x8cã\x81\x82ã\x82\x8bã\x80\x82', 'æ¼¢å\xad\x97ã\x82\x92æ\x9b¸ã\x81\x8fã\x81¨ã\x81\x8dã\x81¯ç\x82¹ã\x82\x84ã\x81¯ã\x82\x89ã\x81\x84ã\x81«æ°\x97ã\x82\x92ã\x81¤ã\x81\x91ã\x81¦ã\x80\x81ã\x81ªã\x82\x8bã\x81¹ã\x81\x8fæ\x97©ã\x81\x8fã\x81¦ã\x81\x84ã\x81\xadã\x81\x84ã\x81«æ\x9b¸ã\x81\x8dã\x81¾ã\x81\x97ã\x82\x87ã\x81\x86ã\x80\x82', '横ç\x9d\x80ã\x81\x97ã\x81ªã\x81\x84ã\x81§ã\x80\x81æ\x95\x99ã\x82\x8fã\x81£ã\x81\x9fæ¼¢å\xad\x97ã\x82\x92使ã\x81\x84ã\x81ªã\x81\x95ã\x81\x84ã\x80\x82', 'ã\x80\x8cé\x81\x93ã\x80\x8dã\x81¨ã\x81\x84ã\x81\x86æ¼¢å\xad\x97ã\x81®ç·\x8fç\x94»æ\x95°ã\x81¯ä½\x95ç\x94»ã\x81§ã\x81\x99ã\x81\x8bã\x80\x82', 'å½¼ã\x81¯æ¼¢å\xad\x97ã\x81\x8cå\x85¨ã\x81\x8fæ\x9b¸ã\x81\x91ã\x81ªã\x81\x84ã\x80\x82', 'å\x90\x9bã\x81¯ã\x81\x93ã\x81®æ¼¢å\xad\x97ã\x81\x8cèª\xadã\x82\x81ã\x81¾ã\x81\x99ã\x81\x8bã\x80\x82', 'æ¼¢å\xad\x97ã\x81¯èª\xadã\x82\x80ã\x81®ã\x81\x8cé\x9b£ã\x81\x97ã\x81\x84ã\x80\x82', 'ã\x81\x93ã\x81®æ¼¢å\xad\x97ã\x81¯ã\x81©ã\x81\x86ã\x81\x84ã\x81\x86æ\x84\x8få\x91³ã\x81§ã\x81\x99ã\x81\x8bã\x80\x82', 'ã\x81\x84ã\x81\x8bã\x81«ã\x82\x82ã\x83ªã\x82¾ã\x83¼ã\x83\x88ã\x81£ã\x81¦ã\x81\x8bã\x82\x93ã\x81\x98ã\x81®æ\xa0¼å¥½ã\x81\xadã\x80\x82', 'æ¼¢å\xad\x97ã\x82\x92å°\x91ã\x81\x97æ\x95\x99ã\x81\x88ã\x81¦ã\x81\x8fã\x81\xa0ã\x81\x95ã\x81\x84ã\x80\x82', '彼女ã\x81¯ã\x81\x93ã\x82\x93ã\x81ªé\x9b£ã\x81\x97ã\x81\x84æ¼¢å\xad\x97ã\x82\x82èª\xadã\x82\x81ã\x81¾ã\x81\x99ã\x80\x82', 'ã\x83\x88ã\x83\x9eã\x81\x95ã\x82\x93ã\x81¯å°\x8få\xad¦ç\x94\x9få\x90\x91ã\x81\x91ã\x81®æ\x9c¬ã\x81\x8cèª\xadã\x82\x81ã\x82\x8bã\x81\x90ã\x82\x89ã\x81\x84æ¼¢å\xad\x97ã\x82\x92ã\x81\x9fã\x81\x8fã\x81\x95ã\x82\x93è¦\x9aã\x81\x88ã\x81¦ã\x81\x84ã\x81¾ã\x81\x99ã\x80\x82', 'ä¸\xadå\x9b½ã\x81§ã\x81¯æ¼¢å\xad\x97ã\x81®å\xad\x97æ\x95°ã\x81\x8cå¤\x9aã\x81\x84ã\x81\x8bã\x82\x89 è¤\x87é\x9b\x91ã\x81ªç¹\x81ä½\x93å\xad\x97ã\x82\x92ã\x82\x84ã\x82\x81ã\x81¦è¦\x9aã\x81\x88ã\x82\x84ã\x81\x99ã\x81\x84ç°¡ä½\x93å\xad\x97ã\x81«ç½®ã\x81\x8dæ\x8f\x9bã\x81\x88ã\x80\x81è\xad\x98å\xad\x97ç\x8e\x87ã\x82\x92é«\x98ã\x82\x81ã\x82\x8bã\x81\x93ã\x81¨ã\x81\x8cç°¡ä½\x93å\xad\x97æ\x8e¨é\x80²ã\x81®ç\x9b®ç\x9a\x84ã\x81§ã\x81\x97ã\x81\x9fã\x80\x82', 'ï¼\x94ç´\x9aã\x81®æ¼¢å\xad\x97ã\x82\x92ã\x81©ã\x82\x8cã\x81\xa0ã\x81\x91è¦\x9aã\x81\x88ã\x81¦ã\x81\x84ã\x81¾ã\x81\x99ã\x81\x8bã\x80\x82', 'ã\x81\x9dã\x82\x93ã\x81ªæ¼¢å\xad\x97ã\x81¯å\x83\x95ã\x81\x8cèª\xadã\x82\x81ã\x81ªã\x81\x84ã\x81»ã\x81©ã\x81\x9fã\x81\x84ã\x81¸ã\x82\x93è¤\x87é\x9b\x91ã\x81ªã\x82\x93ã\x81\xa0ã\x80\x82', 'æ\x97¥æ\x9c¬èª\x9eã\x81¨ä¸\xadå\x9b½èª\x9eã\x81®æ¼¢å\xad\x97ã\x81®ç\x99ºé\x9f³ã\x81¯ã\x81¨ã\x81¦ã\x82\x82é\x81\x95ã\x81\x84ã\x81¾ã\x81\x99ã\x81\xadã\x80\x82', 'ç§\x81ã\x81¯æ¼¢å\xad\x97ã\x82\x92å\x8b\x89å¼·ã\x81\x97ã\x81¦ã\x81\x84ã\x81¾ã\x81\x99ã\x80\x82', 'æ¼¢å\xad\x97ã\x82\x92èª\xadã\x82\x80ã\x81®ã\x81¯é\x9b£ã\x81\x97ã\x81\x84ã\x81§ã\x81\x99ã\x80\x82', 'ã\x81\x93ã\x81®æ¼¢å\xad\x97ã\x81®èª\xadã\x81¿ã\x81\x8bã\x81\x9fã\x81¯ä½\x95ã\x81§ã\x81\x97ã\x82\x87ã\x81\x86ã\x81\x8bã\x80\x82', 'æ\x97¥æ\x9c¬ã\x81§ã\x81¯å®\x89ã\x81\x84æ¼¢å\xad\x97ã\x81®è¾\x9eæ\x9b¸ã\x81\x8cã\x81\x82ã\x82\x8cã\x81°ã\x80\x81è²·ã\x81\x84ã\x81¾ã\x81\x99ã\x80\x82']
Try to get information with BeautifulSoup:
import requests
from bs4 import BeautifulSoup
url = 'https://www.tanoshiijapanese.com/dictionary/sentences.cfm?j=%E6%BC%A2%E5%AD%97&e=&search=Search+%3E'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
for div in soup.select('.jp'):
print(div.text)
Prints:
今日は漢字の書き取りがある。
漢字を書くときは点やはらいに気をつけて、なるべく早くていねいに書きましょう。
横着しないで、教わった漢字を使いなさい。
「道」という漢字の総画数は何画ですか。
彼は漢字が全く書けない。
君はこの漢字が読めますか。
漢字は読むのが難しい。
この漢字はどういう意味ですか。
いかにもリゾートってかんじの格好ね。
漢字を少し教えてください。
彼女はこんな難しい漢字も読めます。
トマさんは小学生向けの本が読めるぐらい漢字をたくさん覚えています。
中国では漢字の字数が多いから 複雑な繁体字をやめて覚えやすい簡体字に置き換え、識字率を高めることが簡体字推進の目的でした。
4級の漢字をどれだけ覚えていますか。
そんな漢字は僕が読めないほどたいへん複雑なんだ。
日本語と中国語の漢字の発音はとても違いますね。
私は漢字を勉強しています。
漢字を読むのは難しいです。
この漢字の読みかたは何でしょうか。
日本では安い漢字の辞書があれば、買います。
Note: It also depends on your terminal, if it can display Unicode characters. If you see garbled text, try to set your terminal to UTF-8.
I recently got interested in web scraping on Python and did it on some simple examples, but I don't know how to handle other languages that don't follow the ASCII codes. For example, searching for a specific string in the HTML file or using those strings to be written in a file.
from urllib.parse import urljoin
import requests
import bs4
website = 'http://book.iranseda.ir'
book_url = 'http://book.iranseda.ir/DetailsAlbum/?VALID=TRUE&g=209103'
soup1 = bs4.BeautifulSoup(requests.get(book_url).text, 'lxml')
match1 = soup1.find_all('a', class_='download-mp3')
for m in match1:
m = m['href'].replace('q=10', 'q=9')
url = urljoin(website, m)
print(url)
print()
Looking at this website under book_url, each row has different text, but the text is in the Persian language.
Let say I need the last row to be considered.
The text is "صدای کل کتاب"
How can I search for this string in <li>, <div>, and <a> tags?
You need to set the encoding from requests to UTF-8. It looks like the requests module was not using the decoding you wanted. As mentioned in this SO post, you can tell requests what encoding to expect.
from urllib.parse import urljoin
import requests
import bs4
website = 'http://book.iranseda.ir'
book_url = 'http://book.iranseda.ir/DetailsAlbum/?VALID=TRUE&g=209103'
req = requests.get(book_url)
req.encoding = 'UTF-8'
soup1 = bs4.BeautifulSoup(req.text, 'lxml')
match1 = soup1.find_all('a', class_='download-mp3')
for m in match1:
m = m['href'].replace('q=10', 'q=9')
url = urljoin(website, m)
print(url)
print()
The only change here is
req = requests.get(book_url)
req.encoding = 'UTF-8'
soup1 = bs4.BeautifulSoup(req.text, 'lxml')
I am building a python3 web crawler/scraper using bs4. The program crashes whenever it meets a UNICODE code character like a Chinese symbol. How do I modify my scraper so that it supports UNICODE?
Here's the code:
import urllib.request
from bs4 import BeautifulSoup
def crawlForData(url):
r = urllib.request.urlopen(url)
soup = BeautifulSoup(r.read(),'html.parser')
result = [i.text.replace('\n', ' ').strip() for i in soup.find_all('p')]
for p in result:
print(p)
url = 'https://en.wikipedia.org/wiki/Adivasi'
crawlForData(url)
You can try unicode() method. It decodes unicode strings.
or a way to go is
content.decode('utf-8','ignore')
where content is your string
The complete solution may be:
html = urllib2.urlopen("your url")
content = html.read().decode('utf-8', 'ignore')
soup = BeautifulSoup(content)
I've read quite a few posts here about this, but I'm very new to Python in general so I was hoping for some more info.
Essentially, I'm trying to write something that will pull word definitions from a site and write them to a file. I've been using BeautifulSoup, and I've made quite some progress, but here's my issue -
from __future__ import print_function
import requests
import urllib2, urllib
from BeautifulSoup import BeautifulSoup
wordlist = open('test.txt', 'a')
word = raw_input('Paste your word ')
url = 'http://services.aonaware.com/DictService/Default.aspx?action=define&dict=wn&query=%s' % word
# print url
html = urllib.urlopen(url).read()
# print html
soup = BeautifulSoup(html)
visible_text = soup.find('pre')(text=True)
print(visible_text, file=wordlist)
this seems to pull what I need, but puts it in this format
[u'passable\n adj 1: able to be passed or traversed or crossed; "the road is\n passable"
but I need it to be in plaintext. I've tried using a sanitizer (I was running it through bleach, but that didn't work. I've read some of the other answers here, but they don't explain HOW the code works, and I don't want to add something if I don't understand how it works.
Is there any way to just pull the plaintext?
edit: I ended up doing
from __future__ import print_function
import requests
import urllib2, urllib
from bs4 import BeautifulSoup
wordlist = open('test.txt', 'a')
word = raw_input('Paste your word ')
url = 'http://services.aonaware.com/DictService/Default.aspx?action=define&dict=wn&query=%s' % word
# print url
html = urllib.urlopen(url).read()
# print html
soup = BeautifulSoup(html)
visible_text = soup.find('pre')(text=True)[0]
print(visible_text, file=wordlist)
The code is already giving you plaintext, it just happens to have some characters encoded as entity references. In this case, special characters, which form part of the XML/HTML syntax are encoded to prevent them from breaking the structure of the text.
To decode them, use the HTMLParser module:
import HTMLParser
h = HTMLParser.HTMLParser()
h.unescape('"the road is passable"')
>>> u'"the road is passable"'