How to search for a specific unicode string when web scraping?

How to search for a specific unicode string when web scraping? - python

I recently got interested in web scraping on Python and did it on some simple examples, but I don't know how to handle other languages that don't follow the ASCII codes. For example, searching for a specific string in the HTML file or using those strings to be written in a file.
from urllib.parse import urljoin
import requests
import bs4
website = 'http://book.iranseda.ir'
book_url = 'http://book.iranseda.ir/DetailsAlbum/?VALID=TRUE&g=209103'
soup1 = bs4.BeautifulSoup(requests.get(book_url).text, 'lxml')
match1 = soup1.find_all('a', class_='download-mp3')
for m in match1:
m = m['href'].replace('q=10', 'q=9')
url = urljoin(website, m)
print(url)
print()
Looking at this website under book_url, each row has different text, but the text is in the Persian language.
Let say I need the last row to be considered.
The text is "صدای کل کتاب"
How can I search for this string in <li>, <div>, and <a> tags?

You need to set the encoding from requests to UTF-8. It looks like the requests module was not using the decoding you wanted. As mentioned in this SO post, you can tell requests what encoding to expect.
from urllib.parse import urljoin
import requests
import bs4
website = 'http://book.iranseda.ir'
book_url = 'http://book.iranseda.ir/DetailsAlbum/?VALID=TRUE&g=209103'
req = requests.get(book_url)
req.encoding = 'UTF-8'
soup1 = bs4.BeautifulSoup(req.text, 'lxml')
match1 = soup1.find_all('a', class_='download-mp3')
for m in match1:
m = m['href'].replace('q=10', 'q=9')
url = urljoin(website, m)
print(url)
print()
The only change here is
req = requests.get(book_url)
req.encoding = 'UTF-8'
soup1 = bs4.BeautifulSoup(req.text, 'lxml')

Related

how to scrape Bengali character properly with python

I want to scrape Bengali characters properly with python. I already try decode it but it gives me an error and while I encode it give me some non readable things
from bs4 import BeautifulSoup
import requests
import urllib3
import requests
import pprint
import json
from requests.models import DecodeError
from urllib3.util.url import Url
content = []
# for i in range(1, 25):
url = 'https://sattacademy.com/job-solution/view?cat_id=1&sub_cat_id=893'
res= requests.get(url)
soup = BeautifulSoup(res.text, 'html.parser')
anchor = soup.findAll("a", attrs={"class": "nav-link"})
for ref in anchor:
print(ref.text)
This is my output ->
enter image description here
But i want this clealry.Thnx

This link might help you. Shows how to print utf-8 characters in python.
Printing unicode number of chars in a string (Python)

cannot figure beautifulsoup string sum

Hello I want to scrape a webpage. I posted my code but the line which I targeted is important. It doesn't worked. I mean there is no error but also no output. My code is there. I need to sum to strings and there is the problem.
import requests
from bs4 import BeautifulSoup
import pandas as pd
url='http://www.sis.itu.edu.tr/tr/ders_programlari/LSprogramlar/prg.php'
html_content = requests.get(url).text
soup = BeautifulSoup(html_content, "lxml")
url_course_main='http://www.sis.itu.edu.tr/tr/ders_programlari/LSprogramlar/prg.php?fb='
url_course=url_course_main+soup.find_all('option')[1].get_text() <---this line
html_content_course=requests.get(a).text
soup_course=BeautifulSoup(html_content_course,'lxml')
for j in soup_course.find_all('td'):
print(j.get_text())
When I am changing the line which I showed to
url_course=url_course_main+'AKM'
it worked.
Also soup.find_all('option')[1].get_text() is equal to AKM.
Can you guess where the mistake is.

Instead of requests module, try Python's standard urllib.request. It seems that requests module has problem opening the page:
import urllib.request
from bs4 import BeautifulSoup
url='http://www.sis.itu.edu.tr/tr/ders_programlari/LSprogramlar/prg.php'
html_content = urllib.request.urlopen(url).read()
soup = BeautifulSoup(html_content, "lxml")
url_course_main='http://www.sis.itu.edu.tr/tr/ders_programlari/LSprogramlar/prg.php?fb='
url_course=url_course_main+soup.find_all('option')[1].get_text()
html_content_course=urllib.request.urlopen(url_course).read()
soup_course=BeautifulSoup(html_content_course,'lxml')
for j in soup_course.find_all('td'):
print(j.get_text(strip=True))
Prints:
2019-2020 Yaz Dönemi AKM Kodlu Derslerin Ders Programı
...

Problem is that get_text() gives 'AKM ' with space at the end and requests sends url with this space - and server can't find file 'AKM ' with space.
I used >< in string '>{}<'.format(param) to show this space - >AKM < - because without >< it seems OK.
Code needs get_text(strip=True) or get_text().strip() to remove this space.
import requests
from bs4 import BeautifulSoup
url = 'http://www.sis.itu.edu.tr/tr/ders_programlari/LSprogramlar/prg.php'
html_content = requests.get(url).text
soup = BeautifulSoup(html_content, 'lxml')
url_course_main = 'http://www.sis.itu.edu.tr/tr/ders_programlari/LSprogramlar/prg.php?fb='
param = soup.find_all('option')[1].get_text()
print('>{}<'.format(param)) # I use `> <` to show spaces
param = soup.find_all('option')[1].get_text(strip=True)
print('>{}<'.format(param)) # I use `> <` to show spaces
url_course = url_course_main + param
html_content_course = requests.get(url_course).text
soup_course = BeautifulSoup(html_content_course, 'lxml')
for j in soup_course.find_all('td'):
print(j.get_text())

How to stop BeautifulSoup from decoding HTML entities into symbols

I am trying to get all the links on a given website but is stuck with some problems about HTML entities. Here's my code that crawls websites using BeautifulSoup:
from bs4 import BeautifulSoup
import requests
.
.
baseRequest = requests.get("https://www.example.com", SOME_HEADER_SETTINGS)
soup = BeautifulSoup(baseRequest.content, "html.parser")
pageLinks = []
for anchor in soup.findAll("a"):
pageLinks.append(anchor["href"])
.
.
print(pageLinks)
The code becomes problematic when it sees this kind of element:
Link
Instead of printing ["./page?id=123&sect=2"], it treats the &sect part as an HTML entity and shows this in the console:
["./page?id=123§=2"]
Is there a solution to preventing the this?

Here is one
from bs4 import BeautifulSoup
soup = BeautifulSoup('Link', "html.parser")
pageLinks = []
for anchor in soup.findAll("a"):
pageLinks.append(anchor["href"])
uncoded = ''.join(i for i in pageLinks).encode('utf-8')
decoded = ''.join(map(lambda x: chr(ord(x)),''.join(i for i in pageLinks)))
print('uncoded =',uncoded)
print('decoded =',decoded)
output
uncoded = b'./page?id=123\xc2\xa7=2'
decoded = ./page?id=123§=2

Python3 scraper. Doesn't parse the xpath till the end

I'm using lxml.html module
from lxml import html
page = html.parse('http://directory.ccnecommunity.org/reports/rptAccreditedPrograms_New.asp?sort=institution')
# print(page.content)
unis = page.xpath('//tr/td[#valign="top" and #style="width: 50%;padding-right:15px"]/h3/text()')
print(unis.__len__())
with open('workfile.txt', 'w') as f:
for uni in unis:
f.write(uni + '\n')
The website right here (http://directory.ccnecommunity.org/reports/rptAccreditedPrograms_New.asp?sort=institution#Z) is full of universities.
The problem is that it parses till the letter 'H' (244 unis).
I can't understand why, as I see it parses all the HTML till the end.
I also documented my self that 244 is not a limit of a list or anything in python3.

That HTML page simply isn't HTML, it's totally broken. But the following will do what you want. It uses the BeautifulSoup parser.
from lxml.html.soupparser import parse
import urllib
url = 'http://directory.ccnecommunity.org/reports/rptAccreditedPrograms_New.asp?sort=institution'
page = parse(urllib.request.urlopen(url))
unis = page.xpath('//tr/td[#valign="top" and #style="width: 50%;padding-right:15px"]/h3/text()')
See http://lxml.de/lxmlhtml.html#really-broken-pages for more info.

For web-scraping i recommend you to use BeautifulSoup 4
With bs4 this is easily done:
from bs4 import BeautifulSoup
import urllib.request
universities = []
result = urllib.request.urlopen('http://directory.ccnecommunity.org/reports/rptAccreditedPrograms_New.asp?sort=institution#Z')
soup = BeautifulSoup(result.read(),'html.parser')
table = soup.find_all(lambda tag: tag.name=='table')
for t in table:
rows = t.find_all(lambda tag: tag.name=='tr')
for r in rows:
# there are also the A-Z headers -> check length
# there are also empty headers -> check isspace()
headers = r.find_all(lambda tag: tag.name=='h3' and tag.text.isspace()==False and len(tag.text.strip()) > 2)
for h in headers:
universities.append(h.text)

Python: parsing UNICODE characters using bs4

I am building a python3 web crawler/scraper using bs4. The program crashes whenever it meets a UNICODE code character like a Chinese symbol. How do I modify my scraper so that it supports UNICODE?
Here's the code:
import urllib.request
from bs4 import BeautifulSoup
def crawlForData(url):
r = urllib.request.urlopen(url)
soup = BeautifulSoup(r.read(),'html.parser')
result = [i.text.replace('\n', ' ').strip() for i in soup.find_all('p')]
for p in result:
print(p)
url = 'https://en.wikipedia.org/wiki/Adivasi'
crawlForData(url)

You can try unicode() method. It decodes unicode strings.
or a way to go is
content.decode('utf-8','ignore')
where content is your string
The complete solution may be:
html = urllib2.urlopen("your url")
content = html.read().decode('utf-8', 'ignore')
soup = BeautifulSoup(content)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to search for a specific unicode string when web scraping? - python

Related

how to scrape Bengali character properly with python

cannot figure beautifulsoup string sum

How to stop BeautifulSoup from decoding HTML entities into symbols

Python3 scraper. Doesn't parse the xpath till the end

Python: parsing UNICODE characters using bs4

Categories

Resources