Why is BeautifulSoup not extracting full HTML from a static website? - python

I am trying to webscrape a Chinese website https://bo.io.gov.mo/bo/ii/2021/43/avisosoficiais_cn.asp, but the code below is not returning the full html text. The strange thing is that the code is able to get me the full html from the Portuguese version of the same website https://bo.io.gov.mo/bo/ii/2021/43/avisosoficiais.asp. What is the problem?
from bs4 import BeautifulSoup
from urllib.request import urlopen
response = urlopen('https://bo.io.gov.mo/bo/ii/2021/43/avisosoficiais_cn.asp')
html_doc = response.read()
soup = BeautifulSoup(html_doc, 'lxml')
strhtm = soup.prettify()
print(strhtm)

Related

Web scraping by BeautifulSoup in Python

I tried to retrieve table data through the link below by python, unfortunately they brought all the html tags but haven't brought the table. Could you do me a favor and help me.
https://www150.statcan.gc.ca/n1/pub/71-607-x/2021004/exp-eng.htm?r1=(1)&r2=0&r3=0&r4=12&r5=0&r7=0&r8=2022-02-01&r9=2022-02-01
my code:
import requests
from bs4 import BeautifulSoup
url = 'https://www150.statcan.gc.ca/n1/pub/71-607-x/2021004/exp-eng.htm?r1=(1)&r2=0&r3=0&r4=12&r5=0&r7=0&r8=2022-02-01&r9=2022-02-01'
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
print(soup)

Neither Selenium or Beautiful soup showing full html source?

I tried using beautiful soup to parse a website, however when I printed "page_soup" I would only get a portion of the HTML, the beginning portion of the code, which has the info I need, was omitted. No one answered my question. After doing some research I tried using Selenium to access the full HTML, however I got the same results. Below are both of my attempts with selenium and beautiful soup. When I try and print the html it starts off in the middle of the source code, skipping the doctype, lang etc initial statements.
from selenium import webdriver
from bs4 import BeautifulSoup
browser = webdriver.Chrome( executable_path= "/usr/local/bin/chromedriver")
browser.get('https://coronavirusbellcurve.com/')
html = browser.page_source
soup = BeautifulSoup(html)
print(soup)
import bs4
import urllib
from urllib.request import urlopen as uReq
from urllib.request import Request, urlopen
from bs4 import BeautifulSoup as soup
htmlPage = urlopen(pageRequest).read()
page_soup = soup(htmlPage, 'html.parser')
print(page_soup)
The requests module seems to be returning the numbers in the first table on the page assuming you are referring to US Totals.
import requests
r = requests.get('https://coronavirusbellcurve.com/').content
print(r)

Beautifulsoup can't find an excel-download link for Anaconda Python 2.7

I am trying to download an excel file with Python automatically but I could not find the excel file's link tag with Beautifulsoup.
Here is my code snippet:
from urllib2 import urlopen
from bs4 import BeautifulSoup
url = "http://www.sse.com.cn/market/othersdata/margin/detail/"
html = urlopen(url)
soup = BeautifulSoup(html, "lxml")
soup.find_all("a", {"class": "download-export"})

Create a script to catch links on a webpage with python 3

I have to catch all the links of the topics in this page: https://www.inforge.net/xi/forums/liste-proxy.1118/
I've tried with this script:
import urllib.request
from bs4 import BeautifulSoup
url = (urllib.request.urlopen("https://www.inforge.net/xi/forums/liste-proxy.1118/"))
soup = BeautifulSoup(url, "lxml")
for link in soup.find_all('a'):
print(link.get('href'))
but it prints all the links of the page, and not just the links of the topics as I'd like to. could you suggest me the fast way to do it? I'm still a newbie, and i've started learning python recently.
You can use BeautifulSoup to parse the HTML:
from bs4 import BeautifulSoup
from urllib2 import urlopen
url= 'https://www.inforge.net/xi/forums/liste-proxy.1118/'
soup= BeautifulSoup(urlopen(url))
Then find the links with
soup.find_all('a', {'class':'PreviewTooltip'})

python beautifulsoup get html tag content

How can I get the content of an html tag with beautifulsoup? for example the content of <title> tag?
I tried:
from bs4 import BeautifulSoup
url ='http://www.websiteaddress.com'
soup = BeautifulSoup(url)
result = soup.findAll('title')
for each in result:
print(each.get_text())
But nothing happened. I'm using python3.
You need to fetch the website data first. You can do this with the urllib.request module. Note that HTML documents only have one title so there is no need to use find_all() and a loop.
from urllib.request import urlopen
from bs4 import BeautifulSoup
url ='http://www.websiteaddress.com'
data = urlopen(url)
soup = BeautifulSoup(data, 'html.parser')
result = soup.find('title')
print(result.get_text())

Categories