I tried to retrieve table data through the link below by python, unfortunately they brought all the html tags but haven't brought the table. Could you do me a favor and help me.
https://www150.statcan.gc.ca/n1/pub/71-607-x/2021004/exp-eng.htm?r1=(1)&r2=0&r3=0&r4=12&r5=0&r7=0&r8=2022-02-01&r9=2022-02-01
my code:
import requests
from bs4 import BeautifulSoup
url = 'https://www150.statcan.gc.ca/n1/pub/71-607-x/2021004/exp-eng.htm?r1=(1)&r2=0&r3=0&r4=12&r5=0&r7=0&r8=2022-02-01&r9=2022-02-01'
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
print(soup)
Related
I come from Hong Kong, a new learner of python, I am going to scrap the information from below website:
https://event.hktdc.com/en/?eventFormat=Exhibition&countryRegion=Hong-Kong&location=all&year=2023&p=1
my code as below
import requests
from bs4 import BeautifulSoup
response = requests.get(
"https://event.hktdc.com/tc/?eventFormat=Exhibition&countryRegion=Hong-Kong&location=all&year=2023&p=1#list")
soup = BeautifulSoup(response.text, "html.parser")
print(soup.prettify())
How can I get the information from it ?
I am scraping this site: https://finance.yahoo.com/quote/MSFT/press-releases.
In the browser, there are 20+ articles. However, when I pull the site's HTML down and load it into HTML agility pack, only the first three articles are appearing.
let client = new WebClient()
let uri = "https://finance.yahoo.com/quote/MSFT/press-releases"
let response = client.DownloadString(uri)
let doc = HtmlDocument()
doc.LoadHtml(response)
works:
let node = doc.DocumentNode.SelectSingleNode("//*[#id=\"summaryPressStream-0-Stream\"]/ul/li[1]")
node.InnerText
no works:
let node = doc.DocumentNode.SelectSingleNode("//*[#id=\"summaryPressStream-0-Stream\"]/ul/li[10]")
node.InnerText
Is it because there are some jenky li tags in the yahoo site? Is it is limitation of the HtmlAgilityPack?
I also did the same script in Python using BeautifulSoup and have the same problem:
from bs4 import BeautifulSoup
from urllib.request import urlopen
url = "https://finance.yahoo.com/quote/MSFT/press-releases?"
page = urlopen(url)
html = page.read().decode("utf-8")
soup = BeautifulSoup(html, "html.parser")
for link in soup.find_all('a', href=True):
print(link['href'])
Thanks
I am using requests-HTML and beautiful to scrape a website, below is the code. The weird thing is I can get the text sometimes from the web when using print(soup.get_text()) and I get some random codes when using print(soup) - in the image attached.
session = HTMLSession()
r = session.get(url)
soup = bs(r.content, "html.parser")
print(soup.get_text())
#print(soup)
The program return this when I tried to look at the soup
I think the site is javascript protected..well try this..it might help
import requests
from bs4 import BeautifulSoup
r = requests.get(url)
print(r.text)
#if you want the whole content you can just do slicing stuff on the response stored in r or rather just do it with bs4
soup = BeautifulSoup(r.text, "html.parser")
print(soup.text)
I'm trying to parse through this html and get the 53.1 and 41.7 values. I'm not quite sure how to do it.
I've been trying to do it using Beautiful Soup
Any suggestions or ideas would be greatly appreciated. Thanks.
from bs4 import BeautifulSoup
import urllib
r = urllib.urlopen('url/to/open').read()
soup = BeautifulSoup(r)
print type(soup)
-OR-
from bs4 import BeautifulSoup
import requests
url = raw_input("Enter a website to extract the URL's from: ")
r = requests.get("http://" +url)
data = r.text
soup = BeautifulSoup(data)
for link in soup.find_all('a'):
print(link.get('href'))
notice the .find_all() method. try exploring all helper methods of beautifulsoup. good luck.
When I inspect the elements on my browser, I can obviously see the exact web content. But when I try to run the below script, I cannot see the some of the web page details. In the web page I see there are "#document" elements and that is missing while I run the script. How can I see the details of #document elements or extract with the script.?
from bs4 import BeautifulSoup
import requests
response = requests.get('http://123.123.123.123/')
soup = BeautifulSoup(response.content, 'html.parser')
print soup.prettify()
You need to make additional requests to get the frame page contents as well:
from urlparse import urljoin
from bs4 import BeautifulSoup
import requests
BASE_URL = 'http://123.123.123.123/'
with requests.Session() as session:
response = session.get(BASE_URL)
soup = BeautifulSoup(response.content, 'html.parser')
for frame in soup.select("frameset frame"):
frame_url = urljoin(BASE_URL, frame["src"])
response = session.get(frame_url)
frame_soup = BeautifulSoup(response.content, 'html.parser')
print(frame_soup.prettify())