Failed to extract tables and data using beautifulsoup - python

I was trying to parse yahoo finance webpage using beautifulsoup. I am using python 2.7 and bs4 4.3.2. My final objective is to extract in python all the tabulated data from http://finance.yahoo.com/q/ae?s=PXT.TO. As a start, following code cannot find any table from the url. What am i missing?
from bs4 import BeautifulSoup
from urllib2 import urlopen
url = "http://finance.yahoo.com/q/ae?s=PXT.TO"
soup = BeautifulSoup(urlopen(url).read())
table = soup.findAll("table")
print table`

Related

Parsing SEC data from HTML file using Python

How do I parse the following SEC data from a .html website in Python?
Im trying to parse the html from the following webpage: https://www.sec.gov/Archives/edgar/data/1831187/0001831187-23-000001-index.html
The .txt version of the page contains the three top-level elements that I need to extract and parse into data frames:
Header < SEC-HEADER >
Primary Document <edgarSubmission... >
Information Table <informationTable...>
I can see some of the information with the following code, but I am ignorant of how to find the equivalent text element in the HTML and extract it. How can I proceed?
import bs4
from bs4 import BeautifulSoup
import requests
url= "https://www.sec.gov/Archives/edgar/data/1831187/0001831187-23-000001-index.html"
request = requests.get(index, headers = {"User-Agent": "Mozilla/5.0"})
soup = BeautifulSoup(request.text, 'lxml')

Scraping pdfs from a webpage

I would like to download all financial reports for a given company from the Danish company register (csv register). An example could be Chr. Hansen Holding in the link below:
https://datacvr.virk.dk/data/visenhed?enhedstype=virksomhed&id=28318677&soeg=chr%20hansen&type=undefined&language=da
Specifically, I would like to download all the PDF under the tab "Regnskaber" (=Financial reports). I do not have previous experience with webscraping using Python. I tried using BeautifulSoup, but given my non-existing experience, I cannot find the correct way to search from the response.
Below are what I tried, but no data are printed (i.e. it did not find any pdfs).
from urllib.parse import urljoin
from bs4 import BeautifulSoup
web_page = "https://datacvr.virk.dk/data/visenhed?
enhedstype=virksomhed&id=28318677&soeg=chr%20hansen&type=undefined&language=da"
response = requests.get(web_page)
soup = BeautifulSoup(response.text)
soup.findAll('accordion-toggle')
for link in soup.select("a[href$='.pdf']"):
print(link['href'].split('/')[-1])
All help and guidance will be much appreciated.
you should use select instead of findAll
from urllib.parse import urljoin
from bs4 import BeautifulSoup
web_page = "https://datacvr.virk.dk/data/visenhed?
enhedstype=virksomhed&id=28318677&soeg=chr%20hansen&type=undefined&language=da"
response = requests.get(web_page)
soup = BeautifulSoup(response.text, 'lxml')
pdfs = soup.select('div[id="accordion-Regnskaber-og-nogletal"] a[data-type="PDF"]')
for link in pdfs:
print(link['href'].split('/')[-1])

How can I get data from this link into a JSON?

I am trying to extract the search results with Python from this link into a JSON file, but normal request methods seem not functioning in this case. How can extract all the results?
url= https://apps.usp.org/app/worldwide/medQualityDatabase/reportResults.html?country=Ethiopia%2BGhana%2BKenya%2BMozambique%2BNigeria%2BCambodia%2BLao+PDR%2BPhilippines%2BThailand%2BViet+Nam%2BBolivia%2BColombia%2BEcuador%2BGuatemala%2BGuyana%2BPeru&period=2017%2B2016%2B2015%2B2014%2B2013%2B2012%2B2011%2B2010%2B2009%2B2008%2B2007%2B2006%2B2005%2B2004%2B2003&conclusion=Both&testType=Both&counterfeit=Both&recordstart=50
my code
import requests
from bs4 import BeautifulSoup
r = requests.get(url)
results_page = BeautifulSoup(response.content,'lxml')
Why am I not getting the full source code of the page?

Scraping website in Python

I have a problem with website scrape in Python. Specifically, the problem is I can not scrape live scores websites with library BeautifulSoup in Python. The problem in my code is that: the html elements can not be inserted into list in Python.
import urllib3
from bs4 import BeautifulSoup
import requests
import pymysql
import timeit
data_list=[]
url_p=requests.get('my url website')
soup = BeautifulSoup(url_p.text,'html.parser')
vathmoi_table=soup.find("td",class_="label")
for table in soup.findAll("table"):
print(table)
print(vathmoi_table)
for team_name in soup.findAll("td"):
data_list_r=[]
simvolo = team_name.find("img")
name=team_name.find("td",class_="label")
vathmologia=team_name.find("td",class_="points")
if(name!=None):
data_list_r.append(symvolo.get_text().strip())
data_list_r.append(name.get_text().strip())
data_list_r.append(vathmologia.get_text().strip())
data_list.append(data_list_r)
for tr_parse in team_name.findAll("tr"):
team=tr_parse.find("td",class_="team")
if(team!=None):
print(team.get_text())
print(data_list)

Scraping Product Names using BeautifulSoup

I'm using BeautifulSoup (BS4) to build a scraper tool that will allow me to pull the product name from any TopShop.com product page, which sits between 'h1' tags. Can't figure out why the code I've written isn't working!
from urllib2 import urlopen
from bs4 import BeautifulSoup
import re
TopShop_URL = raw_input("Enter a TopShop Product URL")
ProductPage = urlopen(TopShop_URL).read()
soup = BeautifulSoup(ProductPage)
ProductNames = soup.find_all('h1')
print ProductNames
I get this working using requests (http://docs.python-requests.org/en/latest/)
from bs4 import BeautifulSoup
import requests
content = requests.get("TOPShop_URL").content
soup = BeautifulSoup(content)
product_names = soup.findAll("h1")
print product_names
Your code is correct, but the problem is that the div which includes the product name is dynamically generated via JavaScript.
In order to be able to successfully parse this element you should mind using Selenium or a similar tool, that will allow you to parse the webpage after all the dom has been fully loaded.

Categories