Scraping pdfs from a webpage

Scraping pdfs from a webpage - python

I would like to download all financial reports for a given company from the Danish company register (csv register). An example could be Chr. Hansen Holding in the link below:
https://datacvr.virk.dk/data/visenhed?enhedstype=virksomhed&id=28318677&soeg=chr%20hansen&type=undefined&language=da
Specifically, I would like to download all the PDF under the tab "Regnskaber" (=Financial reports). I do not have previous experience with webscraping using Python. I tried using BeautifulSoup, but given my non-existing experience, I cannot find the correct way to search from the response.
Below are what I tried, but no data are printed (i.e. it did not find any pdfs).
from urllib.parse import urljoin
from bs4 import BeautifulSoup
web_page = "https://datacvr.virk.dk/data/visenhed?
enhedstype=virksomhed&id=28318677&soeg=chr%20hansen&type=undefined&language=da"
response = requests.get(web_page)
soup = BeautifulSoup(response.text)
soup.findAll('accordion-toggle')
for link in soup.select("a[href$='.pdf']"):
print(link['href'].split('/')[-1])
All help and guidance will be much appreciated.

you should use select instead of findAll
from urllib.parse import urljoin
from bs4 import BeautifulSoup
web_page = "https://datacvr.virk.dk/data/visenhed?
enhedstype=virksomhed&id=28318677&soeg=chr%20hansen&type=undefined&language=da"
response = requests.get(web_page)
soup = BeautifulSoup(response.text, 'lxml')
pdfs = soup.select('div[id="accordion-Regnskaber-og-nogletal"] a[data-type="PDF"]')
for link in pdfs:
print(link['href'].split('/')[-1])

Related

HTML parsing with BeautifulSoup in Python unknown error

I know that this code works for other websites that end in .com
However I noticed that the code doesn't work if I try to parse websites that end in .kr
Can somebody help to find why this is happening and an alternate solution to parse these types of websites?
Following is my code.
import requests
from bs4 import BeautifulSoup
URL = 'https://everytime.kr/#nN4K1XC0weHnnM9VB5Qe'
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
results = soup.find(id='container')
print(results)
The URL here is a link to my timetable. I need to parse this website so that I can easily collect the information for the subjects and data relevant to the subject (duration, location, professor's name, etc.).
Thanks

Website is serving dynamic content and you get an empty response back - you may use selenium.
Example
from selenium import webdriver
from bs4 import BeautifulSoup
import time
driver = webdriver.Chrome(executable_path=r'C:\Program Files\ChromeDriver\chromedriver.exe')
url = 'https://everytime.kr/#nN4K1XC0weHnnM9VB5Qe'
driver.get(url)
time.sleep(5)
soup = BeautifulSoup(driver.page_source, 'html.parser')
results = soup.find(id='container')
print(results)
driver.close()

Python BeautifulSoup trouble extracting titles from a page with JS

I'm having some serious issues trying to extract the titles from a webpage. I've done this before on some other sites but this one seems to be an issue because of the Javascript.
The test link is "https://www.thomasnet.com/products/adhesives-393009-1.html"
The first title I want extracted is "Toagosei America, Inc."
Here is my code:
import requests
from bs4 import BeautifulSoup
url = ("https://www.thomasnet.com/products/adhesives-393009-1.html")
r = requests.get(url).content
soup = BeautifulSoup(r, "html.parser")
print(soup.get_text())
Now if I run it like this, with get_text, i can find the titles in the result, however as soon as I change it to find_all or find, the titles are lost. I cant find them using web browser's inspect tool, because its all JS generated.
Any advice would be greatly appreciated.

You have to specify what to find, in this case <h2> to get first title:
import requests
from bs4 import BeautifulSoup
url = 'https://www.thomasnet.com/products/adhesives-393009-1.html'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
first_title = soup.find('h2')
print(first_title.text)
Prints:
Toagosei America, Inc.

Problem with scraping data from website with BeautifulSoup

I am trying to take a movie rating from the website Letterboxd. I have used code like this on other websites and it has worked, but it is not getting the info I want off of this website.
import requests
from bs4 import BeautifulSoup
page = requests.get("https://letterboxd.com/film/avengers-endgame/")
soup = BeautifulSoup(page.content, 'html.parser')
final = soup.find("section", attrs={"class":"section ratings-histogram-
chart"})
print(final)
This prints nothing, but there is a tag in the website for this class and the info I want is under it.

The reason behind this, is that the website loads most of the content asynchronously, so you'll have to look at the http requests it sends to the server in order to load the page content after loading the page layout. You can find them in "network" section in the browser (F12 key).
For instance, one of the apis they use to load the rating is this one:
https://letterboxd.com/csi/film/avengers-endgame/rating-histogram/

You can get the weighted average from another tag
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://letterboxd.com/film/avengers-endgame/')
soup = bs(r.content, 'lxml')
print(soup.select_one('[name="twitter:data2"]')['content'])
Text of all histogram
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://letterboxd.com/csi/film/avengers-endgame/rating-histogram/')
soup = bs(r.content, 'lxml')
ratings = [item['title'].replace('\xa0',' ') for item in soup.select('.tooltip')]
print(ratings)

Scrape info of companies on Fortune 500

I am trying to scrape company info from http://fortune.com/fortune500 for my thesis. As I downloaded the web_text from the link, there were no links for parsing. However, opening the link on Chrome will automatically lead to #1 company page.
Could someone kindly help explain to me what happened and how I can trace the links to company page from the original url?

First you need to get the postid, then make a request to /data/franchise-list, then get the url from the first article:
import json
import re
from urllib2 import urlopen
from urlparse import urljoin
from bs4 import BeautifulSoup
data = urlopen('http://fortune.com/fortune500/')
soup = BeautifulSoup(data)
postid = next(attr for attr in soup.body['class'] if attr.startswith('postid'))
postid = re.match(r'postid-(\d+)', postid).group(1)
url = "http://fortune.com/data/franchise-list/{postid}/1/".format(postid=postid)
data = json.load(urlopen(url))
resulting_url = urljoin(url, data['articles'][0]['url'])
print resulting_url
Prints:
http://fortune.com/fortune500/wal-mart-stores-inc-1/

Scraping Product Names using BeautifulSoup

I'm using BeautifulSoup (BS4) to build a scraper tool that will allow me to pull the product name from any TopShop.com product page, which sits between 'h1' tags. Can't figure out why the code I've written isn't working!
from urllib2 import urlopen
from bs4 import BeautifulSoup
import re
TopShop_URL = raw_input("Enter a TopShop Product URL")
ProductPage = urlopen(TopShop_URL).read()
soup = BeautifulSoup(ProductPage)
ProductNames = soup.find_all('h1')
print ProductNames

I get this working using requests (http://docs.python-requests.org/en/latest/)
from bs4 import BeautifulSoup
import requests
content = requests.get("TOPShop_URL").content
soup = BeautifulSoup(content)
product_names = soup.findAll("h1")
print product_names

Your code is correct, but the problem is that the div which includes the product name is dynamically generated via JavaScript.
In order to be able to successfully parse this element you should mind using Selenium or a similar tool, that will allow you to parse the webpage after all the dom has been fully loaded.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Scraping pdfs from a webpage - python

Related

HTML parsing with BeautifulSoup in Python unknown error

Python BeautifulSoup trouble extracting titles from a page with JS

Problem with scraping data from website with BeautifulSoup

Scrape info of companies on Fortune 500

Scraping Product Names using BeautifulSoup

Categories

Resources