Python download multiple files from links on pages

Python download multiple files from links on pages - python

I'm trying to download all the PGNs from this site.
I think I have to use urlopen to open each url and then use urlretrieve to download each pgn by accessing it from the download button near the bottom of each game. Do I have to create a new BeautifulSoup object for each game? I'm also unsure of how urlretrieve works.
import urllib
from urllib.request import urlopen, urlretrieve, quote
from bs4 import BeautifulSoup
url = 'http://www.chessgames.com/perl/chesscollection?cid=1014492'
u = urlopen(url)
html = u.read().decode('utf-8')
soup = BeautifulSoup(html, "html.parser")
for link in soup.find_all('a'):
urlopen('http://chessgames.com'+link.get('href'))

There is no short answer to your question. I will show you a complete solution and comment this code.
First, import necessary modules:
from bs4 import BeautifulSoup
import requests
import re
Next, get index page and create BeautifulSoup object:
req = requests.get("http://www.chessgames.com/perl/chesscollection?cid=1014492")
soup = BeautifulSoup(req.text, "lxml")
I strongly advice to use lxml parser, not common html.parser
After that, you should prepare game's links list:
pages = soup.findAll('a', href=re.compile('.*chessgame\?.*'))
You can do it by searching links containing 'chessgame' word in it.
Now, you should prepare function which will download files for you:
def download_file(url):
path = url.split('/')[-1].split('?')[0]
r = requests.get(url, stream=True)
if r.status_code == 200:
with open(path, 'wb') as f:
for chunk in r:
f.write(chunk)
And final magic is to repeat all previous steps preparing links for file downloader:
host = 'http://www.chessgames.com'
for page in pages:
url = host + page.get('href')
req = requests.get(url)
soup = BeautifulSoup(req.text, "lxml")
file_link = soup.find('a',text=re.compile('.*download.*'))
file_url = host + file_link.get('href')
download_file(file_url)
(first you search links containing text 'download' in their description, then construct full url - concatenate hostname and path, and finally download file)
I hope you can use this code without correction!

The accepted answer is fantastic but the task is embarrassingly parallel; there's no need to retrieve these sub-pages and files one at a time. This answer shows how to speed things up.
The first step is to use requests.Session() when sending multiple requests to a single host. Quoting Advanced Usage: Session Objects from the requests docs:
The Session object allows you to persist certain parameters across requests. It also persists cookies across all requests made from the Session instance, and will use urllib3's connection pooling. So if you're making several requests to the same host, the underlying TCP connection will be reused, which can result in a significant performance increase (see HTTP persistent connection).
Next, asyncio, multiprocessing or multithreading are available to parallelize the workload. Each has tradeoffs respective to the task at hand and which you choose is likely best determined by benchmarking and profiling. This page offers great examples of all three.
For the purposes of this post, I'll show multithreading. The impact of the GIL shouldn't be too much of a bottleneck because the tasks are mostly IO-bound, consisting of babysitting requests on the air to wait for the response. When a thread is blocked on IO, it can yield to a thread parsing HTML or doing other CPU-bound work.
Here's the code:
import os
import re
import requests
from bs4 import BeautifulSoup
from concurrent.futures import ThreadPoolExecutor
def download_pgn(task):
session, host, page, destination_path = task
response = session.get(host + page)
response.raise_for_status()
soup = BeautifulSoup(response.text, "lxml")
game_url = host + soup.find("a", text="download").get("href")
filename = re.search(r"\w+\.pgn", game_url).group()
path = os.path.join(destination_path, filename)
response = session.get(game_url, stream=True)
response.raise_for_status()
with open(path, "wb") as f:
for chunk in response.iter_content(chunk_size=1024):
if chunk:
f.write(chunk)
def main():
host = "http://www.chessgames.com"
url_to_scrape = host + "/perl/chesscollection?cid=1014492"
destination_path = "pgns"
max_workers = 8
if not os.path.exists(destination_path):
os.makedirs(destination_path)
with requests.Session() as session:
session.headers["User-Agent"] = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.0.0 Safari/537.36"
response = session.get(url_to_scrape)
response.raise_for_status()
soup = BeautifulSoup(response.text, "lxml")
pages = soup.find_all("a", href=re.compile(r".*chessgame\?.*"))
tasks = [
(session, host, page.get("href"), destination_path)
for page in pages
]
with ThreadPoolExecutor(max_workers=max_workers) as pool:
pool.map(download_pgn, tasks)
if __name__ == "__main__":
main()
I used response.iter_content here which is unnecessary on such tiny text files, but is a generalization so the code will handle larger files in a memory-friendly way.
Results from a rough benchmark (the first request is a bottleneck):
max workers
session?
seconds
1
no
126
1
yes
111
8
no
24
8
yes
22
32
yes
16

Related

How to speedup BeautifulSoup web scraping project

I am working on a web scraping project willing to take prices from a website using different urls. I have run the following code but it takes so long to print the price number. I am using PyCharm on a MacBook Pro 13'' i5 (2020) 1.4 GHz and 8GB RAM, if this can help.
import ssl
import bs4
from urllib.request import Request, urlopen
import json
#to avoid SSL verification
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE
#Define the url to monitor
urls = ['https://www.tiffany.co.uk/jewelry/necklaces-pendants/tiffany-hardwear-graduated-link-necklace-63008966/', 'https://www.tiffany.co.uk/jewelry/necklaces-pendants/tiffany-t-smile-pendant-35189459/']
for i in urls:
#Open the url to monitor using a new user agent to avoid website blocks you
req = Request(
url=i,
headers={'User-Agent': 'Mozilla/5.0'}
)
#Read the HTML code of the url
webpage = urlopen(req, context=ctx).read()
soup = bs4.BeautifulSoup(webpage, "html.parser")
#Define the HTML element we need to screen and find prices (this time using Javascript)
data = json.loads(soup.find_all('script', {'type': 'application/ld+json'})[-1].get_text())
price = int(data['offers']['price'])
print(price)
Using only one url, the code works, but adding other urls and a simple for loop, it takes a while. How could I speed up the process? Thanks a lot!

You can speed up the processing using multi-threading or multi-processing. This example will use multiprocessing module (with Pool of 4 processes) to obtain the prices:
import json
from bs4 import BeautifulSoup
import requests
from multiprocessing import Pool
def get_price(url):
soup = BeautifulSoup(requests.get(url, headers={'User-Agent': 'Mozilla/5.0'}).content, "html.parser")
data = json.loads(soup.find_all('script', {'type': 'application/ld+json'})[-1].get_text())
return url, int(data['offers']['price'])
if __name__ == '__main__':
urls = ['https://www.tiffany.co.uk/jewelry/necklaces-pendants/tiffany-hardwear-graduated-link-necklace-63008966/', 'https://www.tiffany.co.uk/jewelry/necklaces-pendants/tiffany-t-smile-pendant-35189459/']
with Pool(processes=4) as pool:
for url, price in pool.imap_unordered(get_price, urls):
print(url, price)
Prints (for example, the order could vary):
https://www.tiffany.co.uk/jewelry/necklaces-pendants/tiffany-t-smile-pendant-35189459/ 920
https://www.tiffany.co.uk/jewelry/necklaces-pendants/tiffany-hardwear-graduated-link-necklace-63008966/ 13900

Scraping Google Scholar with urllib2 instead of requests

I have the simple script below which works just fine for fetching a list of articles from Google Scholar searching for a term of interest.
import urllib
import urllib2
import requests
from bs4 import BeautifulSoup
SEARCH_SCHOLAR_HOST = "https://scholar.google.com"
SEARCH_SCHOLAR_URL = "/scholar"
def searchScholar(searchStr, limit=10):
"""Search Google Scholar for articles and publications containing terms of interest"""
url = SEARCH_SCHOLAR_HOST + SEARCH_SCHOLAR_URL + "?q=" + urllib.quote_plus(searchStr) + "&ie=UTF-8&oe=UTF-8&hl=en&btnG=Search"
content = requests.get(url, verify=False).text
page = BeautifulSoup(content, 'lxml')
results = {}
count = 0
for entry in page.find_all("h3", attrs={"class": "gs_rt"}):
if count < limit:
try:
text = entry.a.text.encode("ascii", "ignore")
url = entry.a['href']
results[url] = text
count += 1
except:
pass
return results
queryStr = "Albert einstein"
pubs = searchScholar(queryStr, 10)
if len(pubs) == 0:
print "No articles found"
else:
for pub in pubs.keys():
print pub + ' ' + pubs[pub]
However, I want to run this script as a CGI application on a remote server, without access to console, so I cannot install any external Python modules. (I managed to 'install' BeautifulSoup without resorting to pip or easy_install by just copying the bs4 directory to my cgi-bin directory, but this trick does not worked with requests because of its large amount of dependencies.)
So, my question is: is it possible to use the built-in urllib2 or httplib Python modules instead of requests for fetching the Google Scholar page and then pass it to BeautifulSoup? It should be, because I found some code here which scrapes Google Scholar using just the standard libraries plus BeautifulSoup, but it is rather convoluted. I would prefer to achieve a much simpler solution, just be adapting my script for using the standard libraries instead of requests.
Could anyone give me some help?

This code is enough to perform a simple request using urllib2:
def get(url):
req = urllib2.Request(url)
req.add_header('User-Agent', 'Mozilla/2.0 (compatible; MSIE 5.5; Windows NT)')
return urllib2.urlopen(req).read()
if you need to do something more advanced in the future it will be more code. What request does is simplifies the usage over that of the standard libs.

How to download in python big media links of a web page behind a log in form?

I'm looking for some library or libraries in Python to:
a) log in a web site,
b) find all links to some media files (let us say having "download" in their URLs), and
c) download each file efficiently directly to the hard drive (without loading the whole media file into RAM).
Thanks

You can use the broadly used requests module (more than 35k stars on github), and BeautifulSoup. The former handles session cookies, redirections, encodings, compression and more transparently. The later finds parts in the HTML code and has an easy-to-remember syntax, e.g. [] for properties of HTML tags.
It follows a complete example in Python 3.5.2 for a web site that you can scrap without a JavaScript engine (otherwise you can use Selenium), and downloading sequentially some links with download in its URL.
import shutil
import sys
import requests
from bs4 import BeautifulSoup
""" Requirements: beautifulsoup4, requests """
SCHEMA_DOMAIN = 'https://exmaple.com'
URL = SCHEMA_DOMAIN + '/house.php/' # this is the log-in URL
# here are the name property of the input fields in the log-in form.
KEYS = ['login[_csrf_token]',
'login[login]',
'login[password]']
client = requests.session()
request = client.get(URL)
soup = BeautifulSoup(request.text, features="html.parser")
data = {KEYS[0]: soup.find('input', dict(name=KEYS[0]))['value'],
KEYS[1]: 'my_username',
KEYS[2]: 'my_password'}
# The first argument here is the URL of the action property of the log-in form
request = client.post(SCHEMA_DOMAIN + '/house.php/user/login',
data=data,
headers=dict(Referer=URL))
soup = BeautifulSoup(request.text, features="html.parser")
generator = ((tag['href'], tag.string)
for tag in soup.find_all('a')
if 'download' in tag['href'])
for url, name in generator:
with client.get(SCHEMA_DOMAIN + url, stream=True) as request:
if request.status_code == 200:
with open(name, 'wb') as output:
request.raw.decode_content = True
shutil.copyfileobj(request.raw, output)
else:
print('status code was {} for {}'.format(request.status_code,
name),
file=sys.stderr)

You can use the mechanize module to log into websites like so:
import mechanize
br = mechanize.Browser()
br.set_handle_robots(False)
br.open("http://www.example.com")
br.select_form(nr=0) #Pass parameters to uniquely identify login form if needed
br['username'] = '...'
br['password'] = '...'
result = br.submit().read()
Use bs4 to parse this response and find all the hyperlinks in the page like so:
from bs4 import BeautifulSoup
import re
soup = BeautifulSoup(result, "lxml")
links = []
for link in soup.findAll('a'):
links.append(link.get('href'))
You can use re to further narrow down the links you need from all the links present in the response webpage, which are media links (.mp3, .mp4, .jpg, etc) in your case.
Finally, use requests module to stream the media files so that they don't take up too much memory like so:
response = requests.get(url, stream=True) #URL here is the media URL
handle = open(target_path, "wb")
for chunk in response.iter_content(chunk_size=512):
if chunk: # filter out keep-alive new chunks
handle.write(chunk)
handle.close()
when the stream attribute of get() is set to True, the content does not immediately start downloading to RAM, instead the response behaves like an iterable, which you can iterate over in chunks of size chunk_size in the loop right after the get() statement. Before moving on to the next chunk, you can write the previous chunk to memory hence ensuring that the data isn't stored in RAM.
You will have to put this last chunk of code in a loop if you want to download media of every link in the links list.
You will probably have to end up making some changes to this code to make it work as I haven't tested it for your use case myself, but hopefully this gives a blueprint to work off of.

Scraping excel from website using python with _doPostBack link url hidden

For last few days I am trying to scrap the following website (link pasted below) which has a few excels and pdfs available in a table. I am able to do it for the home page successfully. There are total 59 pages from which these excels/ pdfs have to be scrapped. In most of the websites I have seen till now there is a query parameter which is available in the site url which changes as you move from one page to another. In this case, we have a _doPostBack function probably because of which the URL remains the same on every page you go to. I looked at multiple solutions and posts which are suggesting to see the parameters of post call and use them but I am not able to make sense of the parameters which are provided in post call (this is the first time I am scrapping a website).
Can someone please suggest some resource which can help me write a code which helps me in moving from one page to another using python. The details are as follows:
Website link - http://accord.fairfactories.org/ffcweb/Web/ManageSuppliers/InspectionReportsEnglish.aspx
My current code which extracts the CAP excel sheet from the home page (this is working perfect and is provided just for reference)
from urllib.request import urlopen
from urllib.request import urlretrieve
from bs4 import BeautifulSoup
import re
import urllib
Base = "http://accord.fairfactories.org/ffcweb/Web"
html = urlopen("http://accord.fairfactories.org/ffcweb/Web/ManageSuppliers/InspectionReportsEnglish.aspx")
bs = BeautifulSoup(html)
name = bs.findAll("td", {"class":"column_style_right column_style_left"})
i = 1
for link in bs.findAll("a", {"id":re.compile("CAP(?!\w)")}):
if 'href' in link.attrs:
name = str(i)+".xlsx"
a = link.attrs['href']
b = a.strip("..")
c = Base+b
urlretrieve(c, name)
i = i+1
Please let me know if I have missed anything while providing the information and please don't rate me -ve else I won't be able to ask any questions further

For aspx sites, you need to look for things like __EVENTTARGET, __EVENTVALIDATION etc.. and post those parameters with each request, this will get all the pages and using requests with bs4:
import requests
from bs4 import BeautifulSoup
from urlparse import urljoin # python 3 use from urllib.parse import urljoin
# All the keys need values set bar __EVENTTARGET, that stays the same.
data = {
"__EVENTTARGET": "gvFlex",
"__VIEWSTATE": "",
"__VIEWSTATEGENERATOR": "",
"__VIEWSTATEENCRYPTED": "",
"__EVENTVALIDATION": ""}
def validate(soup, data):
for k in data:
# update post values in data.
if k != "__EVENTTARGET":
data[k] = soup.select_one("#{}".format(k))["value"]
def get_all_excel():
base = "http://accord.fairfactories.org/ffcweb/Web"
url = "http://accord.fairfactories.org/ffcweb/Web/ManageSuppliers/InspectionReportsEnglish.aspx"
with requests.Session() as s:
# Add a user agent for each subsequent request.
s.headers.update({"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:47.0) Gecko/20100101 Firefox/47.0"})
r = s.get(url)
bs = BeautifulSoup(r.content, "lxml")
# get links from initial page.
for xcl in bs.select("a[id*=CAP]"):
yield urljoin(base, xcl["href"])
# need to re-validate the post data in our dict for each request.
validate(bs, data)
last = bs.select_one("a[href*=Page$Last]")
i = 2
# keep going until the last page button is not visible
while last:
# Increase the counter to set the target to the next page
data["__EVENTARGUMENT"] = "Page${}".format(i)
r = s.post(url, data=data)
bs = BeautifulSoup(r.content, "lxml")
for xcl in bs.select("a[id*=CAP]"):
yield urljoin(base, xcl["href"])
last = bs.select_one("a[href*=Page$Last]")
# again re-validate for next request
validate(bs, data)
i += 1
for x in (get_all_excel()):
print(x)
If we run it on the first three pages, you can see we get the data you want:
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=9965
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=9552
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=10650
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=11969
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=10086
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=10905
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=10840
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=9229
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=11310
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=9178
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=9614
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=9734
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=10063
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=10871
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=9468
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=9799
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=9278
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=12252
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=9342
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=9966
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=11595
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=9652
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=10271
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=10365
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=10087
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=9967
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=11740
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=12375
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=11643
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=10952
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=12013
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=9810
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=10953
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=10038
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=9664
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=12256
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=9262
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=9210
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=9968
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=9811
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=11610
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=9455
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=11899
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=10273
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=9766
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=9969
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=10088
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=10366
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=9393
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=9813
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=11795
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=9814
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=11273
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=12187
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=10954
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=9556
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=11709
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=9676
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=10251
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=10602
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=10089
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=9908
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=10358
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=9469
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=11333
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=9238
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=9816
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=9817
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=10736
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=10622
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=9394
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=9818
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=10592
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=9395
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=11271

How to get contents of frames automatically if browser does not support frames + can't access frame directly

I am trying to automatically download PDFs from URLs like this to make a library of UN resolutions.
If I use beautiful soup or mechanize to open that URL, I get "Your browser does not support frames" -- and I get the same thing if I use the copy as curl feature in chrome dev tools.
The standard advice for the "Your browser does not support frames" when using mechanize or beautiful soup is to follow the source of each individual frame and load that frame. But if I do so, I get to an error message that the page is not authorized.
How can I proceed? I guess I could try this in zombie or phantom but I would prefer to not use those tools as I am not that familiar with them.

Okay, this was an interesting task to do with requests and BeautifulSoup.
There is a set of underlying calls to un.org and daccess-ods.un.org that are important and set relevant cookies. This is why you need to maintain requests.Session() and visit several urls before getting access to the pdf.
Here's the complete code:
import re
from urlparse import urljoin
from bs4 import BeautifulSoup
import requests
BASE_URL = 'http://www.un.org/en/ga/search/'
URL = "http://www.un.org/en/ga/search/view_doc.asp?symbol=A/RES/68/278"
BASE_ACCESS_URL = 'http://daccess-ods.un.org'
# start session
session = requests.Session()
response = session.get(URL, headers={'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.153 Safari/537.36'})
# get frame links
soup = BeautifulSoup(response.text)
frames = soup.find_all('frame')
header_link, document_link = [urljoin(BASE_URL, frame.get('src')) for frame in frames]
# get header
session.get(header_link, headers={'Referer': URL})
# get document html url
response = session.get(document_link, headers={'Referer': URL})
soup = BeautifulSoup(response.text)
content = soup.find('meta', content=re.compile('URL='))['content']
document_html_link = re.search('URL=(.*)', content).group(1)
document_html_link = urljoin(BASE_ACCESS_URL, document_html_link)
# follow html link and get the pdf link
response = session.get(document_html_link)
soup = BeautifulSoup(response.text)
# get the real document link
content = soup.find('meta', content=re.compile('URL='))['content']
document_link = re.search('URL=(.*)', content).group(1)
document_link = urljoin(BASE_ACCESS_URL, document_link)
print document_link
# follow the frame link with login and password first - would set the important cookie
auth_link = soup.find('frame', {'name': 'footer'})['src']
session.get(auth_link)
# download file
with open('document.pdf', 'wb') as handle:
response = session.get(document_link, stream=True)
for block in response.iter_content(1024):
if not block:
break
handle.write(block)
You should probably extract separate blocks of code into functions to make it more readable and reusable.
FYI, all of this could be more easily done through the real browser with the help of selenium of Ghost.py.
Hope that helps.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python download multiple files from links on pages - python

Related

How to speedup BeautifulSoup web scraping project

Scraping Google Scholar with urllib2 instead of requests

How to download in python big media links of a web page behind a log in form?

Scraping excel from website using python with _doPostBack link url hidden

How to get contents of frames automatically if browser does not support frames + can't access frame directly

Categories

Resources