Downloading pdfs with python? - python

I'm writting a script that uses regex to find pdf links on a page then download said links. The script runs and names the files properly in my personal directory however it is not downloading the full pdf file. The pdfs are being pulled and are only 19kb, a corrupted pdf, when they should be approxemtely 15mb
import urllib, urllib2, re
url = 'http://www.website.com/Products'
destination = 'C:/Users/working/'
website = urllib2.urlopen(url)
html = website.read()
links = re.findall('.PDF">.*_geo.PDF', html)
for item in links:
DL = item[6:]
DL_PATH = url + '/' + DL
SV_PATH = destination + DL
urllib.urlretrieve(DL_PATH, SV_PATH)
The url variable links to a page with links to all the pdfs. When you click on the pdf link it takes you to 'www.website.com/Products/NorthCarolina.pdf' which displays the pdf in the browser. I'm not sure if because of this i should be using a diffrent python method or module

You could try something like this:
import requests
links = ['link.pdf']
for link in links:
book_name = link.split('/')[-1]
with open(book_name, 'wb') as book:
a = requests.get(link, stream=True)
for block in a.iter_content(512):
if not block:
break
book.write(block)

You can also use HTML knowledge (for parsing) and the BeautifulSoup library to find all pdf files from a webpage and then download them all together.
html = urlopen(my_url).read()
html_page = bs(html, features=”lxml”)
After parsing, you can search for <a> tags since all hyperlinks have these tags. Once you have all the <a> tags, you can further narrow them down by checking if they end with the pdf extension or not. Here's a full explanation for it: https://medium.com/the-innovation/notesdownloader-use-web-scraping-to-download-all-pdfs-with-python-511ea9f55e48

Related

Python scraping links from buttons with event

Link I want to scrape: https://digital.auraria.edu/work/ns/8fb66c05-0ad2-4e56-8cc7-6ced34d0c126
I'm currently having some trouble scrapping the "Download" button on this website to download the pdf file using python and beautiful soup since normally, there's a link
and I can just do
soup = BeautifulSoup(r.content, 'lxml')
links = soup.find_all("a")
for link in links:
if ('pdf' in link.get('href')): #find if the book pdf link is in there.
i += 1
response = requests.get(link.get('href'))
print(f"Retrieving PDF for: {title}")
write_pdf(pdf_path, response.content)
However I'm not quite sure what the link for the pdf is in this. I'm wondering if I had to use a headless browser and how would I be able to extract this link?
Here is the Image of inspect element of the link below
The way I found the PDF link is by going to the page and looking at the page source. Then I used the finder tool and searched for PDF and found a meta tag.
<meta name="citation_pdf_url" content="https://dashboard.digital.auraria.edu/downloads/1e0b44c6-cd79-49a3-9eac-0b10d1a4392e"/>
I followed the link and it downloaded a PDF with the same title. In the following code below, you can get the entire tag or the contents using .attrs.get('content') at the end.
Required -> pip install bs4 requests
from bs4 import BeautifulSoup
import requests
url = "https://digital.auraria.edu/work/ns/8fb66c05-0ad2-4e56-8cc7-6ced34d0c126"
req = requests.get(url)
soup = BeautifulSoup(req.content, 'html.parser')
pdf_link = soup.find("meta", attrs={'name': "citation_pdf_url"}).attrs.get('content')
print(pdf_link)
Good luck and let me know if you face any other issues!
Just scrape the filename and add that name to this link, I got the link by actually downloading the file copying it's download address, removing file name and adding different one to test it, it works like a charm. https://storage.googleapis.com/us-fedora/fcrepo.binary.directory/ea/05/2a/ea052a597af18fb6a46c44254e9e596a7e93571f?GoogleAccessId=k8s-storage%40repositories-307112.iam.gserviceaccount.com&Expires=1643411246&Signature=ICo51cFbe3By7JPJol8nLfxcic%2BV%2Bv1uvGYjodATCXJc2I6XWSi7JWC8l%2BM6BTSVFOL8A0YioZOQggY8Afc0JJtiwInkxFHmVjleQ41he3RK5pwF4IwONeuQxcgUXYzd8p94sA5L0YZC6drAFb9mx4AJLwTdKQt7dZh146FmaQYY8ElGT6BpHX2t%2BK31UGP0pC75uFGUq6b3IDK11gPOCSvnrLGSAM1yulE8togDgZmw0BU77nLPkinXSIATCTjlHNxf5aUxlJkg0%2FtSM21b53JFvHGHHCQf8QSKtST4WCBA1up6BVX1YLbGLZXxQ07mf8K7jnQ4U%2FXfnw6IoTpQxw%3D%3D&response-content-disposition=attachment%3B%20filename%3D
so according to your example the link would look like this https://storage.googleapis.com/us-fedora/fcrepo.binary.directory/ea/05/2a/ea052a597af18fb6a46c44254e9e596a7e93571f?GoogleAccessId=k8s-storage%40repositories-307112.iam.gserviceaccount.com&Expires=1643411246&Signature=ICo51cFbe3By7JPJol8nLfxcic%2BV%2Bv1uvGYjodATCXJc2I6XWSi7JWC8l%2BM6BTSVFOL8A0YioZOQggY8Afc0JJtiwInkxFHmVjleQ41he3RK5pwF4IwONeuQxcgUXYzd8p94sA5L0YZC6drAFb9mx4AJLwTdKQt7dZh146FmaQYY8ElGT6BpHX2t%2BK31UGP0pC75uFGUq6b3IDK11gPOCSvnrLGSAM1yulE8togDgZmw0BU77nLPkinXSIATCTjlHNxf5aUxlJkg0%2FtSM21b53JFvHGHHCQf8QSKtST4WCBA1up6BVX1YLbGLZXxQ07mf8K7jnQ4U%2FXfnw6IoTpQxw%3D%3D&response-content-disposition=attachment%3B%20filename%3DIR00000195_00001.pdf

I want to download many Files of same file extension with either Wget or Python, from a given Website Link

I would like to download Files of the same File types .utu and .zip from the Following Microsoft Flight Simulator AI Traffic Websites :-
http://web.archive.org/web/20050315112710/http://www.projectai.com:80/libraries/acfiles.php?cat=6 *(Current Repaints)
http://web.archive.org/web/20050315112940/http://www.projectai.com:80/libraries/acfiles.php?cat=1 (Vintage Repaints)
On each of those pages there are Subcatagories for Airbus Boeing etc for the AI Aircraft types, and the repaints .zip Files choices are shown when you click on the Aircraft image.
The Folder name then becomes http://web.archive.org/web/20041114195147/http://www.projectai.com:80/libraries/repaints.php?ac=number&cat=(number) Then when you click the downloads repaints.php? becomes download.php?fileid=(4 digit number)
What do I need to type to download all the .zip Files at once ? As clicking on them individually to download would take ages.
Also I would like to download all .utu File extension File, For Flight 1 ultimate Traffic AI Aircraft repaints. from the Following Webpage :-
http://web.archive.org/web/20060512161232/http://ultimatetraffic.flight1.net:80/utfiles.asp?mode=1&index=0
Then When you click to download the Ultimate Traffic Aircraft Texture :- The last Folder Path becomes /utfiles.asp?mode=download&id=F1AIRepaintNumbers-Numbers-Numbers.utu And I would like to do the same as for the other Websites.
I used the following written code in Python 2.79, found on a video on Youtube, inserting my info to achieve my aim, but it unsurprisingly didn't work when I ran it timeouts and errors etc, probably due to it's simplicity :-
import requests
from bs4 import BeautifulSoup
import wget
def download_links(url):
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text, "html.parser")
for link in soup.findAll('a'):
href = link.get('href')
print(href)
wget.download(href)
download_links('http://web.archive.org/web/20041225023002/http://www.projectai.com:80/libraries/acfiles.php?cat=6')
Update: Try this update, it should now download all zip files from all links on the first page:
from bs4 import BeautifulSoup
import requests, zipfile, io
def get_zips(zips_page):
# print(zips_page)
zips_source = requests.get(zips_page).text
zip_soup = BeautifulSoup(zips_source, "html.parser")
for zip_file in zip_soup.select("a[href*=download.php?fileid=]"):
zip_url = link_root + zip_file['href']
print('downloading', zip_file.text, '...',)
r = requests.get(zip_url)
with open(zip_file.text, 'wb') as zipFile:
zipFile.write(r.content)
def download_links(root, cat):
url = ''.join([root, cat])
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text, "html.parser")
for zips_suffix in soup.select("a[href*=repaints.php?ac=]"):
# get_zips(root, zips_suffix['href'])
next_page = ''.join([root, zips_suffix['href']])
get_zips(next_page)
link_root = 'http://web.archive.org/web/20041225023002/http://www.projectai.com:80/libraries/'
category = 'acfiles.php?cat=6'
download_links(link_root, category)

Download all the files in the website

I need to download all the files under this links where only the suburb name keep changing in each link
Just a reference
https://www.data.vic.gov.au/data/dataset/2014-town-and-community-profile-for-thornbury-suburb
All the files under this search link:
https://www.data.vic.gov.au/data/dataset?q=2014+town+and+community+profile
Any possibilities?
Thanks :)
You can download file like this
import urllib2
response = urllib2.urlopen('http://www.example.com/file_to_download')
html = response.read()
To get all the links in a page
from bs4 import BeautifulSoup
import requests
r = requests.get("http://site-to.crawl")
data = r.text
soup = BeautifulSoup(data)
for link in soup.find_all('a'):
print(link.get('href'))
You should first read the html, parse it using Beautiful Soup and then find links according to the file type you want to download. For instance, if you want to download all pdf files, you can check if the links end with the .pdf extension or not.
There's a good explanation and code available here:
https://medium.com/#dementorwriter/notesdownloader-use-web-scraping-to-download-all-pdfs-with-python-511ea9f55e48

Downloading target link html in a text file (Beautiful Soup - Python3)

I am completely new to python and studying Web crawling.
I am trying to download individual target link in a text page.
So far, I succeeded to extract all the target URLs I need, but have no idea on how to download all target HTML texts in a text file.
Can someone give me a general idea.
url = ""
r = requests.get(url)
data = r.text
soup = BeautifulSoup(data, "lxml")
link1 = soup2.find_all('a', href=re.compile("drupal_lists"))
for t in link1:
print(t.attrs['href'])
Within your for loop access the link urls using the requests lib and write the contents to a file. Something like:
link_data = requests.get(t.attrs['href']).text
with open('file_to_write.out', 'w') as f:
f.write(link_data)
You may want to change the filename for each link.

How to find specific video html tag using beautiful soup?

Does anyone know how to use beautifulsoup in python.
I have this search engine with a list of different urls.
I want to get only the html tag containing a video embed url. and get the link.
example
import BeautifulSoup
html = '''https://archive.org/details/20070519_detroit2'''
#or this.. html = '''http://www.kumby.com/avatar-the-last-airbender-book-3-chapter-5/'''
#or this... html = '''https://www.youtube.com/watch?v=fI3zBtE_S_k'''
soup = BeautifulSoup.BeautifulSoup(html)
what should I do next . to get the html tag of video, or object or the exact link of the video..
I need it to put it on my iframe. i will integrate the python to my php. so getting the link of the video and outputting it using the python then i will echo it on my iframe.
You need to get the html of the page not just the url
use the built-in lib urllib like this:
import urllib
from bs4 import BeautifulSoup as BS
url = '''https://archive.org/details/20070519_detroit2'''
#open and read page
page = urllib.urlopen(url)
html = page.read()
#create BeautifulSoup parse-able "soup"
soup = BS(html)
#get the src attribute from the video tag
video = soup.find("video").get("src")
also with the site you are using i noticed that to get the embed link just change details in the link to embed so it looks like this:
https://archive.org/embed/20070519_detroit2
so if you want to do it to multiple urls without having to parse just do something like this:
url = '''https://archive.org/details/20070519_detroit2'''
spl = url.split('/')
spl[3] = 'embed'
embed = "/".join(spl)
print embed
EDIT
to get the embed link for the other links you provided in your edit you need to look through the html of the page you are parsing, look until you fint the link then get the tag its in then the attribute
for
'''http://www.kumby.com/avatar-the-last-airbender-book-3-chapter-5/'''
just do
soup.find("iframe").get("src")
the iframe becuase the link is in the iframe tag and the .get("src") because the link is the src attribute
You can try the next one because you should learn how to do it if you want to be able to do it in the future :)
Good luck!
You can't parse a URL. BeautifulSoup is used to parse an html page. Retrieve the page first:
import urllib2
data = urllib2.ulropen("https://archive.org/details/20070519_detroit2")
html = data.read()
Then you can use find, and then take the src attribute:
soup = BeautifulSoup(html)
video = soup.find('video')
src = video['src']
this is a one liner to get all the downloadable MP4 file in that page, in case you need it.
import bs4, urllib2
url = 'https://archive.org/details/20070519_detroit2'
soup = bs4.BeautifulSoup(urllib2.urlopen(url))
links = [a['href'] for a in soup.find_all(lambda tag: tag.name == "a" and '.mp4' in tag['href'])]
print links
Here are the output:
['/download/20070519_detroit2/20070519_detroit_jungleearth.mp4',
'/download/20070519_detroit2/20070519_detroit_sweetkissofdeath.mp4',
'/download/20070519_detroit2/20070519_detroit_goodman.mp4',
...
'/download/20070519_detroit2/20070519_detroit_wilson_512kb.mp4']
These are relative links and you and put them together with the domain and you get absolute path.

Categories