I'm trying to take some links from a text file and just download them onto my computer. However I would like those downloaded pages to be completely the same as it is in browser. These wiki pages I downloaded are not the same, they don't display some of the pictures and it's just text mostly when I open them.
How can I achieve what I want, saw some things with scrapy and beautiful soup however I'm not exp
My code:
import urllib.request
links=[]
fr=open('wiki_linkovi','r')
fw1=open('imena_elemenata.txt', 'w')
link=fr.readlines()
j=0
for i in link:
base='https://en.wikipedia.org/wiki/'
start=i.find(base)+len(base)
end=i.find('\n',start)
ime=i[start:end]
fw1.write(ime+'\n')
response = urllib.request.urlopen(i) #save starts here-----
webContent = response.read()
f = open(ime+'.html', 'wb')
f.write(webContent)
f.close
j=j+1
print(str(j)+'. link\n')
So yeah In short, I'd like to download webpage completely
Related
The original code is here : https://github.com/amitabhadey/Web-Scraping-Images-using-Python-via-BeautifulSoup-/blob/master/code.py
So i am trying to adapt a Python script to collect pictures from a website to get better at web scraping.
I tried to get images from "https://500px.com/editors"
The first error was
The code that caused this warning is on line 12 of the file/Bureau/scrapper.py. To get rid of this warning, pass the additional argument
'features="lxml"' to the BeautifulSoup constructor.
So I did :
soup = BeautifulSoup(plain_text, features="lxml")
I also adapted the class to reflect the tag in 500px.
But now the script stopped running and nothing happened.
In the end it looks like this :
import requests
from bs4 import BeautifulSoup
import urllib.request
import random
url = "https://500px.com/editors"
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text, features="lxml")
for link in soup.find_all("a",{"class":"photo_link "}):
href = link.get('href')
print(href)
img_name = random.randrange(1,500)
full_name = str(img_name) + ".jpg"
urllib.request.urlretrieve(href, full_name)
print("loop break")
What did I do wrong?
Actually the website is loaded via JavaScript using XHR request to the following API
So you can reach it directly via API.
Note that you can increase parameter rpp=50 to any number as you want for getting more than 50 result.
import requests
r = requests.get("https://api.500px.com/v1/photos?rpp=50&feature=editors&image_size%5B%5D=1&image_size%5B%5D=2&image_size%5B%5D=32&image_size%5B%5D=31&image_size%5B%5D=33&image_size%5B%5D=34&image_size%5B%5D=35&image_size%5B%5D=36&image_size%5B%5D=2048&image_size%5B%5D=4&image_size%5B%5D=14&sort=&include_states=true&include_licensing=true&formats=jpeg%2Clytro&only=&exclude=&personalized_categories=&page=1&rpp=50").json()
for item in r['photos']:
print(item['url'])
also you can access the image url itself in order to write it directly!
import requests
r = requests.get("https://api.500px.com/v1/photos?rpp=50&feature=editors&image_size%5B%5D=1&image_size%5B%5D=2&image_size%5B%5D=32&image_size%5B%5D=31&image_size%5B%5D=33&image_size%5B%5D=34&image_size%5B%5D=35&image_size%5B%5D=36&image_size%5B%5D=2048&image_size%5B%5D=4&image_size%5B%5D=14&sort=&include_states=true&include_licensing=true&formats=jpeg%2Clytro&only=&exclude=&personalized_categories=&page=1&rpp=50").json()
for item in r['photos']:
print(item['image_url'][-1])
Note that image_url key hold different img size. so you can choose your preferred one and save it. here I've taken the big one.
Saving directly:
import requests
with requests.Session() as req:
r = req.get("https://api.500px.com/v1/photos?rpp=50&feature=editors&image_size%5B%5D=1&image_size%5B%5D=2&image_size%5B%5D=32&image_size%5B%5D=31&image_size%5B%5D=33&image_size%5B%5D=34&image_size%5B%5D=35&image_size%5B%5D=36&image_size%5B%5D=2048&image_size%5B%5D=4&image_size%5B%5D=14&sort=&include_states=true&include_licensing=true&formats=jpeg%2Clytro&only=&exclude=&personalized_categories=&page=1&rpp=50").json()
result = []
for item in r['photos']:
print(f"Downloading {item['name']}")
save = req.get(item['image_url'][-1])
name = save.headers.get("Content-Disposition")[9:]
with open(name, 'wb') as f:
f.write(save.content)
Looking at the page you're trying to scrape I noticed something. The data doesn't appear to load until a few moments after the page finishes loading. This tells me that they're using a JS framework to load the images after page load.
Your scraper will not work with this page due to the fact that it does not run JS on the pages it's pulling. Running your script and printing out what plain_text contains proves this:
<a class='photo_link {{#if hasDetailsTooltip}}px_tooltip{{/if}}' href='{{photoUrl}}'>
If you look at the href attribute on that tag you'll see it's actually a templating tag used by JS UI frameworks.
Your options now are to either see what APIs they're calling to get this data (check the inspector in your web browser for network calls, if you're lucky they may not require authentication) or to use a tool that runs JS on pages. One tool I've seen recommended for this is selenium, though I've never used it so I'm not fully aware of its capabilities; I imagine the tooling around this would drastically increase the complexity of what you're trying to do.
I'm trying to use request to download the content of some web pages which are in fact PDFs.
I've tried the following code but the output that comes back is not properly decoded it seems:
link= 'http://www.pdf995.com/samples/pdf.pdf'
import requests
r = requests.get(link)
r.text
The output looks like below:
'%PDF-1.3\n%�쏢\n30 0 obj\n<>\nstream\nx��}ݓ%�m���\x15S�%NU���M&O7�㛔]ql�����+Kr�+ْ%���/~\x00��=����{feY�T�\x05��\r�\x00�/���q�8�8�\x7f�\x7f�~����\x1f�ܷ�O�z�7�7�o\x1f����7�\'�{��\x7f<~��\x1e?����C�%\ByLշK����!_b^0o\x083�K\x0b\x0b�\x05z�E�S���?�~ �]rb\x10C�y�>_r�\x10�<�K��<��!>��(�\x17���~�.m��]2\x11��
etc
I was hoping to get the html. I also tried with beautifulsoup but it does not decode it either.. I hope someone can help. Thank you, BR
Yes; a PDF file is a binary file, not a text file, so you should use r.content instead of r.text to access the binary data.
PDF files are not easy to deal with programmatically; but you might (for example) save it to a file:
import requests
link = 'http://www.pdf995.com/samples/pdf.pdf'
r = requests.get(link)
with open('pdf.pdf', 'wb') as f:
f.write(r.content)
I'm working on a program that downloads data from a series of URLs, like this:
https://server/api/getsensordetails.xmlid=sesnsorID&username=user&password=password
the program goes through a list with IDs (about 2500) and running the URL, try to do it using the following code
import webbrowser
webbrowser.open(url)
but this code implies to open the URL in the browser and confirm if I want to download, I need him to simply download the files without opening a browser and much less without having to confirm
thanks for everything
You can use the Requests library.
import requests
print('Beginning file download with requests')
url = 'http://PathToFile.jpg'
r = requests.get(url)
with open('pathOfFileToReceiveDownload.jpg', 'wb') as f:
f.write(r.content)
I download a bunch of pdf files and archive them.
Most of the documents work fine but I have a problem with one.
The link to the document which doesn't work is:
https://www.ishares.com/de/professionelle-anleger/de/literature/fact-sheet/susm-ishares-msci-em-sri-ucits-etf-fund-fact-sheet-de-de.pdf
When I download it normally, it just workds fine.
I tried two different approaches with python to download it.
response = requests.get('https://www.ishares.com/de/professionelle-anleger/de/literature/fact-sheet/susm-ishares-msci-em-sri-ucits-etf-fund-fact-sheet-de-de.pdf',
stream=True)
with open(
'test.pdf',
'wb') as r:
for chunk in response.iter_content(2000):
r.write(chunk)
r.close()
Second approach:
def pdfDownload(url):
response = requests.get(url)
expdf = response.content
egpdf = open('test.pdf', 'wb')
egpdf.write(expdf)
egpdf.close()
In both cases I get an error message when I try to open it afterwards.
you need to replace your url with this
https://www.ishares.com/de/professionelle-anleger/de/literature/fact-sheet/susm-ishares-msci-em-sri-ucits-etf-fund-fact-sheet-de-de.pdf?switchLocale=y&siteEntryPassthrough=true
I'm working on making a PDF Web Scraper in Python. Essentially, I'm trying to scrape all of the lecture notes from one of my courses, which are in the form of PDFs. I want to enter a url, and then get the PDFs and save them in a directory in my laptop. I've looked at several tutorials, but I'm not entirely sure how to go about doing this. None of the questions on StackOverflow seem to be helping me either.
Here is what I have so far:
import requests
from bs4 import BeautifulSoup
import shutil
bs = BeautifulSoup
url = input("Enter the URL you want to scrape from: ")
print("")
suffix = ".pdf"
link_list = []
def getPDFs():
# Gets URL from user to scrape
response = requests.get(url, stream=True)
soup = bs(response.text)
#for link in soup.find_all('a'): # Finds all links
# if suffix in str(link): # If the link ends in .pdf
# link_list.append(link.get('href'))
#print(link_list)
with open('CS112.Lecture.09.pdf', 'wb') as out_file:
shutil.copyfileobj(response.raw, out_file)
del response
print("PDF Saved")
getPDFs()
Originally, I had gotten all of the links to the PDFs, but did not know how to download them; the code for that is now commented out.
Now I've gotten to the point where I'm trying to download just one PDF; and a PDF does get downloaded, but it's a 0KB file.
If it's of any use, I'm using Python 3.4.2
If this is something that does not require being logged in, you can use urlretrieve():
from urllib.request import urlretrieve
for link in link_list:
urlretrieve(link)