I want to download the ipranges.json (which is updated weekly) from https://www.microsoft.com/en-us/download/confirmation.aspx?id=56519
I have this python code which keeps running forever.
import wget
URL = "https://www.microsoft.com/en-us/download/confirmation.aspx?id=56519"
response = wget.download(URL, "ips.json")
print(response)
How can I download the JSON file in Python?
Because https://www.microsoft.com/en-us/download/confirmation.aspx?id=56519 is the link which automatically trigger javascript to download, therefore you just download the page, not the file
If you check downloaded file, the source will look like this
We realize the file will change after a while, so we have to scrape it in generic way
For convenience, I will not use wget, 2 libraries here are requests to request page and download file, beaufitulsoup to parse html
# pip install requests
# pip install bs4
import requests
from bs4 import BeautifulSoup
# request page
URL = "https://www.microsoft.com/en-us/download/confirmation.aspx?id=56519"
page = requests.get(URL)
# parse HTML to get the real link
soup = BeautifulSoup(page.content, "html.parser")
link = soup.find('a', {'data-bi-containername':'download retry'})['href']
# download
file_download = requests.get(link)
# save in azure_ips.json
open("azure_ips.json", "wb").write(file_download.content)
Related
I'm new to Python and would like your advice for the issue I've encountered recently. I'm doing a small project where I tried to scrape a comic website to download a chapter (pictures). However, when printing out the page content for testing (because i tried to use Beautifulsoup.select() and got no result), it only showed a line of html:
'document.cookie="VinaHost-Shield=a7a00919549a80aa44d5e1df8a26ae20"+"; path=/";window.location.reload(true);'
Any help would be really appreciated.
from requests_html import HTMLSession
session = HTMLSession()
res = session.get("https://truyenqqpro.com/truyen-tranh/dao-hai-tac-128-chap-1060.html")
res.html.render()
print(res.content)
I also tried this but the resutl was the same.
import requests, bs4
url = "https://truyenqqpro.com/truyen-tranh/dao-hai-tac-128-chap-1060.html"
res = requests.get(url, headers={"User-Agent": "Requests"})
res.raise_for_status()
# soup = bs4.BeautifulSoup(res.text, "html.parser")
# onePiece = soup.select(".page-chapter")
print(res.content)
update: I installed docker and splash (on Windows 11) and it worked. I included the update code. Thanks Franz and others for yor help.
import os
import requests, bs4
os.makedirs("OnePiece", exist_ok=True)
url = "https://truyenqqpro.com/truyen-tranh/dao-hai-tac-128-chap-1060.html"
res = requests.get("http://localhost:8050/render.html", params={"url": url, "wait": 5})
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text, "html.parser")
onePiece = soup.find_all("img", class_="lazy")
for element in onePiece:
imageLink = "https:" + element["data-cdn"]
res = requests.get(imageLink)
imageFile = open(os.path.join("OnePiece", os.path.basename(imageLink)), "wb")
for chunk in res.iter_content(100000):
imageFile.write(chunk)
imageFile.close()
import urllib.request
request_url = urllib.request.urlopen('https://truyenqqpro.com/truyen-tranh/dao-hai-tac-128-chap-1060.html')
print(request_url.read())
it will return html code of the page.
by the way in that html it is loading several images. you need to use regx to trakdown those img urls and download them.
This response means that we need a javascript render that reload the page using this cookie. for you get the content some workaround must be added.
I commonly use splash scrapinhub render engine and putting a sleep in the page just renders ok all the content. Some tools that render in same way are selenium for python or pupitter in JS.
Link for Splash and Pupeteer
I'm looking for some library or libraries in Python to:
a) log in a web site,
b) find all links to some media files (let us say having "download" in their URLs), and
c) download each file efficiently directly to the hard drive (without loading the whole media file into RAM).
Thanks
You can use the broadly used requests module (more than 35k stars on github), and BeautifulSoup. The former handles session cookies, redirections, encodings, compression and more transparently. The later finds parts in the HTML code and has an easy-to-remember syntax, e.g. [] for properties of HTML tags.
It follows a complete example in Python 3.5.2 for a web site that you can scrap without a JavaScript engine (otherwise you can use Selenium), and downloading sequentially some links with download in its URL.
import shutil
import sys
import requests
from bs4 import BeautifulSoup
""" Requirements: beautifulsoup4, requests """
SCHEMA_DOMAIN = 'https://exmaple.com'
URL = SCHEMA_DOMAIN + '/house.php/' # this is the log-in URL
# here are the name property of the input fields in the log-in form.
KEYS = ['login[_csrf_token]',
'login[login]',
'login[password]']
client = requests.session()
request = client.get(URL)
soup = BeautifulSoup(request.text, features="html.parser")
data = {KEYS[0]: soup.find('input', dict(name=KEYS[0]))['value'],
KEYS[1]: 'my_username',
KEYS[2]: 'my_password'}
# The first argument here is the URL of the action property of the log-in form
request = client.post(SCHEMA_DOMAIN + '/house.php/user/login',
data=data,
headers=dict(Referer=URL))
soup = BeautifulSoup(request.text, features="html.parser")
generator = ((tag['href'], tag.string)
for tag in soup.find_all('a')
if 'download' in tag['href'])
for url, name in generator:
with client.get(SCHEMA_DOMAIN + url, stream=True) as request:
if request.status_code == 200:
with open(name, 'wb') as output:
request.raw.decode_content = True
shutil.copyfileobj(request.raw, output)
else:
print('status code was {} for {}'.format(request.status_code,
name),
file=sys.stderr)
You can use the mechanize module to log into websites like so:
import mechanize
br = mechanize.Browser()
br.set_handle_robots(False)
br.open("http://www.example.com")
br.select_form(nr=0) #Pass parameters to uniquely identify login form if needed
br['username'] = '...'
br['password'] = '...'
result = br.submit().read()
Use bs4 to parse this response and find all the hyperlinks in the page like so:
from bs4 import BeautifulSoup
import re
soup = BeautifulSoup(result, "lxml")
links = []
for link in soup.findAll('a'):
links.append(link.get('href'))
You can use re to further narrow down the links you need from all the links present in the response webpage, which are media links (.mp3, .mp4, .jpg, etc) in your case.
Finally, use requests module to stream the media files so that they don't take up too much memory like so:
response = requests.get(url, stream=True) #URL here is the media URL
handle = open(target_path, "wb")
for chunk in response.iter_content(chunk_size=512):
if chunk: # filter out keep-alive new chunks
handle.write(chunk)
handle.close()
when the stream attribute of get() is set to True, the content does not immediately start downloading to RAM, instead the response behaves like an iterable, which you can iterate over in chunks of size chunk_size in the loop right after the get() statement. Before moving on to the next chunk, you can write the previous chunk to memory hence ensuring that the data isn't stored in RAM.
You will have to put this last chunk of code in a loop if you want to download media of every link in the links list.
You will probably have to end up making some changes to this code to make it work as I haven't tested it for your use case myself, but hopefully this gives a blueprint to work off of.
I'm writing a program/script in python3. I know how to download single files from URL, but I need to download whole folder, unzip the files and merge text files.
Is it possible to download all files FROM HERE to new folder on my computer with python? I'm using a urllib to download a single files, can anyone give a example how to download whole folder from link above?
Install bs4 and requests, than you can use code like this:
import bs4
import requests
url = "http://bossa.pl/pub/metastock/ofe/sesjaofe/"
r = requests.get(url)
data = bs4.BeautifulSoup(r.text, "html.parser")
for l in data.find_all("a"):
r = requests.get(url + l["href"])
print(r.status_code)
Than you have to save the data of the request into your directory.
I have written a script that scrapes a URL. It works fine on Linux OS. But i am getting http 503 error when running on Windows 7. The URL has some issue.
I am using python 2.7.11 .
Please help.
Below is the script:
import sys # Used to add the BeautifulSoup folder the import path
import urllib2 # Used to read the html document
if __name__ == "__main__":
### Import Beautiful Soup
### Here, I have the BeautifulSoup folder in the level of this Python script
### So I need to tell Python where to look.
sys.path.append("./BeautifulSoup")
from bs4 import BeautifulSoup
### Create opener with Google-friendly user agent
opener = urllib2.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
### Open page & generate soup
### the "start" variable will be used to iterate through 10 pages.
for start in range(0,1000):
url = "http://www.google.com/search?q=site:theknot.com/us/&start=" + str(start*10)
page = opener.open(url)
soup = BeautifulSoup(page)
### Parse and find
### Looks like google contains URLs in <cite> tags.
### So for each cite tag on each page (10), print its contents (url)
file = open("parseddata.txt", "wb")
for cite in soup.findAll('cite'):
print cite.text
file.write(cite.text+"\n")
# file.flush()
# file.close()
In case you run it in windows 7, the cmd throws http503 error stating the issue is with url.
The URL works fine in Linux OS. In case URL is actually wrong please suggest the alternatives.
Apparently with Python 2.7.2 on Windows, any time you send a custom User-agent header, urllib2 doesn't send that header. (source: https://stackoverflow.com/a/8994498/6479294).
So you might want to consider using requests instead of urllib2 in Windows:
import requests
# ...
page = requests.get(url)
soup = BeautifulSoup(page.text)
# etc...
EDIT: Also a very good point to be made is that Google may be blocking your IP - they don't really like bots making 100 odd requests sequentially.
Is there a way by which I can click on the most latest url on a web page, then url within it, to download an exe file in python.
I know how to do download files from a static url but what about changing urls?
Note: I want to go to most latest url out of all urls. then I need to again click a url within it. Later, download the file.
Thanks in advance!
I used BeautifulSoup to accomplish this as suggested by ZZY. Thanks ZZY. Basically, we can do something like this:
page = self.authorizedopen(username, password, url)
text = page.read()
page.close()
soup = BeautifulSoup(text)
data = ''
for tag in soup.findAll('a', href=True):
data = tag['href']
return url + '/' + data
And constantly manipulated the url to reach where I want to. Then used simple urlib2 to download the required file.