I was using Bs4 in Python for downloading a wallpaper from nmgncp.com.
However the code downloads only 16KB file whereas the full image is around 300KB.
Please help me. I have even tried wget.download method.
PS:- I am using Python 3.6 on Windows 10.
Here is my code::--
from bs4 import BeautifulSoup
import requests
import datetime
import time
import re
import wget
import os
url='http://www.nmgncp.com/dark-wallpaper-1920x1080.html'
html=requests.get(url)
soup=BeautifulSoup(html.text,"lxml")
a = soup.findAll('img')[0].get('src')
newurl='http://www.nmgncp.com/'+a
print(newurl)
response = requests.get(newurl)
if response.status_code == 200:
with open("C:/Users/KD/Desktop/Python_practice/newwww.jpg", 'wb') as f:
f.write(response.content)
The source of your problem is because there is a protection : the image page requires a referer, otherwise it redirects to the html page.
Source code fixed :
from bs4 import BeautifulSoup
import requests
import datetime
import time
import re
import wget
import os
url='http://www.nmgncp.com/dark-wallpaper-1920x1080.html'
html=requests.get(url)
soup=BeautifulSoup(html.text,"lxml")
a = soup.findAll('img')[0].get('src')
newurl='http://www.nmgncp.com'+a
print(newurl)
response = requests.get(newurl, headers={'referer': newurl})
if response.status_code == 200:
with open("C:/Users/KD/Desktop/Python_practice/newwww.jpg", 'wb') as f:
f.write(response.content)
First of all http://www.nmgncp.com/dark-wallpaper-1920x1080.html is an HTML document. Second when you try to download an image by direct URL (like: http://www.nmgncp.com/data/out/95/4351795-dark-wallpaper-1920x1080.jpg) it will also redirect you to a HTML document. This is most probably because the hoster (nmgncp.com) does not want to provide direct links to its images. He can check whether the image was called directly by looking at the HTTP referer and deciding if it is valid. So in this case you have to put in some more effort to make the hoster think, that you are a valid caller of direct URLs.
Related
I want to download zip file of this link. I tried various method but I couldn't do this.
url = "https://www.cms.gov/apps/ama/license.asp?file=http://download.cms.gov/Research-Statistics-Data-and-Systems/Statistics-Trends-and-Reports/Medicare-Provider-Charge-Data/Downloads/Medicare_National_HCPCS_Aggregate_CY2017.zip"
# downloading with requests
# import the requests library
import requests
# download the file contents in binary format
r = requests.get(url)
# open method to open a file on your system and write the contents
with open("minemaster1.zip", "wb") as code:
code.write(r.content)
# downloading with urllib
# import the urllib library
import urllib
# Copy a network object to a local file
urllib.request.urlretrieve(url, "minemaster.zip")
Can anybody help me in resolving this issue.
They're using some accept/decline mechanism, so you'll need to add this parameters to url:
url = 'http://download.cms.gov/Research-Statistics-Data-and-Systems/Statistics-Trends-and-Reports/Medicare-Provider-Charge-Data/Downloads/Medicare_National_HCPCS_Aggregate_CY2017.zip?agree=yes&next=Accept'
I've been trying to write a function that receives a list of URLs and downloads each image from each URL to a given folder. I understand that I am supposed to be using the urlib library but I am not sure how..
the function should start like this :
def download_images(img_urls, dest_dir):
I don't even know how to start and could only find information online on how to download an image but not into a specific folder. If anyone can help me understand how to do the above, it would be wonderful.
thank you in advance:)
Try this:
import urllib.request
urllib.request.urlretrieve('http://image-url', '/dest/path/file_name.jpg')
You can use requests library, for example:
import requests
image_url = 'https://jessehouwing.net/content/images/size/w2000/2018/07/stackoverflow-1.png'
try:
response = requests.get(image_url)
except:
print('Error')
else:
if response.status_code == 200:
with open('stackoverflow-1.png', 'wb') as f:
f.write(response.content)
Here it's a simple solution for your problem using urllib.request.urlretrieve for download the image from your url list img_urls and os.path.basename to get the file name from the url so you can save it with its original name in your dest_dir
from urllib.request import urlretrieve
import os
def download_images(img_urls, dest_dir):
for url in img_urls:
urlretrieve(url, dest_dir+os.path.basename(url))
I have a URL which upon loading in, will automatically download a CSV to your machine.
I am trying to do this automatically in python and control what the downloaded file is named. Here is my current code:
import os
import sys
import urllib.request
URL= https://www.google.com/url?q=https%3A%2F%2Fbasketballmonster.com%2FDaily.aspx%3Fv%3D2%26exportcsv%3DXnZZUZaDa0E296JhVEGWbs8HRGOXsEkeJKs2towTT%2Fw%3D&sa=D&sntz=1&usg=AFQjCNHYm9T_QIZvEJ8qIKfyXQuZb4HPVA
response = urllib.request.urlopen(URL)
URL2 = response.geturl()
urllib.request.urlretrieve(URL2, "file2.csv")
For the URL:
https://www.google.com/url?q=https%3A%2F%2Fbasketballmonster.com%2FDaily.aspx%3Fv%3D2%26exportcsv%3DXnZZUZaDa0E296JhVEGWbs8HRGOXsEkeJKs2towTT%2Fw%3D&sa=D&sntz=1&usg=AFQjCNHYm9T_QIZvEJ8qIKfyXQuZb4HPVA
(clicking that downloads a CSV to disk)
However, the CSV downloaded has this html markup instead of being the actual data
Any ideas on a solution?
I'm trying to programmably access a website
from robobrowser import RoboBrowser
import sys
browser = RoboBrowser(history=True)
browser.open('https://test.com/login')
loginForm = browser.get_form()
loginForm['UserName']='username'
loginForm['Password']='*'
browser.submit_form(loginForm)
if browser.response.ok:
if browser.response.content[2]=='false':
print browser.response.content[4]
sys.exit(1)
website returned json format ( at least i think it's json), but i can't seems to find robobrowser api for dealing with json.
{"RedirectUrl":null,"IsSuccess":false,"Message":null,"CustomMessage":null,"Errors":[{"Key":"CaptchaValue","Value":["Your response did not match. Please try again."]}],"Messages":{},"HasView":true.......}
As you can see I want to test if "isSuccess", and print error message, how can I proceed in this case?
thanks
found a solution using json
json.load(StringIO(browser.response.content))
and for python 3.x is functional
import io
import json
json.load(io.BytesIO(browser.response.content))
The printed html returns garbled text... instead of what I expect to see as seen in "view source" in browser.
Why is that? How to fix it easily?
Thank you for your help.
Same behavior using mechanize, curl, etc.
import urllib
import urllib2
start_url = "http://www.ncert.nic.in/ncerts/textbook/textbook.htm"
response = urllib2.urlopen(start_url)
html = response.read()
print html
I got the same garbled text using curl
curl http://www.ncert.nic.in/ncerts/textbook/textbook.htm
The result appears to be gzipped. So this shows the correct HTML for me.
curl http://www.ncert.nic.in/ncerts/textbook/textbook.htm | gunzip
Here's a solutions on doing this in Python: Convert gzipped data fetched by urllib2 to HTML
Edited by OP:
The revised answer after reading above is:
import urllib
import urllib2
import gzip
import StringIO
start_url = "http://www.ncert.nic.in/ncerts/textbook/textbook.htm"
response = urllib2.urlopen(start_url)
html = response.read()
data = StringIO.StringIO(html)
gzipper = gzip.GzipFile(fileobj=data)
html = gzipper.read()
html now holds the HTML (Print it to see)
Try requests. Python Requests.
import requests
response = requests.get("http://www.ncert.nic.in/ncerts/textbook/textbook.htm")
print response.text
The reason for this is because the site uses gzip encoding. To my knowledge urllib doesn't support deflating so you end up with compressed html responses for certain sites that use that encoding. You can confirm this by printing the content headers from the response like so.
print response.headers
There you will see that the "Content-Encoding" is gzip format. In order to get around this using the standard urllib library you'd need to use the gzip module. Mechanize also does this because it uses the same urllib library. Requests will handle this encoding and format it nicely for you.