I have made this simple download manager, but the problem is it wont work on complex urls, when pages are redirected.
def str(d):
for i in range(len(d)):
if d[-i] == '/':
x=-i
break
s=[]
l=len(d)+x+1
print d[l],d[len(d)-1]
s=d[l:]
return s
import urllib2
url=raw_input()
filename=str(url)
webfile = urllib2.urlopen(url)
data = webfile.read()
fout =open(filename,"w")
fout.write(data)
fout.close()
webfile.close()
it wouldn't work for http://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=9&ved=0CG0QFjAI&url=http%3A%2F%2Fwww.iasted.org%2Fconferences%2Fformatting%2FPresentations-Tips.ppt&ei=clfWTpjZEIblrAfC8qWXDg&usg=AFQjCNEIgqx6x4ULHFXzzYDzCITuUJOczA&sig2=0VtKXPvoDnIq-lIR4S9LEQ
while it would work for http://www.iasted.org/conferences/formatting/Presentations-Tips.ppt
and both links are for the same file.
How to solve the problem of redirection?
I think redirection is not a problem here:
Since urllib2 already follows redirect automatically, google redirects to a page in case of error.
Try this script :
url1 = 'http://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=9&ved=0CG0QFjAI&url=http%3A%2F%2Fwww.iasted.org%2Fconferences%2Fformatting%2FPresentations-Tips.ppt&ei=clfWTpjZEIblrAfC8qWXDg&usg=AFQjCNEIgqx6x4ULHFXzzYDzCITuUJOczA&sig2=0VtKXPvoDnIq-lIR4S9LEQ'
url2 = 'http://www.iasted.org/conferences/formatting/Presentations-Tips.ppt'
from urlparse import urlsplit
from urllib2 import urlopen
for url in [url1, url2]:
split = urlsplit(url)
filename = split.path[split.path.rfind('/')+1:]
if not filename:
filename = split.query[split.query.rfind('/')+1:]
f = open(filename, 'w')
f.write(urlopen(url).read())
f.close()
# Yields 2 files : url and Presentations-Tips.ppt [Both are ppt files]
The above script works every time.
In general, you handle redirection by using urllib2.HTTPRedirectHandler, like this:
import urllib2
opener = urllib.build_opener(urllib2.HTTPRedirectHandler)
res = open.open('http://example.com/some/url/')
However, it doesn't like like this will work for the Google URL you've given in your example, because rather than including a Location header in the response, the Google result looks like this:
<script>window.googleJavaScriptRedirect=1</script><script>var a=parent,b=parent.google,c=location;if(a!=window&&b){if(b.r){b.r=0;a.location.href="http://www.iasted.org/conferences/formatting/Presentations-Tips.ppt";c.replace("about:blank");}}else{c.replace("http://www.iasted.org/conferences/formatting/Presentations-Tips.ppt");};</script><noscript><META http-equiv="refresh" content="0;URL='http://www.iasted.org/conferences/formatting/Presentations-Tips.ppt'"></noscript>
...which is to say, it uses a Javascript redirect, which substantially complicates your life. You could use Python's re module to extract the correct location from this block.
Related
I want to download text files using python, how can I do so?
I used requests module's urlopen(url).read() but it gives me the bytes representation of file.
For me, I had to do the following (Python 3):
from urllib.request import urlopen
data = urlopen("[your url goes here]").read().decode('utf-8')
# Do what you need to do with the data.
You can use multiple options:
For the simpler solution you can use this
file_url = 'https://someurl.com/text_file.txt'
for line in urllib.request.urlopen(file_url):
print(line.decode('utf-8'))
For an API solution
file_url = 'https://someurl.com/text_file.txt'
response = requests.get(file_url)
if (response.status_code):
data = response.text
for line in enumerate(data.split('\n')):
print(line)
When downloading text files with python I like to use the wget module
import wget
remote_url = 'https://www.google.com/test.txt'
local_file = 'local_copy.txt'
wget.download(remote_url, local_file)
If that doesn't work try using urllib
from urllib import request
remote_url = 'https://www.google.com/test.txt'
file = 'copy.txt'
request.urlretrieve(remote_url, file)
When you are using the request module you are reading the file directly from the internet and it is causing you to see the text in byte format. Try to write the text to a file then view it manually by opening it on your desktop
import requests
remote_url = 'test.com/test.txt'
local_file = 'local_file.txt'
data = requests.get(remote_url)
with open(local_file, 'wb')as file:
file.write(data.content)
Basically, my goal is to fetch the filename, extension and the content of an image by its url. And my fuction should work for both of these urls:
easy case:
https://image.shutterstock.com/image-photo/bright-spring-view-cameo-island-260nw-1048185397.jpg
hard case (does not end with filename.extension ):
https://images.unsplash.com/photo-1472214103451-9374bd1c798e?ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&w=1000&q=80
Currently, what I have looks like this:
from os.path import splitext, basename
def get_filename_from_url(url):
result = urllib.request.urlretrieve(url)
filename, file_ext = splitext(basename(result.path))
print(filename, file_ext)
This works fine for the easy case. But apparently, no solution in case of hard-case url. But I have a feeling that that I can use python's requests module and parse the header to find the mimetype and then use the same module's guesstype functionality to extract the necessary data. So I went on to try this:
import requests
response = requests.get(url, stream=True)
Here, someone seems to describe the clue, saying that
but the problem is that using the hard-case url I get something strange in the response dict items, and maybe my key issue is that I don't know the correct way to parse the header of the response to extract what I need.
I've tried a third approach using urlparse:
from urllib.parse import urlparse
result = urlparse(self.url)
print(os.path.basename(a.path)) # 'photo-1472214103451-9374bd1c798e'
which yields the filename, but again, I miss the extension here...
The ideal solution would be to get the filename, file extension and file content in one go, preferrably being able to validate that the url actually contains an image, not something else...
UPD:
The result1 elemet in result = urllib.request.urlretrieve(self.url) seems to contain the Content-Type, by I can't figure out how to extract it correctly.
One way is to query the content type:
>>> from urllib.request import urlopen
>>> response = urlopen(url)
>>> response.info().get_content_type()
'image/jpeg'
or using urlretrieve as in your edit:
>>> response = urllib.request.urlretrieve(url)
>>> response[1].get_content_type()
I am trying to download data files from a website using urllib.
My code is
import urllib
url_common = 'http://apps.waterconnect.sa.gov.au/SiteInfo/Data/Site_Data/'
site_list=['4260514','4260512','4260519']
parameter_list=['ecrec','ecday','flowrec','flowcday']
for site in site_list:
for parameter in parameter_list:
try:
url = url_common+'A'+site+'/'+'a'+site+'_'+parameter+'.zip'
urllib.urlretrieve(url,'A'+site+'_'+parameter+'.zip')
except ValueError:
break
My issue is some sites do not have all the parameter files. For eg, with my code, site 1 doesn't have flowcday but python still creates the zip file with nothing in content. How can I stop python to create these files if there's no data?
Many thanks,
I think maybe urllib2.urlopen is more suitable in this situation.
import urllib2
from urllib2 import URLError
url_common = 'http://apps.waterconnect.sa.gov.au/SiteInfo/Data/Site_Data/'
site_list=['4260514','4260512','4260519']
parameter_list=['ecrec','ecday','flowrec','flowcday']
for site in site_list:
for parameter in parameter_list:
try:
url = url_common+'A'+site+'/'+'a'+site+'_'+parameter+'.zip'
name ='A'+site+'_'+parameter+'.zip'
req = urllib2.urlopen(url)
with open(name,'wb') as fh:
fh.write(req.read())
except URLError,e:
if e.code==404:
print name + ' not found. moving on...'
pass
I am executing the python script from commandline with this
python myscript.py
This is my script
if item['image_urls']:
for image_url in item['image_urls']:
subprocess.call(['wget','-nH', image_url, '-P images/'])
Now when i run that the on the screen i see output like this
HTTP request sent, awaiting response... 200 OK
Length: 4159 (4.1K) [image/png]
now what i want is that there should be no ouput on the terminal.
i want to grab the ouput and find the image extension from there i.e from [image/png] grab the png and renaqme the file to something.png
Is this possible
If all you want is to download something using wget, why not try urllib.urlretrieve in standard python library?
import os
import urllib
image_url = "https://www.google.com/images/srpr/logo3w.png"
image_filename = os.path.basename(image_url)
urllib.urlretrieve(image_url, image_filename)
EDIT: If the images are dynamically redirected by a script, you may try requests package to handle the redirection.
import requests
r = requests.get(image_url)
# here r.url will return the redirected true image url
image_filename = os.path.basename(r.url)
f = open(image_filename, 'wb')
f.write(r.content)
f.close()
I haven't test the code since I do not find a suitable test case. One big advantage for requests is it can also handle authorization.
EDIT2: If the image is dynamically served by a script, like gravatar image, you can usually find the filename in the response header's content-disposition field.
import urllib2
url = "http://www.gravatar.com/avatar/92fb4563ddc5ceeaa8b19b60a7a172f4"
req = urllib2.Request(url)
r = urllib2.urlopen(req)
# you can check the returned header and find where the filename is loacated
print r.headers.dict
s = r.headers.getheader('content-disposition')
# just parse the filename
filename = s[s.index('"')+1:s.rindex('"')]
f = open(filename, 'wb')
f.write(r.read())
f.close()
EDIT3: As #Alex suggested in the comment, you may need to sanitize the encoded filename in the returned header, I think just get the basename is ok.
import os
# this will remove the dir path in the filename
# so that `../../../etc/passwd` will become `passwd`
filename = os.path.basename(filename)
I can’t really understand how YouTube serves videos, but I have been reading through what I can.
It seems like the old method get_video is now obsolete and can't be used any more. Is there another Pythonic and simple method for collecting YouTube videos?
You might have some luck with youtube-dl
http://rg3.github.com/youtube-dl/documentation.html
I'm not sure if there's a good API, but it's written in Python, so theoretically you could do something a little better than Popen :)
Here is a quick Python script which downloads a YouTube video. No bells and whistles, just scrapes out the necessary URLs, hits the generate_204 URL and then streams the data to a file:
import lxml.html
import re
import sys
import urllib
import urllib2
_RE_G204 = re.compile('"(http:.+.youtube.com.*\/generate_204[^"]+")', re.M)
_RE_URLS = re.compile('"fmt_url_map": "(\d*[^"]+)",.*', re.M)
def _fetch_url(url, ref=None, path=None):
opener = urllib2.build_opener()
headers = {}
if ref:
headers['Referer'] = ref
request = urllib2.Request(url, headers=headers)
handle = urllib2.urlopen(request)
if not path:
return handle.read()
sys.stdout.write('saving: ')
# Write result to file
with open(path, 'wb') as out:
while True:
part = handle.read(65536)
if not part:
break
out.write(part)
sys.stdout.write('.')
sys.stdout.flush()
sys.stdout.write('\nFinished.\n')
def _extract(html):
tree = lxml.html.fromstring(html)
res = {'204': _RE_G204.findall(html)[0].replace('\\', '')}
for script in tree.findall('.//script'):
text = script.text_content()
if 'fmt_url_map' not in text:
continue
# Found it. Extract the URLs we need
for tmp in _RE_URLS.findall(text)[0].split(','):
url_id, url = tmp.split('|')
res[url_id] = url.replace('\\', '')
break
return res
def main():
target = sys.argv[1]
dest = sys.argv[2]
html = _fetch_url(target)
res = dict(_extract(html))
# Hit the 'generate_204' URL first and remove it
_fetch_url(res['204'], ref=target)
del res['204']
# Download the video. Now I grab the first 'download' URL and use it.
first = res.values()[0]
_fetch_url(first, ref=target, path=dest)
if __name__ == '__main__':
main()
Running it:
python youdown.py 'http://www.youtube.com/watch?v=Je_iqbgGXFw' stevegadd.flv
saving: ........................... finished.
I would recommend writing your own parser using urllib2 or Beautiful Soup. You can look at the source code for DownThemAll to see how that plugin finds the video URL.