How to download large files in Python 2

How to download large files in Python 2 - python

I'm trying to download large files (approx. 1GB) with mechanize module, but I have been unsuccessful. I've been searching for similar threads, but I have found only those, where the files are publicly accessible and no login is required to obtain a file. But this is not my case as the file is located in the private section and I need to login before the download. Here is what I've done so far.
import mechanize
g_form_id = ""
def is_form_found(form1):
return "id" in form1.attrs and form1.attrs['id'] == g_form_id
def select_form_with_id_using_br(br1, id1):
global g_form_id
g_form_id = id1
try:
br1.select_form(predicate=is_form_found)
except mechanize.FormNotFoundError:
print "form not found, id: " + g_form_id
exit()
url_to_login = "https://example.com/"
url_to_file = "https://example.com/download/files/filename=fname.exe"
local_filename = "fname.exe"
br = mechanize.Browser()
br.set_handle_robots(False) # ignore robots
br.set_handle_refresh(False) # can sometimes hang without this
br.addheaders = [('User-agent', 'Firefox')]
response = br.open(url_to_login)
# Find login form
select_form_with_id_using_br(br, 'login-form')
# Fill in data
br.form['email'] = 'email#domain.com'
br.form['password'] = 'password'
br.set_all_readonly(False) # allow everything to be written to
br.submit()
# Try to download file
br.retrieve(url_to_file, local_filename)
But I'm getting an error when 512MB is downloaded:
Traceback (most recent call last):
File "dl.py", line 34, in <module>
br.retrieve(br.retrieve(url_to_file, local_filename)
File "C:\Python27\lib\site-packages\mechanize\_opener.py", line 277, in retrieve
block = fp.read(bs)
File "C:\Python27\lib\site-packages\mechanize\_response.py", line 199, in read
self.__cache.write(data)
MemoryError: out of memory
Do you have any ideas how to solve this?
Thanks

You can use bs4 and requests to get you logged in then write the streamed content. There are a few form fields required including a _token_ field that is definitely necessary:
from bs4 import BeautifulSoup
import requests
from urlparse import urljoin
data = {'email': 'email#domain.com', 'password': 'password'}
base = "https://support.codasip.com"
with requests.Session() as s:
# update headers
s.headers.update({'User-agent': 'Firefox'})
# use bs4 to parse the from fields
soup = BeautifulSoup(s.get(base).content)
form = soup.select_one("#frm-loginForm")
# works as it is a relative path. Not always the case.
action = form["action"]
# Get rest of the fields, ignore password and email.
for inp in form.find_all("input", {"name":True,"value":True}):
name, value = inp["name"], inp["value"]
if name not in data:
data[name] = value
# login
s.post(urljoin(base, action), data=data)
# get protected url
with open(local_filename, "wb") as f:
for chk in s.get(url_to_file, stream=True).iter_content(1024):
f.write(chk)

Try downloading/writing it by chunks. Seems like file takes all your memory.
You should specify Range header for your request if server supports it.
https://en.wikipedia.org/wiki/List_of_HTTP_header_fields

Related

Test if URL's in bookmark is offline or online in python

I wonder if it is possible to test the URL's from my bookmarks.
So I can see if the URL still is online or offline.
I can see that I can test it with Urllip2
Urllip2 code
import socket
from urllib2 import urlopen, URLError, HTTPError
socket.setdefaulttimeout( 23 ) # timeout in seconds
url = 'http://google.com/'
try :
response = urlopen( url )
except HTTPError, e:
print 'The server couldn\'t fulfill the request. Reason:', str(e.code)
except URLError, e:
print 'We failed to reach a server. Reason:', str(e.reason)
else :
html = response.read()
print 'got response!'
# do something, turn the light on/off or whatever
My question is, can I get the links/URL's from my bookmarks (Chrome) and the test the URL's in a loop (for) if the URL is Offline or Online.
EDIT 26/02/2019...
Have t/ried this code, and get no folder found error..
/
import json
from jsonpath_rw import parse
import os
# PArse te Bookmarks file from json into a dict
input_filename = os.path.join(os.getenv("APPDATA"), "\\Local\\Google\\Chrome\\User Data\\Default\\Bookmarks")
if os.path.isfile(input_filename):
with open(input_filename) as data_file:
bookmark_data = json.load(data_file)
# Set an xpath expression for all 'url' children
expr = parse('$..url')
# print the value of all url keys
print([x.value for x in expr.find(bookmark_data)])
else:
print("File not found!")
print(input_filename)

Chrome (or at least Chromium) stores your bookmarks in a file called Bookmarks in your chrome config area - on linux this is usually .config/chromium/Default/Bookmarks on Windows it is AppData\Local\Google\Chrome\User Data\Default\Bookmarks (though you may need to hunt for it if your system is different).
Assuming you wan to check all links, then you probably want to recursively walk the tree, looking for url keys and getting their values. Since this is JSON, I would recommend using the JSONPath library (https://readthedocs.org/projects/jsonpath-rw/), rather than writing your own recursion function:
import json
from jsonpath_rw import parse
# PArse te Bookmarks file from json into a dict
with open('Bookmarks') as bm:
data = json.load(bm)
# Set an xpath expression for all 'url' children
expr = parse('$..url')
# print the value of all url keys
print([x.value for x in expr.find(data)])

python script wordpress (mix google image +auto post wordpress)

Hello who can help me to mix two script in python, I have a script that automatically posts articles on wordpress and I have another script that donwload the photo on google image, I just want to downloading images from googgle with a keywords and posting them to a Photoblog (new article) running on Wordpres with a single script, I am novice I would need your help please!
sorry for my English
google script:
# coding: utf-8
# In[ ]:
#Searching and Downloading Google Images/Image Links
#Import Libraries
#coding: UTF-8
import time #Importing the time library to check the time of code execution
import sys #Importing the System Library
import os
import urllib2
########### Edit From Here ###########
#This list is used to search keywords. You can edit this list to search for google images of your choice. You can simply add and remove elements of the list.
search_keyword = ['Australia']
#This list is used to further add suffix to your search term. Each element of the list will help you download 100 images. First element is blank which denotes that no suffix is added to the search keyword of the above list. You can edit the list by adding/deleting elements from it.So if the first element of the search_keyword is 'Australia' and the second element of keywords is 'high resolution', then it will search for 'Australia High Resolution'
keywords = [' high resolution']
########### End of Editing ###########
#Downloading entire Web Document (Raw Page Content)
def download_page(url):
version = (3,0)
cur_version = sys.version_info
if cur_version >= version: #If the Current Version of Python is 3.0 or above
import urllib.request #urllib library for Extracting web pages
try:
headers = {}
headers['User-Agent'] = "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36"
req = urllib.request.Request(url, headers = headers)
resp = urllib.request.urlopen(req)
respData = str(resp.read())
return respData
except Exception as e:
print(str(e))
else: #If the Current Version of Python is 2.x
import urllib2
try:
headers = {}
headers['User-Agent'] = "Mozilla/5.0 (X11; Linux i686) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.27 Safari/537.17"
req = urllib2.Request(url, headers = headers)
response = urllib2.urlopen(req)
page = response.read()
return page
except:
return"Page Not found"
#Finding 'Next Image' from the given raw page
def _images_get_next_item(s):
start_line = s.find('rg_di')
if start_line == -1: #If no links are found then give an error!
end_quote = 0
link = "no_links"
return link, end_quote
else:
start_line = s.find('"class="rg_meta"')
start_content = s.find('"ou"',start_line+1)
end_content = s.find(',"ow"',start_content+1)
content_raw = str(s[start_content+6:end_content-1])
return content_raw, end_content
#Getting all links with the help of '_images_get_next_image'
def _images_get_all_items(page):
items = []
while True:
item, end_content = _images_get_next_item(page)
if item == "no_links":
break
else:
items.append(item) #Append all the links in the list named 'Links'
time.sleep(0.1) #Timer could be used to slow down the request for image downloads
page = page[end_content:]
return items
############## Main Program ############
t0 = time.time() #start the timer
#Download Image Links
i= 0
while i<len(search_keyword):
items = []
iteration = "Item no.: " + str(i+1) + " -->" + " Item name = " + str(search_keyword[i])
print (iteration)
print ("Evaluating...")
search_keywords = search_keyword[i]
search = search_keywords.replace(' ','%20')
#make a search keyword directory
try:
os.makedirs(search_keywords)
except OSError, e:
if e.errno != 17:
raise
# time.sleep might help here
pass
j = 0
while j<len(keywords):
pure_keyword = keywords[j].replace(' ','%20')
url = 'https://www.google.com/search?q=' + search + pure_keyword + '&espv=2&biw=1366&bih=667&site=webhp&source=lnms&tbm=isch&sa=X&ei=XosDVaCXD8TasATItgE&ved=0CAcQ_AUoAg'
raw_html = (download_page(url))
time.sleep(0.1)
items = items + (_images_get_all_items(raw_html))
j = j + 1
#print ("Image Links = "+str(items))
print ("Total Image Links = "+str(len(items)))
print ("\n")
#This allows you to write all the links into a test file. This text file will be created in the same directory as your code. You can comment out the below 3 lines to stop writing the output to the text file.
info = open('output.txt', 'a') #Open the text file called database.txt
info.write(str(i) + ': ' + str(search_keyword[i-1]) + ": " + str(items) + "\n\n\n") #Write the title of the page
info.close() #Close the file
t1 = time.time() #stop the timer
total_time = t1-t0 #Calculating the total time required to crawl, find and download all the links of 60,000 images
print("Total time taken: "+str(total_time)+" Seconds")
print ("Starting Download...")
## To save imges to the same directory
# IN this saving process we are just skipping the URL if there is any error
k=0
errorCount=0
while(k<len(items)):
from urllib2 import Request,urlopen
from urllib2 import URLError, HTTPError
try:
req = Request(items[k], headers={"User-Agent": "Mozilla/5.0 (X11; Linux i686) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.27 Safari/537.17"})
response = urlopen(req,None,15)
output_file = open(search_keywords+"/"+str(k+1)+".jpg",'wb')
data = response.read()
output_file.write(data)
response.close();
print("completed ====> "+str(k+1))
k=k+1;
except IOError: #If there is any IOError
errorCount+=1
print("IOError on image "+str(k+1))
k=k+1;
except HTTPError as e: #If there is any HTTPError
errorCount+=1
print("HTTPError"+str(k))
k=k+1;
except URLError as e:
errorCount+=1
print("URLError "+str(k))
k=k+1;
i = i+1
print("\n")
print("Everything downloaded!")
print("\n"+str(errorCount)+" ----> total Errors")
#----End of the main program ----#
# In[ ]:
wordpress script:
import urllib
from wordpress_xmlrpc import Client, WordPressPost
from wordpress_xmlrpc.methods import posts
import xmlrpclib
from wordpress_xmlrpc.compat import xmlrpc_client
from wordpress_xmlrpc.methods import media, posts
import os
########################### Read Me First ###############################
'''
------------------------------------------In DETAIL--------------------------------
Description
===========
Add new posts to WordPress remotely using Python using XMLRPC library provided by the WordPress.
Installation Requirement
************************
Verify you meet the following requirements
==========================================
Install Python 2.7 (Don't download 3+, as most libraries dont yet support version 3).
Install from PyPI using easy_install python-wordpress-xmlrpc
Easy_Install Link: https://pypi.python.org/pypi/setuptools
==========================================
Windows Installation Guide
==========================
-Download and Install Easy_Install from above Link -Extract Downloaded File and from CMD go to the extracted directory and run 'python setup.py install'. This will install easy_install. -Go to %/python27/script and run following command easy_install python-wordpress-xmlrpc
Ubuntu Installation Guide
=========================
sudo apt-get install python-setuptools
sudo easy_install python-wordpress-xmlrpc
Note: Script has its dummy data to work initially which you can change or integrate with your code easily for making it more dynamic.
****************************************
For Bugs/Suggestions
contact#waqasjamal.com
****************************************
------------------------------------------In DETAIL--------------------------------
'''
class Custom_WP_XMLRPC:
def post_article(self,wpUrl,wpUserName,wpPassword,articleTitle, articleCategories, articleContent, articleTags,PhotoUrl):
self.path=os.getcwd()+"\\00000001.jpg"
self.articlePhotoUrl=PhotoUrl
self.wpUrl=wpUrl
self.wpUserName=wpUserName
self.wpPassword=wpPassword
#Download File
f = open(self.path,'wb')
f.write(urllib.urlopen(self.articlePhotoUrl).read())
f.close()
#Upload to WordPress
client = Client(self.wpUrl,self.wpUserName,self.wpPassword)
filename = self.path
# prepare metadata
data = {'name': 'picture.jpg','type': 'image/jpg',}
# read the binary file and let the XMLRPC library encode it into base64
with open(filename, 'rb') as img:
data['bits'] = xmlrpc_client.Binary(img.read())
response = client.call(media.UploadFile(data))
attachment_id = response['id']
#Post
post = WordPressPost()
post.title = articleTitle
post.content = articleContent
post.terms_names = { 'post_tag': articleTags,'category': articleCategories}
post.post_status = 'publish'
post.thumbnail = attachment_id
post.id = client.call(posts.NewPost(post))
print 'Post Successfully posted. Its Id is: ',post.id
#########################################
# POST & Wp Credentials Detail #
#########################################
#Url of Image on the internet
ariclePhotoUrl='http://i1.tribune.com.pk/wp-content/uploads/2013/07/584065-twitter-1375197036-960-640x480.jpg'
# Dont forget the /xmlrpc.php cause thats your posting adress for XML Server
wpUrl='http://YourWebSite.com/xmlrpc.php'
#WordPress Username
wpUserName='WordPressUsername'
#WordPress Password
wpPassword='YourWordPressPassword'
#Post Title
articleTitle='Testing Python Script version 3'
#Post Body/Description
articleContent='Final .... Testing Fully Automated'
#list of tags
articleTags=['code','python']
#list of Categories
articleCategories=['language','art']
#########################################
# Creating Class object & calling the xml rpc custom post Function
#########################################
xmlrpc_object = Custom_WP_XMLRPC()
#On Post submission this function will print the post id
xmlrpc_object.post_article(wpUrl,wpUserName,wpPassword,articleTitle, articleCategories, articleContent, articleTags,ariclePhotoUrl)
for exemple :
search_keyword = ['Australia']
donwload a 20 image frome Google image and update to wordpress like a new posts
#########################################
# POST & Wp Credentials Detail #
#########################################
#Url of Image on the internet
ariclePhotoUrl='https://www.google.com/search?q=' search_keyword +'&espv=2&biw=1366&bih=667&site=webhp&source=lnms&tbm=isch&sa=X&ei=XosDVaCXD8TasATItgE&ved=0CAcQ_AUoAg'
# Dont forget the /xmlrpc.php cause thats your posting adress for XML Server
wpUrl='http://YourWebSite.com/xmlrpc.php'
#WordPress Username
wpUserName='WordPressUsername'
#WordPress Password
wpPassword='YourWordPressPassword'
#Post Title
articleTitle='Testing Python Script version 3'
#Post Body/Description
articleContent='Final .... Testing Fully Automated'
#list of tags
articleTags=['code','python']
#list of Categories
articleCategories=['language','art']
#########################################
# Creating Class object & calling the xml rpc custom post Function
#########################################
xmlrpc_object = Custom_WP_XMLRPC()
#On Post submission this function will print the post id
xmlrpc_object.post_article(wpUrl,wpUserName,wpPassword,articleTitle, articleCategories, articleContent, articleTags,ariclePhotoUrl)

Mersive Solstive API: AttributeError: 'dict' object has no attribute 'm_displayInformation'

I have around 100 machines running Mersive Solstice, which is a wireless display tool. I'm trying to gather a few important pieces of information, in particular the fulfillment ID for the license for each installed instance.
Using the Solstice OpenControl API, found here, I whipped up a python script to grab everything I needed using a json GET. However, even when using the example GET from the documentation,
import requests
import json
url = ‘http://ip-of-machine/api/stats’
r = requests.get(url)
jsonStats = json.loads(r.text)
usersConnected = jsonStats.m_statistics.m_connectedUsers
I encounter:
Traceback (most recent call last):
File "C:/Python27/test.py", line 7, in <module>
usersConnected = jsonStats.m_statistics.m_connectedUsers
AttributeError: 'dict' object has no attribute 'm_statistics'
Which is very confusing. I've found plenty of similar questions on SO regarding this problem, but not one that's been specifically regarding wrong GET requests from the API Reference guide.
Additionally, here is my script:
import requests
import json
from time import sleep
url = 'test'
f = open("ip.txt", "r")
while(url != ""):
url = f.readline()
url = url.rstrip('\n')
print(url)
try:
r = requests.get(url)
except:
sleep(5)
jsonConfig = json.loads(r.text)
displayName = jsonConfig.m_displayInformation.m_displayName
hostName = jsonConfig.m_displayInformation.m_hostName
ipv4 = jsonConfig.m_displayInformation.m_ipv4
fulfillmentId = jsonConfig.m_licenseCuration.fulfillmentId
r.close()
f.close
I import the URL's from a text document for easy keeping. I'm able to make the connection to the /api/config JSON, and when the URL is put into a browser it does spit out the JSON records:

Json uses "Dicts" which are a type of array. You are just using them in the wrong way. I recommend reading Python Data Structures.
Json.Loads()
Returns a dictionary not a object. Do:
dict['key']['key']
Here is how your code should look:
import requests
import json
from time import sleep
url = 'test'
f = open("ip.txt", "r")
while(url != ""):
url = f.readline()
url = url.rstrip('\n')
print(url)
try:
response = requests.get(url)
json_object = json.loads(response .text)
displayName = json_object['m_displayInformation']['m_displayName']
hostName = json_object['m_displayInformation']['m_hostName']
ipv4 = json_object['m_displayInformation']['m_ipv4']
fulfillmentId = json_object['m_licenseCuration']['fulfillmentId']
except:
pass
response .close()
f.close()
I hope this was helpful!

how to download a file using python-sharepoint library

I am using this library https://github.com/ox-it/python-sharepoint to connect to a SharePoint list. I can authenticate, access the list fields, including the full URL to the file I want, and it seems this library does have is_file() and open() methods however, I do not understand how to call these.
Any advice is appreciated!
from sharepoint import SharePointSite, basic_auth_opener
opener = basic_auth_opener(server_url, "domain/username", "password")
site = SharePointSite(server_url, opener)
sp_list = site.lists['ListName']
for row in sp_list.rows:
print row.id, row.Title, row.Author['name'], row.Created, row.EncodedAbsUrl
#download file
#row.open() ??
To quote from ReadMe file:
Support for document libraries is limited, but SharePointListRow
objects do support a is_file() method and an open() method for
accessing file data.

Basically you call these methods on the list row (which is of type SharePointListRow).
The open() method is actually the method of urllib2's opener, which you usually use like so:
import urllib2
opener = urllib2.build_opener()
response = opener.open('http://www.example.com/')
print ('READ CONTENTS:', response.read())
print ('URL :', response.geturl())
# ....
So you should be able to use it like this (I don't have any Sharepoint site to check this though):
from sharepoint import SharePointSite, basic_auth_opener
opener = basic_auth_opener(server_url, "domain/username", "password")
site = SharePointSite(server_url, opener)
sp_list = site.lists['ListName']
for row in sp_list.rows(): # <<<
print row.id, row.Title, row.Author['name'], row.Created, row.EncodedAbsUrl
# download file here
print ( "This row: ", row.name() ) # <<<
if row.is_file(): # <<<
response = row.open() # <<<
file_data = response.read() # <<<
# process the file data, e.g. write to disk

GraphAPIError: (#324) Requires upload file

I'm trying to upload a picture with the python sdk:
Code:
graph = facebook.GraphAPI(self.current_user.access_token)
graph.put_object("me", "photos", name = "test", message = raw_picture_data)
But I get the error "GraphAPIError: (#324) Requires upload file". I don't think its a permissions issue as I've requested perms="user_photos,friends_photos,publish_stream". Does anyone know what this error means and how to resolve it?

I used this library to encode the image: http://atlee.ca/software/poster/
Add this to facebook.py:
from poster.encode import *
from poster.streaminghttp import register_openers
def put_photo(self, source, album_id=None, message=""):
object_id = album_id or "me"
register_openers()
content_type,body = multipart_encode( [ ('message',message),('access_token',self.access_token),('source',source) ] )
req = urllib2.Request("https://graph.facebook.com/%s/photos" % object_id, content_type,body )
try:
data = urllib2.urlopen(req).read()
except urllib2.HTTPError as e:
data = e.read()
try:
response = _parse_json(data)
if response.get("error"):
raise GraphAPIError(response["error"].get("code", 1),response["error"]["message"])
except ValueError:
response = data
return response
Call the function with the photo as a file like object:
graph = facebook.GraphAPI(access_token)
photo = open("myphoto.bmp","rb")
graph.put_photo(photo,"me","This is my brilliant photo")
The put_photo method has been submitted by someone (I forget who) as proposed a function to add to the API but it didn't work for me till I used poster to encode the image.
Hope this helps.

Just battled with a similar error a bit. I did not use the SDK but just a POST to the graphapi. For me, this error happened when i didnt supply a filename to the file upload field in the "form" sent to facebook.
This is my code (poster - http://pypi.python.org/pypi/poster/0.8.1)
from poster.encode import multipart_encode, MultipartParam
url = 'https://graph.facebook.com/me/photos?access_token=%s'%model.facebook_token
file_param = MultipartParam(name = 'source',
filename = 'photo.jpg', #this is crucial!!!
fileobj = blob_reader) #the blob reader is the fileobject for the file (with read() method)
message_param = MultipartParam(name = 'message',
value = 'test message')
datagen, headers = multipart_encode([file_param,
message_param])
from google.appengine.api import urlfetch
result = urlfetch.fetch(url,
payload = ''.join(datagen),
headers = headers,
method = 'POST')
return result.content

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to download large files in Python 2 - python

Try downloading/writing it by chunks. Seems like file takes all your memory. You should specify Range header for your request if server supports it. https://en.wikipedia.org/wiki/List_of_HTTP_header_fields

Related

Test if URL's in bookmark is offline or online in python

python script wordpress (mix google image +auto post wordpress)

Mersive Solstive API: AttributeError: 'dict' object has no attribute 'm_displayInformation'

how to download a file using python-sharepoint library

GraphAPIError: (#324) Requires upload file

Categories

Resources