I am trying to Download html pages with my python script & TOR proxy server. It is running well. But extremely slow & Code is not organized so my IP is renewing most of the time rather downloading pages much. How can I speed the downloading with TOR? How can I organize the code efficiency.
Two script is there. Script1 is executed to download html pages from the website & after get block from the website, Script2 has to be executed to renew the IP with help of TOR proxy. So on... IP gets blocked after few seconds.
Should I lower my threading? How ? Please help me to speed up the process. I am getting only 300-500 html pages per hour.
Here is my Full Code of Script1:
# -*- coding: UTF-8 -*-
import os
import sys
import socks
import socket
import subprocess
import time
socks.setdefaultproxy(socks.PROXY_TYPE_SOCKS4, '127.0.0.1', 9050, True)
socket.socket = socks.socksocket
import urllib2
class WebPage:
def __init__(self, path, country, url, lower=0,upper=9999):
self.dir = str(path)+"/"+ str(country)
self.dir =os.path.join(str(path),str(country))
self.url = url
try:
fin = open(self.dir+"/limit.txt",'r')
limit = fin.readline()
limits = str(limit).split(",")
lower = int(limits[0])
upper = int(limits[1])
fin.close()
except:
fout = open(self.dir+"/limit.txt",'wb')
limits = str(lower)+","+str(upper)
fout.write(limits)
fout.close()
self.process_instances(lower,upper)
def process_instances(self,lower,upper):
try:
os.stat(self.dir)
except:
os.mkdir(self.dir)
for count in range(lower,upper+1):
if count == upper:
print "all downloaded, quitting the app!!"
break
targetURL = self.url+"/"+str(count)
print "Downloading :" + targetURL
req = urllib2.Request(targetURL)
try:
response = urllib2.urlopen(req)
the_page = response.read()
if the_page.find("Your IP suspended")>=0:
print "The IP is suspended"
fout = open(self.dir+"/limit.txt",'wb')
limits = str(count)+","+str(upper)
fout.write(limits)
fout.close()
break
if the_page.find("Too many requests")>=0:
print "Too many requests"
print "Renew IP...."
fout = open(self.dir+"/limit.txt",'wb')
limits = str(count)+","+str(upper)
fout.write(limits)
fout.close()
subprocess.Popen("C:\Users\John\Desktop\Data-Mine\yp\lol\lol2.py", shell=True)
time.sleep(2)
subprocess.call('lol1.py')
if the_page.find("404 error")>=0:
print "the page not exist"
continue
self.saveHTML(count, the_page)
except:
print "The URL cannot be fetched"
execfile('lol1.py')
pass
#continue
raise
def saveHTML(self,count, content):
fout = open(self.dir+"/"+str(count)+".html",'wb')
fout.write(content)
fout.close()
if __name__ == '__main__':
if len(sys.argv) !=6:
print "cannot process!!! Five Parameters are required to run the process."
print "Parameter 1 should be the path where to save the data, eg, /Users/john/data/"
print "Parameter 2 should be the name of the country for which data is collected, eg, japan"
print "Parameter 3 should be the URL from which the data to collect, eg, the website link"
print "Parameter 4 should be the lower limit of the company id, eg, 11 "
print "Parameter 5 should be the upper limit of the company id, eg, 1000 "
print "The output will be saved as the HTML file for each company in the target folder's country"
exit()
else:
path = str(sys.argv[1])
country = str(sys.argv[2])
url = str(sys.argv[3])
lowerlimit = int(sys.argv[4])
upperlimit = int(sys.argv[5])
WebPage(path, country, url, lowerlimit,upperlimit)
TOR is very slow, so it is to be expected that you don't get that much pages per hour. There are however some ways to speed it up. Most notably you could turn on GZIP compression for urllib (see this question for example) to improve the speed a little bit.
TOR as a protocol has rather low bandwidth, because the data needs to be relayed a few times and each relay must use its bandwidth for your request. If data is relayed 6 times - a rather probable number - you would need 6 times the bandwidth. GZIP compression can compress HTML to (in some cases) ~10% of the original size so that will probably speed up the process.
Related
I have a nice set of URLs saved in this format
Number Link
0 https://ipfs.io/ipfs/QmRRPWG96cmgTn2qSzjwr2qvfNEuhunv6FNeMFGa9bx6mQ
1 https://ipfs.io/ipfs/QmPbxeGcXhYQQNgsC6a36dDyYUcHgMLnGKnF8pVFmGsvqi
2 https://ipfs.io/ipfs/QmcJYkCKK7QPmYWjp4FD2e3Lv5WCGFuHNUByvGKBaytif4
3 https://ipfs.io/ipfs/QmYxT4LnK8sqLupjbS6eRvu1si7Ly2wFQAqFebxhWntcf6
4 https://ipfs.io/ipfs/QmSg9bPzW9anFYc3wWU5KnvymwkxQTpmqcRSfYj7UmiBa7
5 https://ipfs.io/ipfs/QmNwbd7ctEhGpVkP8nZvBBQfiNeFKRdxftJAxxEdkUKLcQ
6 https://ipfs.io/ipfs/QmWBgfBhyVmHNhBfEQ7p1P4Mpn7pm5b8KgSab2caELnTuV
7 https://ipfs.io/ipfs/QmRsJLrg27GQ1ZWyrXZFuJFdU5bapfzsyBfm3CAX1V1bw6
I am trying to use a loop to loop through all of the links and save the file
import urllib.request
for x,y in zip(link, num):
url = str(x)
name = str(y)
filename = "%s.png" % name
urllib.request.urlretrieve(url, filename)
Everytime I run this code I get this error
URLError: <urlopen error [WinError 10054] An existing connection was forcibly closed by the remote host>
What is weird is that if I just run the code on one URL then it works fine.
import urllib.request
name = 1
filename = "%s.png" % name
urllib.request.urlretrieve("https://ipfs.io/ipfs/QmcJYkCKK7QPmYWjp4FD2e3Lv5WCGFuHNUByvGKBaytif4", filename)
How can this be fixed so that the code runs in a loop with no errors?
thanks
EDIT
Here is some code that works for 1 image
import pandas as pd
import urllib.request
links = [['number', 'link'], ['1', 'https://ipfs.io/ipfs/QmPbxeGcXhYQQNgsC6a36dDyYUcHgMLnGKnF8pVFmGsvqi'], ['2', 'https://ipfs.io/ipfs/QmcJYkCKK7QPmYWjp4FD2e3Lv5WCGFuHNUByvGKBaytif4'], ['3', 'https://ipfs.io/ipfs/QmYxT4LnK8sqLupjbS6eRvu1si7Ly2wFQAqFebxhWntcf6']]
data = pd.DataFrame(links)
link = data.get('Link', None)
num = data.get('Number', None)
name = 1
filename = "%s.png" % name
urllib.request.urlretrieve("https://ipfs.io/ipfs/QmYxT4LnK8sqLupjbS6eRvu1si7Ly2wFQAqFebxhWntcf6", filename)
You are being throttled by the IPFS service. You need to implement API rate limiting (or see if the service has a premium option that allows you to pay for higher API request rates).
Here's one way to implement client-side rate limiting, using exponential backoff/retry:
save this retry code as retry.py
fix a couple of Python v2 issues in retry.py (except ExceptionToCheck as e: at line 32 and print(msg) at line 37)
modify your client code as follows
import urllib.request
from retry import retry
LINKS = [
"https://ipfs.io/ipfs/QmRRPWG96cmgTn2qSzjwr2qvfNEuhunv6FNeMFGa9bx6mQ",
"https://ipfs.io/ipfs/QmPbxeGcXhYQQNgsC6a36dDyYUcHgMLnGKnF8pVFmGsvqi",
"https://ipfs.io/ipfs/QmcJYkCKK7QPmYWjp4FD2e3Lv5WCGFuHNUByvGKBaytif4",
"https://ipfs.io/ipfs/QmYxT4LnK8sqLupjbS6eRvu1si7Ly2wFQAqFebxhWntcf6",
"https://ipfs.io/ipfs/QmSg9bPzW9anFYc3wWU5KnvymwkxQTpmqcRSfYj7UmiBa7",
"https://ipfs.io/ipfs/QmNwbd7ctEhGpVkP8nZvBBQfiNeFKRdxftJAxxEdkUKLcQ",
"https://ipfs.io/ipfs/QmWBgfBhyVmHNhBfEQ7p1P4Mpn7pm5b8KgSab2caELnTuV",
"https://ipfs.io/ipfs/QmRsJLrg27GQ1ZWyrXZFuJFdU5bapfzsyBfm3CAX1V1bw6",
]
#retry(urllib.error.URLError, tries=4)
def download(index, url):
filename = "%s.png" % index
urllib.request.urlretrieve(url, filename)
def main():
for index, link in enumerate(LINKS):
print(index, link)
download(index, link)
if __name__ == '__main__':
main()
I tested this code without retries and it was throttled (as expected). Then I added the retry decorator and it completed successfully (including a couple of expected retries).
I've got a Python program on Windows 10 that runs a loop thousands of times with multiple functions that can sometimes hang and stop the program execution. Some are IO functions, and some are selenium webdriver functions. I'm trying to build a mechanism that will let me run a function, then kill that function after a specified number of seconds and try it again if that function didn't finish. If the function completes normally, let the program execution continue without waiting for the timeout to finish.
I've looked at at least a dozen different solutions, and can't find something that fits my requirements. Many require SIGNALS which is not available on Windows. Some spawn processes or threads which consume resources that can't easily be released, which is a problem when I'm going to run these functions thousands of times. Some work for very simple functions, but fail when a function makes a call to another function.
The situations this must work for:
Must run on Windows 10
A "driver.get" command for selenium webdriver to read a web page
A function that reads from or writes to a text file
A function that runs an external command (like checking my IP address or connecting to a VPN server)
I need to be able to specify a different timeout for each of these situations. A file write should take < 2 seconds, whereas a VPN server connection may take 20 seconds.
I've tried the following function libraries:
timeout-decorator 0.5.0
wrapt-timeout-decorator 1.3.1
func-timeout 4.3.5
Here is a trimmed version of my program that includes the functions I need to wrap in a timeout function:
import csv
import time
from datetime import date
from selenium import webdriver
import urllib.request
cities = []
total_cities = 0
city = ''
city_counter = 0
results = []
temp = ''
temp2 = 'IP address not found'
driver = None
if __name__ == '__main__':
#Read city list
with open('citylist.csv') as csvfile:
readCity = csv.reader(csvfile, delimiter='\n');
for row in csvfile:
city = row.replace('\n','')
cities.append(city.replace('"',''))
#Get my IP address
try:
temp = urllib.request.urlopen('http://checkip.dyndns.org')
temp = str(temp.read())
found = temp.find(':')
found2 = temp.find('<',found)
except:
pass
if (temp.find('IP Address:') > -1):
temp2 = temp[found+2:found2]
print(' IP: [',temp2,']\n',sep='')
total_cities = len(cities)
## Open browser for automation
try: driver.close()
except AttributeError: driver = None
options = webdriver.ChromeOptions()
options.add_experimental_option("excludeSwitches", ["enable-logging"])
driver = webdriver.Chrome(options=options)
#Search for links
while (city_counter < total_cities):
city = cities[city_counter]
searchTerm = 'https://www.bing.com/search?q=skate park ' + city
## Perform search using designated search term
driver.get(searchTerm)
haystack = driver.page_source
driver.get(searchTerm)
found = 0
found2 = 0
while (found > -1):
found = haystack.find('<a href=',found2)
found2 = haystack.find('"',found+10)
if (haystack[found+9:found+13] == 'http'):
results.append(haystack[found+9:found2])
city_counter += 1
driver.close()
counter = 0
while counter < len(results):
print(counter,': ',results[counter],sep='')
counter += 1
The citylist.csv file:
"Oakland, CA",
"San Francisco, CA",
"San Jose, CA"
I use the Python Requests library to download a big file, e.g.:
r = requests.get("http://bigfile.com/bigfile.bin")
content = r.content
The big file downloads at +- 30 Kb per second, which is a bit slow. Every connection to the bigfile server is throttled, so I would like to make multiple connections.
Is there a way to make multiple connections at the same time to download one file?
You can use HTTP Range header to fetch just part of file (already covered for python here).
Just start several threads and fetch different range with each and you're done ;)
def download(url,start):
req = urllib2.Request('http://www.python.org/')
req.headers['Range'] = 'bytes=%s-%s' % (start, start+chunk_size)
f = urllib2.urlopen(req)
parts[start] = f.read()
threads = []
parts = {}
# Initialize threads
for i in range(0,10):
t = threading.Thread(target=download, i*chunk_size)
t.start()
threads.append(t)
# Join threads back (order doesn't matter, you just want them all)
for i in threads:
i.join()
# Sort parts and you're done
result = ''.join(parts[i] for i in sorted(parts.keys()))
Also note that not every server supports Range header (and especially servers with php scripts responsible for data fetching often don't implement handling of it).
Here's a Python script that saves given url to a file and uses multiple threads to download it:
#!/usr/bin/env python
import sys
from functools import partial
from itertools import count, izip
from multiprocessing.dummy import Pool # use threads
from urllib2 import HTTPError, Request, urlopen
def download_chunk(url, byterange):
req = Request(url, headers=dict(Range='bytes=%d-%d' % byterange))
try:
return urlopen(req).read()
except HTTPError as e:
return b'' if e.code == 416 else None # treat range error as EOF
except EnvironmentError:
return None
def main():
url, filename = sys.argv[1:]
pool = Pool(4) # define number of concurrent connections
chunksize = 1 << 16
ranges = izip(count(0, chunksize), count(chunksize - 1, chunksize))
with open(filename, 'wb') as file:
for s in pool.imap(partial(download_part, url), ranges):
if not s:
break # error or EOF
file.write(s)
if len(s) != chunksize:
break # EOF (servers with no Range support end up here)
if __name__ == "__main__":
main()
The end of file is detected if a server returns empty body, or 416 http code, or if the response size is not chunksize exactly.
It supports servers that doesn't understand Range header (everything is downloaded in a single request in this case; to support large files, change download_chunk() to save to a temporary file and return the filename to be read in the main thread instead of the file content itself).
It allows to change independently number of concurrent connections (pool size) and number of bytes requested in a single http request.
To use multiple processes instead of threads, change the import:
from multiprocessing.pool import Pool # use processes (other code unchanged)
This solution requires the linux utility named "aria2c", but it has the advantage of easily resuming downloads.
It also assumes that all the files you want to download are listed in the http directory list for location MY_HTTP_LOC. I tested this script on an instance of lighttpd/1.4.26 http server. But, you can easily modify this script so that it works for other setups.
#!/usr/bin/python
import os
import urllib
import re
import subprocess
MY_HTTP_LOC = "http://AAA.BBB.CCC.DDD/"
# retrieve webpage source code
f = urllib.urlopen(MY_HTTP_LOC)
page = f.read()
f.close
# extract relevant URL segments from source code
rgxp = '(\<td\ class="n"\>\<a\ href=")([0-9a-zA-Z\(\)\-\_\.]+)(")'
results = re.findall(rgxp,str(page))
files = []
for match in results:
files.append(match[1])
# download (using aria2c) files
for afile in files:
if os.path.exists(afile) and not os.path.exists(afile+'.aria2'):
print 'Skipping already-retrieved file: ' + afile
else:
print 'Downloading file: ' + afile
subprocess.Popen(["aria2c", "-x", "16", "-s", "20", MY_HTTP_LOC+str(afile)]).wait()
you could use a module called pySmartDLfor this it uses multiple threads and can do a lot more also this module gives a download bar by default.
for more info check this answer
I've been trying very hard to get gpsfake to work with python-gps for the last days.
I execute gpsfake with the following example sentences, which I found on the internet:
gpsfake test_data.log
test_data.log:
$PMGNST,05.40,2,T,961,08.3,+04583,00*4C
$GPGLL,5036.9881,N,00707.9142,E,125412.480,A*3F
$GPGGA,125412.48,5036.9881,N,00707.9142,E,2,04,20.5,00269,M,,,,*17
$GPRMC,125412.48,A,5036.9881,N,00707.9142,E,00.0,000.0,230506,00,E*4F
$GPGSA,A,2,27,04,08,24,,,,,,,,,20.5,20.5,*12
$GPGSV,3,1,10,13,81,052,,04,58,240,39,23,44,064,,24,43,188,36*75
$GPGSV,3,2,10,02,42,295,,27,34,177,40,20,21,113,,16,12,058,*7F
$GPGSV,3,3,10,08,07,189,38,10,05,293,,131,11,117,,120,28,209,*76
Once it's running, I execute a Python script I found here:
import gps
import os
import time
if __name__ == '__main__':
session = gps.gps(verbose=1)
while 1:
os.system('clear')
session.next()
# a = altitude, d = date/time, m=mode,
# o=postion/fix, s=status, y=satellites
print
print ' GPS reading'
print '----------------------------------------'
print 'latitude ' , session.fix.latitude
print 'longitude ' , session.fix.longitude
print time.strftime("%H:%M:%S")
print
print ' Satellites (total of', len(session.satellites) , ' in view)'
for i in session.satellites:
print '\t', i
time.sleep(3)
I gets a connection, but then it gets locked in class gpscommon, method read() on line 80:
frag = self.sock.recv(4096)
which tells me that no data is received or even sent.
When I try to connect over the terminal with
connect 127.0.0.1 2947
the only response is
{"class":"VERSION","release":"3.4","rev":"3.4","proto_major":3,"proto_minor":6}
and nothing else.
Does anyone have an idea how to get the correct data? I tried various NMEA log files, so I think that's not the reason.
To get location reports you'll have to start 'Watching' the devices gpsd/gpsfake is monitoring, try sending:
?WATCH={"enable":true,"json":true};
Which will enable updates from all devices in JSON format.
To stop updates:
?WATCH={"enable":false};
Once watching is enabled you'll get 'TPV' responses containing location data and 'SKY' responses with satellite data.
More information about clients: http://www.catb.org/gpsd/client-howto.html
GPSd JSON format: http://www.catb.org/gpsd/gpsd_json.html
Currently, there's a game that has different groups, and you can play for a prize 'gold' every hour. Sometimes there is gold, sometimes there isn't. It is posted on facebook every hour ''gold in group2" or "gold in group6'', and other times there isn't a post due to no gold being a prize for that hour. I want to write a small script that will check the site hourly and grab the result (if there is gold or not, and what group) and display it back to me. I was wanting to write it in python as I'm learning it. Would this be the best language to use? And how would I go about doing this? All I can really find is information on extracting links. I don't want to extract links, just the text. Thanks for any and all help. I appreciate it.
Check out urllib2 for getting html from a url and BeautifulSoup/HTMLParser/etc to parse the html. Then, you could use something like this as a starting point for the script:
import time
import urllib2
import BeautifulSoup
import HTMLParser
def getSource(url, postdata):
source = ""
req = urllib2.Request(url, postdata)
try:
sock = urllib2.urlopen(req)
except urllib2.URLError, exc:
# handle the error..
pass
else:
source = sock.read()
finally:
try:
sock.close()
except:
pass
return source
def parseSource(source):
pass
# parse source with BeautifulSoup/HTMLParser, or here...
def main():
last_run = 0
while True:
t1 = time.time()
# check if 1 hour has passed since last_run
if t1 - last_run >= 3600:
source = getSource("someurl.com", "user=me&blah=foo")
last_run = time.time()
parseSource(source)
else:
# sleep for 60 seconds and check time again.
time.sleep(60)
return 0
if __name__ == "__main__":
sys.exit(main())
Here is a good article about parsing-html-with-python
I have something similiar to what you have, but you left out what my main question revolves around. I looked at htmlparser and bs, but I am unsure how to do something like if($posttext == gold) echo "gold in so and so".. seems like bs deals a lot with tags..i suppose since facebook posts can use a variety of tags, how would i go about doing just a search on the text and to return the 'post' ??