I wrote a multi-threaded http down-loader, now it can download a file faster than single threaded down-loader, and the MD5 sum is correct. However, I found the speed it showed is so so fast that I do not believe it is true value.
Unit was not printed yet, But I am sure it is KB/s, please take a look at my code about the measure.
# Setup the slaver
def _download(self):
# Start download partital content when queue not empty
while not self.configer.down_queue.empty():
data_range = self.configer.down_queue.get()
headers = {
'Range': 'bytes={}-{}'.format(*data_range)
}
response = requests.get(
self.configer.url, stream = True,
headers = headers
)
start_point = data_range[0]
for bunch in response.iter_content(self.block_size):
_time = time.time()
with self.file_lock:
with open(
self.configer.path, 'r+b',
buffering = 1
) as f:
f.seek(start_point)
f.write(bunch)
f.flush()
start_point += self.block_size
self.worker_com.put((
threading.current_thread().name,
int(self.block_size / (time.time() - _time))
))
self.configer.down_queue.task_done()
# speed monitor
def speed_monitor(self):
while len(self.thread_list)>0:
try:
info = self.worker_com.get_nowait()
self.speed[info[0]] = info[1]
except queue.Empty:
time.sleep(0.1)
continue
sys.stdout.write('\b'*64 + '{:10}'.format(self.total_speed)
+ ' thread num ' + '{:2}'.format(self.worker_count))
sys.stdout.flush()
If you need more information, please visit my github respository. i will be appreciate if you can point out my error. thanks.
Related
I want to know how many data has been downloaded in the last 1 second.
I don't have a code yet but I was wondering when I should start counting this 1 second and how to do it.
Should I start counting before retrbinary() or after? Or am I totally wrong?
First, there are ready-made implementations for transfer progress display, including the transfer speed.
For an example, the progressbar2 module. See Show FTP download progress in Python (ProgressBar).
The progressbar2 by default displays FileTransferSpeed widget, what is an average transfer speed since the download started.
Though note that speed displays usually do not show such speed. They display an average speed over last few seconds. That makes the value more informative. The progressbar2 has AdaptiveTransferSpeed widget for that. But it seems to be broken.
If you want to implement the calculation on your own, and are happy with the simple average transfer speed since the download started, it is easy:
from ftplib import FTP
import time
import sys
import datetime
ftp = FTP(host, user, passwd)
print("Downloading")
total_length = 0
start_time = datetime.datetime.now()
def write(data):
f.write(data)
global total_length
global start_time
total_length += sys.getsizeof(data)
elapsed = (datetime.datetime.now() - start_time)
speed = (total_length / elapsed.total_seconds())
print("\rElapsed: {0} Speed: {1:.2f} kB/s".format(str(elapsed), speed / 1024), end="")
f = open('file.dat', 'wb')
ftp.retrbinary("RETR /file.dat", write)
f.close()
print()
print("done")
It is a way more difficult to calculate the average speed in the last seconds. You have to remember the amount of data transferred at past moments. Stealing (and fixing) the code from AdaptiveTransferSpeed, you will get something like:
sample_times = []
sample_values = []
INTERVAL = datetime.timedelta(milliseconds=100)
last_update_time = None
samples=datetime.timedelta(seconds=2)
total_length = 0
def write(data):
f.write(data)
global total_length
total_length += sys.getsizeof(data)
elapsed = (datetime.datetime.now() - start_time)
if sample_times:
sample_time = sample_times[-1]
else:
sample_time = datetime.datetime.min
t = datetime.datetime.now()
if t - sample_time > INTERVAL:
# Add a sample but limit the size to `num_samples`
sample_times.append(t)
sample_values.append(total_length)
minimum_time = t - samples
minimum_value = sample_values[-1]
while (sample_times[2:] and
minimum_time > sample_times[1] and
minimum_value > sample_values[1]):
sample_times.pop(0)
sample_values.pop(0)
delta_time = sample_times[-1] - sample_times[0]
delta_value = sample_values[-1] - sample_values[0]
if delta_time:
speed = (delta_value / delta_time.total_seconds())
print("\rElapsed: {0} Speed: {1:.2f} kB/s".format(
str(elapsed), speed / 1024), end="")
ftp.retrbinary("RETR /medium.dat", write)
I am training python at my high school. We are still learning.
We did this small script where it loads a website using cookies and downloads files with 400k each. But for some reason, it is very slow. Our internet connection is very fast. It should be able to download 20-30 files at once, but for some reason, it downloads just one file at once and still waits some seconds before download the next. Why is this so? Please, check the script and give suggestions. No matter where we run the script, it always downloads a maximum of 7-8 files per minute. It is not right.
check out:
import urllib.request,string,random
def id_generator(size=6, chars=string.ascii_uppercase + string.digits):
return ''.join(random.choice(chars) for _ in range(size))
ab = 0
faturanum = 20184009433300
while (ab != 1000000):
try:
ab = ab + 1
opener = urllib.request.build_opener()
a = id_generator() + ".pdf"
faturanum = faturanum - 1
fatura = str(faturanum)
faturanome = fatura + ".pdf"
opener.addheaders = [('Cookie', 'ASP.NET_SessionId=gpkufgrfzsk5abc0bc2v2v3e')]
f = opener.open("https://urltest.com/Fatura/Pdf?nrFatura="+fatura)
file = open(faturanome, 'wb')
file.write(f.read())
file.close()
print("file downloaded:"+str(ab)+" downloaded!")
except:
pass
Why is this so slow? The remote server is very fast too. Is there some way to get better results? Maybe putting several files in the queue? Like I said, we are still learning. We just want to find a way the script makes several quests at once to get several files at once, instead of one at a time.
So this is what I wrote after hours.. but doesnt work
Well, here we go.
What am I doing wrong ? I am giving my best, but I never used threading before. Hey #KlausD, check it and let me know what am I doing wrong ? it is a website that requir cookies. Also it need to load the ul and turn to pdf.
check my code try
import os
import threading
import urllib.request,string,random
from queue import Queue
def id_generator(size=6, chars=string.ascii_uppercase + string.digits):
return ''.join(random.choice(chars) for _ in range(size))
class Downloader(threading.Thread):
def __init__(self, queue):
threading.Thread.__init__(self)
self.queue = queue
def run(self):
while True:
url = self.queue.get()
self.download_file(url)
self.queue.task_done()
def download_file(self, url):
faturanum = 20184009433300
ab = ab + 1
handle = urllib.request.urlopen(url)
a = id_generator() + ".pdf"
faturanum = faturanum - 1
fatura = str(faturanum)
faturanome = fatura + ".pdf"
handle.addheaders = [('Cookie', 'ASP.NET_SessionId=v5x4k1m44saix1f5hpybf2qu')]
fname = os.path.basename(url)
with open(fname, "wb") as f:
while True:
chunk = handle.read(1024)
if not chunk: break
f.write(chunk)
def main(urls):
queue = Queue()
for i in range(5):
t = Downloader(queue)
t.setDaemon(True)
t.start()
for url in urls:
queue.put(url)
queue.join()
if __name__ == "__main__":
urls = ["https://sitetest.com/Fatura/Pdf?nrFatura="+fatura"]
main(urls)
what is wrong THis is my best.. believe I am not sleeping to make it working
Client side:
def send_file_to_hashed(data, tcpsock):
time.sleep(1)
f = data
flag = 0
i=0
tcpsock.send(hashlib.sha256(f.read()).hexdigest())
f.seek(0)
time.sleep(1)
l = f.read(BUFFER_SIZE-64)
while True:
while (l):
tcpsock.send(hashlib.sha256(l).hexdigest() + l)
time.sleep(1)
hashok = tcpsock.recv(6)
if hashok == "HASHOK":
l = f.read(BUFFER_SIZE-64)
flag = 1
if hashok == "BROKEN":
flag = 0
if not l:
time.sleep(1)
tcpsock.send("DONE")
break
return (tcpsock,flag)
def upload(filename):
flag = 0
while(flag == 0):
with open(os.getcwd()+'\\data\\'+ filename +'.csv', 'rU') as UL:
tuplol = send_file_to_hashed(UL ,send_to_sock(filename +".csv",send_to("upload",TCP_IP,TCP_PORT)))
(sock,flagn) = tuplol
flag = flagn
time.sleep(2)
sock.close()
Server Side:
elif(message == "upload"):
message = rec_OK(self.sock)
fis = os.getcwd()+'/data/'+ time.strftime("%H:%M_%d_%m_%Y") + "_" + message
f = open(fis , 'w')
latest = open(os.getcwd()+'/data/' + message , 'w')
time.sleep(1)
filehash = rec_OK(self.sock)
print("filehash:" + filehash)
while True:
time.sleep(1)
rawdata = self.sock.recv(BUFFER_SIZE)
log.write("rawdata :" + rawdata + "\n")
data = rawdata[64:]
dhash = rawdata[:64]
log.write("chash: " + dhash + "\n")
log.write("shash: " + hashlib.sha256(data).hexdigest() + "\n")
if dhash == hashlib.sha256(data).hexdigest():
f.write(data)
latest.write(data)
self.sock.send("HASHOK")
log.write("HASHOK\n" )
print"HASHOK"
else:
self.sock.send("HASHNO")
print "HASHNO"
log.write("HASHNO\n")
if rawdata == "DONE":
f.close()
f = open(fis , 'r')
if (hashlib.sha256(f.read()).hexdigest() == filehash):
print "ULDONE"
log.write("ULDONE")
f.close()
latest.close()
break
else:
self.sock.send("BROKEN")
print hashlib.sha256(f.read()).hexdigest()
log.write("BROKEN")
print filehash
print "BROKEN UL"
f.close()
So the data upload is working fine in all tests that i ran from my computer, even worked fine while uploading data over my mobile connection and still sometimes people say it takes a long time and they kill it after a few minutes. the data is there on their computers but not on the server. I don't know what is happening please help!
First of all: this is unrelated to sha.
Streaming over the network is unpredictable. This line
rawdata = self.sock.recv(BUFFER_SIZE)
doesn't guarantee that you read BUFFER_SIZE bytes. You may have read only 1 byte in the worst case scenario. Therefore your server side is completely broken because of the assumption that rawdata contains whole message. It is even worse. If the client sends command and hash fast you may get e.g. rawdata == 'DONEa2daf78c44(...) which is a mixed output.
The "hanging" part just follows from that. Trace your code and see what happens when the server receives partial/broken messages ( I already did that in my imagination :P ).
Streaming over the network is almost never as easy as calling sock.send on one side and sock.recv on the other side. You need some buffering/framing protocol. For example you can implement this simple protocol: always interpret first two bytes as the size of incoming message, like this:
client (pseudocode)
# convert len of msg into two-byte array
# I am assuming the max size of msg is 65536
buf = bytearray([len(msg) & 255, len(msg) >> 8])
sock.sendall(buf)
sock.sendall(msg)
server (pseudocode)
size = to_int(sock.recv(1))
size += to_int(sock.recv(1)) << 8
# You need two calls to recv since recv(2) can return 1 byte.
# (well, you can try recv(2) with `if` here to avoid additional
# syscall, not sure if worth it)
buffer = bytearray()
while size > 0:
tmp = sock.recv(size)
buffer += tmp
size -= len(tmp)
Now you have properly read data in buffer variable which you can work with.
WARNING: the pseudocode for the server is simplified. For example you need to check for empty recv() result everywhere (including where size is calculated). This is the case when the client disconnects.
So unfortunately there's a lot of work in front of you. You have to rewrite whole sending and receving code.
I'm programing a program for downloading images from internet and I would like to speed it up using multiple requests at once.
So I wrote a code you can see here at GitHub.
I can request for webpage only like this:
def myrequest(url):
worked = False
req = Request(url, headers={'User-Agent': 'Mozilla/5.0'})
while not worked:
try:
webpage_read = urlopen(req).read()
worked = True
except:
print("failed to connect to \n{}".format(url))
return(webpage_read)
url = "http://www.mangahere.co/manga/mysterious_girlfriend_x"
webpage_read = myrequest(url).decode("utf-8")
The while is here because I definitely want to download every single picture, so I'm trying until it work (nothing can go wrong except urllib.error.HTTPError: HTTP Error 504: Gateway Time-out)
My question is, how to run that multiple times at once?
My idea is to have " a comander" which will run 5 (or 85) pythonic scripts, give each url and get webpage from them once they are finished, but this is definitely a silly solution :)
EDIT:
I used _thread but it doesn't seem to speed up the program. That should have been the solution am I doing it wrong? that is my new question.
You can use link do get to my code on GitHub
def thrue_thread_download_pics(path, url, ep, name):
lock.acquire()
global goal
goal += 1
lock.release()
webpage_read = myrequest("{}/{}.html".format(url, ep))
url_to_pic = webpage_read.decode("utf-8").split('" onerror="')[0].split('<img src="')[-1]
pic = myrequest(url_to_pic)
myfile = open("{}/pics/{}.jpg".format(path, name), "wb")
myfile.write(pic)
myfile.close()
global finished
finished += 1
and I'm using it here:
for url_ep in urls_eps:
url, maxep = url_ep.split()
maxep = int(maxep)
chap = url.split("/")[-1][2:]
if "." in chap:
chap = chap.replace(".", "")
else:
chap = "{}0".format(chap)
for ep in range(1, maxep + 1):
ted = time.time()
name = "{}{}".format(chap, "{}{}".format((2 - len(str(ep))) * "0", ep))
if name in downloaded:
continue
_thread.start_new_thread(thrue_thread_download_pics, (path, url, ep, name))
checker = -1
while finished != goal:
if finished != checker:
checker = finished
print("{} of {} downloaded".format(finished, goal))
time.sleep(0.1)
Requests Futures is built on top of the very popular requests library and uses non-blocking IO:
from requests_futures.sessions import FuturesSession
session = FuturesSession()
# These requests will run at the same time
future_one = session.get('http://httpbin.org/get')
future_two = session.get('http://httpbin.org/get?foo=bar')
# Get the first result
response_one = future_one.result()
print(response_one.status_code)
print(response_one.text)
# Get the second result
response_two = future_two.result()
print(response_two.status_code)
print(response_two.text)
please take a few time of you to help me. How can I use seek with wfile:
self.wfile = self.connection.makefile('wb', self.wbufsize)
My code look like this:
self.wfile.seek(offset, 0)
self.wfile.write(r.data)
But problem is, my ILDE show this error every time I try to run my code:
self.wfile.seek(offset, 0)
io.UnsupportedOperation: seek
I thought wfile and open are the same, but why I cannot seek like open ? Even if it is true, I still think there is a way to bypass this restrict..
Note: If you at least one time hear about http.server or BaseHTTPServer you probably understood what wfile is.
EDIT: I edit my post to add my code, only a part of my full software, but this others part is not really needed:
self.send_response(200)
self.end_headers()
def accelerator(url=None, splitBy=3):
def buildRange(url, numsplits):
global globaldownloadersave
value = int(self.pool.urlopen('HEAD', url).headers["content-length"])
print("Fullsize: ", value)
print("Try devide with :", value / numsplits)
lst = []
for i in range(numsplits):
if i == range(numsplits):
lst.append('%s-%s' % (i * value//numsplits + 1, i * value//numsplits + 1 + (value - (i * value//numsplits + 1))))
if i == 0:
lst.append('%s-%s' % (0, value//numsplits))
else:
lst.append('%s-%s' % (i * value//numsplits + 1, (i + 1) * value//numsplits))
return lst
def downloadChunk(idx, irange):
global globaldownloadersave
r = self.pool.urlopen('GET', url, headers={'Range': 'bytes=' + str(irange)})
offset = int(re.sub("(^.*?)-(.*?)$", "\\1", irange))
offset2 = int(re.sub("(^.*?)-(.*?)$", "\\2", irange))
self.wfile.seek(offset, 0)
self.wfile.write(r.data)
#self.data = io.BytesIO(b'')
ranges = buildRange(url, splitBy)
tasks = []
# create one downloading thread per chunk
#q = queue.Queue() big fail, so comment it
downloaders = [
threading.Thread(
target=downloadChunk,
args=(idx, irange),
)
for idx,irange in enumerate(ranges)
]
# start threads, let run in parallel, wait for all to finish
for th in downloaders:
th.start()
for th in downloaders:
th.join()
accelerator(self.url, 4)
self.close_connection = 1
return
A socket connection is a stream. Bytes once read from a stream are gone. So seek makes no sense. While it is possible to keep all read bytes in memory and simulate a seek, this is normally not preferred. Try to write your code without the need of seek.
EDIT: Didn't see it at first sight. You try to seek in a writing stream. This will be never possible, because you cannot say the receiving end "forget about all I've send, you get new data". If you really need that functionality you have to save the data locally in a normal file, and, when finished, send this file as one block to the client.