Python seek wfile like open? - python

please take a few time of you to help me. How can I use seek with wfile:
self.wfile = self.connection.makefile('wb', self.wbufsize)
My code look like this:
self.wfile.seek(offset, 0)
self.wfile.write(r.data)
But problem is, my ILDE show this error every time I try to run my code:
self.wfile.seek(offset, 0)
io.UnsupportedOperation: seek
I thought wfile and open are the same, but why I cannot seek like open ? Even if it is true, I still think there is a way to bypass this restrict..
Note: If you at least one time hear about http.server or BaseHTTPServer you probably understood what wfile is.
EDIT: I edit my post to add my code, only a part of my full software, but this others part is not really needed:
self.send_response(200)
self.end_headers()
def accelerator(url=None, splitBy=3):
def buildRange(url, numsplits):
global globaldownloadersave
value = int(self.pool.urlopen('HEAD', url).headers["content-length"])
print("Fullsize: ", value)
print("Try devide with :", value / numsplits)
lst = []
for i in range(numsplits):
if i == range(numsplits):
lst.append('%s-%s' % (i * value//numsplits + 1, i * value//numsplits + 1 + (value - (i * value//numsplits + 1))))
if i == 0:
lst.append('%s-%s' % (0, value//numsplits))
else:
lst.append('%s-%s' % (i * value//numsplits + 1, (i + 1) * value//numsplits))
return lst
def downloadChunk(idx, irange):
global globaldownloadersave
r = self.pool.urlopen('GET', url, headers={'Range': 'bytes=' + str(irange)})
offset = int(re.sub("(^.*?)-(.*?)$", "\\1", irange))
offset2 = int(re.sub("(^.*?)-(.*?)$", "\\2", irange))
self.wfile.seek(offset, 0)
self.wfile.write(r.data)
#self.data = io.BytesIO(b'')
ranges = buildRange(url, splitBy)
tasks = []
# create one downloading thread per chunk
#q = queue.Queue() big fail, so comment it
downloaders = [
threading.Thread(
target=downloadChunk,
args=(idx, irange),
)
for idx,irange in enumerate(ranges)
]
# start threads, let run in parallel, wait for all to finish
for th in downloaders:
th.start()
for th in downloaders:
th.join()
accelerator(self.url, 4)
self.close_connection = 1
return

A socket connection is a stream. Bytes once read from a stream are gone. So seek makes no sense. While it is possible to keep all read bytes in memory and simulate a seek, this is normally not preferred. Try to write your code without the need of seek.
EDIT: Didn't see it at first sight. You try to seek in a writing stream. This will be never possible, because you cannot say the receiving end "forget about all I've send, you get new data". If you really need that functionality you have to save the data locally in a normal file, and, when finished, send this file as one block to the client.

Related

Why this small python script is so slow ? Maybe using the wrong lib ? Suggestions?

I am training python at my high school. We are still learning.
We did this small script where it loads a website using cookies and downloads files with 400k each. But for some reason, it is very slow. Our internet connection is very fast. It should be able to download 20-30 files at once, but for some reason, it downloads just one file at once and still waits some seconds before download the next. Why is this so? Please, check the script and give suggestions. No matter where we run the script, it always downloads a maximum of 7-8 files per minute. It is not right.
check out:
import urllib.request,string,random
def id_generator(size=6, chars=string.ascii_uppercase + string.digits):
return ''.join(random.choice(chars) for _ in range(size))
ab = 0
faturanum = 20184009433300
while (ab != 1000000):
try:
ab = ab + 1
opener = urllib.request.build_opener()
a = id_generator() + ".pdf"
faturanum = faturanum - 1
fatura = str(faturanum)
faturanome = fatura + ".pdf"
opener.addheaders = [('Cookie', 'ASP.NET_SessionId=gpkufgrfzsk5abc0bc2v2v3e')]
f = opener.open("https://urltest.com/Fatura/Pdf?nrFatura="+fatura)
file = open(faturanome, 'wb')
file.write(f.read())
file.close()
print("file downloaded:"+str(ab)+" downloaded!")
except:
pass
Why is this so slow? The remote server is very fast too. Is there some way to get better results? Maybe putting several files in the queue? Like I said, we are still learning. We just want to find a way the script makes several quests at once to get several files at once, instead of one at a time.
So this is what I wrote after hours.. but doesnt work
Well, here we go.
What am I doing wrong ? I am giving my best, but I never used threading before. Hey #KlausD, check it and let me know what am I doing wrong ? it is a website that requir cookies. Also it need to load the ul and turn to pdf.
check my code try
import os
import threading
import urllib.request,string,random
from queue import Queue
def id_generator(size=6, chars=string.ascii_uppercase + string.digits):
return ''.join(random.choice(chars) for _ in range(size))
class Downloader(threading.Thread):
def __init__(self, queue):
threading.Thread.__init__(self)
self.queue = queue
def run(self):
while True:
url = self.queue.get()
self.download_file(url)
self.queue.task_done()
def download_file(self, url):
faturanum = 20184009433300
ab = ab + 1
handle = urllib.request.urlopen(url)
a = id_generator() + ".pdf"
faturanum = faturanum - 1
fatura = str(faturanum)
faturanome = fatura + ".pdf"
handle.addheaders = [('Cookie', 'ASP.NET_SessionId=v5x4k1m44saix1f5hpybf2qu')]
fname = os.path.basename(url)
with open(fname, "wb") as f:
while True:
chunk = handle.read(1024)
if not chunk: break
f.write(chunk)
def main(urls):
queue = Queue()
for i in range(5):
t = Downloader(queue)
t.setDaemon(True)
t.start()
for url in urls:
queue.put(url)
queue.join()
if __name__ == "__main__":
urls = ["https://sitetest.com/Fatura/Pdf?nrFatura="+fatura"]
main(urls)
what is wrong THis is my best.. believe I am not sleeping to make it working

python socket file transfer verified with sha256 not working, but only sometimes?

Client side:
def send_file_to_hashed(data, tcpsock):
time.sleep(1)
f = data
flag = 0
i=0
tcpsock.send(hashlib.sha256(f.read()).hexdigest())
f.seek(0)
time.sleep(1)
l = f.read(BUFFER_SIZE-64)
while True:
while (l):
tcpsock.send(hashlib.sha256(l).hexdigest() + l)
time.sleep(1)
hashok = tcpsock.recv(6)
if hashok == "HASHOK":
l = f.read(BUFFER_SIZE-64)
flag = 1
if hashok == "BROKEN":
flag = 0
if not l:
time.sleep(1)
tcpsock.send("DONE")
break
return (tcpsock,flag)
def upload(filename):
flag = 0
while(flag == 0):
with open(os.getcwd()+'\\data\\'+ filename +'.csv', 'rU') as UL:
tuplol = send_file_to_hashed(UL ,send_to_sock(filename +".csv",send_to("upload",TCP_IP,TCP_PORT)))
(sock,flagn) = tuplol
flag = flagn
time.sleep(2)
sock.close()
Server Side:
elif(message == "upload"):
message = rec_OK(self.sock)
fis = os.getcwd()+'/data/'+ time.strftime("%H:%M_%d_%m_%Y") + "_" + message
f = open(fis , 'w')
latest = open(os.getcwd()+'/data/' + message , 'w')
time.sleep(1)
filehash = rec_OK(self.sock)
print("filehash:" + filehash)
while True:
time.sleep(1)
rawdata = self.sock.recv(BUFFER_SIZE)
log.write("rawdata :" + rawdata + "\n")
data = rawdata[64:]
dhash = rawdata[:64]
log.write("chash: " + dhash + "\n")
log.write("shash: " + hashlib.sha256(data).hexdigest() + "\n")
if dhash == hashlib.sha256(data).hexdigest():
f.write(data)
latest.write(data)
self.sock.send("HASHOK")
log.write("HASHOK\n" )
print"HASHOK"
else:
self.sock.send("HASHNO")
print "HASHNO"
log.write("HASHNO\n")
if rawdata == "DONE":
f.close()
f = open(fis , 'r')
if (hashlib.sha256(f.read()).hexdigest() == filehash):
print "ULDONE"
log.write("ULDONE")
f.close()
latest.close()
break
else:
self.sock.send("BROKEN")
print hashlib.sha256(f.read()).hexdigest()
log.write("BROKEN")
print filehash
print "BROKEN UL"
f.close()
So the data upload is working fine in all tests that i ran from my computer, even worked fine while uploading data over my mobile connection and still sometimes people say it takes a long time and they kill it after a few minutes. the data is there on their computers but not on the server. I don't know what is happening please help!
First of all: this is unrelated to sha.
Streaming over the network is unpredictable. This line
rawdata = self.sock.recv(BUFFER_SIZE)
doesn't guarantee that you read BUFFER_SIZE bytes. You may have read only 1 byte in the worst case scenario. Therefore your server side is completely broken because of the assumption that rawdata contains whole message. It is even worse. If the client sends command and hash fast you may get e.g. rawdata == 'DONEa2daf78c44(...) which is a mixed output.
The "hanging" part just follows from that. Trace your code and see what happens when the server receives partial/broken messages ( I already did that in my imagination :P ).
Streaming over the network is almost never as easy as calling sock.send on one side and sock.recv on the other side. You need some buffering/framing protocol. For example you can implement this simple protocol: always interpret first two bytes as the size of incoming message, like this:
client (pseudocode)
# convert len of msg into two-byte array
# I am assuming the max size of msg is 65536
buf = bytearray([len(msg) & 255, len(msg) >> 8])
sock.sendall(buf)
sock.sendall(msg)
server (pseudocode)
size = to_int(sock.recv(1))
size += to_int(sock.recv(1)) << 8
# You need two calls to recv since recv(2) can return 1 byte.
# (well, you can try recv(2) with `if` here to avoid additional
# syscall, not sure if worth it)
buffer = bytearray()
while size > 0:
tmp = sock.recv(size)
buffer += tmp
size -= len(tmp)
Now you have properly read data in buffer variable which you can work with.
WARNING: the pseudocode for the server is simplified. For example you need to check for empty recv() result everywhere (including where size is calculated). This is the case when the client disconnects.
So unfortunately there's a lot of work in front of you. You have to rewrite whole sending and receving code.

parse pcap file with scapy

I am comparing scapy and dpkt in terms of speed. I have a directory with pcap files which I parse and count the http requests in each file. Here's the scapy code :
import time
from scapy.all import *
def parse(f):
x = 0
pcap = rdpcap(f)
for p in pcap:
try:
if p.haslayer(TCP) and p.getlayer(TCP).dport == 80 and p.haslayer(Raw):
x = x + 1
except:
continue
print x
if __name__ == '__main__':\
path = '/home/pcaps'
start = time.time()
for file in os.listdir(path):
current = os.path.join(path, file)
print current
f = open(current)
parse(f)
f.close()
end = time.time()
print (end - start)
The script is really slow (it gets stuck after a few minutes) compared to the dpkt version :
import dpkt
import time
from os import walk
import os
import sys
def parse(f):
x = 0
try:
pcap = dpkt.pcap.Reader(f)
except:
print "Invalid Header"
return
for ts, buf in pcap:
try:
eth = dpkt.ethernet.Ethernet(buf)
except:
continue
if eth.type != 2048:
continue
try:
ip = eth.data
except:
continue
if ip.p == 6:
if type(eth.data) == dpkt.ip.IP:
tcp = ip.data
if tcp.dport == 80:
try:
http = dpkt.http.Request(tcp.data)
x = x+1
except:
continue
print x
if __name__ == '__main__':
path = '/home/pcaps'
start = time.time()
for file in os.listdir(path):
current = os.path.join(path, file)
print current
f = open(current)
parse(f)
f.close()
end = time.time()
print (end - start)
So it there something wrong with the way I am using scapy? Or is it just that scapy is slower than dpkt?
You inspired me to compare. 2 GB PCAP. Dumb test. Simply counting the number of packets.
I'd expect this to be in single digit minutes with C++ / libpcap just based on previous timings of similar sized files. But this is something new. I wanted to prototype first. My velocity is generally higher in Python.
For my application, streaming is the only option. I'll be reading several of these PCAPs simultaneously and doing computations based on their contents. Can't just hold in memory. So I'm only comparing streaming calls.
scapy 2.4.5:
from scapy.all import *
import datetime
i=0
print(datetime.datetime.now())
for packet in PcapReader("/my.pcap"):
i+=1
else:
print(i)
print(datetime.datetime.now())
dpkt 1.9.7.2:
import datetime
import dpkt
print(datetime.datetime.now())
with open(pcap_file, 'rb') as f:
pcap = dpkt.pcap.Reader(f)
i=0
for timestamp, buf in pcap:
i+=1
else:
print(i)
print(datetime.datetime.now())
Results:
Packet count is the same. So that's good. :-)
dkpt - Just under 10 minutes.
scapy - 35 minutes.
dkpt went first. So if disk cache were helping a package, it would be scapy. And I think it might be marginally. I did this previously with scapy only, and it was over 40 minutes.
In summary, thanks for your 5 year old question. It's still relevant today. I almost bailed on Python here because of the overly long read speeds from scapy. dkpt seems substantially more performant.
Side note, alternative packages:
https://pypi.org/project/python-libpcap/ I'm on python 3.10 and 0.4.0 seems broken for me, unfortunately.
https://pypi.org/project/libpcap/ I'd like to compare timings to this, but have found it much harder to get a minimal example going. Haven't spent much time though, to be fair.

Python: How to calculate multithreaded download speed

I wrote a multi-threaded http down-loader, now it can download a file faster than single threaded down-loader, and the MD5 sum is correct. However, I found the speed it showed is so so fast that I do not believe it is true value.
Unit was not printed yet, But I am sure it is KB/s, please take a look at my code about the measure.
# Setup the slaver
def _download(self):
# Start download partital content when queue not empty
while not self.configer.down_queue.empty():
data_range = self.configer.down_queue.get()
headers = {
'Range': 'bytes={}-{}'.format(*data_range)
}
response = requests.get(
self.configer.url, stream = True,
headers = headers
)
start_point = data_range[0]
for bunch in response.iter_content(self.block_size):
_time = time.time()
with self.file_lock:
with open(
self.configer.path, 'r+b',
buffering = 1
) as f:
f.seek(start_point)
f.write(bunch)
f.flush()
start_point += self.block_size
self.worker_com.put((
threading.current_thread().name,
int(self.block_size / (time.time() - _time))
))
self.configer.down_queue.task_done()
# speed monitor
def speed_monitor(self):
while len(self.thread_list)>0:
try:
info = self.worker_com.get_nowait()
self.speed[info[0]] = info[1]
except queue.Empty:
time.sleep(0.1)
continue
sys.stdout.write('\b'*64 + '{:10}'.format(self.total_speed)
+ ' thread num ' + '{:2}'.format(self.worker_count))
sys.stdout.flush()
If you need more information, please visit my github respository. i will be appreciate if you can point out my error. thanks.

Python Multi Threads

I need to extract all the urls from an ip list,
i wrote this python script, but i have issue extracting the same ip multiple times (more threads are created with the same ip).
Could anyone Improve on my solution using multithreading ?
Sorry for my english
Thanks all
import urllib2, os, re, sys, os, time, httplib, thread, argparse, random
try:
ListaIP = open(sys.argv[1], "r").readlines()
except(IOError):
print "Error: Check your IP list path\n"
sys.exit(1)
def getIP():
if len(ListaIP) != 0:
value = random.sample(ListaIP, 1)
ListaIP.remove(value[0])
return value
else:
print "\nListaIPs sa terminat\n"
sys.exit(1)
def extractURL(ip):
print ip + '\n'
page = urllib2.urlopen('http://sameip.org/ip/' + ip)
html = page.read()
links = re.findall(r'href=[\'"]?([^\'" >]+)', html)
outfile = open('2.log', 'a')
outfile.write("\n".join(links))
outfile.close()
def start():
while True:
if len(ListaIP) != 0:
test = getIP()
IP = ''.join(test).replace('\n', '')
extractURL(IP)
else:
break
for x in range(0, 10):
thread.start_new_thread( start, () )
while 1:
pass
use a threading.Lock. The lock should be global, and create at the beginning when you create the IP list.
lock.acquire at the start of getIP()
and release it before you leave the method.
What you are seeing is, thread 1 executes value=random.sample, and then thread 2 also executes value=random.sample before thread 1 gets to the remove. So the item is still in the list at the time thread 2 gets there.
Therefore both threads have a chance of getting the same IP.

Categories